Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Variable selection wrapper in presence of correlated input variables for random forest models
Department of Applied Mathematics andComputer Science, Technical Universityof Denmark, Kongens Lyngby, Denmark.ORCID iD: 0000-0001-5263-1937
Luleå University of Technology, Department of Social Sciences, Technology and Arts, Business Administration and Industrial Engineering. Department of Applied Mathematics and Computer Science, Technical Universityof Denmark, Kongens Lyngby, Denmark.ORCID iD: 0000-0003-4222-9631
2024 (English)In: Quality and Reliability Engineering International, ISSN 0748-8017, E-ISSN 1099-1638, Vol. 40, no 1, p. 297-312Article in journal (Refereed) Published
Abstract [en]

In most data analytic applications in manufacturing, understanding the data-driven models plays a crucial role in complementing the engineering knowledge about the production process. Identifying relevant input variables, rather than only predicting the response through some “black-box” model, is of great interest in many applications. There is, therefore, a growing focus on describing the contributions of the input variables to the model in the form of “variable importance”, which is readily available in certain machine learning methods such as random forest (RF). Once a ranking based on the importance measure of the variables is established, the question of how many variables are truly relevant in predicting the output variable rises. In this study, we focus on the Boruta algorithm, which is a wrapper around the RF model. It is a variable selection tool that assesses the variable importance measure for the RF model. It has been previously shown in the literature that the correlation among the input variables, which is often a common occurrence in high dimensional data, distorts and overestimates the importance of variables. The Boruta algorithm is also affected by this resulting in a larger set of input variables deemed important. To overcome this issue, in this study, we propose an extension of the Boruta algorithm for the correlated data by exploiting the conditional importance measure. This extension greatly improves the Boruta algorithm in the case of high correlation among variables and provides a more precise ranking of the variables that significantly contribute to the response. We believe this approach can be used in many industrial applications by providing more transparency and understanding of the process.

Place, publisher, year, edition, pages
John Wiley & Sons, 2024. Vol. 40, no 1, p. 297-312
Keywords [en]
additive manufacturing, Boruta algorithm, conditional importance, random forest, variable selection algorithm
National Category
Computer Sciences
Research subject
Quality Technology and Logistics
Identifiers
URN: urn:nbn:se:ltu:diva-99118DOI: 10.1002/qre.3398ISI: 001009829700001Scopus ID: 2-s2.0-85162025946OAI: oai:DiVA.org:ltu-99118DiVA, id: diva2:1779045
Note

Validerad;2024;Nivå 2;2024-02-14 (sofila);

Full text license: CC BY-NC 4.0

Available from: 2023-07-03 Created: 2023-07-03 Last updated: 2024-02-14Bibliographically approved

Open Access in DiVA

fulltext(1583 kB)20 downloads
File information
File name FULLTEXT02.pdfFile size 1583 kBChecksum SHA-512
375d3b803900087f5891c2afd70ec8914a78ac61499442f9a737f4421851e05c8c32979e5c962faf48c97ff345d0ffb7740be5c1411be2216b8979622e77913d
Type fulltextMimetype application/pdf

Other links

Publisher's full textScopus

Authority records

Kulahci, Murat

Search in DiVA

By author/editor
Rotari, MartaKulahci, Murat
By organisation
Business Administration and Industrial Engineering
In the same journal
Quality and Reliability Engineering International
Computer Sciences

Search outside of DiVA

GoogleGoogle Scholar
Total: 148 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

doi
urn-nbn

Altmetric score

doi
urn-nbn
Total: 229 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf