Tree-based Machine Learning Methods for Wind Farm Data
Synopsis
Environmental and energy datasets are typically characterized by nonlinear dependencies and a combination of numerical and categorical variables. Such characteristics require more adaptable computational approaches. In this context, we explore tree-based machine learning methods since they provide a high predictive performance and a high level of interpretability. In this chapter, we present a comparative study of selected tree-based regression models applied to real-world environmental data from the United States Wind Turbine Database. The evaluated methods include a single regression decision tree, a bagging-based Random Forest ensemble, and modern gradient boosting implementations represented by CatBoost and LightGBM. All models are trained within a unified framework using standard regression performance metrics. We demonstrate that ensemble-based approaches substantially outperform a single decision tree in our experimental results. In particular, boosting-based models achieve higher predictive accuracy, with LightGBM providing the best overall performance in terms of squared error metrics and coefficient of determination. Feature importance analysis further highlights the important role of technical turbine characteristics and categorical descriptors. The findings confirm that modern gradient boosting frameworks represent a powerful and effective solution for regression tasks involving large-scale environmental and energy-related datasets.






