Machine Learning for Air Quality and CO2 Emissions: The Role of Data Understanding
Synopsis
Machine Learning for Air Quality and CO2 Emissions: The Role of Data Understanding: In recent years, the emergence of machine learning (ML) techniques has enabled increasingly sophisticated approaches to environmental prediction. However, comparatively little attention has been paid to the nature, origin, and methodological construction of the datasets underlying these models. This study investigates the role of data in ML-based environmental applications, focusing on two domains: greenhouse gas (GHG) emissions, particularly carbon dioxide (CO2), and particulate matter concentrations (PM_{2.5} and PM_{10}). For Poland and Slovakia, a LightGBM model was trained to predict CO2 emissions across all major economic sectors: Residential, Power, Transport, Industry, and Aviation. Predictive performance was highest in sectors with regular, seasonal emission patterns, while low-variability sectors such as Domestic Aviation posed greater challenges. For particulate matter, meteorological variables and time-related features were used to forecast PM_{2.5} and PM_{10} during the heating season. The models captured general temporal patterns, including short-term fluctuations and seasonal peaks, although extreme events were partially underestimated. Overall, the findings highlight that predictive accuracy is strongly influenced by the quality, resolution, and structure of input datasets, as well as by emission regularity and environmental conditions. This work underscores the importance of careful dataset design and preprocessing in ML applications for environmental monitoring, providing guidance for improving the reliability of emission and air quality forecasting.






