Forecasting Water Quality Parameter Using a Novel Kernel-Based Method with Feature Selection and Multivariate Decomposition

نوع مقاله : مقاله کامل علمی پژوهشی

نویسندگان

1 گروه مکانیک، دانشکده فنی- مهندسی، دانشگاه صنعتی خاتم‌الانبیاء بهبهان، بهبهان، ایران

2 نویسنده مسئول، گروه عمران، دانشکده فنی - مهندسی، دانشگاه صنعتی خاتم‌الانبیاء بهبهان، بهبهان، ایران

3 دانشکده مهندسی عمران، ساخت و ساز و محیط زیست (دپارتمان ۲۴۷۰)، دانشگاه ایالتی داکوتای شمالی

چکیده

Background and objectives: Precise forecasting of water quality (WQ) parameters, specifically PS (potential salinity), is critical for sustainable water utilization. In water-stressed regions like the Karun River in Iran, effective monitoring and prediction of the PS is not only important but also critical because of anthropogenic activities, climate change, and reduced inflows of freshwater. Therefore, effective machine learning (ML) models and appropriate input data is very important for monitoring and predicting WQ parameters. However, the influencing factors exhibit complex and non-linear relationships, and multicollinearity in the datasets makes it challenging for traditional ML models to address the problem. Limitations, thus, can result in inaccurate predictions, which obstruct the establishment of sustainable water management strategies. As mentioned above, accurate forecasting of PS is essential for water and soil conservation, because PS helps mitigate salinity-related degradation of agricultural lands and ensure the sustainability of vital ecosystems. This study supports the development of effective conservation strategies to maintain soil productivity and WQ in vulnerable regions by providing reliable predictions. To address these issues, the present study introduces a new hybrid model, IKRidge-GRM, which inherits the advantages of improved kernel ridge regression (IKRidge) and generalized ridge regression (GRM). The hybrid model integrates IKRidge's improved capacity to identify non-linearity with GRM's resilience against multicollinearity problems to improve the predictive performance of the PS prediction. This unique framework offers improved stability and interpretability of results, as well as increases forecast accuracy, making it a helpful tool for environmental monitoring and decision-making. The proposed strategy could aid policymakers and water resource managers in designing reasonable strategies to alleviate salinity issues, protect aquatic ecosystems, and ensure the long-term survival of vital water sources like the Karun River.
Materials and methods: This study introduces a novel hybrid ML model based on two regression techniques, namely: generalized ridge regression (GRM) and improved kernel ridge regression (IKRidge), called IKRidge-GRM. The GRM effectively addresses multicollinearity and overfitting issues using the iteratively reweighted least squares (IRLS) process. On the other hand, IKRidge incorporates a wavelet kernel function, optimized through the INFO algorithm, and the regularized locally weighted (RLW) approach, enabling it to capture complex, non-linear patterns in the data with high precision. This combination of techniques allows the hybrid model to overcome the limitations of traditional ML methods, making it particularly suitable for handling the intricate relationships inherent in WQ datasets. To further enhance the model's predictive accuracy, the IKRidge-GRM framework integrates a light gradient boosting machine (LGBM) for feature selection. It reduces dimensionality by identifying the most relevant input variables while eliminating redundant or irrelevant features.
Additionally, the model employs multivariate variational mode decomposition (MVMD) to decompose the input data into high- and low-frequency components, allowing it to capture both short-term fluctuations and long-term trends in WQ parameters. The study utilized an extensive dataset comprising 48 years of monthly WQ data collected from the Farisat station on the Karun River. Nine keys WQ parameters, including magnesium (Mg), sulfate (SO42−), calcium (Ca), discharge (Q), sodium (Na), bicarbonate (HCO3), chloride (Cl), electrical conductivity (EC), total dissolved solids (TDS) and pH, were used as inputs to forecast the PS three months ahead.
Results: The proposed IKRidge-GRM model accurately predicted PS values at the Farisat station, significantly outperforming baseline models (Ridge, DELM, and LSSVM) and their MVMD-enhanced versions. By leveraging its hybrid architecture and advanced feature extraction techniques, the MVMD-IKRidge-GRM model achieved remarkable results during the testing phase, with the highest correlation coefficient (R = 0.977), the lowest RMSE (0.956), and the lowest MAPE (4.521). These metrics indicate the model's superior predictive accuracy and reliability in handling complex, non-linear relationships. The model also achieved high IA (0.988) and KGE (0.948) scores, underscoring its robustness and effectiveness in capturing the intricate dynamics of the PS variations. These results highlight the model's ability to uncover hidden patterns in the data and provide highly accurate predictions, even in challenging scenarios involving multicollinearity and non-linear dependencies. The model's exceptional performance was further confirmed by visual evaluations such as scatter plots, relative error plots, and Taylor diagrams. Scatter plots demonstrated that the MVMD-IKRidge-GRM model's predictions closely aligned with measured values, with minimal prediction intervals and narrow error distributions, reflecting its precision and consistency. Relative error plots revealed that the model exhibited the most compact and symmetric error distribution, with minimal bias and variability. Relative error plots also indicated the models’ ability to generalize well across different data points. Taylor diagrams provided evidence of the model's strong agreement with reference data, showcasing its ability to balance accuracy, variability representation, and error minimization effectively. Residual analysis further confirmed the model's precision and reliability. Among all the models tested, the MVMD-IKRidge-GRM model achieved the smallest mean residual (-0.0073) and the lowest standard deviation (0.0613), demonstrating its ability to minimize prediction errors consistently. This level of precision is critical for practical applications, as it ensures that the model can provide reliable forecasts for decision-making in water resource management. The model's ability to integrate advanced regression techniques, feature selection, and frequency decomposition enhances its predictive capabilities. The ability also establishes the proposed model as a robust framework for addressing complex environmental challenges. These findings emphasized the potential of the MVMD-IKRidge-GRM model as a powerful tool for sustainable water resource management, particularly in regions like the Karun River basin, where accurate and reliable predictions are essential for mitigating environmental degradation and ensuring long-term ecological balance.
Conclusion: The IKRidge-GRM model predicted PS values at the Farisat station on the Karun River. The findings demonstrated high accuracy and reliability across all evaluation metrics. The IKRidge-GRM model has the ability to uncover hidden patterns in complex, non-linear datasets. Its capacity to deliver precise predictions also highlights its potential as a valuable tool for environmental monitoring and management. By integrating advanced regression techniques, such as improved kernel ridge regression (IKRidge) and generalized ridge regression (GRM), with innovative feature selection and decomposition methods like light gradient boosting machine (LGBM) and multivariate variational mode decomposition (MVMD), the model effectively addresses challenges such as multicollinearity, overfitting, and non-linear relationships. This comprehensive framework ensures that the IKRidge-GRM model achieves superior predictive performance and maintains robustness and adaptability across diverse environmental conditions. This study emphasizes the importance of combining advanced ML techniques with effective preprocessing methods to develop reliable models for analyzing and forecasting complex environmental data. Integrating feature selection and frequency decomposition enhances the model's ability to extract meaningful information from high-dimensional datasets. This integration also enable the models to capture both short-term fluctuations and long-term trends in WQ parameters better. Such capabilities are essential for addressing the multifaceted challenges posed by environmental degradation, particularly in regions like the Karun River basin, where water resources are under significant stress due to anthropogenic activities and climate change.

کلیدواژه‌ها

موضوعات


عنوان مقاله [English]

Forecasting Water Quality Parameter Using a Novel Kernel-Based Method with Feature Selection and Multivariate Decomposition

نویسندگان [English]

  • Masoud Dorfeshan 1
  • Iman Ahmadianfar 2
  • Arvin Samadi-Koucheksaree 3
1 Dept. of Mechanical Engineering, Behbahan Khatam Alanbia University of Technology, Behbahan, Iran
2 Corresponding Author, Dept. of Civil Engineering, Behbahan Khatam Alanbia University of Technology, Behbahan, Iran.
3 Dept. of Civil, Construction and Environmental Engineering (Dept 2470), North Dakota State University, PO Box 6050, Fargo, ND, 58108-6050, USA
چکیده [English]

Background and objectives: Precise forecasting of water quality (WQ) parameters, specifically PS (potential salinity), is critical for sustainable water utilization. In water-stressed regions like the Karun River in Iran, effective monitoring and prediction of the PS is not only important but also critical because of anthropogenic activities, climate change, and reduced inflows of freshwater. Therefore, effective machine learning (ML) models and appropriate input data is very important for monitoring and predicting WQ parameters. However, the influencing factors exhibit complex and non-linear relationships, and multicollinearity in the datasets makes it challenging for traditional ML models to address the problem. Limitations, thus, can result in inaccurate predictions, which obstruct the establishment of sustainable water management strategies. As mentioned above, accurate forecasting of PS is essential for water and soil conservation, because PS helps mitigate salinity-related degradation of agricultural lands and ensure the sustainability of vital ecosystems. This study supports the development of effective conservation strategies to maintain soil productivity and WQ in vulnerable regions by providing reliable predictions. To address these issues, the present study introduces a new hybrid model, IKRidge-GRM, which inherits the advantages of improved kernel ridge regression (IKRidge) and generalized ridge regression (GRM). The hybrid model integrates IKRidge's improved capacity to identify non-linearity with GRM's resilience against multicollinearity problems to improve the predictive performance of the PS prediction. This unique framework offers improved stability and interpretability of results, as well as increases forecast accuracy, making it a helpful tool for environmental monitoring and decision-making. The proposed strategy could aid policymakers and water resource managers in designing reasonable strategies to alleviate salinity issues, protect aquatic ecosystems, and ensure the long-term survival of vital water sources like the Karun River.
Materials and methods: This study introduces a novel hybrid ML model based on two regression techniques, namely: generalized ridge regression (GRM) and improved kernel ridge regression (IKRidge), called IKRidge-GRM. The GRM effectively addresses multicollinearity and overfitting issues using the iteratively reweighted least squares (IRLS) process. On the other hand, IKRidge incorporates a wavelet kernel function, optimized through the INFO algorithm, and the regularized locally weighted (RLW) approach, enabling it to capture complex, non-linear patterns in the data with high precision. This combination of techniques allows the hybrid model to overcome the limitations of traditional ML methods, making it particularly suitable for handling the intricate relationships inherent in WQ datasets. To further enhance the model's predictive accuracy, the IKRidge-GRM framework integrates a light gradient boosting machine (LGBM) for feature selection. It reduces dimensionality by identifying the most relevant input variables while eliminating redundant or irrelevant features.
Additionally, the model employs multivariate variational mode decomposition (MVMD) to decompose the input data into high- and low-frequency components, allowing it to capture both short-term fluctuations and long-term trends in WQ parameters. The study utilized an extensive dataset comprising 48 years of monthly WQ data collected from the Farisat station on the Karun River. Nine keys WQ parameters, including magnesium (Mg), sulfate (SO42−), calcium (Ca), discharge (Q), sodium (Na), bicarbonate (HCO3), chloride (Cl), electrical conductivity (EC), total dissolved solids (TDS) and pH, were used as inputs to forecast the PS three months ahead.
Results: The proposed IKRidge-GRM model accurately predicted PS values at the Farisat station, significantly outperforming baseline models (Ridge, DELM, and LSSVM) and their MVMD-enhanced versions. By leveraging its hybrid architecture and advanced feature extraction techniques, the MVMD-IKRidge-GRM model achieved remarkable results during the testing phase, with the highest correlation coefficient (R = 0.977), the lowest RMSE (0.956), and the lowest MAPE (4.521). These metrics indicate the model's superior predictive accuracy and reliability in handling complex, non-linear relationships. The model also achieved high IA (0.988) and KGE (0.948) scores, underscoring its robustness and effectiveness in capturing the intricate dynamics of the PS variations. These results highlight the model's ability to uncover hidden patterns in the data and provide highly accurate predictions, even in challenging scenarios involving multicollinearity and non-linear dependencies. The model's exceptional performance was further confirmed by visual evaluations such as scatter plots, relative error plots, and Taylor diagrams. Scatter plots demonstrated that the MVMD-IKRidge-GRM model's predictions closely aligned with measured values, with minimal prediction intervals and narrow error distributions, reflecting its precision and consistency. Relative error plots revealed that the model exhibited the most compact and symmetric error distribution, with minimal bias and variability. Relative error plots also indicated the models’ ability to generalize well across different data points. Taylor diagrams provided evidence of the model's strong agreement with reference data, showcasing its ability to balance accuracy, variability representation, and error minimization effectively. Residual analysis further confirmed the model's precision and reliability. Among all the models tested, the MVMD-IKRidge-GRM model achieved the smallest mean residual (-0.0073) and the lowest standard deviation (0.0613), demonstrating its ability to minimize prediction errors consistently. This level of precision is critical for practical applications, as it ensures that the model can provide reliable forecasts for decision-making in water resource management. The model's ability to integrate advanced regression techniques, feature selection, and frequency decomposition enhances its predictive capabilities. The ability also establishes the proposed model as a robust framework for addressing complex environmental challenges. These findings emphasized the potential of the MVMD-IKRidge-GRM model as a powerful tool for sustainable water resource management, particularly in regions like the Karun River basin, where accurate and reliable predictions are essential for mitigating environmental degradation and ensuring long-term ecological balance.
Conclusion: The IKRidge-GRM model predicted PS values at the Farisat station on the Karun River. The findings demonstrated high accuracy and reliability across all evaluation metrics. The IKRidge-GRM model has the ability to uncover hidden patterns in complex, non-linear datasets. Its capacity to deliver precise predictions also highlights its potential as a valuable tool for environmental monitoring and management. By integrating advanced regression techniques, such as improved kernel ridge regression (IKRidge) and generalized ridge regression (GRM), with innovative feature selection and decomposition methods like light gradient boosting machine (LGBM) and multivariate variational mode decomposition (MVMD), the model effectively addresses challenges such as multicollinearity, overfitting, and non-linear relationships. This comprehensive framework ensures that the IKRidge-GRM model achieves superior predictive performance and maintains robustness and adaptability across diverse environmental conditions. This study emphasizes the importance of combining advanced ML techniques with effective preprocessing methods to develop reliable models for analyzing and forecasting complex environmental data. Integrating feature selection and frequency decomposition enhances the model's ability to extract meaningful information from high-dimensional datasets. This integration also enable the models to capture both short-term fluctuations and long-term trends in WQ parameters better. Such capabilities are essential for addressing the multifaceted challenges posed by environmental degradation, particularly in regions like the Karun River basin, where water resources are under significant stress due to anthropogenic activities and climate change.

کلیدواژه‌ها [English]

  • Improved kernel ridge
  • Forecasting
  • WQ
  • Feature selection
  • Decomposition
1.Abdollahi, A., & Ahmadianfar, I. (2021). Multi-mechanism ensemble interior search algorithm to derive optimal hedging rule curves in multi-reservoir systems. Journal of Hydrology. 598, 126211.
2.Ahmadianfar, I., Bozorg-Haddad, O., & Chu, X. (2020). Gradient-based optimizer: A new Metaheuristic optimization algorithm. Information Sciences. 540, 131-159.
3.Ahmadianfar, I., Heidari, A. A., Gandomi, A. H., Chu, X., & Chen, H. (2021). RUN beyond the metaphor: an efficient optimization algorithm based on Runge Kutta method. Expert Systems with Applications. 181, 115079.
4.Ahmadianfar, I., Heidari, A. A., Noshadian, S., Chen, H., & Gandomi, A. H. (2022). INFO: An Efficient Optimization Algorithm based on Weighted Mean of Vectors. Expert Systems with Applications. 116516.
5.Ahmadianfar, I., Jamei, M., & Chu, X. (2020). A novel hybrid wavelet-locally weighted linear regression (W-LWLR) model for electrical conductivity (EC) prediction in surface water. Journal of Contaminant Hydrology. 232, 103641.
6.Ahmadianfar, I., Shirvani-Hosseini, S., He, J., Samadi-Koucheksaraee, A., & Yaseen, Z. M. (2022). An improved adaptive neuro fuzzy inference system model using conjoined metaheuristic algorithms for electrical conductivity prediction. Scientific Reports. 12 (1), 1-34.
7.Ahmadianfar, I., Shirvani-Hosseini, S., Samadi-Koucheksaraee, A., & Yaseen, Z. M. (2022). Surface water sodium (Na+) concentration prediction using hybrid weighted exponential regression model with gradient-based optimization. Environmental Science and Pollution Research. 1-26.
8.Ahmed, A. N., Othman, F. B., Afan, H. A., Ibrahim, R. K., Fai, C. M., Hossain, M. S., Ehteram, M., & Elshafie, A. (2019). Machine learning methods for better water quality prediction. Journal of Hydrology. 578, 124084.
9.Asadollah, S. B. H. S., Sharafati, A., Motta, D., & Yaseen, Z. M. (2021). River water quality index prediction and uncertainty analysis: A comparative study of machine learning models. Journal of Environmental Chemical Engineering.
9 (1), 104599.
10.Barzegar, R., Adamowski, J., & Moghaddam, A. A. (2016). Application of wavelet-artificial intelligence hybrid models for water quality prediction: a case study in Aji-Chay River, Iran. Stochastic Environmental Research and Risk Assessment. 30 (7), 1797-1819.
11.Bozorg-Haddad, O., Soleimani, S., & Loáiciga, H. A. (2017). Modeling water-quality parameters using genetic algorithm–least squares support vector regression and genetic programming. Journal of Environmental Engineering. 143 (7), 4017021.
12.Bui, D. T., Khosravi, K., Tiefenbacher, J., Nguyen, H., & Kazakis, N. (2020). Improving prediction of water quality indices using novel hybrid machine-learning algorithms. Science of the Total Environment. 721, 137612.
13.Chang, F. J., Tsai, Y. H., Chen, P. A., Coynel, A., & Vachaud, G. (2015). Modeling water quality in an urban river using hydrological factors–Data driven approaches. Journal of Environmental Management. 151, 87-96.
14.Chatterjee, S., Sarkar, S., Dey, N., Sen, S., Goto, T., & Debnath, N. C. (2017). Water quality prediction: Multi objective genetic algorithm coupled artificial neural network based approach. 2017 IEEE 15th International Conference on Industrial Informatics (INDIN). 963-968.
15.Chen, H., Ahmadianfar, I., Liang, G., & Heidari, A. A. (2024). Robust kernel extreme learning machines with weighted mean of vectors and variational mode decomposition for forecasting total dissolved solids. Engineering Applications of Artificial Intelligence. 133, 108587.
16.Deng, W., Wang, G., & Zhang, X. (2015). A novel hybrid water quality time series prediction method based on cloud model and fuzzy forecasting. Chemometrics and Intelligent Laboratory Systems. 149, 39-49.
17.Fayaz, M., & Kim, D. (2018). A prediction methodology of energy consumption based on deep extreme learning machine and comparative analysis in residential buildings. Electronics. 7 (10), 222.
18.Gharemahmudli, S., & Seyed Hamidreza Sadeghi, V. S. S. (2024). Changeability of saline soil surface due to soil cyanobacteria inoculation using image processing. Water and Soil Conservation. 31 (2), 119-137.
19.Han, Y., Aziz, T. N., Del Giudice, D., Hall, N. S., & Obenour, D. R. (2021). Exploring nutrient and light limitation of algal production in a shallow turbid reservoir. Environmental Pollution. 269, 116210.
20.Huang, M., Tian, D., Liu, H., Zhang, C., Yi, X., Cai, J., Ruan, J., Zhang, T., Kong, S., & Ying, G. (2018). A hybrid fuzzy wavelet neural network model with self-adapted fuzzy-means clustering and genetic algorithm for water quality prediction in rivers. Complexity. 2018.
21.Jamei, M., Ahmadianfar, I., Karbasi, M., Jawad, A. H., Farooque, A. A., & Yaseen, Z. M. (2021). The assessment of emerging data-intelligence technologies for modeling Mg+2 and SO4−2 surface water quality. Journal of Environmental Management. 300, 113774.
22.Jamei, M., Ali, M., Karbasi, M., Karimi, B., Jahannemaei, N., Farooque, A. A., & Yaseen, Z. M. (2024). Monthly sodium adsorption ratio forecasting in rivers using a dual interpretable glass-box complementary intelligent system: Hybridization of ensemble TVF-EMD-VMD, Boruta-SHAP, and eXplainable GPR. Expert Systems with Applications. 237, 121512.
23.Kandasamy, L., Mahendran, A., Sangaraju, S. H. V., Mathur, P., Faldu, S. V., & Mazzara, M. (2025). Enhanced remote sensing and deep learning aided water quality detection in the Ganges River, India supporting monitoring of aquatic environments. Results in Engineering. 25, 103604.
24.Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., & Liu, T. Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in Neural Information Processing Systems. 30.
25.Nelder, J. A., & Wedderburn, R. W. M. (1972). Generalized linear models. Journal of the Royal Statistical Society Series A: Statistics in Society. 135 (3), 370-384.
26.Qiu, R., Wang, Y., Wang, D., Qiu, W., Wu, J., & Tao, Y. (2020). Water temperature forecasting based on modified artificial neural network methods: Two cases of the Yangtze River. Science of The Total Environment. 737, 139729.
27.Salarijazi, M., Ahmadianfar, I., & Yaseen, Z. M. (2024). Prediction enhancement for surface water sodium adsorption ratio using limited inputs: Implementation of hybridized stacked ensemble model with feature selection algorithm. Physics and Chemistry of the Earth, Parts a/b/C. 134, 103561.
28.Satish, N., Anmala, J., Rajitha, K., & Varma, M. R. R. (2024). A stacking ANN ensemble model of ML models for stream water quality prediction of Godavari River Basin, India. Ecological Informatics. 80, 102500.
29.ur Rehman, N., & Aftab, H. (2019). Multivariate variational mode decomposition. IEEE Transactions on Signal Processing. 67 (23), 6039-6052.
30.Vovk, V. (2013). Kernel ridge regression. In Empirical inference: Festschrift in honor of vladimir n. vapnik (pp. 105-116). Springer.
31.Wai, K. P., Koo, C. H., Huang, Y. F., & Chong, W. C. (2024). Decomposed intrinsic mode functions and deep learning algorithms for water quality index forecasting. Neural Computing and Applications. 1-20.
32.Wu, C., Zhang, X., Wang, W., Lu, C., Zhang, Y., Qin, W., Tick, G. R., Liu, B., & Shu, L. (2021). Groundwater level modeling framework by combining the wavelet transform with a long short-term memory data-driven model. Science of The Total Environment. 783, 146948.
33.Zahiri, J., Cheraghi, M., & Salarijazi, M. (2024). Simulating chlorophyll a in dam reservoirs using remote sensing and data-driven approaches. Water and Soil Conservation. 31 (3), 85-108.
34.Zhou, X., Leng, Y., Salarijazi, M., Ahmadianfar, I., & Farooque, A. A. (2024). Development of forecasting of monthly SAR time series in river systems: A multivariate data decomposition-based hybrid approach. Process Safety and Environmental Protection. 188, 1355-1375.