Abstract

Large-scale data analysis has been the subject of numerous studies recently. In many applications of today's data-intensive world, data is typically brought in continually as data streams. Analytics engines that handle streaming data must be able to react to data that is in motion. Data streams provide special challenges because traditional methods for data mining and machine learning are meant for static information. They are less suited to consider the representative characteristics of data streams and are very less suitable to effectively analyse data that is growing quickly. The authors through this research viz. A-MERIT-C - a dynamic learning multitiered ensemble-based flight real time data analysis system. Through this research authors have presented an active learning dynamic real time data stream analysis model built with self-tuning ensemble learning framework, able to quickly adapt to concepts in near real time streaming data analysis. The conceptual architectural framework illustrated through this research is adaptive to deal with the dynamics related with real time data through the evolving classifier pool (i.e. best performing classifiers get added to classifier pool at every epoch). One more distinguishing characteristic of -A-MERIT-C is instead of using traditional hold out evaluation, it uses prequentially evaluated classifiers. A-MERIT-C's unique features provide significant gains in accuracy, precision, and AUC for streaming data analytics; however, it can also overcome the drawbacks of current algorithms, including concept evolution and feature drift, by using incremental learning and feedback.

Keywords

Data stream learning, Dynamic learning, Adaptive Ensemble, Multitiered architecture, Prequential evaluation,

Downloads

Download data is not yet available.

References

  1. D. Leite, I. Škrjanc, F. Gomide. An overview on evolving systems and learning from stream data. Evolving Systems, 11, (2020) 181–198. https://doi.org/10.1007/s12530-020-09334-5
  2. H.M. Gomes, J.P. Barddal, F. Enembreck, A. Bifet, A survey on ensemble learning for data stream classification. ACM Computing Surveys (CSUR), 50(2), (2017) 1-36. https://doi.org/10.1145/3054925
  3. D. Brzezinski, J. Stefanowski, R. Susmaga, I. Szczech, On the Dynamics of Classification Measures for Imbalanced and Streaming Data. In IEEE Transactions on Neural Networks and Learning Systems, 31(8), (2020) 2868-2878. https://doi.org/10.1109/TNNLS.2019.2899061
  4. V.M. Souza, D.M. dos Reis, A.G. Maletzke, G.E. Batista, Challenges in benchmarking stream learning algorithms with real-world data. Data Mining and Knowledge Discovery, 34, (2020) 1805–1858.
  5. Z. Yu, D. Wang, Z. Zhao, C.P. Chen, J. You, H.S. Wong, J. Zhang, Hybrid Incremental Ensemble Learning for Noisy Real-World Data Classification. In IEEE Transactions on Cybernetics, 49(2), (2019) 403-416. https://doi.org/10.1109/TCYB.2017.2774266
  6. M.S.B. Jadhav, D.V. Kodavade, Enhancing Flight Delay Prediction through Feature Engineering in Machine Learning Classifiers: A Real Time Data Streams Case Study. International Journal on Recent and Innovation Trends in Computing and Communication, 11, (2023) 212-218. https://doi.org/10.17762/ijritcc.v11i2s.6064
  7. R.C. Samant, S. Patil, An Enhanced Online Boosting Ensemble Classification Technique to Deal with Data Drift. International Journal of Computing, 21(4), (2022) 435-442. https://doi.org/10.47839/ijc.21.4.2778
  8. A.A. Hassan, T.M. Hassan, Real-Time Big Data Analytics for Data Stream Challenges: An Overview. European Journal of Information Technologies and Computer Science. 2(4), (2022), 1–6. https://doi.org/10.24018/compute.2022.2.4.62
  9. O.R. Amosu, P. Kumar, A. Fadina, Y.M. Ogunsuji, S. Oni, K. Adetula, Harnessing real-time data analytics for strategic customer insights in e-commerce and retail. World Journal of Advanced Research and Reviews, 23(02), 2(024) 880–889. https://doi.org/10.30574/wjarr.2024.23.2.2407
  10. D. Brzezinski, J. Stefanowski, Prequential AUC: properties of the area under the ROC curve for data streams with concept drift. Knowledge Information Systems, 52, (2017) 531–562. https://doi.org/10.1007/s10115-017-1022-8
  11. E. Yu, J. Lu, B. Zhang, G. Zhang, Online boosting adaptive learning under concept drift for multistream classification. Proceedings of the AAAI Conference on Artificial Intelligence, 38(15), (2024) 16522-16530. https://doi.org/10.1609/aaai.v38i15.29590
  12. D. Elreedy, A.F. Atiya, F. Kamalov, A theoretical distribution analysis of synthetic minority oversampling technique (SMOTE) for imbalanced learning. Machine Learning, 113(7), (2024)4903–4923. https://doi.org/10.1007/s10994-022-06296-4
  13. B. Krawczyk, L.L. Minku, J. Gama, J. Stefanowski, M. Woźniak, Ensemble learning for data stream analysis: A survey. Information Fusion, 37, (2017) 132-156. https://doi.org/10.1016/j.inffus.2017.02.004
  14. T. Zhai, Y. Gao, H. Wang, L. Cao, Classification of high-dimensional evolving data streams via a resource-efficient online ensemble. Data Mining and Knowledge Discovery September, 31(5), (2017) 1242–1265. https://doi.org/10.1007/s10618-017-0500-7
  15. X. An, C. Hu, G. Liu, H. Lin, Distributed online gradient boosting on data stream over multi-agent networks. Signal Processing, 189, (2021) 108253. https://doi.org/10.1016/j.sigpro.2021.108253
  16. D. Leite, I. Škrjanc, F. Gomide, An overview on evolving systems and learning from stream data. Evolving Systems, 11, (2020) 181–198. https://doi.org/10.1007/s12530-020-09334-5
  17. L. Rutkowski, M. Jaworski, P. Duda, (2020) Stream Data Mining: Algorithms and Their Probabilistic Properties. Cham: Springer International Publishing, 83-89.
  18. D. Brzezinski, J. Stefanowski, Combining block-based and online methods in learning ensembles from concept drifting data streams. Information Sciences, 265, (2014) 50-67. https://doi.org/10.1016/j.ins.2013.12.011
  19. J. Shan, H. Zhang, W. Liu, Q. Liu, Online active learning ensemble framework for drifted data streams. IEEE transactions on neural networks and learning systems, 30(2), (2018) 486-498.
  20. J.R.B. Junior, M. do Carmo Nicoletti, An iterative boosting-based ensemble for streaming data classification. Information Fusion, 45, (2019) 66-78. https://doi.org/10.1016/j.inffus.2018.01.003
  21. J. Kunnen, M. Duchateau, Z. Van Veldhoven, J. Vanthienen, (2020) Benchmarking Stacking Against Other Heterogeneous Ensembles in Telecom Churn Prediction. 2020 IEEE Symposium Series on Computational Intelligence (SSCI), IEEE, Canberra, ACT, Australia. https://doi.org/10.1109/SSCI47803.2020.9308188
  22. N. Liu, H. Gao, Z. Zhao, Y. Hu, L. Duan, A stacked generalization ensemble model for optimization and prediction of the gas well rate of penetration: a case study in Xinjiang. Journal of Petroleum Exploration and Production Technology, 12(6), (2022) 1595-1608. https://doi.org/10.1007/s13202-021-01402-z
  23. A.O. AlQabbany, A.M. Azmi, Measuring the Effectiveness of Adaptive Random Forest for Handling Concept Drift in Big Data Streams. Entropy, 23(7), (2021) 859. https://doi.org/10.3390/e23070859
  24. A.M. Paim, F. Enembreck, Adaptive random tree ensemble for evolving data stream classification. Knowledge-Based Systems, 309, (2025) 112830. https://doi.org/10.1016/j.knosys.2024.112830
  25. A. Sani, B.L. Pal, A.S. Dhabariya, F. Rasheed, A. Shah, U. Haruna, B.S. Mu'az, J. Habu, S. Abbas, B.L. Pal, S. Ajay, Deep Learning Techniques in Data Mining: A Comprehensive Overview. International Journal of Innovative Science and Research Technology, 9(9), (2024) 1254-1270. https://doi.org/10.38124/ijisrt/IJISRT24SEP367
  26. P. Stefanovič, R. Štrimaitis, O. Kurasova, Prediction of flight time deviation for lithuanian airports using supervised machine learning model. Computational intelligence and neuroscience, 2020(1), (2020) 8878681. https://doi.org/10.1155/2020/8878681
  27. J.I.G. Hidalgo, B.I. Maciel, R.S. Barros, Experimenting with prequential variations for data stream learning evaluation. Computational Intelligence, 35(4), (2019) 670-692. https://doi.org/10.1111/coin.12208
  28. B. Shailaja, D. Jadhav, V. Kodavade. Performance analysis of ensemble learning for artificial and real time data streams - Research directions. In AIP Conference Proceedings AIP Publishing LLC, 2917(1), (2023) 060003. https://doi.org/10.1063/5.0175615
  29. E.A.K. Zaman, A. Mohamed, A. Ahmad, Feature selection for online streaming high-dimensional data: A state-of-the-art review. Applied Soft Computing, 127, (2022) 109355.https://doi.org/10.1016/j.asoc.2022.109355
  30. J. Montiel, J. Read, A. Bifet, T. Abdessalem, Scikit-multiflow: A multi-output streaming framework. Journal of Machine Learning Research, 19(72), (2018) 1-5.
  31. M. Schäfer, M. Strohmeier, V. Lenders, I. Martinovic, M. Wilhelm. (2014) Bringing Up OpenSky: A Large-scale ADS-B Sensor Network for Research. In IPSN-14 proceedings of the 13th IEEE/ACM International Symposium on Information Processing in Sensor Networks (IPSN), IEEE, Berlin, Germany, 83-94. https://doi.org/10.1109/IPSN.2014.6846743
  32. AviationStack: https://aviationstack.com
  33. H.M. Gomes, A. Bifet, J. Read, J.P. Barddal, F. Enembreck, B. Pfharinger, G. Holmes, T. Abdessalem, Adaptive random forests for evolving data stream classification. Machine Learning, 106(9), (2017)1469–1495. https://doi.org/10.1007/s10994-017-5642-8
  34. Y. Yang, G.I. Webb, Discretization for Naive–Bayes learning: managing discretization bias and variance. Machine Learning, 74(1), (2009) 39–74. https://doi.org/10.1007/s10994-008-5083-5
  35. V. Losing, B. Hammer, H. Wersing. (2016) KNN Classifier with Self Adjusting Memory for Heterogeneous Concept Drift. In 2016 IEEE 16th international conference on data mining (ICDM), IEEE, Barcelona, Spain, 291-300. https://doi.org/10.1109/ICDM.2016.0040
  36. A. Aljubairy, W.E. Zhang, A. Shemshadi, A. Mahmood, Q.Z. Sheng, A system for effectively predicting flight delays based on IoT data. Computing, 102(9), (2020) 2025–2048. https://doi.org/10.1007/s00607-020-00794-w
  37. Q. Wang, Z. Luo, J. Huang, Y. Feng, Z. Liu, A Novel Ensemble Method for Imbalanced Data Learning: Bagging of Extrapolation‐SMOTE SVM. Computational intelligence and neuroscience, 2017(1), (2017) 1827016. https://doi.org/10.1155/2017/1827016
  38. W. Li, X. Xu, Ensemble learning algorithm - research analysis on the management of financial fraud and violation in listed companies. Decision Making: Applications in Management and Engineering, 6(2), (2023) 722–733. https://doi.org/10.31181/dmame622023785
  39. Airline On-time Performance Data: https://www.kaggle.com/datasets/bulter22/airline-data
  40. Y. Sun, Z. Wang, H. Liu, C. Du, J. Yuan, Online Ensemble Using Adaptive Windowing for Data Streams with Concept Drift. International Journal of Distributed Sensor Networks. 12(5), (2016) 4218973. https://doi.org/10.1155/2016/4218973
  41. J.Z. Kolter, M.A. Maloof, Dynamic Weighted Majority: An Ensemble Method for Drifting Concepts. The Journal of Machine Learning Research, 8, (2007) 2755–2790.
  42. P. Dhaliwal, A. Kumar, P. Chaudhary, An Approach for Concept Drifting Streams: Early Dynamic Weighted Majority. Procedia Computer Science, 167, (2020) 2653-2661. https://doi.org/10.1016/j.procs.2020.03.344
  43. K. Nishida, K. Yamauchi, T. Omori, (2005) ACE: Adaptive Classifiers-Ensemble System for Concept-Drifting Environments. In International workshop on multiple classifier systems, Berlin, Heidelberg. https://doi.org/10.1007/11494683_18
  44. A. Bifet, G. Holmes, B. Pfahringer. (2010). Leveraging bagging for evolving data streams. In Joint European, conference on machine learning and knowledge discovery in databases Berlin, Heidelberg, 6321. https://doi.org/10.1007/978-3-642-15880-3_15
  45. R. Yadu, R. Shukla, A hybrid model integrating Adaboost approach for sentimental analysis of airline tweets. Revue d'Intelligence Artificielle, 36(4), (2022) 519-528. https://doi.org/10.18280/ria.360402
  46. R. Elwell, R. Polikar, Incremental Learning of Concept Drift in Nonstationary Environments. IEEE Trans. on Neural Networks, 22(10), (2011) 1517-1531. https://doi.org/10.1109/TNN.2011.2160459