Analysis of Job Failure Prediction in a Cloud Environment by Applying Machine Learning Techniques

Faraz Bashir; Farrukh Zeeshan Khan

Authors

Faraz Bashir Department of Computer Science, University of Engineering and Technology, Taxila, Pakistan.
Farrukh Zeeshan Khan Department of Computer Science, University of Engineering and Technology, Taxila, Pakistan.

Keywords:

Cloud Service Providers, Virtual Machines, Physical Machines , Machine Learning, Infrastructure as a Service

Abstract

Cloud Services are the on-demand availability of resources like storage, data, and compute power. Nowadays, cloud computing and storage systems are continuing to expand, there is an imperative requirement for CSP (cloud service providers) to ensure a reliable and consistent supply of resources to users and businesses in case of any failure. Consequently, the large cloud service providers are concentrating on mitigating any failures that transpire in a cloud system environment. In this research work, we examined the bit brains dataset for the job failure prediction which keeps traces of 3 years of cloud system VMs. The dataset contains data about the resources used in a cloud environment. We proposed the performance of two machine learning algorithms which are Logistic-Regression and KNN. The performance of these ML algorithms has been assessed using cross-validation. KNN and Logistic Regression give the optimal results with an accuracy of 99% and 95%. Our research study shows that using KNN and Logistic Regression increases the detection accuracy of job failures and will relieve cloud-service providers from diminishing future failures in cloud resources. Thus, we believe our approach is feasible and can be transformed to apply in an existing cloud environment.

References

M. S. Jassas and Q. H. Mahmoud, “Analysis of Job Failure and Prediction Model for Cloud Computing Using Machine Learning,” Sensors 2022, Vol. 22, Page 2035, vol. 22, no. 5, p. 2035, Mar. 2022, doi: 10.3390/S22052035.

M. S. Ajmal, Z. Iqbal, F. Z. Khan, M. Ahmad, I. Ahmad, and B. B. Gupta, “Hybrid ant genetic algorithm for efficient task scheduling in cloud data centers,” Comput. Electr. Eng., vol. 95, p. 107419, Oct. 2021, doi: 10.1016/J.COMPELECENG.2021.107419.

J. Gao, H. Wang, and H. Shen, “Task Failure Prediction in Cloud Data Centers Using Deep Learning,” IEEE Trans. Serv. Comput., 2020, doi: 10.1109/TSC.2020.2993728.

D. Cotroneo, L. De Simone, P. Liguori, and R. Natella, “Enhancing the Analysis of Software Failures in Cloud Computing Systems with Deep Learning,” J. Syst. Softw., vol. 181, Jun. 2021, doi: 10.1016/j.jss.2021.111043.

B. Mohammed, I. Awan, H. Ugail, and M. Younas, “Failure prediction using machine learning in a virtualised HPC system and application,” Clust. Comput. 2019 222, vol. 22, no. 2, pp. 471–485, Mar. 2019, doi: 10.1007/S10586-019-02917-1.

J. Shetty, R. Sajjan, and G. Shobha, “Task resource usage analysis and failure prediction in cloud,” Proc. 9th Int. Conf. Cloud Comput. Data Sci. Eng. Conflu. 2019, pp. 342–348, Jan. 2019, doi: 10.1109/CONFLUENCE.2019.8776612.

B. Mohammed, B. Modu, K. M. Maiyama, H. Ugail, I. Awan, and M. Kiran, “Failure Analysis Modelling in an Infrastructure as a Service (Iaas) Environment,” Electron. Notes Theor. Comput. Sci., vol. 340, no. October, pp. 41–54, 2018, doi: 10.1016/j.entcs.2018.09.004.

S. W. and I. M. Sehir e N, Shehzad M.A, Aslam M.S, “Optimize Elasticity in Cloud Computing using Container Based Virtualization,” Int. J. Innov. Sci. Technol., vol. 2, no. 1, pp. 1–16, 2020.

T. Islam and D. Manivannan, “Predicting Application Failure in Cloud: A Machine Learning Approach,” Proc. - 2017 IEEE 1st Int. Conf. Cogn. Comput. ICCC 2017, pp. 24–31, Sep. 2017, doi: 10.1109/IEEE.ICCC.2017.11.

C. Liu, J. Han, Y. Shang, C. Liu, B. Cheng, and J. Chen, “Predicting of Job Failure in Compute Cloud Based on Online Extreme Learning Machine: A Comparative Study,” IEEE Access, vol. 5, pp. 9359–9368, 2017, doi: 10.1109/ACCESS.2017.2706740.

D. Ford et al., “Availability in globally distributed storage systems,” Proc. 9th USENIX Symp. Oper. Syst. Des. Implementation, OSDI 2010, pp. 61–74, 2019.

T. Pitakrat, D. Okanović, A. van Hoorn, and L. Grunske, “Hora: Architecture-aware online failure prediction,” J. Syst. Softw., vol. 137, pp. 669–685, Mar. 2018, doi: 10.1016/J.JSS.2017.02.041.

A. Das, F. Mueller, C. Siegel, and A. Vishnu, “Desh: Deep learning for system health prediction of lead times to failure in HPC,” HPDC 2018 - Proc. 2018 Int. Symp. High-Performance Parallel Distrib. Comput., pp. 40–51, 2018, doi: 10.1145/3208040.3208051.

C. Xu, G. Wang, X. Liu, D. Guo, and T. Y. Liu, “Health Status Assessment and Failure Prediction for Hard Drives with Recurrent Neural Networks,” IEEE Trans. Comput., vol. 65, no. 11, pp. 3502–3508, Nov. 2016, doi: 10.1109/TC.2016.2538237.

S. Ganguly, A. Consul, A. Khan, B. Bussone, J. Richards, and A. Miguel, “A Practical Approach to Hard Disk Failure Prediction in Cloud Platforms: Big Data Model for Failure Management in Datacenters,” Proc. - 2016 IEEE 2nd Int. Conf. Big Data Comput. Serv. Appl. BigDataService 2016, pp. 105–116, May 2016, doi: 10.1109/BIGDATASERVICE.2016.10.

R. Birke, I. Giurgiu, L. Y. Chen, D. Wiesmann, and T. Engbersen, “Failure analysis of virtual and physical machines: Patterns, causes and characteristics,” Proc. Int. Conf. Dependable Syst. Networks, pp. 1–12, Sep. 2014, doi: 10.1109/DSN.2014.18.

X. Chen, C. Da Lu, and K. Pattabiraman, “Failure prediction of jobs in compute clouds: A Google cluster case study,” Proc. - IEEE 25th Int. Symp. Softw. Reliab. Eng. Work. ISSREW 2014, pp. 341–346, Dec. 2014, doi: 10.1109/ISSREW.2014.105.

M. Du, F. Li, G. Zheng, and V. Srikumar, “DeepLog: Anomaly detection and diagnosis from system logs through deep learning,” Proc. ACM Conf. Comput. Commun. Secur., pp. 1285–1298, 2017, doi: 10.1145/3133956.3134015.

I. Fronza, A. Sillitti, G. Succi, M. Terho, and J. Vlasenko, “Failure prediction based on log files using Random Indexing and Support Vector Machines,” J. Syst. Softw., vol. 86, no. 1, pp. 2–11, Jan. 2013, doi: 10.1016/J.JSS.2012.06.025.

A. Amin, A. Colman, and L. Grunske, “An approach to forecasting QoS attributes of web services based on ARIMA and GARCH models,” Proc. - 2012 IEEE 19th Int. Conf. Web Serv. ICWS 2012, pp. 74–81, 2012, doi: 10.1109/ICWS.2012.37.

J. F. Murray, G. F. Hughes, and K. Kreutz-Delgado, “Machine learning methods for predicting failures in hard drives: A multiple-instance application,” J. Mach. Learn. Res., vol. 6, no. May 2014, 2005.

T. Chalermarrewong, T. Achalakul, and S. C. W. See, “Failure prediction of data centers using time series and Fault Tree Analysis,” Proc. Int. Conf. Parallel Distrib. Syst. - ICPADS, pp. 794–799, 2012, doi: 10.1109/ICPADS.2012.129.

Q. Guan, Z. Zhang, and S. Fu, “Ensemble of Bayesian predictors and decision trees for proactive failure management in cloud computing systems,” J. Commun., vol. 7, no. 1, pp. 52–61, 2012, doi: 10.4304/jcm.7.1.52-61.