Machine Learning for Cybersecurity: Web Attack Detection (Brute Force, XSS, SQL Injection)
Abstract
Security is a top priority in system development, as web portals serve as critical entry points that are frequently targeted by cyber-attacks. Common attack methods include SQL Injection, Cross-Site Scripting (XSS), and Brute Force. The application of machine learning in cybersecurity is growing due to its effectiveness in detecting such threats. This study employs supervised machine learning with six algorithms: K-Nearest Neighbors (KNN), Random Forest, Naïve Bayes, AdaBoost, LightGBM, and XGBoost. The research utilizes the CICIDS2017 and CSE-CICIDS2018 datasets, which contain network traffic data labeled with four categories: Benign, Brute Force, XSS, and SQL Injection. To address the dataset imbalance issue, this study applies Synthetic Minority Oversampling Technique (SMOTE) in conjunction with Principal Component Analysis (PCA) for dimensionality reduction. Performance evaluation is conducted using accuracy, precision, recall, and F1-score metrics, as well as K-Fold Cross Validation, AUC-ROC, and Learning Curve analysis. The results indicate that the Random Forest algorithm achieves the highest classification performance, with an accuracy of 97.77%, precision of 84.07%, recall of 91.96%, and an F1-score of 87.28%. This research contributes by demonstrating the applicability of machine learning in real-time web attack detection, highlighting the advantages of ensemble-based models in handling cybersecurity threats. Additionally, it underscores the importance of dataset preprocessing techniques in enhancing classification performance. Future improvements should focus on optimizing hyperparameters, integrating real-time network traffic analysis, and exploring hybrid models that combine traditional machine learning with deep learning approaches to further enhance detection capabilities.
Keywords: machine learning; cybersecurity; web attack detection; random forest; SMOTE; PCA.
Abstrak
Keamanan merupakan prioritas utama dalam pengembangan sistem, karena portal web berfungsi sebagai titik masuk penting yang sering menjadi sasaran serangan siber. Metode serangan umum meliputi SQL Injection, Cross-Site Scripting (XSS), dan Brute Force. Penerapan machine learning dalam keamanan siber semakin berkembang karena efektivitasnya dalam mendeteksi ancaman tersebut. Studi ini menggunakan supervised machine learning dengan enam algoritma: K-Nearest Neighbors (KNN), Random Forest, Naïve Bayes, AdaBoost, LightGBM, dan XGBoost. Penelitian ini memanfaatkan kumpulan data CICIDS2017 dan CSE-CICIDS2018, yang berisi data lalu lintas jaringan yang diberi label dengan empat kategori: Benign, Brute Force, XSS, dan SQL Injection. Untuk mengatasi masalah ketidakseimbangan kumpulan data, studi ini menerapkan Synthetic Minority Oversampling Technique (SMOTE) bersama dengan Principal Component Analysis (PCA) untuk pengurangan dimensionalitas. Evaluasi kinerja dilakukan dengan menggunakan metrik akurasi, presisi, recall, dan skor F1, serta K-Fold Cross Validation, AUC-ROC, dan analisis Learning Curve. Hasilnya menunjukkan bahwa algoritma Random Forest mencapai kinerja klasifikasi tertinggi, dengan akurasi 97,77%, presisi 84,07%, recall 91,96%, dan skor F1 87,28%. Penelitian ini berkontribusi dengan menunjukkan penerapan machine learning dalam deteksi serangan web real-time, menyoroti keunggulan model berbasis ensemble dalam menangani ancaman keamanan siber. Selain itu, penelitian ini menggarisbawahi pentingnya teknik praproses dataset dalam meningkatkan kinerja klasifikasi. Peningkatan di masa mendatang harus difokuskan pada pengoptimalan hiperparameter, pengintegrasian analisis lalu lintas jaringan real-time, dan eksplorasi model hybrid yang menggabungkan machine learning tradisional dengan pendekatan deep learning untuk lebih meningkatkan kemampuan deteksi.
Kata Kunci: pembelajaran mesin; keamanan siber; deteksi serangan web; random forest; SMOTE; PCA.
2020MSC: 68T05
Keywords
References
Z. Liu, Y. Fang, C. Huang, and Y. Xu, “MFXSS: An effective XSS vulnerability detection method in JavaScript based on multi-feature model,” Comput. Secur., vol. 124, p. 103015, 2023, doi: https://doi.org/10.1016/j.cose.2022.103015.
A. Buja, “An Online SQL Vulnerablility Assessment Tool and It’s Impact on SMEs,” Int. J. Adv. Res. Comput. Sci., vol. 13, no. 5, pp. 23–28, 2022, doi: 10.26483/ijarcs.v13i5.6903.
M. M. Najafabadi, T. M. Khoshgoftaar, C. Kemp, N. Seliya, and R. Zuech, “Machine Learning for Detecting Brute Force Attacks at the Network Level,” in 2014 IEEE International Conference on Bioinformatics and Bioengineering, 2014, pp. 379–385. doi: 10.1109/BIBE.2014.73.
A. Priandoyo, “Vulnerability Assessment untuk Meningkatkan Kesadaran Pentingnya Keamanan Informasi,” J. Sist. Inf., vol. 1, no. 2, pp. 73–83, 2006.
R. Moskovitch, Y. Elovici, and L. Rokach, “Detection of unknown computer worms based on behavioral classification of the host,” Comput. Stat. Data Anal., vol. 52, no. 9, pp. 4544–4566, 2008, doi: 10.1016/j.csda.2008.01.028.
S. Vijayakumar, K. S. P. Gowtham, N. Nigam, and R. V. R. Singh, “An Novel Approach in Designing a Security Workbench with Deep Learning Capabilities and Process Automation,” IEEE Reg. 10 Annu. Int. Conf. Proceedings/TENCON, vol. 2019-Octob, pp. 263–268, 2019, doi: 10.1109/TENCON.2019.8929691.
C. Virmani, T. Choudhary, A. Pillai, and M. Rani, “Applications of machine learning in cyber security,” Res. Anthol. Mach. Learn. Tech. Methods, Appl., pp. 621–641, 2022, doi: 10.4018/978-1-6684-6291-1.ch033.
I. Sharafaldin, A. H. Lashkari, and A. A. Ghorbani, “Toward Generating a New Intrusion Detection Dataset and Intrusion Traffic Characterization,” ICISSP 2018 - Proc. 4th Int. Conf. Inf. Syst. Secur. Priv., vol. 2018-Janua, no. Cic, pp. 108–116, 2018, doi: 10.5220/0006639801080116.
A. Fernández, S. García, F. Herrera, and N. V. Chawla, “SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary,” J. Artif. Intell. Res., vol. 61, pp. 863–905, 2018, doi: 10.1613/jair.1.11192.
T. Kurita, “Principal component analysis (PCA),” in Computer vision: a reference guide, Springer, 2021, pp. 1013–1016.
LP2M Universitas Medan Area, “Algoritma K-Nearest Neighbors (KNN) – Pengertian dan Penerapan,” 2016. https://lp2m.uma.ac.id/2023/02/16/algoritma-k-nearest-neighbors-knn-pengertian-dan-penerapan/ (accessed May 28, 2024).
Cornell University, “Lecture 2: k-nearest neighbors.” https://www.cs.cornell.edu/courses/cs4780/2017sp/lectures/lecturenote02_kNN.html#:~:text=The k-NN algorithm&text=Formally Sx is defined,furthest point in Sx). (accessed Aug. 11, 2024).
T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Trans. Inf. Theory, vol. 13, no. 1, pp. 21–27, 1967, doi: 10.1109/TIT.1967.1053964.
L. Breiman, “Random Forests,” Mach. Learn., vol. 45, no. 1, pp. 5–32, 2001, doi: 10.1023/A:1010933404324.
G. I. Webb, E. Keogh, and R. Miikkulainen, “Naive Bayes,” Encycl. Mach. Learn., vol. 15, no. 1, pp. 713–714, 2010.
M. Murty and V. Devi, Pattern recognition. An algorithmic approach. 2011. doi: 10.1007/978-0-85729-495-1.
R. E. Schapire, “The strength of weak learnability,” Mach. Learn., vol. 5, no. 2, pp. 197–227, 1990, doi: 10.1007/BF00116037.
Y. Freund and R. E. Schapire, “A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting,” J. Comput. Syst. Sci., vol. 55, no. 1, pp. 119–139, 1997, doi: https://doi.org/10.1006/jcss.1997.1504.
G. Ke et al., “LightGBM: A Highly Efficient Gradient Boosting Decision Tree,” in Advances in Neural Information Processing Systems, 2017, vol. 30. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2017/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf
J. Friedman, “Greedy Function Approximation: A Gradient Boosting Machine,” Ann. Stat., vol. 29, 2000, doi: 10.1214/aos/1013203451.
T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., vol. 13-17-Augu, pp. 785–794, 2016, doi: 10.1145/2939672.2939785.
J. F. Nunamaker, M. Chen, and T. D. M. Purdin, “Systems Development in Information Systems Research,” J. Manag. Inf. Syst., vol. 7, no. 3, pp. 89–106, Feb. 1990, [Online]. Available: http://www.jstor.org/stable/40397957
J. Anderberg and N. Fathullah, “A machine learning approach to enhance the privacy of customers,” 2019.
“A Realistic Cyber Defense Dataset (CSE-CIC-IDS2018).” https://registry.opendata.aws/cse-cic-ids2018 (accessed Feb. 21, 2024).
R. Gunawan, Erik Suanda Handika, and Edi Ismanto, “Pendekatan Machine Learning Dengan Menggunakan Algoritma Xgboost (Extreme Gradient Boosting) Untuk Peningkatan Kinerja Klasifikasi Serangan Syn,” J. CoSciTech (Computer Sci. Inf. Technol., vol. 3, no. 3, pp. 453–463, 2022, doi: 10.37859/coscitech.v3i3.4356.
DOI: 10.15408/inprime.v7i1.41025
Refbacks
- There are currently no refbacks.