Enhancing Tuberculosis Diagnosis: Effective Naive Bayes Classification using SMOTE and Tomek Links for Imbalanced Data
Abstract
Naive Bayes classification, grounded in Bayes' theorem, is a well-established probabilistic and statistical method. However, it often faces challenges when dealing with datasets that have skewed class distributions. A common issue with unbalanced data is that the classifier tends to predict the majority class more accurately, leading to high accuracy for the majority class but low accuracy for the minority class. Resampling techniques such as oversampling, undersampling, or a combination of both can be employed to address this. This research introduces a novel approach to balancing training data using a hybrid method that combines SMOTE (Synthetic Minority Oversampling Technique) and Tomek Links by applying this method to tuberculosis (TB) diagnosis data from Mayjend HM Ryacudu Kotabumi Hospital. We evaluate the Naive Bayes classifier's performance on the original and newly balanced data. We used 826 patient data for training and 207 for testing out of 1,033. Of the 826 records in the training dataset, 306 patients had a TB diagnosis, whereas 520 patients did not. To achieve a better balance between the majority and minority classes, we oversampled 214 data in the minority class to match the number in the majority class. If necessary, we also reduce 214 data from the majority class. The results demonstrate that this hybrid approach significantly enhances the performance of the Naive Bayes model in terms of data balancing and overall accuracy. Specifically, the hybrid method achieves an average specificity of 96%, sensitivity of 88%, false positive fraction (FPF) of 4%, and false negative fraction (FNF) of 12%. These findings highlight the effectiveness of combining SMOTE and Tomek Links, providing a robust solution for improving classification performance in unbalanced datasets.
Keywords: Naive Bayes classification; SMOTE; Tomek Links; SMOTE+Tomek Links; tuberculosis.
Abstrak
Klasifikasi Naive Bayes, yang didasarkan pada Teorema Bayes, adalah metode probabilistik dan statistik yang sudah mapan. Namun, metode ini sering menghadapi tantangan ketika berhadapan dengan kumpulan data yang memiliki distribusi kelas yang miring (tidak seimbang). Masalah umum pada data yang tidak seimbang adalah bahwa pengklasifikasi cenderung memprediksi kelas mayoritas dengan lebih akurat, yang mengarah pada akurasi tinggi untuk kelas mayoritas namun menghasilkan akurasi rendah untuk kelas minoritas. Untuk mengatasi masalah ini, teknik resampling seperti oversampling, undersampling, atau kombinasi keduanya dapat digunakan. Penelitian ini memperkenalkan pendekatan baru untuk menyeimbangkan data pelatihan menggunakan metode hibrida yang menggabungkan SMOTE (Synthetic Minority Oversampling Technique) dan Tomek Links. Dengan menerapkan metode ini pada data diagnosis tuberculosis (TB) dari Rumah Sakit Mayjend HM Ryacudu Kotabumi. Kami mengevaluasi kinerja pengklasifikasi Naive Bayes pada data yang tidak seimbang asli dan data yang sudah seimbang. Kami menggunakan 826 data pasien untuk pelatihan dan 207 untuk pengujian dari total 1.033. Dari 826 catatan dalam dataset pelatihan, 306 pasien didiagnosis dengan TB, sedangkan 520 pasien tidak. Untuk mencapai keseimbangan yang lebih baik antara kelas mayoritas dan minoritas, kami melakukan oversampling sebanyak 214 data pada kelas minoritas agar jumlahnya seimbang dengan kelas mayoritas. Selain itu, kami juga mengurangi 214 data dari kelas mayoritas. Hasilnya menunjukkan bahwa pendekatan hibrida ini secara signifikan meningkatkan kinerja model Naive Bayes dalam hal keseimbangan data dan akurasi keseluruhan. Secara spesifik, metode hibrida ini mencapai spesifisitas rata-rata sebesar 96%, sensitivitas sebesar 88%, fraksi positif palsu (FPF) sebesar 4%, dan fraksi negatif palsu (FNF) sebesar 12%. Temuan ini menyoroti efektivitas penggabungan SMOTE dan Tomek Links, serta memberikan solusi yang tangguh untuk meningkatkan kinerja klasifikasi di tengah kumpulan data yang tidak seimbang.
Kata Kunci: klasifikasi Naive Bayes; SMOTE; Tomek Links; SMOTE+Tomek Links; tuberkulosis.
2020MSC: 68T05, 62R07.
Keywords
References
J. Han, M. Kambe, and J. Pe, Data Mining Concepts and Techniques. 2012. doi: 10.1016/C2009-0-61819-5.
P. Domingos and M. Pazzani, “On the Optimality of the Simple Bayesian Classifier under Zero-One Loss,” Mach. Learn., vol. 29, no. 2–3, pp. 103–130, 1997, doi: 10.1023/a:1007413511361.
A. S. Sastrawan et al., “Analisis Pengaruh Metode Combine Sampling Dalam Churn Prediction Untuk Perusahaan Telekomunikasi,” Semin. Nas. Inform. 2010 (semnasIF 2010) UPN, vol. 1, no. 1, pp. 14–22, 2010.
H. Sain and S. W. Purnami, “Combine Sampling Support Vector Machine for Imbalanced Data Classification,” Procedia Comput. Sci., vol. 72, pp. 59–66, 2015, doi: 10.1016/j.procs.2015.12.105.
S. Tyagi and S. Mittal, “Sampling Approaches For Imbalanced Data Classification Problem In Machine Learning,” Lect. Notes Electr. Eng., vol. 597, no. 7, pp. 209–221, 2020, doi: 10.1007/978-3-030-29407-6_17.
C. M. Bishop, Pattern Recognition and Machine Learning, vol. 4, no. 1. 2006. doi: 10.53759/7669/jmc202404020.
R. Siringoringo, “Klasifikasi Data Tidak Seimbang Menggunakan Algoritma SMOTE dan K-Nearest Neighbor,” J. ISD, vol. 3, no. 1, pp. 44–49, 2018.
N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-Sampling Technique,” J. Artif. Intell. Res., vol. 16, no. 2, pp. 321–357, 2002, doi: 10.1613/jair.953.
C. Drummond and R. C. Holte, “Class Imbalance, and Cost Sensitivity: Why Under-Sampling beats Over-Sampling,” Phys. Rev. Lett., vol. 91, no. 3, 2003.
I. Tomek, “An Experiment with the Edited Nearest-Neighbor Rule,” IEEE Trans. Syst. Man, Cybern. SMC, vol. 6, no. 6, pp. 448–453, 1973.
E. F. Swana, W. Doorsamy, and P. Bokoro, “Tomek Link and SMOTE Approaches for Machine Fault Classification with an Imbalanced Dataset,” Sensors, vol. 22, no. 9, 2022, doi: 10.3390/s22093246.
G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A Study Of The Behavior Of Several Methods For Balancing Machine Learning Training Data,” ACM SIGKDD Explor. Newsl., vol. 6, no. 1, pp. 20–29, 2004, doi: 10.1145/1007730.1007735.
A. Fernández, S. García, F. Herrera, and N. V. Chawla, “SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary,” J. Artif. Intell. Res., vol. 61, pp. 863–905, 2018, doi: 10.1613/jair.1.11192.
Kemenkes RI, Petunjuk Teknis Pemeriksaan TB Menggunakan Tes Cepat Molekuler. 2017.
K. Fithriasari, I. Hariastuti, and K. S. Wening, “Handling Imbalance Data in Classification Model with Nominal Predictors,” vol. 6, no. 1, pp. 33–37, 2020.
L. M. Sullivan, Essentials of Biostatistics in Public Health. United States of America: Jones & Bartlett Learning, 2018.
R. M. Pereira, Y. M. G. Costa, and C. N. Silla, “MLTL: A Multi-Label Approach For The Tomek Link Undersampling Algorithm,” Neurocomputing, vol. 383, pp. 95–105, 2020, doi: 10.1016/j.neucom.2019.11.076.
K. Murphy, Machine Learning A Probabilistic Perspective. London, England: The MIT Press, 2012.
H. Hairani, A. Anggrawan, and D. Priyanto, “Improvement Performance of the Random Forest Method on Unbalanced Diabetes Data Classification Using Smote-Tomek Link,” Int. J. Informatics Vis., vol. 7, no. 1, pp. 258–264, 2023, doi: 10.30630/joiv.7.1.1069.
G. A. Sejie and O. H. Mahomed, “Mapping The Effectiveness of The Community Tuberculosis Care Programs: A Systematic Review,” Syst. Rev., vol. 12, no. 1, pp. 1–15, 2023, doi: 10.1186/s13643-023-02296-0.
M. Singh et al., “Evolution of Machine Learning in Tuberculosis Diagnosis: A Review of Deep Learning-Based Medical Applications,” Electron., vol. 11, no. 17, 2022, doi: 10.3390/electronics11172634.
S. Yadav, G. Rawal, M. Jeyaraman, and N. Jeyaraman, “Advancements in Tuberculosis Diagnostics: A Comprehensive Review of the Critical Role and Future Prospects of Xpert MTB/RIF Ultra Technology,” Cureus, vol. 16, no. 3, 2024, doi: 10.7759/cureus.57311.
Y. Zhang, L. Deng, and B. Wei, “Imbalanced Data Classification Based on Improved Random-SMOTE and Feature Standard Deviation,” Mathematics, vol. 12, no. 11, pp. 1–17, 2024, doi: 10.3390/math12111709.
A. Noviyani, T. Nopsopon, and K. Pongpirul, “Variation of Tuberculosis Prevalence Across Diagnostic Approaches and Geographical Areas of Indonesia,” PLoS One, vol. 16, no. 10 October, pp. 1–12, 2021, doi: 10.1371/journal.pone.0258809.
Y. Wang, L. Liu, and C. Wang, “Trends in Using Deep Learning Algorithms in Biomedical Prediction Systems,” Front. Neurosci., vol. 17, no. November, pp. 1–32, 2023, doi: 10.3389/fnins.2023.1256351.
DOI: 10.15408/inprime.v6i2.41463
Refbacks
- There are currently no refbacks.