Sentiment Classification in Imbalanced Data: Trade-Offs Between Metrics and Real-World Relevance

Indra Swanto Ritonga; Wanayumini; Dedy Hartama

doi:10.15408/jti.v18i2.46652

Authors

Indra Swanto Ritonga Master of Computer Science, Faculty of Computer Science, Potensi Utama University, Indonesia https://orcid.org/0009-0001-6415-598X
Wanayumini Informatics Engineering, Faculty of Engineering, Asahan University, Indonesia https://orcid.org/0000-0002-5178-9449
Dedy Hartama Information System, STIKOM Tunas Bangsa, Indonesia https://orcid.org/0000-0002-9569-5874

DOI:

https://doi.org/10.15408/jti.v18i2.46652

Keywords:

bag of words, BPJS Kesehatan, class imbalance, Naïve Bayes, Sentiment Analysis, text feature extraction, TF-IDF

Abstract

Sentiment analysis plays a crucial role in assessing public perception, particularly in healthcare services like BPJS Kesehatan, Indonesia’s national health insurance program. However, sentiment classification faces a challenge due to class imbalance, where negative feedback dominates positive responses. This study investigates whether sentiment classification should prioritize traditional evaluation or maintain real-world data representation by preserving the original sentiment distribution. Two feature extraction methods, Term Frequency-Inverse Document Frequency (TF-IDF) and Bag of Words (BoW), were evaluated using Naïve Bayes, Support Vector Machine (SVM), and Logistic Regression with varying maximum feature counts (100–300) to examine the impact of feature dimensionality. Model performance was evaluated using traditional metrics, while sentiment distribution fidelity was assessed by comparing predicted proportions with the dataset. Results show TF-IDF achieves higher precision and recall but fails to capture positive sentiments, leading to a skewed representation of real-world trends, while BoW offers a more balanced distribution with slightly lower accuracy. Paired t-tests and Wilcoxon signed-rank tests confirmed differences in accuracy and recall are significant, but not in precision and sentiment distribution. These findings highlight a trade-off between performance and sentiment diversity, vital in healthcare services and other fields with imbalanced datasets, emphasizing the need to align evaluation metrics with real-world objectives. Future research should investigate advanced models, such as deep learning and transformer-based approaches, to enhance both accuracy and fairness when analyzing imbalanced data.

References

[1] L. Abualigah, H. E. Alfar, M. Shehab, and A. M. A. Hussein, “Sentiment Analysis in Healthcare: A Brief Review,” in Studies in Computational Intelligence, vol. 874, no. December 2019, 2020, pp. 129–141. doi: 10.1007/978-3-030-34614-0_7.

[2] T. D. Dikiyanti, A. M. Rukmi, and M. I. Irawan, “Sentiment analysis and topic modeling of BPJS Kesehatan based on twitter crawling data using Indonesian Sentiment Lexicon and Latent Dirichlet Allocation algorithm,” J. Phys. Conf. Ser., vol. 1821, no. 1, 2021, doi: 10.1088/1742-6596/1821/1/012054.

[3] H. D. Abubakar and M. Umar, “Sentiment Classification: Review of Text Vectorization Methods: Bag of Words, Tf-Idf, Word2vec and Doc2vec,” SLU J. Sci. Technol., vol. 4, no. 1&2, pp. 27–33, Aug. 2022, doi: 10.56471/slujst.v4i.266.

[4] D. E. Cahyani and I. Patasik, “Performance comparison of tf-idf and word2vec models for emotion text classification,” Bull. Electr. Eng. Informatics, vol. 10, no. 5, pp. 2780–2788, 2021, doi: 10.11591/eei.v10i5.3157.

[5] R. Obiedat et al., “Sentiment Analysis of Customers’ Reviews Using a Hybrid Evolutionary SVM-Based Approach in an Imbalanced Data Distribution,” IEEE Access, vol. 10, pp. 22260–22273, 2022, doi: 10.1109/ACCESS.2022.3149482.

[6] H. R. Sneha and B. Annappa, “Exploratory Analysis of Methods, Techniques, and Metrics to Handle Class Imbalance Problem,” Procedia Comput. Sci., vol. 235, pp. 863–877, 2024, doi: 10.1016/j.procs.2024.04.082.

[7] C. Suhaeni and H. S. Yong, “Mitigating Class Imbalance in Sentiment Analysis through GPT-3-Generated Synthetic Sentences,” Appl. Sci., vol. 13, no. 17, 2023, doi: 10.3390/app13179766.

[8] S. N. Almuayqil, M. Humayun, N. Z. Jhanjhi, M. F. Almufareh, and D. Javed, “Framework for Improved Sentiment Analysis via Random Minority Oversampling for User Tweet Review Classification,” Electron., vol. 11, no. 19, pp. 1–17, 2022, doi: 10.3390/electronics11193058.

[9] J. Qiu, C. Liu, Y. Li, and Z. Lin, “Leveraging sentiment analysis at the aspects level to predict ratings of reviews,” Inf. Sci. (Ny)., vol. 451–452, pp. 295–309, Jul. 2018, doi: 10.1016/j.ins.2018.04.009.

[10] F. Thabtah, S. Hammoud, F. Kamalov, and A. Gonsalves, “Data imbalance in classification: Experimental evaluation,” Inf. Sci. (Ny)., vol. 513, pp. 429–441, Mar. 2020, doi: 10.1016/j.ins.2019.11.004.

[11] J. Yun and J.-S. Lee, “Learning from class-imbalanced data using misclassification-focusing generative adversarial networks,” Expert Syst. Appl., vol. 240, p. 122288, Apr. 2024, doi: 10.1016/j.eswa.2023.122288.

[12] S. George and V. Srividhya, “Performance Evaluation of Sentiment Analysis on Balanced and Imbalanced Dataset Using Ensemble Approach,” Indian J. Sci. Technol., vol. 15, no. 17, pp. 790–797, 2022, doi: 10.17485/ijst/v15i17.2339.

[13] T. Mirzoev and S. Kane, “Key strategies to improve systems for managing patient complaints within health facilities–what can we learn from the existing literature?,” Glob. Health Action, vol. 11, no. 1, 2018, doi: 10.1080/16549716.2018.1458938.

[14] J. Opitz, “A Closer Look at Classification Evaluation Metrics and a Critical Reflection of Common Evaluation Practice,” Trans. Assoc. Comput. Linguist., vol. 12, no. 2018, pp. 820–836, 2024, doi: 10.1162/tacl_a_00675.

[15] Z. Qiu et al., “Assessing the impact of bag-of-words versus word-to-vector embedding methods and dimension reduction on anomaly detection from log files,” Int. J. Netw. Manag., vol. 34, no. 1, pp. 1–20, 2024, doi: 10.1002/nem.2251.

[16] I. Verawati and B. S. Audit, “Algoritma Naïve Bayes Classifier Untuk Analisis Sentiment Pengguna Twitter Terhadap Provider By.u,” J. Media Inform. Budidarma, vol. 6, no. 3, p. 1411, 2022, doi: 10.30865/mib.v6i3.4132.

[17] Y. P. Astuti, A. R. Wibowo, E. Kartikadarma, E. R. Subhiyakto, N. A. Sri Winarsih, and M. S. Rohman, “Penerapan Metode Naïve Bayes Classifier Untuk Klasifikasi Sentimen Pada Judul Berita,” LogicLink, vol. 1, no. 1, pp. 1–12, 2024, doi: 10.28918/logiclink.v1i1.7684.

[18] Israt Jahan, Md Nakibul Islam, Md Mahadi Hasan, and Md Rafiuddin Siddiky, “Comparative analysis of machine learning algorithms for sentiment classification in social media text,” World J. Adv. Res. Rev., vol. 23, no. 3, pp. 2842–2852, 2024, doi: 10.30574/wjarr.2024.23.3.2983.

[19] Y. Jaswanth, R. Muni, S. Kumar, R. M. Sudhan, M. Vijaya Kumar, and M. Rajagopalam, “Sentiment analysis using logistic regression algorithm,” Eur. J. Mol. Clin. Med., vol. 7, no. 4, pp. 2081–2086, 2020, [Online]. Available: https://ejmcm.com/article_1947.html

[20] “Sentimen-ID-BPJS.” Accessed: Jul. 17, 2025. [Online]. Available: https://www.kaggle.com/datasets/aeworld/sentimen-id-bpjs

[21] S. Qaiser and R. Ali, “Text Mining: Use of TF-IDF to Examine the Relevance of Words to Documents,” Int. J. Comput. Appl., vol. 181, no. 1, pp. 25–29, 2018, doi: 10.5120/ijca2018917395.

[22] K. Juluru, H. H. Shih, K. N. K. Murthy, and P. Elnajjar, “Bag-of-words technique in natural language processing: A primer for radiologists,” Radiographics, vol. 41, no. 5, pp. 1420–1426, 2021, doi: 10.1148/rg.2021210025.

[23] H. Chen, S. Hu, R. Hua, and X. Zhao, “Improved naive Bayes classification algorithm for traffic risk management,” EURASIP J. Adv. Signal Process., vol. 2021, no. 1, 2021, doi: 10.1186/s13634-021-00742-6.

[24] S. Dey Sarkar, S. Goswami, A. Agarwal, and J. Aktar, “A Novel Feature Selection Technique for Text Classification Using Naïve Bayes,” Int. Sch. Res. Not., vol. 2014, pp. 1–10, 2014, doi: 10.1155/2014/717092.

[25] T. H. J. Hidayat, Y. Ruldeviyani, A. R. Aditama, G. R. Madya, A. W. Nugraha, and M. W. Adisaputra, “Sentiment analysis of twitter data related to Rinca Island development using Doc2Vec and SVM and logistic regression as classifier,” Procedia Comput. Sci., vol. 197, no. 2021, pp. 660–667, 2021, doi: 10.1016/j.procs.2021.12.187.

[26] M. T R, V. K. V, D. K. V, O. Geman, M. Margala, and M. Guduri, “The stratified K-folds cross-validation and class-balancing methods with high-performance ensemble classifiers for breast cancer classification,” Healthc. Anal., vol. 4, no. July, p. 100247, 2023, doi: 10.1016/j.health.2023.100247.

[27] D. Wilimitis and C. G. Walsh, “Practical Considerations and Applied Examples of Cross-Validation for Model Development and Evaluation in Health Care: Tutorial,” Jmir Ai, vol. 2, no. 1, 2023, doi: 10.2196/49023.

[28] P. Alkhairi, E. R. Batubara, R. Rosnelly, W. Wanayaumini, and H. S. Tambunan, “Effect of Gradient Descent With Momentum Backpropagation Training Function in Detecting Alphabet Letters,” Sinkron, vol. 8, no. 1, pp. 574–583, 2023, doi: 10.33395/sinkron.v8i1.12183.

[29] Y. HaCohen-Kerner, D. Miller, and Y. Yigal, “The influence of preprocessing on text classification using a bag-of-words representation,” PLoS One, vol. 15, no. 5, pp. 1–22, 2020, doi: 10.1371/journal.pone.0232525.

[30] S. Akuma, T. Lubem, and I. T. Adom, “Comparing Bag of Words and TF-IDF with different models for hate speech detection from live tweets,” Int. J. Inf. Technol., vol. 14, no. 7, pp. 3629–3635, Dec. 2022, doi: 10.1007/s41870-022-01096-4.

[31] C. A. Nurhaliza Agustina, R. Novita, Mustakim, and N. E. Rozanda, “The Implementation of TF-IDF and Word2Vec on Booster Vaccine Sentiment Analysis Using Support Vector Machine Algorithm,” Procedia Comput. Sci., vol. 234, pp. 156–163, 2024, doi: 10.1016/j.procs.2024.02.162.

[32] Dedy Sugiarto, Ema Utami, and Ainul Yaqin, “Perbandingan Kinerja Model TF-IDF dan BOW untuk Klasifikasi Opini Publik Tentang Kebijakan BLT Minyak Goreng,” J. Tek. Ind., vol. 12, no. 3, pp. 272–277, 2022, doi: 10.25105/jti.v12i3.15669.

[33] M. Mujahid et al., “Data oversampling and imbalanced datasets: an investigation of performance for machine learning and feature engineering,” J. Big Data, vol. 11, no. 1, 2024, doi: 10.1186/s40537-024-00943-4.

[34] Z. Nassr, F. Benabbou, N. Sael, and T. Hamim, “Improving Sentiment Analysis Performance on Imbalanced Moroccan Dialect Datasets Using Resample and Feature Extraction Techniques,” Inf., vol. 16, no. 1, 2025, doi: 10.3390/info16010039.

[35] N. A. Semary, W. Ahmed, K. Amin, P. Pławiak, and M. Hammad, “Enhancing machine learning-based sentiment analysis through feature extraction techniques,” PLoS One, vol. 19, no. 2 February, 2024, doi: 10.1371/journal.pone.0294968.

[36] S. J. Basha, S. R. Madala, K. Vivek, E. S. Kumar, and T. Ammannamma, “A Review on Imbalanced Data Classification Techniques,” in 2022 International Conference on Advanced Computing Technologies and Applications (ICACTA), IEEE, Mar. 2022, pp. 1–6. doi: 10.1109/ICACTA54488.2022.9753392.

[37] Z. Shuai et al., “Comparison of different feature extraction methods for applicable automated ICD coding,” BMC Med. Inform. Decis. Mak., vol. 22, no. 1, pp. 1–15, 2022, doi: 10.1186/s12911-022-01753-5.