Comparing K-Prototypes and K-Medoids with Catboost for Health Profile Clustering of Pesantren Students

Moch. Aghisna Hadzikunnuha; Harits Ar Rosyid; M. Zainal Arifin

doi:10.15408/jti.v19i1.49369

Authors

Moch. Aghisna Hadzikunnuha Electrical and Informatics Engineering, Faculty of Engineering, Universitas Negeri Malang
Harits Ar Rosyid Electrical and Informatics Engineering, Faculty of Engineering, Universitas Negeri Malang
M. Zainal Arifin Electrical and Informatics Engineering, Faculty of Engineering, Universitas Negeri Malang

DOI:

https://doi.org/10.15408/jti.v19i1.49369

Keywords:

CatBoost, Clustering, K-Medoids, K-Prototypes, Pesantren Students

Abstract

Health screening in pesantren is challenging due to communal living conditions, limited health facilities, and the need for early identification of vulnerable student groups. This study compares the performance of K-Prototypes and K-Medoids clustering for grouping student health profiles and evaluates the use of cluster labels as additional features in a CatBoost classification model. The dataset consists of 1,464 new students from Queen Al Falah Islamic Boarding School in the 2025/2026 academic year, collected through the admission system and analyzed after preprocessing. Clustering is performed using K-Prototypes and K-Medoids with three clusters to support interpretability of nutritional and health profiles. Although two clusters yield higher silhouette values, three clusters provide more meaningful distinctions for practical screening. Classification experiments use CatBoost with an 80:20 stratified train-test split, comparing baseline models and hybrid models that integrate cross-algorithm cluster features. The results show an asymmetric pattern. Adding K-Prototypes features improves K-Medoids target accuracy from 99.66 percent to 100 percent, while adding K-Medoids features slightly decreases K-Prototypes target accuracy from 98.98 percent to 98.63 percent. McNemar test results indicate that these differences are not statistically significant. Overall, the proposed framework supports reliable and interpretable health profile clustering for pesantren student monitoring.

References

[1] I. Amalia et al., “Combating Infectious Diseases Threat among Students in Islamic Boarding School (Pondok Pesantren): A Pilot Assessment,” J. Community Empower. Heal., vol. 6, no. 1, p. 7, Apr. 2023, doi: 10.22146/jcoemph.77426.

[2] E. Rianti, A. Triwinarto, A. Rodoni, and Elina, “Enhancing Health Quality of Islamic Boarding School Students through Hygiene Practices in Depok and Banten, Indonesia,” Indian J. Forensic Med. Toxicol., vol. 13, no. 4, p. 1661, 2019, doi: 10.5958/0973-9130.2019.00545.0.

[3] F. H. Ruslana and S. Mulyono, “The Relationship of Cultural Values with Clean and Healthy Life Behaviour among Islamic Boarding School Students in Indonesia,” J. Public health Res., vol. 11, no. 2, Apr. 2022, doi: 10.4081/jphr.2021.2739.

[4] J. Olufemi Ogunleye, “The Concept of Data Mining,” 2022, pp. 1–34. doi: 10.5772/intechopen.99417.

[5] Venkata Mahesh Babu Batta, “Machine Learning,” Int. J. Adv. Res. Sci. Commun. Technol., pp. 583–591, Apr. 2024, doi: 10.48175/IJARSCT-17677.

[6] D. Zelterman, “Clustering Methods,” 2022, pp. 305–351. doi: 10.1007/978-3-031-13005-2_11.

[7] A. Majumder, “Classification Models in Machine Learning Techniques,” 2023, pp. 1–16. doi: 10.4018/978-1-6684-8531-6.ch001.

[8] S. Yadav, “Heart Disease Prediction Using Machine Learning,” INTERANTIONAL J. Sci. Res. Eng. Manag., vol. 08, no. 07, pp. 1–14, Jul. 2024, doi: 10.55041/IJSREM36858.

[9] A. Pathak et al., “Application of Machine Learning K-Means Clustering and Linear Regression in Determining the Risk Level of Pulmonary Tuberculosis,” in 2024 IEEE International Conference on Computing, Applications and Systems (COMPAS), IEEE, Sep. 2024, pp. 1–6. doi: 10.1109/COMPAS60761.2024.10796963.

[10] H. Hafid and S. Annisa, “IMPLEMENTATION OF K-MEDOIDS AND K-PROTOTYPES CLUSTERING FOR EARLY DETECTION OF HYPERTENSION DISEASE,” BAREKENG J. Ilmu Mat. dan Terap., vol. 19, no. 1, pp. 465–476, Jan. 2025, doi: 10.30598/barekengvol19iss1pp465-476.

[11] A Priyanka and D. C. Chandrasekar, “Efficient Slice Creation in Network Slicing using K-Prototype Clustering and Context-Aware Slice Selection for Service Provisioning,” Int. J. Recent Technol. Eng., vol. 12, no. 5, pp. 12–20, Jan. 2024, doi: 10.35940/ijrte.E7973.12050124.

[12] H. Jridi, M. A. Ben HajKacem, and N. Essoussi, “Parallel K-Prototypes Clustering with High Efficiency and Accuracy,” 2020, pp. 380–395. doi: 10.1007/978-3-030-59065-9_29.

[13] R. Septian and D. Darnah, “Penerapan Algoritma K-Medoids pada Pengelompokan Wilayah Provinsi di Indonesia Berdasarkan Indikator Pendidikan,” EKSPONENSIAL, vol. 14, no. 2, p. 85, Nov. 2023, doi: 10.30872/eksponensial.v14i2.1150.

[14] L. Lenssen, E. Schubert, A. Krivošija, E. Schubert, A. Lang, and S. Hess, “Cluster Analysis,” in Fundamentals, De Gruyter, 2022, pp. 179–248. doi: 10.1515/9783110785944-005.

[15] A. AbdElSamea and S. M. Saif, “K-medoid clustering containerized allocation algorithm for cloud computing environment,” J. Electr. Syst. Inf. Technol., vol. 11, no. 1, p. 35, Sep. 2024, doi: 10.1186/s43067-024-00161-1.

[16] J. T. Hancock and T. M. Khoshgoftaar, “CatBoost for big data: an interdisciplinary review,” J. Big Data, vol. 7, no. 1, p. 94, Dec. 2020, doi: 10.1186/s40537-020-00369-8.

[17] Y. H. Chang et al., “Machine learning–based triage to identify low-severity patients with a short discharge length of stay in emergency department,” BMC Emerg. Med., vol. 22, no. 1, pp. 1–10, 2022, doi: 10.1186/s12873-022-00632-6.

[18] T. Phung, K. Reese, I. Shpitser, and R. Bhattacharya, “Recursive Equations For Imputation Of Missing Not At Random Data With Sparse Pattern Support,” Jul. 2025, [Online]. Available: http://arxiv.org/abs/2507.16107

[19] H. Al Azies, F. A. Rohmatullah, H. B. Rochmanto, and D. Putri, “TOWARDS OPTIMIZATION : DATA-DRIVEN APPROACH K-MEDOIDS CLUSTERING ALGORITHM FOR REGIONAL EDUCATION QUALITY,” vol. 12, no. 3, 2022.

[20] A. F. Purba, Mustafid, and K. Puspita, “PENERAPAN ALGORITMA k-PROTOTYPE UNTUK PENGELOMPOKAN DESA DI KABUPATEN BEKASI BERDASARKAN INFRASTRUKTUR DIGITAL,” vol. 13, pp. 479–489, 2025, doi: 10.14710/j.gauss.13.2.479-489.

[21] A. B. Mawardi, R. S. Pradini, M. S. Haris, and G. Boosting, “Komparasi Algoritma Boosting untuk Prediksi Gangguan Tidur,” vol. 13, no. 3, doi: https://doi.org/10.23960/jitet.v13i3.7281.

[22] E. Arazo, D. Ortego, P. Albert, N. E. O. Connor, and K. Mcguinness, “Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning”.

[23] J. Xu, T. Li, D. Zhang, and J. Wu, “Ensemble clustering via fusing global and local structure information,” Expert Syst. Appl., vol. 237, p. 121557, Mar. 2024, doi: 10.1016/j.eswa.2023.121557.

[24] H. Surbakti and T. A. Munandar, “K-Means-Based Pseudo-Labeling Technique in Supervised Learning Models for Regional Classification Based on Types of Non-Communicable Diseases,” J. Online Inform., vol. 10, no. 2, pp. 465–473, Nov. 2025, doi: 10.15575/join.v10i2.1609.

[25] A. Garrocho-Rangel, S. Aranda-Romo, R. Martínez-Martínez, V. Zavala-Alonso, J. C. Flores-Arriaga, and A. Pozos-Guillén, “Fundamentals of Nonparametric Statistical Tests for Dental Clinical Research,” Dent. J., vol. 12, no. 10, p. 314, Sep. 2024, doi: 10.3390/dj12100314.