Uncovering Hidden Themes in Indie Music: Crisp-Dm Guided LDA Topic Modeling on a Kaggle-Based Lyric Generation Dataset

Thoyyibah T; Yan Mitha Djaksana

doi:10.15408/jti.v18i2.46643

Authors

Thoyyibah T Information System Management Department, BINUS Graduate Program- Master of Information System Management, University of Bina Nusantara, Indonesia https://orcid.org/0000-0002-6348-8694
Yan Mitha Djaksana Information Technology Program, Faculty of Computer Science, Pamulang University, Indonesia https://orcid.org/0000-0003-3783-4618

DOI:

https://doi.org/10.15408/jti.v18i2.46643

Keywords:

Natural Language Processing, topic modeling, CRISP-DM, LDA, lyrics dataset

Abstract

The development of music has produced many works in the form of data, especially lyrical data, which provide insight into the semantic structure of music. This study explores latent thematic patterns in the indie lyric dataset from Kaggle by applying Latent Dirichlet Allocation (LDA), which is the first LDA study of indie music lyrics in the Indonesian context with the interpretation of love, emotional needs, romance, and inner conflict. The CRISP-DM (Cross-Industry Standard Process for Data Mining) methodology can be effectively applied to unstructured data, opening up opportunities for better music classification. The methodological stages include business and data understanding, data preparation, modelling, evaluation, and dissemination. In the early stages, the Kaggle dataset implemented Natural Language Processing, which was done with case folding, punctuation removal, stopword removal, stemming, and tokenization. The LDA model is trained by identifying five topics with different interpretations. Visualization in WordClouds, with topic distribution on datasets and title-based topic mapping. This model yielded a coherence value of 0.3044, which indicates limited semantic consistency, which means the words in the topic have a reasonably good relationship, but there is still potential for refinement in subsequent studies. The limitations of this study include the limited size of the dataset, with only 347 rows and slight variation in interpretation. For future research, it is recommended to use larger datasets and more diverse interpretations and apply more machine learning models.

References

[1] S. Dua et al., “Developing a Speech Recognition System for Recognizing Tonal Speech Signals Using a Convolutional Neural Network,” Appl. Sci., vol. 12, no. 12, p. 6223, Jun. 2022, doi: 10.3390/app12126223.

[2] H. Luo et al., “Human–Machine Interaction via Dual Modes of Voice and Gesture Enabled by Triboelectric Nanogenerator and Machine Learning,” ACS Appl. Mater. Interfaces, vol. 15, no. 13, pp. 17009–17018, Apr. 2023, doi: 10.1021/acsami.3c00566.

[3] B. Liu and Y. Lv, “The Influence of the Era of Big Data on Film and Television Art and Countermeasures,” no. Fmess, pp. 237–240, 2021.

[4] M. de Witte, A. Spruit, S. van Hooren, X. Moonen, and G.-J. Stams, “Effects of music interventions on stress-related outcomes: a systematic review and two meta-analyses,” Health Psychol. Rev., vol. 14, no. 2, pp. 294–324, Apr. 2020, doi: 10.1080/17437199.2019.1627897.

[5] Y. R. Pandeya, B. Bhattarai, and J. Lee, “Deep-Learning-Based Multimodal Emotion Classification for Music Videos,” Sensors, vol. 21, no. 14, p. 4927, Jul. 2021, doi: 10.3390/s21144927.

[6] Y. R. Pandeya and J. Lee, “Deep learning-based late fusion of multimodal information for emotion classification of music video,” Multimed. Tools Appl., vol. 80, no. 2, pp. 2887–2905, Jan. 2021, doi: 10.1007/s11042-020-08836-3.

[7] M. de Witte, A. da S. Pinho, G. Stams, X. Moonen, A. E. R. Bos, and S. van Hooren, “Music therapy for stress reduction: a systematic review and meta-analysis,” Health Psychol. Rev., vol. 16, no. 1, pp. 134–159, Jan. 2022, doi: 10.1080/17437199.2020.1846580.

[8] H. Richard, P. Dornheim, and T. Weber, “Using AI to Improve Risk Management : A Case Study of a Leading Telecommunications Provider,” IEEE Access, vol. 12, no. November, pp. 165068–165080, 2024, doi: 10.1109/ACCESS.2024.3488321.

[9] H. Mamdouh and F. Tarek, “A high-quality feature selection method based on frequent and correlated items for text classification,” Soft Comput., vol. 27, no. 16, pp. 11259–11274, 2023, doi: 10.1007/s00500-023-08587-x.

[10] N. Jalal, A. Mehmood, G. Sang, and I. Ashraf, “A novel improved random forest for text classification using feature ranking and optimal number of trees,” J. King Saud Univ. - Comput. Inf. Sci., vol. 34, no. 6, pp. 2733–2742, 2022, doi: 10.1016/j.jksuci.2022.03.012.

[11] P. Liu, W. Yuan, J. Fu, Z. Jiang, H. Hayashi, and G. Neubig, “Pre-train , Prompt , and Predict : A Systematic Survey,” vol. 55, no. 9, 2023, doi: 10.1145/3560815.

[12] T. H. Tee, B. Q. Bei Yeap, K. H. Gan, and T. P. Tan, “Learning to Automatically Generating Genre-Specific Song Lyrics: A Comparative Study,” 2022, pp. 62–75. doi: 10.1007/978-3-031-21422-6_5.

[13] M. Mayerl, S. Brandl, G. Specht, M. Schedl, and E. Zangerle, “Verse Versus Chorus: Structure-Aware Feature Extraction for Lyrics-Based Genre Recognition,” Proc. 23rd Int. Soc. Music Inf. Retr. Conf. ISMIR 2022, pp. 884–890, 2022.

[14] S. Sharma, A. Shukla, A. Walimbe, T. Sharma, and J. Delgado, “LyricLure: Mining Catchy Hooks in Song Lyrics to Enhance Music Discovery and Recommendation,” in 18th ACM Conference on Recommender Systems, Oct. 2024, pp. 800–802. doi: 10.1145/3640457.3688049.

[15] I. Czedik-Eysenberg, O. Wieczorek, A. Flexer, and C. Reuter, “Charting the Universe of Metal Music Lyrics and Analyzing Their Relation to Perceived Audio Hardness,” Trans. Int. Soc. Music Inf. Retr., vol. 7, no. 1, Aug. 2024, doi: 10.5334/tismir.182.

[16] A. M. Demetriou, J. Kim, S. Manolios, C. C. S. Liem, and S. Pandora, “TOWARDS AUTOMATED ESTIMATION OF VALUES FROM SONG LYRICS : A DATA COLLECTION PROTOCOL,” pp. 57–59, 2023.

[17] A. Lukic, “A Comparison of Topic Modeling Approaches for a Comprehensive Corpus of Song Lyrics Dataset Methodology,” pp. 1–7.

[18] S. Rani, “SENTIMENT ANALYSIS AND TOPIC MODELLING ON TWITTER FOR CLEAN INDIA MISSION,” vol. 12, no. 5, pp. 1198–1207, 2021.

[19] W. Chen, F. Cai, H. Chen, and M. D. E. Rijke, “Personalized query suggestion diversi fi cation in information retrieval,” vol. 14, no. 3, 2020.

[20] E. K. Seltzer et al., “Patient Experience and Satisfaction in Online Reviews of Obstetric Care : Observational Study Corresponding Author :,” vol. 6, pp. 1–8, doi: 10.2196/28379.

[21] K. Siriket, V. Sa-ing, and S. Khonthapagdee, “Mood classification from Song Lyric using Machine Learning,” in 2021 9th International Electrical Engineering Congress (iEECON), Mar. 2021, pp. 476–478. doi: 10.1109/iEECON51072.2021.9440333.

[22] M. D. Devi and N. Saharia, “Exploiting Topic Modelling to Classify Sentiment from Lyrics,” 2020, pp. 411–423. doi: 10.1007/978-981-15-6318-8_34.

[23] D. Yang, X. Chen, and Y. Zhao, “A LDA-Based Approach to Lyric Emotion Regression,” 2011, pp. 331–340. doi: 10.1007/978-3-642-25661-5_43.

[24] T. Hunke, F. Huber, and J. Steffens, “The Evolution of Song Lyrics: An NLP-Based Analysis of Popular Music in Germany from 1954 to 2022,” Music Sci., vol. 8, Apr. 2025, doi: 10.1177/20592043251331155.

[25] S. Zhang, R. C. Repetto, and X. Serra, “Understanding the expressive functions of jingju metrical patterns through lyrics text mining,” Proc. 18th Int. Soc. Music Inf. Retr. Conf. ISMIR 2017, pp. 397–403, 2017.

[26] P. Kherwa and P. Bansal, “A Comparative Empirical Evaluation of Topic Modeling Techniques,” 2021, pp. 289–297. doi: 10.1007/978-981-15-5148-2_26.

[27] J. BRZOZOWSKA, J. PIZOŃ, G. BAYTIKENOVA, A. GOLA, A. ZAKIMOVA, and K. PIOTROWSKA, “DATA ENGINEERING IN CRISP-DM PROCESS PRODUCTION DATA – CASE STUDY,” Appl. Comput. Sci., vol. 19, no. 3, pp. 83–95, Sep. 2023, doi: 10.35784/acs-2023-26.

[28] O. Azeroual, R. Nacheva, A. Nikiforova, and U. Störl, “A CRISP-DM and Predictive Analytics Framework for Enhanced Decision-Making in Research Information Management Systems,” vol. 49, pp. 67–86, 2025.

[29] J. Bokrantz, M. Subramaniyan, and A. Skoogh, “The Management of Operations Realizing the promises of artificial intelligence in manufacturing by enhancing CRISP-DM,” Prod. Plan. Control, vol. 35, no. 16, pp. 2234–2254, 2024, doi: 10.1080/09537287.2023.2234882.

[30] I. Kolyshkina and S. Simoff, “Interpretability of Machine Learning Solutions in Public Healthcare : The CRISP-ML Approach,” vol. 4, no. May, 2021, doi: 10.3389/fdata.2021.660206.

[31] M. Konrad and X. State, “Automatic Complaints Classification in E-Commerce : A Case Study Using CRISP-DM,” 2025, doi: 10.5753/jis.2025.4661.

[32] C. Schröer, F. Kruse, and J. M. Gómez, “A Systematic Literature Review on Applying CRISP-DM Process Model,” Procedia Comput. Sci., vol. 181, pp. 526–534, 2021, doi: 10.1016/j.procs.2021.01.199.

[33] C. Schröer, F. Kruse, J. Marx, F. Kruse, and J. Marx, “ScienceDirect ScienceDirect A Systematic Literature Review A Systematic Literature Review on Applying Process Model on Applying CRISP-DM Process Model,” Procedia Comput. Sci., vol. 181, no. 2019, pp. 526–534, 2021, doi: 10.1016/j.procs.2021.01.199.

[34] N. Cavus, M. Goksu, and B. Oktekin, “Real-time fake news detection in online social networks : FANDC Cloud-based system,” pp. 1–11, 2024.

[35] R. Nisbet, K. McCormick, and G. Miner, “Data Understanding,” in Handbook of Statistical Analysis, Elsevier, 2025, pp. 69–74. doi: 10.1016/B978-0-443-15845-2.00006-2.

[36] Z. Ma and B. N. Jørgensen, “DataPro – A Standardized Data Understanding and Processing Procedure : A Case Study of an Eco-driving Project,” pp. 1–20.