Struggling Models: An Analysis of Logistic Regression and Random Forest in Predicting Repeat Buyers with Imbalanced Performance Metrics

Siska Farizah Mauludiah, Yunifa Miftachul Arif, Muhammad Faisal, Dony Darmawan Putra


Predicting repeat buyers is essential for businesses seeking to improve customer retention and maximize profitability. This study examines the effectiveness of logistic regression and random forest algorithms in forecasting repeat buyers, utilizing an e-commerce dataset from Kaggle. Despite the theoretical strengths of these models, our results indicate significant performance challenges. Both models were evaluated on key metrics: accuracy, precision, recall, F1 score, and ROC-AUC. The findings revealed that the models logistic regression and random forest performed poorly, with accuracy hovering around 50%, precision and recall demonstrating imbalanced performance, and ROC-AUC scores barely exceeding random guessing levels. Such metrics highlight the limited discriminative power of these models in identifying repeat buyers. The analysis suggests that issues such as data quality, feature relevance, and class imbalance contribute to these shortcomings. Specifically, the models struggled to effectively learn from the data, leading to suboptimal predictions. These results underscore the need for enhanced feature engineering, better handling of class imbalance, and possibly exploring more advanced algorithms. This study provides a critical assessment of the limitations inherent in using Logistic Regression and Random Forest for predicting repeat buyers, hence implements feature engineering, SMOTE and hyperparameter tuning using RandomSearchCV to get better result.


E-commerce, repeat buyers, customer retention, logistic regression, random forest, imbalanced performance metrics

