Investigating Differential Item Functioning (DIF) in Geometry Test Scores: Holistic vs. Analytical Scoring Rubrics
DOI:
https://doi.org/10.15408/jp3i.v15i1.40842Keywords:
DIF, Polytomous Data, Holistic Scoring, Analytical Scoring, Geometry AssessmentAbstract
The use of polytomous data in test instruments enables more detailed assessment of test-takers' abilities, but group differences, such as gender, class, and ethnicity, are often overlooked. Differential Item Functioning (DIF) analysis helps determine whether these identities influence test performance. This descriptive quantitative study examines DIF in geometry tests using Holistic and Analytical Scoring Rubrics across gender, class, and ethnic groups. The study involved 102 undergraduate students from Cenderawasih University, Papua, who completed a geometry test with 10 descriptive questions. Two scoring rubrics were used: the Holistic Scoring Rubric with three categories and the Analytical Scoring Rubric with five. Results were analyzed with the difR package in the R Program. The criterion for detecting DIF is a p-value less than 0.05. Findings show that some test items exhibit DIF concerning gender, class, and ethnicity. The DIF detected for the holistic scoring rubric is items 1, 6, 7, and 10 for the gender group; items 1, 4, 5, and 8 for the class group; and items 1, 2, and 6 for the ethnic group. Meanwhile, the DIF detected for the analytic scoring rubric is item 10 for the gender group; items 9 and 10 for the class group; and item 10 for the ethnic group. However, not all DIF items are flawed; some assess fundamental skills. The Analytical Rubric demonstrated slightly higher reliability (alpha 0.903) than the Holistic Rubric (alpha 0.804). These insights support the development of more equitable and sustainable assessment practices, ensuring fairness and inclusivity in educational evaluations.
References
Abadyo, & Bastari. (2015). Estimation of ability and item parameters in mathematics testing by using the combination of 3plm/grm and mcm/gpcm scoring model 1. In Research and Evaluation in Education Journal e (Vol. 1, Issue 1). http://journal.uny.ac.id/index.php/reid
Andrade, C. (2020). Sample Size and its Importance in Research. Indian Journal of Psychological Medicine, 42(1), 102–103. https://doi.org/10.4103/IJPSYM.IJPSYM_504_19
Angoff, W. H. (1993). Perspective on Differential Item Functioning Methodology. Holland, P.W & Wainer, H.(eds.). Differential Item Functioning. NJ: Lawrence Erlbaum Associates, Publishers.
Aziz, R., & Günther, U. (2023). Psychometric Properties of Creative Personality Scale among Secondary School Students. Jurnal Pengukuran Psikologi Dan Pendidikan Indonesia, 12(2), 162–176. https://doi.org/10.15408/jp3i.v12i2.31808
Baker, J. G., Rounds, J. B., & Zeron, M. A. (2000). A comparison of graded response and rasch partial credit models with subjective well-being. Journal of Educational and Behavioral Statistic, 25(3), 253-270.
Boughton, K.A., Klinger, D.A. & Gierl, M. J. (2001). Effect of random rater error on parameter recovery of the generalized partial credit model and graded response model. Paper Presented at the Annual Meeting of the National Council on Measurement in Education, Seattle, WA.
Dodeen, H. (2004). The relationship between item parameters and item fit. Journal of Educational Measurement. Fall 2004, Vol.41, No.3, Pp.261- 270.
Embretson, S. E., & Reise, S. P. (2000). Item Response Theory (1st ed.). Erlbaum Associates.
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage Publications.
Hambleton, R., & Swaminathan, H. (2013). Item response theory: principles and applications. Springer Science & Business Media.
Handoko, S. T., Mardiati, Y., Ismail, R., & Imawan, O. R. (2023). Employing Higher Order Thinking Skills-based Instruction in History Course: A History High School Teacher’s Perspective. AIP Conference Proceedings, 2679(January). https://doi.org/10.1063/5.0127631
Harsana, F. N., Retnawati, H., Dewanti, S. R., Lumenyela, R. A., Sotlikova, R., Adzima, M. F., & Septiana, A. R. (2024). Comparison of item characteristic analysis models of reading literacy test with polytomous Item Response Theory. REID (Research and Evaluation in Education), 10(2), 214–226. https://doi.org/10.21831/reid.v10i2.77852
Hortensius, L. (2012). Advanced Measurement - Logistic regression for DIF detection.
Ibrahim, Z. S., Retnawati, H., Irambona, A., & Pérez, B. E. O. (2024). Stability of estimation item parameter in IRT dichotomy considering the number of participants. REID (Research and Evaluation in Education), 10(1), 114–127. https://doi.org/10.21831/reid.v10i1.73055
Imawan, O. R., Retnawati, H., Haryanto, & Ismail, R. (2024). Confirmatory factor analysis and differential item functioning analysis on mathematical literacy instruments for prospective Indonesian elementary school teachers. AIP ConferenceProceeding, 080009. https://doi.org/10.1063/5.0228174
Imawan, O. R., Retnawati, H., Haryanto, & Ismail, R. (2025). The challenges of implementing computerized adaptive testing in Indonesia. Journal of Education and E-Learning Research, 12(2), 124–144. https://doi.org/10.20448/jeelr.v12i2.6677
Isgiyanto, A. (2013). Diagnosis of Student Errors Based on Polytomous Scoring Using the Partial Credit Model in Mathematics. Jurnal Penelitian Dan Evaluasi Pendidikan, 15(2), 308–325. https://doi.org/10.21831/pep.v15i2.1099
Ismail, R., Retnawati, H., Sugiman, Arovah, N. I., & Imawan, O. R. (2024). Contexts proposed by teachers in Papua for developing mathematics hots assessment instruments: A phenomenological study. Journal of Education and E-Learning Research, 11(3), 548–556. https://doi.org/10.20448/jeelr.v11i3.5922
Ismail, R., Retnawati, H., Sugiman, & Imawan, O. R. (2024). Construct validity of mathematics high order thinking skills instrument with cultural context: Confirmatory factor analysis. AIP Conference Proceeding, 080008. https://doi.org/10.1063/5.0228143
Ismail, R., Retnawati, H., Sugiman, S., Setiawati, F. A., Imawan, O. R., & Santoso, P. H. (2024). Optimal Scale Points for Reliable Measurements: Exploring the Impact of Scale Point Variation. JP3I (Jurnal Pengukuran Psikologi Dan Pendidikan Indonesia), 13(1), 44–56. https://doi.org/10.15408/jp3i.v13i1.34173
Karimah, U., Retnawati, H., Hadiana, D., Pujiastuti, P., & Yusron, E. (2021). The characteristics of chemistry test items on nationally-standardized school examination in Yogyakarta City. REID (Research and Evaluation in Education), 7(1), 1–12. https://doi.org/10.21831/reid.v7i1.31297
Kartianom, K., & Mardapi, D. (2018). The utilization of junior high school mathematics national examination data: A conceptual error diagnosis. REID (Research and Evaluation in Education), 3(2), 163–173. https://doi.org/10.21831/reid.v3i2.18120
Kartowagiran, B., Mardapi, D., Purnama, D. N., & Kriswantoro, K. (2019). Parallel tests viewed from the arrangement of item numbers and alternative answers. REID (Research and Evaluation in Education), 5(2), 169–182. https://doi.org/10.21831/reid.v5i2.23721
Kusumawati, M., & Hadi, S. (2018). An analysis of multiple choice questions (MCQs): Item and test statistics from mathematics assessments in senior high school. 4(1), 70–78. http://journal.uny.ac.id/index.php/reid
Lei Chang. (1994). A Psychometric evaluation of 4-poin and 6-point Likert-type scales in relatiopn to reliability and validity. [Versi elektronik]. Apllied Psychological Measurement, 18, 3, 205-215.
Lin, C. J. (2008). Comparisons between classical test theory and butir response theory in automated assembly of parallel test form. The Journal of Technology, Learning, and Assessment. 6(8), 1-42.
McGrath, J. M., & Brandon, D. (2018). What Constitutes a Well-Designed Pilot Study? Advances in Neonatal Care, 18(4), 243–245. https://doi.org/10.1097/ANC.0000000000000535
Messick, S. J. (1995). Validity of Psychological Assesment: Validation of Infrences from Persons Responses and Performance as Scientific Inquiry Into Score Meaning. American Psychologist, 50 (9), 741-749.
Otaya, L. G., Kartowagiran, B., Retnawati, H., & Mustakim, S. S. (2020). Estimating the ability of pre-service and in-service Teacher Profession Education (TPE) participants using Item Response Theory. REID (Research and Evaluation in Education), 6(2), 160–173. https://doi.org/10.21831/reid.v6i2.36043
Pardede, T., Santoso, A., Diki, D., Retnawati, H., Rafi, I., Apino, E., & Rosyada, M. N. (2023). Gaining a deeper understanding of the meaning of the carelessness parameter in the 4PL IRT model and strategies for estimating it. REID (Research and Evaluation in Education), 9(1), 86–117. https://doi.org/10.21831/reid.v9i1.63230
Ploutz-Snyder, L., Bloomfield, S., & Smith, S. M. (2014). Fundamental Principles of Small Sample Size Research. Frontiers in Physiology, 5, 413.
Susongko, P. (2010). Perbandingan Keefektifan Bentuk Tes Uraian Dan Testlet Dengan Penerapan Graded Response Model (GRM). Jurnal Penelitian Dan Evaluasi Pendidikan.
Retnawati, H. (2014). Teori respon butir dan penerapannya. Parama Publishing.
Robitzsch, A. (2025). sirt: Supplementary Item Response Theory Models. In CRAN: Contributed Packages (pp. 12–80). https://doi.org/10.32614/CRAN.package.sirt
Santoso, P. H., Setiawati, F. A., Ismail, R., & Suhariyono, S. (2023). Comparing IRT properties among different category numbers: a case from attitudinal measurement on physics education research. Discover Psychology, 3(1). https://doi.org/10.1007/s44202-023-00101-6
Scott, N. W., Fayers, P. M., Aaronson, N. K., Bottomley, A., Graeff, A. de, Groenvold, M., Gundy, C., & Koller, M. (2009). A simulation study provided sample size guidance for differential item functioning (DIF) studies using short scales Scott. Journal of Clinical Epidemiology, 62(3), 288–295.
Setiawan, A., Kassymova, G. K., Mbazumutima, V., & Agustyani, A. R. D. (2024). Differential Item Functioning of the region-based national examination equipment. REID (Research and Evaluation in Education), 10(1), 99–113. https://doi.org/10.21831/reid.v10i1.73270
Sheppard, R., dkk. (2006). Differential Item Functioning by Sex and Race in the Hogan Personality Inventory.
Strout, W. F. (1990). A New Item Response Theory Modeling Approach with Applications to Unidimensionality Assessment and Ability Estimation. Psychometrika, 55(2), 293–325. https://doi.org/10.1007/BF02295289
Sumin, S., Sukmawati, F., & Nurdin, N. (2022). Gender differential item functioning on the Kentucky Inventory of Mindfulness Skills instrument using logistic regression. REID (Research and Evaluation in Education), 8(1), 55–66. https://doi.org/10.21831/reid.v8i1.50809
Sumintono, B. & Widhiarso, W. (2013). Aplikasi Model Rasch: untuk penelitian ilmu sosial. Edisi 1. Trim Komunikata Publishing House.
Tang, K. L. (1996). Polytomous item response theory (IRT) models and their aplications in large-scale testing program: Review of literature. Educational Testing Science. Princeton, NJ. RM-96-8 TOEFL Monograph Series.
Tognolini, J., & Davidson, M. (2003). How do we operationalise what we value? Some technical chalenges in assessing higher order thinking skills. Makalah Disajikan Dalam the Natinaonal Roundtable on Assessment Conference Pada Bulan Juli 2003 Di Darwin, Australia.
Wardani, R. E. A., Prihatni, Y., Negeri, S., & Jl Jogja-Solo Km, K. (2018). Developing assessment model for bandel attitudes based on the teachings of Ki Hadjar Dewantara. 4(2), 117–125. http://journal.uny.ac.id/index.php/reid
Wasis. (2011). Model Penskoran Partial Credit Pada Butir Multiple True-False Bidang Fisika. Jurnal Penelitian Dan Evaluasi Pendidikan.
Wu, B. C. (2003). Scoring multiple true-false butirs: A comparison of summed scores and response pattern scores at butir and test level. Research report. Educational Resources International Center (ERIC).
Yim, L. W. K., Lye, C. Y., & Koh, P. W. (2024). A psychometric evaluation of an item bank for an English reading comprehension tool using Rasch analysis. REID (Research and Evaluation in Education), 10(1), 18–34. https://doi.org/10.21831/reid.v10i1.65284
Yudiana, W., Triwahyuni, A., & Susanto, H. (2023). Multidimensional Rasch Analysis of Gender Differences in Tes Intelegensi Kolektif Indonesia–Tinggi (TIKI-T). JP3I (Jurnal Pengukuran Psikologi Dan Pendidikan Indonesia), 12(1), 1–16. https://doi.org/10.15408/jp3i.v12i1.20417
Zhang, J. (2006). Conditional Covariance Theory and Detect for Polytomous Items. Psychometrika, 72(1), 69–91. https://doi.org/10.1007/s11336-004-1257-7
Downloads
Published
Issue
Section
License
Copyright (c) 2026 Okky Riswandha Imawan, Heri Retnawati, Haryanto Haryanto, Raoda Ismail

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.





