Investigating Differential Item Functioning (DIF) in Geometry Test Scores: Holistic vs. Analytical Scoring Rubrics

Raoda Ismail; Okky Riswandha Imawan; Heri Retnawati; Haryanto

doi:10.15408/jp3i.v15i1.40842

Authors

Raoda Ismail Mathematics Education, Cenderawasih University, Indonesia
Okky Riswandha Imawan Mathematics Education, Universitas Cenderawasih, Papua, Indonesia https://orcid.org/0000-0002-9162-3822
Heri Retnawati Educational Research and Evaluation, Graduated School, Yogyakarta State University, Indonesia
Haryanto Educational Research and Evaluation, Graduated School, Yogyakarta State University, Indonesia

DOI:

https://doi.org/10.15408/jp3i.v15i1.40842

Keywords:

DIF, Polytomous Data, Holistic Scoring, Analytical Scoring, Geometry Assessment

Abstract

The use of polytomous data in test instruments enables more detailed assessment of test-takers' abilities, but group differences, such as gender, class, and ethnicity, are often overlooked. Differential Item Functioning (DIF) analysis helps determine whether these identities influence test performance. This descriptive quantitative study examines DIF in geometry tests using Holistic and Analytical Scoring Rubrics across gender, class, and ethnic groups. The study involved 102 undergraduate students from Cenderawasih University, Papua, who completed a geometry test with 10 descriptive questions. Two scoring rubrics were used: the Holistic Scoring Rubric with three categories and the Analytical Scoring Rubric with five. Results were analyzed with the difR package in the R Program. The criterion for detecting DIF is a p-value less than 0.05. Findings show that some test items exhibit DIF concerning gender, class, and ethnicity. The DIF detected for the holistic scoring rubric is items 1, 6, 7, and 10 for the gender group; items 1, 4, 5, and 8 for the class group; and items 1, 2, and 6 for the ethnic group. Meanwhile, the DIF detected for the analytic scoring rubric is item 10 for the gender group; items 9 and 10 for the class group; and item 10 for the ethnic group. However, not all DIF items are flawed; some assess fundamental skills. The Analytical Rubric demonstrated slightly higher reliability (alpha 0.903) than the Holistic Rubric (alpha 0.804). These insights support the development of more equitable and sustainable assessment practices, ensuring fairness and inclusivity in educational evaluations.

References

Abadyo, & Bastari. (2015). Estimation of ability and item parameters in mathematics testing by using the combination of 3plm/grm and mcm/gpcm scoring model 1. In Research and Evaluation in Education Journal e (Vol. 1, Issue 1). http://journal.uny.ac.id/index.php/reid

Andrade, C. (2020). Sample Size and its Importance in Research. Indian Journal of Psychological Medicine, 42(1), 102–103. https://doi.org/10.4103/IJPSYM.IJPSYM_504_19

Angoff, W. H. (1993). Perspective on Differential Item Functioning Methodology. Holland, P.W & Wainer, H.(eds.). Differential Item Functioning. NJ: Lawrence Erlbaum Associates, Publishers.

Aziz, R., & Günther, U. (2023). Psychometric Properties of Creative Personality Scale among Secondary School Students. Jurnal Pengukuran Psikologi Dan Pendidikan Indonesia, 12(2), 162–176. https://doi.org/10.15408/jp3i.v12i2.31808

Baker, J. G., Rounds, J. B., & Zeron, M. A. (2000). A comparison of graded response and rasch partial credit models with subjective well-being. Journal of Educational and Behavioral Statistic, 25(3), 253-270.

Boughton, K.A., Klinger, D.A. & Gierl, M. J. (2001). Effect of random rater error on parameter recovery of the generalized partial credit model and graded response model. Paper Presented at the Annual Meeting of the National Council on Measurement in Education, Seattle, WA.

Dodeen, H. (2004). The relationship between item parameters and item fit. Journal of Educational Measurement. Fall 2004, Vol.41, No.3, Pp.261- 270.

Embretson, S. E., & Reise, S. P. (2000). Item Response Theory (1st ed.). Erlbaum Associates.

Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Sage Publications.

Hambleton, R., & Swaminathan, H. (2013). Item response theory: principles and applications. Springer Science & Business Media.

Handoko, S. T., Mardiati, Y., Ismail, R., & Imawan, O. R. (2023). Employing Higher Order Thinking Skills-based Instruction in History Course: A History High School Teacher’s Perspective. AIP Conference Proceedings, 2679(January). https://doi.org/10.1063/5.0127631

Harsana, F. N., Retnawati, H., Dewanti, S. R., Lumenyela, R. A., Sotlikova, R., Adzima, M. F., & Septiana, A. R. (2024). Comparison of item characteristic analysis models of reading literacy test with polytomous Item Response Theory. REID (Research and Evaluation in Education), 10(2), 214–226. https://doi.org/10.21831/reid.v10i2.77852

Hortensius, L. (2012). Advanced Measurement - Logistic regression for DIF detection.

Ibrahim, Z. S., Retnawati, H., Irambona, A., & Pérez, B. E. O. (2024). Stability of estimation item parameter in IRT dichotomy considering the number of participants. REID (Research and Evaluation in Education), 10(1), 114–127. https://doi.org/10.21831/reid.v10i1.73055

Imawan, O. R., Retnawati, H., Haryanto, & Ismail, R. (2024). Confirmatory factor analysis and differential item functioning analysis on mathematical literacy instruments for prospective Indonesian elementary school teachers. AIP ConferenceProceeding, 080009. https://doi.org/10.1063/5.0228174

Imawan, O. R., Retnawati, H., Haryanto, & Ismail, R. (2025). The challenges of implementing computerized adaptive testing in Indonesia. Journal of Education and E-Learning Research, 12(2), 124–144. https://doi.org/10.20448/jeelr.v12i2.6677

Isgiyanto, A. (2013). Diagnosis of Student Errors Based on Polytomous Scoring Using the Partial Credit Model in Mathematics. Jurnal Penelitian Dan Evaluasi Pendidikan, 15(2), 308–325. https://doi.org/10.21831/pep.v15i2.1099

Ismail, R., Retnawati, H., Sugiman, Arovah, N. I., & Imawan, O. R. (2024). Contexts proposed by teachers in Papua for developing mathematics hots assessment instruments: A phenomenological study. Journal of Education and E-Learning Research, 11(3), 548–556. https://doi.org/10.20448/jeelr.v11i3.5922

Ismail, R., Retnawati, H., Sugiman, & Imawan, O. R. (2024). Construct validity of mathematics high order thinking skills instrument with cultural context: Confirmatory factor analysis. AIP Conference Proceeding, 080008. https://doi.org/10.1063/5.0228143

Ismail, R., Retnawati, H., Sugiman, S., Setiawati, F. A., Imawan, O. R., & Santoso, P. H. (2024). Optimal Scale Points for Reliable Measurements: Exploring the Impact of Scale Point Variation. JP3I (Jurnal Pengukuran Psikologi Dan Pendidikan Indonesia), 13(1), 44–56. https://doi.org/10.15408/jp3i.v13i1.34173

Karimah, U., Retnawati, H., Hadiana, D., Pujiastuti, P., & Yusron, E. (2021). The characteristics of chemistry test items on nationally-standardized school examination in Yogyakarta City. REID (Research and Evaluation in Education), 7(1), 1–12. https://doi.org/10.21831/reid.v7i1.31297

Kartianom, K., & Mardapi, D. (2018). The utilization of junior high school mathematics national examination data: A conceptual error diagnosis. REID (Research and Evaluation in Education), 3(2), 163–173. https://doi.org/10.21831/reid.v3i2.18120

Kartowagiran, B., Mardapi, D., Purnama, D. N., & Kriswantoro, K. (2019). Parallel tests viewed from the arrangement of item numbers and alternative answers. REID (Research and Evaluation in Education), 5(2), 169–182. https://doi.org/10.21831/reid.v5i2.23721

Kusumawati, M., & Hadi, S. (2018). An analysis of multiple choice questions (MCQs): Item and test statistics from mathematics assessments in senior high school. 4(1), 70–78. http://journal.uny.ac.id/index.php/reid

Lei Chang. (1994). A Psychometric evaluation of 4-poin and 6-point Likert-type scales in relatiopn to reliability and validity. [Versi elektronik]. Apllied Psychological Measurement, 18, 3, 205-215.

Lin, C. J. (2008). Comparisons between classical test theory and butir response theory in automated assembly of parallel test form. The Journal of Technology, Learning, and Assessment. 6(8), 1-42.

McGrath, J. M., & Brandon, D. (2018). What Constitutes a Well-Designed Pilot Study? Advances in Neonatal Care, 18(4), 243–245. https://doi.org/10.1097/ANC.0000000000000535

Messick, S. J. (1995). Validity of Psychological Assesment: Validation of Infrences from Persons Responses and Performance as Scientific Inquiry Into Score Meaning. American Psychologist, 50 (9), 741-749.

Otaya, L. G., Kartowagiran, B., Retnawati, H., & Mustakim, S. S. (2020). Estimating the ability of pre-service and in-service Teacher Profession Education (TPE) participants using Item Response Theory. REID (Research and Evaluation in Education), 6(2), 160–173. https://doi.org/10.21831/reid.v6i2.36043

Pardede, T., Santoso, A., Diki, D., Retnawati, H., Rafi, I., Apino, E., & Rosyada, M. N. (2023). Gaining a deeper understanding of the meaning of the carelessness parameter in the 4PL IRT model and strategies for estimating it. REID (Research and Evaluation in Education), 9(1), 86–117. https://doi.org/10.21831/reid.v9i1.63230

Ploutz-Snyder, L., Bloomfield, S., & Smith, S. M. (2014). Fundamental Principles of Small Sample Size Research. Frontiers in Physiology, 5, 413.

Susongko, P. (2010). Perbandingan Keefektifan Bentuk Tes Uraian Dan Testlet Dengan Penerapan Graded Response Model (GRM). Jurnal Penelitian Dan Evaluasi Pendidikan.

Retnawati, H. (2014). Teori respon butir dan penerapannya. Parama Publishing.

Robitzsch, A. (2025). sirt: Supplementary Item Response Theory Models. In CRAN: Contributed Packages (pp. 12–80). https://doi.org/10.32614/CRAN.package.sirt

Santoso, P. H., Setiawati, F. A., Ismail, R., & Suhariyono, S. (2023). Comparing IRT properties among different category numbers: a case from attitudinal measurement on physics education research. Discover Psychology, 3(1). https://doi.org/10.1007/s44202-023-00101-6

Scott, N. W., Fayers, P. M., Aaronson, N. K., Bottomley, A., Graeff, A. de, Groenvold, M., Gundy, C., & Koller, M. (2009). A simulation study provided sample size guidance for differential item functioning (DIF) studies using short scales Scott. Journal of Clinical Epidemiology, 62(3), 288–295.

Setiawan, A., Kassymova, G. K., Mbazumutima, V., & Agustyani, A. R. D. (2024). Differential Item Functioning of the region-based national examination equipment. REID (Research and Evaluation in Education), 10(1), 99–113. https://doi.org/10.21831/reid.v10i1.73270

Sheppard, R., dkk. (2006). Differential Item Functioning by Sex and Race in the Hogan Personality Inventory.

Strout, W. F. (1990). A New Item Response Theory Modeling Approach with Applications to Unidimensionality Assessment and Ability Estimation. Psychometrika, 55(2), 293–325. https://doi.org/10.1007/BF02295289

Sumin, S., Sukmawati, F., & Nurdin, N. (2022). Gender differential item functioning on the Kentucky Inventory of Mindfulness Skills instrument using logistic regression. REID (Research and Evaluation in Education), 8(1), 55–66. https://doi.org/10.21831/reid.v8i1.50809

Sumintono, B. & Widhiarso, W. (2013). Aplikasi Model Rasch: untuk penelitian ilmu sosial. Edisi 1. Trim Komunikata Publishing House.

Tang, K. L. (1996). Polytomous item response theory (IRT) models and their aplications in large-scale testing program: Review of literature. Educational Testing Science. Princeton, NJ. RM-96-8 TOEFL Monograph Series.

Tognolini, J., & Davidson, M. (2003). How do we operationalise what we value? Some technical chalenges in assessing higher order thinking skills. Makalah Disajikan Dalam the Natinaonal Roundtable on Assessment Conference Pada Bulan Juli 2003 Di Darwin, Australia.

Wardani, R. E. A., Prihatni, Y., Negeri, S., & Jl Jogja-Solo Km, K. (2018). Developing assessment model for bandel attitudes based on the teachings of Ki Hadjar Dewantara. 4(2), 117–125. http://journal.uny.ac.id/index.php/reid

Wasis. (2011). Model Penskoran Partial Credit Pada Butir Multiple True-False Bidang Fisika. Jurnal Penelitian Dan Evaluasi Pendidikan.

Wu, B. C. (2003). Scoring multiple true-false butirs: A comparison of summed scores and response pattern scores at butir and test level. Research report. Educational Resources International Center (ERIC).

Yim, L. W. K., Lye, C. Y., & Koh, P. W. (2024). A psychometric evaluation of an item bank for an English reading comprehension tool using Rasch analysis. REID (Research and Evaluation in Education), 10(1), 18–34. https://doi.org/10.21831/reid.v10i1.65284

Yudiana, W., Triwahyuni, A., & Susanto, H. (2023). Multidimensional Rasch Analysis of Gender Differences in Tes Intelegensi Kolektif Indonesia–Tinggi (TIKI-T). JP3I (Jurnal Pengukuran Psikologi Dan Pendidikan Indonesia), 12(1), 1–16. https://doi.org/10.15408/jp3i.v12i1.20417

Zhang, J. (2006). Conditional Covariance Theory and Detect for Polytomous Items. Psychometrika, 72(1), 69–91. https://doi.org/10.1007/s11336-004-1257-7