The Use of Stocking-Lord and Haebara Methods in Horizontal Equating: A Case of Indonesian Madrasah Competence Assessment

Kusaeri Kusaeri, Ali Ridho, Noor Wahyudi

Abstract


Indonesian Madrasah Competence Assessment (AKMI) is a national assessment implemented each year held by the Ministry of Religious Affairs. One of the uniqueness of the AKMI is the use of different tests every year. AKMI focuses on capturing the development of learning in Madrasa by comparing the test scores of the current year with the previous year. An equating process is crucial for valid results when comparing scores. Therefore, this research aims to (a) equate the scientific literacy assessment tools at AKMI in 2022 with 2023 and (b) evaluate the business process of developing AKMI scientific literacy instruments (along with the MSAT design), which has implications for the equating process. This study adopted a Non-Equivalent Anchor Test (NEAT) design because the two test sets were parallel years, and the participants were from a diverse population. The data is from the AKMI Science Literacy of the Ministry of Religious Affairs, with 303,987 participants in 2022 and 342,987 in 2023 from the Islamic elementary school level. A total of 674 scientific literacy instrument items in 2022 and 1,392 items in 2023, with 90 items used as anchor items. There are 3 stages of analysis: pre-equalization, equalization calibration, and post-equalization analysis. The results show that there are differences in item parameter estimation results between 2022 and 2023, where 2022 has a higher level of item difficulty. Furthermore, the Stocking-Lord and Haebara methods had proven to be effective and had produced estimates with minimal differences in the equating process. In addition, the anchor items used as the basis for the equating do not represent the items as a whole in the item pool. These findings indicate the need for firm, careful standardization based on psychometric principles of the process at AKMI, from developing items to assembling items, testing, determining anchor items, and assembling items in the MSAT application.


Keywords


Horizontal equating, AKMI, Stocking-Lord dan Haebara

References


Aditomo, A., Rahmawati, N., Felicia, N., Shihab, M., Psi, F., & Handayani, M. B. A. (2019). Academic Study and Recommendations for National Assessment System Reform. https://pusmendik.kemdikbud.go.id/pdf/file-137

Aksekioglu, B. (2017). Comparison of Test Equating Methods Based on Item Response Theory: PISA 2021 Science Test Sample [Akdeniz Universities]. https://acikbilim.yok.gov.tr/bitstream/handle/20.500.12812/40289/yokAcikBilim_10138163.pdf?sequence=-1&isAllowed=y

Alonzo, D., Leverett, J., & Obsioma, E. (2021). Leading an Assessment Reform: Ensuring a Whole-School Approach for Decision-Making. Frontiers in Education, 6. https://doi.org/10.3389/feduc.2021.631857

Ayanwale, M. A. (2023). Test score equating of multiple-choice mathematics items: techniques from characteristic curve of modern psychometric theory. Discover Education, 2(1), 30. https://doi.org/10.1007/s44217-023-00052-z

Battauz, M. (2023). Testing for differences in chain equating. Statistica Neerlandica, 77(2), 134–145. https://doi.org/10.1111/stan.12277

Berger, S., Verschoor, A. J., Eggen, T. J. H. M., & Moser, U. (2019). Improvement of Measurement Efficiency in Multistage Tests by Targeted Assignment. Frontiers in Education, 4. https://doi.org/10.3389/feduc.2019.00001

Born, S., Fink, A., Spoden, C., & Frey, A. (2019). Evaluating Different Equating Setups in the Continuous Item Pool Calibration for Computerized Adaptive Testing. Frontiers in Psychology, 10(JUN). https://doi.org/10.3389/fpsyg.2019.01277

Cai, L., Albano, A. D., & Roussos, L. A. (2021). An Investigation of Item Calibration Methods in Multistage Testing. Measurement: Interdisciplinary Research and Perspectives, 19(3), 163–178. https://doi.org/10.1080/15366367.2021.1878778

Chmielewski, A. K. (2019). The Global Increase in the Socioeconomic Achievement Gap, 1964 to 2015. American Sociological Review, 84(3), 517–544. https://doi.org/10.1177/0003122419847165

Cumming, J., Goldstein, H., & Hand, K. (2020). Enhanced use of educational accountability data to monitor educational progress of Australian students with focus on Indigenous students. Educational Assessment, Evaluation and Accountability, 32(1), 29–51. https://doi.org/10.1007/s11092-019-09310-x

Ersen, R. K., & Lee, W. (2023). Pretest Item Calibration in Computerized Multistage Adaptive Testing. Journal of Educational Measurement, 60(3), 379–401. https://doi.org/10.1111/jedm.12361

Fink, A., & Born, S. (2018). A Continuous Calibration Strategy for Computerized Adaptive Testing. Psychological Test and Assessment Modeling, 60(3), 327–346. http://www.iacat.org/content/operational-cat-programs

Kemenag, R. (2023). Technical Report Asesmen Kompetensi Madrasah Indonesia (AKMI). In Direktorat Kurikulum, Sarana, Kelembagaan, dan Kesiswaan Madrasah Direktorat Jenderal Pendidikan Islam . Direktorat Kurikulum.

Khorramdel, L., Yin, L., Foy, P., Jung, J. Y., Bezirhan, U., & Davier, M. (2022). Rosetta Stone analysis report: Establishing a concordance between PASEC and TIMSS/PIRLS. TIMSS & PIRLS International Study Center. https://tcg.uis.unesco.org/wp-content/uploads/sites/4/2022/07/Rosetta-Stone_PASEC_Analysis-Report_2022.pdf

Kilmen, S., & Demirtasli, N. (2012). Comparison of Test Equating Methods Based on Item Response Theory According to the Sample Size and Ability Distribution. Procedia - Social and Behavioral Sciences, 46, 130–134. https://doi.org/10.1016/j.sbspro.2012.05.081

Kolen, M., & Brennan, R. (2014). Test equating, scaling, and linking. Methods and practices. 3rd revised ed. In Test Equating, Scaling, and Linking: Methods and Practices: Third Edition. https://doi.org/10.1007/978-1-4939-0317-7

Kolen, M. J., & Brennan, R. L. (2014). Test Equating, Scaling, and Linking (3rd ed.). Springer New York. https://doi.org/10.1007/978-1-4939-0317-7

Kusaeri, K., & Aditomo, A. (2019). Pedagogical Beliefs about Critical Thinking among Indonesian Mathematics Pre-service Teachers. International Journal of Instruction, 12(1), 573–590. https://doi.org/10.29333/iji.2019.12137a

Kusaeri, K., Dwisanti, C., Yanti, A., & Ridho, A. (2022). Indonesian Madrasah Competency Assessment: Students’ numeracy based on age. Beta: Jurnal Tadris Matematika, 15(2), 148–156. https://doi.org/10.20414/betajtm.v15i2.558

Kusaeri, K., Yudha, Y. H., Kadarisman, Y. P., & Hidayatullah, A. (2022). Do Instructional Practices by Madrasah Teachers Promote Numeracy? International Conference on Madrasah Reform 2021 (ICMR 2021, 1–5. https://doi.org/10.2991/assehr.k.220104.001

Le Hebel, F., Montpied, P., Tiberghien, A., & Fontanieu, V. (2017). Sources of difficulty in assessment: example of PISA science items. International Journal of Science Education, 39(4), 468–487. https://doi.org/10.1080/09500693.2017.1294784

Lee, W. C., & Ban, J. C. (2009). A comparison of irt linking procedures. Applied Measurement in Education, 23(1), 23–48. https://doi.org/10.1080/08957340903423537

Lestari, M., Johar, R., Mailizar, M., & Ridho, A. (2023). Measuring Learning Loss Due to Disruptions from COVID-19: Perspectives from the Concept of Fractions. Jurnal Didaktik Matematika, 10(1), 131–151. https://doi.org/10.24815/jdm.v10i1.28580

Li, G., Cai, Y., Gao, X., Wang, D., & Tu, D. (2021). Automated Test Assembly for Multistage Testing With Cognitive Diagnosis. Frontiers in Psychology, 12(1347). https://doi.org/10.3389/fpsyg.2021.509844

Looney, A. (2014). Assessment and the Reform of Education Systems. In C. Wyatt-Smith, V. Klenowski, & P. Colbert (Eds.), Designing Assessment for Quality Learning: The Enabling Power of Assessment (pp. 233–247). Springer Science Business Media. https://doi.org/10.1007/978-94-007-5902-2_15

MacGregor, D., Yen, S. J., & Yu, X. (2022). Using Multistage Testing to Enhance Measurement of an English Language Proficiency Test. Language Assessment Quarterly, 19(1), 54–75. https://doi.org/10.1080/15434303.2021.1988953

Magis, D., Yan, D., & von Davier, A. A. (2017). Computerized Adaptive and Multistage Testing with R. Springer International Publishing. https://doi.org/10.1007/978-3-319-69218-0

Majoros, E. (2023). Linking the first- and second-phase IEA studies on mathematics and science. Large-Scale Assessments in Education, 11(1), 14. https://doi.org/10.1186/s40536-023-00162-y

Majoros, E., Rosén, M., Johansson, S., & Gustafsson, J.-E. (2021). Measures of long-term trends in mathematics: linking large-scale assessments over 50 years. Educational Assessment, Evaluation and Accountability, 33(1), 71–103. https://doi.org/10.1007/s11092-021-09353-z

Moghadamzadeh, A., Salehi, K., & Khodaie, E. (2011). A comparison Method of Equating Classic and Item Response Theory (IRT): A Case of Iranian Study in the University Entrance Exam. Procedia - Social and Behavioral Sciences, 29, 1368–1372. https://doi.org/10.1016/j.sbspro.2011.11.375

Mutluer, C., & Cakan, M. (2023). Comparison of Test Equating Methods Based on Classical Test Theory and Item Response Theory. Journal of Uludag University Faculty of Education, 36(3), 866–906. https://doi.org/10.19171/uefad.1325587

Nisa, C., & Retnawati, H. (2018). Comparing the methods of vertical equating for the math learning achievement tests for junior high school students. Research and Evaluation in Education, 4(2), 164–174. https://doi.org/10.21831/reid.v4i2.19291

Özdemir, G., & Atar, B. (2022). Investigation of the Missing Data Imputation Methods on Characteristic Curve Transformation Methods Used in Test Equating. Journal of Measurement and Evaluation in Education and Psychology, 13(2), 105–116. https://doi.org/10.21031/epod.1029044

Park, J. S., & Park, J. H. (2012). The changes of assessment at middle school level in Korea. ZDM, 44(2), 201–209. https://doi.org/10.1007/s11858-012-0408-z

Rahmawati, R., & Mardapi, D. (2015). Modified Robust Z method for equating and detecting item parameter drift. Research and Evaluation in Education, 1(1), 100. https://doi.org/10.21831/reid.v1i1.4901

Rodrigues, B., Cadime, I., Freitas, T., Choupina, C., Baptista, A., Viana, F. L., & Ribeiro, I. (2022). Assessing oral reading fluency within and across grade levels: Development of equated test forms. Behavior Research Methods, 54(6), 3043–3054. https://doi.org/10.3758/s13428-022-01806-7

Setiawan, R. (2019). A Comparison of Score Equating Conducted Using Haebara and Stocking Lord Method for Polytomous. European Journal of Educational Research, 8(4), 1071–1079. https://doi.org/10.12973/eu-jer.8.4.1071

Shin, H. J., Yamamoto, K., Khorramdel, L., & Robin, F. (2021). Increasing Measurement Precision of PISA Through Multistage Adaptive Testing (pp. 325–334). Springer Proceeding in Mathematics & Statistics. https://doi.org/10.1007/978-3-030-74772-5_29

Steinfeld, J., & Robitzsch, A. (2021). Item Parameter Estimation in Multistage Designs: A Comparison of Different Estimation Approaches for the Rasch Model. Psych, 3(3), 279–307. https://doi.org/10.3390/psych3030022

Stoolmiller, M., Biancarosa, G., & Fien, H. (2013). Measurement Properties of DIBELS Oral Reading Fluency in Grade 2. Assessment for Effective Intervention, 38(2), 76–90. https://doi.org/10.1177/1534508412456729

Strietholt, R., & Rosén, M. (2016). Linking Large-Scale Reading Assessments: Measuring International Trends Over 40 Years. Measurement: Interdisciplinary Research and Perspectives, 14(1), 1–26. https://doi.org/10.1080/15366367.2015.1112711

Supovitz, J. (2009). Can high stakes testing leverage educational improvement? Prospects from the last decade of testing and accountability reform. Journal of Educational Change, 10(2–3), 211–227. https://doi.org/10.1007/s10833-009-9105-2

Umar, A., Kusaeri, K., Ridho, A., Yusuf, A., & Asyhar, A. H. (2022). Does opportunity to learn explain the math score gap between madrasah and non-madrasah students in Indonesia? Jurnal Cakrawala Pendidikan, 41(3), 792–805. https://doi.org/10.21831/cp.v41i3.40169

Uysal, İ., & Kilmen, S. (2016). Comparison of Item Response Theory Test Equating Methods for Mixed Format Tests. International Online Journal of Educational Sciences, 8(2), 1–11. https://doi.org/10.15345/iojes.2016.02.001

van der Linden, W. J. (2000). A test-theoretic approach to observed-score equating. Psychometrika, 65(4), 437–456. https://doi.org/10.1007/BF02296337

Wang, W. (2013). Mixed-format test score equating [University of Iowa]. https://doi.org/10.17077/etd.kvqyo3b2

Wei, W. (2013). Mixed-format test score equating: Effect of item-type multidimensionality, length and composition of common-item set, and group ability difference [The University of Iowa]. https://www.proquest.com/docview/1495946546

Widhiarso, W., & Ridho, A. (2022). Validation of Setting and Design of Multi-Stage Testing (MST) to Portray Students’ Achievement on Reading Literacy in AKMI 2021. https://doi.org/10.2991/assehr.k.220104.002

Yusron, E., Retnawati, H., & Rafi, I. (2020). Bagaimana hasil penyetaraan paket tes USBN pada mata pelajaran matematika dengan teori respon butir? Jurnal Riset Pendidikan Matematika, 7(1), 1–12. https://doi.org/10.21831/jrpm.v7i1.31221


Full Text: PDF

DOI: 10.15408/jp3i.v13i1.38300

Refbacks

  • There are currently no refbacks.


Copyright (c) 2024 Kusaeri Kusaeri, Ali Ridho, Noor Wahyudi

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.