Comparing IRT Models: Summated Scaling Effects on Critical Thinking in Vocational Students

Andi Abdurrahman Manggaberani; Samsul Hadi; Nur Hidayanto Pancoro Setyo Putro; Abrar Syahrul Fajri; Heri Retnawati

doi:10.15408/jp3i.v14i2.42886

Authors

Andi Abdurrahman Manggaberani Universitas Negeri Yogyakarta
Samsul Hadi Department of educational Research and Evaluation, Graduate School, Universitas Negeri Yogyakarta
Nur Hidayanto Pancoro Setyo Putro Department of educational Research and Evaluation, Graduate School, Universitas Negeri Yogyakarta
Abrar Syahrul Fajri Universitas Negeri Yogyakarta
Heri Retnawati Universitas Negeri Yogyakarta

DOI:

https://doi.org/10.15408/jp3i.v14i2.42886

Keywords:

Critical Thinking, Summated Rating, Item Response Theory

Abstract

This study investigates the comparative efficacy of Summated Rating Scales (SRS) and traditional
ordinal scales (raw Likert-type responses) in measuring critical thinking skills among vocational
students, employing Item Response Theory (IRT) to evaluate their psychometric properties.
Addressing the limitations of ordinal scales notably inconsistent intervals between response
categories the research adopts a descriptive quantitative methodology involving 269 students from
state vocational high schools in Yogyakarta, Indonesia. Data were collected using a five-point
Likert scale instrument, validated for content (Aiken’s V = 0.94), and analyzed through two IRT
frameworks: Polytomous IRT for unscaled ordinal data and Continuous Response Model (CRM)
IRT for SRS-transformed interval data. Key findings reveal that SRS enhances measurement
precision by normalizing response distributions into proportional intervals (e.g., recalibrated scores:
0.00, 0.73, 1.46, 2.07, 2.84), thereby resolving issues of unequal category spacing inherent to
ordinal scales. Polytomous IRT demonstrated robust item fit (e.g., Partial Credit Model fit for 5/6
items) and strong difficulty parameter invariance (r = 0.84), yet exhibited instability in ability
estimates (r = 0.37) due to extreme response patterns. Conversely, CRM IRT applied to scaled
data produced stable ability estimates (r = 0.46) and eliminated infinite values in Maximum
Likelihood Estimation, underscoring its superiority in handling continuous metrics. However, ordinal
scales retained higher consistency in difficulty calibration across subgroups. The study concludes
that integrating SRS with CRM IRT offers a refined approach for critical thinking assessments,
balancing precision and fairness, while ordinal scales remain pragmatic for contexts prioritizing
simplicity. These insights advocate for the adoption of advanced scaling techniques in vocational
education to improve the validity of competency evaluations, with recommendations for future
research to explore hybrid models and longitudinal applications.

References

Aiken, L. R. (1985). Three coefficients for analyzing the reliability and validity of ratings, Educational and Psychological Measurument. Journal Articles; Reports - Research; Numerical/Quantitative Data, 45(1), 131–142. https://doi.org/https://doi.org/10.1177/0013164485451012

Akour, I. A., Al-Maroof, R. S., Alfaisal, R., & Salloum, S. A. (2022). A conceptual framework for determining metaverse adoption in higher institutions of gulf area: An empirical study using hybrid SEM-ANN approach. Computers and Education: Artificial Intelligence, 3(January), 100052. https://doi.org/10.1016/j.caeai.2022.100052

Alamrani, S., Gardner, A., Falla, D., Russell, E., Rushton, A. B., & Heneghan, N. R. (2023). Content validity of the Scoliosis Research Society questionnaire (SRS-22r): A qualitative concept elicitation study. PLoS ONE, 18(5 May), 1–21. https://doi.org/10.1371/journal.pone.0285538

Ali, U. S., Chang, H., & Anderson, C. J. (2015). Location indices for ordinal polytomous items based on item response theory. In ETS Research Report Series (Vol. 2015, Issue 2). https://doi.org/10.1002/ets2.12065

Alordiah, C. O., & Oji, J. (2024). Test Equating in Educational Assessment : A Comprehensive Framework for Promoting Fairness , Validity , and Cross- Cultural Equity. Asian Journal of Assessment in Teaching and Learning, 14(1), 70–84. https://doi.org/10.37134/ajatel.vol14.1.7.2024

Astuti, N. D., Hajaroh, M., Prihatni, Y., Setiawan, A., Setiawati, F. A., & Retnawati, H. (2024). Comparison of KMO results, eigen value, reliability, and standard error of measurement: Original & rescaling through summated rating scaling. Jurnal Pengukuran Psikologi Dan Pendidikan Indonesia, 13(2), 199–217. https://doi.org/10.15408/jp3i.v13i2.36684

Baker, M., Lu, P., & Lamm, A. (2021). Assessing the dimensional validity and reliability of the university of florida critical thinking inventory (UFCTI) in chinese: A confirmatory factor analysis. Journal of International Agricultural and Extension Education, 28(3), 41–56. https://doi.org/10.5191/jiaee.2021.28341

Bean, G. J., & Bowen, N. K. (2021). Item response theory and confirmatory factor analysis: Complementary approaches for scale development. Journal of Evidence-Based Social Work (United States), 18(6), 597–618. https://doi.org/10.1080/26408066.2021.1906813

BSKAP Kemendikbudristek. (2022). Dimensi, Elemen, dan Subelemen Profil Pelajar Pancasila pada Kurikulum Merdeka. In Kemendikbudristek.

Casper, W. C., Edwards, B. D., Wallace, J. C., Landis, R. S., & Fife, D. A. (2020). Selecting response anchors with equal intervals for summated rating scales. Journal of Applied Psychology, 105(4), 390–409. https://doi.org/10.1037/apl0000444

Chalmers, R. P. (2012). mirt : A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48(6). https://doi.org/10.18637/jss.v048.i06

Dai, S., Vo, T. T., Kehinde, O. J., He, H., Xue, Y., Demir, C., & Wang, X. (2021). Performance of Polytomous IRT Models With Rating Scale Data: An Investigation Over Sample Size, Instrument Length, and Missing Data. Frontiers in Education, 6(September), 1–18. https://doi.org/10.3389/feduc.2021.721963

Febriana, B. W., & Setiawati, F. A. (2024). Increasing measurement accuracy: Scaling effect on academic resilience instrument using Method of Successive Interval (MSI) and Method of Summated Rating Scale (MSRS). Jurnal Penelitian Dan Evaluasi Pendidikan, 28(1), 32–42. https://doi.org/10.21831/pep.v28i1.69334

Fialho, L., & Zyngier, S. (2023). Quantitative methodological approaches to stylistics. In The Routledge handbook of stylistics (2nd ed.). Routledge.

Guenther, P., Guenther, M., Ringle, C. M., Zaefarian, G., & Cartwright, S. (2023). Improving PLS-SEM use for business marketing research. Industrial Marketing Management, 111(April), 127–142. https://doi.org/10.1016/j.indmarman.2023.03.010

Hj. Ebil, S., Salleh, S. M., & Shahrill, M. (2020). The use of E-portfolio for self-reflection to promote learning: A case of TVET students. Education and Information Technologies, 25(6), 5797–5814. https://doi.org/10.1007/s10639-020-10248-7

Jebb, A. T., Ng, V., & Tay, L. (2021). A review of key likert scale development advances: 1995–2019. Frontiers in Psychology, 12(May), 1–14. https://doi.org/10.3389/fpsyg.2021.637547

Kadigi, R. M. J., Mgeni, C. P., Kangile, J. R., Aku, A. O. ati, & Kimaro, P. (2023). Can a legal game meat trade in Tanzania lead to reduced poaching? Perceptions of stakeholders in the wildlife industry. Journal for Nature Conservation, 76(March), 126502. https://doi.org/10.1016/j.jnc.2023.126502

Kadim, A., & Sunardi, N. (2021). Financial management system (QRIS) based on UTAUT model approach in Jabodetabek. International Journal of Artificial Intelligence Research, 6(1). https://doi.org/10.29099/ijair.v6i1.282

Kinel, E., Korbel, K., Kozinoga, M., Czaprowski, D., Stępniak, Ł., & Kotwicki, T. (2021). The measurement of health-related quality of life of girls with mild to moderate idiopathic scoliosis—comparison of isyqol versus srs-22 questionnaire. Journal of Clinical Medicine, 10(21). https://doi.org/10.3390/jcm10214806

Kusmaryono, I., Wijayanti, D., & Maharani, H. R. (2022). Number of response options, reliability, validity, and potential bias in the use of the likert scale education and social science research: A literature review. International Journal of Educational Methodology, 8(4), 625–637. https://doi.org/10.12973/ijem.8.4.625

Lindner, J. R., & Lindner, N. (2024). Interpreting Likert type, summated, unidimensional, and attitudinal scales: I neither agree nor disagree, Likert or not. Advancements in Agricultural Development, 5(2), 152–163. https://doi.org/10.37433/aad.v5i2.351

Mohamadi, Z. (2018). Comparative effect of online summative and formative assessment on EFL student writing ability. Studies in Educational Evaluation, 59, 29–40. https://doi.org/10.1016/j.stueduc.2018.02.003

Mustika, M., Maknun, J., & Feranie, S. (2019). Case study : Analysis of senior high school students scientific creative, critical thinking and its correlation with their scientific reasoning skills on the sound concept. Journal of Physics: Conference Series, 1157(3). https://doi.org/10.1088/1742-6596/1157/3/032057

Payan-Carreira, R., Sacau-Fontenla, A., Rebelo, H., Sebastião, L., & Pnevmatikos, D. (2022). Development and validation of a critical thinking assessment-scale short form. Education Sciences, 12(12). https://doi.org/10.3390/educsci12120938

Putranta, H., & Supahar, S. (2019). Development of Physics-Tier Tests (PysTT) to measure students’ conceptual understanding and creative thinking skills: A qualitative synthesis. Journal for the Education of Gifted Young Scientists, 7(3), 747–775. https://doi.org/10.17478/jegys.587203

Robitzsch, A. (2021). A comparison of linking methods for two groups for the two-parameter logistic item response model in the presence and absence of random differential item functioning. Foundations, 1(1), 116–144. https://doi.org/10.3390/foundations1010009

Robitzsch, A., & Lüdtke, O. (2020). A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psychological Test and Assessment Modeling, 62(2), 233–279. https://www.psychologie-aktuell.com/fileadmin/Redaktion/Journale/ptam-2020-2/03_Robitzsch.pdf

Şad, S. N. (2020). Does difficulty-based item order matter in multiple-choice exams? (Empirical evidence from university students). Studies in Educational Evaluation, 64(September 2019), 100812. https://doi.org/10.1016/j.stueduc.2019.100812

Selçuk, E., & Demir, E. (2024). Comparison of item response theory ability and item parameters according to classical and Bayesian estimation methods. International Journal of Assessment Tools in Education, 11(2), 213–248. https://doi.org/10.21449/ijate.1290831

Shaw, A., Liu, O. L., Gu, L., Kardonova, E., Chirikov, I., Li, G., Hu, S., Yu, N., Ma, L., Guo, F., Su, Q., Shi, J., Shi, H., & Loyalka, P. (2020). Thinking critically about critical thinking: validating the Russian HEIghten® critical thinking assessment. Studies in Higher Education, 45(9), 1933–1948. https://doi.org/10.1080/03075079.2019.1672640

Sidel, J. L., Bleibaum, R. N., & Tao, K. W. C. (2018). Quantitative descriptive analysis. In S. E. Kemp, J. Hort, & T. Hollowood (Eds.), Descriptive analysis in sensory evaluation. John Wiley & Sons Ltd. https://doi.org/10.1002/9781118991657.ch8

Tobón, S., & Luna‐nemecio, J. (2021). Complex thinking and sustainable social development: Validity and reliability of the complex‐21 scale. Sustainability (Switzerland), 13(12), 1–19. https://doi.org/10.3390/su13126591

Tsikritsis, D., Legge, E. J., & Belsey, N. A. (2022). Practical considerations for quantitative and reproducible measurements with stimulated Raman scattering microscopy. Analyst, 147(21), 4642–4656. https://doi.org/10.1039/d2an00817c

Van Hauwaert, S. M., Schimpf, C. H., & Azevedo, F. (2020). The measurement of populist attitudes: Testing cross-national scales using item response theory. Politics, 40(1), 3–21. https://doi.org/10.1177/0263395719859306

Vollmer, F., & Alkire, S. (2022). Consolidating and improving the assets indicator in the global multidimensional poverty index. World Development, 158, 105997. https://doi.org/10.1016/j.worlddev.2022.105997

Zarate, D., Hobson, B. A., March, E., Griffiths, M. D., & Stavropoulos, V. (2023). Psychometric properties of the Bergen Social Media Addiction Scale: An analysis using item response theory. Addictive Behaviors Reports, 17(July 2022), 100473. https://doi.org/10.1016/j.abrep.2022.100473

Zou, G., Zou, L., & Qiu, S. fang. (2023). Parametric and nonparametric methods for confidence intervals and sample size planning for win probability in parallel-group randomized trials with likert item and likert scale data. Pharmaceutical Statistics, 22(3), 418–439. https://doi.org/10.1002/pst.2280