Evaluating Psychometric Properties of Raven ’s Coloured Progressive Matrices Test in Indonesian Sample using the Rasch Model

Coloured Progressive Matrices (CPM) is a psychological test well known among Indonesian psychologists to measure intelligence. Some researchers who use CPM in their research reveal that CPM has weaknesses in the principle of measurement equivalence. Therefore, the focus of this research is to evaluate the details of the psychometric properties of CPM by using the Rasch model. This research used a secondary data analysis approach, where the primary data sets from a psychological service were collected into a single file for further analysis. Data of 371 boys and 377 girls with an age range of five to seven years old who took an intelligence test to assess their school readiness were collected. The Rasch model analysis showed that CPM showed unidimensionality and local independence, had a fairly good reliability value, and eight items were unsuitable for testing intelligence. Only twenty-eight items of CPM were suitable for measuring children’s intelligence in Indonesia .


Introduction
Individual differences in intelligence have become a fascinating study topic to be discussed (Hülür et al., 2011).Raven (1983) defines intelligence as an ability that reflects individual differences in capturing information, experiences, and a condition that has been experienced.The community in general are aware that intelligence is one of the most important factors to determine children's educational success (Ardini & Handini, 2018).Furthermore, according to Gorey (2001), intelligence is also closely related to academic achievement in the early childhood development stage.Therefore, it is inevitable that intelligence and education in children in the early childhood stage are inseparable from preparing children for school until they receive education with more formal curriculum.
For example, children in the early childhood stage from age five to seven in the United States are carefully prepared for elementary school since this age range is the primary age for a child to enter elementary school (Ziol-Guest & McKenna, 2014).Similar to parents in Indonesia, parents prepare their children to enter elementary school by taking a series of psychological assessments.The actual evidence can be seen in the data on participants in a psychological assessment conducted by a psychological institute in Yogyakarta in 2021.A total of 321 children aged six to seven went through a series of psychological assessment processes to determine whether they were prepared for elementary school (Gusniarti et al., 2021).One of the instruments used in the series of assessments is Coloured Progressive Matrices, which measure children's intelligence.
In principle, Coloured Progressive Matrices (CPM) are instruments to determine a child's level of intelligence without any previous learning process (Sanz-Cervera et al., 2015).It is also used to measure abstract thinking skills in children (van Schoor et al., 2016).CPM can measure a child's abstract thinking ability since all of the items contain patterned, colored, and various shapes, hoping that children will be able to analyze various patterns and provide answers to the pictures one by one (Yoshizawa et al., 2014).
CPM is the most widely used psychological test among Indonesian psychologists in assessment activities.The reason is because of its time-saving procedure, the test is easily administered, and the work pattern is quite simple, only by responding to images that need to be adjusted to the shape of the pattern.Besides being widely used for assessment activities, CPM has advantages in its use.It is not only used to measure the intelligence of children with a normal brain function (Khan, 2015), but is also used for special needs children, such as children with autism and attention-deficit/ hyperactive disorder (Sanz-Cervera et al., 2015).Bass (2000) emphasized that CPM is a psychological instrument used by many psychology researchers in various countries.In 2018, CPM was studied on children in Sardinian, Italy.This study used a large research sample; 1,626 elementary school and junior high school children aged five to thirteen became the research sample (Nicotra et al., 2018).The results of the research showed descriptive results of the CPM calibration of items for each of the sections.This study found that the most difficult items in CPM were item A11 (for Section A), item Ab12 (for Section Ab), and item B12 (for Section B).The study did not provide more detailed information regarding assumptions, fit statistics, instrument reliability, item bias detection on the instrument, and other indices that are important to be presented when conducting analysis using the Rasch model.
Referring to other studies, CPM is highly likely to have weaknesses in terms of psychometric properties.The main weakness in CPM is the gap in the results of research using CPM as a research instrument, especially in the analysis of instrument bias.Two previous studies described CPM as a highly consistent instrument, and it is free from bias (Agnoli et al., 2012;Antoniou et al., 2022).Both studies support CPM as an instrument that prioritizes the principle of measurement equivalence or fairness for anyone who uses it.In contrast to the two studies above, Sigmon (1983) and Lúcio et al. (2019) revealed that CPM indicated a slight gender bias in its items.Other researchers also support that if CPM shows

Rasch Measurement Theory
In 1960 a mathematician named George Rasch introduced a mathematical modeling known as the Rasch model.The Rasch model was first introduced to analyze dichotomous data (Kreiner, 2013), in which the data was obtained from correct or incorrect answers on a test.In addition, the Rasch model can also be used to analyze polytomous data widely used to measure attitude known as the Rating Scale Model or Partial Credit Model (Andrich & Marais, 2019).In its application, the Rasch model provides interpretation results from information-rich data, such as providing in-depth information about individual abilities and item difficulty levels (Khairani & Razak, 2015).
To describe the relationship estimation of individual abilities and item difficulty, the Rasch model could be given by the following formula: where: X ni = 1 refers to the response obtained by subject n to item i (correct response of the item) β n refers to the ability of subject n; δ i refers to the difficulty of the i; α i refers to an item's discrimination index; e is the base of natural point logarithm (e = 2.718….) The individual ability parameter is obtained from the ratio calculation of the number of individuals answering correctly to the number of incorrect answers.The ratio value is then transformed to a range of interval sizes using logarithms or log odd; the final result is a logit value.The logit value is the final value, and it can be precise and accurate in measuring an individual's ability to be compared with other individuals (Khairani & Razak, 2015).
Similar to the difficulty level parameter, the value obtained is also in the form of logit.The value generated is also from the proportion of incorrect answers divided by correct answers on the items tested, which are then transformed using log odd.If the two parameters have the same unit size in the form of logit, they can be compared equally (Khairani & Razak, 2015).Logit resulting from individual abilities and item difficulty levels can be explained more meaningfully (Wright & Stone, 1979).
The principal of Rasch model analysis has mandatory assumptions, such as the assumption of unidimensionality and local independence (Mair, 2018), and added goodness of fit indices of infit Mean-Square (MNSQ) and outfit MNSQ.The MNSQ outfit is a reasonable limit in determining the level of difficulty of an item.If the calculation shows that the MNSQ outfit value is less or more than the limit, it is likely that the analyzed items are unsuitable for use.At the same time, the MNSQ infit is very sensitive to the obtained responses (Khairani & Razak, 2015).The advantage of analysis using the Rasch model is the detailed information describing the analysis results, such as fit statistics, reliability values, separation index, and comparison of individual abilities with item difficulty levels (Clements et al., 2008).The statistical fit index serves to see how the Rasch model meets the right expectations for the analytical model.It shows scores comparison from the overall individual ability (logit person mean) that can be compared to the item difficulty level (logit item mean) and displays item-person separation and item-person reliability to show the item's suitability and person in the tests carried out.According to previous research, the Rasch model is a highly accurate method to see the quality of an instrument because there are item parameters with precise persons (Jong et al., 2015).

Research Design
The secondary data analysis method approach is used in this research (Johnston, 2014).It is a technique for analyzing primary data collected by other people or institutions with other purposes.Secondary data analysis includes an empirical research approach, noting that the research follows research principles when using direct data collection.Secondary data analysis can be carried out for systematic investigation in various fields.The mandatory steps in conducting secondary data analysis are (1) developing research questions, (2) identifying the data set as a whole to be carried out, and (3) evaluating the data set obtained (this stage will be described in the analysis procedure).

Participants
The primary data of this research was obtained from the results of intelligence testing conducted by Darunnisa Psychological Service on children in twenty kindergartens and elementary schools in Bandung from 2017 to 2021.The primary data consisted of hundreds of data files, which were merged into a set of files in the form of Excel files (now referred to as secondary data).Based on the secondary data, the total number of participants of this study was 748 children comprising 371 girls (49.6%) and 377 boys (50.4%).The age range of the research participants based on the secondary data is five to seven years (M = 5.67 & SD = 0.54), which is based on the developmental task of the children.These children are at the school readiness stage (Williams & Lerner, 2019).

Instrument
Raven's Colored Progressive Matrices is an instrument to measure children's intelligence, especially children aged five to eleven years old (Muniz et al., 2016).There are three sections of questions (section A, section Ab, section B), and each consists of twelve items with different difficulty levels, and thus the total number of CPM items is 36.The difficulty level of item B is higher than that of item Ab, while item Ab's difficulty level is more difficult than item A. All items consist of blank picture sections that need to be filled in; children are expected to choose one of the six available answer options.There is only one correct answer for each question, indicating the minimum CPM score is 0 while the maximum CPM score is 36.Practically, CPM can be done individually or in groups since there is no time limit for the assessment.The test-retest reliability value of CPM in previous studies was .90, which means that CPM is an instrument that consistently measures intelligence in children (Lehmann et al., 2014).

Procedure
The research procedure adopted steps based on the secondary data analysis approach.The first step was regarding research questions.The research questions were developed after the researchers conducted a literature review and found problems with CPM.A study of the literature review found that there were gaps in the results of previous research.These researchers had carried out the basis of obtaining the research question.This paper questioned the quality of psychometric properties of CPM.Therefore, this JP3I (Jurnal Pengukuran Psikologi dan Pendidikan Indonesia), 12(2), 2023 97-107 http://journal.uinjkt.ac.id/index.php/jp3iThis is an open access article under CC-BY-SA license (https://creativecommons.org/licenses/by-sa/4.0/)study aimed to evaluate the psychometric properties of CPM with participants of children in the early childhood stage in Indonesia.
The second step was to identify the data set as a whole.Before identifying the data set, the researchers sent a letter requesting permission to Darunnisa Psychological Service to conduct the CPM research there.After giving permission, Darunnisa Psychological Service was willing to provide data in hundreds of Excel file data sets.Data set identification was carried out by researchers from the data given, and then it was combined into one Excel file.Merging hundreds of data sets into one Excel file was the next step to make statistical analysis easier since the statistical programs generally analyze only one data file.The last step was to analyze the Excel file using a statistical program, specifically the Rasch model analysis.The results of the analysis will be presented in the results section of this paper.

Statistical Analysis
The research data analysis was carried out using the Winstep 3.65 program.Some of the limitations of the ideal reference value used in this study include: (a) the benchmark value of an instrument is proven to be unidimensional when the raw variance value is explained by a measures value of > 40% (Holster & Lake, 2016); (b) the criterion for an instrument that does not have local independence between items is by looking at the critical value (Q3) of < .30(Christensen et al., 2017); (c) the ideal limit of the personitem separation index is > 3 (Duncan et al., 2003); (d) item fit testing with the Rasch model is the MNSQ outfit value in the range of .5 to 1.5 logit (Boone et al., 2014); (e) categorization of results from item calibration to determine item difficulty level uses the range -.30 to .30logit (Wicaksono et al., 2021); and (f) the item is indicated to be biased if the DIF construct value is > .40(Rogers & Swaminathan, 1990).

Unidimensionality and Local Independence
This study may also prove two mandatory assumptions in analyzing the Rasch model.The first assumption is regarding unidimensionality in the Rasch research model.The raw variance value explained by measures on CPM was 50.3%, exceeding the limit set by the previous research of > 40% (Holster & Lake, 2016).The CPM in this study only measured one aspect, namely intelligence.The second assumption is on local independence.In their book, Bond and Fox (2015) suggest that local independence shows that there is no link between one item and another in terms of the response given to an instrument.Linkages between items can be seen from the results of the largest standardized residual correlation between items.The largest standardized residual CPM value correlation was found between item A2 and item A3 at .23.The largest standardized residual correlation value obtained was below the standard critical value (Q3) of < .30(Christensen et al., 2017), which may indicate that each CPM item is independent.

Fit Statistics and Reliability
Overall (see Table 1), this study found the person mean = .55logit, while for the item mean = .00logit.If the person means value was greater than the item means value on the cognitive scale, the participant had no difficulty in doing the test.The average intelligence value of the participant who did CPM was in the high category.The results of this analysis are in line with those of research on measuring achievement tests (Othman et al., 2015).This suggests that CPM could be too easy for children in the early childhood stage in Indonesia to complete.another).Standard deviation items had a greater value, indicating that CPM items had varying problem difficulty levels.The person separation index = 1.9 < 3 supported the evidence from the small value of the person standard deviation, which may indicate that the participants of this study were homogeneous, in terms of age criteria, developmental stages, and level of intelligence.The item separation index of 13.12 > 3 could indicate that the difficulty level grouping on CPM items was appropriate according to the Winstep program.Direct evidence of the standard deviation and item separation analysis will be further explored in the item fit section of this analysis.
The next result discusses item-person reliability.The value of person reliability was .78,while the value of item reliability was .99.The item reliability value here did not show the constancy of an instrument.In this research, item reliability measured how good the CPM items tested were.Person reliability was used to measure the appropriateness of research participants in this study.Item-person reliability in this study could be categorized as good, that is, item reliability is > .70 and person reliability is > .80(Mohd et al., 2017).Item-person reliability criteria were met for this study, which shows that no problems were found in items or persons (items and participants were correct in measuring this intelligence).
For the reliability value of CPM, as indicated by the Cronbach Alpha value (KR-20), it was found that the value of = .79> .70 was in the acceptable range for an instrument (Sharma, 2016).CPM has been proven to measure intelligence consistently.In addition, in testing the fit model, a question arises is whether the Rasch modeling used is appropriate or not.This is shown by the Chi-square value = 20493.52and p-value = 1, which proves that the test model is fit (p-value > .01)and acceptable (Renny et al., 2013).

Wright Map, Calibration, and Item Fit
The Wright map (Figure 1) shows the distribution of research participants' ability level with the level of item difficulty.Researchers modified the Wright map for each test section to make it easier to understand.In general, the Wright map shows that the ability of the research participants is higher than the questions' difficulty level.It can be seen from the pattern of distribution of the ability of the research participants that was higher than the distribution of questions' difficulty level.It is evident that 10 of the 36 items are at the bottom of the diagram, which may suggest that the CPM is a psychological instrument that could be slightly easy for the research participants to complete.The evidence from further examination has shown that CPM is an easy instrument.It can be seen from the calibration results of the CPM items in Table 2.The item calibration divides each section into several test categories.Wicaksono et al. (2021) categorize the level of difficulty of the test based on the range -.30 to +.30 logit value item (hard level if logit value item > .30,medium level if logit value ranges from -.30 to .30, and easy level if logit value item < -.30).The CPM for Section A consisted of 12, categorized as hard, medium, and easy levels.The most difficult items in Section A are items A11, A12, A9.For Section Ab, the 12 items are divided into three categories of difficulty levels: hard, medium and easy.Based on the results of item calibration, seven difficult items were found for Section Ab.Section B consists of two levels of difficulty based on the calibration results.It was found that Section B was the section with the most difficult items.It can be seen from the eight items (B12, B11, B8, B9, B10, B7, B5, and B6) included in the hard level.Meanwhile, there are four easy items in Section B: items B2, B4, B3, and B1.
The results of item calibration in this study were closely related to those of the test fit item.Fit item was analyzed using the Rasch model (see Table 2), and the results showed that three of the CPM items did not fit, as indicated by the MNSQ outfit value of less than the range of .5 -1.5 (Boone et al., 2014).The three items were items A3, A2, and B12.A gray mark was given on the value of the MNSQ outfit items that did not fit in Table 2.In the Rasch model analysis, items that do not meet the standard limits ideal should be discarded.This may suggest that the three items do not accurately measure intelligence or problematic items to be tested on the CPM.

Distractor Analysis
A misconception was also found between the person's ability and an item's difficulty level on one CPM item (item Ab12).In general, a child with high ability can answer every item correctly.In item Ab12, it was found that 13 children (about 2% of the total number of participants) with high ability (average measure = 1.14 logit > .55person mean logit) incorrectly answered questions with medium difficulty level.Item Ab12, when studied further, may have a good answer distractor.For the other 35 items included in the general item category, the questions could be addressed if a child is smart.

DIF Analysis
Another important finding in this study is the principle of measurement equivalence.Based on the results of the DIF analysis, CPM in this study seemed to violate the principle of measurement equivalence, which may suggest that CPM could be biased or favorable to one of the groups tested on several of its items (see Table 3).DIF in CPM only occurred in Section A and Section B, and not in Section Ab.This may suggest that Section Ab is an ideal section for measuring intelligence.In Section A, five items were detected to have DIF (A1, A2, A3, A4, and A5), and in Section B, it was found that two items had DIF (B7 and B9).The seven CPM items that experienced DIF in Section A and Section B were at moderate (.40 to .60) and high DIF (> .60)levels.This accords with previous research, which states that an item displays DIF if the difference in DIF values between groups reach the referenced values (Rogers & Swaminathan, 1990).It can be seen from the graph (see Figure 2) that the difference in DIF is visible from the distance range between the blue line (female) and the red line (male).Items A1, A2, A3, A4, A5, and B9 showed a higher DIF value in the male group (boys), which indicated that the six items benefited the male group (boys).Only 1 item, item B7, benefited the female group (girls).

Table 3. DIF Analysis
Note: t-value positive is item tends to benefit the male group, and vice versa.

Discussion
The analysis also demonstrated the assumption of unidimensionality and local independence.
Unidimensionality assumption shows strong evidence that CPM is a test that measures intelligence; it can be observed with the raw variance value limit explained by measures > 40%.The results of the unidimensional assumption testing support previous research, which obtained a test model that fits CPM with a unidimensional model using confirmatory factor analysis (Lúcio et al., 2019).The assumption of local independence showed that the items in the CPM did not have a close relationship with one another.All CPM items were almost certain to be independent.
In the fit statistics test, only the person separation index was found lacking, less than 3 (Duncan et al., 2003) because the study participants were at the same stage of development, namely early childhood, with similar age range.Thus, the participants were homogeneous.Moreover, when they were examined on the basis of the educational age, they were included in the ideal age category of children to prepare for elementary school (Ziol-Guest & McKenna, 2014).
Regarding the instrument's consistency, the CPM reliability value obtained in this study was = .79.In previous studies using CPM, the test-retest reliability value = .90,which means that the reliability value obtained in this study was smaller than that in previous studies.This may show that CPM has lower reliability score in one time measurement than in two-time measurement (test-retest) in measuring intelligence in Indonesia.The results of this study are contrary to those of the previous research, which stated that CPM is a consistent instrument (Agnoli et al., 2012).This study found four items that were not good at calibrating CPM items.These may have caused CPM to have a smaller reliability value than that of previous studies.From a psychometric point of view, there is a strong possibility that the items that are not good automatically have a tremendous impact on reliability calculations.In addition, the research approach chosen was an important factor in determining the consistency of the instrument.This study used secondary data for analysis (one measurement), while Lehmann et al. (2014) conducted experimental research (twice measurements).It is highly likely that the research approach affects the reliability value of an instrument.The research approach that uses two measurements may have more value than the instrument's consistency and the research participant consistency.The limitation of this study is that the data was not directly collected since it could be costly and time consuming to collect data from 748 children preparing for school.
Regarding the item calibration analysis, something very useful was found in new information about CPM.In addition to obtaining four items that did not meet the MNSQ outfit value limit, this study found that each section in the CPM had different criteria for difficulty levels.For the research participants consisting of children at the early childhood stage in Indonesia, no single item was difficult to complete, especially Section Ab in the CPM.Although it may not be difficult to complete, it does not mean that it is completely easy; for example, in the CPM category item analyzer for item Ab12 of Section Ab, it is highly likely that distractors may cause children with high abilities to have difficulties in choosing answers.On the one hand, CPM is an easy test, but on the other, some distractors can make it harder for children to answer the questions.
Another strong factor that plays a role in why high abilities children answer incorrectly on item Ab12 is the complexity of the picture pattern being asked.Based on the data on the amount of time test takers spent at Darunnisa Psychological Service, children on average spent more time on CPM item Ab12 than other items, for example item Ab12 (M = 10.67 seconds), item Ab11 (M = 10.3 seconds), item Ab12 (M = 10.3 seconds), and item Ab12 Ab5 (M = 9.3 seconds).They spent around 5.8 to 7.5 seconds for the other items in Section Ab.This finding suggests that item complexity is highly correlated with the processing time of the questions.
The indication that item Ab12 has the highest difficulty level in the Ab CPM Section is in line with the calibration results of previous research conducted in Italy.Nicotra et al. (2018) found that the most difficult items in CPM were item A11 (for Section A), item Ab12 (for Section Ab), and item B12 (for Section B).The findings of this study provide support for similar findings regarding the most difficult items in each test section.It may indicate that children in Indonesia and Italy have similarities in the level of difficulty in working on CPM items with complex patterns.This research provides more detailed information about fit items, and we believe that it is the strength of this research.
Based on the results of the DIF analysis, seven items indicated gender bias (more favorable for the male group).This finding corroborates the results of previous research (Lúcio et al., 2019;Lynn & Irwing, 2004;Sigmon, 1983).It was found that the number of bias items was quite large.When calculated, the number of CPM items identified as biased was 7 out of 36, or 19.4% of CPM items were gender-biased.In addition, six of the seven CPM items proven to be gender-biased were more favorable for the male group in measuring intelligence.In general, men benefit more from tests or tests in the form of visuals or geometry.This is in line with research that has been carried out in Indonesia (Ridho, 2014).Ridho (2014) states that men will benefit more if they are tested in the cognitive realm.
Another limitation in this study is that the data was obtained from psychological institutions.Consequently, it was difficult to analyze whether cultural differences also caused the gender bias.In the gender bias test, it was found that the test gave the male group an unfair advantage.From the demographic profile of the research participants, no variable was found on the culture of the research participants.Therefore, it is important for studies to have more control over the demographics.More diverse demographics may be able to provide more useful information, such as gender, culture, and even from different countries of origin.

Conclusion
Based on the data analysis and discussion, eight items should not be used in intelligence testing using CPM.These are items A1, A2, A3, A4, A5, B7, B9, and B12.Items A1, A4, A5, and A7 may not be good enough due to DIF detection.Item B12 did not meet the item fit limit, and items A2 and A3 were detected in both.Therefore, there are only 28 items that are eligible to be used to measure intelligence in participants at the early childhood stage.The issue of "norming" would be more crucial for psychological tests that are adapted from other countries or culture, such the CPM.
Further research could use the 28 items that are eligible for the CPM.It is necessary to make new norms that are adapted to children in Indonesia.In addition, CPM can be tested for cultural bias because the results of several studies in different countries find different results.Studies using CPM are likely to indicate cultural bias, especially for researchers and psychologists who are interested in measuring children's intelligence.To prove the level of consistency of CPM in Indonesia, a study using an experimental approach with two measurements (pre-test and post-test) needs to be carried out as a followup study to further evaluate the reliability of CPM.

Table 1 .
Summary Fit Statistics

Table 2 .
Item Fit & Calibration Coloured Progressive MatricesNote: Item's marked in gray lack of item fit limitation.