Rasch Analysis of The Indonesian Version of Individual Work Performance Questionnaire (IWPQ)

The Individual Work Performance Questionnaire was developed by Koopmans et al. (2013). This questionnaire was based on the construct of individual work performance which consists of task performance, contextual performance, and counterproductive work behavior. Widyastuti & Hidayat (2018) adapted the IWPQ into Bahasa Indonesia. The mentioned research used the classical test theory (CTT) approach to validate the instrument. Therefore, the findings were only applicable to the study’s sample, as validity and reliability could not be legitimately generalized to other study settings. In comparison, the development of the original IWPQ used Rasch analysis to examine its measurement properties. Rasch analysis is a modern psychometric approach based on item response theory (IRT), which has several advantages over the CTT. This study aimed to validate the psychometric properties of the Indonesian Version of IWPQ using the Rasch model. The psychometric properties discussed in this study include instrument reliability, person and item reliability, unidimensionality, rating scale functioning, and bias detection (Differential Item Functioning). The 213 participants in this research survey were Indonesian citizens aged 18-46 years old (mean = 30.64, SD = 8.55) and were actively working for at least three months at their current job. The result showed that the assumption of the unidimensionality of each sub-scale of IWPQ was fulfilled. The 5-Likert rating scales of this instrument had adequate functionality. The person reliability for all sub-scales ranged from .58 - .80. Meanwhile, the item reliability ranged from .90 - .97. The separations were considered high with a value ranging from 3.04 – 5.77. All items in this instrument functioned well to measure individual work performance except for one item in sub-scale Contextual Performance. This specific item should be revised to achieve a more accurate measurement of the construct. There was one item that was considered biased toward gender in sub-scale Contextual Performance. Also, there was one item that was considered biased toward tenure in sub-scale Counterproductive Work Behavior. These findings had implications for using the Indonesian Version of IWPQ to assess employees’ individual work performance and recommendations for future research.


Introduction
Every organization has certain goals. It uses any sources possible and gives the best effort to achieve those goals. Among so many factors, individual work performance is the basic foundation that can predict organizational achievement (Campbell & Wiernik, 2015). Even so, the concept of performance is often misunderstood or used interchangeably with the term productivity. Productivity is the result of input divided by output. It can be said that productivity is a concept that is closely related to the result, while performance is closely related to the process (Rostiana & Lie, 2019).
Individual work performance is a construct regarding the behaviors or actions of an individual that are relevant to organizational goals. According to the result of the study conducted by Koopmans et al. (2011), individual performance consists of three dimensions, including task performance, contextual performance, and counterproductive work behavior. Task performance is also known as proficiency, with which an individual performs central job tasks. It includes work quantity, work quality, and job knowledge. Contextual behavior is defined as individual behaviors that comprehensively support the organizational environment in which the technical core must function. Meanwhile, the definition of counterproductive work behavior is behavior that harms the well-being of the organization, such as being late for work, engaging in off-task behavior, and absenteeism.
A follow-up study about these constructs resulted in an instrument known as Individual Work Performance Questionnaire or IWPQ (Koopmans et al., 2013). IWPQ was developed originally in The Netherlands. To date, there have been a few studies that translate IWPQ into different languages or used it contextually in different countries such as Sweden (Dåderman et al., 2020), Spain (Ramos-Villagrasa et al., 2019) and South Africa (van der Vaart, 2021). In Indonesia, the adaptation of IWPQ was conducted by Widyastuti & Hidayat (2018).
There are several approaches to measuring psychological variables. The Classical Test Theory (CTT) is a relatively popular psychometric theory used in social science disciplines, including psychology. CTT analysis and interpretation can be carried out according to research needs with a number of properties, including descriptive statistics, difficulty level, discriminant index, total item correlation, and item weighting (Bond & Fox, 2015). The effectiveness of CTT in demonstrating the validity and reliability of measuring instruments raised a few criticisms, such as the reliability value of the CTT depended on the sample or the characteristics of the test takers, meaning that if a measuring instrument was used in group A, the reliability results might differ in group B. Reliability was not attached to the instrument or measuring instrument, but it was attached to the score or measurement of the sample. In addition, criticism has also JP3I (Jurnal Pengukuran Psikologi dan Pendidikan Indonesia), 11(2), 2022 This is an open access article under CC-BY-SA license (https://creativecommons.org/licenses/by-sa/4.0/) been raised against CTT for using raw scores in its analysis process. CTT was considered to give less accurate results because it treated raw scores on an ordinal scale in statistical, mathematical calculations, which are actually carried out with interval or ratio data scales (Alagumalai & Curtis, 2005).
Item response theory (IRT) is an approach in measurement theory whose analysis specifically explains the interaction between the person or subject of measurement with the items. One of the most popular IRT models was the Rasch model or the so-called 1-parameter logistic (PL) model. There were also other models, such as 2 PL and 3 PL. Bond & Fox (2015) noted that the Rasch model has an advantage over the other IRT models as it uses the measurement procedures of physical sciences as its reference point. Responding to one of the limitations of CTT regarding the use of raw scores in mathematical calculations, the Rasch model returns the data according to its natural condition in the form of continuum data by accommodating data transformations in logit units (Sumintono & Widhiarso, 2014). Thus, ordinal data from measuring instruments whose original scale distance is not known can be converted into interval data.
The analysis using the Rasch model produces some measurement properties. The accuracy of the item with the model often referred to as infit and outfit, is an indicator of the suitability of items in measuring tools and misconceptions. The results of the analysis that show item misfits can also be seen in the form of a map, widely known as the Wright Map. Reliability values are divided into person and item reliability. DIF or Differential Item Functioning is an indicator to determine whether or not there is an item bias in specific research subject categories, for example, between men and women or between specific age groups (Yu, 2020). Analysis with Rasch modeling through some of these properties can objectively evaluate the accuracy of the instrument in measuring specific attributes or variables.
The development of the original version of IWPQ by Koopmans et al. (2013) was carried out in several stages. Individual performance indicators were obtained through the scientific literature, and existing measurement tools and interviews with experts were used as the basis for constructing the 47 IWPQ 0.1 items. The scale was tested on 1,811 field workers, service workers, and office workers in the Netherlands. Then a factor analysis was carried out, with the results of three dimensions of individual performance. Koopmans et al. (2013) then conducted an analysis using the Rasch model to identify the accuracy of items and individuals. The Rasch analysis resulted in three dimensions, with each dimension perceived as a subscale aligned with individual work performance's multidimensional construct. Task performance consisted of 5 items, the contextual performance consisted of 8 items, and counterproductive work behavior consisted of 5 items. Meanwhile, the reliability results were in the range of 0.78 to 0.84. Widyastuti & Hidayat (2018) adapted the IWPQ to the Indonesian language by testing its content validity, calculating the discriminant index of each item, and estimating the reliability of the measuring instrument using Cronbach's alpha coefficient. The discriminant index on each individual performance dimension was in the range of .447 to .747. The results of Cronbach's alpha coefficient showed good reliability above .8. Based on the classical test theory approach used in that study, the Indonesian version of the IWPQ had met the good psychometric property rules. However, this finding was only applicable to the study's sample as the validity and reliability could not be legitimately generalized to other study settings or samples.
A number of studies on individual performance in various cultural contexts have used the IWPQ as the measurement tool, either in its entirety or the sub-scales which are related to the research topics (Ceschi et al., 2017;Daraba et al., 2021;Metin et al., 2018;van der Lippe & Lippényi, 2020;Varshney & Varshney, 2020). Meanwhile, individual work performance research with Indonesian participants has also been carried out quite a lot, from its relation to personal aspects such as stress (Grasiaswaty, 2020), self-efficacy, and personality (Ramdani et al., 2021) to organizational aspects such as compensation and discipline (Prasetyo et al., 2021), and organizational culture (Srihadi et al., 2019). However, these studies did not use the Indonesian version of the IWPQ adapted by Widyastuti & Hidayat (2018) as a measurement tool.
This is an open access article under CC-BY-SA license (https://creativecommons.org/licenses/by-sa/4.0/) Moreover, we found no other studies validated the Indonesian Version of IWPQ using Rasch analysis. Therefore, it was necessary to have an individual work performance measurement tool that is rooted in the original construct and is also proven to be valid and reliable.
Based on this explanation, this study aimed to test the validity and reliability of the Indonesian version of the IWPQ using the Rasch model. If the IWPQ is proven to be valid and reliable, organizations can use it to accurately capture individual performance. Meanwhile, if the test results indicate that there is a need for improvement, then the measuring instrument can be developed to obtain a more suitable and consistent measuring instrument for workers in Indonesia. Validity and reliability tests will provide protection to the public or the scientific community from the use of measuring instruments that are less valid and reliable.

Participant
The number of samples in the Rasch model is affected by the principle of instrument calibration. When an instrument is calibrated on different samples of similar participants, slightly different results are expected. Therefore, if the sample size is too small, the calibration results will be unstable and less sensitive to describing the actual results. A large sample size will suppress the difference in the calibration result, but it should be noted that cost and time efficiency needs to be further considered. Linacre (1994) suggested that with 99% confidence level, sample size range between 108-243 is sufficient to conduct Rasch analysis. This study was participated by 213 Indonesian workers (145 female, 68 male). All participants were Indonesian citizens aged 18-46 years old (mean = 30.64, SD = 8.55) and were actively working for at least the last three months. They were categorized by the tenure at the last job, three months -1 year (42 participants), 1 -3 years (55 participants), 3 -5 years (30 participants), 5-10 years (36 participants), and more than ten years (50 participants). The nonprobability sampling method was used as the sampling technique. A sampling frame as a requirement of probability sampling was not possible to establish because the size of the population that met the criteria could not be precisely determined.

Instrument
The instrument used in this research was the Indonesian version of the IWPQ adapted by Widyastuti & Hidayat (2018). We managed to grant permission from the mentioned researchers to use this instrument. The instrument consisted of three sub-scales, namely task performance (5 items), contextual performance (8 items), and counterproductive work behavior (5 items). The total number of items in this instrument were 18 items. Table 1 shows the blueprint of each subscale. The item response model in this instrument was a 5-Likert rating scale, which consisted of five answer choices, namely "jarang", "kadang", "sering", "sangat sering", and "selalu". Respondents were asked to choose a response that was appropriate to their condition for at least the last three months. Instruments were distributed in the form of a Google Form link. The consent to be involved in the research was included in the link before data collection was carried out. Data submitted by respondents were stored in Google Drive, which can only be accessed by the researchers. Advertisements regarding the research, along with the Google Form link, were distributed through social media and were closed when the required number of samples had been met.

Data Analysis
The data analysis using the Rasch model in this study was carried out using WINSTEP® 5.1.0. version. The psychometric properties discussed in this study include instrument reliability, person and item reliability, unidimensionality, rating scale functioning, and bias detection (Differential Item Functioning). The analysis of the Rasch model was carried out on each sub-scale referring to the construct of the individual work performance' theory which was a multidimensional construct. This was in line with the research of Koopmans et al. (2013), which used Rasch analysis per sub-scale when the instrument was developed for the first time.

Results and Discussion
Unidimensionality The analysis in this study was performed separately for each dimension of individual work performance, following the pattern of the original study (Koopmans et al., 2013). The unidimensionality of each sub-scale was determined by the result of raw variance explained by measure. This was the criteria of Rasch Principal Component Analysis of Residuals (PCAR). According to Holster & Lake (2016), a size of > 40% is sufficient evidence of unidimensionality. The eigenvalue of the first contrast should not be more than 2.0, since a smaller value indicated that the residuals were random noises, not another dimension. The result of this study showed that the Task Performance sub-scale obtained 56.2% of raw variance explained by measure (first contrast = 1.6), meanwhile the second sub-scale, Contextual Performance, had had the size of 49.3% (first contrast = 1.9). The third sub-scale, Contextual Work Behavior, showed the size of 52.9% (first contrast = 1.6). It can be said that the assumption of the unidimensionality of each sub-scale of IWPQ had been fulfilled and further analysis could be done.

Rating Scale Diagnostics
IWPQ used a 5-Likert rating scale which consist of "jarang", "kadang", "sering", "sangat sering", and "selalu". A rating scale diagnostic was used to evaluate how the individuals took those choices and interpreted the distance between them. This data served more precise and interpretable measures of the construct because researchers were able to determine the actual distance applicable to respondents when choosing existing options. Each IWPQ sub-scale had a similar diagnostic result that the responses functioned as it should. This conclusion was drawn from Table 2, which shows no category or option has 0 (zero) response frequency in each sub-scale. All sub-scale also showed an increasing threshold from negative to positive in the 5 (five) responses or choices that were used (Linacre, 2012). This strongly indicates that respondents used the response category as well as they should.

Reliability
The Rasch model estimated the reliability of either the person or the item. The instrument's capability to distinguish respondents regarding the measured variable was called person reliability. Person and item separation reliability (PSR and ISR) were interpreted in the same manner as Cronbach's α, with a minimum value of around 0.80 to be considered reliable. The person separation index (PSI) also showed reliability, although it used the logit scale instead of the raw scores. PSI should be above 3.0 to be considered as high (Linacre, 2012). The result of this study can be seen in Table 3, where person reliability for all sub-scales ranged from .58 -.80. This indicated that subscale 3, Counterproductive Work Behavior, was only fair for distinguishing the person on the measured construct. It means that this sub-scale might not be sensitive enough to distinguish between high and low performers (Bond & Fox, 2015). However, the item reliability ranged from .90 -.97, supported by Cronbach's α that ranged from .82 -.86. The separation was considered high with a value ranging from 3.04 -5.77. Those findings indicated that the reliability of IWPQ was considered high and that it had good psychometric characteristics, with a particular note for the value of PSR.

Item Fit
To determine how well each item measures the construct, the Rasch model tested the item infit, outfit, and point measure correlation. Table 4 shows the result that had been sorted from the items that were difficult to the easier ones. The infit and outfit MNSQ should range between .5 -1.5 to be considered effective for a measurement. Meanwhile, the point measure correlation should range between .4 -.85 (Fisher, 2007). This study found that the only misfit item in all three sub-scales of IWPQ was item CP6 in sub-scale 2 "Saya bernisiatif memulai tugas baru setelah tugas sebelumnya selesai". This item was considered not fit to measure the contextual performance (infit MNSQ = 1.66, outfit MNSQ = 1.69). The point measure correlation value for all three sub-scales was positively correlated and passed the criteria . These findings suggest that all items in this instrument function well to measure individual work performance except for item CP6. This specific item should be revised to achieve a more accurate measurement of the construct.

Wright Map
The validity of the construct could be determined by the hierarchy of items that can be observed in a Wright Map. This map showed the difficulty of the item on the right panel and the ability of the person on the left panel. On this map, the easier item is located at the bottom, the item with average difficulty is in the middle (mean, denoted by M on the right side), and the item with greater difficulty is at the top (Yu, 2020). The Wright map of each IWPQ's sub-scale can be seen in Figure 2.
It could be seen that regarding the difficulty level, all items in the sub-scale 1 Task Performance were relatively easy to moderate, shown by only a few persons located under the mean value (M). The mean value for person was 1.87 logit (Standard Deviation = 2.12), much lower than the mean value of the item, which was .00. On the sub-scale 2, Contextual Performance, the most difficult item was CP12 "Saya terus mencari tantangan baru dalam pekerjaan saya". The mean value for person was .62 logit (Standard Deviation = 1.25), also much lower than the mean value for an item, which was .00. Lastly, it should be noted that the sub-scale 3, Counterproductive Work Behavior, consists of 5 items with a negative connotation. It was aligned with the construct, which defined counterproductive work behavior as behavior that harms the well-being of an organization. The most difficult item on the subscale 3 was item CWB15 "Saya cenderung membesar-besarkan masalah di tempat kerja saya". The person distribution from Figure 2 shows that this sub-scale was relatively hard for the participants as many persons did not choose the extreme responses. The mean value for person was -3.25 logit (Standard Deviation = 1.89), meanwhile, the mean value for the item was .00.

Differential Item Functioning
Differential Item Functioning (DIF) analysis was used to examine whether subgroups within the sample (divided by gender and tenure) responded differently to the items, despite equal levels of the underlying characteristic being measured. In this study, gender was divided into two sub-groups, male (L) and female (P). Meanwhile, tenure was divided into five sub-groups, group A (3 months-1 years), B (1 -3 years), C (3 -5 years), D (5 -10 years), and E (>10 years). The method used for evaluating DIF in this research was the item-trait chi-square (Linacre, 2007). Significant bias is detected if the probability value of the item is less than .05.
In line with item fit findings, Table 5 shows that item CP6 was considered biased toward gender (p = .0128). It means that there could be different interpretation between male and female in understanding item CP6 "Saya bernisiatif memulai tugas baru setelah tugas sebelumnya selesai". As seen in Figure 3, women tend to choose a higher rating scale than men on this item. Men were often described as achievement-oriented, whereas women were often seen as benevolent. The finding of this study was congruent with a phenomenon known as "the stereotype backlash effect", which occurs when individuals' behavior deviates from prescriptive stereotypes (Bohlmann & Zacher, 2021). Therefore, it was necessary to be careful in using item CP6 as there was a tendency for women to engage in higher proactive behavior at work. On a more positive      .5148 .0489* *p<0.05 Table 5 shows that bias toward tenure grouping was found in item CWB18 "Saya membicarakan halhal negatif dalam pekerjaan dengan orang-orang di luar tempat kerja saya", with p = .0489. As seen on Figure  3, individuals who were recently working on their job (group A = 3 months -1 year) had the lowest tendency to talk about negative things about their employer. Meanwhile, individuals who had been working for 1 -3 years (group B) had a higher tendency to bad-mouth their employer, closely followed by the group with the longest tenure (> 10 years). The second was pursuant to the previous study conducted by Ng & Feldman (2010), which concluded that organizational tenure was positively related to some counterproductive behaviors. The explanation about group B's DIF on this item might be related to "the hangover period" of employees or the decline of job satisfaction after approximately 1 year of employment. This was based on the assumption that after a "honeymoon period" when individuals started their employment, they might find some aspects of their job or organization they perceived as unsatisfactory. Workers who cannot bear the dissatisfaction leave their organization, while those who might have come to terms with or found ways to cope with it stay (Dobrow & Ganzach, 2014). This explained the decreasing tendency to bad-mouthing in the next longer tenure groups (group C & D).

Conclusion
A few conclusions could be derived from this study. The test of unidimensional assumptions showed that each sub-scale or dimension of individual work performance was unidimensional. The use of five response categories in each sub-scale appeared to be in order, increased from negative to positive which indicated that the response categories functioned as well as it should. The only misfit item in all three sub-scales of IWPQ was item CP6 in the sub-scale Contextual Performance (infit MNSQ = 1.66, outfit MNSQ = 1.69). However, the point measure correlation values for all three sub-scales were positively correlated and passed the criteria. These findings suggest that all the items in this instrument functioned well to measure the construct theory of individual work performance except for item CP6. This specific item could be revised to achieve a more accurate measurement of the construct. Regarding the difficulty level, all items in the sub-scale 1 Task Performance were relatively easy to moderate, yet on the sub-scale 2 Contextual Performance, the most difficult item was CP12. The person distribution of the sub-scale 3 Counterproductive Work Behavior showed that this sub-scale was relatively hard for the participants as many persons did not choose the extreme responses, but it should be noted that the items of this sub-scale had negative connotations. It aligned with the construct, which defined counterproductive work behavior as behavior that harms the well-being of an organization. The most difficult item on subscale 3 was item CWB15. The next interesting finding in this study was the differential item functioning, which showed there was one item that was considered biased toward gender. Women tend to choose a higher rating scale than men on item CP6. Aligned with previous finding about misfit item, this item might need a revision to increase its accuracy in measuring contextual performance. However, it was interesting to see that this finding was congruent with a phenomenon known as "the stereotype backlash effect", which occurs when individuals' behavior deviates from a prescriptive stereotype. There might be a tendency for women to engage in higher proactive behavior at work, which further suggests that women's empowerment might be related to contextual performance. Other than that, this study also detected bias based on tenure on item CWB18. This finding implied that individuals who had just started their job had the lowest tendency to talk about negative things about their employer. Meanwhile, employees who underwent "the hangover period" or the decline of job satisfaction after approximately one year of employment had the highest tendency to bad-mouth their employer. This finding strengthens the assumption that after a "honeymoon period" when individuals start their employment, they might find some aspects of their job or organization they perceived as unsatisfactory, which leads to counterproductive work behavior in the form of bad-mouthing or negative talk.
The limitation of this study was the data collection process which might not be able to accurately represent the whole population of an employee in Indonesia because of the non-probability sampling technique. However, the Rasch analysis does not solely depend on the sampling involved, thus allowing a generalization of effective measurement properties evaluation of both three sub-scales of individual work performance construct. Even so, further research is needed to explore the validity and reliability of this instrument in more specific populations or demographic characteristics such as industrial sector and age. If such research could be conducted, it should add a more comprehensive understanding of individual work performance measurement in a different context. Overall, the Indonesian version of the Individual Work Performance Questionnaire can be applied to future research and practical application in organizations with considerations, as explained before.