Ensuring Parameter Estimation Accuracy in 3PL IRT Modeling: The Role of Test Length and Sample Size

The objective of this simulation study was to evaluate the accuracy of item parameters estimation when employing the 3PL IRT model, mainly focusing on sample size and the length of the test (number of test items). The investigation used six datasets produced by WinGen, each comprising 5000 responses and varying test lengths within 10 to 40 items. For each dataset, the study conducted simulations and re-analyzed the data 15 times, generating a total of 2025 data subsets and estimating 225 parameters for each item. The results revealed that smaller sample sizes led to more pronounced biases, emphasizing a recommended minimum sample size of 3000 for precise parameter estimation. Additionally, the study found that a limited number of items (short test) yielded biased estimations and proposed a minimum of 25 or 40 test items for accurate estimation using the 3PL IRT model. These findings offer valuable insights for test developers in making informed decisions regarding sample sizes and test length, ultimately ensuring reliable and accurate parameter estimates.


Introduction
One of the crucial assumptions in the item response theory (IRT) is the requirement for parameter invariance.In IRT, parameters refer to both those associated with the test item and those related to the test taker (Paek et al., 2021)."Invariant" signifies that the characteristics of the test item parameters remain unaffected by the test taker's ability, and vice versa; the ability of the test taker is independent of the parameters of the test item (Baker, 1985;Retnawati, 2014;Stenbeck et al., 1992).This invariance of parameters is a fundamental aspect that distinguishes IRT from classical test theory (CTT).Ensuring compliance with this assumption is essential before further analyzing the test items.
The fulfilment of assumptions in IRT models heavily depends on the quality of the test items.Evaluating the quality of these items can be approached from two perspectives: content-wise and statistically (Paek et al., 2021).For test developers, a solid understanding of item development is crucial in crafting test items with high content quality.This ensures that the items effectively assess the intended constructs or latent attributes.
On the other hand, the parameters of the test items obtained from the IRT analysis, such as difficulty (b), discrimination (a), and pseudo-guessing (c), can be statistically evaluated for quality if the underlying model has successfully met the assumptions.The IRT model's assumptions, including parameter invariance and other statistical requirements, play a significant role in determining the quality and validity of the parameter estimates.Overall, a comprehensive assessment of test item quality involves content evaluation during item development and rigorous statistical evaluation using appropriate IRT models.By combining these perspectives, test developers can create more reliable and valid measuring instruments, enhancing the accuracy and effectiveness of the assessment process.

Research Results Related to the Accuracy of Item Parameters and Sample Size
Achieving parameter invariance in IRT poses a significant challenge, especially when dealing with limited sample sizes in pilot studies.The issue of constrained sample size is a common hurdle during the calibration of test items using the IRT approach.Barnes and Wise (1991) conducted a study to explore the accuracy of item parameters with smaller sample sizes.Their findings indicated that enhancing the stability and accuracy of item parameter estimation (e.g., difficulty level and ability) could be achieved by modifying the 1-parameter IRT model.This modification involved setting a specific value for the pseudo-guessing (c) parameter.Similarly, other studies, such as those by Wainer & Wright (1980) and Divgi (1984), have investigated and demonstrated the accuracy of parameters by adopting a modified 1-parameter IRT approach, often referred to as the 'Rasch' model.These studies underscore that the challenge of sample size affecting item parameter accuracy is a longstanding concern.However, due to the absence of a definitive guide on sample size requirements for calibrating item parameters, ongoing research on sample size in IRT remains a critical consideration for instrument developers.By continuously exploring this aspect, researchers can make informed choices to enhance the reliability and validity of their measurement instruments.
Recent studies, such as the one by Paek et al. (2021), suggest a sample size of at least 500 for calibration using the Rasch model.For longer tests (test length = 40), they recommend a sample size of at least 750.In their research, they conducted simulations employing both the Rasch and IRT 2 PL models, varying sample sizes across 13 scenarios, test lengths from 9 to 40, and different estimation methods (joint maximum likelihood (JML), marginal maximum likelihood (MML), and conditional maximum likelihood (CML)).The results demonstrated that the IRT 2PL model was more sensitive to changes in sample size than the Rasch model.However, it's worth noting that the study by Paek et al. (2021) did not include the use of the 3PL model in their simulations.
Another study by Feuerstahler (2022) recommended a sample size of at least 5000 to achieve the best calibration results with the 3PL model.This suggests that meeting the 3PL model's parameter requirements necessitates a large sample size, which can be challenging in practical contexts.On another note, Yen (1981) highlighted the importance of using the 3PL IRT model, particularly for estimating ability, especially in the context of multiple-choice tests.

Research Objective
Studies on sample size in IRT have shown the need for further exploration in determining sample size for calibrating IRT test items, impacting parameter estimation and the assumption of item parameter invariance.Unlike previous studies, which generated simulation data separately and employed various IRT models for parameter estimation, our study utilizes six datasets already fitting the 3PL IRT model, each with a sample size 5000.The primary goal of this study is to evaluate the accuracy of item parameters estimation when employing the 3PL IRT model.We address three key research questions: (1) does sample size and test length affect item parameter estimation accuracy in the IRT 3-PL model?; (2) what test length is necessary to uphold item parameter accuracy in the IRT 3-PL model?; and (3) what is the required sample size to preserve item parameter accuracy in the IRT 3-PL model?
For each dataset, we performed resampling by randomly selecting response subsets.Subsequently, we iteratively estimated item parameters using the IRT 3-PL model on all these subsets.We then recorded and compared the resulting item parameters with the initial ones.The simulation study's stages are illustrated in Figure 1.In our simulation, we employed three software tools: WinGen (Han, 2007;Han, 2007) to generate simulation data with consistent parameters for each dataset.The only variation among the initial six datasets was in test length, ranging from 10 to 40.RStudio was then utilized for random resampling of 5000 samples from the initial data, item parameter estimation, quantifying item parameter bias, and performing ANOVA tests to assess the impact of test length and sample size on item parameter deviations from their original values.Additionally, MS Excel was used to store the analysis results, including the summary of ANOVA and pairwise comparisons.

Data Generation
The simulation commenced by creating six datasets with specific characteristics using WinGen (Han, 2007;Han, 2007).Each dataset shared common attributes: (1) number of examinees = 5000, (2) distribution = normal, (3) mean = 0, (4) standard deviation = 1, (5) number of response categories = 2, and ( 6) model = 3PLM (3-Parameter Logistic Model).The three item parameters (par.a,par.b, and par.c) were assigned particular distributions and mean values: (1) par.a:normal distribution with a mean of 0.85 and a standard deviation of 0.1, (2) par.b:normal distribution with a mean of 0.65 and a standard deviation of 0.6, and (3) par.c:uniform distribution with a minimum value of 0.001 and a standard deviation of 0.05.Test length varied across the six dataset types, resulting in test lengths of 10 items, 15 items, 20 items, 25 items, 30 items, and 40 items, respectively.The generated data consisted of responses in a dichotomous format (1/0), representing correct/incorrect answers.

Resampling the Initial Data
The initial dataset included 5000 samples.Each of the six initial datasets underwent resampling across 15 scenarios, resulting in 90 new data subsets.The sample sizes for these subsets were systematically decreased from 5000 to as low as 200.The subsets were created with reduced sample sizes: 4500, 4000, 3500, 3000, 2500, 2000, 1500, 1000, 800, 600, 500, 400, 300, 250, and 200.This reduction in sample sizes allowed for a comprehensive data exploration under various conditions during the resampling process.
The resampling process was conducted randomly using the "sample_n(data, n-sample target)" function from the 'dplyr' package in RStudio (Wickham et al., 2022).Each subset (resampling) was replicated 15 times, resulting in a total of 2025 data subsets.For each item, the parameters were estimated 225 times (15×15).In summary, this resampling procedure generated 2025 data subsets, with item parameters estimated 225 times through this iterative process.

Estimating Items Parameters
The initial data is used for estimating item parameters (b, a, and c) using the IRT 3-PL model.These parameters are saved as reference parameters (b0, a0, and c0) and serve as a basis for evaluating item parameters obtained from resampled subsets of data.The estimation of item parameters in both the initial data and data subsets is conducted using the 'mirt' package in RStudio (Fernández-Ballesteros, 2012).
For each of the six test lengths (10, 15, 20, 25, 30, and 40 items), item parameters (b, a, and c) are estimated in the initial data.This results in a set of estimates labelled accordingly (e.g., 10.b0, 10.a0, 10.c0 for a test length of 10 items).In each test length scenario, item parameters are estimated across 15 resampled data subsets, with each subset being replicated 15 times.This leads to 225 dataset estimates for each item (b, a, and c) at a given test length.
In summary, this process allows for a comprehensive analysis of the accuracy and precision of item parameters across various test lengths and sample sizes, ensuring a thorough evaluation of the IRT model's performance.

Evaluation of Item Parameter Estimation Accuracy
The evaluation of item parameter estimation accuracy in this study aligns with the approach introduced by Paek et al. (2021).They employed the Root Mean Squared Difference (RMSD) and Mean Absolute Difference (MAD) to gauge the precision and bias of the IRT 2-PL item parameters derived from re-sampled data.Notably, the simulation results by Paek et al. (2021) revealed that the resulting RMSD and MAD exhibited no significant differences.This finding implies that due to their comparable performance, either RMSD or MAD can be chosen to evaluate the accuracy of item parameter estimation in the context of IRT.This method is akin to that used by Wells et al. (2002), who utilized RMSD to assess the accuracy of ability estimation.
In this study, RMSD is employed as the method of choice to evaluate the accuracy of item parameters in the simulation, ensuring consistent evaluation criteria are applied throughout the analysis.For each item, the RMSD of the parameter estimation results on the each item is calculated using the following equation.
Where "" represents the item number, "" represents the replication (1, 2, 3, …, R),   is the estimated item parameter result for item i in replication , and  0 is the initial (reference) parameter for item .RMSD(i) values are computed for each item on 15 subsets of the initial data across all six test lengths.This process generates 90 RMSD(i) values, one for each subset.The entire calculation procedure is carried out within RStudio.
RMSD serves as a dual-purpose metric, indicating both the accuracy of parameter estimation and the extent of parameter bias introduced by re-sampling the initial parameters.A higher RMSD value signifies a significant deviation of parameter magnitude from the initial values, indicating lower accuracy in the estimation results.Conversely, a lower RMSD value suggests minimal bias and higher accuracy in parameter estimation.
To evaluate the effect of sample sizes and test length on item parameter accuracy, we conducted a oneway analysis of variance (ANOVA).This analysis scrutinizes each factor individually, considering the sample size of the re-sampled data (subset) and the test length (n-items).The ANOVA uses the 'aov' function within the 'stats' package in RStudio.
If the two-factor ANOVA reveals significant differences in RMSD due to the test length and/or sample size factors, a post hoc analysis, specifically Tukey's Honestly Significant Difference (HSD) test, is employed.Tukey's HSD is a commonly utilized method for comparing all pairwise means following the detection of significant differences in ANOVA results.Through the application of Tukey's HSD, we can pinpoint which combinations of sample sizes or test lengths exhibit significant deviations in the RMSD of item parameters compared to the initial data.This approach provides in-depth insights into the specific distinctions among these groups following the initial ANOVA analysis, which has identified significant differences.
The evaluation of parameter estimation accuracy depends on the significance of the ANOVA results, which are used to evaluate RMSD variance.The null hypothesis under examination posits that the RMSD of item parameters for all sample sizes is zero.If the analysis produces significant results (p-value < 0.05), it indicates a noteworthy RMSD in item parameters for a specific sample size compared to the initial data.

Results
The initial analysis aimed to investigate the effect of sample sizes and test length on item parameter accuracy, addressing whether they affect item parameter estimation accuracy in the IRT 3-PL model.The data analysis in Table 1 shows that sample sizes significantly affect the RMSD of item parameters, including item difficulty (b), item discrimination (a), and pseudo-guessing (c).Additionally, the test length also significantly affects the RMSD of all item parameters.Further analysis, as displayed in Figure 2, Figure 3, and Figure 4 demonstrates an increasing trend in RMSD values.This increasing trend indicates that both sample size and test length have a significant impact on the decrease in accuracy in estimating parameters b, a, and c.These findings demonstrate that both sample size and test length are crucial factors influencing the accuracy of item parameter estimation in the 3-PL IRT model.This indicates that the sample sizes used for test calibration and test length play significant roles in this accuracy.Considering the significant differences in RMSD item parameters across various sample sizes and test lengths, we conducted further pairwise comparisons (Table 4 and Table 5) to identify distinct groups and evaluate the minimum sample size required to maintain item parameter accuracy.Additionally, pairwise comparison was employed to explore the accuracy of item parameters to the test length.This pairwise analysis simultaneously addresses the second and third questions in this study, which are related to "What test length is necessary to uphold item parameter accuracy in the IRT 3-PL model?" and "What is the required sample size to preserve item parameter accuracy in the IRT 3-PL model?".Source: personal data ( 2023) Table 2 displays pairwise comparisons of RMSD between data groups categorized by test length factors.A significant p-value indicates significant variations in RMSD among these groups.Specifically, the comparison between test lengths 10 and 30 reveals that the RMSD parameter b is not significantly different (p-value > 0.05).However, it significantly differs from the other four test lengths.Moreover, pairwise comparisons for all combinations of test lengths (15, 20, 25, and 40) exhibit p-values > 0.05, indicating that RMSD parameter b for these four test lengths is not significantly different.Pairwise analysis based on item discrimination (a) reveals significant differences in RMSD between test length 10 and test lengths 20, 25, and 40.Meanwhile, pairwise comparisons for all combinations of test lengths (15, 20, 25, 30, and 40) exhibit p-values > 0.05, indicating that RMSD parameter a for these five test lengths is not significantly different.Furthermore, pairwise analysis based on pseudo-guess (c) yields results similar to the previous two parameters.RMSD for the test length 10 indicates a significant difference from the other five test lengths.
Based on these results, the most significant RMSD differences were observed in data with a test length of 10.These differences also tend to lead to estimation inaccuracies, as the RMSD values for data with a test length of 10 significantly increase with decreasing sample size (see Figure 2, Figure 3, and Figure 4).These findings imply that a minimum of 15 test items is required to obtain unbiased estimates of item parameters.

Discussions
The simulation results highlight the significant influence of test length on the accuracy of item parameters in the IRT 3-PL model, a critical consideration given its impact on participants' psychological states (Ackerman & Kanfer, 2009).Consequently, careful test length selection is essential for ensuring precise item parameter estimation in pilot studies.
Notably, the study's simulation results reveal that a test with a length of 40 provides the most accurate item parameters.Conversely, shorter tests result in unstable parameter estimation.However, these findings do not mandate lengthy tests for measuring latent attributes; instead, they offer insights into the ideal number of items for pilot study calibration.Previous research (Ackerman & Kanfer, 2009;Şahin & Anıl, 2017;Wells et al., 2002) has also explored the impact of test length on ability estimation, with some studies finding no significant effect.Sample size emerges as the most influential factor affecting item parameter stability in the IRT 3-PL model.Parameters b, a, and c remain stable when the sample size surpasses 3000, based on estimations from 225 replications for each item.Consequently, large sample sizes are imperative for accurately calibrating item parameters in the IRT 3-PL model.Small sample sizes in pilot studies will likely yield item parameter estimates with significant bias.
It is important to note that this study did not explore specific attributes within the sample data, such as distribution or other characteristics that may impact parameter estimation accuracy.The study employed random sampling on a subset of the data.Therefore, further research is needed to investigate how different data attributes affect the stability of item parameters in the IRT 3-PL model.The findings of this study are closely aligned with those of Paek et al. (2021), who examined item parameters in IRT 2-PL and Rasch models.Paek et al. (2021) also observed that the difficulty parameter (b) was sensitive to sample size in IRT models with item discrimination parameters (a).This sensitivity to sample size was similarly observed in our study, with the difficulty parameter exhibiting a significant difference in the sixth sample size, while the item discrimination parameters (a) showed significant differences for the 13th sample size.
Although the RMSD characteristics in this simulation exhibit similarities with Şahin and Anıl (2017), the conclusions about the accuracy of estimating item parameters using the 3-PL model differ between the two studies.In our study, the RMSD value of item parameters can exceed 0.33 after the 12th subset, corresponding to a sample size of 350, while Şahin and Anıl (2017) concluded that a minimum sample size of 350 and test lengths of 30 resulted in precise parameter estimation with RMSD<0.33.
In our study, different criteria were used, employing ANOVA to assess whether RMSD obtained from subsets with large samples could still be maintained.Consequently, to achieve precise parameter estimation, the minimum sample sizes identified in our study were larger than those reported by Şahin and Anıl (2017).
Based on the stability analysis of item parameters (b, a, and c) in this simulation, two crucial conclusions can be drawn to obtain unbiased item parameters.Firstly, the minimum sample size for 3-PL IRT models required for pilot study/calibration should be at least 3000.Larger sample sizes enhance accuracy and stability in parameter estimation, reducing bias.Secondly, for pilot study/calibration, it is advisable to use a test length of at least ten items, with test lengths of 25 or 40 being preferred.
The results underscore the significance of achieving the assumption of invariance of item parameters (Retnawati, 2014;Stenbeck et al., 1992), which necessitates conducting a pilot study with large sample sizes.Small sample sizes can lead to imprecise item parameter estimates, affecting the accuracy of measuring latent attributes.Therefore, substantial sample sizes are crucial for reliable and accurate item parameter estimation, ensuring a precise assessment of latent attributes.
Furthermore, the quality of the items and measuring instruments is intricately linked to the skills and expertise of those involved in item development.While this study offers a quantitative approach to assessing item attributes through statistical testing, it is vital to acknowledge that the procedure for evaluating item content quality plays a central role in determining overall instrument quality.Thus, adhering to good item development guidelines is highly recommended to produce high-quality instrument items.
The findings from this simulation study provide valuable guidance for planning large-scale pilot studies for calibration purposes.By applying the insights gained from this research, practitioners and researchers can enhance the effectiveness and accuracy of the measurement process, ultimately leading to more robust and reliable results in the field of psychometrics.

Conclusion
The study's outcomes yield three main conclusions regarding sample size, test length, and sensitivity to sample size.Firstly, both sample size and test length significantly impact the accuracy of item parameters in the IRT 3-PL model, with larger sample sizes and test lengths leading to more stable parameter estimations.Secondly, a limited number of items in a short test results in unstable and biased parameter estimates.To enhance accuracy and reliability, it is advisable to use a test length of 25 or 40 items during pilot studies.Thirdly, estimating item parameters using the IRT 3-PL model is highly sensitive to sample sizes, with smaller sample sizes introducing greater bias in parameter estimates.Achieving precise calibration necessitates a minimum sample size of 3000 for estimating parameters (b, JP3I (Jurnal Pengukuran Psikologi dan Pendidikan Indonesia), 12(2), 2023 [188][189][190] http://journal.uinjkt.ac.id/index.php/jp3iThis is an open access article under CC-BY-SA license (https://creativecommons.org/licenses/by-sa/4.0/)a, and c) in the 3-PL IRT model.By incorporating these conclusions, test developers can improve the quality and accuracy of their measurement instruments.Diligence in considering sample size, test length, and calibration procedures enhances the reliability of psychometric evaluations, offering valuable insights into participants' latent attributes.

Limitations and Suggestions for Further Research
The simulations in this study involved 225 replications for each item's parameter estimation, spanning six test length scenarios and 15 different sample sizes.The results from these replications offer insights into the required sample sizes for precise item parameter estimation.However, it's important to acknowledge that the simulation data used in this study followed the same distribution characteristics, specifically the normal distribution.This limitation should be recognized, and further studies using simulated data with diverse characteristics are needed for a more comprehensive and robust understanding.
Future research should focus on determining the minimum sample size and considering participant characteristics in pilot studies.Understanding how participant characteristics impact parameter estimation is crucial for improving the accuracy of measuring instruments.Regarding the sensitivity of parameter estimation results caused by test length differences, this study did not investigate the instability observed at a test length of 30.It would be intriguing to explore the reasons behind this instability.Conducting further empirical studies to identify the underlying causes is a potential avenue for research.
In summary, this study provides valuable insights into the effects of sample size and test length on item parameter estimation while also pointing out areas for future research and enhancement.Addressing these limitations and conducting additional studies will contribute to a better understanding and more accurate item parameter estimation in the context of the IRT 3-PL model.