## Introduction

*θ*) and item parameters, make studies of test construction, item analysis, dat

*a*-model fit evaluation, differential item function, test performance and score distribution, and others not only possible but effective. The MC simulation method is especially important in the CAT arena because often it is the only practical way to study and evaluate CAT programs and their implementation. As noted above, other analytical methods are often not applicable or feasible with CAT programs.

## Basics of SimulCAT

*θ*estimation, (2) CAT termination policy, and (3) item selection. It supports various CAT administration options to create CAT testing environments that are as realistic as possible. The interim and final score estimates can be c1a6lc ulated using the maximum likelihood, the Bayesian maximum a posteriori or expected a posteriori estimations, or the maximum likelihood estimation with fences [3]. SimulCAT users also can set the initial score value, the range of score estimates, and restrictions on how much the estimate can change. The length of CAT administration can be either fixed or variable. For variable-length testing, SimulCAT supports multiple termination rules, including the standard error of estimation (SEE) and score estimate consistency. Within SimulCAT, users can choose the number of test takers to be administered simultaneously at each test time slot and set the frequency of communication between a test server and client computers (i.e., terminals).

*a*-stratification [5,6], (3) global information [7], (4) interval information [8], (5) likelihood weighted information [8], (6) gradual maximum information ratio [9], and (7) efficiency balanced information [10].

## Some statistics for evaluating computerized adaptive testing performance

### Measures of measurement precision

*θ*values are readily available in MC simulation studies, evaluating the measurement error and precision of CAT is a straightforward calculation. The bias statistic, which is a measure of systematic measurement errors, can be computed simply by averaging the difference between the estimated

*θ*(

*θ*across all test takers (a.k.a., simulees). That is

*I*is the number of simulees. The bias statistic is commonly used with most conventional, nonadaptive test programs, and is still often used with many CAT programs as a “good-to-know” summary statistic. However, because the characteristic of test forms can differ dramatically across

*θ*ranges with CAT, conditional bias (CBIAS), which is bias within each

*θ*range (for example,

*θ*< −2, −2≤

*θ*< −1, −1≤

*θ*< 0, 0≤

*θ*< 1, 1≤

*θ*< 2, and

*θ*≥ 2), is generally the more important statistic to investigate in CAT. An example of CBIAS will be presented later in this article.

*θ*range.

*θ*ranges are often interpreted as approximates of the conditional standard error of measurement (CSEM). The exact point estimate of CSEM can be easily computed using the CRMSE with a simulation design that repeats thousands of simulees at the same exact

*θ*value; for example, for 1,000 simulees with

*θ*= −1, the CRMSE of those 1,000 simulees is the CSEM at

*θ*= −1.

*θ*estimation method.

*θ*ranges, and the reliability coefficient could often mislead people about the quality of measurement at different

*θ*levels. For CAT programs, reporting the CSEM (or its approximations, such as the CRMSE or CMAE) at the most relevant

*θ*ranges is strongly advised instead of reporting the reliability coefficient. If the reliability coefficient must be reported, it should be computed based on the most representative sample set of simulees, and it should be accompanied by CSEM statistics.

### Measures for test security performance

*θ*level. Therefore, it is common for high-stakes tests to evaluate and control the conditional maximum item exposure within each

*θ*level [14].

*r*is the exposure of item

_{j}*j, p*is the number of (fixed-length) CAT forms administered, and k is the number of items in each form. Because this test overlap index is used frequently in practice, it could cause test practitioners to overlook the worst instances of test overlap. Thus, test practitioners should still investigate the most severe cases of test overlap, even if the average between-test overlap index is within a reasonable range.

*a*-parameter values and item exposure could provide important information about the pattern of item selection and use given the item selection criterion of CAT. A test item’s

*a*-parameter is one of the most important factors in the item information function, and many item selection criteria, including the MFI, have a strong tendency toward excessive use of items with higher

*a*-parameter values. If the correlation coefficient is observed to be too high to accept, a test practitioner might improve the situation by lowering the target item exposure rate or changing the item selection criterion (for example, from the MFI to the

*b*-matching method).

## Example of a computerized adaptive testing simulation study using SimulCAT

*θ*,

*θ*estimate, and SEE) in the SimulCAT output file (*.sca), we can easily compute and plot CSEE, CMAE, and CBIAS (SPSS was used in this example). The simulation results show that the CSEE was tightly controlled to be lower than 0.3 across all

*θ*areas as targeted (Fig. 11), and the actual observed errors based on CMAE seemed to be consistent across all

*θ*areas (Fig. 12). The CBIAS (Fig. 13) indicated that the observed systematic error was almost zero. The average number of items administered (while satisfying the SEE criterion of 0.3) was less than 17 when −2<

*θ*<2 (Fig. 14). The average test length increased to about 18 when

*θ*value was set to be chosen randomly between −1 and 1, and the

*θ*estimate jump was limited to less than 1 for the first 5 items. If one wants to reduce the average number of items for individuals at the extreme

*θ*level, one could consider relaxing the constraint for the

*θ*estimate jump.

*a*-parameter values (Fig. 15), it is apparent that the studied CAT design too strongly favored items with higher

*a*-values. One possible change to remedy this issue would be changing the item selection criterion from the MFI method to the

*b*-matching method, which does not take

*a*-values into consideration. Regarding the item difficulty (

*b*-value) of items in the pool, there seemed to be no shortage of items with certain difficulty levels (Fig. 16).