## Introduction

## Three (iterative) processes of CAT administration

## Content balancing

## Item selection criteria

### Maximized Fisher information

*i*can be computed as

*P*(θ) is the probability of a correct response from a person at a given θ,

_{i}*Q*(θ)= 1-

_{i}*P*(θ), and

_{i}*P´*(θ) is the first derivative of

_{i}*P*(θ). If items are calibrated with a 2-parameter logistic (2PL) IRT model,

_{i}*P*(θ) can be computed as

_{i}*i*(Equation 1), therefore, reduces to

*D*is the scaling constant of 1.702.

*a*-parameter value (= 0.5). Additionally, there are other items that exhibit a higher IIF at any given θ between −3 and 3. Therefore, there is no chance that item 4 would be selected and used under the MFI criterion in this example. Fig. 4 displays an example of a typical item usage and exposure pattern with the MFI criterion. In this example, CAT administers 30 out of 300 items in the pool based solely on the MFI criterion. The figure clearly shows a pattern of excessive use of items with higher aparameter values, as well as a pattern of infrequent use of items with lower

*a*-parameter values. The ‘greedy’ nature of MFI item selection imposes serious threats to test security and creates issues with item pool utilization, and thus has led to the development of other item selection criteria and item exposure methods.

### Difficulty matching criterion

*b*-matching) criterion evaluates the distance between the interim

*b*-parameters of all eligible items and selects the item with minimal distance. This approach is commonly used when test items are calibrated with a 1-parameter logistic (1PL) model or Rasch model since items exhibit the most information when their difficulty is closest to the θ value. In fact, the ‘

*b*-matching approach’ essentially results in the same item selection pattern as the MFI ap-proach when a 1PL or Rasch model is used. The

*b*-matching criterion is often used with items calibrated with 2PL or 3PL models as well, since, unlike the MFI criterion, it does not demonstrate the ‘greedy’ item-selection pattern that selects only higher

*a*-parameter values.

### Interval information criterion

*i*is

*θ*.

### Weighted likelihood information criterion

*θ*scale, weighted by the likelihood function after the items administered thus far. With the WLI criterion, the item to be selected is item i, which results in the maximized value of

*θ; x*) is the likelihood function of the response vector

_{m-1}*x*after the (

_{m-1}*m*-1)th item administration.

*θ*values.

### a-Stratifi ation method

*a*-parameter values are reserved for use in the later stage of CAT by stratifying all items in the item pool by

*a*-parameter values. For example, if an item pool has 90 eligible items and a total of 9 items need to be selected and administered, as shown in Fig. 5, the items can be grouped into 3 item strata by their

*a*-parameter values (30 items in each item stra-tum). At the beginning of CAT (the first 3 item administrations, for example), CAT selects and uses an item with a difficulty level that is closest to

*a*-parameter values. The overall performance of the

*a*-stratification method has been proven to be solid as long as the item pool is optimally designed—meaning that it does not show the ‘greedy’ item selection pattern seen in the MFI criterion—while minimizing its tradeoff in measurement efficiency.

*a*-stratification method is that in realworld applications, it is common to observe a moderate positive correlation between

*a*- and

*b*-parameters. In other words, items with higher

*a*-parameter values tend also to have higher

*b*-parameter values. Because of that, stratifying an item pool by an item’s

*a*-parameter value could unintentionally result in items being stratified by their

*b*-parameter value as well. For example, the item stratum with the highest

*a*-parameter values is likely to end up with items whose

*b*-parameter values are also much higher than other item strata with lower

*a*-parameter values. This could lead to a serious shortage of items with specific difficulty levels within each item stratum. To address this issue, Chang et al. [12] proposed a modification called astratification with

*b*-blocking. In the modified version, items are first stratified by their

*b*-parameter values, and then the items from each

*b*-parameter stratum are grouped by their

*a*-parameter values to construct item strata that are based on

*a*-parameters while being balanced in the

*b*-parameter.

*a*-stratification method (and its modification) generally yields stable performance, striking a balance between CAT measurement efficiency and overall item pool utilization, as long as the item pool is large and optimally designed. If the item pool is small, however, or if there are many content categories and test constraints, the actual number of eligible items within each item stratum could be extremely small. Under such circumstances, the CAT’s level of adaptability with this item selection method could suffer a serious downturn. Additionally, because the

*a*-stratification method determines which item stratum to select an item from according to the CAT process, it is not usable when the test length is not fixed.

### Efficiency balanced information criterion

*a*-parameter values, as in the

*a*-stratification method, but with no need to stratify the item pool. One component of the EBI criterion involves evaluating the expected item efficiency (EIE), which is defined as the level of real-ization of an item’s potential information at an interim

*j*-th item administration is computed as

*b*when using either a 1PL or 2PL model. In the EBI criterion, the EIE (Equation 6) is assessed across a

_{i}*θ*interval. The width of the

*θ*interval for the item efficiency (IE) evaluation is determined by the SEE (ε) and set to 2 SEEs from

*j*-th item administration

*i*is computed as

*a*-parameter will result in a larger IE value if all other conditions are the same among items. Items with a lower

*a*-parameter tend to show greater efficiency at a wider range of

*θ*.

*a*-values tend to have a better chance of being selected at the beginning of CAT, whereas items with higher

*a*-values occur more frequently in the later stages.

### Kullback-Leibler information criterion

*θ*for the

*i*-th item with response

*X*is defined as

_{i}*P*

_{i}(*θ*_{0}) is the probability that a random test taker at proficiency level

*θ*

_{0}answers the item correctly. The moving average of KLI is then calculated and used as the item selection criterion, as follows,

*c*selected ac-cording to a specified coverage probability and with

*m*being thenumber of items administered thus far. Chang and Ying [14](1996) found that replacing the MFI criterion with the KLI criterion often reduced the biases and mean-squared errors of proficiency estimation when the test length was short (

*m*< 30) or when the CAT administration was in its early stage, where the

## Item exposure control

### Randomesque

### Sympson-Hetter method

### Unconditional multinomial method

### Conditional multinomial method

### Fade-away method

*C*is the absolute item usage limit (of the first exposure con-trol component) and

*U*is the item usage for the life of item

_{i}*i*. With this new method, rarely used items are expected to be promoted more frequently, and excessively used items are likely to “fade away” from the item selection. This method can be especially useful and effective in CAT with cloud-based systems.

## CAT using automated test assembly approaches

*θ*scale is set to be an objective for the MIP, and the test content balancing and other test specifications are formulated as constraints. For example, if the goal is to implement an STA that is equivalent to a 10-item-long CAT using the MFI item selection criterion (see Equation 1) with the content balancing scenario shown in Table 2, then the MIP model can be expressed as:

##### (13)

*i*= {1, 2, 3, …,

*I*},

*I*is the number of items in the item pool,

*n*is the test length,

*x*is a binary variable indicating whether item

_{i}*i*is included in the constructed test form (1 if included and 0 if not included),

*g*is the number of items administered so far, and

*C*,

_{1i}*C*,

_{2i}*C*,

_{3i}*C*, and C5i are binary (0 or 1) identifiers for item i for each content type/area (for example, if

_{4i}*C*is 1, the item

_{1i}*i*content type is “pure”).

## Choosing an optimal CAT approach

### Interactions between item selection criteria and exposure control

*a*-stratification,

*b*-matching, and EBI) paired with 5 different exposure control methods (none, randomesque, SH, CM, and FA). Fig. 6 displays the item usage/exposure patterns by an item’s

*a*-parameter values under each condition, and Fig. 7 shows the conditional standard error of

*θ*estimation (CSEE). Except for the item selection criterion and item exposure control method, all other test conditions were identical:

*θ*values were generated for 2,000 simulees following a standard normal distribution, each simulee was administered 20 items, and each item pool contained 400 items. As shown in Fig. 6, when the MFI criterion was used with no item exposure control, more than half of items in the pool were not used at all, while items with higher

*a*-parameter values were used excessively. Because the MFI criterion always selects items that will maximize the information function, the CSEE was the smallest with the MFI criterion compared with other criteria in the absence of any exposure control (Fig. 7). The

*a*-stratification and

*b*-matching criteria showed even item usage regardless of an item’s

*a*-parameter value, even without any additional exposure control. When the EBI criterion was used with no item exposure control, it tended to excessively select items with lower

*a*-parameter values. As a result, the CSEE was noticeably larger than the cases using other item selection criteria.

*a*-stratification and bmatching criteria because those criteria never showed a maximum exposure rate larger than 0.2 in the first place. When the FA method was used to control item exposure, the EBI criterion showed the most even item usage pattern among all 4 criteria, although its CSEE was consistently low throughout the

*θ*intervals. The MFI criterion with the FA item exposure method also showed a significantly lower maximum exposure rate, without necessarily leading to the promotion of underused items in item selection. When the CM method was used to control item exposure for each of 6 different

*θ*groups at the rate of 0.2, the item usage pattern was similar to those seen with the SH method (Fig. 6), but there were serious surges of CSEE at

*θ*< −1.5 and

*θ*> 1.5 (Fig. 7), regardless of the item selection criterion chosen. The increased CSEEs at extreme

*θ*values were the result of the CM method’s tight control of the item exposure rate, which limited the number of items with either very low or very high b-parameter values in the pool. When the CM method limited the maximum usage of items with very low or very high b-parameter values, items lacking the optimal difficulty level were forced into use as an alternative, eventually leading to a dramatic increase in CSEE.

### Conventional 3-component approach versus ATA-based approaches

*θ*ranges when the MFI method is used with the CM item exposure control given that particular item pool. Based on the identified issue, one can arrive at possible solutions to address it by tackling the potential causes from each item selection component. For example, relaxing some item content balancing parameters mightpromote the inclusion of more optimal items given the item selection criterion. Alternately, adjusting the CM item exposure control setting to have a slightly higher exposure rate target or to have fewer

*θ*groups for conditional control may help reduce the CSEE. It is also possible that increasing the size of the item pool by adding more items with extreme

*b*-parameter values could ultimately resolve the issue. Under the conventional CAT algorithm with 3 item-selection components, different approaches to address the issue can be easily tried separately at each component level. In contrast, when the ATAbased approaches such as the STA fail to create a test form, it is often difficult to understand what exactly is causing the failure of test assembly.