### Materials and/or subjects

The average number of KMLE candidates is over 3,500 every year. Therefore, this study assumed that 5 item sets would be needed, with 1,000 students taking each set. The simulation study was conducted using 6 years of cumulative data. To validate the data, this study investigated item difficulty parameters and the ability distributions of the candidates for each year. The resulting dataset consisted of a total of 2,410 items, including 450 items from 2012, 400 items from 2013, 400 items from 2014, 400 items from 2015, 400 items from 2016, and 360 items from 2017. This study assumed that the item bank would include 2,410 items and constructed 5 item sets, each consisting of 360 items.

### Study design

The necessary constraints to construct 5 equated item sets are as follows. First, it is important to balance several content areas on the KMLE. Item sets were categorized according to the subjects of the licensing examination. The sub-factors of the KMLE are composed of 8 categories according to the subjects of the licensing examination and 18 categories according to a more specific classification. In general, if sub-factors are too specific and numerous, equating 5 item sets is inefficient and cannot be accomplished through LP because there are too many degrees of freedoms. As a result, this study sought to balance 8 sub-factors based on the subjects of the licensing examination. The DETECT value [

1] was used to examine the extent of the multidimensional simple structure of the KMLE for the 6 years of cumulative data. A confirmatory DETECT analysis was conducted by using the ‘sirt’ package [

2] in the R statistical software 3.4.4 (The R Foundation for Statistical Computing, Vienna, Austria) [

3]. All the DETECT values were less than 0.1 (0.019 in 2012, 0.025 in 2013, 0.024 in 2014, 0.025 in 2015, 0.025 in 2016, and 0.028 in 2017). This indicates that each year of data was essentially unidimensional. The KMLE is composed of easy items, as seen by the fact that its pass rate is over 90%. For this reason, the DETECT program might provide results showing the 8-dimensional data as unidimensional. We supposed that each year of data was multidimensional based on the test specification that comprised 8 categories.

Second, the mean and standard deviation of the item difficulty statistics across the 5 item sets should be the same. Because using a constraint according to which these values had to be exactly the same would drastically reduce the amount of mathematically feasible solutions, this study implemented a constraint according to which the mean and standard deviation had to be similar across the 5 item sets. Therefore, the item difficulty statistics were divided into 2 categories and 3 categories using the predicted correct answer rate (PCAR), and the same number of items was assigned for each item difficulty category.

In this case, the item difficulty was determine using the PCAR, which was computed by the KMLE item developers when they created items and could be interpreted as a predicted value. The PCAR ranged from 0 to 100, with values interpreted as the ratio of the number of correct responses to the total number of responses. If the PCAR is large, the item is easy, and vice versa.

Previously, the PCAR was divided into 6 categories based on the subjects of the licensing examination to determine the difficulty constraint. The item parameter distribution of the PCAR is presented in

Table 1.

Almost 90% of PCARs were between 60 and 90. Based on a previous investigation [

4], it is meaningless to divide the PCAR into 6 categories. Two categories divided by a PCAR of 75 or 3 categories divided by PCARs of 60 and 75 would be appropriate for setting equal item difficulty constraints. As a result, this study examined the quality of equating 5 item sets using 2 or 3 divisions of item difficulty. Based on this item difficulty design, this study examined which equating conditions provided 5 equally pre-equated item sets.

Third, this study investigated whether common items could contribute to the accuracy of equating of item sets. Item sets that had 20% of the total items in common and item sets without common items were constructed and compared with each other.

This study was designed through the following procedures. First, 5 item sets were constructed by LP using 2 or 3 divisions of the PCAR, and then compared with 5-item sets constructed by random item selection. Second, 5 item sets with 20% of the items in common were compared with 5 item sets without common items.

To compare the accuracy of equating in each condition, we estimated the actual correct answer rate (ACAR) and the difficulty parameter of item response theory (IRT). The Rasch model was used to estimate the IRT difficulty parameters in this study [

5]. In the Rasch model, we used the marginal maximum likelihood method for item parameter estimation and the expected a posterior for ability parameter estimation. IRT difficulty parameters were estimated on the assumption that the candidates’ abilities were the same every year. Therefore, the ACAR and IRT difficulty parameters were used to evaluate the equating accuracy.

### Technical information

To equate the difficulty of the item sets, this study conducted a simulation study using LP, as suggested by van der Linden [

6]. Each item set was composed of 360 items from the item bank. The item bank consisted of 8 sub-factors to consider the content balancing issue [

7]. As shown in

Table 1, the sub-factors of the item bank had 317, 316, 317, 182, 958, 152, 120, and 48 items, respectively, and each sub-factor was demonstrated to be a unidimensional trait [

8]. The item sets also were composed of conditions with 20% common items or without common items. The constraints can be summarized as follows: (1) generate 5 item sets; (2) the number of items in each item set is 360; (3) eight sub-factors have 45, 45, 45, 25, 154, 20, 20, and 6 items, respectively; (4) and no common items or 20% common items.

To construct an optimal test using LP, the above constraints must be transformed into decision variables and then converted to a mathematical optimization problem. Decision variables can be defined as variables that make the best decision in the optimization problem. The solution of the problem is to find a set of values such that an objective function is optimal and all constraints are satisfied [

4].

The constraints of this study can be solved by selecting the following variables. i= 1,…,360 is the number of items in each item set. It is assumed that the sub-factor A1 is composed of i= 1,…,45, A2 is i= 46,…,90, A3 is i= 91,…,135, A4 is i= 136,…,160, A5 is i= 161, …,314, A6 is i= 315,…,334, A7 is i= 335,…,354, and A8 is i= 355, …,360.

The decision variable of this study was determined by a binary response for each item. If the item i is selected, *x*_{i}= 1, and if the item i is not selected, *x*_{i}= 0. The sum of the number of items for each item set is expressed as follows ∑i=1360χi.

In order to equate the item sets, the average PCAR was used in this study. *P*_{i} indicates the PCAR of item i. If *x*_{i}= 1 is the selected item, ∑i=1360Pχi is the sum of the PCARs. If ∑i=1360Pχi is divided by 360, it becomes the average PCAR. This study was designed to control the average PCARs of each item set as closely as possible. The difference in average PCARs should be smaller than τ= 0.05. The LP that obtains the optimal test from these constraints is summarized as follows.

(1) The number of item sets is k = 5.

(2) For all k, the average PCARs expressed by ∑i=1360Pχi360 are similar.

(3) The number of items for the 8 sub-factors in each item set is as follows.

(4) The no-common-items constraint is defined as ∑k=15χ<1, for all i.

(5) The common-items constraint is defined as ∑k=15χ≤n0max, for all i n0max>1

The objective function of (2) is to create 5 item sets that are equated. The constraint in (2) mandates that the 5 item sets have a difference of the average PCAR of 0.05 or less. The constraint in (3) determines the number of items for the 8 sub-factors in each item set. The constraint in (4) expresses the absence of common items, while (5) formalizes the presence of common items among the 5 item sets.

The relationship between all constraints was linear. Therefore, the design for this study is equivalent to an LP for 0–1. The solution of an LP is to have 5 item sets that equate to a considerable degree. The solution is determined by a 0–1 value for which the objective function is minimal, and all constraints are met under the appropriate conditions. To summarize, this study was designed to construct 5 item sets that were equated from the item bank.