Acceptability of the 8-case objective structured clinical examination of medical students in Korea using generalizability theory: a reliability study

Purpose This study investigated whether the reliability was acceptable when the number of cases in the objective structured clinical examination (OSCE) decreased from 12 to 8 using generalizability theory (GT). Methods This psychometric study analyzed the OSCE data of 439 fourth-year medical students conducted in the Busan and Gyeongnam areas of South Korea from July 12 to 15, 2021. The generalizability study (G-study) considered 3 facets—students (p), cases (c), and items (i)—and designed the analysis as p×(i:c) due to items being nested in a case. The acceptable generalizability (G) coefficient was set to 0.70. The G-study and decision study (D-study) were performed using G String IV ver. 6.3.8 (Papawork, Hamilton, ON, Canada). Results All G coefficients except for July 14 (0.69) were above 0.70. The major sources of variance components (VCs) were items nested in cases (i:c), from 51.34% to 57.70%, and residual error (pi:c), from 39.55% to 43.26%. The proportion of VCs in cases was negligible, ranging from 0% to 2.03%. Conclusion The case numbers decreased in the 2021 Busan and Gyeongnam OSCE. However, the reliability was acceptable. In the D-study, reliability was maintained at 0.70 or higher if there were more than 21 items/case in 8 cases and more than 18 items/case in 9 cases. However, according to the G-study, increasing the number of items nested in cases rather than the number of cases could further improve reliability. The consortium needs to maintain a case bank with various items to implement a reliable blueprinting combination for the OSCE.


theory (CTT) and generalizability theory (GT).
CTT usually works well with multiple-choice tests, in which all examinees answer identical questions. However, CTT does not work well with clinical skill examinations in which students do not see the same patients, and the students are not evaluated by the same examiners simultaneously. Thus, the scores of examinees contain variations according to the examiners and clinical scenarios. These variations are a potential source of measurement error [2].
In GT, sources of variation are referred to as facets. These may include persons (students/examinees), raters (examiners), items, cases, and station settings. GT answers how similar the examinee's score will be in the different tests and scenarios. Specifically, this theory can answer the question of whether this result could be generalized with more stations and fewer examiners in the new objective structured clinical examination (OSCE). The purpose of GT is to quantify the components of the error caused by each facet and the interaction of facets. GT analysis comprises 2 stages: a generalizability study (G-study) and a decision study (D-study). In the G-study, variance components (VCs) from the facets are estimated, and the reliability is calculated. There are 2 reliability indices: generalizability coefficients (G coefficients), which incorporate relative error variance and are used for normative assessments, and Phi coefficients, which contain absolute error variance and are used for criterion-based assessments. After G-study, using the VCs, a post hoc projection of reliability is examined through the D-study. By applying a simulated D-study, it would be possible to investigate how the G coefficients would change under a different examination setting and consequently determine theoretically reliable settings of a clinical examination [3].
Therefore, GT is more informative than CTT for measuring the reliability of clinical examinations [4]. If the form of the OSCE has been changed, a reliability analysis must be performed subsequently, and VC should be analyzed.

Objectives
The research question of this study was whether the reliability was acceptable when the number of cases in the OSCE decreased from 12 to 8. This study aimed to examine the reliability of medical school OSCEs conducted in South Korea using GT.

Ethics statement
Since this study was not about human subjects or human-originated materials, informed consent from subjects was not indicated and waived. The Institutional Review Board of Dong-A University approved this study protocol (IRB approval no., 2-1040709-AB-N-01-202206-HR-031-02).

Study design
This was an explorative study to model the implementation of GT. Specifically, this was a psychometric study aimed at measuring the reliability of the OSCE. The present study analyzed clinical skill examination data from 439 fourth-year medical students in the Busan and Gyeongnam areas of South Korea from July 12 to 15, 2021.

Setting
There are 5 medical schools in the Busan and Gyeongnam areas, located in the southeastern part of South Korea. These 5 medical schools form the Busan-Gyeongnam Clinical Skill Examination (BGCSE) consortium. Since 2014, the consortium has conducted joint clinical skill examinations annually as normative evaluations for third-and fourth-year medical students.
In the 2021 BGCSE, there was a change in the form of the OSCE due to changes in the Korean Medical Licensing Examination (KMLE) by the Korea Health Personnel Licensing Examination Institute. In 2022, the number of OSCE simulations of the KMLE was scheduled to be reduced from 12 to 10. However, the BGCSE consortium lacked the resources to operate all 10 simulations, which required a further reduction to 8. As a result, the OSCE comprised 7 stations where students encountered standardized patients (SPs) and 1 station where students performed procedures on a manikin. Table 1 shows the number of examinees, the topics of the cases, and the number of items in the cases on each OSCE day. The average number of items per case on each OSCE day was 20. Students were given 12 minutes at each station.
By 2020, the consortium had tracked the reliability of the OSCE using Cronbach's α, and it remained at an acceptable level (above 0.70). However, with the change in the 2021 OSCE, it was necessary to identify the reliability of the test and its error components. Consequently, the consortium decided to analyze the reliability using GT.
The examiners' training proceeded in the same way as usual. Physician examiners from 4 medical schools evaluated examinees' performance at each station by completing the checklist and assigning a value from global rating scales. The SPs' training also proceeded in the same way as usual. The experienced SP trainer trained SPs on scenarios for 2 hours, and they rehearsed for more than 2 hours. All SPs had more than 5 years of SP experience with the BGCSE consortium.

Participants
A total of 439 fourth-year medical students from 5 medical schools participated in the BGCSE at 4 medical school skill simulation centers for 4 days, from July 12 to 15, 2021.

Variables
In OSCEs, examples of facets usually include students (p), cases (c), items (i), and raters (r), among others. GT estimates the variance associated with each facet and provides information about the examination's measurement characteristics. For example, students (p) refer to the variability in scores between examinees that reflects the true difference in competency between students. A greater variance between students indicates that the difference is due to examinee competency, not measurement errors. Cases (c) refer to the variability in difficulty associated with SP encounters in the OSCE. In this study, examinees were randomly assigned to 8 of 23 cases. Items (i) refer to the variability in difficulty associated with checklist items within each case. Raters (r) refer to the variability among examiners. In this study, only 1 rater assessed each case. Thus, there was no variability caused by different raters. In the OSCE, there are interactions between facets. For instance, person-by-case (p × c) interactions indicate differences in student performance according to the cases. The proportion of VCs from each facet provides valuable information about the examination, such as whether the test discriminates high-performance students from low-performance students and whether the number of cases and items is sufficient for reliability.
In this study, we defined 3 facets-students (p), cases (c), and items (i)-and designed them as p × (i:c) due to items being nested in a case. Five types of VCs can be derived from this design: (1) p, (2) c, (3) i:c, (4) p × c, and (5) p × (i:c).

Study outcomes
We set the primary outcomes as examining the reliability presented as G coefficients and analyzing the VCs on each OSCE examination day (G-study). We set the acceptable reliability level of G coefficients to 0.70 [5]. Since this examination was a normative evaluation, phi coefficient criteria were not set. We set the secondary outcomes as the D-study. Using estimates of VCs via the G-study, a post hoc projection of reliability was examined.

Data sources/measurement
The data analyzed in this study were from the BGCSE consortium. The scores of examinees' clinical performance were inserted by faculty examiners using a computer program, and the results were automatically processed. All data were recorded in an Excel spreadsheet (Microsoft Corp., Redmond, WA, USA) and available at Dataset 1.

Bias
No bias was found in the study scheme.

Study size
A sample size was not calculated due to the nature of the study design.

Statistical methods
Descriptive statistics for OSCE scores were calculated, including the mean and standard deviation of each case. The G-study and D-study were performed using G String IV ver. 6.3.8 (2013; Papaworx, Hamilton, ON, Canada). G String IV is a user-centered Windows program that applies GT to analyze empirical datasets. It uses Brennan's urGenova command-line program to perform the analogous analysis of variance procedure necessary to estimate VCs. It was designed and coded by Ralph Bloch at Papaworx as part of a project commissioned by the Medical Council of Canada. In 2018, G String V was released, and G String can be downloaded for free from the website papaworx.com.

Participants
A total of 439 medical students completed the BGCSE, and 128 faculty members participated as examiners.

Main results
The descriptive statistics are shown in Table 2. Raw score data of examinees for each OSCE day are available at Dataset 1.

Generalizability study
All G coefficients except that for July 14 were above 0.70. Items nested in cases (i:c) and residual errors [p × (i:c)] were the major sources of VCs on all examination days (Table 3). Table 4 shows the number of items that reached acceptable G coefficients according to the number of cases. As the number of cases increased, the number of items that met the reliability decreased. In 10 cases, the number of items to secure reliability was 18 for all OSCE days. However, there were 21 items in 8 cases.

Key results
In the 2021 BGCSE, when the number of cases changed from 12 to 8, the G coefficient was at an acceptable level (above 0.70) except for 1 of the 4 examination days. Most VCs were attributed to the items nested in the case and residual error. If the stakes of the OSCE are changed and the reliability needs to be increased, increasing the number of items nested in each case rather than the number of cases would be reasonable.

Interpretation
According to a systematic review regarding real-world OSCE reliability, the overall reliability presented as α coefficients in medical school examinations was 0.66 (95% confidence interval, 0.62-0.70), which was below the generally accepted minimum reliability [6]. However, the reliability coefficients seem to depend on the purpose of the assessment. If the stakes are high, such as certification, professionals suggest a reliability of at least 0.90. However, for moderate-stakes assessments such as summative examinations in medical school, the reliability is expected to range from 0.80 to 0.89. Lower-stakes assessments, such as formative assessments or those administered by local faculty, would be expected to range from 0.70 to 0.79 [5]. The stakes of the BGCSE are considered low to moderate, as a formative assessment.
According to the D-study, there are 2 approaches for G coefficients above 0.70. One is increasing the number of cases from 8 to 9 or 10, and the other is increasing the number of items nested in cases to more than 20 while maintaining the number of cases at 8.
Each approach has its advantages and disadvantages. Increasing the number of cases will increase reliability, but more resources are needed. If the number of cases rises to 10, the consortium must prepare 2 more cases. This means that an additional 32 physician examiners and 8 SPs will be needed. More staff for the operation of the OSCE and item developers for new cases will also need to join. More manikins and equipment for added stations will also be required. In this case, the consortium will have to consider the cost-effectiveness of the OSCE.
Increasing the number of items will also increase reliability. However, when developing cases, the number of items tends to depend on the case's topic. For example, as shown in Table 1, the vaccination counseling case (a 32-year-old woman is counseled about vaccination for her 9-month-old baby) included 22 items since many key questions are to be asked before vaccination, such as previous vaccination history and allergy reaction history, and current medication history. However, in the case of intimate partner violence (a 41-year-old woman with a swollen and bruised right eye), there may be fewer key questions. If we add superfluous items, these will have low assessment value and eventually reduce the validity of the case. Thus, it will not always be possible to increase the number of items to secure reliability.

Comparison with previous studies
It is well known that the major threat to reliable measurements in evaluating performance is case specificity [7]. Case specificity The score of each case was converted to 100 points. OSCE, objective structured clinical examination; SD, standard deviation. a) The difference in overall score among the 4 groups was statistically significant (P<0.001) by analysis of variance with the Scheffe post hoc test.
can be defined as a phenomenon in which student performance varies depending on the scenario [8]. This is because some students may have more prior knowledge or experience in some scenarios than others. Previous studies have shown that case specificity in multicase examinations is naturally a significant VC. Therefore, a reliable test is needed for many cases [9,10]. However, re-cent studies have shown that the number of cases is not necessarily the source of variance. Instead, the source of significant variance can be attributed to items nested in cases or other factors [11,12]. The findings of our study are consistent with recent studies because the proportion of VCs for cases was negligible, from 0.00% to 2.03% (Table 3). Therefore, in this examination, increasing the item number per case can increase the reliability of the examination, since most of the VCs were from items nested in cases (i:c). This study found that if the OSCE was performed in 8 cases, the G coefficient was above 0.70 when the average number of items was above 21. This means that if the number of items in some cases is more than 21, the number of items in other cases could be less than 21. In this situation, a combination of cases with various items may be important in the blueprinting of the OSCE. The consortium should have sufficient cases in which various items are included in the case bank.

Limitations
This study has some limitations. First, it was conducted by 1 consortium, although 5 medical schools participated. Applying the same OSCE will result in different findings depending on the student population. Second, items evaluating patient-physician interactions (PPIs) were excluded from the G-study. Because the number and contents of items evaluating PPIs are already set in all cases by the Korea Health Personnel Licensing Examination Institute, the consortium cannot modify them. Third, the items of the cases belong to categories such as history taking, physical exam-  Minimally required item number of that case.
ination, and patient education. The composition ratio of these categories may vary depending on the case. For each case, a sub-design using the p × (i:c) structure was possible. However, we did not analyze whether the number of items in the categories was appropriate because it was beyond our research question. Other studies on this topic should be conducted in the future.

Generalizability
Reliability analysis using GT can improve the reliability of other OSCEs.

Suggestions
There was 1 examiner for each case in this study, and the rater (r) was not considered in the G-study design. However, we did not verify intrarater reliability. Further research is needed on this topic in the future.

Conclusion
In the 2021 BGCSE, the case number decreased from 12 to 8. However, the reliability was acceptable. In the D-study, reliability was maintained at 0.70 or higher if there were more than 21 items per case with 8 cases and more than 18 items per case with 9 cases. However, according to the G-study, increasing the number of items nested in cases rather than the number of cases could further improve reliability because most VCs were from items nested in cases. The consortium needs to maintain a case bank with a diverse number of items to implement reliable blueprinting for the OSCE.