Department of Medical Education, College of Medicine, Korea University, Seoul, Korea
© 2007, National Health Personnel Licensing Examination Board of the Republic of Korea
This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
The panel received an ordered item booklet (OIB), which lists items in order of difficulty, from the easiest to the hardest. To make the OIB, item response theory (IRT) was applied and 2/3 probability of correct response was used as a scale score to represent examinee ability.
The panel was divided into small groups-typically groups of six to eight people. In small groups, the panel examined each item in the OIB and discussed knowledge and skills required to answer the item correctly.
After the group discussion, each panelist determined a cut score by placing a bookmark in the OIB based on his or her own judgment of what students with basic performance level should know and be able to do. The scale score of the marked item was the first demarcation.
The panel engaged in the small group discussion again to compromise differences in their opinions. The panelists then marked their choice on the OIB, which was the second demarcation.
Two small groups were combined into a mid-sized group to prevent one specific individual from dominating the discussion in small groups. After the group discussion, the panel was given another opportunity to change their choice and placed the bookmark on the OIB.
Finally, all panelists gathered in one place and discussed their choice of bookmarks. After the discussion, they were given a final opportunity to change their choice. They then placed the final bookmark on the OIB.
All the bookmark placements in the final round were gathered and the median was calculated to set the panel’s recommended cut score.
Based on the cut score, the panel examined the items before the bookmark and wrote performance-level descriptors that represent a summary of the knowledge, skills, and abilities that students with the basic performance level must be able to demonstrate.
The choice of a panel: Generally, the panel consists of subject experts, teachers, related administrators, and so on. In case of medical licensing examination, the panel may include doctors and professors of basic science and clinic, medical administrators, and so on. The panel should be content experts and know well the characteristics of examinees. There are diverse opinions on the appropriate number of the panel, but at least 10 panelists are required, while 15∼20 panelists are ideal [6–9].
Achievement Level Description (ALD): Conceptualizing achievement level starts from policy definition. Government agency such as the Ministry of Education & Human Resources Development or the Ministry of Health & Welfare provides a policy definition of achievement levels. Based on this policy definition, the panel discusses and describes performance of each level. This description specifies what students at each level should know and be able to do in terms of knowledge, skills, and behaviors and provides an operational definition. Exemplary items can be provided as well. To save time, the facilitator may provide preliminary ALD with the panel, so that the panel starts to set the standard with some consensus on the characteristics of a borderline group.
Practice: The panel practices with exemplary items. In case different types of items are mixed, they practice with each type of item. Especially for performance item, actual performance data should be provided, so that the panel can get a sense of the level of examinees.
The first round of estimation: The panel is divided into small groups of three or four people. The panel solves actual items on the exam. After solving the exam, they check their answers with correct ones. Then, they estimate the probability of correct answers of a borderline group of examinees for each item. After individual estimation, the results are collected and the cut score is set at the sum of medians of each item. The cut score is posted and the result of cut score application is provided, so that the panel has opportunity to check whether it is realistic and change it if necessary.
The second round of estimation: The panel is divided into mid-sized groups. Based on ADL and their experiences at the first round, they exchange their opinions. Then, they estimate the probability of correct answer of a borderline group of examinees for each item.
The third round of estimation: All the panelists gather at one place and discuss. The focus is on the items showing large deviations. After the discussion, they make third estimations. The results are collected and posted. If the third round is enough, the cut point obtained at the third round becomes the final cut score. The cut score is transformed into a scale score.
Description of minimum competency: If required, the panel writes what students at each level are able to do by analyzing items and students’ response to the items.
This article is available from: http://jeehp.org/
Year | Examinees (N) | Examinees who passed the exam (N) | Pass rate (%) |
---|---|---|---|
2005 | 3,618 | 3,372 | 93.2 |
2004 | 3,881 | 3,760 | 96.9 |
2003 | 3,647 | 3,159 | 86.6 |
2002 | 3,578 | 3,314 | 92.6 |
2001 | 3,262 | 2,796 | 85.7 |
2000 | 2,961 | 2,772 | 93.6 |
1999 | 3,091 | 2,871 | 92.9 |
Do you think the current cut score is valid? And why? | |
---|---|
Yes (N=22) | No (N=16) |
- Criterion-referenced evaluation such as the current cut score is appropriate because the test is a licensing examination. - Since almost all of the licensing examinations in Korea adopt the 60-40% cut score, it does not seem problematic. - Even though the current cut score is not valid in a theoretical sense, there is no alternative, considering the cost of setting standards and the drainage of test items. |
- An absolute cut score is not valid in the case of perennial fluctuation of test difficulty, such as in Korea. - The current test items are not appropriate to evaluate basic medical knowledge. - The reliability and validity of the test should be checked and skills as well as knowledge should be evaluated. - The current cut score does not provide a social function to control quality of doctors. |
Nation | Test | Standard setting methods |
---|---|---|
Australia | Australian Medical Council (AMC) examination for overseas trained medical practitioners No examination for citizens | Transformed scores Separate scoring of general items (200) and core items (60) General items 250/500 Core items 300/500 |
Canada | Medical Council of Canada Qualifying Examination (MCCQE) I &II | Part I: Nedelsky Method Part II: Angoff Method OSCE: Boderline Group Method |
England | Professional and Linguistic Assessments Board (PLAB) | Written: Angoff Method Skill: Boderline Group Method (under consideration) |
Ireland | Temporary Registration Assessment Scheme (TRAS) | Part I (MCQ) Equalized score of 45%; Penalty count system Part II (OSCE): 85–90% |
New Zealand | New Zealand Registration Examination (NZREX) | Written: Modified Angoff Skill: Contrasting Groups Methods |
U.S.A. | United States Medical Licensing Examination (USMLE) | Modified Angoff |
U.S.A. | Comprehensive Osteopathic Medical Licensing Examination (COMLEX) | A criteria-referenced method is used and pass rate is about 90%. |
Common features | Criteria-referenced methods are used. No country uses a fixed cut score applied to original scores. |
Bookmark | Modified Angoff | |
---|---|---|
Process | The panel places a bookmark on the hardest item to a borderline group on OIB. | The panel estimates the probability of correct answer of a borderline group for each item. |
Time | In a day | At least two days |
Cost | Due to little burden on the panel, costs are comparatively low. | Pre-analysis by psychometricians is required, but there is no need for OIB. |
But, an additional cost for OIB should be set. | When the exam contains many items, the whole process takes much time and this increases costs. | |
Advantages | Shorter time and lower cost since it was already applied in Korea, its validity was tested. | Since it is a classic method in psychometrics, it is easy to understand and explain. Preparation is relatively small. |
Disadvantages | Preparation such as OIB is required. | Since estimation should be made for each item, the panel has a lot of work to do. |
OSCE: Objective Structured Clinical Examination; MCQ: Multiple Choice Questions.
OIB: Ordered Item Booklet.