Purpose This study aimed to evaluate the performance of Chat Generative Pre-Trained Transformer (ChatGPT) with respect to standardized urology multiple-choice items in the United States.
Methods In total, 700 multiple-choice urology board exam-style items were submitted to GPT-3.5 and GPT-4, and responses were recorded. Items were categorized based on topic and question complexity (recall, interpretation, and problem-solving). The accuracy of GPT-3.5 and GPT-4 was compared across item types in February 2024.
Results GPT-4 answered 44.4% of items correctly compared to 30.9% for GPT-3.5 (P<0.00001). GPT-4 (vs. GPT-3.5) had higher accuracy with urologic oncology (43.8% vs. 33.9%, P=0.03), sexual medicine (44.3% vs. 27.8%, P=0.046), and pediatric urology (47.1% vs. 27.1%, P=0.012) items. Endourology (38.0% vs. 25.7%, P=0.15), reconstruction and trauma (29.0% vs. 21.0%, P=0.41), and neurourology (49.0% vs. 33.3%, P=0.11) items did not show significant differences in performance across versions. GPT-4 also outperformed GPT-3.5 with respect to recall (45.9% vs. 27.4%, P<0.00001), interpretation (45.6% vs. 31.5%, P=0.0005), and problem-solving (41.8% vs. 34.5%, P=0.56) type items. This difference was not significant for the higher-complexity items.
Conclusions ChatGPT performs relatively poorly on standardized multiple-choice urology board exam-style items, with GPT-4 outperforming GPT-3.5. The accuracy was below the proposed minimum passing standards for the American Board of Urology’s Continuing Urologic Certification knowledge reinforcement activity (60%). As artificial intelligence progresses in complexity, ChatGPT may become more capable and accurate with respect to board examination items. For now, its responses should be scrutinized.
Citations
Citations to this article as recorded by
Evaluating the Performance of ChatGPT4.0 Versus ChatGPT3.5 on the Hand Surgery Self-Assessment Exam: A Comparative Analysis of Performance on Image-Based Questions Kiera L Vrindten, Megan Hsu, Yuri Han, Brian Rust, Heili Truumees, Brian M Katt Cureus.2025;[Epub] CrossRef
Assessing the performance of large language models (GPT-3.5 and GPT-4) and accurate clinical information for pediatric nephrology Nadide Melike Sav Pediatric Nephrology.2025;[Epub] CrossRef
Retrieval-augmented generation enhances large language model performance on the Japanese orthopedic board examination Juntaro Maruyama, Satoshi Maki, Takeo Furuya, Yuki Nagashima, Kyota Kitagawa, Yasunori Toki, Shuhei Iwata, Megumi Yazaki, Takaki Kitamura, Sho Gushiken, Yuji Noguchi, Masataka Miura, Masahiro Inoue, Yasuhiro Shiga, Kazuhide Inage, Sumihisa Orita, Seiji Oh Journal of Orthopaedic Science.2025;[Epub] CrossRef
From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance Markus Kipp Information.2024; 15(9): 543. CrossRef
Artificial Intelligence can Facilitate Application of Risk Stratification Algorithms to Bladder Cancer Patient Case Scenarios Max S Yudovich, Ahmad N Alzubaidi, Jay D Raman Clinical Medicine Insights: Oncology.2024;[Epub] CrossRef
Purpose This study aimed to assess the performance of the Ebel standard-setting method for the spring 2019 Royal College of Physicians and Surgeons of Canada internal medicine certification examination consisting of multiple-choice questions. Specifically, the following parameters were evaluated: inter-rater agreement, the correlations between Ebel scores and item facility indices, the impact of raters’ knowledge of correct answers on the Ebel score, and the effects of raters’ specialty on inter-rater agreement and Ebel scores.
Methods Data were drawn from a Royal College of Physicians and Surgeons of Canada certification exam. The Ebel method was applied to 203 multiple-choice questions by 49 raters. Facility indices came from 194 candidates. We computed the Fleiss kappa and the Pearson correlations between Ebel scores and item facility indices. We investigated differences in the Ebel score according to whether correct answers were provided or not and differences between internists and other specialists using the t-test.
Results The Fleiss kappa was below 0.15 for both facility and relevance. The correlation between Ebel scores and facility indices was low when correct answers were provided and negligible when they were not. The Ebel score was the same whether the correct answers were provided or not. Inter-rater agreement and Ebel scores were not significantly different between internists and other specialists.
Conclusion Inter-rater agreement and correlations between item Ebel scores and facility indices were consistently low; furthermore, raters’ knowledge of the correct answers and raters’ specialty had no effect on Ebel scores in the present setting.
Citations
Citations to this article as recorded by
Sustained Improvement in Medical Students' Academic Performance in Renal Physiology Through the Application of Flipped Classroom Thana Thongsricome, Rahat Longsomboon, Nattacha Srithawatpong, Sarunyapong Atchariyapakorn, Kasiphak Kaikaew, Danai Wangsaturaka The Clinical Teacher.2025;[Epub] CrossRef
Investigating assessment standards and fixed passing marks in dental undergraduate finals: a mixed-methods approach Ting Khee Ho, Lucy O’Malley, Reza Vahid Roudsari BMC Medical Education.2025;[Epub] CrossRef
Competency Standard Derivation for Point-of-Care Ultrasound Image Interpretation for Emergency Physicians Maya Harel-Sterling, Charisse Kwan, Jonathan Pirie, Mark Tessaro, Dennis D. Cho, Ailish Coblentz, Mohamad Halabi, Eyal Cohen, Lynne E. Nield, Martin Pusic, Kathy Boutis Annals of Emergency Medicine.2023; 81(4): 413. CrossRef
The effects of a land-based home exercise program on surfing performance in recreational surfers Jerry-Thomas Monaco, Richard Boergers, Thomas Cappaert, Michael Miller, Jennifer Nelson, Meghan Schoenberger Journal of Sports Sciences.2023; 41(4): 358. CrossRef
Medical specialty certification exams studied according to the Ottawa Quality Criteria: a systematic review Daniel Staudenmann, Noemi Waldner, Andrea Lörwald, Sören Huwendiek BMC Medical Education.2023;[Epub] CrossRef
A Target Population Derived Method for Developing a Competency Standard in Radiograph Interpretation Michelle S. Lee, Martin V. Pusic, Mark Camp, Jennifer Stimec, Andrew Dixon, Benoit Carrière, Joshua E. Herman, Kathy Boutis Teaching and Learning in Medicine.2022; 34(2): 167. CrossRef
Pediatric Musculoskeletal Radiographs: Anatomy and Fractures Prone to Diagnostic Error Among Emergency Physicians Winny Li, Jennifer Stimec, Mark Camp, Martin Pusic, Joshua Herman, Kathy Boutis The Journal of Emergency Medicine.2022; 62(4): 524. CrossRef
Possibility of independent use of the yes/no Angoff and Hofstee methods for the standard setting of the Korean Medical Licensing Examination written test: a descriptive study Do-Hwan Kim, Ye Ji Kang, Hoon-Ki Park Journal of Educational Evaluation for Health Professions.2022; 19: 33. CrossRef
Image interpretation: Learning analytics–informed education opportunities Elana Thau, Manuela Perez, Martin V. Pusic, Martin Pecaric, David Rizzuti, Kathy Boutis AEM Education and Training.2021;[Epub] CrossRef
Computerized adaptive testing (CAT) has been implemented in high-stakes examinations such as the National Council Licensure Examination-Registered Nurses in the United States since 1994. Subsequently, the National Registry of Emergency Medical Technicians in the United States adopted CAT for certifying emergency medical technicians in 2007. This was done with the goal of introducing the implementation of CAT for medical health licensing examinations. Most implementations of CAT are based on item response theory, which hypothesizes that both the examinee and items have their own characteristics that do not change. There are 5 steps for implementing CAT: first, determining whether the CAT approach is feasible for a given testing program; second, establishing an item bank; third, pretesting, calibrating, and linking item parameters via statistical analysis; fourth, determining the specification for the final CAT related to the 5 components of the CAT algorithm; and finally, deploying the final CAT after specifying all the necessary components. The 5 components of the CAT algorithm are as follows: item bank, starting item, item selection rule, scoring procedure, and termination criterion. CAT management includes content balancing, item analysis, item scoring, standard setting, practice analysis, and item bank updates. Remaining issues include the cost of constructing CAT platforms and deploying the computer technology required to build an item bank. In conclusion, in order to ensure more accurate estimations of examinees’ ability, CAT may be a good option for national licensing examinations. Measurement theory can support its implementation for high-stakes examinations.
Citations
Citations to this article as recorded by
From Development to Validation: Exploring the Efficiency of Numetrive, a Computerized Adaptive Assessment of Numerical Reasoning Marianna Karagianni, Ioannis Tsaousis Behavioral Sciences.2025; 15(3): 268. CrossRef
Validation of the cognitive section of the Penn computerized adaptive test for neurocognitive and clinical psychopathology assessment (CAT-CCNB) Akira Di Sandro, Tyler M. Moore, Eirini Zoupou, Kelly P. Kennedy, Katherine C. Lopez, Kosha Ruparel, Lucky J. Njokweni, Sage Rush, Tarlan Daryoush, Olivia Franco, Alesandra Gorgone, Andrew Savino, Paige Didier, Daniel H. Wolf, Monica E. Calkins, J. Cobb S Brain and Cognition.2024; 174: 106117. CrossRef
Comparison of real data and simulated data analysis of a stopping rule based on the standard error of measurement in computerized adaptive testing for medical examinations in Korea: a psychometric study Dong Gi Seo, Jeongwook Choi, Jinha Kim Journal of Educational Evaluation for Health Professions.2024; 21: 18. CrossRef
The irtQ R package: a user-friendly tool for item response theory-based test data analysis and calibration Hwanggyu Lim, Kyungseok Kang Journal of Educational Evaluation for Health Professions.2024; 21: 23. CrossRef
Implementing Computer Adaptive Testing for High-Stakes Assessment: A Shift for Examinations Council of Lesotho Musa Adekunle Ayanwale, Julia Chere-Masopha, Mapulane Mochekele, Malebohang Catherine Morena International Journal of New Education.2024;[Epub] CrossRef
The current utilization of the patient-reported outcome measurement information system (PROMIS) in isolated or combined total knee arthroplasty populations Puneet Gupta, Natalia Czerwonka, Sohil S. Desai, Alirio J. deMeireles, David P. Trofa, Alexander L. Neuwirth Knee Surgery & Related Research.2023;[Epub] CrossRef
Evaluating a Computerized Adaptive Testing Version of a Cognitive Ability Test Using a Simulation Study Ioannis Tsaousis, Georgios D. Sideridis, Hannan M. AlGhamdi Journal of Psychoeducational Assessment.2021; 39(8): 954. CrossRef
Accuracy and Efficiency of Web-based Assessment Platform (LIVECAT) for Computerized Adaptive Testing Do-Gyeong Kim, Dong-Gi Seo The Journal of Korean Institute of Information Technology.2020; 18(4): 77. CrossRef
Transformaciones en educación médica: innovaciones en la evaluación de los aprendizajes y avances tecnológicos (parte 2) Veronica Luna de la Luz, Patricia González-Flores Investigación en Educación Médica.2020; 9(34): 87. CrossRef
Introduction to the LIVECAT web-based computerized adaptive testing platform Dong Gi Seo, Jeongwook Choi Journal of Educational Evaluation for Health Professions.2020; 17: 27. CrossRef
Computerised adaptive testing accurately predicts CLEFT-Q scores by selecting fewer, more patient-focused questions Conrad J. Harrison, Daan Geerards, Maarten J. Ottenhof, Anne F. Klassen, Karen W.Y. Wong Riff, Marc C. Swan, Andrea L. Pusic, Chris J. Sidey-Gibbons Journal of Plastic, Reconstructive & Aesthetic Surgery.2019; 72(11): 1819. CrossRef
Presidential address: Preparing for permanent test centers and computerized adaptive testing Chang Hwi Kim Journal of Educational Evaluation for Health Professions.2018; 15: 1. CrossRef
Updates from 2018: Being indexed in Embase, becoming an affiliated journal of the World Federation for Medical Education, implementing an optional open data policy, adopting principles of transparency and best practice in scholarly publishing, and appreci Sun Huh Journal of Educational Evaluation for Health Professions.2018; 15: 36. CrossRef
Linear programming method to construct equated item sets for the implementation of periodical computer-based testing for the Korean Medical Licensing Examination Dong Gi Seo, Myeong Gi Kim, Na Hui Kim, Hye Sook Shin, Hyun Jung Kim Journal of Educational Evaluation for Health Professions.2018; 15: 26. CrossRef