Skip Navigation
Skip to contents

JEEHP : Journal of Educational Evaluation for Health Professions

OPEN ACCESS
SEARCH
Search

Search

Page Path
HOME > Search
3 "Certification"
Filter
Filter
Article category
Keywords
Publication year
Authors
Funded articles
Research articles
Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study
Max Samuel Yudovich, Elizaveta Makarova, Christian Michael Hague, Jay Dilip Raman
J Educ Eval Health Prof. 2024;21:17.   Published online July 8, 2024
DOI: https://doi.org/10.3352/jeehp.2024.21.17
  • 2,721 View
  • 319 Download
  • 3 Web of Science
  • 5 Crossref
AbstractAbstract PDFSupplementary Material
Purpose
This study aimed to evaluate the performance of Chat Generative Pre-Trained Transformer (ChatGPT) with respect to standardized urology multiple-choice items in the United States.
Methods
In total, 700 multiple-choice urology board exam-style items were submitted to GPT-3.5 and GPT-4, and responses were recorded. Items were categorized based on topic and question complexity (recall, interpretation, and problem-solving). The accuracy of GPT-3.5 and GPT-4 was compared across item types in February 2024.
Results
GPT-4 answered 44.4% of items correctly compared to 30.9% for GPT-3.5 (P<0.00001). GPT-4 (vs. GPT-3.5) had higher accuracy with urologic oncology (43.8% vs. 33.9%, P=0.03), sexual medicine (44.3% vs. 27.8%, P=0.046), and pediatric urology (47.1% vs. 27.1%, P=0.012) items. Endourology (38.0% vs. 25.7%, P=0.15), reconstruction and trauma (29.0% vs. 21.0%, P=0.41), and neurourology (49.0% vs. 33.3%, P=0.11) items did not show significant differences in performance across versions. GPT-4 also outperformed GPT-3.5 with respect to recall (45.9% vs. 27.4%, P<0.00001), interpretation (45.6% vs. 31.5%, P=0.0005), and problem-solving (41.8% vs. 34.5%, P=0.56) type items. This difference was not significant for the higher-complexity items.
Conclusions
ChatGPT performs relatively poorly on standardized multiple-choice urology board exam-style items, with GPT-4 outperforming GPT-3.5. The accuracy was below the proposed minimum passing standards for the American Board of Urology’s Continuing Urologic Certification knowledge reinforcement activity (60%). As artificial intelligence progresses in complexity, ChatGPT may become more capable and accurate with respect to board examination items. For now, its responses should be scrutinized.

Citations

Citations to this article as recorded by  
  • Evaluating the Performance of ChatGPT4.0 Versus ChatGPT3.5 on the Hand Surgery Self-Assessment Exam: A Comparative Analysis of Performance on Image-Based Questions
    Kiera L Vrindten, Megan Hsu, Yuri Han, Brian Rust, Heili Truumees, Brian M Katt
    Cureus.2025;[Epub]     CrossRef
  • Assessing the performance of large language models (GPT-3.5 and GPT-4) and accurate clinical information for pediatric nephrology
    Nadide Melike Sav
    Pediatric Nephrology.2025;[Epub]     CrossRef
  • Retrieval-augmented generation enhances large language model performance on the Japanese orthopedic board examination
    Juntaro Maruyama, Satoshi Maki, Takeo Furuya, Yuki Nagashima, Kyota Kitagawa, Yasunori Toki, Shuhei Iwata, Megumi Yazaki, Takaki Kitamura, Sho Gushiken, Yuji Noguchi, Masataka Miura, Masahiro Inoue, Yasuhiro Shiga, Kazuhide Inage, Sumihisa Orita, Seiji Oh
    Journal of Orthopaedic Science.2025;[Epub]     CrossRef
  • From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance
    Markus Kipp
    Information.2024; 15(9): 543.     CrossRef
  • Artificial Intelligence can Facilitate Application of Risk Stratification Algorithms to Bladder Cancer Patient Case Scenarios
    Max S Yudovich, Ahmad N Alzubaidi, Jay D Raman
    Clinical Medicine Insights: Oncology.2024;[Epub]     CrossRef
Performance of the Ebel standard-setting method for the spring 2019 Royal College of Physicians and Surgeons of Canada internal medicine certification examination consisting of multiple-choice questions  
Jimmy Bourque, Haley Skinner, Jonathan Dupré, Maria Bacchus, Martha Ainslie, Irene W. Y. Ma, Gary Cole
J Educ Eval Health Prof. 2020;17:12.   Published online April 20, 2020
DOI: https://doi.org/10.3352/jeehp.2020.17.12
  • 7,597 View
  • 178 Download
  • 10 Web of Science
  • 9 Crossref
AbstractAbstract PDFSupplementary Material
Purpose
This study aimed to assess the performance of the Ebel standard-setting method for the spring 2019 Royal College of Physicians and Surgeons of Canada internal medicine certification examination consisting of multiple-choice questions. Specifically, the following parameters were evaluated: inter-rater agreement, the correlations between Ebel scores and item facility indices, the impact of raters’ knowledge of correct answers on the Ebel score, and the effects of raters’ specialty on inter-rater agreement and Ebel scores.
Methods
Data were drawn from a Royal College of Physicians and Surgeons of Canada certification exam. The Ebel method was applied to 203 multiple-choice questions by 49 raters. Facility indices came from 194 candidates. We computed the Fleiss kappa and the Pearson correlations between Ebel scores and item facility indices. We investigated differences in the Ebel score according to whether correct answers were provided or not and differences between internists and other specialists using the t-test.
Results
The Fleiss kappa was below 0.15 for both facility and relevance. The correlation between Ebel scores and facility indices was low when correct answers were provided and negligible when they were not. The Ebel score was the same whether the correct answers were provided or not. Inter-rater agreement and Ebel scores were not significantly different between internists and other specialists.
Conclusion
Inter-rater agreement and correlations between item Ebel scores and facility indices were consistently low; furthermore, raters’ knowledge of the correct answers and raters’ specialty had no effect on Ebel scores in the present setting.

Citations

Citations to this article as recorded by  
  • Sustained Improvement in Medical Students' Academic Performance in Renal Physiology Through the Application of Flipped Classroom
    Thana Thongsricome, Rahat Longsomboon, Nattacha Srithawatpong, Sarunyapong Atchariyapakorn, Kasiphak Kaikaew, Danai Wangsaturaka
    The Clinical Teacher.2025;[Epub]     CrossRef
  • Investigating assessment standards and fixed passing marks in dental undergraduate finals: a mixed-methods approach
    Ting Khee Ho, Lucy O’Malley, Reza Vahid Roudsari
    BMC Medical Education.2025;[Epub]     CrossRef
  • Competency Standard Derivation for Point-of-Care Ultrasound Image Interpretation for Emergency Physicians
    Maya Harel-Sterling, Charisse Kwan, Jonathan Pirie, Mark Tessaro, Dennis D. Cho, Ailish Coblentz, Mohamad Halabi, Eyal Cohen, Lynne E. Nield, Martin Pusic, Kathy Boutis
    Annals of Emergency Medicine.2023; 81(4): 413.     CrossRef
  • The effects of a land-based home exercise program on surfing performance in recreational surfers
    Jerry-Thomas Monaco, Richard Boergers, Thomas Cappaert, Michael Miller, Jennifer Nelson, Meghan Schoenberger
    Journal of Sports Sciences.2023; 41(4): 358.     CrossRef
  • Medical specialty certification exams studied according to the Ottawa Quality Criteria: a systematic review
    Daniel Staudenmann, Noemi Waldner, Andrea Lörwald, Sören Huwendiek
    BMC Medical Education.2023;[Epub]     CrossRef
  • A Target Population Derived Method for Developing a Competency Standard in Radiograph Interpretation
    Michelle S. Lee, Martin V. Pusic, Mark Camp, Jennifer Stimec, Andrew Dixon, Benoit Carrière, Joshua E. Herman, Kathy Boutis
    Teaching and Learning in Medicine.2022; 34(2): 167.     CrossRef
  • Pediatric Musculoskeletal Radiographs: Anatomy and Fractures Prone to Diagnostic Error Among Emergency Physicians
    Winny Li, Jennifer Stimec, Mark Camp, Martin Pusic, Joshua Herman, Kathy Boutis
    The Journal of Emergency Medicine.2022; 62(4): 524.     CrossRef
  • Possibility of independent use of the yes/no Angoff and Hofstee methods for the standard setting of the Korean Medical Licensing Examination written test: a descriptive study
    Do-Hwan Kim, Ye Ji Kang, Hoon-Ki Park
    Journal of Educational Evaluation for Health Professions.2022; 19: 33.     CrossRef
  • Image interpretation: Learning analytics–informed education opportunities
    Elana Thau, Manuela Perez, Martin V. Pusic, Martin Pecaric, David Rizzuti, Kathy Boutis
    AEM Education and Training.2021;[Epub]     CrossRef
Review article
Overview and current management of computerized adaptive testing in licensing/certification examinations  
Dong Gi Seo
J Educ Eval Health Prof. 2017;14:17.   Published online July 26, 2017
DOI: https://doi.org/10.3352/jeehp.2017.14.17
  • 40,274 View
  • 389 Download
  • 15 Web of Science
  • 14 Crossref
AbstractAbstract PDF
Computerized adaptive testing (CAT) has been implemented in high-stakes examinations such as the National Council Licensure Examination-Registered Nurses in the United States since 1994. Subsequently, the National Registry of Emergency Medical Technicians in the United States adopted CAT for certifying emergency medical technicians in 2007. This was done with the goal of introducing the implementation of CAT for medical health licensing examinations. Most implementations of CAT are based on item response theory, which hypothesizes that both the examinee and items have their own characteristics that do not change. There are 5 steps for implementing CAT: first, determining whether the CAT approach is feasible for a given testing program; second, establishing an item bank; third, pretesting, calibrating, and linking item parameters via statistical analysis; fourth, determining the specification for the final CAT related to the 5 components of the CAT algorithm; and finally, deploying the final CAT after specifying all the necessary components. The 5 components of the CAT algorithm are as follows: item bank, starting item, item selection rule, scoring procedure, and termination criterion. CAT management includes content balancing, item analysis, item scoring, standard setting, practice analysis, and item bank updates. Remaining issues include the cost of constructing CAT platforms and deploying the computer technology required to build an item bank. In conclusion, in order to ensure more accurate estimations of examinees’ ability, CAT may be a good option for national licensing examinations. Measurement theory can support its implementation for high-stakes examinations.

Citations

Citations to this article as recorded by  
  • From Development to Validation: Exploring the Efficiency of Numetrive, a Computerized Adaptive Assessment of Numerical Reasoning
    Marianna Karagianni, Ioannis Tsaousis
    Behavioral Sciences.2025; 15(3): 268.     CrossRef
  • Validation of the cognitive section of the Penn computerized adaptive test for neurocognitive and clinical psychopathology assessment (CAT-CCNB)
    Akira Di Sandro, Tyler M. Moore, Eirini Zoupou, Kelly P. Kennedy, Katherine C. Lopez, Kosha Ruparel, Lucky J. Njokweni, Sage Rush, Tarlan Daryoush, Olivia Franco, Alesandra Gorgone, Andrew Savino, Paige Didier, Daniel H. Wolf, Monica E. Calkins, J. Cobb S
    Brain and Cognition.2024; 174: 106117.     CrossRef
  • Comparison of real data and simulated data analysis of a stopping rule based on the standard error of measurement in computerized adaptive testing for medical examinations in Korea: a psychometric study
    Dong Gi Seo, Jeongwook Choi, Jinha Kim
    Journal of Educational Evaluation for Health Professions.2024; 21: 18.     CrossRef
  • The irtQ R package: a user-friendly tool for item response theory-based test data analysis and calibration
    Hwanggyu Lim, Kyungseok Kang
    Journal of Educational Evaluation for Health Professions.2024; 21: 23.     CrossRef
  • Implementing Computer Adaptive Testing for High-Stakes Assessment: A Shift for Examinations Council of Lesotho
    Musa Adekunle Ayanwale, Julia Chere-Masopha, Mapulane Mochekele, Malebohang Catherine Morena
    International Journal of New Education.2024;[Epub]     CrossRef
  • The current utilization of the patient-reported outcome measurement information system (PROMIS) in isolated or combined total knee arthroplasty populations
    Puneet Gupta, Natalia Czerwonka, Sohil S. Desai, Alirio J. deMeireles, David P. Trofa, Alexander L. Neuwirth
    Knee Surgery & Related Research.2023;[Epub]     CrossRef
  • Evaluating a Computerized Adaptive Testing Version of a Cognitive Ability Test Using a Simulation Study
    Ioannis Tsaousis, Georgios D. Sideridis, Hannan M. AlGhamdi
    Journal of Psychoeducational Assessment.2021; 39(8): 954.     CrossRef
  • Accuracy and Efficiency of Web-based Assessment Platform (LIVECAT) for Computerized Adaptive Testing
    Do-Gyeong Kim, Dong-Gi Seo
    The Journal of Korean Institute of Information Technology.2020; 18(4): 77.     CrossRef
  • Transformaciones en educación médica: innovaciones en la evaluación de los aprendizajes y avances tecnológicos (parte 2)
    Veronica Luna de la Luz, Patricia González-Flores
    Investigación en Educación Médica.2020; 9(34): 87.     CrossRef
  • Introduction to the LIVECAT web-based computerized adaptive testing platform
    Dong Gi Seo, Jeongwook Choi
    Journal of Educational Evaluation for Health Professions.2020; 17: 27.     CrossRef
  • Computerised adaptive testing accurately predicts CLEFT-Q scores by selecting fewer, more patient-focused questions
    Conrad J. Harrison, Daan Geerards, Maarten J. Ottenhof, Anne F. Klassen, Karen W.Y. Wong Riff, Marc C. Swan, Andrea L. Pusic, Chris J. Sidey-Gibbons
    Journal of Plastic, Reconstructive & Aesthetic Surgery.2019; 72(11): 1819.     CrossRef
  • Presidential address: Preparing for permanent test centers and computerized adaptive testing
    Chang Hwi Kim
    Journal of Educational Evaluation for Health Professions.2018; 15: 1.     CrossRef
  • Updates from 2018: Being indexed in Embase, becoming an affiliated journal of the World Federation for Medical Education, implementing an optional open data policy, adopting principles of transparency and best practice in scholarly publishing, and appreci
    Sun Huh
    Journal of Educational Evaluation for Health Professions.2018; 15: 36.     CrossRef
  • Linear programming method to construct equated item sets for the implementation of periodical computer-based testing for the Korean Medical Licensing Examination
    Dong Gi Seo, Myeong Gi Kim, Na Hui Kim, Hye Sook Shin, Hyun Jung Kim
    Journal of Educational Evaluation for Health Professions.2018; 15: 26.     CrossRef

JEEHP : Journal of Educational Evaluation for Health Professions
TOP