-
Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study
-
Max Samuel Yudovich, Elizaveta Makarova, Christian Michael Hague, Jay Dilip Raman
-
J Educ Eval Health Prof. 2024;21:17. Published online July 8, 2024
-
DOI: https://doi.org/10.3352/jeehp.2024.21.17
-
-
1,652
View
-
293
Download
-
2
Web of Science
-
2
Crossref
-
Abstract
PDFSupplementary Material
- Purpose
This study aimed to evaluate the performance of Chat Generative Pre-Trained Transformer (ChatGPT) with respect to standardized urology multiple-choice items in the United States.
Methods In total, 700 multiple-choice urology board exam-style items were submitted to GPT-3.5 and GPT-4, and responses were recorded. Items were categorized based on topic and question complexity (recall, interpretation, and problem-solving). The accuracy of GPT-3.5 and GPT-4 was compared across item types in February 2024.
Results GPT-4 answered 44.4% of items correctly compared to 30.9% for GPT-3.5 (P<0.00001). GPT-4 (vs. GPT-3.5) had higher accuracy with urologic oncology (43.8% vs. 33.9%, P=0.03), sexual medicine (44.3% vs. 27.8%, P=0.046), and pediatric urology (47.1% vs. 27.1%, P=0.012) items. Endourology (38.0% vs. 25.7%, P=0.15), reconstruction and trauma (29.0% vs. 21.0%, P=0.41), and neurourology (49.0% vs. 33.3%, P=0.11) items did not show significant differences in performance across versions. GPT-4 also outperformed GPT-3.5 with respect to recall (45.9% vs. 27.4%, P<0.00001), interpretation (45.6% vs. 31.5%, P=0.0005), and problem-solving (41.8% vs. 34.5%, P=0.56) type items. This difference was not significant for the higher-complexity items.
Conclusions ChatGPT performs relatively poorly on standardized multiple-choice urology board exam-style items, with GPT-4 outperforming GPT-3.5. The accuracy was below the proposed minimum passing standards for the American Board of Urology’s Continuing Urologic Certification knowledge reinforcement activity (60%). As artificial intelligence progresses in complexity, ChatGPT may become more capable and accurate with respect to board examination items. For now, its responses should be scrutinized.
-
Citations
Citations to this article as recorded by
- From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance
Markus Kipp Information.2024; 15(9): 543. CrossRef - Artificial Intelligence can Facilitate Application of Risk Stratification Algorithms to Bladder Cancer Patient Case Scenarios
Max S Yudovich, Ahmad N Alzubaidi, Jay D Raman Clinical Medicine Insights: Oncology.2024;[Epub] CrossRef
|