Skip Navigation
Skip to contents

JEEHP : Journal of Educational Evaluation for Health Professions

OPEN ACCESS
SEARCH
Search

Articles

Page Path
HOME > J Educ Eval Health Prof > Forthcoming articles > Article
Research article Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study
Max Samuel Yudovich1*orcid , Elizaveta Makarova2orcid , Christian Michael Hague1orcid , Jay Dilip Raman1orcid

DOI: https://doi.org/10.3352/jeehp.2024.21.17 [Epub ahead of print]
Published online: July 8, 2024
1Department of Urology, Penn State Health Milton S. Hershey Medical Center, Hershey, PA, USA
2Penn State College of Medicine, Hershey, PA, USA
*Corresponding email:  myudovich@pennstatehealth.psu.edu

Editor: Sun Huh, Hallym University, Korea

• Received: 11 June 2024   • Accepted: 27 June 2024
  • 294 Views
  • 67 Download
  • 0 Crossref
  • 0 Scopus

Purpose
This study aimed to evaluate the performance of Chat Generative Pre-Trained Transformer (ChatGPT) with respect to standardized urology multiple-choice items in the United States.
Methods
In total, 700 multiple-choice urology board exam-style items were submitted to GPT-3.5 and GPT-4, and responses were recorded. Items were categorized based on topic and question complexity (recall, interpretation, and problem-solving). The accuracy of GPT-3.5 and GPT-4 was compared across item types in February 2024.
Results
GPT-4 answered 44.4% of items correctly compared to 30.9% for GPT-3.5 (P>0.0001). GPT-4 (vs. GPT-3.5) had higher accuracy with urologic oncology (43.8% vs. 33.9%, P=0.03), sexual medicine (44.3% vs. 27.8%, P=0.046), and pediatric urology (47.1% vs. 27.1%, P=0.012) items. Endourology (38.0% vs. 25.7%, P=0.15), reconstruction and trauma (29.0% vs. 21.0%, P=0.41), and neurourology (49.0% vs. 33.3%, P=0.11) items did not show significant differences in performance across versions. GPT-4 also outperformed GPT-3.5 with respect to recall (45.9% vs. 27.4%, P<0.00001), interpretation (45.6% vs. 31.5%, P=0.0005), and problem-solving (41.8% vs. 34.5%, P=0.56) type items. This difference was not significant for the higher-complexity items.
Conclusions
ChatGPT performs relatively poorly on standardized multiple-choice urology board exam-style items, with GPT-4 outperforming GPT-3.5. The accuracy was below the proposed minimum passing standards for the American Board of Urology’s Continuing Urologic Certification knowledge reinforcement activity (60%). As artificial intelligence progresses in complexity, ChatGPT may become more capable and accurate with respect to board examination items. For now, its responses should be scrutinized.


JEEHP : Journal of Educational Evaluation for Health Professions