Skip Navigation
Skip to contents

JEEHP : Journal of Educational Evaluation for Health Professions

OPEN ACCESS
SEARCH
Search

Search

Page Path
HOME > Search
1 "Xiomara Cristina Benavente Chalco"
Filter
Filter
Article category
Keywords
Publication year
Authors
Research article
Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study  
Betzy Clariza Torres-Zegarra, Wagner Rios-Garcia, Alvaro Micael Ñaña-Cordova, Karen Fatima Arteaga-Cisneros, Xiomara Cristina Benavente Chalco, Marina Atena Bustamante Ordoñez, Carlos Jesus Gutierrez Rios, Carlos Alberto Ramos Godoy, Kristell Luisa Teresa Panta Quezada, Jesus Daniel Gutierrez-Arratia, Javier Alejandro Flores-Cohaila
J Educ Eval Health Prof. 2023;20:30.   Published online November 20, 2023
DOI: https://doi.org/10.3352/jeehp.2023.20.30
  • 2,489 View
  • 205 Download
  • 9 Web of Science
  • 11 Crossref
AbstractAbstract PDFSupplementary Material
Purpose
We aimed to describe the performance and evaluate the educational value of justifications provided by artificial intelligence chatbots, including GPT-3.5, GPT-4, Bard, Claude, and Bing, on the Peruvian National Medical Licensing Examination (P-NLME).
Methods
This was a cross-sectional analytical study. On July 25, 2023, each multiple-choice question (MCQ) from the P-NLME was entered into each chatbot (GPT-3, GPT-4, Bing, Bard, and Claude) 3 times. Then, 4 medical educators categorized the MCQs in terms of medical area, item type, and whether the MCQ required Peru-specific knowledge. They assessed the educational value of the justifications from the 2 top performers (GPT-4 and Bing).
Results
GPT-4 scored 86.7% and Bing scored 82.2%, followed by Bard and Claude, and the historical performance of Peruvian examinees was 55%. Among the factors associated with correct answers, only MCQs that required Peru-specific knowledge had lower odds (odds ratio, 0.23; 95% confidence interval, 0.09–0.61), whereas the remaining factors showed no associations. In assessing the educational value of justifications provided by GPT-4 and Bing, neither showed any significant differences in certainty, usefulness, or potential use in the classroom.
Conclusion
Among chatbots, GPT-4 and Bing were the top performers, with Bing performing better at Peru-specific MCQs. Moreover, the educational value of justifications provided by the GPT-4 and Bing could be deemed appropriate. However, it is essential to start addressing the educational value of these chatbots, rather than merely their performance on examinations.

Citations

Citations to this article as recorded by  
  • Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study
    Masao Noda, Takayoshi Ueno, Ryota Koshu, Yuji Takaso, Mari Dias Shimada, Chizu Saito, Hisashi Sugimoto, Hiroaki Fushiki, Makoto Ito, Akihiro Nomura, Tomokazu Yoshizaki
    JMIR Medical Education.2024; 10: e57054.     CrossRef
  • Response to Letter to the Editor re: “Artificial Intelligence Versus Expert Plastic Surgeon: Comparative Study Shows ChatGPT ‘Wins' Rhinoplasty Consultations: Should We Be Worried? [1]” by Durairaj et al
    Kay Durairaj, Omer Baker
    Facial Plastic Surgery & Aesthetic Medicine.2024; 26(3): 276.     CrossRef
  • Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review
    Xiaojun Xu, Yixiao Chen, Jing Miao
    Journal of Educational Evaluation for Health Professions.2024; 21: 6.     CrossRef
  • Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis
    Mingxin Liu, Tsuyoshi Okuhara, XinYi Chang, Ritsuko Shirabe, Yuriko Nishiie, Hiroko Okada, Takahiro Kiuchi
    Journal of Medical Internet Research.2024; 26: e60807.     CrossRef
  • Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study
    Giacomo Rossettini, Lia Rodeghiero, Federica Corradi, Chad Cook, Paolo Pillastrini, Andrea Turolla, Greta Castellini, Stefania Chiappinotto, Silvia Gianola, Alvisa Palese
    BMC Medical Education.2024;[Epub]     CrossRef
  • Evaluating the competency of ChatGPT in MRCP Part 1 and a systematic literature review of its capabilities in postgraduate medical assessments
    Oliver Vij, Henry Calver, Nikki Myall, Mrinalini Dey, Koushan Kouranloo, Thiago P. Fernandes
    PLOS ONE.2024; 19(7): e0307372.     CrossRef
  • Large Language Models in Pediatric Education: Current Uses and Future Potential
    Srinivasan Suresh, Sanghamitra M. Misra
    Pediatrics.2024;[Epub]     CrossRef
  • Comparison of the Performance of ChatGPT, Claude and Bard in Support of Myopia Prevention and Control
    Yan Wang, Lihua Liang, Ran Li, Yihua Wang, Changfu Hao
    Journal of Multidisciplinary Healthcare.2024; Volume 17: 3917.     CrossRef
  • Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam
    Misaki Fujimoto, Hidetaka Kuroda, Tomomi Katayama, Atsuki Yamaguchi, Norika Katagiri, Keita Kagawa, Shota Tsukimoto, Akito Nakano, Uno Imaizumi, Aiji Sato-Boku, Naotaka Kishimoto, Tomoki Itamiya, Kanta Kido, Takuro Sanuki
    Cureus.2024;[Epub]     CrossRef
  • Dermatological Knowledge and Image Analysis Performance of Large Language Models Based on Specialty Certificate Examination in Dermatology
    Ka Siu Fan, Ka Hay Fan
    Dermato.2024; 4(4): 124.     CrossRef
  • Information amount, accuracy, and relevance of generative artificial intelligence platforms’ answers regarding learning objectives of medical arthropodology evaluated in English and Korean queries in December 2023: a descriptive study
    Hyunju Lee, Soobin Park
    Journal of Educational Evaluation for Health Professions.2023; 20: 39.     CrossRef

JEEHP : Journal of Educational Evaluation for Health Professions
TOP