-
Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study
-
Betzy Clariza Torres-Zegarra, Wagner Rios-Garcia, Alvaro Micael Ñaña-Cordova, Karen Fatima Arteaga-Cisneros, Xiomara Cristina Benavente Chalco, Marina Atena Bustamante Ordoñez, Carlos Jesus Gutierrez Rios, Carlos Alberto Ramos Godoy, Kristell Luisa Teresa Panta Quezada, Jesus Daniel Gutierrez-Arratia, Javier Alejandro Flores-Cohaila
-
J Educ Eval Health Prof. 2023;20:30. Published online November 20, 2023
-
DOI: https://doi.org/10.3352/jeehp.2023.20.30
-
-
2,685
View
-
212
Download
-
12
Web of Science
-
15
Crossref
-
Abstract
PDFSupplementary Material
- Purpose
We aimed to describe the performance and evaluate the educational value of justifications provided by artificial intelligence chatbots, including GPT-3.5, GPT-4, Bard, Claude, and Bing, on the Peruvian National Medical Licensing Examination (P-NLME).
Methods This was a cross-sectional analytical study. On July 25, 2023, each multiple-choice question (MCQ) from the P-NLME was entered into each chatbot (GPT-3, GPT-4, Bing, Bard, and Claude) 3 times. Then, 4 medical educators categorized the MCQs in terms of medical area, item type, and whether the MCQ required Peru-specific knowledge. They assessed the educational value of the justifications from the 2 top performers (GPT-4 and Bing).
Results GPT-4 scored 86.7% and Bing scored 82.2%, followed by Bard and Claude, and the historical performance of Peruvian examinees was 55%. Among the factors associated with correct answers, only MCQs that required Peru-specific knowledge had lower odds (odds ratio, 0.23; 95% confidence interval, 0.09–0.61), whereas the remaining factors showed no associations. In assessing the educational value of justifications provided by GPT-4 and Bing, neither showed any significant differences in certainty, usefulness, or potential use in the classroom.
Conclusion Among chatbots, GPT-4 and Bing were the top performers, with Bing performing better at Peru-specific MCQs. Moreover, the educational value of justifications provided by the GPT-4 and Bing could be deemed appropriate. However, it is essential to start addressing the educational value of these chatbots, rather than merely their performance on examinations.
-
Citations
Citations to this article as recorded by
- Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study
Masao Noda, Takayoshi Ueno, Ryota Koshu, Yuji Takaso, Mari Dias Shimada, Chizu Saito, Hisashi Sugimoto, Hiroaki Fushiki, Makoto Ito, Akihiro Nomura, Tomokazu Yoshizaki JMIR Medical Education.2024; 10: e57054. CrossRef - Response to Letter to the Editor re: “Artificial Intelligence Versus Expert Plastic Surgeon: Comparative Study Shows ChatGPT ‘Wins' Rhinoplasty Consultations: Should We Be Worried? [1]” by Durairaj et al
Kay Durairaj, Omer Baker Facial Plastic Surgery & Aesthetic Medicine.2024; 26(3): 276. CrossRef - Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review
Xiaojun Xu, Yixiao Chen, Jing Miao Journal of Educational Evaluation for Health Professions.2024; 21: 6. CrossRef - Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis
Mingxin Liu, Tsuyoshi Okuhara, XinYi Chang, Ritsuko Shirabe, Yuriko Nishiie, Hiroko Okada, Takahiro Kiuchi Journal of Medical Internet Research.2024; 26: e60807. CrossRef - Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study
Giacomo Rossettini, Lia Rodeghiero, Federica Corradi, Chad Cook, Paolo Pillastrini, Andrea Turolla, Greta Castellini, Stefania Chiappinotto, Silvia Gianola, Alvisa Palese BMC Medical Education.2024;[Epub] CrossRef - Evaluating the competency of ChatGPT in MRCP Part 1 and a systematic literature review of its capabilities in postgraduate medical assessments
Oliver Vij, Henry Calver, Nikki Myall, Mrinalini Dey, Koushan Kouranloo, Thiago P. Fernandes PLOS ONE.2024; 19(7): e0307372. CrossRef - Large Language Models in Pediatric Education: Current Uses and Future Potential
Srinivasan Suresh, Sanghamitra M. Misra Pediatrics.2024;[Epub] CrossRef - Comparison of the Performance of ChatGPT, Claude and Bard in Support of Myopia Prevention and Control
Yan Wang, Lihua Liang, Ran Li, Yihua Wang, Changfu Hao Journal of Multidisciplinary Healthcare.2024; Volume 17: 3917. CrossRef - Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam
Misaki Fujimoto, Hidetaka Kuroda, Tomomi Katayama, Atsuki Yamaguchi, Norika Katagiri, Keita Kagawa, Shota Tsukimoto, Akito Nakano, Uno Imaizumi, Aiji Sato-Boku, Naotaka Kishimoto, Tomoki Itamiya, Kanta Kido, Takuro Sanuki Cureus.2024;[Epub] CrossRef - Dermatological Knowledge and Image Analysis Performance of Large Language Models Based on Specialty Certificate Examination in Dermatology
Ka Siu Fan, Ka Hay Fan Dermato.2024; 4(4): 124. CrossRef - ChatGPT and Other Large Language Models in Medical Education — Scoping Literature Review
Alexandra Aster, Matthias Carl Laupichler, Tamina Rockwell-Kollmann, Gilda Masala, Ebru Bala, Tobias Raupach Medical Science Educator.2024;[Epub] CrossRef - PICOT questions and search strategies formulation: A novel approach using artificial intelligence automation
Lucija Gosak, Gregor Štiglic, Lisiane Pruinelli, Dominika Vrbnjak Journal of Nursing Scholarship.2024;[Epub] CrossRef - Performance of ChatGPT and Bard on the medical licensing examinations varies across different cultures: a comparison study
Yikai Chen, Xiujie Huang, Fangjie Yang, Haiming Lin, Haoyu Lin, Zhuoqun Zheng, Qifeng Liang, Jinhai Zhang, Xinxin Li BMC Medical Education.2024;[Epub] CrossRef - Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis
Volodymyr Mavrych, Paul Ganguly, Olena Bolgova Clinical Anatomy.2024;[Epub] CrossRef - Information amount, accuracy, and relevance of generative artificial intelligence platforms’ answers regarding learning objectives of medical arthropodology evaluated in English and Korean queries in December 2023: a descriptive study
Hyunju Lee, Soobin Park Journal of Educational Evaluation for Health Professions.2023; 20: 39. CrossRef
|