Skip Navigation
Skip to contents

JEEHP : Journal of Educational Evaluation for Health Professions

OPEN ACCESS
SEARCH
Search

Author index

Page Path
HOME > Browse articles > Author index
Search
Javier Alejandro Flores-Cohaila 2 Articles
Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study  
Betzy Clariza Torres-Zegarra, Wagner Rios-Garcia, Alvaro Micael Ñaña-Cordova, Karen Fatima Arteaga-Cisneros, Xiomara Cristina Benavente Chalco, Marina Atena Bustamante Ordoñez, Carlos Jesus Gutierrez Rios, Carlos Alberto Ramos Godoy, Kristell Luisa Teresa Panta Quezada, Jesus Daniel Gutierrez-Arratia, Javier Alejandro Flores-Cohaila
J Educ Eval Health Prof. 2023;20:30.   Published online November 20, 2023
DOI: https://doi.org/10.3352/jeehp.2023.20.30
  • 2,213 View
  • 198 Download
  • 9 Web of Science
  • 9 Crossref
AbstractAbstract PDFSupplementary Material
Purpose
We aimed to describe the performance and evaluate the educational value of justifications provided by artificial intelligence chatbots, including GPT-3.5, GPT-4, Bard, Claude, and Bing, on the Peruvian National Medical Licensing Examination (P-NLME).
Methods
This was a cross-sectional analytical study. On July 25, 2023, each multiple-choice question (MCQ) from the P-NLME was entered into each chatbot (GPT-3, GPT-4, Bing, Bard, and Claude) 3 times. Then, 4 medical educators categorized the MCQs in terms of medical area, item type, and whether the MCQ required Peru-specific knowledge. They assessed the educational value of the justifications from the 2 top performers (GPT-4 and Bing).
Results
GPT-4 scored 86.7% and Bing scored 82.2%, followed by Bard and Claude, and the historical performance of Peruvian examinees was 55%. Among the factors associated with correct answers, only MCQs that required Peru-specific knowledge had lower odds (odds ratio, 0.23; 95% confidence interval, 0.09–0.61), whereas the remaining factors showed no associations. In assessing the educational value of justifications provided by GPT-4 and Bing, neither showed any significant differences in certainty, usefulness, or potential use in the classroom.
Conclusion
Among chatbots, GPT-4 and Bing were the top performers, with Bing performing better at Peru-specific MCQs. Moreover, the educational value of justifications provided by the GPT-4 and Bing could be deemed appropriate. However, it is essential to start addressing the educational value of these chatbots, rather than merely their performance on examinations.

Citations

Citations to this article as recorded by  
  • Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study
    Masao Noda, Takayoshi Ueno, Ryota Koshu, Yuji Takaso, Mari Dias Shimada, Chizu Saito, Hisashi Sugimoto, Hiroaki Fushiki, Makoto Ito, Akihiro Nomura, Tomokazu Yoshizaki
    JMIR Medical Education.2024; 10: e57054.     CrossRef
  • Response to Letter to the Editor re: “Artificial Intelligence Versus Expert Plastic Surgeon: Comparative Study Shows ChatGPT ‘Wins' Rhinoplasty Consultations: Should We Be Worried? [1]” by Durairaj et al
    Kay Durairaj, Omer Baker
    Facial Plastic Surgery & Aesthetic Medicine.2024; 26(3): 276.     CrossRef
  • Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review
    Xiaojun Xu, Yixiao Chen, Jing Miao
    Journal of Educational Evaluation for Health Professions.2024; 21: 6.     CrossRef
  • Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis
    Mingxin Liu, Tsuyoshi Okuhara, XinYi Chang, Ritsuko Shirabe, Yuriko Nishiie, Hiroko Okada, Takahiro Kiuchi
    Journal of Medical Internet Research.2024; 26: e60807.     CrossRef
  • Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study
    Giacomo Rossettini, Lia Rodeghiero, Federica Corradi, Chad Cook, Paolo Pillastrini, Andrea Turolla, Greta Castellini, Stefania Chiappinotto, Silvia Gianola, Alvisa Palese
    BMC Medical Education.2024;[Epub]     CrossRef
  • Evaluating the competency of ChatGPT in MRCP Part 1 and a systematic literature review of its capabilities in postgraduate medical assessments
    Oliver Vij, Henry Calver, Nikki Myall, Mrinalini Dey, Koushan Kouranloo, Thiago P. Fernandes
    PLOS ONE.2024; 19(7): e0307372.     CrossRef
  • Large Language Models in Pediatric Education: Current Uses and Future Potential
    Srinivasan Suresh, Sanghamitra M. Misra
    Pediatrics.2024;[Epub]     CrossRef
  • Comparison of the Performance of ChatGPT, Claude and Bard in Support of Myopia Prevention and Control
    Yan Wang, Lihua Liang, Ran Li, Yihua Wang, Changfu Hao
    Journal of Multidisciplinary Healthcare.2024; Volume 17: 3917.     CrossRef
  • Information amount, accuracy, and relevance of generative artificial intelligences’ answers to learning objectives of medical arthropodology evaluated in English and Korean queries in December 2023: a descriptive study
    Hyunju Lee, Soo Bin Park
    Journal of Educational Evaluation for Health Professions.2023; 20: 39.     CrossRef
Factors associated with medical students’ scores on the National Licensing Exam in Peru: a systematic review  
Javier Alejandro Flores-Cohaila
J Educ Eval Health Prof. 2022;19:38.   Published online December 29, 2022
DOI: https://doi.org/10.3352/jeehp.2022.19.38
  • 3,916 View
  • 311 Download
  • 2 Crossref
AbstractAbstract PDFSupplementary Material
Purpose
This study aimed to identify factors that have been studied for their associations with National Licensing Examination (ENAM) scores in Peru.
Methods
A search was conducted of literature databases and registers, including EMBASE, SciELO, Web of Science, MEDLINE, Peru’s National Register of Research Work, and Google Scholar. The following key terms were used: “ENAM” and “associated factors.” Studies in English and Spanish were included. The quality of the included studies was evaluated using the Medical Education Research Study Quality Instrument (MERSQI).
Results
In total, 38,500 participants were enrolled in 12 studies. Most (11/12) studies were cross-sectional, except for one case-control study. Three studies were published in peer-reviewed journals. The mean MERSQI was 10.33. A better performance on the ENAM was associated with a higher-grade point average (GPA) (n=8), internship setting in EsSalud (n=4), and regular academic status (n=3). Other factors showed associations in various studies, such as medical school, internship setting, age, gender, socioeconomic status, simulations test, study resources, preparation time, learning styles, study techniques, test-anxiety, and self-regulated learning strategies.
Conclusion
The ENAM is a multifactorial phenomenon; our model gives students a locus of control on what they can do to improve their score (i.e., implement self-regulated learning strategies) and faculty, health policymakers, and managers a framework to improve the ENAM score (i.e., design remediation programs to improve GPA and integrate anxiety-management courses into the curriculum).

Citations

Citations to this article as recorded by  
  • Medical Student’s Attitudes towards Implementation of National Licensing Exam (NLE) – A Qualitative Exploratory Study
    Saima Bashir, Rehan Ahmed Khan
    Pakistan Journal of Health Sciences.2024; : 153.     CrossRef
  • Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study
    Javier A Flores-Cohaila, Abigaíl García-Vicente, Sonia F Vizcarra-Jiménez, Janith P De la Cruz-Galán, Jesús D Gutiérrez-Arratia, Blanca Geraldine Quiroga Torres, Alvaro Taype-Rondan
    JMIR Medical Education.2023; 9: e48039.     CrossRef

JEEHP : Journal of Educational Evaluation for Health Professions
TOP