Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study

Purpose We aimed to describe the performance and evaluate the educational value of justifications provided by artificial intelligence chatbots, including GPT-3.5, GPT-4, Bard, Claude, and Bing, on the Peruvian National Medical Licensing Examination (P-NLME). Methods This was a cross-sectional analytical study. On July 25, 2023, each multiple-choice question (MCQ) from the P-NLME was entered into each chatbot (GPT-3, GPT-4, Bing, Bard, and Claude) 3 times. Then, 4 medical educators categorized the MCQs in terms of medical area, item type, and whether the MCQ required Peru-specific knowledge. They assessed the educational value of the justifications from the 2 top performers (GPT-4 and Bing). Results GPT-4 scored 86.7% and Bing scored 82.2%, followed by Bard and Claude, and the historical performance of Peruvian examinees was 55%. Among the factors associated with correct answers, only MCQs that required Peru-specific knowledge had lower odds (odds ratio, 0.23; 95% confidence interval, 0.09–0.61), whereas the remaining factors showed no associations. In assessing the educational value of justifications provided by GPT-4 and Bing, neither showed any significant differences in certainty, usefulness, or potential use in the classroom. Conclusion Among chatbots, GPT-4 and Bing were the top performers, with Bing performing better at Peru-specific MCQs. Moreover, the educational value of justifications provided by the GPT-4 and Bing could be deemed appropriate. However, it is essential to start addressing the educational value of these chatbots, rather than merely their performance on examinations.


Introduction
Background/rationale: Recently, there has been growing interest in the performance of chatbots such as ChatGPT, Bing, Bard, and Claude on national licensing medical examinations (NLMEs).Some studies have reported outstanding performance in which chatbots matched and even outperformed medical examinees [1][2][3].However, there is a lack of studies comparing the performance of different chatbots, which hinders their potential use in classifying examination complexity [4].Furthermore, studies exploring the quality of chatbot justifications in multiple-choice questions (MCQs), focusing on their educational value, are lacking in the current literature.In this study, we aimed to address these issues.

Objectives
In this study, we aimed to describe the performance and evaluate the educational value of justifications provided by ChatGPT (using GPT-3 and GPT-4), Bard, Claude, and Bing on the Peruvian National Licensing Examination (P-NLME) of 2023.The following objectives were addressed to describe the performance of chatbots on the P-NLME under 3 attempts for each chatbot: to identify factors associated with correct answers provided by chatbots in the P-NLME; and to assess the educational value of the justifications provided by the 2 top-performing chatbots in terms of certainty, usefulness, and potential use in the classroom setting.

Ethics statement
We did not seek approval from an institutional review board or informed consent for this study because we analyzed the performance of chatbots on an NMLE rather than conducting human-subject research.

Study design
This cross-sectional analytical study compared the accuracy of chatbots (GPT-3 and GPT-4, Bing, Claude, and Bard) with the historical performance of examinees on an NMLE, which has been previously published [5].Additionally, based on previous research, we assessed the factors associated with correct answers [1] and evaluated the educational value of the justifications provided by chatbots [6].

Setting and procedures
On July 25, 2023, we entered the 2023 P-NLME into the selected chatbots (GPT-3.5 and GPT-4, Bing, Claude, and Bard) 3 times, as the responses provided by chatbots were not deterministic.The answers provided by chatbots were saved in Microsoft Excel (Microsoft Corp.), and at least 2 medical educators categorized the MCQs according to the area and type of MCQ and if they required Peru-specific knowledge.
Subsequently, the justifications provided by the 2 top-performing chatbots (Bing and GPT-4) were analyzed using items from previously published instruments that assess the quality of open-access medical education resources [7].The prompt used is availableat Supplement 1.One hundred-eight MCQ items are available at Supplement 2.

Participants
The chatbots were counted as participants, resulting in 15 participants (3 attempts per chatbot).The evaluator team comprised 4 medical educators with training in developing and evaluating MCQs.

Variables
The dependent variables were the answers provided by the chatbot (correct or incorrect), defined as the choice selected by at least 3 medical educators.The independent variables were the area of the MCQ, the type of item, whether Peru-specific knowledge was required, and the educational value of the justifications.

Data sources/measurements
MCQs from the P-NLME were analyzed for the area of medicine to which the MCQs belonged, including surgery, internal medicine, pediatrics, obstetrics and gynecology, public health, and emergency medicine, according to the P-NLME specifications table [8].For the item type, we categorized MCQs into 2 categories: those that solely assessed the recall of information and those that evaluated the application of knowledge involving decision-making in the form of diagnosis or treatment [9].Regarding the requirement for Peru-specific knowledge, MCQs were categorized as "yes" if the MCQs assessed or required knowledge specific to Peru, such as epidemiological, clinical practice guidelines, or diseases restricted to this country.
Finally, to evaluate the educational value, we adopted the Academic Life in Emergency Medicine (ALiEM) using the Approved Instructional Resources (AIR) [7,10].We considered educational value as the certainty, usefulness, and potential use of the responses provided by chatbots.Certainty was defined as the accuracy of the information provided in each chatbot response (GPT-4 and Bing).Usefulness was defined as the number of educational pearls (stand-alone clinical relevant details) and potential use in the classroom, referring to the potential to use the response in hypothetical classes.The categorization of MCQs and assessment of educational value were carried out by at least 2 independent authors (J.A.F.C., C.J.G.R., C.A.R.G., K.T.P.Q., J.D.G.A.), who had previously experienced training medical examinees for the ENAM (Examen Nacional de Medicina).The rating scale employed to assess educational value is available in Supplement 3.

Study size
We analyzed all 180 MCQs from the P-NLME of 2023.Therefore, a sample size calculation was not required.

Statistical methods
Descriptive statistics were used to analyze the scores for each chatbot and the rest of the categories.They are presented as absolute values along with their frequencies.We conducted an agreement test for each chatbot using the Fleiss kappa.We considered a kappa <0.20 as indicating no agreement, 0.21 to 0.39 as minimal, 0.40 to 0.59 as weak, 0.60 to 0.79 as moderate, 0.80 to 0.90 as strong, and above 0.90 as almost perfect agreement [11].Then, inferential statistics were employed, using the chi-square test to compare the highest rating on certainty, usefulness, and potential use between Bing and GPT-4, considering a P-value ≤0.05 as statistically significant.Additionally, we employed a bivariate logistic regression model to identify potential factors associated with correct answers for each chatbot's best attempt.All analyses were conducted using RStudio ver.4.1.2(RStudio).
The level of agreement among various chatbots is displayed in Table 1.Most chatbots exhibited substantial agreement, except for Bard, which showed only moderate agreement.When analyzing the remaining categories, the level of agreement ranged from moderate to substantial for all chatbots.In emergency medicine, GPT-4 demonstrated almost perfect agreement, whereas Bing and Claude showed no agreement.
Table 2 shows the best performance for each chatbot in the various categories.GPT-4 outperformed other chatbots in all categories except obstetrics and gynecology, and public health.These exceptions occurred in instances requiring Peru-specific knowledge or for questions that evaluated recall rather than applying knowledge.In these specific cases, Bing outperformed GPT-4.

Factors associated with correct answers
Table 3 presents bivariate regression models for each chatbot.Although we analyzed multiple categories, some noteworthy associations emerged.Specifically, questions solved by GPT-4 that required Peru-specific knowledge had lower odds (odds ratio [OR], 0.23; 95% confidence interval [CI], 0.09-0.61) of being correct.
Similarly, for questions solved by Bard that required the application of knowledge, the odds of being correct were lower (OR, 0.43; 95% CI, 0.16-0.99).

The educational value of responses provided by GPT-4 and Bing
We selected the best attempts from GPT-4 and Bing, including their corresponding responses and justifications, to assess their educational value.The findings are summarized in Table 4. Medical educators considered Bing's justifications superior to "full of educational pearls" compared to GPT-4 (42 versus 59).
However, GPT-4 outperformed Bing regarding the number of responses containing 3 or more educational pearls (86 versus 59).Therefore, although the 2 chatbots exhibited different strengths when analyzed by these categories (summing up "full of educational pearls" and "3 or more educational pearls"), no statistically significant difference was observed when the metrics were combined (χ 2 =1.284,P=0.257).
Furthermore, in the item "Potential use of the justification provided by chatbots in classes," fewer than 20% of the justifications were considered as "I would not use anything" (13.33% for GPT-4 and 12.22% for Bing).For "Yes, I would use the entire explanation," there were no significant differences between GPT-4 and Bing (χ 2 =1.284,P=0.112).
All research data are available at Dataset 1.

Key results
Our major findings are as follows: (1) we found that the chatbots' average performance was above the historical performance of Peruvian examinees in the P-NLME, with GPT-4 and Bing being the top performers; (2) we did not detect any associations between the correct answers and the specific areas of the MCQs, item type, or whether they required Peru-specific knowledge for the majority of chatbots, except GPT-4; and (3) there were no statistically significant differences in terms of certainty between GPT-4 and Bing (P=0.777),nor in the potential use of responses in the classroom (P=0.112).Furthermore, there were no differences regarding the presence of ≥3 educational pearls per GPT-4 and Bing response.These findings suggest the superiority of GPT-4 and Bing in written assessments in medical education regarding performance and educational value.

Interpretation
The outstanding performance of chatbots, mainly GPT-4 (87.2%) and Bing (84.4%), is not surprising, as previous studies have reported similar outcomes [1,2].We hypothesized that internet access may impact specific categories, as it is well known that the performance of chatbots depends on the dataset used to train them and because Bing has access to the internet.This is supported by the fact that MCQs that required Peru-specific knowledge were associated with lower odds of correct answers provided by GPT-4, whereas this tendency was not observed in Bing.This suggests that for educational purposes, commercial chatbots may need to be trained or tailored to a specific setting, such as epidemiological data or beliefs particular to a country.
Regarding our major topic of interest, we evaluated the educational value of justifications provided by GPT-4 and Bing.We found that the outstanding performance was not solely quantitative but also qualitative, with chatbots providing "educational pearls" in more than 50% of their justifications.
Furthermore, the educators considered using the justifications in their classes.This shows the feasibility of their use in the teaching and learning process for medical education, a field not yet explored in the literature but suggested in a previous report [12].

Comparison with previous studies
Previous studies have compared the performance of commercial chatbots on NLMEs, with similar findings for GPT-4.On the Japanese NLME, it scored 79.9%, while examinees scored 84.9% [13]; on the Chinese NLME, it scored 84% [3], and it scored 83.46% and 84.75% on the United States Medical Licensing Examination Step 1 and Step 2, respectively [2].Furthermore, a previous study showed that GPT-4 scored 86% on the Peruvian NLME, whereas examinees scored 54% [1].Therefore, the performance of GPT-4 across several NLMEs appears to be homogeneous and independent of the setting or language.Regarding educational value, one study showed that the acceptability of justifications provided by GPT-3.5 was 52% [6].No other studies have assessed this; therefore, a comprehensive comparison is lacking.

Limitations
While our study aimed to compare chatbots with examinees' historical performance on the P-NLME, this approach could introduce selection bias.The findings may not be interpretable as we compared performances from different years.However, it is worth noting that the P-NLME is designed based on a standardized test blueprint, intending to measure the same constructs consistently across years, reducing the potential bias.Moreover, the assessment of "educational value" was conducted by medical educators, a factor that may introduce evaluation bias due to subjective interpretations.To mitigate this bias, the evaluation process was conducted in duplicate.

Generalizability
Given prior research, it is plausible that our findings may be extended to other NLMEs that adhere to a meticulous development process, such as those from Peru, China, Japan, and the United States.
Nevertheless, it is pivotal to note that our educational value findings involved trained medical educators experienced in crafting and assessing MCQs, potentially limiting generalizability across all contexts.
Additionally, considering the rapid advancement of chatbots, their capabilities might supersede those of previous iterations within a few weeks or months after this study's publication.

Implications and suggestions
We employed and compared all available commercial chatbots; thus, we offer some perspectives on how each chatbot performs in questions regarding medical knowledge.This can inform educators and students about which chatbots are more suitable for academic tasks.Additionally, we provided evidence that GPT-4 had lower odds of obtaining correct answers when they required Peru-specific knowledge, a phenomenon not observed in Bing.Therefore, this may suggest that Bing may be more suitable for non-English medical education tasks, such as explaining topics, developing MCQs, or other endeavors not yet explored.Future research should address this in more specific tasks, such as decision-making related to country-specific guidelines.We found that the justifications offered by both the GPT-4 and Bing were deemed valuable by medical educators.However, it is important to recognize that our conclusions may not be universally applicable, as our own inherent biases and paradigms influenced them, and only 2 authors assessed each justification.Consequently, future research should explore the educational value of chatbot justifications by gathering perspectives from a broader spectrum of educators, ranging from novices to experienced educators.

Conclusion
Among the chatbots, GPT-4 and Bing were the top performers, with Bing performing better at Peruspecific MCQs.Moreover, the educational value of justifications provided by the GPT-4 and Bing could be deemed appropriate.However, it is essential to start addressing the educational value of these chatbots, rather than merely their performance on examinations.Values are presented as number (%) or number.Values are presented as odds ratio (95% confidence interval).
a) The odds ratio was statistically significant.Values are presented as number (%).P-value for the chi-square test comparing GPT-4 and Bing on the highest rating of each item.
a) The difference is statistically significant according to the chi-square test.

Fig. 1 .
Fig. 1.Scores obtained in the Peruvian national licensing medical examination by chatbots, compared with the average score of Peruvian examinees from 2009 to 2019.

Table 1 .
Agreement between the 3 attempts of each chatbot calculated using the Fleiss kappa

Table 2 .
Total and subgroup scores of the best attempt of each chatbot

Table 3 .
Factors associated with correct answers provided by chatbots in a bivariate logistic regression model

Table 4 .
Ratings of certainty, usefulness, and potential use in class for the best GPT-4 and Bing scores by