Skip Navigation
Skip to contents

JEEHP : Journal of Educational Evaluation for Health Professions

OPEN ACCESS
SEARCH
Search

Articles

Page Path
HOME > J Educ Eval Health Prof > Volume 22; 2025 > Article
Research article
Performance of large language models on Thailand’s national medical licensing examination: a cross-sectional study
Prut Saowaprutorcid, Romen Samuel Wabina*orcid, Junwei Yangorcid, Lertboon Siriwatorcid

DOI: https://doi.org/10.3352/jeehp.2025.22.16
Published online: May 12, 2025

Department of Clinical Epidemiology and Biostatistics, Ramathibodi Hospital, Faculty of Medicine, Mahidol University, Bangkok, Thailand

*Corresponding email: romensamuel.wab@mahidol.ac.th

Editor: A Ra Cho, The Catholic University of Korea, Korea

• Received: March 15, 2025   • Accepted: May 2, 2025

© 2025 Korea Health Personnel Licensing Examination Institute

This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

prev next
  • 1,054 Views
  • 150 Download
  • Purpose
    This study aimed to evaluate the feasibility of general-purpose large language models (LLMs) in addressing inequities in medical licensure exam preparation for Thailand’s National Medical Licensing Examination (ThaiNLE), which currently lacks standardized public study materials.
  • Methods
    We assessed 4 multi-modal LLMs (GPT-4, Claude 3 Opus, Gemini 1.0/1.5 Pro) using a 304-question ThaiNLE Step 1 mock examination (10.2% image-based), applying deterministic API configurations and 5 inference repetitions per model. Performance was measured via micro- and macro-accuracy metrics compared against historical passing thresholds.
  • Results
    All models exceeded passing scores, with GPT-4 achieving the highest accuracy (88.9%; 95% confidence interval, 88.7–89.1), surpassing Thailand’s national average by more than 2 standard deviations. Claude 3.5 Sonnet (80.1%) and Gemini 1.5 Pro (72.8%) followed hierarchically. Models demonstrated robustness across 17 of 20 medical domains, but variability was noted in genetics (74.0%) and cardiovascular topics (58.3%). While models demonstrated proficiency with images (Gemini 1.0 Pro: +9.9% vs. text), text-only accuracy remained superior (GPT-4o: 90.0% vs. 82.6%).
  • Conclusion
    General-purpose LLMs show promise as equitable preparatory tools for ThaiNLE Step 1. However, domain-specific knowledge gaps and inconsistent multi-modal integration warrant refinement before clinical deployment.
Background/rationale
Large language models (LLMs) have demonstrated remarkable capabilities in medical question-answering (QA) tasks, including interpreting complex terminology, clinical reasoning, and analyzing multi-modal data [1,2]. These models excel in standardized medical examinations such as the United States Medical Licensing Examination (USMLE), achieving accuracies exceeding 85% on Step 1 questions [3-5]. Their proficiency extends globally, with validated performance across diverse languages and healthcare systems, including those of Germany, Japan, Chile, and South Korea [6-9]. Recent advances in multi-modal LLMs (e.g., OpenAI’s ChatGPT, Google’s Gemini, Anthropic’s Claude) further enhance their utility by integrating visual and textual inputs—critical features for image-based medical assessments [3,10]. Despite these advancements, applications in low-resource settings remain underexplored, particularly in regions lacking standardized training materials.
In Thailand, medical students preparing for the Thai National Licensing Examination (ThaiNLE) face significant challenges due to the absence of publicly available, expert-validated study resources. Current preparation relies primarily on informally shared recollections of past exam questions, disproportionately disadvantaging students from rural institutions with limited peer networks [11]. This inequity highlights the urgent need for scalable, high-quality educational tools.
Objectives
We hypothesize that state-of-the-art LLMs can achieve passing scores on ThaiNLE Step 1 questions and demonstrate robust multi-modal capabilities, thereby serving as equitable preparatory resources that mitigate disparities in medical education access. To test this hypothesis, we evaluate GPT, Gemini, and Claude models using a comprehensive ThaiNLE mock examination dataset, assessing their accuracy, domain-specific performance, and image-text integration efficacy.
Ethics statement
Given that no human participants were involved, the research was exempt from institutional review board.
Study design
A cross-sectional study was conducted.
Data sources
We utilized a mock examination dataset consisting of 304 multiple-choice questions in English. The dataset includes “general principles” (G) and “organ systems” (S) questions, organized into 20 groups, such as biochemistry and molecular biology, biology of cells, and human development and genetics (Table 1). This mock examination is widely used and has a difficulty level comparable to actual examinations, although somewhat simpler than Step 1 of the USMLE. Additionally, 10.2% of the questions included images. There was a class imbalance among questions within each topic, reflecting the relative emphasis typically found in real examinations, though not exactly matching the official syllabus. A total of 19 board-certified physicians from various specialties—including internal medicine, pediatrics, and pathology—verified the answers. Any discrepancies in responses were resolved through a consensus majority approach, whereby the final answer reflected the response most frequently selected by the reviewers.
Passing scores
The original examination consists of 300 questions covering topics aligned with the Medical Competency Assessment Criteria for National License 2012. For clarity and to avoid confusion, we renamed the topic identifiers. The original naming convention used “B1.x” for general principles topics and “B2–B11” for systems topics. These categories were renamed to “G” for general principles and “S” for systems topics.
We used the national passing scores from the Center for Medical Competency Assessment as benchmarks for evaluating LLM performance. Table 2 presents the passing scores and national averages for the main (summer) examinations from 2019 to 2024. The mean passing score for the main examination rounds from 2019 to 2024 was 52.3%, while the national average was 56.1%. The passing scores showed fluctuations over this 6-year period, with the highest passing percentage recorded in 2021 (53%) and the lowest in 2024 (51%). National average scores also exhibited variability, notably increasing to 63.05% in 2024—significantly higher than in previous years. The standard deviation ranged from 12.60 in 2020 to 16.08 in 2024, indicating varying degrees of score dispersion across these years.
Models and inference
We evaluated 4 multi-modal LLMs (GPT-4, Claude 3 Opus, Gemini 1.0 Pro, Gemini 1.5 Pro) using their official Application Programming Interfaces (APIs) between March and June 2024, following established benchmarking protocols for medical QA systems [2]. All models were configured with zero temperature to ensure deterministic outputs, suppressing stochastic generation while maintaining clinical response fidelity. Token limits followed manufacturer specifications (32,768 tokens for GPT-4/Claude 3 Opus; undisclosed for Gemini models), sufficient for processing exam questions with integrated images (<5 MB/file). Model versions remained fixed during the evaluation period (Supplement 1), with API requests programmatically managed via Python’s requests library. No prompt engineering or fine-tuning was applied, maintaining consistency with prior medical benchmarking studies [3].
Each model predicted answers to the entire dataset over 5 inference runs. Questions were inputted individually to avoid confusing the model and to simulate how a student might submit questions one at a time. Answers from each run were compared to the ground truth and scored accordingly. Scores from each run were then averaged, and a 95% confidence interval (CI) was calculated.
A zero-shot prompt served as the foundational instruction outlining the specific question and task for the LLMs. The prompt outputted a single letter to clearly articulate the directive for the QA task. The exact prompt used was: “Given the following medical multiple-choice question, select the best answer from the provided options. Your response should be in the form of a single letter (A, B, C, D, E). Do not provide any explanation or additional information—only the letter or digit corresponding to the correct answer. Answer the following question.”
Statistical methods
To evaluate the performance of the LLMs, we employed micro-average and macro-average metrics, along with their 95% CIs. Using both metrics provides a comprehensive evaluation of the LLMs’ performance. While micro-average gives an overall performance view, macro-average ensures performance in each subtopic is equally considered, preventing any topic from being neglected.
The micro-average is determined by the percentage of correct answers, as shown in Equation 1. It emphasizes overall performance by giving equal weight to each individual question.
(Equation 1)
Micro-average=Number of correct answersTotal number of questions
While the micro-average can be influenced by the most frequent classes, we also utilized the macro-average (Equation 2) because it provides equal weight to each subtopic, irrespective of the number of questions per topic, ensuring performance in all subtopics is equally evaluated.
(Equation 2)
Macro-average=1Ni=1NNumber of correct answers in subtopic iTotal number of questions in subtopic i
Main results
Table 3 shows the overall and group accuracies for GPT, Claude, and Gemini models (Dataset 1). The overall accuracy demonstrated a clear hierarchical trend, with GPT-4o achieving superior performance across all models at 88.9% (95% CI, 88.7%–89.1%). GPT-4 followed with an accuracy of 83.3%, though it showed a wider CI (79.6%–87.0%), indicating variability in its results. Claude-3.5-Sonnet achieved 80.1% (79.6%–80.6%), notably higher than its predecessor, Claude-3-Opus, which reached 77.8% (77.4%–78.2%). The Gemini models showed the lowest performance, with Gemini-1.5-Pro obtaining 72.8% (72.2%–73.4%) and Gemini-1.0-Pro reaching 61.4% (61.3%–61.5%). In terms of macro-average accuracy, the results were consistent, reflecting similar hierarchical trends across the models. The high balanced accuracy of GPT-4o (89.1%; 95% CI, 88.8%–89.4%) indicates robust performance across diverse topics, confirming its reliability in handling varied medical content.
Comparison of model performance
To evaluate whether the performance differences were statistically significant, we conducted pairwise z-tests comparing the micro-average accuracies of all 6 models over the 304-question dataset. GPT-4o significantly outperformed all other models, including GPT-4 (P<0.001), Claude-3.5-Sonnet (P=0.0028), and Gemini-1.5-Pro (P<0.0001), with Cohen’s h values indicating moderate to large effect sizes. Claude-3.5-Sonnet also significantly outperformed Gemini-1.5-Pro (P=0.034) and Gemini-1.0-Pro (P<0.00001). These results confirmed that the observed performance differences were both statistically significant and practically meaningful.
Results by topic
Fig. 1 presents the accuracy results categorized by general principles and systems. All models demonstrated higher accuracy on “general principle” questions compared to “systems” questions. Accuracy for general principles was relatively consistent, with Gemini-1.0-Pro achieving 64.1%, and Gemini-1.5-Pro significantly improving to 84.6%. Claude-3-Opus and Claude-3.5-Sonnet further increased accuracy, reaching 80.2% and 84.7%, respectively. GPT-4 achieved 83.5%, while GPT-4o led with an accuracy of 90.0%. In contrast, performance in the systems category varied notably across models: Gemini-1.0-Pro scored 62.3%, Gemini-1.5-Pro improved to 67.9%, and Claude-3-Opus and Claude-3.5-Sonnet reached 74.5% and 79.1%, respectively. GPT-4 achieved 82.2%, and GPT-4o again outperformed all others with an accuracy of 88.3%. Overall, GPT-4o consistently delivered the highest accuracy across both general principles and systems. Table 4 illustrates the micro-accuracy of each model by topic.
Comparing questions with and without images
The comparative analysis of model accuracy for questions with and without images is detailed in this section, showcasing variations in performance across models.
Questions with images
Micro-accuracy varied, with Gemini-1.0-Pro achieving 70.3% (69.0%–71.6%), and Gemini-1.5-Pro slightly lower at 68.4% (66.0%–70.8%). Claude-3-Opus and Claude-3.5-Sonnet improved accuracy to 72.9% (68.2%–77.6%) and 74.8% (73.5%–76.1%), respectively. GPT-4 had higher accuracy at 78.4% (75.0%–81.8%), with GPT-4o reaching the highest accuracy of 82.6% (81.1%–84.1%) (Table 5).
Most models performed better on text-only questions compared to image-based questions (Fig. 2). Notably, Gemini-1.0-Pro performed better on image questions, showing an increase of 9.9% points compared to text-only questions. GPT-4 and GPT-4o consistently outperformed other models in both categories, with GPT-4o achieving the highest overall accuracy. This variation in performance between image and non-image questions suggests differing levels of robustness across models in handling visual inputs.
Key results
This study aimed to assess the performance of LLMs on the ThaiNLE Step 1 examination by analyzing their overall and domain-specific performances in handling multimodal questions. All models surpassed the national passing thresholds, with GPT-4o achieving the highest overall accuracy and consistent domain-specific performance. Although the models demonstrated strong general capabilities, performance varied considerably across medical specialties and question types, especially for image-based questions.
Our evaluation of GPT, Gemini, and Claude models on the ThaiNLE mock examination revealed 3 key findings. First, all models exceeded historical passing thresholds, with GPT-4o demonstrating superior overall accuracy (88.9%; 95% CI, 88.7–89.1) and balanced performance across topics (macro-average 89.1%) (Table 3). GPT-4o also maintained dominance across both major question categories, achieving 90.0% accuracy in general principles and 88.3% in systems (Fig. 1), outperforming Claude-3.5-Sonnet (80.1% overall, 84.7% systems) and Gemini-1.5-Pro (72.8% overall, 67.9% systems).
Second, the performance variability across medical domains highlights model-specific limitations (Table 4). GPT-4’s weaker quantitative reasoning (74.0% in G9 vs. 99.2% in S4) aligns with its documented limitations regarding mathematical precision, while Claude-3.5-Sonnet’s inconsistent cardiovascular performance (96.7% in S1 vs. 58.3% in S5) might indicate insufficient exposure to procedural clinical scenarios. LLM variability across specialties is likely due to differences in training data and required clinical reasoning skills. For instance, the high accuracy in obstetrics and gynecology (84.6%) reflects standardized protocols and workflows in the specialty. Conversely, lower accuracy in psychiatry (78.0%–86.0%) may highlight challenges related to cultural factors (Thai-specific practices) and ethical ambiguities inherent in interpreting subjective symptoms. Third, most models showed decreased accuracy on image-based questions compared to text-only questions. An exception was Gemini-1.0-Pro, which unexpectedly achieved higher accuracy (+9.9%) on visual questions despite poor textual performance (61.4% micro-average). This finding suggests its vision module may compensate for language-processing deficiencies—a trade-off warranting further investigation.
Interpretation
The observed performance hierarchy (GPT-4o > GPT-4 > Claude > Gemini) suggests architectural advantages in OpenAI’s models for medical QA tasks, especially their ability to maintain high macro-average accuracy across diverse topics. GPT-4o’s superior performance across 17 out of 20 topics (Table 4), including perfect scores in biochemistry (G2) and neurology (S4), demonstrates its robust reasoning capabilities. This advanced performance is likely attributable to its enhanced multimodal transformer architecture and larger effective context window, enabling richer contextual understanding and reliable clinical reasoning. Pairwise statistical comparisons confirmed that GPT-4o significantly outperformed all other models (P<0.001), with large effect sizes (Cohen’s h >0.8) observed in most comparisons. Meanwhile, differences between Claude-3.5-Sonnet and GPT-4, as well as between Gemini-1.0 and Gemini-1.5, were also statistically significant, albeit with smaller effect sizes. Persistent knowledge gaps—such as GPT-4o’s genetics accuracy (80.0% in G3) and Claude-3.5-Sonnet’s musculoskeletal performance (58.3% in S5)—appear unrelated to overall model size or training recency. Although GPT-4o consistently achieved the highest accuracy, its advantage was smaller in certain domains, likely due to topic complexity, uneven training data, or domain-specific terminology. Gemini-1.0-Pro’s superior image-based accuracy, despite poor textual performance, further highlights that vision-module benefits are highly model-specific and do not uniformly translate into superior multimodal performance.
Comparison with previous studies
Our findings align with previous studies evaluating multimodal LLM performance on medical examinations. Research conducted in Chile [8] and Japan [7] indicates that integrating visual data does not always enhance accuracy, whereas studies in Germany [6] and Taiwan [12] found overall improvements in newer models do not necessarily translate into consistent gains across all medical domains. Similarly, USMLE-based evaluations confirm that, despite improvements in overall scores by newer models, significant variability remains in specific medical domains [4]. Our observation that the evaluated models passed the examination aligns closely with studies from Germany, Chile, and Taiwan [6,8,12], but contrasts with findings from Japan [7], where models failed to achieve passing scores.
Limitations
This study has 3 primary limitations: (1) restricted dataset access (304 questions) limits reproducibility and generalizability; (2) a small subset of image-based questions (n=31) precludes definitive conclusions regarding multimodal capabilities; and (3) the exclusion of fine-tuned models obscures potential performance enhancements achievable through domain adaptation. Additionally, focusing exclusively on multiple-choice questions neglects the complexity inherent in open-ended questions, where subjective interpretation and cultural nuances play substantial roles. The lack of real-world testing conditions further limits direct comparability to human performance.
Generalizability
Our findings suggest that high-performing LLMs have significant potential as globally scalable solutions for medical education, especially in resource-limited settings. Although the evaluation used Thai-language and region-specific content, the consistency of domain-specific performance variability parallels patterns seen internationally [4,6,12], supporting broader applicability. Nonetheless, generalizability may be limited by language nuances, cultural contexts, and curricular differences, highlighting the necessity for localized fine-tuning. Considerations such as cost-effectiveness (e.g., affordability of artificial intelligence [AI]-driven tutorials), technical accessibility (e.g., stable internet access), and instructor-student ratios should also be taken into account, as these factors could mitigate or exacerbate educational disparities in low-resource settings [13].
Suggestions
Future research should explore domain-specific fine-tuning and multilingual optimization to enhance LLM reliability, particularly in complex or underperforming domains. Additionally, comparative studies evaluating real-world deployment—such as classroom integration or personalized tutoring—may further elucidate their practical educational utility.
Conclusion
Our results confirm the hypothesis that LLMs can effectively address inequities in ThaiNLE preparation, as all models exceeded passing thresholds. GPT-4o’s narrow confidence intervals (88.7%–89.1%) suggest reliable performance, particularly beneficial to rural students lacking traditional resources. Nonetheless, the observed domain-specific weaknesses necessitate targeted improvements before widespread clinical deployment. By integrating these findings into medical education frameworks, institutions can deliver equitable, AI-driven support while maintaining rigorous competency standards.

Authors’ contributions

Conceptualization: PS, RSW. Data curation: PS, LS. Formal analysis: PS, RSW, JY. Methodology: PS, RSW, JY. Validation: PS, RSW, JY. Project administration: PS, LS. Writing–original draft: PS, RSW. Writing–review and editing: PS, RSW.

Conflict of interest

No potential conflict of interest relevant to this article was reported.

Funding

None.

Data availability

Data files are available from Harvard Dataverse: https://doi.org/10.7910/DVN/GIVTQY

Dataset 1. Model prediction data.

jeehp-22-16-dataset1.xlsx

Acknowledgments

The authors sincerely thank Kittinan Junsui for providing the mock dataset “Mock NL1 by RA53,” along with all other contributors whose efforts were essential in its development.

Supplementary files are available from Harvard Dataverse: https://doi.org/10.7910/DVN/GIVTQY
Supplement 1. Supplementary information and sample questions.
jeehp-22-16-suppl1.docx
Supplement 2. Audio recording of the abstract.
jeehp-22-16-abstract-recording.avi
Fig. 1.
Macro-accuracy in general principles and systems. CI, confidence interval.
jeehp-22-16f1.jpg
Fig. 2.
Comparison between questions with and without images.
jeehp-22-16f2.jpg
jeehp-22-16f3.jpg
Table 1.
Dataset characteristics (N=304)
Group Topic No. of questions
G1 Biochemistry and molecular biology 10
G2 Biology of cells 10
G3 Human development and genetics 5
G4 Normal immune responses 12
G5 Pathogenesis, pathophysiology, basic pathological processes, & lab investigation 25
G6 Gender, ethnic, and behavioral considerations affecting disease treatment and prevention, including psychosocial, cultural, occupational, and environmental 10
G7 Multisystem processes 2
G8 General pharmacology 12
G9 Quantitative methods 10
S2 Hematopoietic and lymphoreticular systems 22
S3 Central and peripheral nervous systems 22
S4 Skin and related connective tissue 12
S5 Musculoskeletal system and connective tissue 12
S6 Respiratory system 22
S7 Cardiovascular system 22
S8 Gastrointestinal system 22
S9 Renal/urinary system 30
S10 Reproductive system and perinatal period 22
S11 Endocrine system 22
Table 2.
Passing scores and national averages for main (summer) exams
Year Passing score (%) National average (%) Standard deviation
2019 52.33 55.40 13.30
2020 52.67 53.80 12.60
2021 53.00 56.10 14.10
2022 52.84 54.20 15.10
2023 51.67 53.79 14.38
2024 51.00 63.05 16.08
Table 3.
Overall results
Model Micro-average (%) Macro-average (%)
Gemini-1.0-Pro 61.4 (61.3–61.5) 63.1 (62.9–63.3)
Gemini-1.5-Pro 72.8 (72.2–73.4) 75.5 (75.0–76.0)
Claude-3-Opus 77.8 (77.4–78.2) 77.1 (76.9–77.3)
Claude-3.5-Sonnet 80.1 (79.6–80.6) 81.7 (81.5–81.9)
GPT-4 83.3 (79.6–87.0) 82.8 (78.7–86.9)
GPT-4o 88.9 (88.7–89.1) 89.1 (88.8–89.4)

Values are presented as average (95% confidence interval).

Table 4.
Micro-accuracy across each topic
Group G1.0-Pro G1.5-Pro C3-Opus C3.5-S GPT-4 GPT-4o Average (min–max)
G1 60.0 100.0 80.0 90.0 85.0 90.0 84.2 (60.0–100.0)
G2 90.0 100.0 100.0 90.0 100.0 100.0 96.7 (90.0–100.0)
G3 60.0 100.0 80.0 80.0 80.0 80.0 80.0 (60.0–100.0)
G4 66.7 76.7 85.0 85.0 83.3 91.7 81.4 (66.7–91.7)
G5 64.0 76.8 80.8 91.2 88.8 90.4 82.0 (64.0–91.2)
G6 80.0 80.0 88.0 78.0 83.0 86.0 82.5 (78.0–88.0)
G7 50.0 100.0 50.0 100.0 75.0 100.0 79.2 (50.0–100.0)
G8 66.7 58.3 80.0 68.3 82.5 91.7 74.6 (58.3–91.7)
G9 40.0 70.0 78.0 80.0 74.0 80.0 70.3 (40.0–80.0)
S1 83.3 50.0 66.7 96.7 80.0 83.3 76.7 (50.0–96.7)
S2 56.2 62.5 67.5 82.5 79.4 87.5 72.6 (56.2–87.5)
S3 59.1 68.2 67.3 76.4 82.7 86.4 73.4 (59.1–68.4)
S4 81.7 83.3 85.0 91.7 99.2 100.0 90.2 (81.7–100.0)
S5 58.3 55.0 61.7 58.3 67.5 83.3 64.0 (55.0–83.3)
S6 63.6 71.8 83.6 81.8 80.9 90.0 78.6 (63.6–90.0)
S7 45.5 68.2 70.9 64.5 80.0 82.7 68.6 (45.5–82.7)
S8 63.6 68.2 74.5 81.8 82.3 86.4 76.1 (63.6–86.4)
S9 46.7 62.0 75.3 78.7 83.0 90.0 72.6 (46.7–90.0)
S10 59.1 67.3 80.9 76.4 79.1 86.4 74.9 (59.1–86.4)
S11 68.2 90.9 86.4 81.8 90.5 95.5 88.9 (68.2–95.5)
Table 5.
Micro-accuracy for questions with images
Model With images (n=31) Without images (n=273)
Gemini-1-Pro 70.3 (69.0–71.6) 60.4 (60.4–60.4)
Gemini-1.5-Pro 68.4 (66.0–70.8) 73.3 (72.7–73.9)
Claude-3-Opus 72.9 (68.2–77.6) 78.4 (78.0–78.8)
Claude-3.5-Sonnet 74.8 (73.5–76.1) 80.7 (80.1–81.3)
GPT-4 78.4 (75.0–81.8) 83.8 (80.0–87.6)
GPT-4o 82.6 (81.1–84.1) 89.6 (89.4–89.8)

Figure & Data

References

    Citations

    Citations to this article as recorded by  

      Figure
      • 0
      • 1
      • 2
      Related articles
      Performance of large language models on Thailand’s national medical licensing examination: a cross-sectional study
      Image Image Image
      Fig. 1. Macro-accuracy in general principles and systems. CI, confidence interval.
      Fig. 2. Comparison between questions with and without images.
      Graphical abstract
      Performance of large language models on Thailand’s national medical licensing examination: a cross-sectional study
      Group Topic No. of questions
      G1 Biochemistry and molecular biology 10
      G2 Biology of cells 10
      G3 Human development and genetics 5
      G4 Normal immune responses 12
      G5 Pathogenesis, pathophysiology, basic pathological processes, & lab investigation 25
      G6 Gender, ethnic, and behavioral considerations affecting disease treatment and prevention, including psychosocial, cultural, occupational, and environmental 10
      G7 Multisystem processes 2
      G8 General pharmacology 12
      G9 Quantitative methods 10
      S2 Hematopoietic and lymphoreticular systems 22
      S3 Central and peripheral nervous systems 22
      S4 Skin and related connective tissue 12
      S5 Musculoskeletal system and connective tissue 12
      S6 Respiratory system 22
      S7 Cardiovascular system 22
      S8 Gastrointestinal system 22
      S9 Renal/urinary system 30
      S10 Reproductive system and perinatal period 22
      S11 Endocrine system 22
      Year Passing score (%) National average (%) Standard deviation
      2019 52.33 55.40 13.30
      2020 52.67 53.80 12.60
      2021 53.00 56.10 14.10
      2022 52.84 54.20 15.10
      2023 51.67 53.79 14.38
      2024 51.00 63.05 16.08
      Model Micro-average (%) Macro-average (%)
      Gemini-1.0-Pro 61.4 (61.3–61.5) 63.1 (62.9–63.3)
      Gemini-1.5-Pro 72.8 (72.2–73.4) 75.5 (75.0–76.0)
      Claude-3-Opus 77.8 (77.4–78.2) 77.1 (76.9–77.3)
      Claude-3.5-Sonnet 80.1 (79.6–80.6) 81.7 (81.5–81.9)
      GPT-4 83.3 (79.6–87.0) 82.8 (78.7–86.9)
      GPT-4o 88.9 (88.7–89.1) 89.1 (88.8–89.4)
      Group G1.0-Pro G1.5-Pro C3-Opus C3.5-S GPT-4 GPT-4o Average (min–max)
      G1 60.0 100.0 80.0 90.0 85.0 90.0 84.2 (60.0–100.0)
      G2 90.0 100.0 100.0 90.0 100.0 100.0 96.7 (90.0–100.0)
      G3 60.0 100.0 80.0 80.0 80.0 80.0 80.0 (60.0–100.0)
      G4 66.7 76.7 85.0 85.0 83.3 91.7 81.4 (66.7–91.7)
      G5 64.0 76.8 80.8 91.2 88.8 90.4 82.0 (64.0–91.2)
      G6 80.0 80.0 88.0 78.0 83.0 86.0 82.5 (78.0–88.0)
      G7 50.0 100.0 50.0 100.0 75.0 100.0 79.2 (50.0–100.0)
      G8 66.7 58.3 80.0 68.3 82.5 91.7 74.6 (58.3–91.7)
      G9 40.0 70.0 78.0 80.0 74.0 80.0 70.3 (40.0–80.0)
      S1 83.3 50.0 66.7 96.7 80.0 83.3 76.7 (50.0–96.7)
      S2 56.2 62.5 67.5 82.5 79.4 87.5 72.6 (56.2–87.5)
      S3 59.1 68.2 67.3 76.4 82.7 86.4 73.4 (59.1–68.4)
      S4 81.7 83.3 85.0 91.7 99.2 100.0 90.2 (81.7–100.0)
      S5 58.3 55.0 61.7 58.3 67.5 83.3 64.0 (55.0–83.3)
      S6 63.6 71.8 83.6 81.8 80.9 90.0 78.6 (63.6–90.0)
      S7 45.5 68.2 70.9 64.5 80.0 82.7 68.6 (45.5–82.7)
      S8 63.6 68.2 74.5 81.8 82.3 86.4 76.1 (63.6–86.4)
      S9 46.7 62.0 75.3 78.7 83.0 90.0 72.6 (46.7–90.0)
      S10 59.1 67.3 80.9 76.4 79.1 86.4 74.9 (59.1–86.4)
      S11 68.2 90.9 86.4 81.8 90.5 95.5 88.9 (68.2–95.5)
      Model With images (n=31) Without images (n=273)
      Gemini-1-Pro 70.3 (69.0–71.6) 60.4 (60.4–60.4)
      Gemini-1.5-Pro 68.4 (66.0–70.8) 73.3 (72.7–73.9)
      Claude-3-Opus 72.9 (68.2–77.6) 78.4 (78.0–78.8)
      Claude-3.5-Sonnet 74.8 (73.5–76.1) 80.7 (80.1–81.3)
      GPT-4 78.4 (75.0–81.8) 83.8 (80.0–87.6)
      GPT-4o 82.6 (81.1–84.1) 89.6 (89.4–89.8)
      Table 1. Dataset characteristics (N=304)

      Table 2. Passing scores and national averages for main (summer) exams

      Table 3. Overall results

      Values are presented as average (95% confidence interval).

      Table 4. Micro-accuracy across each topic

      Table 5. Micro-accuracy for questions with images


      JEEHP : Journal of Educational Evaluation for Health Professions
      TOP