Skip Navigation
Skip to contents

JEEHP : Journal of Educational Evaluation for Health Professions

OPEN ACCESS
SEARCH
Search

Search

Page Path
HOME > Search
81 "Medical Education"
Filter
Filter
Article category
Keywords
Publication year
Authors
Funded articles
Research articles
Empirical effect of the Dr Lee Jong-wook Fellowship Program to empower sustainable change for the health workforce in Tanzania: a mixed-methods study
Masoud Dauda, Swabaha A. Yusuph, Harouni Yasini, Issa Mmbaga, Perpetua Mwambinngu, Hansol Park, Gyeongbae Seo, Kyoung Kyun Oh
J Educ Eval Health Prof. 2025;22:6.   Published online January 20, 2025
DOI: https://doi.org/10.3352/jeehp.2025.22.6    [Epub ahead of print]
  • 434 View
  • 64 Download
AbstractAbstract PDF
Purpose
This study evaluated the Dr Lee Jong-wook Fellowship Program’s impact on Tanzania’s health workforce, focusing on relevance, effectiveness, efficiency, impact, and sustainability in addressing healthcare gaps.
Methods
A mixed-methods research design was employed. Data were collected from 97 out of 140 alumni through an online survey, 35 in-depth interviews, and one focus group discussion. The study was conducted from November to December 2023 and included alumni from 2009 to 2022. Measurement instruments included structured questionnaires for quantitative data and semi-structured guides for qualitative data. Quantitative analysis involved descriptive and inferential statistics (Spearman’s rank correlation, non-parametric tests) using Python ver. 3.11.0 and Stata ver. 14.0. Thematic analysis was employed to analyze qualitative data using NVivo ver. 12.0.
Results
Findings indicated high relevance (mean=91.6, standard deviation [SD]=8.6), effectiveness (mean=86.1, SD=11.2), efficiency (mean=82.7, SD=10.2), and impact (mean=87.7, SD=9.9), with improved skills, confidence, and institutional service quality. However, sustainability had a lower score (mean=58.0, SD=11.1), reflecting challenges in follow-up support and resource allocation. Effectiveness strongly correlated with impact (ρ=0.746, P<0.001). The qualitative findings revealed that participants valued tailored training but highlighted barriers, such as language challenges and insufficient practical components. Alumni-led initiatives contributed to knowledge sharing, but limited resources constrained sustainability.
Conclusion
The Fellowship Program enhanced Tanzania’s health workforce capacity, but it requires localized curricula and strengthened alumni networks for sustainability. These findings provide actionable insights for improving similar programs globally, confirming the hypothesis that tailored training positively influences workforce and institutional outcomes.
Reliability and construct validation of the Blended Learning Usability Evaluation–Questionnaire with interprofessional clinicians in Canada : a methodological study
Anish Kumar Arora, Jeff Myers, Tavis Apramian, Kulamakan Kulasegaram, Daryl Bainbridge, Hsien Seow
J Educ Eval Health Prof. 2025;22:5.   Published online January 16, 2025
DOI: https://doi.org/10.3352/jeehp.2025.22.5    [Epub ahead of print]
  • 174 View
  • 49 Download
AbstractAbstract PDFSupplementary Material
Purpose
To generate Cronbach’s alpha and further mixed methods construct validity evidence for the Blended Learning Usability Evaluation–Questionnaire (BLUE-Q).
Methods
Forty interprofessional clinicians completed the BLUE-Q after finishing a 3-month long blended learning professional development program in Ontario, Canada. Reliability was assessed with Cronbach’s α for each of the 3 sections of the BLUE-Q and for all quantitative items together. Construct validity was evaluated through the Grand-Guillaume-Perrenoud et al. framework, which consists of 3 elements: congruence, convergence, and credibility. To compare quantitative and qualitative results, descriptive statistics, including means and standard deviations for each Likert scale item of the BLUE-Q were calculated.
Results
Cronbach’s α was 0.95 for the pedagogical usability section, 0.85 for the synchronous modality section, 0.93 for the asynchronous modality section, and 0.96 for all quantitative items together. Mean ratings (with standard deviations) were 4.77 (0.506) for pedagogy, 4.64 (0.654) for synchronous learning, and 4.75 (0.536) for asynchronous learning. Of the 239 qualitative comments received, 178 were identified as substantive, of which 88% were considered congruent and 79% were considered convergent with the high means. Among all congruent responses, 69% were considered confirming statements and 31% were considered clarifying statements, suggesting appropriate credibility. Analysis of the clarifying statements assisted in identifying 5 categories of suggestions for program improvement.
Conclusion
The BLUE-Q demonstrates high reliability and appropriate construct validity in the context of a blended learning program with interprofessional clinicians, making it a valuable tool for comprehensive program evaluation, quality improvement, and evaluative research in health professions education.
Empathy and tolerance of ambiguity in medical students and doctors participating in art-based observational training at the Rijksmuseum in Amsterdam, Netherlands: a before-and-after study
Stella Anna Bult, Thomas van Gulik
J Educ Eval Health Prof. 2025;22:3.   Published online January 14, 2025
DOI: https://doi.org/10.3352/jeehp.2025.22.3    [Epub ahead of print]
  • 221 View
  • 48 Download
AbstractAbstract PDFSupplementary Material
Purpose
This research presents an experimental study using validated questionnaires to quantitatively assess the outcomes of art-based observational training in medical students, residents, and specialists. The study tested the hypothesis that art-based observational training would lead to measurable effects on judgement skills (tolerance of ambiguity) and empathy in medical students and doctors.
Methods
An experimental cohort study with pre- and post-intervention assessments was conducted using validated questionnaires and qualitative evaluation forms to examine the outcomes of art-based observational training in medical students and doctors. Between December 2023 and June 2024, 15 art courses were conducted in the Rijksmuseum in Amsterdam. Participants were assessed on empathy using the Jefferson Scale of Empathy (JSE) and tolerance of ambiguity using the Tolerance of Ambiguity in Medical Students and Doctors scale (TAMSAD).
Results
In total, 91 participants were included; 29 participants completed the JSE and 62 completed the TAMSAD scales. The results showed statistically significant post-test increases for mean JSE and TAMSAD scores (3.71 points for the JSE, ranging from 20 to 140, and 1.86 points for the TAMSAD, ranging from 0 to 100). The qualitative findings were predominantly positive.
Conclusion
The results suggest that incorporating art-based observational training in medical education improves empathy and tolerance of ambiguity. This study highlights the importance of art-based observational training in medical education in the professional development of medical students and doctors.
Inter-rater reliability and content validity of the measurement tool for portfolio assessments used in the Introduction to Clinical Medicine course at Ewha Womans University College of Medicine: a methodological study  
Dong-Mi Yoo, Jae Jin Han
J Educ Eval Health Prof. 2024;21:39.   Published online December 10, 2024
DOI: https://doi.org/10.3352/jeehp.2024.21.39
  • 364 View
  • 124 Download
AbstractAbstract PDFSupplementary Material
Purpose
This study aimed to examine the reliability and validity of a measurement tool for portfolio assessments in medical education. Specifically, it investigated scoring consistency among raters and assessment criteria appropriateness according to an expert panel.
Methods
A cross-sectional observational study was conducted from September to December 2018 for the Introduction to Clinical Medicine course at the Ewha Womans University College of Medicine. Data were collected for 5 randomly selected portfolios scored by a gold-standard rater and 6 trained raters. An expert panel assessed the validity of 12 assessment items using the content validity index (CVI). Statistical analysis included Pearson correlation coefficients for rater alignment, the intraclass correlation coefficient (ICC) for inter-rater reliability, and the CVI for item-level validity.
Results
Rater 1 had the highest Pearson correlation (0.8916) with the gold-standard rater, while Rater 5 had the lowest (0.4203). The ICC for all raters was 0.3821, improving to 0.4415 after excluding Raters 1 and 5, indicating a 15.6% reliability increase. All assessment items met the CVI threshold of ≥0.75, with some achieving a perfect score (CVI=1.0). However, items like “sources” and “level and degree of performance” showed lower validity (CVI=0.72).
Conclusion
The present measurement tool for portfolio assessments demonstrated moderate reliability and strong validity, supporting its use as a credible tool. For a more reliable portfolio assessment, more faculty training is needed.
History article
History of the medical licensure system in Korea from the late 1800s to 1992
Sang-Ik Hwang
J Educ Eval Health Prof. 2024;21:36.   Published online December 9, 2024
DOI: https://doi.org/10.3352/jeehp.2024.21.36
  • 237 View
  • 71 Download
AbstractAbstract PDFSupplementary Material
The introduction of modern Western medicine in the late 19th century, notably through vaccination initiatives, marked the beginning of governmental involvement in medical licensure, with the licensing of doctors who performed vaccinations. The establishment of the national medical school “Euihakkyo” in 1899 further formalized medical education and licensure, granting graduates the privilege to practice medicine without additional examinations. The enactment of the Regulations on Doctors in 1900 by the Joseon government aimed to define doctor qualifications, including modern and traditional practitioners, comprehensively. However, resistance from the traditional medical community hindered its full implementation. During the Japanese colonial occupation of the Korean Peninsula from 1910 to 1945, the medical licensure system was controlled by colonial authorities, leading to the marginalization of traditional Korean medicine and the imposition of imperial hierarchical structures. Following liberation in 1945 from Japanese colonial rule, the Korean government undertook significant reforms, culminating in the National Medical Law, which was enacted in 1951. This law redefined doctor qualifications and reinstated the status of traditional Korean medicine. The introduction of national examinations for physicians increased state involvement in ensuring medical competence. The privatization of the Korean Medical Licensing Examination led to the establishment of the Korea Health Personnel Licensing Examination Institute in 1992, which assumed responsibility for administering licensing examinations for all healthcare workers. This shift reflected a move towards specialized management of professional standards. The evolution of the medical licensure system in Korea illustrates a dynamic process shaped by the historical context, balancing the protection of public health with the rights of medical practitioners.
Research articles
Validation of the Blended Learning Usability Evaluation–Questionnaire (BLUE-Q) through an innovative Bayesian questionnaire validation approach  
Anish Kumar Arora, Charo Rodriguez, Tamara Carver, Hao Zhang, Tibor Schuster
J Educ Eval Health Prof. 2024;21:31.   Published online November 7, 2024
DOI: https://doi.org/10.3352/jeehp.2024.21.31
  • 576 View
  • 166 Download
  • 1 Web of Science
AbstractAbstract PDFSupplementary Material
Purpose
The primary aim of this study is to validate the Blended Learning Usability Evaluation–Questionnaire (BLUE-Q) for use in the field of health professions education through a Bayesian approach. As Bayesian questionnaire validation remains elusive, a secondary aim of this article is to serve as a simplified tutorial for engaging in such validation practices in health professions education.
Methods
A total of 10 health education-based experts in blended learning were recruited to participate in a 30-minute interviewer-administered survey. On a 5-point Likert scale, experts rated how well they perceived each item of the BLUE-Q to reflect its underlying usability domain (i.e., effectiveness, efficiency, satisfaction, accessibility, organization, and learner experience). Ratings were descriptively analyzed and converted into beta prior distributions. Participants were also given the option to provide qualitative comments for each item.
Results
After reviewing the computed expert prior distributions, 31 quantitative items were identified as having a probability of “low endorsement” and were thus removed from the questionnaire. Additionally, qualitative comments were used to revise the phrasing and order of items to ensure clarity and logical flow. The BLUE-Q’s final version comprises 23 Likert-scale items and 6 open-ended items.
Conclusion
Questionnaire validation can generally be a complex, time-consuming, and costly process, inhibiting many from engaging in proper validation practices. In this study, we demonstrate that a Bayesian questionnaire validation approach can be a simple, resource-efficient, yet rigorous solution to validating a tool for content and item-domain correlation through the elicitation of domain expert endorsement ratings.
Performance of GPT-3.5 and GPT-4 on standardized urology knowledge assessment items in the United States: a descriptive study
Max Samuel Yudovich, Elizaveta Makarova, Christian Michael Hague, Jay Dilip Raman
J Educ Eval Health Prof. 2024;21:17.   Published online July 8, 2024
DOI: https://doi.org/10.3352/jeehp.2024.21.17
  • 2,035 View
  • 311 Download
  • 2 Web of Science
  • 3 Crossref
AbstractAbstract PDFSupplementary Material
Purpose
This study aimed to evaluate the performance of Chat Generative Pre-Trained Transformer (ChatGPT) with respect to standardized urology multiple-choice items in the United States.
Methods
In total, 700 multiple-choice urology board exam-style items were submitted to GPT-3.5 and GPT-4, and responses were recorded. Items were categorized based on topic and question complexity (recall, interpretation, and problem-solving). The accuracy of GPT-3.5 and GPT-4 was compared across item types in February 2024.
Results
GPT-4 answered 44.4% of items correctly compared to 30.9% for GPT-3.5 (P<0.00001). GPT-4 (vs. GPT-3.5) had higher accuracy with urologic oncology (43.8% vs. 33.9%, P=0.03), sexual medicine (44.3% vs. 27.8%, P=0.046), and pediatric urology (47.1% vs. 27.1%, P=0.012) items. Endourology (38.0% vs. 25.7%, P=0.15), reconstruction and trauma (29.0% vs. 21.0%, P=0.41), and neurourology (49.0% vs. 33.3%, P=0.11) items did not show significant differences in performance across versions. GPT-4 also outperformed GPT-3.5 with respect to recall (45.9% vs. 27.4%, P<0.00001), interpretation (45.6% vs. 31.5%, P=0.0005), and problem-solving (41.8% vs. 34.5%, P=0.56) type items. This difference was not significant for the higher-complexity items.
Conclusions
ChatGPT performs relatively poorly on standardized multiple-choice urology board exam-style items, with GPT-4 outperforming GPT-3.5. The accuracy was below the proposed minimum passing standards for the American Board of Urology’s Continuing Urologic Certification knowledge reinforcement activity (60%). As artificial intelligence progresses in complexity, ChatGPT may become more capable and accurate with respect to board examination items. For now, its responses should be scrutinized.

Citations

Citations to this article as recorded by  
  • Evaluating the Performance of ChatGPT4.0 Versus ChatGPT3.5 on the Hand Surgery Self-Assessment Exam: A Comparative Analysis of Performance on Image-Based Questions
    Kiera L Vrindten, Megan Hsu, Yuri Han, Brian Rust, Heili Truumees, Brian M Katt
    Cureus.2025;[Epub]     CrossRef
  • From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance
    Markus Kipp
    Information.2024; 15(9): 543.     CrossRef
  • Artificial Intelligence can Facilitate Application of Risk Stratification Algorithms to Bladder Cancer Patient Case Scenarios
    Max S Yudovich, Ahmad N Alzubaidi, Jay D Raman
    Clinical Medicine Insights: Oncology.2024;[Epub]     CrossRef
Educational/Faculty development material
The 6 degrees of curriculum integration in medical education in the United States  
Julie Youm, Jennifer Christner, Kevin Hittle, Paul Ko, Cinda Stone, Angela D. Blood, Samara Ginzburg
J Educ Eval Health Prof. 2024;21:15.   Published online June 13, 2024
DOI: https://doi.org/10.3352/jeehp.2024.21.15
  • 2,994 View
  • 414 Download
AbstractAbstract PDFSupplementary Material
Despite explicit expectations and accreditation requirements for integrated curriculum, there needs to be more clarity around an accepted common definition, best practices for implementation, and criteria for successful curriculum integration. To address the lack of consensus surrounding integration, we reviewed the literature and herein propose a definition for curriculum integration for the medical education audience. We further believe that medical education is ready to move beyond “horizontal” (1-dimensional) and “vertical” (2-dimensional) integration and propose a model of “6 degrees of curriculum integration” to expand the 2-dimensional concept for future designs of medical education programs and best prepare learners to meet the needs of patients. These 6 degrees include: interdisciplinary, timing and sequencing, instruction and assessment, incorporation of basic and clinical sciences, knowledge and skills-based competency progression, and graduated responsibilities in patient care. We encourage medical educators to look beyond 2-dimensional integration to this holistic and interconnected representation of curriculum integration.
Research articles
Redesigning a faculty development program for clinical teachers in Indonesia: a before-and-after study
Rita Mustika, Nadia Greviana, Dewi Anggraeni Kusumoningrum, Anyta Pinasthika
J Educ Eval Health Prof. 2024;21:14.   Published online June 13, 2024
DOI: https://doi.org/10.3352/jeehp.2024.21.14
  • 1,247 View
  • 299 Download
AbstractAbstract PDFSupplementary Material
Purpose
Faculty development (FD) is important to support teaching, including for clinical teachers. Faculty of Medicine Universitas Indonesia (FMUI) has conducted a clinical teacher training program developed by the medical education department since 2008, both for FMUI teachers and for those at other centers in Indonesia. However, participation is often challenging due to clinical, administrative, and research obligations. The coronavirus disease 2019 pandemic amplified the urge to transform this program. This study aimed to redesign and evaluate an FD program for clinical teachers that focuses on their needs and current situation.
Methods
A 5-step design thinking framework (empathizing, defining, ideating, prototyping, and testing) was used with a pre/post-test design. Design thinking made it possible to develop a participant-focused program, while the pre/post-test design enabled an assessment of the program’s effectiveness.
Results
Seven medical educationalists and 4 senior and 4 junior clinical teachers participated in a group discussion in the empathize phase of design thinking. The research team formed a prototype of a 3-day blended learning course, with an asynchronous component using the Moodle learning management system and a synchronous component using the Zoom platform. Pre-post-testing was done in 2 rounds, with 107 and 330 participants, respectively. Evaluations of the first round provided feedback for improving the prototype for the second round.
Conclusion
Design thinking enabled an innovative-creative process of redesigning FD that emphasized participants’ needs. The pre/post-testing showed that the program was effective. Combining asynchronous and synchronous learning expands access and increases flexibility. This approach could also apply to other FD programs.
Challenges and potential improvements in the Accreditation Standards of the Korean Institute of Medical Education and Evaluation 2019 (ASK2019) derived through meta-evaluation: a cross-sectional study  
Yoonjung Lee, Min-jung Lee, Junmoo Ahn, Chungwon Ha, Ye Ji Kang, Cheol Woong Jung, Dong-Mi Yoo, Jihye Yu, Seung-Hee Lee
J Educ Eval Health Prof. 2024;21:8.   Published online April 2, 2024
DOI: https://doi.org/10.3352/jeehp.2024.21.8
  • 1,709 View
  • 318 Download
  • 1 Web of Science
  • 1 Crossref
AbstractAbstract PDFSupplementary Material
Purpose
This study aimed to identify challenges and potential improvements in Korea's medical education accreditation process according to the Accreditation Standards of the Korean Institute of Medical Education and Evaluation 2019 (ASK2019). Meta-evaluation was conducted to survey the experiences and perceptions of stakeholders, including self-assessment committee members, site visit committee members, administrative staff, and medical school professors.
Methods
A cross-sectional study was conducted using surveys sent to 40 medical schools. The 332 participants included self-assessment committee members, site visit team members, administrative staff, and medical school professors. The t-test, one-way analysis of variance and the chi-square test were used to analyze and compare opinions on medical education accreditation between the categories of participants.
Results
Site visit committee members placed greater importance on the necessity of accreditation than faculty members. A shared positive view on accreditation’s role in improving educational quality was seen among self-evaluation committee members and professors. Administrative staff highly regarded the Korean Institute of Medical Education and Evaluation’s reliability and objectivity, unlike the self-evaluation committee members. Site visit committee members positively perceived the clarity of accreditation standards, differing from self-assessment committee members. Administrative staff were most optimistic about implementing standards. However, the accreditation process encountered challenges, especially in duplicating content and preparing self-evaluation reports. Finally, perceptions regarding the accuracy of final site visit reports varied significantly between the self-evaluation committee members and the site visit committee members.
Conclusion
This study revealed diverse views on medical education accreditation, highlighting the need for improved communication, expectation alignment, and stakeholder collaboration to refine the accreditation process and quality.

Citations

Citations to this article as recorded by  
  • The new placement of 2,000 entrants at Korean medical schools in 2025: is the government’s policy evidence-based?
    Sun Huh
    The Ewha Medical Journal.2024;[Epub]     CrossRef
Review
Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review  
Xiaojun Xu, Yixiao Chen, Jing Miao
J Educ Eval Health Prof. 2024;21:6.   Published online March 15, 2024
DOI: https://doi.org/10.3352/jeehp.2024.21.6
  • 5,906 View
  • 566 Download
  • 11 Web of Science
  • 15 Crossref
AbstractAbstract PDFSupplementary Material
Background
ChatGPT is a large language model (LLM) based on artificial intelligence (AI) capable of responding in multiple languages and generating nuanced and highly complex responses. While ChatGPT holds promising applications in medical education, its limitations and potential risks cannot be ignored.
Methods
A scoping review was conducted for English articles discussing ChatGPT in the context of medical education published after 2022. A literature search was performed using PubMed/MEDLINE, Embase, and Web of Science databases, and information was extracted from the relevant studies that were ultimately included.
Results
ChatGPT exhibits various potential applications in medical education, such as providing personalized learning plans and materials, creating clinical practice simulation scenarios, and assisting in writing articles. However, challenges associated with academic integrity, data accuracy, and potential harm to learning were also highlighted in the literature. The paper emphasizes certain recommendations for using ChatGPT, including the establishment of guidelines. Based on the review, 3 key research areas were proposed: cultivating the ability of medical students to use ChatGPT correctly, integrating ChatGPT into teaching activities and processes, and proposing standards for the use of AI by medical students.
Conclusion
ChatGPT has the potential to transform medical education, but careful consideration is required for its full integration. To harness the full potential of ChatGPT in medical education, attention should not only be given to the capabilities of AI but also to its impact on students and teachers.

Citations

Citations to this article as recorded by  
  • AI-assisted patient education: Challenges and solutions in pediatric kidney transplantation
    MZ Ihsan, Dony Apriatama, Pithriani, Riza Amalia
    Patient Education and Counseling.2025; 131: 108575.     CrossRef
  • Exploring predictors of AI chatbot usage intensity among students: Within- and between-person relationships based on the technology acceptance model
    Anne-Kathrin Kleine, Insa Schaffernak, Eva Lermer
    Computers in Human Behavior: Artificial Humans.2025; 3: 100113.     CrossRef
  • Chatbots in neurology and neuroscience: Interactions with students, patients and neurologists
    Stefano Sandrone
    Brain Disorders.2024; 15: 100145.     CrossRef
  • ChatGPT in education: unveiling frontiers and future directions through systematic literature review and bibliometric analysis
    Buddhini Amarathunga
    Asian Education and Development Studies.2024; 13(5): 412.     CrossRef
  • Evaluating the performance of ChatGPT-3.5 and ChatGPT-4 on the Taiwan plastic surgery board examination
    Ching-Hua Hsieh, Hsiao-Yun Hsieh, Hui-Ping Lin
    Heliyon.2024; 10(14): e34851.     CrossRef
  • Preparing for Artificial General Intelligence (AGI) in Health Professions Education: AMEE Guide No. 172
    Ken Masters, Anne Herrmann-Werner, Teresa Festl-Wietek, David Taylor
    Medical Teacher.2024; 46(10): 1258.     CrossRef
  • A Comparative Analysis of ChatGPT and Medical Faculty Graduates in Medical Specialization Exams: Uncovering the Potential of Artificial Intelligence in Medical Education
    Gülcan Gencer, Kerem Gencer
    Cureus.2024;[Epub]     CrossRef
  • Research ethics and issues regarding the use of ChatGPT-like artificial intelligence platforms by authors and reviewers: a narrative review
    Sang-Jun Kim
    Science Editing.2024; 11(2): 96.     CrossRef
  • Innovation Off the Bat: Bridging the ChatGPT Gap in Digital Competence among English as a Foreign Language Teachers
    Gulsara Urazbayeva, Raisa Kussainova, Aikumis Aibergen, Assel Kaliyeva, Gulnur Kantayeva
    Education Sciences.2024; 14(9): 946.     CrossRef
  • Exploring the perceptions of Chinese pre-service teachers on the integration of generative AI in English language teaching: Benefits, challenges, and educational implications
    Ji Young Chung, Seung-Hoon Jeong
    Online Journal of Communication and Media Technologies.2024; 14(4): e202457.     CrossRef
  • Unveiling the bright side and dark side of AI-based ChatGPT : a bibliographic and thematic approach
    Chandan Kumar Tiwari, Mohd. Abass Bhat, Abel Dula Wedajo, Shagufta Tariq Khan
    Journal of Decision Systems.2024; : 1.     CrossRef
  • Artificial Intelligence in Medical Education and Mentoring in Rehabilitation Medicine
    Julie K. Silver, Mustafa Reha Dodurgali, Nara Gavini
    American Journal of Physical Medicine & Rehabilitation.2024; 103(11): 1039.     CrossRef
  • The Potential of Artificial Intelligence Tools for Reducing Uncertainty in Medicine and Directions for Medical Education
    Sauliha Rabia Alli, Soaad Qahhār Hossain, Sunit Das, Ross Upshur
    JMIR Medical Education.2024; 10: e51446.     CrossRef
  • A Systematic Literature Review of Empirical Research on Applying Generative Artificial Intelligence in Education
    Xin Zhang, Peng Zhang, Yuan Shen, Min Liu, Qiong Wang, Dragan Gašević, Yizhou Fan
    Frontiers of Digital Education.2024; 1(3): 223.     CrossRef
  • Artificial intelligence in medical problem-based learning: opportunities and challenges
    Yaoxing Chen, Hong Qi, Yu Qiu, Juan Li, Liang Zhu, Xiaoling Gao, Hao Wang, Gan Jiang
    Global Medical Education.2024;[Epub]     CrossRef
Research articles
Negative effects on medical students’ scores for clinical performance during the COVID-19 pandemic in Taiwan: a comparative study  
Eunice Jia-Shiow Yuan, Shiau-Shian Huang, Chia-An Hsu, Jiing-Feng Lirng, Tzu-Hao Li, Chia-Chang Huang, Ying-Ying Yang, Chung-Pin Li, Chen-Huan Chen
J Educ Eval Health Prof. 2023;20:37.   Published online December 26, 2023
DOI: https://doi.org/10.3352/jeehp.2023.20.37
  • 2,116 View
  • 112 Download
  • 1 Web of Science
  • 1 Crossref
AbstractAbstract PDFSupplementary Material
Purpose
Coronavirus disease 2019 (COVID-19) has heavily impacted medical clinical education in Taiwan. Medical curricula have been altered to minimize exposure and limit transmission. This study investigated the effect of COVID-19 on Taiwanese medical students’ clinical performance using online standardized evaluation systems and explored the factors influencing medical education during the pandemic.
Methods
Medical students were scored from 0 to 100 based on their clinical performance from 1/1/2018 to 6/31/2021. The students were placed into pre-COVID-19 (before 2/1/2020) and midst-COVID-19 (on and after 2/1/2020) groups. Each group was further categorized into COVID-19-affected specialties (pulmonary, infectious, and emergency medicine) and other specialties. Generalized estimating equations (GEEs) were used to compare and examine the effects of relevant variables on student performance.
Results
In total, 16,944 clinical scores were obtained for COVID-19-affected specialties and other specialties. For the COVID-19-affected specialties, the midst-COVID-19 score (88.513.52) was significantly lower than the pre-COVID-19 score (90.143.55) (P<0.0001). For the other specialties, the midst-COVID-19 score (88.323.68) was also significantly lower than the pre-COVID-19 score (90.063.58) (P<0.0001). There were 1,322 students (837 males and 485 females). Male students had significantly lower scores than female students (89.333.68 vs. 89.993.66, P=0.0017). GEE analysis revealed that the COVID-19 pandemic (unstandardized beta coefficient=-1.99, standard error [SE]=0.13, P<0.0001), COVID-19-affected specialties (B=0.26, SE=0.11, P=0.0184), female students (B=1.10, SE=0.20, P<0.0001), and female attending physicians (B=-0.19, SE=0.08, P=0.0145) were independently associated with students’ scores.
Conclusion
COVID-19 negatively impacted medical students' clinical performance, regardless of their specialty. Female students outperformed male students, irrespective of the pandemic.

Citations

Citations to this article as recorded by  
  • The emergence of generative artificial intelligence platforms in 2023, journal metrics, appreciation to reviewers and volunteers, and obituary
    Sun Huh
    Journal of Educational Evaluation for Health Professions.2024; 21: 9.     CrossRef
Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study  
Betzy Clariza Torres-Zegarra, Wagner Rios-Garcia, Alvaro Micael Ñaña-Cordova, Karen Fatima Arteaga-Cisneros, Xiomara Cristina Benavente Chalco, Marina Atena Bustamante Ordoñez, Carlos Jesus Gutierrez Rios, Carlos Alberto Ramos Godoy, Kristell Luisa Teresa Panta Quezada, Jesus Daniel Gutierrez-Arratia, Javier Alejandro Flores-Cohaila
J Educ Eval Health Prof. 2023;20:30.   Published online November 20, 2023
DOI: https://doi.org/10.3352/jeehp.2023.20.30
  • 3,036 View
  • 220 Download
  • 13 Web of Science
  • 18 Crossref
AbstractAbstract PDFSupplementary Material
Purpose
We aimed to describe the performance and evaluate the educational value of justifications provided by artificial intelligence chatbots, including GPT-3.5, GPT-4, Bard, Claude, and Bing, on the Peruvian National Medical Licensing Examination (P-NLME).
Methods
This was a cross-sectional analytical study. On July 25, 2023, each multiple-choice question (MCQ) from the P-NLME was entered into each chatbot (GPT-3, GPT-4, Bing, Bard, and Claude) 3 times. Then, 4 medical educators categorized the MCQs in terms of medical area, item type, and whether the MCQ required Peru-specific knowledge. They assessed the educational value of the justifications from the 2 top performers (GPT-4 and Bing).
Results
GPT-4 scored 86.7% and Bing scored 82.2%, followed by Bard and Claude, and the historical performance of Peruvian examinees was 55%. Among the factors associated with correct answers, only MCQs that required Peru-specific knowledge had lower odds (odds ratio, 0.23; 95% confidence interval, 0.09–0.61), whereas the remaining factors showed no associations. In assessing the educational value of justifications provided by GPT-4 and Bing, neither showed any significant differences in certainty, usefulness, or potential use in the classroom.
Conclusion
Among chatbots, GPT-4 and Bing were the top performers, with Bing performing better at Peru-specific MCQs. Moreover, the educational value of justifications provided by the GPT-4 and Bing could be deemed appropriate. However, it is essential to start addressing the educational value of these chatbots, rather than merely their performance on examinations.

Citations

Citations to this article as recorded by  
  • PICOT questions and search strategies formulation: A novel approach using artificial intelligence automation
    Lucija Gosak, Gregor Štiglic, Lisiane Pruinelli, Dominika Vrbnjak
    Journal of Nursing Scholarship.2025; 57(1): 5.     CrossRef
  • Capable exam-taker and question-generator: the dual role of generative AI in medical education assessment
    Yihong Qiu, Chang Liu
    Global Medical Education.2025;[Epub]     CrossRef
  • Comparison of artificial intelligence systems in answering prosthodontics questions from the dental specialty exam in Turkey
    Busra Tosun, Zeynep Sen Yilmaz
    Journal of Dental Sciences.2025;[Epub]     CrossRef
  • Benchmarking LLM chatbots’ oncological knowledge with the Turkish Society of Medical Oncology’s annual board examination questions
    Efe Cem Erdat, Engin Eren Kavak
    BMC Cancer.2025;[Epub]     CrossRef
  • Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study
    Masao Noda, Takayoshi Ueno, Ryota Koshu, Yuji Takaso, Mari Dias Shimada, Chizu Saito, Hisashi Sugimoto, Hiroaki Fushiki, Makoto Ito, Akihiro Nomura, Tomokazu Yoshizaki
    JMIR Medical Education.2024; 10: e57054.     CrossRef
  • Response to Letter to the Editor re: “Artificial Intelligence Versus Expert Plastic Surgeon: Comparative Study Shows ChatGPT ‘Wins' Rhinoplasty Consultations: Should We Be Worried? [1]” by Durairaj et al
    Kay Durairaj, Omer Baker
    Facial Plastic Surgery & Aesthetic Medicine.2024; 26(3): 276.     CrossRef
  • Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review
    Xiaojun Xu, Yixiao Chen, Jing Miao
    Journal of Educational Evaluation for Health Professions.2024; 21: 6.     CrossRef
  • Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis
    Mingxin Liu, Tsuyoshi Okuhara, XinYi Chang, Ritsuko Shirabe, Yuriko Nishiie, Hiroko Okada, Takahiro Kiuchi
    Journal of Medical Internet Research.2024; 26: e60807.     CrossRef
  • Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study
    Giacomo Rossettini, Lia Rodeghiero, Federica Corradi, Chad Cook, Paolo Pillastrini, Andrea Turolla, Greta Castellini, Stefania Chiappinotto, Silvia Gianola, Alvisa Palese
    BMC Medical Education.2024;[Epub]     CrossRef
  • Evaluating the competency of ChatGPT in MRCP Part 1 and a systematic literature review of its capabilities in postgraduate medical assessments
    Oliver Vij, Henry Calver, Nikki Myall, Mrinalini Dey, Koushan Kouranloo, Thiago P. Fernandes
    PLOS ONE.2024; 19(7): e0307372.     CrossRef
  • Large Language Models in Pediatric Education: Current Uses and Future Potential
    Srinivasan Suresh, Sanghamitra M. Misra
    Pediatrics.2024;[Epub]     CrossRef
  • Comparison of the Performance of ChatGPT, Claude and Bard in Support of Myopia Prevention and Control
    Yan Wang, Lihua Liang, Ran Li, Yihua Wang, Changfu Hao
    Journal of Multidisciplinary Healthcare.2024; Volume 17: 3917.     CrossRef
  • Evaluating Large Language Models in Dental Anesthesiology: A Comparative Analysis of ChatGPT-4, Claude 3 Opus, and Gemini 1.0 on the Japanese Dental Society of Anesthesiology Board Certification Exam
    Misaki Fujimoto, Hidetaka Kuroda, Tomomi Katayama, Atsuki Yamaguchi, Norika Katagiri, Keita Kagawa, Shota Tsukimoto, Akito Nakano, Uno Imaizumi, Aiji Sato-Boku, Naotaka Kishimoto, Tomoki Itamiya, Kanta Kido, Takuro Sanuki
    Cureus.2024;[Epub]     CrossRef
  • Dermatological Knowledge and Image Analysis Performance of Large Language Models Based on Specialty Certificate Examination in Dermatology
    Ka Siu Fan, Ka Hay Fan
    Dermato.2024; 4(4): 124.     CrossRef
  • ChatGPT and Other Large Language Models in Medical Education — Scoping Literature Review
    Alexandra Aster, Matthias Carl Laupichler, Tamina Rockwell-Kollmann, Gilda Masala, Ebru Bala, Tobias Raupach
    Medical Science Educator.2024;[Epub]     CrossRef
  • Performance of ChatGPT and Bard on the medical licensing examinations varies across different cultures: a comparison study
    Yikai Chen, Xiujie Huang, Fangjie Yang, Haiming Lin, Haoyu Lin, Zhuoqun Zheng, Qifeng Liang, Jinhai Zhang, Xinxin Li
    BMC Medical Education.2024;[Epub]     CrossRef
  • Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in Gross Anatomy course: Comparative analysis
    Volodymyr Mavrych, Paul Ganguly, Olena Bolgova
    Clinical Anatomy.2024;[Epub]     CrossRef
  • Information amount, accuracy, and relevance of generative artificial intelligence platforms’ answers regarding learning objectives of medical arthropodology evaluated in English and Korean queries in December 2023: a descriptive study
    Hyunju Lee, Soobin Park
    Journal of Educational Evaluation for Health Professions.2023; 20: 39.     CrossRef
Efficacy and limitations of ChatGPT as a biostatistical problem-solving tool in medical education in Serbia: a descriptive study  
Aleksandra Ignjatović, Lazar Stevanović
J Educ Eval Health Prof. 2023;20:28.   Published online October 16, 2023
DOI: https://doi.org/10.3352/jeehp.2023.20.28
  • 3,978 View
  • 216 Download
  • 9 Web of Science
  • 12 Crossref
AbstractAbstract PDFSupplementary Material
Purpose
This study aimed to assess the performance of ChatGPT (GPT-3.5 and GPT-4) as a study tool in solving biostatistical problems and to identify any potential drawbacks that might arise from using ChatGPT in medical education, particularly in solving practical biostatistical problems.
Methods
ChatGPT was tested to evaluate its ability to solve biostatistical problems from the Handbook of Medical Statistics by Peacock and Peacock in this descriptive study. Tables from the problems were transformed into textual questions. Ten biostatistical problems were randomly chosen and used as text-based input for conversation with ChatGPT (versions 3.5 and 4).
Results
GPT-3.5 solved 5 practical problems in the first attempt, related to categorical data, cross-sectional study, measuring reliability, probability properties, and the t-test. GPT-3.5 failed to provide correct answers regarding analysis of variance, the chi-square test, and sample size within 3 attempts. GPT-4 also solved a task related to the confidence interval in the first attempt and solved all questions within 3 attempts, with precise guidance and monitoring.
Conclusion
The assessment of both versions of ChatGPT performance in 10 biostatistical problems revealed that GPT-3.5 and 4’s performance was below average, with correct response rates of 5 and 6 out of 10 on the first attempt. GPT-4 succeeded in providing all correct answers within 3 attempts. These findings indicate that students must be aware that this tool, even when providing and calculating different statistical analyses, can be wrong, and they should be aware of ChatGPT’s limitations and be careful when incorporating this model into medical education.

Citations

Citations to this article as recorded by  
  • From statistics to deep learning: Using large language models in psychiatric research
    Yining Hua, Andrew Beam, Lori B. Chibnik, John Torous
    International Journal of Methods in Psychiatric Research.2025;[Epub]     CrossRef
  • Assessing the Current Limitations of Large Language Models in Advancing Health Care Education
    JaeYong Kim, Bathri Narayan Vajravelu
    JMIR Formative Research.2025; 9: e51319.     CrossRef
  • Can Generative AI and ChatGPT Outperform Humans on Cognitive-Demanding Problem-Solving Tasks in Science?
    Xiaoming Zhai, Matthew Nyaaba, Wenchao Ma
    Science & Education.2024;[Epub]     CrossRef
  • Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review
    Xiaojun Xu, Yixiao Chen, Jing Miao
    Journal of Educational Evaluation for Health Professions.2024; 21: 6.     CrossRef
  • Comparing the Performance of ChatGPT-4 and Medical Students on MCQs at Varied Levels of Bloom’s Taxonomy
    Ambadasu Bharatha, Nkemcho Ojeh, Ahbab Mohammad Fazle Rabbi, Michael Campbell, Kandamaran Krishnamurthy, Rhaheem Layne-Yarde, Alok Kumar, Dale Springer, Kenneth Connell, Md Anwarul Majumder
    Advances in Medical Education and Practice.2024; Volume 15: 393.     CrossRef
  • Revolutionizing Cardiology With Words: Unveiling the Impact of Large Language Models in Medical Science Writing
    Abhijit Bhattaru, Naveena Yanamala, Partho P. Sengupta
    Canadian Journal of Cardiology.2024; 40(10): 1950.     CrossRef
  • ChatGPT in medicine: prospects and challenges: a review article
    Songtao Tan, Xin Xin, Di Wu
    International Journal of Surgery.2024;[Epub]     CrossRef
  • In-depth analysis of ChatGPT’s performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions
    Leonard Knoedler, Samuel Knoedler, Cosima C. Hoch, Lukas Prantl, Konstantin Frank, Laura Soiderer, Sebastian Cotofana, Amir H. Dorafshar, Thilo Schenck, Felix Vollbach, Giuseppe Sofo, Michael Alfertshofer
    Scientific Reports.2024;[Epub]     CrossRef
  • Evaluating the quality of responses generated by ChatGPT
    Danimir Mandić, Gordana Miščević, Ljiljana Bujišić
    Metodicka praksa.2024; 27(1): 5.     CrossRef
  • A Comparative Evaluation of Statistical Product and Service Solutions (SPSS) and ChatGPT-4 in Statistical Analyses
    Al Imran Shahrul, Alizae Marny F Syed Mohamed
    Cureus.2024;[Epub]     CrossRef
  • ChatGPT and Other Large Language Models in Medical Education — Scoping Literature Review
    Alexandra Aster, Matthias Carl Laupichler, Tamina Rockwell-Kollmann, Gilda Masala, Ebru Bala, Tobias Raupach
    Medical Science Educator.2024;[Epub]     CrossRef
  • Exploring the potential of large language models for integration into an academic statistical consulting service–the EXPOLS study protocol
    Urs Alexander Fichtner, Jochen Knaus, Erika Graf, Georg Koch, Jörg Sahlmann, Dominikus Stelzer, Martin Wolkewitz, Harald Binder, Susanne Weber, Bekalu Tadesse Moges
    PLOS ONE.2024; 19(12): e0308375.     CrossRef
Brief report
Comparing ChatGPT’s ability to rate the degree of stereotypes and the consistency of stereotype attribution with those of medical students in New Zealand in developing a similarity rating test: a methodological study  
Chao-Cheng Lin, Zaine Akuhata-Huntington, Che-Wei Hsu
J Educ Eval Health Prof. 2023;20:17.   Published online June 12, 2023
DOI: https://doi.org/10.3352/jeehp.2023.20.17
  • 3,156 View
  • 156 Download
  • 3 Web of Science
  • 4 Crossref
AbstractAbstract PDFSupplementary Material
Learning about one’s implicit bias is crucial for improving one’s cultural competency and thereby reducing health inequity. To evaluate bias among medical students following a previously developed cultural training program targeting New Zealand Māori, we developed a text-based, self-evaluation tool called the Similarity Rating Test (SRT). The development process of the SRT was resource-intensive, limiting its generalizability and applicability. Here, we explored the potential of ChatGPT, an automated chatbot, to assist in the development process of the SRT by comparing ChatGPT’s and students’ evaluations of the SRT. Despite results showing non-significant equivalence and difference between ChatGPT’s and students’ ratings, ChatGPT’s ratings were more consistent than students’ ratings. The consistency rate was higher for non-stereotypical than for stereotypical statements, regardless of rater type. Further studies are warranted to validate ChatGPT’s potential for assisting in SRT development for implementation in medical education and evaluation of ethnic stereotypes and related topics.

Citations

Citations to this article as recorded by  
  • The Performance of ChatGPT on Short-answer Questions in a Psychiatry Examination: A Pilot Study
    Chao-Cheng Lin, Kobus du Plooy, Andrew Gray, Deirdre Brown, Linda Hobbs, Tess Patterson, Valerie Tan, Daniel Fridberg, Che-Wei Hsu
    Taiwanese Journal of Psychiatry.2024; 38(2): 94.     CrossRef
  • ChatGPT and Other Large Language Models in Medical Education — Scoping Literature Review
    Alexandra Aster, Matthias Carl Laupichler, Tamina Rockwell-Kollmann, Gilda Masala, Ebru Bala, Tobias Raupach
    Medical Science Educator.2024;[Epub]     CrossRef
  • Psychiatric Care, Training and Research in Aotearoa New Zealand
    Chao-Cheng (Chris) Lin, Charlotte Mentzel, Maria Luz C. Querubin
    Taiwanese Journal of Psychiatry.2024; 38(4): 161.     CrossRef
  • Efficacy and limitations of ChatGPT as a biostatistical problem-solving tool in medical education in Serbia: a descriptive study
    Aleksandra Ignjatović, Lazar Stevanović
    Journal of Educational Evaluation for Health Professions.2023; 20: 28.     CrossRef

JEEHP : Journal of Educational Evaluation for Health Professions
TOP