Purpose This study aimed to determine whether ChatGPT-4o, a generative artificial intelligence (AI) platform, was able to pass a simulated written European Board of Interventional Radiology (EBIR) exam and whether GPT-4o can be used to train medical students and interventional radiologists of different levels of expertise by generating exam items on interventional radiology.
Methods GPT-4o was asked to answer 370 simulated exam items of the Cardiovascular and Interventional Radiology Society of Europe (CIRSE) for EBIR preparation (CIRSE Prep). Subsequently, GPT-4o was requested to generate exam items on interventional radiology topics at levels of difficulty suitable for medical students and the EBIR exam. Those generated items were answered by 4 participants, including a medical student, a resident, a consultant, and an EBIR holder. The correctly answered items were counted. One investigator checked the answers and items generated by GPT-4o for correctness and relevance. This work was done from April to July 2024.
Results GPT-4o correctly answered 248 of the 370 CIRSE Prep items (67.0%). For 50 CIRSE Prep items, the medical student answered 46.0%, the resident 42.0%, the consultant 50.0%, and the EBIR holder 74.0% correctly. All participants answered 82.0% to 92.0% of the 50 GPT-4o generated items at the student level correctly. For the 50 GPT-4o items at the EBIR level, the medical student answered 32.0%, the resident 44.0%, the consultant 48.0%, and the EBIR holder 66.0% correctly. All participants could pass the GPT-4o-generated items for the student level; while the EBIR holder could pass the GPT-4o-generated items for the EBIR level. Two items (0.3%) out of 150 generated by the GPT-4o were assessed as implausible.
Conclusion GPT-4o could pass the simulated written EBIR exam and create exam items of varying difficulty to train medical students and interventional radiologists.
Citations
Citations to this article as recorded by
Evaluating the performance of ChatGPT in patient consultation and image-based preliminary diagnosis in thyroid eye disease Yue Wang, Shuo Yang, Chengcheng Zeng, Yingwei Xie, Ya Shen, Jian Li, Xiao Huang, Ruili Wei, Yuqing Chen Frontiers in Medicine.2025;[Epub] CrossRef
Solving Complex Pediatric Surgical Case Studies: A Comparative Analysis of Copilot, ChatGPT-4, and Experienced Pediatric Surgeons' Performance Richard Gnatzy, Martin Lacher, Michael Berger, Michael Boettcher, Oliver J. Deffaa, Joachim Kübler, Omid Madadi-Sanjani, Illya Martynov, Steffi Mayer, Mikko P. Pakarinen, Richard Wagner, Tomas Wester, Augusto Zani, Ophelia Aubert European Journal of Pediatric Surgery.2025;[Epub] CrossRef
Preliminary assessment of large language models’ performance in answering questions on developmental dysplasia of the hip Shiwei Li, Jun Jiang, Xiaodong Yang Journal of Children's Orthopaedics.2025;[Epub] CrossRef
AI and Interventional Radiology: A Narrative Review of Reviews on Opportunities, Challenges, and Future Directions Andrea Lastrucci, Nicola Iosca, Yannick Wandael, Angelo Barra, Graziano Lepri, Nevio Forini, Renzo Ricci, Vittorio Miele, Daniele Giansanti Diagnostics.2025; 15(7): 893. CrossRef
From GPT-3.5 to GPT-4.o: A Leap in AI’s Medical Exam Performance Markus Kipp Information.2024; 15(9): 543. CrossRef
Performance of ChatGPT and Bard on the medical licensing examinations varies across different cultures: a comparison study Yikai Chen, Xiujie Huang, Fangjie Yang, Haiming Lin, Haoyu Lin, Zhuoqun Zheng, Qifeng Liang, Jinhai Zhang, Xinxin Li BMC Medical Education.2024;[Epub] CrossRef