Purpose This study aimed to assess the performance of ChatGPT (GPT-3.5 and GPT-4) as a study tool in solving biostatistical problems and to identify any potential drawbacks that might arise from using ChatGPT in medical education, particularly in solving practical biostatistical problems.
Methods ChatGPT was tested to evaluate its ability to solve biostatistical problems from the Handbook of Medical Statistics by Peacock and Peacock in this descriptive study. Tables from the problems were transformed into textual questions. Ten biostatistical problems were randomly chosen and used as text-based input for conversation with ChatGPT (versions 3.5 and 4).
Results GPT-3.5 solved 5 practical problems in the first attempt, related to categorical data, cross-sectional study, measuring reliability, probability properties, and the t-test. GPT-3.5 failed to provide correct answers regarding analysis of variance, the chi-square test, and sample size within 3 attempts. GPT-4 also solved a task related to the confidence interval in the first attempt and solved all questions within 3 attempts, with precise guidance and monitoring.
Conclusion The assessment of both versions of ChatGPT performance in 10 biostatistical problems revealed that GPT-3.5 and 4’s performance was below average, with correct response rates of 5 and 6 out of 10 on the first attempt. GPT-4 succeeded in providing all correct answers within 3 attempts. These findings indicate that students must be aware that this tool, even when providing and calculating different statistical analyses, can be wrong, and they should be aware of ChatGPT’s limitations and be careful when incorporating this model into medical education.
Citations
Citations to this article as recorded by
From statistics to deep learning: Using large language models in psychiatric research Yining Hua, Andrew Beam, Lori B. Chibnik, John Torous International Journal of Methods in Psychiatric Research.2025;[Epub] CrossRef
Assessing the Current Limitations of Large Language Models in Advancing Health Care Education JaeYong Kim, Bathri Narayan Vajravelu JMIR Formative Research.2025; 9: e51319. CrossRef
ChatGPT for Univariate Statistics: Validation of AI-Assisted Data Analysis in Healthcare Research Michael R Ruta, Tony Gaidici, Chase Irwin, Jonathan Lifshitz Journal of Medical Internet Research.2025; 27: e63550. CrossRef
Can Generative AI and ChatGPT Outperform Humans on Cognitive-Demanding Problem-Solving Tasks in Science? Xiaoming Zhai, Matthew Nyaaba, Wenchao Ma Science & Education.2024;[Epub] CrossRef
Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review Xiaojun Xu, Yixiao Chen, Jing Miao Journal of Educational Evaluation for Health Professions.2024; 21: 6. CrossRef
Comparing the Performance of ChatGPT-4 and Medical Students on MCQs at Varied Levels of Bloom’s Taxonomy Ambadasu Bharatha, Nkemcho Ojeh, Ahbab Mohammad Fazle Rabbi, Michael Campbell, Kandamaran Krishnamurthy, Rhaheem Layne-Yarde, Alok Kumar, Dale Springer, Kenneth Connell, Md Anwarul Majumder Advances in Medical Education and Practice.2024; Volume 15: 393. CrossRef
Revolutionizing Cardiology With Words: Unveiling the Impact of Large Language Models in Medical Science Writing Abhijit Bhattaru, Naveena Yanamala, Partho P. Sengupta Canadian Journal of Cardiology.2024; 40(10): 1950. CrossRef
ChatGPT in medicine: prospects and challenges: a review article Songtao Tan, Xin Xin, Di Wu International Journal of Surgery.2024; 110(6): 3701. CrossRef
In-depth analysis of ChatGPT’s performance based on specific signaling words and phrases in the question stem of 2377 USMLE step 1 style questions Leonard Knoedler, Samuel Knoedler, Cosima C. Hoch, Lukas Prantl, Konstantin Frank, Laura Soiderer, Sebastian Cotofana, Amir H. Dorafshar, Thilo Schenck, Felix Vollbach, Giuseppe Sofo, Michael Alfertshofer Scientific Reports.2024;[Epub] CrossRef
Evaluating the quality of responses generated by ChatGPT Danimir Mandić, Gordana Miščević, Ljiljana Bujišić Metodicka praksa.2024; 27(1): 5. CrossRef
A Comparative Evaluation of Statistical Product and Service Solutions (SPSS) and ChatGPT-4 in Statistical Analyses Al Imran Shahrul, Alizae Marny F Syed Mohamed Cureus.2024;[Epub] CrossRef
ChatGPT and Other Large Language Models in Medical Education — Scoping Literature Review Alexandra Aster, Matthias Carl Laupichler, Tamina Rockwell-Kollmann, Gilda Masala, Ebru Bala, Tobias Raupach Medical Science Educator.2024;[Epub] CrossRef
Exploring the potential of large language models for integration into an academic statistical consulting service–the EXPOLS study protocol Urs Alexander Fichtner, Jochen Knaus, Erika Graf, Georg Koch, Jörg Sahlmann, Dominikus Stelzer, Martin Wolkewitz, Harald Binder, Susanne Weber, Bekalu Tadesse Moges PLOS ONE.2024; 19(12): e0308375. CrossRef
Purpose The framing effect refers to a phenomenon wherein, when the same problem is presented using different representations of information, people make significant changes in their decisions. This study aimed to explore whether the framing effect could be reduced in medical students and residents by teaching them the statistical concepts of effect size, probability, and sampling for use in the medical decision-making process.
Methods Ninety-five second-year medical students and 100 second-year medical residents of Austral University and Buenos Aires University, Argentina were invited to participate in the study between March and June 2017. A questionnaire was developed to assess the different types of framing effects in medical situations. After an initial administration of the survey, students and residents were taught statistical concepts including effect size, probability, and sampling during 2 individual independent official biostatistics courses. After these interventions, the same questionnaire was randomly administered again, and pre- and post-intervention outcomes were compared among students and residents.
Results Almost every type of framing effect was reproduced either in the students or in the residents. After teaching medical students and residents the analytical process behind statistical concepts, a significant reduction in sample-size, risky-choice, pseudo-certainty, number-size, attribute, goal, and probabilistic formulation framing effects was observed.
Conclusion The decision-making of medical students and residents in simulated medical situations may be affected by different frame descriptions, and these framing effects can be partially reduced by training individuals in probability analysis and statistical sampling methods.
Citations
Citations to this article as recorded by
Numeracy Education for Health Care Providers: A Scoping Review Casey Goldstein, Nicole Woods, Rebecca MacKinnon, Rouhi Fazelzad, Bhajan Gill, Meredith Elana Giuliani, Tina Papadakos, Qinge Wei, Janet Papadakos Journal of Continuing Education in the Health Professions.2024; 44(1): 35. CrossRef
We designed and evaluated an objective structured biostatistics examination (OSBE) on a trial basis to determine whether it was feasible for formative or summative assessment. At Ataturk University, we have a seminar system for curriculum for every cohort of all five years undergraduate education. Each seminar consists of an integrated system for different subjects, every year three to six seminars that meet for six to eight weeks, and at the end of each seminar term we conduct an examination as a formative assessment. In 2010, 201 students took the OSBE, and in 2011, 211 students took the same examination at the end of a seminar that had biostatistics as one module. The examination was conducted in four groups and we examined two groups together. Each group had to complete 5 stations in each row therefore we had two parallel lines with different instructions to be followed, thus we simultaneously examined 10 students in these two parallel lines. The students were invited after the examination to receive feedback from the examiners and provide their reflections. There was a significant (P= 0.004) difference between male and female scores in the 2010 students, but no gender difference was found in 2011. The comparison among the parallel lines and among the four groups showed that two groups, A and B, did not show a significant difference (P> 0.05) in either class. Nonetheless, among the four groups, there was a significant difference in both 2010 (P= 0.001) and 2011 (P= 0.001). The inter-rater reliability coefficient was 0.60. Overall, the students were satisfied with the testing method; however, they felt some stress. The overall experience of the OSBE was useful in terms of learning, as well as for assessment.
Citations
Citations to this article as recorded by
THE COMPARISON OF DIFFERENT ASSESSMENT TECHNIQUES USED IN PHYSIOLOGY PRACTICAL ASSESSMENT Ksh. Lakshmikumari, Sarada N, Lalit Kumar L INTERNATIONAL JOURNAL OF SCIENTIFIC RESEARCH.2022; : 7. CrossRef