Harnessing the Potential of Generative AI in Medical Undergraduate Education Across Different Disciplines – A Comparative Study on Performance of ChatGPT-3.5 and GPT-4 in Physiology and Biochemistry Modified Essay Questions

Nathasha LUKE¹, LEE Seow Chong², Kenneth BAN²,
Amanda WONG¹, CHEN Zhi Xiong^1,3, LEE Shuh Shing³ , Reshma Taneja¹,
Dujeepa D. SAMARASEKARA³, Celestial T. YAP¹

¹Department of Physiology, Yong Loo Lin School of Medicine (NUSMed)
²Department of Biochemistry, NUSMed
³Centre for Medical Education (CenMED), NUSMed

Editor’s Note: Natasha and her co-authors share a summary of the findings from the study they presented at HECC 2023 under the sub-theme “AI and Education”.

NathashaLuke et al - anchor pic

Nathasha presenting her team’s study during HECC 2023.

Luke, W. A. N. V., Lee, S. C., Ban, K., Wong, A., Chen, Z. X., Lee, S. S., Taneja, R., Samarasekara, D., & Yap, C. T. (2024, April 26). Harnessing the potential of generative AI in medical undergraduate education across different disciplines – A comparative study on performance of ChatGPT-3.5 and GPT-4 in physiology and biochemistry modified essay questions [HECC 2023 Summary]. Teaching Connections. https://blog.nus.edu.sg/teachingconnections/2024/04/26/harnessing-the-potential-of-generative-ai-in-medical-undergraduate-education-across-different-disciplines-a-comparative-study-on-performance-of-chatgpt-3-5-and-gpt-4-in-physiology-and-bioche/

Introduction

Generative artificial intelligence (AI) is becoming an integral part of modern-day education. ChatGPT passing medical examinations (Kung et al., 2023, Subramani et al., 2023) and solving complex clinical problems (Eriksen et al., 2023) displayed its potential to revolutionise medical education. The capabilities and limitations of this technology across disciplines should be identified to promote the optimum use of the models in education. We conducted a study evaluating the performance of ChatGPT in modified essay questions (MEQs) in Physiology and Biochemistry for medical undergraduates.

Answer Generation, Assessment, and Data Analysis

The modified essay questions were extracted from Physiology and Biochemistry tutorials, and case-based learning scenarios. 44 MEQs in Physiology and 43 MEQs in Biochemistry were encoded into GPT-3.5 to generate answers. Each response was graded by two examiners independently, guided by a marking scheme. The examiners also rated the answers on concordance, accuracy, language, organisation, and information and provided qualitative comments. Descriptive statistics including mean, standard deviation, and variance were calculated in relation to the average scores and subgroups according to Bloom’s Taxonomy. Single factor ANOVA was calculated for the subgroups to assess for a statistically significant difference.

Subsequently, a subgroup of 15 questions from each subject was selected to represent different score categories of the GPT-3.5 answers. Answers were generated in GPT-4 and graded similarly.

Results

The answers generated in GPT-3.5 (n = 44) obtained a mean score of 74.7 (SD 25.96) in Physiology. 16/44 (36.3%) of the answers generated in GPT-3.5 scored 90/100 marks or above. 29.5%, numerically 13/44, obtained a score of 100%. There was a statistically significant difference in mean scores between the higher-order and lower-order questions on the Bloom’s Taxonomy (p < 0.05). Qualitative comments commended ChatGPT’s strength in producing exemplary answers to most questions in Physiology, mostly excelling in lower-order questions. Deficiencies were noted in applying physiological concepts in the clinical context.

The mean score for Biochemistry was 59.3 (SD 26.9). Only 2/43 (4.6%) obtained 100% scores for the answers, while 7/43 (16.27%) scored 90 marks and above. There was no statistically significant difference in the scores for higher and lower-order questions of the Bloom’s Taxonomy. The examiner’s comments highlighted that those answers lacked relevant information and had faulty explanations of concepts. Examiners also commented that the outputs demonstrated breadth, but not the depth expected. NathashaLuke-Fig1

Figure 1. Distribution of scores of answers generated in GPT-3.5.

The 15 Physiology questions answered by GPT-3.5 and GPT-4 had mean scores of 59.2 (SD 32.04) and 68.6 (SD 29.77) respectively. For the 15 Biochemistry questions, the mean score for GPT-3.5 was 59.33 (SD 29.08), and for GPT-4 was 85.33 (SD 18.48). The increase in GPT-4 scores was statistically significant in Biochemistry (p = 0.006), but not in Physiology (p = 0.4)

Discussion

Our study showcases the potential of generative AI as a helpful educational resource. However, the current generative AI technology is not at a standpoint to be the sole educational resource, as performance is sub-optimal in most of the questions. The responses are either incomplete or contain errors, particularly in the domain of applying concepts in the clinical context. The wise use of generative AI as an adjunct to other resources will augment the learning process. Students and educators should be aware of the potentials and weaknesses of generative AI technologies in their own disciplines. One strategy would be to get students to generate the initial draft from generative AI, critically analyse the responses, and based on their analysis, strive towards improving their application of disciplinary knowledge. In addition, educators will have to revise certain educational tools to ensure that course objectives are still met in the generative AI era. This technology should be incorporated into learning pedagogies where appropriate.

Our study demonstrates the differential performance of ChatGPT across the two subjects. The performance of language models largely depends on the availability of training data; hence the efficacy may vary across subject areas. Subject and domain-specific training focusing on areas of deficiencies will enhance the performance of language models.

Overall, GPT-4 performed better than GPT-3.5 in the subset of questions tested. The users should know the potentials and limitations of each type as well as iterations of generative AI to promote optimum adoption of the technology.

Acknowledgments

We would like to thank the examiners who took their time to grade the ChatGPT responses. We would like to thank CDTL for their support and the Teaching Enhancement Grant (TEG) which financially supported our project.

References

Eriksen, A. V., Möller, S., & Ryg, J. (2023, December 11). Use of GPT-4 to diagnose complex clinical cases. NEJM AI, 1(1). https://doi.org/10.1056/aip2300031

Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., Madriaga, M., Aggabao, R., Diaz-Candido, G., Maningo, J., & Tseng, V. (2023, February 9). Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health, 2(2), e0000198. https://doi.org/10.1371/journal.pdig.0000198

Subramani, M., Jaleel, I., & Krishna Mohan, S. (2023, June 1). Evaluating the performance of ChatGPT in medical physiology university examination of phase I MBBS. Advances in Physiology Education, 47(2), 270–71. https://doi.org/10.1152/advan.00036.2023

	Nathasha LUKE is a Lecturer in the Department of Physiology, at the Yong Loo Lin School of Medicine (NUSMed) and a Resident Physician at Ng Teng Fong General Hospital. Nathasha can be reached at nathasha@nus.edu.sg.
	LEE Seow Chong is a Lecturer in the Department of Biochemistry at NUSMed. He can be reached at bchlees@nus.edu.sg .
	Kenneth BAN is a Senior Lecturer in the Department of Biochemistry at NUSMed. In addition, he leads the longitudinal Health Informatics track that aims to build foundational competencies in data science for our medical students. Kenneth can be reached at kenneth_ban@nus.edu.sg.
	Amanda WONG is an instructor in the Department of Physiology at NUSMed. She can be reached at phswhpa@nus.edu.sg.
	CHEN Zhi Xiong is an Associate Professor at the Department of Physiology and the Assistant Dean (Education) at NUSMed. He can be reached at zhixiong_chen@nus.edu.sg.
	LEE Shuh Shing is the Assistant Director of the Centre for Medical Education (CenMED) of NUSMed. She can be reached at shuhshing@nus.edu.sg.
	Reshma Taneja is the Head of the Department of Physiology at NUSMed. She can be reached at phsrt@nus.edu.sg.
	Dujeepa SAMARASEKARA is the Senior Director of the Centre for Medical Education (CenMED), NUSMed, and Senior Consultant (Health Professions Education) at the Ministry of Health Singapore. He can be reached at dujeepa@nus.edu.sg.
	Celestial YAP is an Associate Professor & Physiology Program Director (Medicine, Dentistry, Pharmacy) at NUSMed. She can be reached at phsyapc@nus.edu.sg.