large language models

W. A. Nathasha Vihangi LUKE¹*, LEE Seow Chong², Kenneth BAN², Amanda WONG¹, CHEN Zhi Xiong^1,3, LEE Shuh Shing³ , Reshma Taneja¹,
Dujeepa SAMARASEKARA³, Celestial T. YAP¹

¹Department of Physiology, Yong Loo Lin School of Medicine (YLLSOM)
²Department of Biochemistry, YLLSOM
³Centre for Medical Education, YLLSOM

*nathasha@nus.edu.sg

Luke, W. A. N. V., Lee, S. C., Ban, K., Wong, A., Chen, Z. X., Lee, S. S., Taneja, R., Samarasekara, D., & Yap, C. T. (2023). Harnessing the potential of generative AI in medical undergraduate education across different disciplines—comparative study on performance of ChatGPT in physiology and biochemistry modified essay questions [Paper presentation]. In Higher Education Campus Conference (HECC) 2023, 7 December, National University of Singapore. https://blog.nus.edu.sg/hecc2023proceedings/harnessing-the-potential-of-generative-ai-in-medical-undergraduate-education-across-different-disciplines-comparative-study-on-performance-of-chatgpt-in-physiology-and-biochemistry-modified-es/

SUB-THEME

AI and Education

KEYWORDS

Generative AI, artificial intelligence, large language models, physiology, biochemistry

INTRODUCTION & JUSTIFICATION

Revolutions in generative artificial intelligence (AI) have led to profound discussions on its potential implications across various disciplines in education. ChatGPT passing the United States medical school examinations (Kung et al., 2023) and excelling in other discipline-specific examinations (Subramani et al., 2023) displayed its potential to revolutionise medical education. Capabilities and limitations of this technology across disciplines should be identified to promote the optimum use of the models in medical education. This study evaluated the performance of ChatGPT, a large language model (LLM) by Open AI, powered by GPT 3.5, in modified essay questions (MEQs) in physiology and biochemistry for medical undergraduates.

METHODOLOGY

Modified essay questions (MEQs) extracted from physiology and biochemistry tutorials and case-based learning scenarios were encoded into ChatGPT. Answers were generated for 44 MEQs in physiology and 43 MEQs in biochemistry. Each response was graded by two examiners independently, guided by a marking scheme. In addition, the examiners rated the answers on concordance, accuracy, language, organisation, and information and provided qualitative comments. Descriptive statistics including mean, standard deviation, and variance were calculated in relation to the average scores and subgroups according to Bloom’s Taxonomy. Single factor ANOVA was calculated for the subgroups to assess for a statistically significant difference.

RESULTS

ChatGPT answers (n = 44) obtained a mean score of 74.7(SD 25.96) in physiology. 16/44(36.3%) of the ChatGPT answers scored 90/100 marks or above. 29.5%, numerically 13/44, obtained a score of 100%. There was a statistically significant difference in mean scores between the higher-order and lower-order questions on the Bloom’s taxonomy (p < 0.05). Qualitative comments commended ChatGPT’s strength in producing exemplary answers to most questions in physiology, mostly excelling in lower-order questions. Deficiencies were noted in applying physiological concepts in a clinical context.

The mean score for biochemistry was 59.3(SD 26.9). Only 2/43(4.6%) obtained 100% scores for the answers, while 7/43(16.27%) scored 90 or above marks. There was no statistically significant difference in the scores for higher and lower-order questions of the Bloom’s taxonomy. The examiner’s comments highlighted those answers lacked relevant information and had faulty explanations of concepts. Examiners commented that outputs demonstrated breadth, but not the depth expected.

nathasha luke et al, - Distribution of scores

Figure 1. Distribution of scores.

CONCLUSIONS AND RECOMMENDATIONS

Overall, our study demonstrates the differential performance of ChatGPT across the two subjects. ChatGPT performed with a high degree of accuracy in most physiology questions, particularly excelling in lower-order questions of the Bloom’s taxonomy. Generative AI answers in biochemistry scored relatively lower. Examiners commented that the answers demonstrated lower levels of precision and specificity, and lacked depth in explanations.

The performance of language models largely depends on the availability of training data; hence the efficacy may vary across subject areas. The differential performance highlights the need for future iterations of LLMs to receive subject and domain-specific training to enhance performance.

This study further demonstrates the potential of generative AI technology in medical education. Educators should be aware of the abilities and limitations of generative AI in different disciplines and revise learning tools accordingly to ensure integrity. Efforts should be made to integrate this technology into learning pedagogies when possible.

The performance of ChatGPT in MEQs highlights the ability of generative AI as educational tools for students. However, this study confirms that the current technology might not be in a state to be recommended as a sole resource, but rather be a supplementary tool along with other learning resources. In addition, the differential performance in subjects should be taken into consideration by students when determining the extent to which this technology should be incorporated into learning.

REFERENCES

Kung, T. H., Cheatham, M., Medenilla, A., Sillos, C., De Leon, L., Elepaño, C., Madriaga, M., Aggabao, R., Diaz-Candido, G., Maningo, J., & Tseng, V. (2023). Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health, 2(2), e0000198. https://doi.org/10.1371/journal.pdig.0000198

Subramani, M., Jaleel, I., & Krishna Mohan, S. (2023). Evaluating the performance of ChatGPT in medical physiology university examination of phase I MBBS. Advances in Physiology Education, 47(2), 270–71. https://doi.org/10.1152/advan.00036.2023