Anatomy exam model for the circulatory and respiratory systems using GPT-4: a medical school study


TEKİN A., Karamus N. F., ÇOLAK T.

SURGICAL AND RADIOLOGIC ANATOMY, cilt.47, sa.1, 2025 (SCI-Expanded) identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 47 Sayı: 1
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1007/s00276-025-03667-z
  • Dergi Adı: SURGICAL AND RADIOLOGIC ANATOMY
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, CINAHL, MEDLINE
  • Kocaeli Üniversitesi Adresli: Evet

Özet

Purpose The study aimed to evaluate the effectiveness of anatomy multiple-choice questions (MCQs) generated by GPT-4, focused on their methodological appropriateness and alignment with the cognitive levels defined by Bloom's revised taxonomy to enhance assessment. Methods The assessment questions developed for medical students were created utilizing GPT-4, comprising 240 MCQs organized into subcategories consistent with Bloom's revised taxonomy. When designing prompts to create MCQs, details about the lesson's purpose, learning objectives, and students' prior experiences were included to ensure the questions were contextually appropriate. A set of 30 MCQs was randomly selected from the generated questions for testing. A total of 280 students participated in the examination, which assessed the difficulty index of the MCQs, the item discrimination index, and the overall test difficulty level. Expert anatomists examined the taxonomy accuracy of GPT-4's questions. Results Students achieved a median score of 50 (range, 36.67-60) points on the test. The test's internal consistency, assessed by KR-20, was 0.737. The average difficulty of the test was 0.5012. Results show difficulty and discrimination indices for each AI-generated question. Expert anatomists' taxonomy-based classifications matched GPT-4's 26.6%. Meanwhile, 80.9% of students found the questions were clear, and 85.8% showed interest in retaking the assessment exam. Conclusion This study demonstrates GPT-4's significant potential for generating medical education exam questions. While it effectively assesses basic knowledge recall, it fails to sufficiently evaluate higher-order cognitive processes outlined in Bloom's revised taxonomy. Future research should consider alternative methods that combine AI with expert evaluation and specialized multimodal models.