Comparative evaluation of radiological anatomy knowledge and accuracy of ChatGPT‐5, Gemini 2.5, and Grok 4 across normal and thinking modes


Sivri İ., Özden F. M., Celik H., Gokturk O., Çolak T.

ANATOMICAL SCIENCES EDUCATION, cilt.0, 2026 (Scopus)

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 0
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1002/ase.70292
  • Dergi Adı: ANATOMICAL SCIENCES EDUCATION
  • Derginin Tarandığı İndeksler: Social Science Premium Collection (ProQuest), Biomedical Reference Collection: Corporate Edition (EBSCO), Education Collection (ProQuest), Education Source Ultimate (EBSCO), Health Research Premium Collection (ProQuest), Scopus, Agricultural & Environmental Science Database, Education Abstracts, EMBASE, ERIC (Education Resources Information Center), EBSCO Education Source, MEDLINE, Psycinfo
  • Kocaeli Üniversitesi Adresli: Evet

Özet

Abstract This study compared the performance of three large language models, ChatGPT‐5 Plus, Gemini 2.5 Pro, and SuperGrok 4, in identifying anatomical structures on radiographic images using standardized anatomical terminology. Thirty radiographs from different body regions were selected from an open‐access atlas and analyzed by the models in Normal and Thinking modes using standardized prompts based on Terminologia Anatomica (version 2.07). Responses were evaluated independently by two anatomists using a 0–2 scoring system. Overall accuracy across both modes and models ranged from 47.4% to 85.7%. Data were analyzed using Friedman and Wilcoxon signed‐rank tests. Temporal response consistency was assessed with weighted kappa coefficients. Gemini 2.5 Pro and ChatGPT‐5 Plus significantly outperformed SuperGrok 4 in both modes. In Normal mode, Gemini 2.5 Pro achieved the highest overall accuracy (82.7%), significantly exceeding ChatGPT‐5 Plus (60.7%, p  = 0.001) and SuperGrok 4 (47.4%, p  < 0.001). In Thinking mode, accuracies were 85.7% for Gemini 2.5 Pro, 77.6% for ChatGPT‐5 Plus, and 49.5% for SuperGrok 4. Gemini 2.5 Pro demonstrated a significant advantage over ChatGPT‐5 Plus only in Normal mode ( p  = 0.001), whereas Thinking mode significantly improved performance only for ChatGPT‐5 Plus ( p  = 0.01). Temporal stability analysis showed high response consistency for Gemini 2.5 Pro and SuperGrok 4 across all modes ( r  > 0.94, p  < 0.001). Conversely, ChatGPT‐5 Plus' stability decreased from substantial agreement in normal mode ( r  = 0.697, p  < 0.001) to moderate agreement in Thinking mode ( r  = 0.539, p  < 0.001). Despite their educational potential, these models need refinement to reliably identify anatomical structures on radiographic images. image