Do Large Language Models Perform Equally Across Languages? A Comparison of Responses to Frequently Asked Questions in Anesthesiology


YÖRÜKOĞLU H. U., AKSU C., Sultan P., Tulgar S.

Medical Science Monitor, cilt.32, 2026 (SCI-Expanded, Scopus) identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 32
  • Basım Tarihi: 2026
  • Doi Numarası: 10.12659/msm.951815
  • Dergi Adı: Medical Science Monitor
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, EMBASE, MEDLINE
  • Anahtar Kelimeler: Anesthesia, Anesthesiology, Artificial Intelligence, Comparative Study, Multilingualism, Natural Language Processing, Patient Education as Topic, Quality of Health Care, Surveys and Questionnaires
  • Kocaeli Üniversitesi Adresli: Evet

Özet

Background: Material/Methods: Results: Conclusions: With the increasing use of large language model (LLM) chatbots in healthcare, evaluating their ability to provide reliable and understandable information in multiple languages is critical, particularly in fields such as anesthesia, where patient education is essential. The study primarily aimed to compare the quality of ChatGPT 4.0’s and DeepSeek V3’s English responses, with secondary aims to evaluate content and communication differences between English and Turkish responses. Anesthesiologists proficient in both languages were recruited as experts. Ten frequently asked questions in anesthesia were selected and translated for evaluation. Responses from ChatGPT 4.0 and DeepSeek V3 in both English and Turkish were assessed for overall quality and content quality (accuracy, comprehensiveness, and safety) and communication quality (understanding, empathy/tone, and ethics), and Turkish and English responses were compared by the evaluators. Eleven experts evaluated the responses. English responses of ChatGPT 4.0 were superior to the English responses of DeepSeek V3 in overall (P<0.001). English responses of ChatGPT 4.0 were superior to the Turkish responses in the terms of overall, content, and communication quality (P<0.001 each) and English responses of DeepSeek V3 were superior to the Turkish responses in the terms of overall (P<0.001), content (P<0.001) and communication (P=0.001) quality. ChatGPT 4.0 performed better than DeepSeek V3 in the English language in terms of overall quality of responses to 10 frequently asked questions in the field of anesthesia and the English responses provided by ChatGPT 4.0 and DeepSeek V3 outperformed the Turkish responses.