Do Large Language Models Perform Equally Across Languages? A Comparison of Responses to Frequently Asked Questions in Anesthesiology

YÖRÜKOĞLU, HADİ; AKSU, CAN; Sultan, Pervez; Tulgar, Serkan

doi:10.12659/msm.951815

Do Large Language Models Perform Equally Across Languages? A Comparison of Responses to Frequently Asked Questions in Anesthesiology

YÖRÜKOĞLU H. U., AKSU C., Sultan P., Tulgar S.

Medical Science Monitor, cilt.32, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 32
Basım Tarihi: 2026
Doi Numarası: 10.12659/msm.951815
Dergi Adı: Medical Science Monitor
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, EMBASE, MEDLINE
Anahtar Kelimeler: Anesthesia, Anesthesiology, Artificial Intelligence, Comparative Study, Multilingualism, Natural Language Processing, Patient Education as Topic, Quality of Health Care, Surveys and Questionnaires
Kocaeli Üniversitesi Adresli: Evet

Özet

Background: Material/Methods: Results: Conclusions: With the increasing use of large language model (LLM) chatbots in healthcare, evaluating their ability to provide reliable and understandable information in multiple languages is critical, particularly in fields such as anesthesia, where patient education is essential. The study primarily aimed to compare the quality of ChatGPT 4.0’s and DeepSeek V3’s English responses, with secondary aims to evaluate content and communication differences between English and Turkish responses. Anesthesiologists proficient in both languages were recruited as experts. Ten frequently asked questions in anesthesia were selected and translated for evaluation. Responses from ChatGPT 4.0 and DeepSeek V3 in both English and Turkish were assessed for overall quality and content quality (accuracy, comprehensiveness, and safety) and communication quality (understanding, empathy/tone, and ethics), and Turkish and English responses were compared by the evaluators. Eleven experts evaluated the responses. English responses of ChatGPT 4.0 were superior to the English responses of DeepSeek V3 in overall (P<0.001). English responses of ChatGPT 4.0 were superior to the Turkish responses in the terms of overall, content, and communication quality (P<0.001 each) and English responses of DeepSeek V3 were superior to the Turkish responses in the terms of overall (P<0.001), content (P<0.001) and communication (P=0.001) quality. ChatGPT 4.0 performed better than DeepSeek V3 in the English language in terms of overall quality of responses to 10 frequently asked questions in the field of anesthesia and the English responses provided by ChatGPT 4.0 and DeepSeek V3 outperformed the Turkish responses.