Exploring clinical feasibility of zero-shot learning in a large language model for immunofixation electrophoresis image interpretation

ÖZTAŞ, BERRİN; KÖSESOY, İRFAN

doi:10.1016/j.cca.2026.120974

Exploring clinical feasibility of zero-shot learning in a large language model for immunofixation electrophoresis image interpretation

ÖZTAŞ B., KÖSESOY İ.

Clinica Chimica Acta, cilt.587, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 587
Basım Tarihi: 2026
Doi Numarası: 10.1016/j.cca.2026.120974
Dergi Adı: Clinica Chimica Acta
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, Chemical Abstracts Core, Chimica, EMBASE
Anahtar Kelimeler: Immunofixation electrophoresis, Monoclonal gammopathy, Visual question answering, Zero-shot learning, Zero-shot visual question answering
Kocaeli Üniversitesi Adresli: Evet

Özet

Introduction Immunofixation electrophoresis (IFE) is the gold standard method for detecting and typing monoclonal immunoglobulins, but its interpretation requires expert evaluation and remains susceptible to subjectivity and variability. Recent advances in multimodal large language models (LLMs) have enabled zero-shot visual reasoning without task-specific training. This study aimed to evaluate the feasibility and diagnostic performance of zero-shot multimodal LLMs for automated interpretation of IFE images. Materials and methods A dataset of 487 immunofixation electrophoresis images representing ten semantic classes (AK, AL, GK, GL, K, L, LGL, MK, ML, and NONE) was retrospectively collected and annotated by expert laboratory specialists. Two multimodal LLMs, ChatGPT-5.2 and Gemini3 Pro, were evaluated using a zero-shot visual question answering (ZS-VQA) framework without additional training or fine-tuning. Each image was analyzed using two prompt configurations: a simple prompt and a detailed prompt with structured diagnostic guidance. Model performance was assessed using precision, recall, F1-score, and confusion matrix analysis. Results Gemini outperformed ChatGPT across most classes, particularly with detailed prompts, achieving high F1-scores in clinically relevant monoclonal categories such as MK (89.80), AK (84.75), and GK (71.54). Detailed prompting consistently improved recall and F1-scores, underscoring the importance of prompt design. In contrast, ChatGPT showed lower recall, F1-scores, and classification consistency. Both models performed less effectively in visually subtle patterns. Conclusions Zero-shot multimodal LLMs, particularly Gemini, show promising potential for interpreting IFE images without task-specific training. However, performance variability and limitations in certain classes indicate that further optimization and validation are required before clinical implementation.