Exploring clinical feasibility of zero-shot learning in a large language model for immunofixation electrophoresis image interpretation


ÖZTAŞ B., KÖSESOY İ.

Clinica Chimica Acta, cilt.587, 2026 (SCI-Expanded, Scopus) identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 587
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1016/j.cca.2026.120974
  • Dergi Adı: Clinica Chimica Acta
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, Chemical Abstracts Core, Chimica, EMBASE
  • Anahtar Kelimeler: Immunofixation electrophoresis, Monoclonal gammopathy, Visual question answering, Zero-shot learning, Zero-shot visual question answering
  • Kocaeli Üniversitesi Adresli: Evet

Özet

Introduction Immunofixation electrophoresis (IFE) is the gold standard method for detecting and typing monoclonal immunoglobulins, but its interpretation requires expert evaluation and remains susceptible to subjectivity and variability. Recent advances in multimodal large language models (LLMs) have enabled zero-shot visual reasoning without task-specific training. This study aimed to evaluate the feasibility and diagnostic performance of zero-shot multimodal LLMs for automated interpretation of IFE images. Materials and methods A dataset of 487 immunofixation electrophoresis images representing ten semantic classes (AK, AL, GK, GL, K, L, LGL, MK, ML, and NONE) was retrospectively collected and annotated by expert laboratory specialists. Two multimodal LLMs, ChatGPT-5.2 and Gemini3 Pro, were evaluated using a zero-shot visual question answering (ZS-VQA) framework without additional training or fine-tuning. Each image was analyzed using two prompt configurations: a simple prompt and a detailed prompt with structured diagnostic guidance. Model performance was assessed using precision, recall, F1-score, and confusion matrix analysis. Results Gemini outperformed ChatGPT across most classes, particularly with detailed prompts, achieving high F1-scores in clinically relevant monoclonal categories such as MK (89.80), AK (84.75), and GK (71.54). Detailed prompting consistently improved recall and F1-scores, underscoring the importance of prompt design. In contrast, ChatGPT showed lower recall, F1-scores, and classification consistency. Both models performed less effectively in visually subtle patterns. Conclusions Zero-shot multimodal LLMs, particularly Gemini, show promising potential for interpreting IFE images without task-specific training. However, performance variability and limitations in certain classes indicate that further optimization and validation are required before clinical implementation.