The future of action recognition: are multi-modal visual language models the key?

Gumuskaynak, Enes; EKEN, SÜLEYMAN

doi:10.1007/s11760-025-03951-w

The future of action recognition: are multi-modal visual language models the key?

Gumuskaynak E., EKEN S.

SIGNAL IMAGE AND VIDEO PROCESSING, cilt.19, sa.4, 2025 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 19 Sayı: 4
Basım Tarihi: 2025
Doi Numarası: 10.1007/s11760-025-03951-w
Dergi Adı: SIGNAL IMAGE AND VIDEO PROCESSING
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, zbMATH
Kocaeli Üniversitesi Adresli: Evet

Özet

This study investigates the potential of Visual Language Models for action recognition, a critical task in video analysis. Traditional action recognition methods predominantly rely on visual features, often struggling with challenges such as complex actions, varied environments, and high intra-class variability. VLMs, which integrate visual and textual data, offer a promising solution by leveraging contextual information to enhance recognition accuracy and robustness. We evaluate several state-of-the-art multi-modal VLMs, including Moondream2, Florence-2-large, PaliGemma-3B, and Meta Chameleon 7B, on the UCF101 and kinetics-400 action recognition datasets. The performance of these models is analyzed in their fine-tuning-free states, providing insights into their applicability and effectiveness in action recognition tasks. Our results indicate that while these models demonstrate substantial potential, further fine-tuning and optimization could unlock even greater performance. This study contributes to the understanding of VLMs capabilities in action recognition and highlights areas for future research and development.