The future of action recognition: are multi-modal visual language models the key?


Gumuskaynak E., EKEN S.

SIGNAL IMAGE AND VIDEO PROCESSING, no.4, 2025 (SCI-Expanded) identifier

Abstract

This study investigates the potential of Visual Language Models for action recognition, a critical task in video analysis. Traditional action recognition methods predominantly rely on visual features, often struggling with challenges such as complex actions, varied environments, and high intra-class variability. VLMs, which integrate visual and textual data, offer a promising solution by leveraging contextual information to enhance recognition accuracy and robustness. We evaluate several state-of-the-art multi-modal VLMs, including Moondream2, Florence-2-large, PaliGemma-3B, and Meta Chameleon 7B, on the UCF101 and kinetics-400 action recognition datasets. The performance of these models is analyzed in their fine-tuning-free states, providing insights into their applicability and effectiveness in action recognition tasks. Our results indicate that while these models demonstrate substantial potential, further fine-tuning and optimization could unlock even greater performance. This study contributes to the understanding of VLMs capabilities in action recognition and highlights areas for future research and development.