The future of action recognition: are multi-modal visual language models the key?


Gumuskaynak E., EKEN S.

SIGNAL IMAGE AND VIDEO PROCESSING, vol.19, no.4, 2025 (SCI-Expanded, Scopus) identifier identifier

  • Publication Type: Article / Article
  • Volume: 19 Issue: 4
  • Publication Date: 2025
  • Doi Number: 10.1007/s11760-025-03951-w
  • Journal Name: SIGNAL IMAGE AND VIDEO PROCESSING
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, zbMATH
  • Kocaeli University Affiliated: Yes

Abstract

This study investigates the potential of Visual Language Models for action recognition, a critical task in video analysis. Traditional action recognition methods predominantly rely on visual features, often struggling with challenges such as complex actions, varied environments, and high intra-class variability. VLMs, which integrate visual and textual data, offer a promising solution by leveraging contextual information to enhance recognition accuracy and robustness. We evaluate several state-of-the-art multi-modal VLMs, including Moondream2, Florence-2-large, PaliGemma-3B, and Meta Chameleon 7B, on the UCF101 and kinetics-400 action recognition datasets. The performance of these models is analyzed in their fine-tuning-free states, providing insights into their applicability and effectiveness in action recognition tasks. Our results indicate that while these models demonstrate substantial potential, further fine-tuning and optimization could unlock even greater performance. This study contributes to the understanding of VLMs capabilities in action recognition and highlights areas for future research and development.