Transfer learning for video action recognition: A comparative overview of weight initialization strategies

ÇELİK, ASLI

doi:10.1007/s11760-025-04887-x

Transfer learning for video action recognition: A comparative overview of weight initialization strategies

ÇELİK A.

Signal, Image and Video Processing, cilt.19, sa.15, 2025 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 19 Sayı: 15
Basım Tarihi: 2025
Doi Numarası: 10.1007/s11760-025-04887-x
Dergi Adı: Signal, Image and Video Processing
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, zbMATH
Anahtar Kelimeler: Action classification, Action recognition, Knowledge distillation, Self-supervised learning, Transfer learning
Kocaeli Üniversitesi Adresli: Evet

Özet

Transfer learning has demonstrated remarkable success in static image tasks; however, its application to video understanding remains significantly more challenging due to the architectural gap between 2D and 3D models, the need to capture temporal dynamics, and the limited availability of large-scale, well-annotated video datasets. In this context, weight initialization becomes a critical factor for effective model training, especially under data-scarce conditions. This work presents a comparative overview of four widely used initialization strategies for video action recognition: 2D-to-3D weight inflation, supervised video pretraining, knowledge distillation from 2D teacher models, and self-supervised learning. Based on reported results from standard benchmarks such as UCF101 and HMDB51, all strategies outperform training from scratch. While supervised pretraining often yields the strongest results on standard benchmarks, alternative strategies can offer competitive or complementary advantages, particularly under specific constraints such as limited dataset size, low domain similarity, or restricted computational budget. By organizing these approaches under a unified perspective, this study clarifies their relative strengths and limitations and offers practical guidance for future research in video action recognition.