Signal, Image and Video Processing, cilt.19, sa.15, 2025 (SCI-Expanded, Scopus)
Transfer learning has demonstrated remarkable success in static image tasks; however, its application to video understanding remains significantly more challenging due to the architectural gap between 2D and 3D models, the need to capture temporal dynamics, and the limited availability of large-scale, well-annotated video datasets. In this context, weight initialization becomes a critical factor for effective model training, especially under data-scarce conditions. This work presents a comparative overview of four widely used initialization strategies for video action recognition: 2D-to-3D weight inflation, supervised video pretraining, knowledge distillation from 2D teacher models, and self-supervised learning. Based on reported results from standard benchmarks such as UCF101 and HMDB51, all strategies outperform training from scratch. While supervised pretraining often yields the strongest results on standard benchmarks, alternative strategies can offer competitive or complementary advantages, particularly under specific constraints such as limited dataset size, low domain similarity, or restricted computational budget. By organizing these approaches under a unified perspective, this study clarifies their relative strengths and limitations and offers practical guidance for future research in video action recognition.