ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

Existing approaches to skill proficiency estimation often rely on black-box video classifiers, ignoring multi-view context and lacking explainability. We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning: it jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos. Central to our method is an AttentiveGatedProjector that dynamically fuses multi-view features, projected from a frozen TimeSformer backbone into a language model tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%. Our approach not only achieves superior accuracy across diverse activities, but also outputs natural language critiques aligned with performance, offering transparent reasoning. These results highlight generative vision-language modeling as a powerful new direction for skill assessment.

Submitted & under review
PATS: Proficiency-Aware Temporal Sampling for Multi-View Sports Skill Assessment

Automated sports skill assessment requires capturing fundamental movement patterns that distinguish expert from novice performance, yet current video sampling methods disrupt the temporal continuity essential for proficiency evaluation. To this end, we introduce Proficiency-Aware Temporal Sampling (PATS), a novel sampling strategy that preserves complete fundamental movements within continuous temporal segments for multi-view skill assessment. PATS adaptively segments videos to ensure each analyzed portion contains full execution of critical performance components, repeating this process across multiple segments to maximize information coverage while maintaining temporal coherence. Evaluated on the EgoExo4D benchmark with SkillFormer, PATS surpasses the state-of-the-art accuracy across all viewing configurations (+0.65% to +3.05%) and delivers substantial gains in challenging domains (+26.22% bouldering, +2.39% music, +1.13% basketball). Systematic analysis reveals that PATS successfully adapts to diverse activity characteristics-from high-frequency sampling for dynamic sports to fine-grained segmentation for sequential skills-demonstrating its effectiveness as an adaptive approach to temporal sampling that advances automated skill assessment for real-world applications.

2025 IEEE Sport Technology and Research Workshop
SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation

Assessing human skill levels in complex activities is a challenging problem with applications in sports, rehabilitation, and training. In this work, we present SkillFormer, a parameter-efficient architecture for unified multi-view proficiency estimation from egocentric and exocentric videos. Building on the TimeSformer backbone, SkillFormer introduces a CrossViewFusion module that fuses view-specific features using multi-head cross-attention, learnable gating, and adaptive self-calibration. We leverage Low-Rank Adaptation to fine-tune only a small subset of parameters, significantly reducing training costs. In fact, when evaluated on the EgoExo4D dataset, SkillFormer achieves state-of-the-art accuracy in multi-view settings while demonstrating remarkable computational efficiency, using 4.5x fewer parameters and requiring 3.75x fewer training epochs than prior baselines. It excels in multiple structured tasks, confirming the value of multi-view integration for fine-grained skill assessment.

2025 International Conference on Machine Vision

Gate-Shift-Pose (GSP): Enhancing Action Recognition in Sports with Skeleton Information

Free University of Bozen-Bolzano
Winter Conference on Applications of Computer Vision (WACV) Workshops 2025

Gate-Shift-Pose


Overview of the GSP (Gate-Shift-Pose) network architecture with two fusion strategies for integrating RGB and skeletal information. Top: In the early-fusion approach, pose data is preprocessed as a Gaussian heatmap and concatenated with RGB frames, forming a four-channel input for the GSF network. Bottom: In the late-fusion approach, RGB frames and skeletal data are processed in separate streams using a GSF network and a Pose network, respectively. Normalized features from each stream are then combined in a fusion layer, followed by multi-head attention and alignment layers to integrate relevant spatio-temporal features before classification.

Abstract

This paper introduces Gate-Shift-Pose, an enhanced version of Gate-Shift-Fuse networks, designed for athlete fall classification in figure skating by integrating skeleton pose data alongside RGB frames. We evaluate two fusion strategies: early-fusion, which combines RGB frames with Gaussian heatmaps of pose keypoints at the input stage, and late-fusion, which employs a multi-stream architecture with attention mechanisms to combine RGB and pose features. Experiments on the FR-FS dataset demonstrate that Gate-Shift-Pose significantly outperforms the RGB-only baseline, improving accuracy by up to 40% with ResNet18 and 20% with ResNet50. Early-fusion achieves the highest accuracy (98.08%) with ResNet50, leveraging the model's capacity for effective multimodal integration, while late-fusion is better suited for lighter backbones like ResNet18. These results highlight the potential of multimodal architectures for sports action recognition and the critical role of skeleton pose information in capturing complex motion patterns.

Early Fusion: Gaussian Heatmap for Skeleton-Based Feature Extraction

Gaussian Heatmap Example

In the early-fusion approach, pose information is incorporated by augmenting each RGB frame with an additional channel containing a Gaussian heatmap of pose keypoints. This results in a four-channel input (RGB + pose) per frame, which is processed by a GSF network. This strategy allows the network to learn correlations between pose and appearance features at an early stage, potentially capturing valuable low-level interactions between modalities.

Early-fusion is computationally efficient, as both RGB and pose information are processed jointly by the same feature extractor, eliminating the need for separate processing pipelines. However, by fusing the modalities from the initial layers, this approach may be limited in its ability to capture higher-level semantic interactions between pose and RGB features that emerge in deeper network layers.

Late Fusion: Pose Model and Alignment Layers

Gaussian Heatmap Example

The late-fusion strategy employs a two-stream architecture with separate branches for RGB frames and poses. The RGB stream processes raw visual data with a GSF network, while the pose stream uses a dedicated MLP-based model that maps 34-dimensional joint coordinates (17 joints with x,y coordinates) through three fully connected layers (34→64→128→128 dimensions) with ReLU activations, producing a compact representation of skeletal dynamics.

Following independent feature extraction, the RGB and pose features undergo L2 normalization to ensure balanced scaling between modalities. The normalized features are concatenated and processed by a multi-head attention layer, which dynamically emphasizes contextually relevant features across both streams. This attention mechanism enhances the model's sensitivity to critical aspects of each modality within the fused representation.

The attention-enhanced output is refined through a feature refinement module consisting of two sequential linear layers that progressively reduce dimensionality (halving at each stage), with batch normalization, ReLU activation, and dropout applied between layers. This structure provides regularization, compressing the fused features and mitigating noise to produce a more robust and discriminative feature set for final classification.

The late-fusion approach enables effective leverage of complementary RGB and pose information, enhancing the model's ability to distinguish complex movements. Although this architecture introduces additional computational overhead compared to early-fusion, it facilitates contextually aware interactions between modalities, which is particularly advantageous for recognizing complex and dynamic actions in sports like figure skating.

Experimental Results

Comparison with State-of-the-Art

Gate-Shift-Pose achieves its highest accuracy (98.08%) with early-fusion on ResNet50 using a batch size of 4 and 32 segments, demonstrating that integrating RGB and pose data at the input stage is highly effective for models with greater capacity. For the lighter ResNet18 backbone, the best result (95.19%) was achieved with late-fusion using the same batch size and segment count, showing that maintaining separate RGB and pose streams until later stages is advantageous for smaller architectures.

The integration of pose data consistently outperformed the original RGB-only GSF baseline across all configurations. Specifically, the inclusion of skeleton-based features improved accuracy from 67.79% to 95.19% with ResNet18 (approximately 40% increase) and from 81.73% to 98.08% with ResNet50 (approximately 20% increase). These results confirm the effectiveness of pose information in capturing complex motion patterns essential for figure skating action recognition.

Experimental parameters also played a significant role: a batch size of 4 resulted in improved performance due to its stabilizing effect on training with relatively small datasets, while a higher segment count (32) consistently led to better results across both backbones by providing richer temporal context for modeling dynamic motion patterns. In summary, early-fusion is well-suited for larger backbones enabling effective multimodal integration, whereas late-fusion better supports smaller backbones by reducing computational complexity while maintaining strong performance.

BibTeX

@InProceedings{Bianchi_2025_WACV,
    author    = {Bianchi, Edoardo and Lanz, Oswald},
    title     = {Gate-Shift-Pose: Enhancing Action Recognition in Sports with Skeleton Information},
    booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV) Workshops},
    month     = {February},
    year      = {2025},
    pages     = {1257-1264}
}