ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

Existing approaches to skill proficiency estimation often rely on black-box video classifiers, ignoring multi-view context and lacking explainability. We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning: it jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos. Central to our method is an AttentiveGatedProjector that dynamically fuses multi-view features, projected from a frozen TimeSformer backbone into a language model tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%. Our approach not only achieves superior accuracy across diverse activities, but also outputs natural language critiques aligned with performance, offering transparent reasoning. These results highlight generative vision-language modeling as a powerful new direction for skill assessment.

Submitted & under review
PATS: Proficiency-Aware Temporal Sampling for Multi-View Sports Skill Assessment

Automated sports skill assessment requires capturing fundamental movement patterns that distinguish expert from novice performance, yet current video sampling methods disrupt the temporal continuity essential for proficiency evaluation. To this end, we introduce Proficiency-Aware Temporal Sampling (PATS), a novel sampling strategy that preserves complete fundamental movements within continuous temporal segments for multi-view skill assessment. PATS adaptively segments videos to ensure each analyzed portion contains full execution of critical performance components, repeating this process across multiple segments to maximize information coverage while maintaining temporal coherence. Evaluated on the EgoExo4D benchmark with SkillFormer, PATS surpasses the state-of-the-art accuracy across all viewing configurations (+0.65% to +3.05%) and delivers substantial gains in challenging domains (+26.22% bouldering, +2.39% music, +1.13% basketball). Systematic analysis reveals that PATS successfully adapts to diverse activity characteristics-from high-frequency sampling for dynamic sports to fine-grained segmentation for sequential skills-demonstrating its effectiveness as an adaptive approach to temporal sampling that advances automated skill assessment for real-world applications.

2025 IEEE Sport Technology and Research Workshop
Gate-Shift-Pose: Enhancing Action Recognition in Sports with Skeleton Information

This paper introduces Gate-Shift-Pose, an enhanced version of Gate-Shift-Fuse networks, designed for athlete fall classification in figure skating by integrating skeleton pose data alongside RGB frames. We evaluate two fusion strategies: early-fusion, which combines RGB frames with Gaussian heatmaps of pose keypoints at the input stage, and latefusion, which employs a multi-stream architecture with attention mechanisms to combine RGB and pose features. Experiments on the FR-FS dataset demonstrate that Gate-Shift-Pose significantly outperforms the RGB-only baseline, improving accuracy by up to 40% with ResNet18 and 20% with ResNet50. Early-fusion achieves the highest accuracy (98.08%) with ResNet50, leveraging the model's capacity for effective multimodal integration, while latefusion is better suited for lighter backbones like ResNet18. These results highlight the potential of multimodal architectures for sports action recognition and the critical role of skeleton pose information in capturing complex motion patterns.

2025 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)

SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation

Free University of Bozen-Bolzano
International Machine Vision Conference 2025

SkillFormer


Multi-view video inputs (one egocentric and up to four exocentric) are processed through a shared TimeSformer backbone fine-tuned with LoRA. Features are fused using the CrossViewFusion module and passed to a classification head.

Abstract

Assessing human skill levels in complex activities is a challenging problem with applications in sports, rehabilitation, and training. In this work, we present SkillFormer, a parameter-efficient architecture for unified multi-view proficiency estimation from egocentric and exocentric videos. Building on the TimeSformer backbone, SkillFormer introduces a CrossViewFusion module that fuses view-specific features using multi-head cross-attention, learnable gating, and adaptive self-calibration. We leverage Low-Rank Adaptation to fine-tune only a small subset of parameters, significantly reducing training costs. In fact, when evaluated on the EgoExo4D dataset, SkillFormer achieves state-of-the-art accuracy in multi-view settings while demonstrating remarkable computational efficiency, using 4.5x fewer parameters and requiring 3.75x fewer training epochs than prior baselines. It excels in multiple structured tasks, confirming the value of multi-view integration for fine-grained skill assessment.

CrossViewFusion Module

Directional Weight Score

Detailed architecture of the CrossViewFusion module. Input features (B,V,d) undergo: (1) Layer normalization per view, (2) Multi-head cross-attention enabling each view to attend to all others, (3) View aggregation via mean pooling, (4) Feed-forward transformation with GELU activation, (5) Learnable gating mechanism g = σ(Linear(h)) for selective feature modulation, (6) Final projection, and (7) Adaptive self-calibration using learnable statistics to align with classification space.

Comparison with State-of-the-Art Methods

Comparison with State-of-the-Art

SkillFormer achieves state-of-the-art classification accuracy in both Exos (46.3%) and Ego+Exos (47.5%) settings, outperforming the best TimeSformer baseline by up to 16.4%. In contrast to baselines—which train separate models for egocentric and exocentric inputs and perform late fusion at inference—SkillFormer uses a single unified model for each configuration, simplifying the architecture and inference pipeline.

Beyond improved accuracy, SkillFormer demonstrates exceptional computational efficiency, using 4.5x fewer trainable parameters (27M vs. 121M) and requiring 3.75x fewer training epochs (4 vs. 15) compared to TimeSformer baselines. Furthermore, we do not apply multi-crop testing, reducing computational overhead while maintaining superior performance. These results highlight SkillFormer's ability to achieve both higher accuracy and greater compute-efficiency.

It is worth noting that the proficiency label distribution is notably imbalanced, skewed toward intermediate and late experts due to targeted recruitment of skilled participants. This may bias overall accuracy by underrepresenting novice classes. Random and majority-class baselines perform significantly worse (24.9% and 31.1% respectively), underscoring the inherent complexity of the skill assessment task.

Per-Scenario Performance

Per-Scenario Results

SkillFormer consistently outperforms baseline models in the Ego+Exos setting for structured and physically grounded activities such as Basketball (77.88%) and Cooking (60.53%). These domains benefit from synchronized egocentric and exocentric perspectives, which enable better modeling of spatial layouts and temporally extended actions. The fusion of multi-view signals allows SkillFormer to exploit cross-perspective cues, such as object-hand interactions and full-body movement trajectories.

Interestingly, view-specific advantages emerge in certain domains. In Music, the Ego-only configuration achieves the highest accuracy (72.41%), suggesting that head-mounted views sufficiently capture detailed instrument manipulation. In Bouldering, the Exos-only configuration outperforms with 33.52% accuracy, indicating that third-person perspectives better capture full-body spatial positioning and climbing technique assessment. This highlights SkillFormer's flexibility to adapt to view-specific signal quality.

However, subjective domains like Dancing reveal limitations: SkillFormer's Ego+Exos accuracy (13.68%) falls significantly below both the majority baseline (51.61%) and the baseline Ego model (55.65%). This indicates that tasks with high intra-class variability and weak structure may not benefit from multi-view fusion, or may require additional modalities such as audio to disambiguate subtle skill indicators. These trends underscore that SkillFormer is particularly effective in domains requiring precise spatial-temporal reasoning and multi-view integration.

Architecture Configuration

Architecture Configuration

As the number of views increases, our design prioritizes efficiency without compromising accuracy. We strategically reduce the number of frames per view (32→16)—preserving the temporal span with fewer sampled tokens—while proportionally increasing the LoRA rank (32→64), alpha (64→128), and hidden dimension (1536→2560). This trade-off compensates for reduced per-view tokens by enabling richer cross-view transformations through enhanced adapter capacity.

Our choice of LoRA rank and fusion dimensionality is not arbitrary. Higher ranks allow the adapter to express richer transformations across views, compensating for the reduced token budget. Empirically, increasing these parameters moderately yields significant gains in accuracy while maintaining training efficiency within tractable computational budgets. SkillFormer-Ego+Exos achieves 47.5% accuracy with only 27M trainable parameters—a 4.5x reduction compared to full fine-tuning of the 121M parameter TimeSformer backbone—demonstrating the effectiveness of low-rank adaptation and targeted fusion.

This design reflects a key motivation behind SkillFormer: enabling scalable, parameter-efficient skill recognition across multi-view egocentric and exocentric inputs. By balancing frame sampling, adapter capacity, and fusion complexity, we achieve state-of-the-art performance while maintaining computational tractability for real-world deployment scenarios.

BibTeX

@misc{bianchi2025skillformerunifiedmultiviewvideo,
      title={SkillFormer: Unified Multi-View Video Understanding for Proficiency Estimation}, 
      author={Edoardo Bianchi and Antonio Liotta},
      year={2025},
      eprint={2505.08665},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.08665}, 
}