ProfVLM: A Lightweight Video-Language Model for Multi-View Proficiency Estimation

Existing approaches to skill proficiency estimation often rely on black-box video classifiers, ignoring multi-view context and lacking explainability. We present ProfVLM, a compact vision-language model that reformulates this task as generative reasoning: it jointly predicts skill level and generates expert-like feedback from egocentric and exocentric videos. Central to our method is an AttentiveGatedProjector that dynamically fuses multi-view features, projected from a frozen TimeSformer backbone into a language model tuned for feedback generation. Trained on EgoExo4D with expert commentaries, ProfVLM surpasses state-of-the-art methods while using up to 20x fewer parameters and reducing training time by up to 60%. Our approach not only achieves superior accuracy across diverse activities, but also outputs natural language critiques aligned with performance, offering transparent reasoning. These results highlight generative vision-language modeling as a powerful new direction for skill assessment.

Submitted & under review
Gate-Shift-Pose: Enhancing Action Recognition in Sports with Skeleton Information

This paper introduces Gate-Shift-Pose, an enhanced version of Gate-Shift-Fuse networks, designed for athlete fall classification in figure skating by integrating skeleton pose data alongside RGB frames. We evaluate two fusion strategies: early-fusion, which combines RGB frames with Gaussian heatmaps of pose keypoints at the input stage, and latefusion, which employs a multi-stream architecture with attention mechanisms to combine RGB and pose features. Experiments on the FR-FS dataset demonstrate that Gate-Shift-Pose significantly outperforms the RGB-only baseline, improving accuracy by up to 40% with ResNet18 and 20% with ResNet50. Early-fusion achieves the highest accuracy (98.08%) with ResNet50, leveraging the model's capacity for effective multimodal integration, while latefusion is better suited for lighter backbones like ResNet18. These results highlight the potential of multimodal architectures for sports action recognition and the critical role of skeleton pose information in capturing complex motion patterns.

2025 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW)
SkillFormer: Unified Multi-View Proficiency Estimation for Proficiency Estimation

Assessing human skill levels in complex activities is a challenging problem with applications in sports, rehabilitation, and training. In this work, we present SkillFormer, a parameter-efficient architecture for unified multi-view proficiency estimation from egocentric and exocentric videos. Building on the TimeSformer backbone, SkillFormer introduces a CrossViewFusion module that fuses view-specific features using multi-head cross-attention, learnable gating, and adaptive self-calibration. We leverage Low-Rank Adaptation to fine-tune only a small subset of parameters, significantly reducing training costs. In fact, when evaluated on the EgoExo4D dataset, SkillFormer achieves state-of-the-art accuracy in multi-view settings while demonstrating remarkable computational efficiency, using 4.5x fewer parameters and requiring 3.75x fewer training epochs than prior baselines. It excels in multiple structured tasks, confirming the value of multi-view integration for fine-grained skill assessment.

2025 International Conference on Machine Vision

PATS: Proficiency-Aware Temporal Sampling for Multi-View Sports Skill Assessment

Free University of Bozen-Bolzano
IEEE Sport Technology and Research Workshop 2025

PATS


PATS is a novel video sampling strategy designed specifically for automated sports skill assessment. Unlike traditional methods that randomly sample frames or use uniform intervals, PATS preserves complete fundamental movements within continuous temporal segments.

Abstract

Automated sports skill assessment requires capturing fundamental movement patterns that distinguish expert from novice performance, yet current video sampling methods disrupt the temporal continuity essential for proficiency evaluation. To this end, we introduce Proficiency-Aware Temporal Sampling (PATS), a novel sampling strategy that preserves complete fundamental movements within continuous temporal segments for multi-view skill assessment. PATS adaptively segments videos to ensure each analyzed portion contains full execution of critical performance components, repeating this process across multiple segments to maximize information coverage while maintaining temporal coherence. Evaluated on the EgoExo4D benchmark with SkillFormer, PATS surpasses the state-of-the-art accuracy across all viewing configurations (+0.65% to +3.05%) and delivers substantial gains in challenging domains (+26.22% bouldering, +2.39% music, +1.13% basketball). Systematic analysis reveals that PATS successfully adapts to diverse activity characteristics-from high-frequency sampling for dynamic sports to fine-grained segmentation for sequential skills-demonstrating its effectiveness as an adaptive approach to temporal sampling that advances automated skill assessment for real-world applications.

Proficiency-Aware Temporal Sampling

PATS Example

In this example configuration, PATS extracts Ntarget = 32 frames from Ns = 2 continuous temporal segments of duration ds = 3s from a 10 s video. Within each segment, ⌊Ntarget/Ns⌋ = 16 frames are sampled uniformly (red vertical lines), preserving temporal continuity within segments. Segment positioning with automatic spacing prevents overlap and ensures comprehensive temporal coverage. This configuration is used in the basketball and bouldering domains.

Comparison with State-of-the-Art Methods

PATS Overall Accuracy

SkillFormer+PATS surpasses the state-of-the-art accuracy across all viewing configurations. Our approach delivers consistent improvements over the original SkillFormer: 47.3% accuracy for egocentric views (+3.05%), 46.6% for exocentric views (+0.65%), and 48.0% for combined views (+1.05%). These gains are achieved while maintaining computational efficiency with 14-27M parameters and 4 training epochs.

Optimal Configuration Patterns

Optimal Configuration Patterns

Systematic analysis reveals PATS adapts to activity characteristics: 32 frames proves universally optimal, sampling rates vary by dynamics (4.0-5.33 FPS for dynamic vs. 0.89 FPS for sequential), view selection matches skill type (egocentric for proprioceptive, fused for technique-based), and segmentation inversely correlates with action continuity (2-12 segments).

Scenario-Specific Optimal Configuration

Scenario-Specific Configuration

PATS demonstrates strong adaptability across diverse activity domains, with optimal configurations varying by skill characteristics. Basketball achieves the highest accuracy at 78.76% using rapid multi-view sampling, while music reaches 74.14% through fine-grained egocentric capture with 12 segments. Cooking and bouldering show distinct preferences for exocentric-only and egocentric-only views respectively, both utilizing high-frequency sampling.

The most substantial improvements appear in domains requiring precise temporal coordination. Bouldering shows the largest gain at +26.22% over SkillFormer, followed by music at +2.39% and basketball at +1.13%. These results validate PATS' effectiveness for proprioceptive skills where temporal continuity is critical.

However, limitations emerge in certain scenarios. Dancing presents a mixed case where PATS improves over SkillFormer but remains below baseline methods, suggesting the explored parameter combinations may not fully capture rhythmic and aesthetic components. Soccer shows a specific decline in egocentric view accuracy, indicating that PATS' temporal sampling strategy may not suit all activity-view combinations. These challenges point to opportunities for further refinement in scenario-specific configurations and potentially automated parameter selection mechanisms for broader applicability.

BibTeX

@INPROCEEDINGS{Bian2510:PATS,
AUTHOR="Edoardo Bianchi and Antonio Liotta",
TITLE="{PATS:} {Proficiency-Aware} Temporal Sampling for {Multi-View} Sports Skill
Assessment",
BOOKTITLE="2025 IEEE International Workshop on Sport, Technology and Research (STAR)
(IEEE STAR 2025)",
ADDRESS="Trento, Italy",
PAGES=6,
DAYS=29,
MONTH=oct,
YEAR=2025,
ABSTRACT="Automated sports skill assessment requires capturing fundamental movement
patterns that distinguish expert from novice performance, yet current video
sampling methods disrupt the temporal continuity essential for proficiency
evaluation. To this end, we introduce Proficiency-Aware Temporal Sampling
(PATS), a novel sampling strategy that preserves complete fundamental
movements within continuous temporal segments for multi-view skill
assessment. PATS adaptively segments videos to ensure each analyzed portion
contains full execution of critical performance components, repeating this
process across multiple segments to maximize information coverage while
maintaining temporal coherence. Evaluated on the EgoExo4D benchmark with
SkillFormer, PATS surpasses the state-of-the-art accuracy across all
viewing configurations (+0.65\% to +3.05\%) and delivers substantial gains
in challenging domains (+26.22\% bouldering, +2.39\% music, +1.13\%
basketball). Systematic analysis reveals that PATS successfully adapts to
diverse activity characteristics-from high-frequency sampling for dynamic
sports to fine-grained segmentation for sequential skills-demonstrating its
effectiveness as an adaptive approach to temporal sampling that advances
automated skill assessment for real-world applications."
}