Part-Level Action Parsing via a Pose-Guided Coarse-to-Fine Framework

Existing action recognition methods usually consider an input video as a whole and learn models with video-level labels, but cannot learn fine-grained cues for human action. Therefore, researchers start to focus on Part-level Action Parsing which predicts the video-level action and frame-level fine-grained actions of body parts. We propose a coarse-to-fine framework for this task, which first predicts the video-level action, then localizes body parts and predicts the part-level actions. Moreover, to balance the accuracy and computation, we propose to recognize the part-level actions by segment-level features. Furthermore, we propose a pose-guided positional embedding method to accurately localize body parts.

Slides

Part-Level Action Parsing via a Pose-Guided Coarse-to-Fine Framework (application/pdf)

Download