Details
Presenter(s)
Display Name
Xiaodong Chen
- Affiliation
-
AffiliationUniversity of Science and Technology of China
- Country
Abstract
Existing action recognition methods usually consider an input video as a whole and learn models with video-level labels, but cannot learn fine-grained cues for human action. Therefore, researchers start to focus on Part-level Action Parsing which predicts the video-level action and frame-level fine-grained actions of body parts. We propose a coarse-to-fine framework for this task, which first predicts the video-level action, then localizes body parts and predicts the part-level actions. Moreover, to balance the accuracy and computation, we propose to recognize the part-level actions by segment-level features. Furthermore, we propose a pose-guided positional embedding method to accurately localize body parts.