Skip to main content
Video s3
    Details
    Poster
    Presenter(s)
    Rui Zhong Headshot
    Display Name
    Rui Zhong
    Affiliation
    Affiliation
    Central China Normal University
    Country
    Abstract

    The high redundancy among keyframes is a critical issue for the existing summarizing methods in dealing with user-created videos. To address the critical issue, we present an unsupervised learning method, Spatial Attention Model guided Bi-directional Long Short-term Memory network (Bi-LSTM), on the combination of visual and semantic features. As for the visual feature, we design a Salient-Area-Size-based spatial attention model on the observation that humans tend to focus on sizable and moving objects in videos. Moreover, the Bi-LSTM network is leveraged to exploit the semantic feature. Afterward, the Soft Selected Probability generated from the spatial attention and semantic feature is fused to obtain the final probability for keyframe selection. The reinforcement learning framework, trained by the Deep Deterministic Policy Gradient algorithm, is adopted to do unsupervised training. Extensive experiments on the SumMe and TVSum datasets demonstrate that our method outperforms the state-of-the-art methods in terms of F-score.

    Slides
    • Unsupervised Learning of Visual and Semantic Features for Video Summarization (application/pdf)