Skip to main content
Video s3
    Details
    Presenter(s)
    Qinyu Zhang Headshot
    Display Name
    Qinyu Zhang
    Affiliation
    Affiliation
    Tongji University
    Country
    Author(s)
    Display Name
    Qinyu Zhang
    Affiliation
    Affiliation
    Tongji University
    Display Name
    Pengjie Tang
    Affiliation
    Affiliation
    Jinggangshan University
    Display Name
    Hanli Wang
    Affiliation
    Affiliation
    Tongji University
    Display Name
    Jinjing Gu
    Affiliation
    Affiliation
    Tongji University
    Abstract

    Video captioning is a challenging cross-modal task that requires taking full advantage of both vision and language. In this paper, multiple tasks are assigned to fully mine multi-concept knowledge in both vision and languages, including video-to-video knowledge, video-to-text knowledge and text-to-text knowledge. The experimental results on the benchmark MSVD and MSR-VTT datasets show that the proposed method makes remarkable improvement on all metrics for MSVD dataset and two out of four metrics for MSR-VTT dataset.

    Slides
    • Multi-Concept Mining for Video Captioning Based on Multiple Tasks (application/pdf)