Multi-Concept Mining for Video Captioning Based on Multiple Tasks

Video captioning is a challenging cross-modal task that requires taking full advantage of both vision and language. In this paper, multiple tasks are assigned to fully mine multi-concept knowledge in both vision and languages, including video-to-video knowledge, video-to-text knowledge and text-to-text knowledge. The experimental results on the benchmark MSVD and MSR-VTT datasets show that the proposed method makes remarkable improvement on all metrics for MSVD dataset and two out of four metrics for MSR-VTT dataset.

Slides

Multi-Concept Mining for Video Captioning Based on Multiple Tasks (application/pdf)

Download