Details
Presenter(s)
![Qinyu Zhang Headshot](https://confcats-catavault.s3.amazonaws.com/CATAVault/ieeecass/master/files/styles/cc_user_photo/s3/user-pictures/16341.jpg?h=3575139a&itok=2YgJer99)
Display Name
Qinyu Zhang
- Affiliation
-
AffiliationTongji University
- Country
Abstract
Video captioning is a challenging cross-modal task that requires taking full advantage of both vision and language. In this paper, multiple tasks are assigned to fully mine multi-concept knowledge in both vision and languages, including video-to-video knowledge, video-to-text knowledge and text-to-text knowledge. The experimental results on the benchmark MSVD and MSR-VTT datasets show that the proposed method makes remarkable improvement on all metrics for MSVD dataset and two out of four metrics for MSR-VTT dataset.