Dense Video Captioning By Utilizing Auxiliary Image Data
xmlui.mirage2.itemSummaryView.MetaDataShow full item record
Dense video captioning aims at detecting events in untrimmed videos and generating accurate and coherent caption for each detected event. It is one of the most challenging captioning tasks since generated sentences must form a meaningful and fluent paragraph by considering temporal dependencies and the order between the events, where most of the previous works are heavily dependent on the visual features extracted from the videos. Collecting textual descriptions is an especially costly task for dense video captioning, since each event in the video needs to be annotated separately and a long descriptive paragraph needs to be provided. In this thesis, we investigate a way to mitigate this heavy burden and we propose a new dense video captioning approach that leverages captions of similar images as auxiliary context while generating coherent captions for events in a video. Our model successfully retrieves visually relevant images and combines noun and verb phrases from their captions to generating coherent descriptions. We employ a generator and a discriminator design, together with an attention-based fusion technique, to incorporate image captions as context in the video caption generation process. We choose the best generated caption by a hybrid discriminator that can consider temporal and semantic dependencies between events. The effectiveness of our model is demonstrated on ActivityNet Captions dataset and our proposed approach achieves favorable performance when compared to the strong baseline based on automatic metrics and qualitative evaluations.