A Phd student from Durham University, Thomas Winterbottom, in collaboration with Durham University’s Dr Noura Al-Moubayed & Sarah Xiao, and Carbon CDO Dr Al Mclean recently had a paper accepted to the British Machine Vision Conference – the findings of which could have huge implications for how image-based media is analysed and optimised.
Video Question Answering
The problem of video question answering is to find relevant clips in a video that answer questions about the visual content of the video. Multi-modal video question answering also takes into account subtitles (or audio) as well as the video.
To further this research question datasets have been proposed and created to allow academics to concentrate on improving the machine learning capabilities of the solutions. TVQA is a large scale standard data set based on popular TV shows that has been designed specifically to ‘require both vision and language understanding to answer’. It contains over 150K question/answer pairs and 460 hours of video. The questions, designed specifically to encourage multi-modal reasoning by asking the labeller to create two-part compositional questions, are labelled with timestamps for the relevant video frames and subtitles. An example is shown below:
The experimental set Tom created allowed us to test individual modalities and combinations of them. Our video channel consists of 3 sub channels which focus on core image features, regions of interest in the image, and a transformation to visual concepts of those regions (such as ‘person’, ‘table’ etc). This is a complex system and there are many parameters for the system to learn. For example, the ResNet101 box in the bottom left has about 45M parameters the machine needs to learn.
The first striking result was that the subtitles dominated rather than compliments the visual data. Adding subtitles took the accuracy results from around 45% to 68%. Furthermore, on deeper analysis, it turned out that the subtitles were actively suppressing the useful information in the visual channels.
The following figure also shows how the visual models are better at complementing each other and answer different questions (low overlap) whereas the subtitles dominate and have low overlap with the visual elements.
Simple Solutions work better
This led us to investigate whether simple solutions could give us state of the art results. Rather than creating complex neural network architectures we swapped the way the subtitles were represented in the models with a more recent and better approach to representing language.
In this table we are showing that any model that contains subtitles (S) can be improved to achieve state of the art (in bold) by changing the language model from GloVe to BERT.
Question Type Analysis
We then proceeded to dissect the dataset even further by looking at the performance of different question types (the five W’s and one H).
The table shows the relative performance of a question type vs the average for all question types for each model. It clearly illustrates that non subtitle models (top half) underperform on ‘which’ and ‘who’ question types. This makes intuitive sense as names and relevant nouns commonly appear in the subtitle.
The lower half of the table shows the subtitle models and these significantly over perform on ‘why’ and ‘how’ questions. Intuitively these question types are harder because the answers are implied rather than concrete and often revolve around explanations that are best represented in language.
A truly multi modal dataset
Our last experiment was to look for the section of the dataset that was truly multi-modal. To do this we found the questions that could only be answered correctly when a multi-modal model was used – we removed any question that could be answered correctly by any unimodal model. Our results showed that only 3.79% of the dataset was in fact multimodal!
Our deep analysis has shown that it is challenging to design questions without introducing biases that discourage multimodality. We should always explore and understand inherent bias in datasets before we invest in complex and expensive solutions and can allow us to develop simple solutions that over-perform overly complicated ones.