Image Verification Corpus Released
As we are approaching the end of project year one, first R&D results are becoming available in REVEAL. Here, Symeon Papadopoulos of the Centre for Research and Technology Hellas (CERTH-ITI) presents an image verification data corpus that is available for free to the research community.
Fake Images Spreading
One of the key challenges that REVEAL faces is the verification of images residing on the web. Recent experience has shown that as soon as a big news story breaks, numerous fake images circulate in online social networks. This is predominantly the case with Twitter as it is the primary network for the dissemination of news.
In fact, given that social networks have been widely used for the dissemination of news for quite some time, there are currently numerous documented examples of fake images that have been distributed online during the past years. Hurricane Sandy is perhaps the most well-known case study. During Sandy, a number of fake images were spread via Twitter. Most interestingly, during and right after the hurricane, the number of tweets linking to fake images was so high that it was possible to train a machine learning algorithm to distinguish between tweets that linked to verified images and those that linked to fake content .
A Particular Challenge
Despite the extremely interesting and promising results of the research in , when we tried to replicate the experiment, we were confronted with a particular challenge: It was really hard to find the tweets that linked to fake content, and even more difficult to replicate the experiment with exactly the same dataset. Furthermore, our experiments indicated that using the same event (Hurricane Sandy in this case) to train and test the machine learning algorithm led to over-optimistic conclusions regarding the effectiveness of automatic image validation .
A Free Image Verification Data Corpus
Motivated by all these challenges, we decided to create – and make publicly available – a collection of fake and real images that were disseminated via Twitter in the context of different events. We call this collection “the image verification corpus”, currently being maintained on GitHub .
Two important characteristics of the corpus are
- it contains examples of both fake and real images in the context of the same event, and
- it contains images from a variety of events.
The first is important so that machine learning algorithms can be properly trained, using both positive and negative examples from the same event. The latter is crucial in order to be able to test the effectiveness of automatic approaches in realistic settings (i.e. test them in datasets that have not been used for training the algorithms). In fact, we have also made available an open source implementation of a simple technique to test whether a tweet linking to an image is likely to have been faked or not .
The corpus currently contains images about a few events (Hurricane Sandy, Boston Marathon bombings, Sochi Olympics, Malaysian Airlines missing plane MH370, and a few minor additional ones) and is manually maintained by monitoring major breaking news stories, and then looking for trustworthy online sources that document such cases . We then proceeded with the identification of tweets that shared these images. Note that identifying such tweets is not an easy task, since we are not only looking for tweets containing a specific set of URLs, but any tweet that shares one of these images or an image that is a near-duplicate (e.g. cropped or slightly modified). We do this semi-automatically by employing algorithms for large-scale near-duplicate image detection, and then verifying the results by inspection.
Hopes and Expectations for Future Work
Our hope is that this dataset will become a valuable resource for researchers working in the area of media verification. One aim of making it available for free is to aid in the reproducibility of research results and the development of new approaches. We welcome ideas and contributions to grow and improve this dataset. If you want to get in touch or contribute, we would be more than happy to hear from you (contacts below)!
The work described in  and the initial work on the collection of the corpus was funded by the EC co-funded SocialSensor project. The work is continuing in the context of REVEAL. The image verification corpus is primarily maintained by CERTH researcher Christina Boididou (follow her on Twitter @CMpoi).
About the Author
Dr Symeon Papadopoulos is a Post-doctoral Research Fellow at the Centre for Research and Technology Hellas (CERTH-ITI). His main research areas are social network analysis, social media content mining and multimedia indexing and retrieval. You can follow him on Twitter @sympapadopoulos