Semi-Automated Extraction of Attributed Verification and Debunking Reports from Social Media
Content from social media sites are becoming an important part of modern journalism. Of particular importance to real-time breaking news is amateur on the spot incident reports and eyewitness images and videos. With breaking news having tight reporting deadlines, measured in minutes not days, the need to quickly verify suspicious content is paramount  .
Journalists are increasingly looking to pre-filter and automate the simpler parts of the verification process. Current tools available to journalists can be broadly categorized as dashboard and in-depth analytic tools:
- Dashboard tools display filtered traffic volumes, trending hashtags and maps of content by topic, author and/or location.
- In-depth analysis tools use techniques such as sentiment analysis, social network graph visualization and topic tracking.
These tools help journalists to manage social media content but unverified rumours and fake news stories on social media are becoming both increasingly common  and increasingly difficult to spot. The current best practice for journalistic user generated content (UGC) verification  follows a hard to scale manual process involving journalists reviewing content from trusted sources with the ultimate goal of phoning up authors to verify specific images/videos and then asking permission to use that content for publication.
REVEAL’s trust and credibility model
In the REVEAL project we are developing ways to automate simpler verification steps, empowering journalists and helping them to focus on cross-checking tasks that need human expertise. We are creating a trust and credibility model able to process real-time evidence extracted using a combination of natural language processing, image analysis, social network analysis and semantic analysis. This article describes our work on text analysis, extracting and processing fake and genuine claims from tweets referencing suspicious images and videos. Our central hypothesis is that the “wisdom of the crowd” is not really wisdom at all when it comes to verifying suspicious images and videos. Instead, it is better to rank evidence from Twitter according to the most trusted and credible sources in a way similar to human journalists. We describe a semi-automated approach, automatically extracting claims about real or fake content and their source attributions and comparing them to a manually created list of trusted sources. A cross-checking step ranks conflicting claims and selects the most trustworthy evidence on which to base a final fake/real decision.
These two pictures display examples where verification tools might be able to assists journalists during the verification process.
MediaEval-2015 Verification Challenge
The MediaEval 2015 Verifying Multimedia Use challenge   is an annual event which tests international teams of computer science researchers on their ability to verify multimedia content. Teams receive sets of tweets mentioning suspicious images or videos and must use multimedia features and textual patterns to decide if its real or fake. A fake is considered to be a manipulated image (e.g. photoshopped images) or an original image presented in the wrong context (e.g. photos of the wrong war zone presented as an atrocity) [see figures 1 and 2 for examples]. All teams submit their image and video classifications, which are then compared by the challenge organizers to a hidden ground truth based on a human assessment of the content. The teams are scored by how many classifications they get right, ranked and a winner chosen. The winning team will have the best balance between a low error rate and a high classification rate.
Our approach’s strength is that it has a very low false positive rate, and in fact made no mistakes at all when classifying the MediaEval-2015 Verifying Multimedia Use challenge dataset. Figure 3 highlights the low false positive rate with a maximum precision score of 1.0. Full details can be found in the working notes paper .
Our approach’s weakness is a lower classification rate, since not all images have tweeted claims about its verification or debunking status and as such were not always able to reach a decision and had to label the content as ‘unknown’. Figure 4 highlights the low classification rate with a modest recall score.
We extract claims from textual patterns in tweets about the image being fake or real, and attribution statements about the source of the content. We compare attributed source named entities (e.g. BBCNews) to a list of trusted sources in the same way a human journalist might do. Our trust and credibility model is based on a classic natural language processing pipeline involving tokenization, Parts of Speech (PoS) tagging, named entity recognition and relational extraction. Full technical details can be seen in the working notes paper .
In the context of journalistic verification these results are promising. Given enough tweeted claims about an image or video we can rank the most trustworthy and provide a highly accurate classification result. This means that once images and videos, such as eyewitness content, go viral on twitter we will be able to provide a real-time view on their verification status. Our approach does not replace manual verification techniques – someone still needs to actually verify the content – but it can rapidly alert journalists to trustworthy reports of verification and/or debunking. This in turn should speed up the verification cycle and allow the ‘time to publish’ to be shortened.
We are working on a range of trust and verification algorithms in addition to this work. Examples include automated image verification via a cross-check of known facts, automatically downloading the historical weather and time of day for an event and checking this against image features (e.g. if it’s raining in an image but it was a dry day for the event the image must be a fake). We are also developing interactive analytical visualizations to both display clusters of content geolocated on maps, and display temporally sampled content on timelines. These visualizations will allow journalists to explore contextual social media content, quickly finding evidence that can be used for cross-checking facts about a story.
We hope to release a live demonstration tool in the Spring of 2016. Announcements will be made via the REVEAL website. Alternatively you can follow us on Twitter to be the first to know!
 Silverman, C. (Ed.), 2013. Verification Handbook. European Journalism Centre
 Silverman, C. 2015. Lies, Damn Lies, and Viral Content. How News Websites Spread (and Debunk) Online Rumors, Unverified Claims, And Misinformation. Tow Center for Digital Journalism, Columbia Journalism School
 Spangenberg, J. Heise, N. 2014. News from the Crowd: Grassroots and Collaborative Journalism in the Digital Age. In Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion (WWW 2014). Seoul, Korea, 765-768
 Boididou, C. Andreadou, K. Papadopoulos, S. Dang-Nguyen, D. Boato, G. Riegler, M. Kompatsiaris, Y. 2015. Verifying Multimedia Use at MediaEval 2015. In Proceedings of the MediaEval 2015 Workshop, Wurzen, Germany
 Middleton, S.E.”Extracting Attributed Verification and Debunking Reports from Social Media: MediaEval-2015 Trust and Credibility Analysis of Image and Video”, MediaEval-2015, Wurzen, Germany, Sept 2015
 MediaEval-2015, http://wwwu.edu.uni-klu.ac.at/miriegle/mediaeval/index2015.html
About the author
Stuart E. Middleton is a senior research engineer at the University of Southampton IT Innovation Centre. His main research interests are social media, sensor systems, data fusion and semantics. Stuart has a PhD in Computer Science from the University of Southampton. You can find more information here and here.