If you were tasked with reviewing and making sense of a huge stack of documents you’ve never seen before, you would probably go about it in a pretty standard way. Skim the first page and make a quick decision about whether it’s relevant or about a specific topic, then move to page two and make that decision again. After a few pages, you might have a few separate piles describing what you’ve seen so far.
As you continue reading, the piles might get more sophisticated. In one pile, you might place emails containing specific complaints to the school board. In another, policy proposals from a public official’s top adviser. On and on you go until you get through enough of the pile to have a fairly good idea of what’s inside.
For investigative journalists reviewing massive document dumps — responses to public records requests, for example — this may be one of the very first steps in the reporting process. The faster reporters understand what they have, the faster they can decide whether there’s a story worth digging into.
Overview, a project to help journalists sift through massive document dumps
Making sense of documents as efficiently as possible is the primary purpose of Overview, an open-source tool originally developed by The Associated Press and funded by a collection of grants from the Knight Foundation and Google, among others.
Upload your documents into Overview and it will automatically process them first using optical character recognition. It then uses a clustering algorithm called term frequency-inverse document frequency to try to sort each individual document into a series of piles. It’s somewhat similar to the way a human reporter would sort documents if she were reading the pages one by one.
TF-IDF is built on a really basic assumption. It counts the number of times each word is used in each document — say a single email in a batch of thousands. It then compares those counts to the number of times the same words are used in the larger collection of documents. If a few of the emails have words in common that are relatively uncommon in the whole collection of emails, the assumption is that those documents are related in some way.
Overview doesn’t actually derive any meaning from the words it’s counting, so the assumption the algorithm makes about documents being related might be wrong or totally unhelpful. But Overview also allows users to tag individual documents (or whole piles) with custom labels. It might, for example, help a reporter more quickly identify those complaints to the school board or the policy proposals to the public official because they’re all grouped together by the algorithm.
Overview has a few other helpful features, like fast searching and the ability to rerun the clustering algorithm with different parameters — specific terms of interest or stop words, for example. It’s also seamlessly integrated with another tool called DocumentCloud, a popular platform journalists use to annotate and publish documents online.