Snoozing differently

By Anne, Jeneé, Michelle and Tyler

Our group discussed a common problem with wake-up apps: How often we hit the snooze button. So we came up with a feature that allows the user to set up two separate playlists — songs you love and songs you hate.

When the alarm clock rings, you can select the “hype” playlist or the “hate” playlist for the next time the alarm sounds to get you out of bed. We also discussed ways to integrate the alarm app with services like Spotify and Pandora and use them to further randomize the hate/hype based on the preferences you’ve already stored.

 

Overview: Find stories faster in massive document dumps

If you were tasked with reviewing and making sense of a huge stack of documents you’ve never seen before, you would probably go about it in a pretty standard way. Skim the first page and make a quick decision about whether it’s relevant or about a specific topic, then move to page two and make that decision again. After a few pages, you might have a few separate piles describing what you’ve seen so far.

As you continue reading, the piles might get more sophisticated. In one pile, you might place emails containing specific complaints to the school board. In another, policy proposals from a public official’s top adviser. On and on you go until you get through enough of the pile to have a fairly good idea of what’s inside.

For investigative journalists reviewing massive document dumps — responses to public records requests, for example — this may be one of the very first steps in the reporting process. The faster reporters understand what they have, the faster they can decide whether there’s a story worth digging into.

Overview, a project to help journalists sift through massive document dumps

Making sense of documents as efficiently as possible is the primary purpose of Overview, an open-source tool originally developed by The Associated Press and funded by a collection of grants from the Knight Foundation and Google, among others.

Upload your documents into Overview and it will automatically process them first using optical character recognition. It then uses a clustering algorithm called term frequency-inverse document frequency to try to sort each individual document into a series of piles. It’s somewhat similar to the way a human reporter would sort documents if she were reading the pages one by one.

TF-IDF is built on a really basic assumption. It counts the number of times each word is used in each document — say a single email in a batch of thousands. It then compares those counts to the number of times the same words are used in the larger collection of documents. If a few of the emails have words in common that are relatively uncommon in the whole collection of emails, the assumption is that those documents are related in some way.

Overview doesn’t actually derive any meaning from the words it’s counting, so the assumption the algorithm makes about documents being related might be wrong or totally unhelpful. But Overview also allows users to tag individual documents (or whole piles) with custom labels. It might, for example, help a reporter more quickly identify those complaints to the school board or the policy proposals to the public official because they’re all grouped together by the algorithm.

Overview has a few other helpful features, like fast searching and the ability to rerun the clustering algorithm with different parameters — specific terms of interest or stop words, for example. It’s also seamlessly integrated with another tool called DocumentCloud, a popular platform journalists use to annotate and publish documents online.

Tyler’s bio

I’m Tyler Dukes, and I’m a 2017 fellow at the Nieman Foundation for Journalism at Harvard. In real life, I’m an investigative reporter for the state politics team at WRAL News in Raleigh, North Carolina, where I work on longform stories and specialize in data and public records. I’m really interested in finding ways to use technology to enhance in-depth reporting and make data journalism more accessible to underserved media markets. That means (I think) developing better methods for training working journalists and educating journalism students in ways that allow them apply these skills practically on the beat.

At WRAL, I’ve led the reporting on deep dives into the state’s mental healthcare system, deaths in the prisons and, oddly enough, the search for sunken treasure off the Carolina coast. I also built systems that allow readers to search more than a million pages of records from a major university athletic scandal and explore the campaign cash fueling each state lawmaker’s election bid. Prior to working at WRAL, I managed a research project at Duke University’s DeWitt Wallace Center for Media and Democracy called the Reporters’ Lab, aimed at finding ways to reduce the cost of investigative reporting. I also freelanced as a science and technology reporter for several newspapers and worked as an adviser to North Carolina State University’s (then-)daily student newspaper.

While my background is in reporting, writing, editing etc., I’m also proficient in Python, JavaScript and HTML/CSS (although far, far, far from being an expert). I’m also really good at prying records from the clutches of government officials.

I’m a native North Carolinian, devotee of Eastern-NC barbecue and fan of gas station coffee. I love all dogs and very few cats.

Follow me on Twitter and Instagram.