The American Community Survey – 3 in 1: explainer, engagement, data story

I have thought about creating a census fan page many times. Looking at data all day makes one appreciate the history, scale, and effort of this massive public endeavor. Not only does the census provide official guidance to the formulation of public funding and policy, it has over the years also ritualistically structured our understanding of our environment. Since 1790, the census evolved not just to adapt to the massive increase in population(from under 4 million to 318 million today) and migration(from 5.1% urban to 81% in 2000), but its format has also changed to reflect our attitudes. In this 3 part(hopefully) assignment/makeup assignments, I focused on explaining and visualizing the American Community Survey(ACS), a newer data offering of the census that is a yearly long form survey for a 1% sample of the population.

Last summer, while interning at a newsroom, I built a twitter bot based on the ACS inspired by how nuanced and evocative the original collected format of the dataset is. Each tweet is a person’s data reconstituted into a mini bio. In the year since, people have retweeted when an entry is absurd or sad, but most often when an entry reminded them of themselves or someone they know. It quickly became clear that narratives are more digestible than data plotted on a map. However, I was at a loss on how to further this line of inquiry to include more data in bigger narratives.

Part of my research is to experiment with ways of making public data accessible so that individuals can make small incremental changes to improve their own environment. Many of these small daily decisions are driven by public data, but making the underlying data public is not always enough. While still plotting data on maps regularly, I started to think about narratives. Can algorithmically constructed narratives and narrative visualizations stand alone as long-form creative nonfiction?

There are so many wonderful public data projects that go the extra step out there. Socialexplorer does a great job of aggregating the data, so does actually Projects from timeLab show many examples of how census data has been used for a variety of purposes, even entertainment. And just last week, the macroconnections group unveiled a beautiful and massive effort to expose public datasets with that takes data all the way into a story presentation.

Constraints are blessings…

It’s fortunate that I work in such a time and environment but also very intimidating. What can I contribute to an already rich body of work where each endeavor normally requires many hours and even months of teamwork, not to mention the variety of skills involved? More selfishly, what can visual artists add to the conversation that is beyond simply dressing up the results? This series of 3 assignments is a start.

1. Explainer – the evolution of the census

Instead of focusing on how the population has changed, here is a visualization of how census questions have changed to reflect the attitudes and needs of the times. Unfortunately this was unfinished and only goes from 1790 to 1840 right now.


1790_1840view closeups here – 1790_1840

2. Engagement – how special are you?

I have been procrastinating by spending a lot of time on guessing the correlation. I think that buzzfeed-type quizzes are one of the best data collection tools. Of course there is also this incredible NYT series. People who commented on the census bot often directly address tweets that describe themselves. This is an experiment to get people to learn something about the data by allowing them to place themselves in it.

Screen Shot 2016-04-19 at 9.42.30 PMScreen Shot 2016-04-19 at 9.42.00 PM  This is also still very much in progress:

3. Data Story

To be continued …

What Is A Bot, Anyway?

(with Adrienne)

Bots are having their 15 minutes, so to speak. Recently, Microsoft launched the “Tay” AI bot and chaos ensued. But bots had already been making a name for themselves on Twitter, on Tumblr, and even on collaboration platforms like Slack or Github. But just because we might recognize a bot when we see it, doesn’t help us understand what’s going on. To make the lives of non-coders everywhere easier, we’ve prototyped an app that can create and configure a vertible cornicopia of bots, no code required.

* For those who are interested in a little more detail, we’ve also created a simple example, an activist bot that echoes quotes excerpts from the Boston Police Patrolmen’s Association newsletter which is…unfortunately surprising. 

What is a bot?

Broadly speaking, a bot is computer program that acts like a human user on a social media platform. Though we haven’t yet seen the passing of the Turing Test by any artificial intelligence, so it is pretty easy to distinguish the humans from the code. Essentially, a bot takes in some information or content from source A (or A + B, or A + B + C, or…well you get the idea), and then potentially transforms it based on rules the developer has given it, and saves the newly crafted content to a database. From here, the bot could also have instructions to share their creation on Twitter, but it’s not a requirement.

Minimum Viable Bot is just Information In, Information Out.

What are the different kinds of bots?

Screen Shot 2016-03-30 at 02.45.59

Bots can take lots of different forms depending on their purpose. Some bots can help you schedule meetings through email. Others are more nefarious, and try to circumvent spam filters in your email or on Twitter. Funnily enough, the hugely popular @horse_ebooks started out as a scam bot, until it was taken over by a reporter from Buzzfeed.

We should note that there is no canonized taxonomy, but we’re going to offer a few informal categories here.

Mash Up Bots:
These bots combine different sources of content and post them.
Example: A bot that tweets out a combination headlines.


Image Poster Bots:
These bots post an image, sometimes with additional information, or generated content.
Example: A bot that posts live TV stills and improvises subtitles for them.

Smart Learner Bots:
Some bots will grow more “intelligent” the more they are interacted with. Smart learner bots require an extra level of human care, as Microsoft learned with Tay. To learn more about ethics in bot curation, Motherboard just posted a great explainer with some of the leaders in social bot technology
Microsoft’s ill-fated “Tay”, who “learned” by accepting as valuable everything that was said to it.


Auto Notifier Bots:
Auto Notifiers listen to a content source, and then perform an action when new content is posted, or something changes. It’s kind of like If This, Then That, the extremely popular service for connecting various web platforms together. These bots are also very common in journalism. They frequently take template text and “fill in the blanks” with the latest relevant information. 

Our demo bot is a version of this kind of bot, because we are not transforming our text in any way. We are simply waiting for a new newsletter to be posted, and then periodically tweeting sentences from it.
Screen Shot 2016-03-30 at 02.34.11
Example: A twitter bot that tweets each time there is an earthquake near L.A.


Replier Bots:
These bots talk to the user based on rules written by the developer. Sometimes this needs to be something the user says directly to the bot, and sometimes these bots will tweet at someone in reaction to something that’s been said. Many platforms (e.g. Twitter) have
rules for keeping these bots on their best behavior.
Example: A bot that takes nouns from your tweets and turns them into tributes to deities.


Expert Bots:
Much like the phone trees, these bots may either offer (semi-) useful information, or take responses and decide what to say next based on them. These bots can also sometimes be found on e-commerce sites with services like Live Chat. The bot will help to quickly sort the chatter for a human.
Example: The Bank of America customer service bot.


Where do bots live?

  • Email
  • Github
  • Slack
  • Twitter
  • IRC
  • and many more!

How do bots work?

Bots typically have a place where they get their content from. In some cases, this may be a very advanced system. In the case of our demo app and bot we simply feed in a web address pointing to our desired content, and it will post sentence by sentence is located.

Screen Shot 2016-03-30 at 02.42.08

With any program that deals with a large amount of data, most of the work is typically in cleaning up the data so that, e.g. in this case, what the bot says is correct.

Screen Shot 2016-03-30 at 02.42.15

Some bots will try to detect what is relevant in the data you feed it. Some will simply take the data and reproduce it without a second thought. Tay’s “repeat after me” feature did this, to disastrous effect.
Screen Shot 2016-03-30 at 02.44.32

It’s common for one person, once they have acquired the skills, to make and manage many bots! To see this ailment in action, have a look at the work of the wonderful Darius Kazemi!

To end, here is an example of the code that would run our bot that tweets out random sentences from the Boston Police Patrolman Association’s newsletter. This script would typically be set up on a server and run on a schedule.

Screen Shot 2016-03-30 at 01.46.03

It may not have been as complicated as you thought to build your own bot! If you would like an even more automated route, have a look at the article How To Make A Twitter Bot with Google Spreadsheets.


How bad does Shelby County discriminates against black businesses?

This bad. (The link is to a YouTube video. And yes, I know the first slide says it’s under 2 minutes but the video is 2:12. That will be updated in later versions.)

I’ve written about the racial disparity in municipal contracting processes – which sounds dry as sh*t, I’d be the first to admit – but it’s actually really important. (Here’s an graphic I created a couple of years ago on the topic.)

The short version: Black people finance their own discrimination when they pay taxes into a county government that then awards an unfair share of county contracts to businesses owned by white men. This has been happening in the county (and city) where I’m from for decades.

My video is an attempt to explain the issue by stressing the consensus values in the middle of Hallin’s Sphere and the deviance of continuing to use tax dollars to give one group an unfair advantage over another.

Here’s the story I was trying to explain. The story includes the numbers I cite, but people who won’t read the story WILL watch a video.

I’d like to redo it and make it snappier, add some sound effects and pictures. Some of the slides with not much text could have been shorter, but there’s no (easy) way in Keynote to vary the length of the slides.

Creating this was a beast. (Pro tip: If you create a Powerpoint, export the slides as JPEGs and import the JPEGs into iMovie, the stills will be so blurry as to be unreadable. A workaround: Use Keynote to create a slideshow, export it as a QuickTime movie and then upload to YouTube or wherever. Or alternately, get a legit video editor like Premiere or Final Cut.)

I think THIS is the future of news. I’d like to create a series of videos like this, ideally under a minute. The series (#MLK50, referenced at the end of the video) will be focused on how public policy reinforces racial/economic injustice in Memphis – and what policies would create a more economically equitable environment.

My “fierce urgency of now” is that in two years, Memphis will mark the 50th anniversary of the violent interruption of Martin Luther King’s vision of economic equality. King came to Memphis to make sure that local government treated mistreated black sanitation workers fairly, but 48 years later, the black community is still getting the short end of the stick.

Refugee Resettlement in the United States

An explainer on the process refugees go through to relocate to the U.S. — a collaboration from Brittany and me…

From Brussels to Paris, the growing number of terror attacks in the West has bled both fear and ignorance around the number of Syrian refugees resettled in the United States. The Republican presidential frontrunner has even gone so far as to pledge that he will send resettled refugees back to Syria if elected. Yet, for all of the hand-wringing about the influx of potential jihadists, official government data tells another story.

Since the Syrian civil war broke out in March of 2011, just under 2,200 refugees have been admitted into the United States. According to the Pew Research Center, of the 70,000 refugees the United States was able to legally accept in the 2015 fiscal year, roughly 25% were from Burma, 20% from Iraq, and 13% from Somalia.  While the Obama Administration will raise the refugee cap to 85,000 to accommodate 10,000 Syrian refugees in 2016, Syrians will still make up less than 12% of the total admitted refugee population. Also, while the average processing time for refugees is 18 to 24 months, Syrian applications can take significantly longer because of security concerns and difficulties in verifying their information. Aid organizations currently put the actual processing time at 33 months.

Rather than just throwing more numbers at the reader, we decided to let he or she engage with the Syrian asylum application process directly via Typeform. A survey with style, easy on the eyes Typeform allows the designer to simulate a conversation through “logic jumps”, which adapts the survey based on a respondent’s answer. Try your hand at the journey here.

My [future] tool: Uliza

I heard the phrase “digital divide” for the first time about six months ago. As someone just sticking their toe in to the larger debate around ICTs, net neutrality, and zero-rating products, it’s been a slightly overwhelming dive down the rabbit hole to say the least. It has also lead me to the tool I’ll be introducing today: Uliza.

What is it?

Uliza, which means “ask” in Swahili, is a telephone service that leverages existing technologies in voice recognition, cloud-computing, and translation to provide access to information for the 4.5 billion people who are off-net or illiterate in a major internet language. It is currently being developed for market in East Africa by a team of graduate students at The Fletcher School, MIT, and UC Berkeley.

How does it work?

Anyone with a phone can call a toll-free number, ask a question in their own language, and receive an answer through an automated service, at no cost.

Caller experience:Uliza caller process Back-end experience:Uliza backend process

Why does it matter?

With only 5% of the world’s languages available on the internet, representing linguistic diversity online continues to be a major challenge. Uliza is one product in growing suite of tools which seek to bridge the information divide between networked and un-networked communities. The original three-person team behind Uliza — who collectively have more than a decade’s worth of experience working in East Africa — chose to roll out Uliza in Kenya due to the high adoption of mobile technology, even among low-income population, growing telecom industry, and a need to scale Swahili-language resources.   



(How) Can Algorithms be Racist?

Technology can be the ultimate equalizer: once access is provided, it can erase borders, education, race, class. But a new study offers that the same tools that are said to provide a level playing field might also be blind spots.  Are the algorithms that are used to drive images and ads perpetuating human prejudices?  One study says yes. But, how can algorithms (which seem to be based on reason) discriminate?

Flash preview: (How) can algorithms be racist? An illustrated story #doodles #datamining #race #partnews

A video posted by Sophie C (@petit.chou) on

For this assignment, Alicia and I wanted to tackle the issue of bias and discrimination in algorithms in a creative way. Our response is to this short article from the Guardian, “Can Googling be Racist?“.  The Instagram video is a preview of the resulting story, which I plan to scan into a static web-readable series.

To explain, we  supplemented Latanya Sweeney’s research paper with my own knowledge of data mining and algorithms, in a easily-digestable format. One of my biggest gripes as a computer scientist/machine-learner is the assumption that algorithms are either value-free or a mysterious black box. As Mark Twain (might have) said,

“There are three kinds of lies: lies, damned lies, and statistics.”


Tracing the links of the Germanwings disaster

A week ago a German jet crashed into the Alps, killing all 144 people on board. For the first several hours after the tragedy it was considered an accident, but it is now apparent that the plane’s co-pilot, Andreas Lubitz, is responsible, and details continue to emerge about his past. As more facts surface, news outlets covering the tragedy have released them in incremental updates. These updates have touched on a wide variety of questions: Why was no one aware of or worried about his mental health issues? Should he have been flying a plane in the first place? Have suicide plane crashes happened before? How has small-town Germany — such as the town of the 16 high school students on board or the pilot’s hometown — reacted to the horrific event?

When publishing these updates, publishers are often linking back to previous stories as a proxy for background information. The “original” story breaking the incident tends to be low on hyperlinks (such as the first link above, which only links to a Germany topic page) while later updates start to link back to archival stories for context. I was curious whether these internal, archival hyperlinks could be followed in order to automatically create a community of stories, one that touches on a variety of aspects of the incident. Links are rarely added to stories retroactively, so in general, following the links means traveling back in time. Could a crawler organize all the links for me, and present historical content (whether over the past 3 days or 10 years) for the Germanwings disaster?

I built a crawler that follows the inline, internal links in an article, and subsequently builds a graph spidering out from the source, storing metadata like link location and anchor text along the way. It doesn’t include navigational links, only links inside the article text; and it won’t follow links to YouTube or Wikipedia, just, for instance, the Times. This quickly builds up a dialogue of stories within a publisher’s archive, around one story; from here, it is easy to experiment with simple ranking algorithms like the most-cited, the oldest, or the longest article.

I chose three incremental update articles from March 30, one each from the Times, the Post, and the Guardian, all reporting that Lubitz was treated for suicidal tendencies:

For each of these three, I spidered out as far as they could go (though in the case of the Times that turned infinite, so I had to stop it somewhere).

New York Times

My first strategy was to simply look at the links that the article already contained. While the system can track links pointing in as well as out, this aticle only had outlinks; presumably this is because a) it was a very recent article at the time of the query, and b) we cannot be sure that we have all of the related stories from the given spider.

Clicking on a card will reveal the card’s links in turn–both inlinks and outlinks.

The “germanwings-crash.html” article had several links that formed a clear community, including archival stories about plane crashes from 1999 and 2005. The 1999 story was about an EgyptAir crash that has also been deemed a pilot suicide. This suggests that old related articles could surface from following hyperlinks, even if they were not initially tagged or indexed as being related. The 2005 crash is linked in the context of early speculation about the cause of the crash (cabin depressurization was initially considered). It is a less useful signal, but it could be useful in the right context.

This community of links is generally relevant, but it does veer into other territories sometimes. The Times’ large topic pages about France, Spain, and Germany all led the crawler towards stories about the Eurozone economy and the Charlie Hebdo shooting.

Washington Post

The Wapo article collected a community of just 32 links, forming a small community. When I limited the spidering to just 3 levels out, it yielded 12 Germanwings stories covering various aspects of the incident, as well as two older ones, one of which is titled “Ten major international airlines disasters in the past 50 years.”

Click on the image to see the graph in Fusion Tables.

The Washington Post articles dipped the farthest back in the past, with tangential but still related events like the missing Malaysia Airlines flight and the debate over airline cell phone regulations.

The Guardian

The Guardian crawler pulled 59 links, including the widest variety of topic and entity pages. It also picked up article author homepages though (e.g. 32 of these links ended up being relevant Germanwings articles, which is well more than I expected to see…I wouldn’t have guessed the Guardian had published so many stories about it so quickly. These ranged from the forthcoming Lufthansa lawsuit to the safety of the Airbus.

Click on the image to see the graph in Fusion Tables

The Guardian seems to have amassed the biggest network, and tellingly, they already have the dedicated topic page to show for it, even if it’s just a simple timeline format. The graph appears more clustered than Wapo’s, which was more sequential. But it doesn’t dip as far back in the past, and at one point, the crawler did find itself off-topic on a classical music tangent (the culprit was a story about an opera performance that honored the Germanwings victims).


In the end, the crawler worked well on a limited scope, but I found two problems for link-oriented recommendation and context provision:

  1. The links were often relevant, but it wasn’t clear why. More detail surrounding the context around the link is crucial. This could be served by previewing the paragraph on the page where the link occurs, so a reader could dive into the story itself. In short, a simple list wouldn’t be as detailed as a complete graph or more advanced views.
  2. The topic pages were important hubs, but also noisy and impermanent. Most NYT topic pages feature the most recent stories that have been tagged as such; this works better for a page like “Airbus SAS” than it does for “France.” As such, such an algorithm needs to treat topic pages with more nuance. Considering topic pages as “explainer” pages in their own right, one wonders how they could be improved or customized for a given event or incident.

Another wrinkle: I returned to the NYT article the next day after a few improvements to the crawler, and found that they had removed a crucial link from the article, one that connected it to the rest of the nodes. So already my data is outdated! This shows the fragility of a link-oriented recommendation scheme as it stands now.

Demystifying the Internet in Cuba

A group of early adopters at CENIAI, Havana, 1996. Photo courtesy of Larry Press.

A group of early adopters at CENIAI, Havana, 1996. Photo courtesy of Larry Press.

When it comes to the Internet, Cuba is routinely compared to countries like China, Iran, and Vietnam, where broad-reaching Internet censorship regimes exist. The degree to which Internet use is controlled by the Cuban government is great. But unlike these and many other countries, there is no evidence that the Cuban government conducts systematic censorship of online content.

Similarly, there is no reliable data on how many people in Cuba actually use the Internet — regularly-cited statistics range from 2.9%-25%. And one could spend years reading western media coverage of Cuba’s Internet and its embattled blogging community (as both of these authors have) and never figure out precisely how the Internet works there, how many people use it, and what kinds of restrictions they face in doing so. Like many other aspects of public life and experience on the island, Cuba’s digital culture is poorly understood by outsiders…

Read the whole explainer by me and Elaine on Medium.