Visualizing Comments by Gender at the NYT

I recently read Emma Pierson’s study about commenters and gender at the NYT. I thought it was a great piece with compelling data, some of which I tried to pull out in the following infographic.

A few challenges: the program I used didn’t allow me a lot of flexibility in terms of editing the charts, so I had to be creative about which points I chose to pull out of her findings. This visualization also uses word clouds, which some folks find terribly unsophisticated, but I really liked the visual comparison of the types of words that men and women use in comments on the same articles side by side.

Without further ado, here’s the visualization…(unfortunately, I had to paste in a screenshot because the original png file wouldn’t copy into this, so the quality on this version is a little lower than I would have hoped)

comments and gender snip

It’s low cost energy, stupid.

Recently, the Department of Energy announced it will participate in the development of the Plains & Eastern Clean Line Project (Clean Line), a major clean energy infrastructure project which will bring low-cost renewable power to my home state of Arkansas, Tennessee and other markets in the Mid-South and Southeast. The approximately 700-mile, high voltage direct current transmission line and associated facilities has the capacity to deliver 4,000 megawatts (MW) of wind power from the Oklahoma Panhandle region.

The all-Republican Arkansas congressional delegation has already issued a statement against the decision, citing executive overreach. Yet, in a state whose per capita GDP of $40,924 trails well below the US average of $54,307 and where access to inexpensive energy is hard to come by, I thought the case for the project deserved to be made.

For this assignment, I designed the following graphic for the Arkansas Times, the state’s go-to alternative news source.

Clean Line Energy (2)



Boston’s Urban Orchards

With the weather becoming warmer again, for this week’s assignment I reported on a lighter and sweeter topic: urban orchard’s in the Boston Area.

I was fascinated to learn that there are, in fact, fruit trees and berry bushes around Cambridge, Somerville, Boston, and more that are publicly foragable. This dataset from the city of Boston data portal lists the known plants, and I overlayed it onto a color coded map.

I was unable to get the embed working in wordpress which does not allow for iframes, so here’s the link!

Is San Francisco’s Hot Housing Market Literally On Fire?

This project is a collaboration between David Jimenez, Charles Kaioun, Celeste LeCompte, and Léa Steinacker.

In San Francisco, there is a growing concern about residential fires, which have displaced more than 100 residents from their homes since the beginning of the year. Have there been more big fires? If so, why? We turned to the data to answer the question.

FIRE-in-SFO-draft_3Read on for more background on our analysis.
Continue reading

Visualizing newspapers words

For this assignment I worked in a project that evolved and get included in a larger one. I’m currently participating with my hometown university (ITESO) trying to understand what the newspapers are publishing related to political candidates, with elections for mayors and local congress in 60 days the team in Mexico is collecting the news related to the campaign every day.

With the news as a data set I processed it in Wordij a tool to generate semantic networks from .txt files, the software also gets a count of words in a csv file, this files where processed in Excel and then visualized in Tableau, with all the data we could run queries to know how many times a candidate has been mentioned by each newspaper, and adding data every day we are getting a larger picture of whom the newspapers are talking about.

Screenshot 2015-04-07 22.49.09

As part of another project that I’m involved with to monitor the political campaigns, we decided to include the data-viz tool as part of the site, all the info is in spanish but if any one wants to try the tool you could use the right side panel, adjusting dates, frequency of words or search for specific words, for example search for “PRI” or “PAN” or “MC” political parties or for “Villanueva”, “Alfaro” or “Petersen” last name of the candidates. The visualization is here.

Visualization showing queries for "Alfaro", "Villanueva" and "Petersen"

Visualization showing queries for “Alfaro”, “Villanueva” and “Petersen”

Narrative of education in Pakistani media sources

The United Nations has recently announced that international donors have pledged $1 billion to provide education to millions of children in Pakistan. Nearly 25 million children are currently out of school in Pakistan, and about seven million of these children have yet to receive primary schooling, according to a recent report prepared by Society for the Protection of the Rights of the Child (SPARC).

Education in Pakistan has long been in a state of crisis. After Musharraf’s regime, Pakistan resumed elections in 2008, and media, judiciary and other democratic institutions have strengthened since then. What does the narrative of education look like in current times, and what kind of discourse underlies the education narrative? These are the questions that we explore in this inquiry.

In order to understand the narrative of education in Pakistan, we employed unsupervised learning algorithm on the text corpus provided by Alif Ailaan, an education advocacy group in Pakistan. The corpus comprises education stories curated from Pakistani media sources— including Dawn, The Express Tribune, Nation, The News and Pakistan Today— since Feb. 2013. The purpose of using unsupervised learning algorithm was to delineate underlying topical themes that are present in the text corpus.

We extracted five topic structures using our learning algorithm. The intuition behind our algorithm is that documents exhibit multiple topics. For instance, in a single document, ‘Malala’, ‘woman’ and ‘education’ are lumped together as one topic, and ‘federal’, ‘funding’ and ‘government’ are grouped into another topic. Using this technique we extracted keywords associated with five topics that our algorithm discovers.

Below is a bubble graph of the entire topical space.Each bubble represents proportional representation of a keyword in a topical cluster, which is differentiated by color.

Topics from education corpus

Topics from education corpus

Now we will look at each topic individually. We have labeled the first topic as “Federal Education” because it loosely exhibits the discourse surrounding federal policies and issues on education in form of keywords like ‘federal’, administration’, ‘CADD’ and ‘FDE’. Both Capital Administration and Development Division (CADD) and Federal Directorate of Education (FDE) are constitutional bodies that are responsible for federal functions on education.

Topic: Federal Education

Topic: Federal Education

We have labeled the second cluster as “Higher Education” since it contains terms like ‘university’, ‘international’, ‘technology’, ‘faculty’, and ‘science’ which are characteristic of higher education in Pakistan. The Higher Education Commission of Pakistan (‘HEC’) is a constitutionally established institution that drives higher education efforts in Pakistan.


Topic: Higher Education

Topic: Higher Education

We have labeled the third cluster “Primary Education” because of terms like ‘child’, ‘primary’, ‘enrollment’, ‘school’, ‘literacy’, ‘teacher’, and ‘english’. Last year, successful primary enrollment drives took place at provincial level in Pakistan to register out-of-school children in public schools.

Topic: Primary Education

Topic: Primary Education

The fourth cluster of topics, which we have labeled “Malala”, is the most telling one. Malala became “the spokesperson for a generation of girls” after being shot in the head by Taliban. Almost half of rural young women in Pakistan have never attended school, according to a 2012-2013 UNESCO report. The name Malala is the only personal name that appears in the topical space on education in Pakistan. This cluster of words is also marked by tension between heterogeneous discourses in Pakistan including Talibanization, religion, security, peace, rights, and gender, highlighting the disruptive power of the “Malala” narrative on the discourses around education.

Topic: Malala

Topic: Malala

Lastly, the fifth cluster of topics includes provinces-related terms such as ‘sindh’,’punjab’, ‘local’, ‘district’, ‘provincial’. We have labeled this topic as “Provinces and Education”.

Topic: Provinces and Education

Topic: Provinces and Education

In the chart below we show a timeline representation of the news stories curated in the Alif Ailaan corpus. Malala gave her first speech at the United Nations in July 2013; an increase in the number of stories on education in July could be related to Malala’s speech. Similarly, spikes in Aug. 2013 and Sept. 2013 could be explained by enrolment drives in Punjab and Khyber Pakhtunka provinces. These campaigns aimed at enrolling out-of-school children in public schools. Finally, the spike in Feb. 2014 could be related to the launch of Annual Status of Education Report (ASER) report, which highlighted Pakistan’s education crisis and made headlines in national newspapers. An in-depth analysis of these correlations is needed to provide more concrete insights on these trends.

News stories timeline

News stories timeline

In summary, these preliminary findings suggest that the current narrative of education in Pakistani media landscape is rich and diverse and covers the entire gamut of concerns around education crisis. The topics we discovered suggest that the media attention on education is produced by an active state of affairs.


Background: Kevin Hu & Travis Rich built a site called GIFGIF, which aims to crowd tag animated gifs with various emotions. From GIFGIF’s website: “An animated gif is a magical thing. It contains the power to convey emotion, empathy, and context in a subtle way that text or emoticons simply can’t. GIFGIF is a project to capture that magic with quantitative methods. Our goal is to create a tool that lets people explore the world of gifs by the emotions they evoke, rather than by manually entered tags.” As we know, animated gifs are also a popular storytelling mechanism for social news and entertainment websites.

The cultural phenomenon of using animated gifs to express emotions has been the subject of numerous journalistic inquiries:

Fresh From the Internet’s Attic – NYTimes

Christina Hendricks on an Endless Loop: The Glorious GIF Renaissance –

GIF hearts Tumblr: a fairytale for the Internet age –

Visualization project for this week: Kevin, Travis, and I built a map tool so people can explore GIFGIF’s current dataset to see which gifs are most representative of certain emotions across different countries. Out of 1.8 million votes, 1.4 million votes had IP data which links the votes to the location of the voter. GIFGIFmap can be found here.

Screen Shot 2014-04-02 at 1.03.12 AM

In a future version, we would like to show the top gifs per emotion that countries have in common with each other, and what are unique top gifs for each country (along the lines of What We Watch). However, there are limitations to the GIFGIF data set in terms of global coverage. For example, the top 21 countries account for 92% of the votes. Additionally, we excluded countries that had less than 10,000 total votes across all categories, so as to avoid making generalizations based on limited data. We chose to include the number of votes per country (per emotion) to make the data set more transparent in terms of representation.

We think the tool we are building could complement existing stories about the phenomenon of using animated gifs to communicate (stories like the ones we linked to above).

These are some potential questions that we hope journalists could explore using a map interface to the GIFGIF dataset:

1) Do people from different countries interpret the emotional content of gifs differently?

2) If there are variances in interpretation, are there clusters of countries that have more similar interpretations? Do these match up with proximity, or immigration patterns?

3) What top gifs per emotion are unique to a given country?


Note: GIFGIF’s data will soon be made publicly available through an API.


Boston high schools- by the numbers

My Quest for Truth

It all started with a simple question: How many high schools are there in Boston? lists “all public and private high schools located in Boston” and says there are 17. lists 32 public and private high schools. US News says there are 32 schools just within the Boston Public School District. Wikipedia says 33. The Massachusetts Department of Education lists 42 public and private.

I compiled a list of 56.

Why the discrepancy over a seemingly basic question? Is it because

  • We can’t agree on what “high school” means?
  • We can’t agree on what “in Boston” means?

Charter schools, special education, adult education, vocational training, private schools, religious schools- there are many ways to designate what is and is not a “high school” that could explain the differences cheap air jordan.
Boston public schools, Boston city limits, Greater Boston- the discrepancy may also be caused by varying definitions of what it means for a high school to be “in Boston.”

I aim to create an authoritative central portal that lists all high schools in Boston. I will continue exploring this in future assignments (talk to me if you want to collaborate!).

Cold Calling For Data

To preempt a similar situation arising when trying to figure out how many high school students are there in Boston, this time I chose a bottom-up rather than a top-down approach. I picked up the phone and began cold calling every high school on my list. I asked every school receptionist two questions:

  • How many students go to your school?
  • What makes your school special?

I chose these two questions because I thought they would be a good foundation to explore both quantitative and qualitative data, and the answers could give me potential follow-on questions if I continue focusing on Boston high schools.

Another Course to College- their Annual Report states 220 students; their receptionist told me 224.

Boston Adult Technical Academy- their Annual Report states 257 students; their receptionist told me 300.

Boston Arts Academy- their Annual Report states 420 students; their receptionist told me 400.

Boston International High School- their Annual Report states 359 students; their receptionist told me 500.

… and the list goes on. I could present more data but I’m not sure what story I want it to tell yet. Yes, I could add up all the numbers and create “the authoritative Julia guide to how many high school students there are in Boston.” Yes, I could put together another “a-ha” moment showing the discrepancies in calculating this number across organizations and websites. But I don’t want to present a repeat of other dry, going nowhere data pieces.

Telling a Story

I recently read the book Made to Stick: Why Some Ideas Survive and Others Die which nailed home for me the importance of telling a compelling story. With the school mapping project I am working on, I have been more focused on organizing and presenting the information and hoping others will find stories to tell, rather than having to tell the story myself. My model has been Wikipedia, which presents information in a way that is useful to the reader. Would you say that Wikipedia tells a story?

My aim has been to build a school mapping platform using data and communication tools that are informative and useful. I thought that would be enough. What I’m struggling with now is how to build a platform that tells a story, and what story do I want it to tell original new balance.

original new balance