As state and national level government agencies continue to make community related data available online (mass.gov/data, NY community health data), it brings the exciting opportunity to look for rich information between the datasets.

For example, one datasets may contain low birth rates across counties in new york, while another dataset may contain youth pregnancy data. Exploring correlations between the two datasets is an important first step towards uncovering the next big scoop (or interesting facts!).

Throughout the semester, I identified three hurdles that make these types of analyses difficult.

  1. Data import: Simply getting data into a program you’re comfortable with is hard. The data may be stored in JSON, or displayed as an HTML table on a website. Government datasets often come as hundreds of files in a zipfile.
  2. Extracting location information: Many datasets are very difficult to deal with because the statistics (e.g., low birth rates) are assigned to communities, cities, counties, or states. One dataset may report birth rates by county, while another reports then by zip code. We can only look between them by knowing how to equate zip codes with county names.
  3. Looking for patterns: Managing even 5 different datasets can quickly be unwieldy.

EasyData is a prototype to make each of these three steps less cumbersome. It tries to automate each of these three steps as much as possible.

Easy Data Import

EasyData will do a good enough job of importing your data. If the data contains headers (e.g., “birth_rate”, “county_name”), it will try to identify them. If the data has errors, it will ignore them.

Automatic Geocoding

EasyData will try to analyze your data to see if any of the columns are zipcodes, state names, addresses, or latitude longitude coordinates. If it thinks it sees an address and a state name, it will try to automatically geocode the table. Otherwise, you can tell EasyData which columns to geocode and it will do the rest.

Automatic Correlation Search

Once your data hase been geocoded, EasyData will combine the statistics of data that references the same location (e.g., downtown boston), and see if there are any interesting trends. It will plot the most interesting trends.

Here’s a screenshot so you know its real.


Fact Checking a Technique

Felipe Andres Coronel, better known as Immortal Technique, is a popular underground rapper of Afro-Peruvian whose rap lyrics focus on controversial issues such as class imbalance, racial inequality, institutional oppression. Unfortunately, many people consider many of Immortal Technique’s lyrics as conspiracy theories and the antics of a wild man. In response he lucidly argues that his lyrics are simply “the truth”, and the truth is often seen as revolutionary.

In his own words

I give niggaz the truth, cause they pride is indigent

On March 17, 2012, Immortal Technique will take the stage in Boston’s Paradise Rock Club. As a fan of Mr. Coronel’s lyrical prowress and a budding journalist, I will naturally attend the concert and listen to his thoughts on current issues. I find it valuable to perform a cursory factual assessment of his lyrics. To do this, I picked his verses from the song Young Lords from his most recent album, The Martyr:

I survived the cointelpro assassinations.
AIDS epidemic, Crack era, fractured a nation,
The Interpretation of American Democracy,
Is best exemplified in it's foreign policy dichotomy,
I live a double life of political philosophy,
But revolution follows me, the struggle for equality,
Against the morally bankrupt claiming to be born again,
It's a civil war again like MS-13s origin
Ban ethnic studies claiming our culture will swallow them,
But you can't conquer people and build a country on top of them,
And then feel offended that they breathe the same oxygen,
Your family values lack the wisdom of Solomon,
But Operation Condor and Operation Bootstrap are Polisci 101,
Research for the new jack,
It's hard to reach Communist Utopia tomorrow,
When your hands are in a fuckin glass jar like Che Guevara,
Forget the distorted historical facts you were given,
Slave trade was the capital for capitalism,
Trapped in a prison mentally, dying existentially,
Separated from people you can't see yourself to be,
Then racially integrated into a burning house colony of an empire,
Economically burning out,
Can't win a debate so they sponsor every threat to me,
I wonder if agent 800 is standing next to me!    

Let’s go through the first half.

COINTEL assassinations: COINTELPRO were a series of declassified, covert and illegal projects to remove power from domestic policital organizations, such as the KKK and Black Panthers. The summary report by the Senate acknowledges that “… the domestic activities of the intelligence community at times violated specific statutory prohibitions and infringed the constitutional rights of American citizens.” However, the projects were active between 1956 to 1971, and are unlikely to have directly affected Felipe (born 1978).

The Interpretation of American Democracy / Is best exemplified in it’s foreign policy dichotomy: The US government’s foreign policy has often been called a dichotomy – for example, when the government calls to reduce weapons in the Middle East while supplying tanks to countries in the Gulf. Similarly, while the United States is called the “greatest democracy on the planet”, controversies such as the financial institutions’ ties with the Federal Reserve, and the 1%.

Civil war like MS-13s origin: MS-13 is an L.A.-Mexican gang notorious for their excessive cruelty. It originated as a group to protect Salvadoran immigrants, fleeing civil war in their home country, from existing, well establish Mexican gangs in the area, despite both sides being immigant populations living in the same region. Stepping back, we can see that much of the news in the past several months have focused on income disparities and the resulting unrest. Comparing current events to a civil war between a militant government and a guerrilla coalition is certainly an overstatement.

Based on an admittedly small sample set, the relationships that Technique weaves between (factually accurate) historical events to current events and himself are tenuous at best. This is an instance where the individual facts are correct but the contextual information is “pants on fire”.

To perform the fact checking I used a combination of RapGenious (not a very good source), Wikipedia, and old fashioned Google searches.


The key principles that must be followed in any election is that a large portion of the voters participate in the process, and that there is a lack of corruption. The 2012 Presidential Election has become a major turning point in the United States’ election process, in that both principles have disappeared from the radar. Despite the efforts of organizations such as rock the vote, voter apathy has only increased since the last election. When interviewed, many cited a lack of faith in the political system and the amount of corruption.

These complaints stem from the second change — the unprecedented emphasis on money. Money has always been a part of the process. Many experts argue that Obama won the 2008 election by spending a record shattering $700 million on his campaign, as opposed to McCain’s $100 million. Despite the campaign funding, here was still a limit on how much an individual could donate to a candidate. However, the birth of a new player — the SuperPAC — is poised to change completely change the election process by giving wealthy individuals and organizations the power to directly market and advertise on behalf (or against) candidiates all the way up until election day.

This issue was once again brought to light at a recent talk at MIT’s Media Lab by political activist and director of the Foundation for Ethics at Harvard University, Lawrence Lessig. Lessig has focused the past five years on battling political corruption and focused the discussion on systemic corruption in the current political system.

He emphasized SuperPACs as a poingnent example of corruption. The January 2010 ruling of Citizens United vs the FEC overturned a law that prohibited corporations and unions from spending on “electioneering communication”, which encompasses any television, radio or other communication that mentions a candidate close to election time. This was designed to limit the amount of influence third parties can have on the election. With the law overturned, any corporation can act a candidate’s third arm and effectively help fund the candidate’s marketing campaign throughout the election. Candidates are no longer


There has been a surge in the number and funding of SuperPACs. The following diagram depicts the number of (dashed line) and total funding of (solid line) SuperPACs over the past ten years. Although SuperPACs (otherwise called Individual Expenditures) have always existed, the number has doubled since the Citizens United ruling, while the total funding has increased 12 fold, from $9 million to $122 million.

Not only have the number and total funding of SuperPACs gone up, so has the disparity of the top PACs compared to smaller PACs. The following diagram is a cumulative plot of PAC funds. The x-axis show’s the size of a single PAC’s (let’s call it PACMan) funds in the millions, while the y-axis shows the cumulative amount of funds of all PACs the same size or smaller than PACMan. Each differently colored line is the data for a different year. We see a similar story, before 2010, the largest PAC only had $2.5 million. Now, the largest SuperPACs how command over $35 million, far more than some candidates’ entire budgets in the last election. This is money used to not only back candidates that support the SuperPAC’s policies, but to attack candidates that speak contrary to the SuperPac’s opinions.

Lawrence made a powerful point during his talk. The mere existance of SuperPACs with so much money and power will force politicans to buy political insurance. Politicans will cater to a SuperPac policies in fear that competitor comes along armed with a different SuperPAC. At what point will these politicans consider the public?

All data provided by the Federal Election Committee.

All source code available on github.

Tracking a media diet


I approached the Media Diet in two ways.  This post will discuss one of them, which is a tool called IdeaPrint.

In the current news environment, there are a small number of corporations and individuals that control the vast majority of media that is produced, and a vast number of smaller blogs and organizations that, in aggregate, provide balanced and far reaching reports of the news.   Despite so many choices, many people still adhere to a small number of news sources that will inevitably result in biased views.

The first component of my media diet is a tool called IdeaPrint.  The ultimate goal is a tool that can keep track of an individual’s “idea consumption” construct a unique “ideaprint”, similar to a fingerprint.  The ideaprint includes information about the biases that influence your idea sources.  For example, how much of your consumed ideas are owned by Rupert Murdoch?  This information can be further used to suggest additional articles and commentary to provide a more balanced view on topics.  Or can be used across your social circle to identify homogenous thought processes and enhance the variety of news content that you and your friends read.

The current tool is built as a Google Chrome extension that simply aggregates the number of visits to major websites (those that have wikipedia articles) and displays the top 9 as a bar chart.  In contrast to tools such as RescueTime, the goal is to enhance the list of visited sites with information about the site owners, the amount of time spent reading an article and provide a simple API for custom analyses.

The current implementation is a very hacked up prototype.  You can check out the source code at: https://github.com/sirrice/ideaprint