An Analysis of Sina Weibo Censorship Using WeiboScope Search Data
Starting at 4:28 PM May 19, 2012, I posted on my Sina Weibo account two names as well as the Chinese words for “Taiwanese independence.” The first name I posted was ”Chen Guangcheng” (in English and Chinese), the blind lawyer who escaped house arrest in Shandong province and made his way to the U.S. Embassy in Beijing. The second name was “Bo Xilai” (in English and Chinese), the former Party Secretary of Chongqing who recently fell from power. Less than 14 hours later, I received a message from Sina Weibo’s system administrator informing me that my two posts on “Chen Guangcheng” were “inappropriate” and had been censored. While I can still see the two “Chen Guangcheng” posts on my Sina Weibo account page, no one else can. Surprisingly, my posts on “Bo Xilai” and “Taiwan independence” were not censored.
Herein lies the conundrum with censorship in China. We know that certain topics are censored from blogs hosted in China, Chinese search engines and Weibos. But we don’t know where the line lies. Part of the reason is because the line is constantly moving. Baidu, Sina and Tencent could help identify the line by publishing a list of banned topics or keywords, but they don’t. Rather, they hire “monitoring editors” and rely on self-censorship to ensure that user generated content does not run afoul of Chinese authorities.
Some computer scientists in academia have tried to make sense of censorship in Sina Weibo by analyzing the data. In March 2012, David Bamman, Brendan O’Connor and Noah Smith at Carnegie Mellon University published a paper entitled “Censorship and deletion practices in Chinese social media” in First Monday after analyzing 56 million Sina Weibo messages and found that more than 16% had been deleted. King-Wa Fu and Cedric Sam at the University of Hong Kong’s Journalism and Media Studies Centre have hacked the Weibo Scope Search that archives deleted posts on Sina Weibo.
For my MIT Media Lab final project, I’ve tried to build on King-Wa Fu and Cedric Sam‘s work by analyzing the data collected from the Weibo Scope Search to try to make some sense of Sina Weibo censorship. Since its inception February 1, 2012 to May 20, 2012, the Weibo Scope Search has collected 12,032 deleted messages from Sina Weibo. The first thing I did was to simply plot all the deleted messages on a timeline from February 1, 2012 to May 20, 2012 and this is what I got:
My findings were consistent with the Carnegie Mellon team’s findings. There are spikes in Sina Weibo censorship as a result of media reports and rumors. During the Carnegie Mellon survey duration from June 27, 2011 to September 30, 2011, there was a rumor that former President Jiang Zemin passed away causing a spike in Sina Weibo deletions. From February 1, 2012 to May 20, 2012, the following incidents in China caused in censors employed by Sina Weibo to work overtime:
- February 6, 2012 - Chongqing Public Security Bureau head Wang Lijun goes to U.S. Consulate in Chengdu with information about the death of British businessman Neil Heywood that implicates Chongqing Party Secretary Bo Xilai.
- March 8, 2012 - Chongqing Party Secretary Bo Xilai fails to show up at the National People’s Congress, sparking rumors that he has fallen from power.
- March 15, 2012 - Bo Xilai is removed from post as Chongqing Party Secretary.
- March 18, 2012 - A Ferrari crashed on Beijing’s Fourth Ring Road killing one and injuring two people.
- April 22, 2012 - Blind lawyer Chen Guangcheng escapes from house arrest in Shandong province and makes his way to U.S. Embassy in Beijing.
- May 14, 2012 - The Beijing Daily posts a message on its official Weibo charging that U.S. Ambassador to China Gary Locke is posing as an ordinary citizen and calls for Locke to disclose his wealth.
Interestingly, deletion of Sina Weibo messages tend to hit a low on Saturdays. I’m not too sure why that is except that maybe censors want to take time off on weekends as well. If you want to maximize the length of time your message will remain on Sina Weibo, probably the best time is to post the message after 11 PM Friday night.
The second analysis I did with the Weibo Scope Search was to try and figure out how long it took the censors to delete messages on Sina Weibo. Each Sina Weibo has a time stamp for when it was created. The Weibo Scope Search checks Sina Weibo‘s timeline at most four times a day (but usually less due to limits that Sina Weibo imposes). Let’s say for instance, a user posts a message on Sina Weibo at 8 AM. Weibo Scope Search checks Sina Weibo‘s timeline at 9 AM, 3 PM, 9 PM, and 3 AM. If the message was deleted by the censor at 10 AM, it would show up on Weibo Scope Search‘s “deleted time” as 3 PM.
The fastest a post was deleted on Sina Weibo was just over 4 minutes. The longest time it took for the censor to get around deleting a message on Sina Weibo was over four months. For the posts created on May 20, 2012 and deleted on the same day, it took on average 11 hours for Weibo Scope Search to detect the deletion. It took the censors about 14 hours to delete my post “chen guangcheng.” Determining the average time it takes for censors to delete “irresponsible” messages is a bit tricky since we don’t have data on exactly how long it takes for each post to be deleted. Out of curiosity, I pulled up three messages that took over four months to delete to see what they said:
|time created||time deleted||hours||message|
|2011-12-29 00:30:41||2012-05-18 18:22:25||3401:51:45||“如果明年欧美名校在三四月份一起召开家长会的话，那么中国的十八大就很可能开不了了。”|
|“If the top universities in Europe and the U.S. hold their parent-teacher conference next March or April, then China will not be able to hold it’s 18th Party Congress then.”|
|2011-12-17 20:52:01||2012-04-27 20:40:12||3167:48:11||“【媿尔公侯高窃位，怆然世事急抢滩】国际盲人日当天@张海迪 通过私信@我是闻正兵 公布了她之前为光诚的努力。当年我和袁伟静嫂子向她求助时抱了很大期望，但坦率说从未得到她哪怕一个电话询问或慰问，这肯定谈不上“做了应该做的一切”。如今舆论环境更好而光诚处境则更糟，难道她一点努力都不能做吗？”|
|“On World Blind Day, paraplegic writer Zhang Haidi told Wen Zhengbin in a private letter that she did her utmost to help Chen Guangcheng. At the time, Chen Guangcheng’s wife Yuan Weijing and I asked her for help, but she didn’t even call us or ask us how we were doing. She didn’t do everything she should have. Today, Chen Guangcheng’s situation is even worse while there is greater openess for debate. Shouldn’t she have done more?”|
|2011-11-26 16:13:26||2012-03-28 07:55:28||2943:42:02||RT:”演藝界人士周星馳表支持唐英年參選下屆行政長官，欣賞他的處事，為人豁達開通。他又說，唐英年一點也不蠢，是有智慧的人，自己亦不會與蠢的人做朋友。對於唐英年有感情缺失，會否流失支持，周星馳認為並無關係，因為現時是選特首，不是選男朋友。 http://t.cn/SUjFh1″|
|“Hong Kong actor Stephen Chow said he supports Henry Tang’s bid to be the next Chief Executive of Hong Kong. Chow admires Tang’s way of doing things and open mindedness. Chow added that Tang is not stupid, but a smart person. Chow says that he wouldn’t be friends with stupid people. Regarding whether Tang’s infidelities will cause him to lose support, Chow says that it shouldn’t matter because people are voting for the Chief Executive, not choosing a boyfriend.”|
I’m not too sure why it took so long to delete the posts. Cedric Sam points out that the posts may have been in the Weibo Scope Search database to begin with and they just didn’t turn up until several months later. The researchers at University of Hong Kong’s Journalism and Media Studies Centre are constantly adding new Sina Weibo to their list. Or, they could have just turned on the deletion marking system in the Weibo Scope Search so that it would have caught some censored posts that weren’t caught before.
To be sure, there is no way to tell for sure whether some of the posts were deleted by the users themselves instead of “monitoring editors.” Sina’s API returns two types of error messages: “Weibo does not exist” and “Permission denied.” We assume that when a post is deleted by the user, the “Weibo does not exist” error message comes up. When a post is censored, the “Permission denied” error message comes up. Weibo Scope Search keeps track of all the deleted posts that have the “Permission denied” error message.
If I had more time (and knew how to code), I would have liked to have analyzed more of the data that Weibo Scope Search came up with. Among the things I would have liked to explore are:
- Geographic distribution of deleted messages on Sina Weibo - The Carnegie Mellon paper also looked at geographic distribution of censored Sina Weibo and found that messages issued from Tibet, Qinghai and Ningxia are deleted at a higher rate. Weibo Scope Search also had data on the city and province that each message originated from. However, I didn’t have enough time to figure out how to convert Sina’s data in its city and province into a fungible type of data to transpose on a map.
- Relationships between the most censored Sina Weibo accounts - Using Weibo Scope Search, we’re able to rank the 3,524 users whose Sina Weibo messages are being deleted the most to last. One thing I’d be interested in exploring is how many followers these Sina Weibo accounts have and whether they follow each other. It’s not clear to me if the censors have compiled a list of influential Sina Weibo accounts and are tracking them daily or the censors are using key word searches to figure out what to censor.
- A deeper analysis into the most censored Chinese words on Sina Weibo - Several weeks ago, I did a word cloud of the most censored Chinese words on Sina Weibo to see what came up. By far, the most censored words were the Chinese words for “retweet” followed by “ha ha” or some variation. It makes sense, but it’s not very helpful. Given more time, I would have liked to dig a little deeper to see if there were any words or code words that consistently came up again and again after filtering out the “retweets,” “ha ha,” and other stop words.
How to Analyze WeiboScope Search Data
King-Wa Fu and Cedric Sam at the University of Hong Kong’s Journalism and Media Studies Centre have built a WeiboScope Search that sends all of the deleted Weibo posts to a server in Hong Kong and stores them. However, the data is in JSON format, which looks like this:
To make sense of the data collected, we need to first clean up the data. I used Google refine to clean up the data by:
- 1) Download + install Google refine
- 2) Click on “Create Project”
- 3) Click on “Web Addresses (URLs)”
- 4) Insert link http://research.jmsc.hku.hk/social/sinaweibo/lastpermissiondenied.all.json
- 5) Click “Next”
- 6) Highlight the fields you’re interested in and left click the mouse.
- Google refine should automatically put all the fields into columns:
- 7) Click “Create Project”
- Click “Export”
- 9) Click “Excel”
Now that we have the data formatted, we want to make sense of it.
- 1) Download + install Tableau
- 2) Click on “Open Data”
- 3) Under “Connect to Data: In a file”, click on “Microsoft Excel”
- 4) When the “Excel Workbook Connection” window pops up, click “Ok”
- 5) Change the format for the “created at” and “deleted” columns from “text” to “Date & time” by right clicking the mouse, selecting “Change Data Type” then “Date & time.”
- 6) Go to the Dimensions box and drag the “deleted” data set to “Columns”
- 7) Go to the Measures box and drag the “Number of Records” data set to “Rows”
- In the “Show Me” box, select the type of graph you want. Voila!