Is it cancer?

Screen Shot 2016-03-07 at 10.42.27 AM

Cyberchondria refers to unfounded health concerns perpetuated by medical information found online. WebMD is a popular website and often a top search result for people seeking to self diagnose conditions and symptoms. Its tendency to increase concern for potential conditions and exaggerate the seriousness of symptoms is found at the center of jokes. Specifically, articles online have referred to how easy it is to arrive at a cancer diagnosis on the website.

We cannot determine the validity of the entire WebMD site by fact-checking the answers given by each page, but we can perhaps answer this question – given a symptom, how far away is a person from a diagnosis of cancer on WebMD?

So here is an experiment that attempts to use the physical properties(text and links) of a website to determine it’s message. The goal is to investigate the structure and content of webmd.com in order to determine if and how much it perpetuates the diagnosis of cancer.

The site is a big nest of links so the scope is limited to be the A-Z common topics page. This section lists 482 health related topic pages from Acid Reflux to Zoster (Herpes) Virus. The content examined is further limited to the main article of each of the conditions.

The experiment looks at each page’s center content section for 2 things – cancer related words(a limited list I found on the internet), and all the out links from that section of the page. It continues to search through the pages until it arrives at either a page with cancer, a page with no links, or a page that is outside of WebMD.

Using this method, the simple web scraper picked up 9714 web pages. Of these,

  • 7976 pages do not have cancer related keywords on them.
  • 726 pages are cancer related conditions because keywords were found in the main content.
  • 1012 information pages had either no outlinks such as liver, or out-links that redirected to a sponsored page like this.

A rat’s nest of a directed network graph was made with a force directed layout from the resulting pages where each page is a node, and each edge a link between pages. The cancer related pages here are colored in red. It is not immediately noticeable which categories of pages have more prominence. However it is clear that there are central nodes in the network where almost every page eventually leads.

Screen Shot 2016-03-09 at 9.14.56 AM I calculated pageRank for each page(node) to determine its prominence.

PageRank, the more famous part of the google search algorithm measures the relative importance of the page given its links based on one of the algorithms that determines the order of search results. Below are the top 1000 pageranked pages in descending order. We can see that pages with cancer do not have the highest scores, and are distributed throughout the ranking.

Screen Shot 2016-03-08 at 11.32.15 PM

Unfortunately, this is a much more complicated project than I expected, so I can only tell you that given what I have seen of the network, cancer related pages do not act differently or hold prominence over other topic pages. However, it is not clear that the scope of the website’s conditions covers cancer related topics proportionally more than it should. Nor is it clear that if a cancer diagnosis occurs, how much of it is driven by the behavior of the medical advice seeker who may tend to travel the path toward the worst scenarios.

If webMD is not about diagnosing cancer, then where are the most likely places that any given webMD query will lead? A few pages with significantly higher centrality and pageRank stood out far from the rest. And these pages focus on 2 things – policy and medicine.

The page which every page eventually leads to is as expected – the disclaimer that states webMD information “are for informational purposes only. The Content is not intended to be a substitute for professional medical advice, diagnosis, or treatment…”

A equally prominent page is a tool to identify medication. The drug index comes in 3rd, but has the most user input on the website with its thousands of reviews of specific drugs..

And subsequent prominent pages serve similar purposes: privacy policy, and conditions of use.

… to be continued

One thought on “Is it cancer?

  1. This is interesting, but I don’t know what the story itself is. (Or maybe you didn’t intend this to be a story, in which case disregard what’s below.)

    How you got the data is impressive to non-tech folks, but if the reader can’t tell WHAT the point is, does the technical scraping process matter?

    Perhaps the “how I did this” could be a footnote to a short, succinct statements that a general audience could understand. Journalists usually write for a general audience but I’d argue that your average reader wouldn’t be able to tell you what those charts say.

    Maybe the headline could have been: Nope, not all WebMD roads lead to cancer. (I’m not sure that’s what the headline would be, but that’s just to give you an idea.)

Comments are closed.