An Analysis of hNews Usage

On Wednesday, Martin Moore, one of the creators of the hNews microformat, posted on Idea Lab about how hNews is in use at 577 U.S. news sites. As someone who has long been interested in standardized markup and interchange formats for news content, I found this interesting and set out to investigate for myself.

Moore’s post contains a link to a Google document containing a list of the 577 sites. I downloaded this spreadsheet as a CSV for easy consumption and analysis, and set to work.

Duplicates/Sub-sites

First thing I noticed is that there are 25 entries whose domain is already mentioned previously in the list. For example, ‘http://www.picayune-times.com/’ is mentioned twice, in rows 34 and 43. There are also sub-sites listed, such as ‘http://www.tcpalm.com/’ (row 119) and ‘http://www.tcpalm.com/news/sebastian-sun/’ (row 121). If we remove all of these redundancies, we’re down to 552 sites.

CMSs

Then I started randomly going to sites in the list, and noticed that many of them look remarkably similar. It soon became apparent that there were a number of large chains sharing the same CMS and almost identical templates.

I came up with simple heuristics for determining which CMS powered a site (e.g., by looking for references to the CMS company’s domain in a CSS stylesheet or Javascript file link). I wrote a small Python program to go through all the URLs, check the HTML source of each page against my heuristics, and identify the CMS.

Results

The results (see appendix for full document):

170 sites using TownNews products (e.g., BLOX)
101 sites using MatchBin (many MediaNewsGroup papers)
213 sites using Zope.com/Zope4Media (many from the Community Newspaper Holdings chain)
21 sites using Ellington (Scripps chain)
2 sites using MediaNewsGroup (rest of MNG’s papers still using MatchBin, see above)
24 sites from the Rush Publishing chain (I couldn’t find a corporate site, but you can see a list of papers here)
17 sites from the Swift Communications chain
4 sites from the Freedom Communications chain
5 sites built using Microsoft FrontPage (seriously? in 2010?)
20 other sites with either custom or obscure CMSes

I’m not claiming that my script’s results are perfect, but they are at least indicative. I draw two conclusions here:

The majority of the 577 claimed sites are from a small set of CMSs. Notably absent are open-source CMSs. If you only count once for each CMS that has implemented hNews, you have the 8 listed above, plus perhaps a few more from the ‘other’ list.
The list does not contain the full client list for each of the chains represented, and thus there are far more sites that support hNews which just aren’t listed there.

Compliance

I had intended to algorithmically verify that all of these sites were actually hNews-compliant.

The microtron library has simple hNews extraction/validation which I was able to leverage, but I needed a way to find an individual article page on each site.

My first instinct was to use the RSS feed of each site to extract a recent article link. However, I discovered (and tweeted in despair) that almost none of the sites have a <link> tag in their homepage <head> to support RSS auto-discovery. A few had RSS feeds on their “News” section pages, but it become clear that automagically trying to find RSS feeds was doomed to fail. You would think that a site that implements a microformat standard would also implement the ubiquitous standard of RSS or Atom…

Instead, I manually verified compliance with examples of each of the CMSs discovered, and was glad to see that they all passed. I apologize for the scare quotes in one of my tweets, I did not mean to imply that Mr. Moore or anyone else was being disingenuous.

Final Thoughts

I am quite happy to see that hNews is being adopted more broadly, and I hope to see that trend continue. As a structured data fan, it makes me happy that so many articles out there are easily parse-able.

However, I think it’s a bit unfair for the hNews proponents to use individual sites as the primary metric for “reach” of the microformat. Instead, I would prefer that they look at the CMSs supported, and continue evangelism in an effort to convince the other newspaper chains out there to adopt hNews in their own systems.

Appendix

My annotated Google Spreadsheet with CMS breakdown and domain uniqueness.

My python script (‘hnews.csv’ is Moore’s original spreadsheet exported as a CSV, and ‘hnews_rev.csv’ was the basis for my own spreadsheet above).

Max Cutler