An Analysis of hNews Usage

On Wednesday, Martin Moore, one of the creators of the hNews microformat, posted on Idea Lab about how hNews is in use at 577 U.S. news sites. As someone who has long been interested in standardized markup and interchange formats for news content, I found this interesting and set out to investigate for myself.

Moore’s post contains a link to a Google document containing a list of the 577 sites. I downloaded this spreadsheet as a CSV for easy consumption and analysis, and set to work.

Duplicates/Sub-sites

First thing I noticed is that there are 25 entries whose domain is already mentioned previously in the list. For example, ‘http://www.picayune-times.com/’ is mentioned twice, in rows 34 and 43. There are also sub-sites listed, such as ‘http://www.tcpalm.com/’ (row 119) and ‘http://www.tcpalm.com/news/sebastian-sun/’ (row 121). If we remove all of these redundancies, we’re down to 552 sites.

CMSs

Then I started randomly going to sites in the list, and noticed that many of them look remarkably similar. It soon became apparent that there were a number of large chains sharing the same CMS and almost identical templates.

I came up with simple heuristics for determining which CMS powered a site (e.g., by looking for references to the CMS company’s domain in a CSS stylesheet or Javascript file link). I wrote a small Python program to go through all the URLs, check the HTML source of each page against my heuristics, and identify the CMS.

Results

The results (see appendix for full document):

I’m not claiming that my script’s results are perfect, but they are at least indicative. I draw two conclusions here:

  1. The majority of the 577 claimed sites are from a small set of CMSs. Notably absent are open-source CMSs. If you only count once for each CMS that has implemented hNews, you have the 8 listed above, plus perhaps a few more from the ‘other’ list.
  2. The list does not contain the full client list for each of the chains represented, and thus there are far more sites that support hNews which just aren’t listed there.

Compliance

I had intended to algorithmically verify that all of these sites were actually hNews-compliant.

The microtron library has simple hNews extraction/validation which I was able to leverage, but I needed a way to find an individual article page on each site.

My first instinct was to use the RSS feed of each site to extract a recent article link. However, I discovered (and tweeted in despair) that almost none of the sites have a <link> tag in their homepage <head> to support RSS auto-discovery. A few had RSS feeds on their “News” section pages, but it become clear that automagically trying to find RSS feeds was doomed to fail. You would think that a site that implements a microformat standard would also implement the ubiquitous standard of RSS or Atom…

Instead, I manually verified compliance with examples of each of the CMSs discovered, and was glad to see that they all passed. I apologize for the scare quotes in one of my tweets, I did not mean to imply that Mr. Moore or anyone else was being disingenuous.

Final Thoughts

I am quite happy to see that hNews is being adopted more broadly, and I hope to see that trend continue. As a structured data fan, it makes me happy that so many articles out there are easily parse-able.

However, I think it’s a bit unfair for the hNews proponents to use individual sites as the primary metric for “reach” of the microformat. Instead, I would prefer that they look at the CMSs supported, and continue evangelism in an effort to convince the other newspaper chains out there to adopt hNews in their own systems.

Appendix

My annotated Google Spreadsheet with CMS breakdown and domain uniqueness.

My python script (‘hnews.csv’ is Moore’s original spreadsheet exported as a CSV, and ‘hnews_rev.csv’ was the basis for my own spreadsheet above).


7 Responses to An Analysis of hNews Usage
  1. Megan Taylor

    Wow, Max, thanks for doing all this analysis. The thing that makes me sad is that so many sites are using such crappy CMSs.

  2. Martin Moore

    Hi Max,

    On the one hand I think it’s great that you’ve taken the time to do a detailed analysis of the CSV file I posted on Google Docs – and I’ll go through and check the 25 you’ve cited.

    On the other hand, I’d have loved it if you’d helped us find the other 300-400 sites I understand are also publishing news with hNews (which would increase the number of sites closer to 1,000).

    I take your point about websites being a relatively crude measure but up till now we hadn’t even had that. It was because of our frustration that we didn’t have any numbers at all that we built a simple scraper to find as many as we could.

    If you’d like you could do the same and try and track down a bunch more.

    Re hNews in open source CMSs. We developed an hNews plugin for WordPress, and I understand Tech and Law blog has also worked on one for blogger, so it’s not hard to integrate. It’s more a question of telling people about it and getting them to do it (and we’re doing our best to do that). Re stats – it’s also hard to work out who is publishing hNews in WordPress or Blogger (all suggestions welcome).

    Which is why we should talk. If you’re into structured data – like us – then you’ll want to help us promote hNews (and other structured data). I’ve left my email with this comment (unpublished) so drop me a note.

    Best wishes, Martin

  3. Max

    Martin, thanks for reading.

    What prompted my investigation was the feeling that “577” was such an exact number, and when I found that it was inaccurate, it got me wondering what else could be wrong. Hence my desire for a better success metric.

    I did intend to augment your list by scraping the entire client list of the newspaper chains I found using it, but I simply ran out of time. It’s on my to-do list, and when I get around to it I’ll be sure to pass along my results to you.

    Regarding OSS CMSs, I did see your WP plugin when doing my research. You should contact the WordPress.com folks, as they have a huge client base that they could push it on, and many VIP clients who would make nice high profile hNews examples. If you need help contacting someone there, let me know and I can connect you.

    I’ll drop you an email to talk more about structured data privately. Thanks,

    Max

  4. [...] I’m grateful to Max Cutler for spotting a number of duplicate entries in the original list which h... martinjemoore.com/us_sites_publishing_hnews
  5. Todd Martin

    Fantastic analysis. To Martin’s point…find more as there are more out there. The Assoicated Press continues to work with content management vendors to drive adoption. The negative slant with regard to the vendors who have adopted and are actively supporting hNews merits a much more objective assessment as these vendors represent a large percentage of the US domestic news sites.

    Cheers –

  6. Miles

    “…that almost none of the sites have a tag in their home page to sup port RSS auto-discovery.”

    Just a quick note–your blog front page doesn’t either, Max.

  7. Max

    Right you are, Miles. There was a bug in this theme, and I hadn’t updated it since January. Should be fixed now.

    In my defense, there is a “feed” link on every page, it just wasn’t adding the auto-discovery meta tag. I’m alsonot a news organization who makes money by attracting readers and serving their interests.

Leave a Reply