julio | 2013 | el ñasco a la barrapan

I’ve just learned about next.data.gov, and at first glance it looks much more usable than the well known data.gov version. This CKAN-based deployment made me wonder about the future of the OGPL, but I digress…

When getting to the data catalog, I was greeted with this message at the top of the page:

where I found out that data.gov is now hosting 75,712 datasets. I followed the link to the site’s homepage and found this:

So apparently, the figure was not the right one as the number of datasets seems to be 152,977. So I followed the link to the catalog and got this:

Hmmm… I’m confused.

Since the new webiste announcement was part of the fourth aninversary announcements, I reminded other announcements in previous anniversaries. So, for example, as part of the third anniversary announcement, we could read: «Growing from 47 datasets in 2009 to nearly 450,000 datasets today…»

I’m even more confused. The progress and growth of data.gov has been significant. The number of agencies publishing datasets (174 at the time of writing) has grown over the last four years and in the best case scenario what I’m seeing is roughly about one third of datasets on the catalog compared to one year ago? I haven’t found the time to look in depth just yet but I’m pretty sure that’s not the case but more a matter of a usability issue on one hand and different ways of counting datasets over time on the other.

This shows something I mentioned quite a few times before and that gives title to this blog post: counting datasets is bad. And, in fact, is quite meaningless.

I understand that data catalogs need to show a total number somewhere but the issue here is the interpretations that might be derived from it. I heard people claiming that catalog X is better than catalog Y because they are publishing so many more datasets and, frankly, this is a totally questionable claim. In fact, we’re yet to determine what makes an open data catalog good and why catalog X can be considered better than catalog Y.

The bottom line to me is: the number of datasets is just a simple metric that tells very little about the usefulness of an open data catalog.

We need more research to understand these issues and the impact of open data in general, even to understand whether or not an open data central point of access (a data.gov.* website) is the best way to achieve the promised benefits of open data.

el ñasco a la barrapan

blog personal de Josema Alonso

Month / julio 2013

Counting Datasets Is Bad