Enabling the Future Internet for Smart Cities

Enabling the Future Internet for Smart Cities

Laying the foundations for benchmarking open data automatically: a method for surveying data portals from the whole web

The widespread of open data portals in the global scenario makes automated methods for data gathering and assessment a landmark in benchmarking. Some practitioners have already introduced automated approaches focused on mainly quality assessment of selected lists of data portals, but little has been studied about how to ensure exercises happen on the widest range of initiatives across the world, even incipient ones. The purpose of this paper is to provide a method for surveying data portals from the whole web, aiming to produce a whitelist of URLs that point to healthy data portals on the internet. The method was tested on 3.3 billion web addresses from which we found 1,339 open data portals worldwide using the main software platforms in the market CKAN, Socrata, OpenDataSoft and ArcGIS Open Data. Findings showed the choice of the whole web approach increased the number of data portals found, besides offering a workaround for redundancy, discoverability and traceability issues of current sparse and manual-based repositories. This work contributes to development of a fully automated method towards building an independent, reliable and up to date repository as a single source of open data portals operated around the world as well as provide insights about dataset estimation and geographic localization, from which benchmarking exercises may benefit to happen on a larger scale, at higher frequency and with lower costs.