Guest Post: Data Sketching With the OpenCorporates API

Tony Hirst, a lecturer in the Department of Communication and Systems at The Open University, and author of the blog has been using the OpenCorporates API for some time. Here’s a review of some of his experiments.

Looking back over my datajunkie notes, I may only have been using the OpenCorporates API since March of this year (2012; we’re now in December…) but it’s become one of the richest data playgrounds for me, in part because of the far-ranging linking it affords both internally and externally, to other data sources.

Diving into the data for the first time, not even a year ago now, my first thought was to look for something structured that I could use as a warm-up exercise to familiarise myself with the API. Focusing primarily on UK companies, and looking through some of the results for some of the larger UK registered companies, I noticed that (recently registered?) trademark ownership information was available. (Specifically, the company data points to other OpenCorporates data records which include records for trademark registrations).

Included in the OpenCorporates recorded data was a unique WIPO generated identifier for each trademark. A quick websearch revealed that WIPO publishes information about trademarks on from URLs that include the trademark identifier, so I could use the OpenCorporates data – trademarks registered to a particular corporate entity – to draw down additional data from WIPO about particular trademarks. For trademarks that are registered images, this included an image file, so it was a relatively simple exercise to generate a quick sketch of (at least some of) the graphical trademarks registered a particular company.

Tesco trademarks

One thing I noticed in searching for companies on OpenCorporates is that, for the bigger companies at least, there are lots of corporate entities associated with a particular company name. OpenCorporates currently provides an in-part crowd-sourced “community groupings” feature that tries to bundle together different companies that are part of corporate group, but as I poked around the data I noticed that director filings might provide one way of automatically grouping companies. And so I went graph hunting…

The new release of the OpenCorporates API makes it trivial to look up directors, but 6 months or so ago, all we had to hand to was partially structured director filings. It was enough, though, to be able to pull out the directors associated with a particular corporate entity. And having got a list of directors by company, we could do a search around a company with many corporate entities – Tesco, for example – and map out which entities were connected to which by virtue of common director names. Directors’ data is starting to appear as such on OpenCorporates, which makes this sort of mapping easier, although now we are faced with the problem of deciding whether a two directors records sharing the same name are part of the same “director grouping”!

Tesco director dealings

Using network visualisation tools such as Gephi, it’s possible to easily decompose graphs such as these that show connections between companies and directors to a form that just shows co-director links (directors joined by a common company) or potential corporate groupings (companies connected by N or more common directors).

Another possible link between companies was their registered address, so we could also start to explore which similarly named companies might be sharing a physical office. It’s not hard to imagine a time when OpenCorporates will associate geolocation based data with corporate entities, which makes this route to identifying pattern and structure in the data from a geographical, location based perspective a ready possibility.

Tesco registered office locations

Revealing the implied structures that are hidden away inside the OpenCorporates database by virtue of common links between corporate entities, directors, and/or locations represents one significant form of value. But there is also much to be gained through linking the OpenCorporates data to other data sources as part of investigations that span datasets. A trivial example is a transparency supporting service that lets us quickly look up (fuzzily, it has to be said!) the directorships of local councillors. Using data from OpenlyLocal, we can pull down a list of councillor names for a particular council, and then look up those names as directors on OpenCorporates. Using open spending data, a further step might be to look up the companies that have received payments over £500 from the same council; and then look to see whether there are any matches.

Whilst preparing for a recent presentation about open data, it struck me that OpenCorporates has the potential to be disruptive in the sense of Clayton Christensen’s “Innovator’s Dilemma”: whilst the data quality may still be lacking in certain respects, OpenCorporates is good enough to use at least as a starting point for certain company related data searches. As the corporate mapping tools evolve, curating corporate groupings (both automatically/heuristically, and via human curators) will become ever easier and ever more accurate. As the director database evolves, I’m sure techniques will emerge for “de-duping” director entities.

The library world may have tools and ideas to help in this respect, for example via the notion of “Virtual International Authority Files” (VIAF), that provide comprehensive, authoritative identifiers for known entities or some of the competing(!?) personal identifier schemes e.g. (Open Researcher and Contributor ID (ORCID), International Standard Name Identifier (ISNI), both discussed here.). (To a certain extent, the aim of OpenCorporates appears to be the creation of such authority files for corporate entities globally, whatever territory they are registered in.)

An approach that I believe holds much promise is the OpenCorporates Reconciliation API. This provides a clean and efficient way of integrating look-ups to OpenCorporates with data cleansing tools such as OpenRefine. The reconciliation API provides a fuzzy match on a corporate name that returns a set of ranked “possible matches” in the OpenCorporates database and that makes it relatively easy to annotate third party datasets containing company names with OpenCorporates identifiers. This sort of tool may prove invaluable when trying to reconcile council spending data against corporate groupings.

G4S spending Sankey diagram

I’m also hopeful for an appearance of a directors reconciliation service…;-)

By continuing to take an open approach to its data, providing robust linking strategies out to other identifier namespaces, in to the OpenCorporates namespace, and within OpenCorporates itself through corporate and director groupings, OpenCorporates can both add value to other services as well as gain value from external enrichment.

2 thoughts on “Guest Post: Data Sketching With the OpenCorporates API

Comments are closed.