We’ve often heard company hierarchies and networks referred to as the Holy Grail of business information. That’s not just a recognition of the value and importance of this data. It’s also that it’s really difficult to find… and to collect, and to make usable too.
What’s more, where this does exist as a dataset it’s not only expensive, and tied up in restrictive proprietary licences, it’s also Black Box Data – that is, data that you just have to take on trust, without knowing where it came from, or how recently it was retrieved. This, despite the fact that most of this data comes from official sources, either from data held by company registers, from various regulatory data, or from company’s statutory filings.
This is pretty much exactly the sweet spot for open data – pulling together data from different sources, maintaining the provenance, and increasing access for the benefit of all.
So we were delighted when the Alfred P Sloan Foundation last year gave OpenCorporates a grant to build an open data version of this dataset, and since late last year we’ve been working hard on tackling this problem.
Today, we’re excited to publicly launch three things:
1. An open data corporate network platform
The most important part is a new platform for collecting, collating and allowing access to different types of corporate relationship data – subsidiary data, parent company data, and shareholding data. This means that governments around the world (and companies too) can publish corporate network data and they can be combined in a single open-data repository, for a more complete picture. We think this is a game-changer, as it not only allows seamless, lightweight co-operation, but will identify errors and contradictions. We’ll be blogging about the platform in more details over the coming weeks, but it’s been a genuinely hard computer-science problem that has resulted in some really innovative work.
2. Three key initial datasets
To test out this platform we’ve ingested three large corporate network datasets, which combined give a much better picture than any individual ones could. These three datasets were chosen because they have significant amounts of data, pose different challenges, and say subtlely different things.
The shareholder data from the New Zealand company register, for example, is granular and up to date, and if you have API access is available as data. It talks about parental control, often to very granular data, and importing this data allows you to see not just shareholders (which you can also see on the NZ Companies House pages) but also what companies are owned by another company (which you can’t). And it’s throwing up some interesting examples, of which more in a later blog post.
The data from the Federal Reserve’s National Information Center is also fairly up to date, but is (for the biggest banks) locked away in horrendous PDFs and talks about companies controlled by other companies.
The data from the 10-K and 20-F filings from the US Securities and Exchange Commission is the most problematic of all, being published once a year, as arbitrary text (pretty shocking in the 21st century for this still to be the case), and talks about ‘significant subsidiaries’.
Again we’ll be blogging further about these datasets in the future – what they mean, what they don’t mean, where we’ve found errors and problems, and what inferences you can draw.
We’ll also be adding more datasets in the future, and would love to hear from company registers who want to become more transparent by publishing shareholding information as open data, from banking regulators who’d like to have their bank network data included (and hence combined with other banking regulator’s data), and from corporations who’d like to publish their corporate networks as open data. Drop us a line at email@example.com.
3. An example of the power of this dataset.
We think just pulling the data together as open data is pretty cool, and that many of the best uses will come from other users (we’re going to include the data in the next version of our API in a couple of weeks). But we’ve built in some network visualisations to allow the information to be explored. Check out Barclays Bank PLC, Pearson PLC, The Gap or Starbucks.
We also worked together with our friends at Kiln to do a rather cool visualization based on the data for the six biggest US banks, and in particular showing the complexity of the networks by geographic region, and just how long the control chains are. If you’re interested in doing other visualizations, and can’t wait for the data to be included in the API, just drop us a line.
What’s next? More datasets, more features, and also crowdsourcing features for adding, fixing, and improving the data. If you want to be an alpha tester of the crowdsourcing features, email us at firstname.lastname@example.org
Finally, it’s worth stressing what this is, and what this isn’t. This isn’t every companies’ corporate network, or even every large corporation’s network. It’s still a subset of that. That’s true of every corporate network database, even the most expensive proprietary ones.
Nor is it error-free. In fact we know it will contain errors, as all large datasets will, particularly ones from disparate, difficult-to-extract sources such as corporate network information. But the power of this platform, and of open data itself, is it makes those errors visible to a wider audience, allowing them to be identified and fixed, used to improve the quality of the underlying data.