Last week, OpenCorporates reached another minor milestone, as we hit 90 million companies. That’s a long way from the 3 million companies we started with 4 years ago, and yet we’re not finished yet. In fact, we’ve got several more jurisdictions (totalling several million more companies) going through our data quality checks at the moment, and think there’s an outside chance we may reach 100 million by the end of this year.
Adding these jurisdictions is not as simple as you might think. And this is not necessarily for the reasons you’d expect – for example difficulties around languages and alphabets can be solved with the help of our growing worldwide community.
In fact, some of the most tricky problems are around the data quality of the register itself. We’ll be writing more about this in the future, but we thought that on the occasion of beating 90 million, we’d share some of the problems, and how we’re tackling them in a transparent and open way.
Being transparent about data-quality issues
A case in point is where we have found problems with company identifiers. We’ve written several times about the importance of good company identifiers, and how this is critical to good quality company data. In fact, the Open Company Data Index (which OpenCorporates operates in partnership with the World Bank) now marks down those company registers that don’t issue company numbers for data-quality problems – this is because due to the frequent change of names by companies, it’s impossible to be sure that ‘Foo Bar Ltd’ is the same (or different) to ‘Hello World Ltd’ without such persistent identifiers.
Over the past weeks, frequent users will have noticed that we’ve improved the look and feel of the list of the over 100 jurisdictions we have data for, making it sortable, searchable, listing the number of companies and officers for each, and including both the Open Company Data Index and the Basel Anti-Money-Laundering Index scores.
We’ve also added indicators where we believe there are data quality problems with registers, and clicking through these will give additional details of the problems. In the cases where more explanation is needed, we’re now starting to add data quality reports on the OpenCorporates wiki.
For example, one of the very first jurisdictions we added to OpenCorporates was Jersey, although we were limited to just the basic private/public limited companies (excluding the partnerships and other more esoteric company structures). The problem was that the identifiers issued by the Jersey register were not unique across all company types, and because we use the company number as a primary identifier (e.g. https://opencorporates.com/companies/je/11) this caused a problem in adding these.
However, rather than deal with this opaquely, we wanted a policy for handling such situations which was transparent, comprehensible and sensible. We created such a policy several months ago, and then when reviewing the full Jersey dataset as part of our Quality Control procedures we discovered that not only were identifiers reused among different types of companies, but among the same type of company too. This meant that the initial idea of using prefixes for company numbers (e.g. RC_1234) wouldn’t work.
Full details of the approach we’ve taken are on our wiki (with a link from the register page), but in short we’re using the internal UIDs used by Jersey in the URL with redirects to handle this so existing URLs still work. We’re also being clear about how we’re doing this, a detailed description of the problem and the decisions we’ve taken. We believe this is not just the right thing to do, but also the most useful for our users too.
Please do let us know what you think about this approach, and join us too building a community dedicated to making company information more transparent, more usable, more trustworthy, and more useful.