More countries, more companies, and what this means for reconciliation

The open data collaboration with ScraperWiki to add more companies and more countries has been going incredibly well, with over 20 scrapers for company registers being worked on or completed. Because of this we’ve already been able to import Iceland, Singapore and, today, Malta companies, bring the total number of companies we’ve got URLs for to over 8 million, from 13 jurisdictions, with many more coming to OpenCorporates soon.

However, as the list of jurisdictions we’re covering grows there’s an obvious question: when you’re matching a company which appears with exactly the same name in several jurisdictions, how do you know which one to match – and this is particularly pertinent to matching with Google Refine.

To a certain extent we can handle this with some intelligent scoring, marking down foreign branches compared with the home company, and Google Refine (or others using the Google Refine API that OpenCorporates supports) can use this score to rank likely matches. (For those who aren’t familiar with branches, a company doing business in another country/jurisdiction will often have to or choose to register as having a branch there; this is different to having a subsidiary there, which would have its own board, shareholders, etc.)

However, there will increasingly be cases where there are two separate companies (they may or may not be related) with the same name in different jurisdictions, or where the register doesn’t say if it’s a branch. An example would be Barclays Bank PLC:

For people navigating the info through the browser, it’s fairly easy to click through the filters on the right, but for users of Google Refine, you really want the ability to restrict a reconciliation to a particular jurisdiction.

Well, now you can, because we’ve implemented jurisdiction-specific Google Refine reconciliation points.

How does it work? Simple. You just add the jurisdiction code to the end of the normal Google Refine reconciliation url and you’re done. The jurisdiction codes are the ISO 3166-derived letter codes that we use in all company URLs (e.g. ‘lu’ for Luxembourg, ‘is’ for Iceland, and ‘us_mi’ for the US state of Michigan – see this post for more details), so the Google refine endpoint for Gibraltar is http://opencorporates.com/reconcile/gi rather than http://opencorporates.com/reconcile .

We’re planning to allow more granular reconciliation in the future (e.g. when a data file might have a different jurisdiction in each row), but we’re still figuring out the best way to do that 😉

4 thoughts on “More countries, more companies, and what this means for reconciliation

  1. Why isn’t the jurisdiction just another query property? I might want to provide a list of jurisdictions. It seems wrong to mangle the URL to add a qualifier to the query when the query mechanism has such rich qualifiers already.

    Thanks!

    Ralph

    1. We plan on allowing the jurisdiction to be provided as an individual filter for each row in the future, but doing global jurisdiction restrictions seemed to solve many of the biggest use cases. So we did that first 😉

Comments are closed.