OpenCorporates’ mission is to be able to list every company in the world, using only public sources to provide full transparency and provenance. In order to achieve this, the OpenCorporates Data team works constantly to expand its coverage of jurisdictions where companies can be registered (120 and counting so far!), whilst maintaining a rigorous set of data quality standards.
To provide some openness about how we go about this, we’re going behind the scenes in a series of data-focused blog posts intended to help explain what happens when we introduce a new company jurisdiction to OpenCorporates.
At a broad level, the process goes through these steps:
- Scouting – finding new sources of data & choosing the most appropriate one
- Analysis – understanding the data in depth and mapping attributes to the OpenCorporates data model [Read here]
- Development – write code to automate data collection and ongoing update [Read here]
- Quality Assurance – test the data to ensure it our quality standards
- Pre-import readiness – final sign-off and configuration setup
- Import – data is finally added to OpenCorporates, ready for use
- BAU – Post-import and ongoing support activities, the “business as usual” process through which we ensure the smooth running of regular data updates
We’ll update this post with links to all parts of the series as they’re written.
This first post covers Scouting.
Why & when do we need to scout?
There can be a number of reasons that prompt us to start our investigations into a country that we’ve not already obtained company data for:
- Demand from our open-data or commercial clients
- Tip-offs about a new open dataset from our network of community partners, or increasingly from the register itself, when they publish as open data
- Internal prioritisation of our roadmap
We also need to scout for sources when an existing source of data already in OpenCorporates, becomes unavailable, for example due to a registry website stopping free access to basic company records in the case of Spain or Gibraltar.
For all of these we generally carry out the same process outlined below.
Working with the local community
OpenCorporates could not do what it does, without the help of a wide network of corporate transparency campaigners, NGOs, open data activists, even government itself to glean local knowledge, translation/language skills and friendly advice. Liaising with the wider community (like we did in the case of Israel) at the beginning of in the process speeds up the overall process and leads to better quality data – working alone can lead to the risk of incorrect assumptions being made about the different sources of data available, and having a local subject matter expert on tap reduces this risk considerably. And both sides benefit, as we often ask detailed domain-specific question about the data that even the government haven’t thought of before. And of course, everyone benefits from the increased transparency that occurs when we publish and make available the data via the website and the API.
Finding New Sources
When looking to introduce a new jurisdiction, the first step is (fairly obviously) to work out where we should get our data from. OpenCorporates only obtains company data from freely available, publicly owned sources that authoritatively identify companies, and so we start with tracking down the official company registration authority (or authorities) for that jurisdiction.
Of course, OpenCorporates maintains its own list of these, the Open Company Data Index, which not only lists company registers, but also scores them for openness, and makes all the data available as open data too. There are also some third-party resources we occasionally use:
- Wikipedia list of company registers
- OKFN list of company open datasets
- European Commission list of EU company registers –
- RBA Information Services List
- A list of official company registers by country provided by Companies House
- A list of official company registers around the world maintained by the Commercial Register Office of the Canton St. Gallen, Switzerland
If this does not enable us to pin down the correct source, we then put our sleuthing hats on, and start searching with the help and collaboration of our community colleagues.
So, what is the ‘correct’ source to use? This varies from jurisdiction to jurisdiction, and there may be multiple government bodies that make company data available. Here are some common examples:
- Company registration bodies, e.g.a national register, or regional Chambers of Commerce
- Government agencies acting as a data aggregator – for example national statistics bodies
- Governmental Open Data agencies
- Tax authorities, e.g. corporate tax databases
- Business licensing bodies publishing data on companies licenced to trade in that country
- Official government notices providing listings of new company registrations or amendments – for example gazettes or court judgements/listings
We’ll look into each data source and start to assess its suitability for use in OpenCorporates. The main questions we ask are:
- How authoritative is the source?
- is it the main originator of the data, or is it combining & republishing other authorities’ data?
- Does it contain complete listings of companies in that country, or just a sub-set? For example only active companies, or just companies of a certain type might be available
- Are there unique and persistent identifiers for each company?
- How rich is the available data, e.g. what attributes are available?
- How easy is the data to obtain? Are there any technical constraints?
- Are there any restrictive legal terms & conditions regarding re-use of the data?
We compare the sources and pick the best one, using various criteria. First, having good, unique, persistent identifiers is a prerequisite. Second, the sources are judged ,higher weights given to the source closest to the company registration process, having the richest data, and with the most permissive T&Cs. We also prioritise the use of open, bulk data (e.g. in CSV or XML format) or APIs over other approaches to obtaining data such as web scraping.
What happens when we can’t find a good source?
It can sometimes happen that we are not able to find a good source of data, perhaps the company register is behind a paywall, or is simply not available as an online register. In this case we’ll put that jurisdiction on hold, and work with our community networks to support their efforts in working with politicians and government bodies to open up register data, by providing evidence on the benefits gained through increased transparency, thought leadership, or support for publicity or pressure campaigns.
Case Study: Texas, USA
The official company register in Texas is maintained by the Secretary of State. Laws in the State allow it to charge a $1 fee for each search (http://www.statutes.legis.state.tx.us/Docs/GV/htm/GV.405.htm Sec. 405.018), making company data only accessible to those able to afford it.
In contrast, all corporations registered in Texas are required to pay Franchise Tax, and the Texas Tax Code designates companies and most corporate officers and directors as public information. This allows the Texas Comptroller of Public Accounts (who manages Franchise Tax collection) to make company data information freely available, either via their search pages of taxable entities [https://mycpa.cpa.state.tx.us/coa/], as a series of opendata files of taxpayers [https://comptroller.texas.gov/transparency/open-data/search-datasets/].
Based on this we analysed the data available from the Texas Comptroller to validate how complete it would be in terms of numbers of companies, availability of other data (including company numbers issued by the the Secretary of State) and a permissive re-use licence, enabling us to make the easy choice to use it as our main data source for Texas.
Once we’ve established the preferred source, we’ll update the Open Company Data Index, with any changes that are necessary. We’ll then analyse in more detail the data available and start to work with our developers to sample the data and map it to the OpenCorporates standard Company data model, a topic that will be covered in more detail in the next blog post – stay tuned!