This is the second of our behind-the-scenes series of data-focused blog posts intended to help explain what happens when we introduce a new company jurisdiction to OpenCorporates. In the previous blog post we discussed how we find new sources of company data & choose the most appropriate one.
In this, Part 2, we’re covering Analysis – the process by which we aim to understand the company register in depth and map the data attributes to the OpenCorporates “Company” data model schema.
What’s in scope?
The first area we investigate is to establish the kind of data published by the source to ensure that we only take those entity types that we are interested in, namely:
- Legal entities – covering majority of registered companies
- Natural persons acting in a business capacity (sole traders)
- Partnerships – limited or unlimited
- Foreign branches of out-of-jurisdiction companies
- Local branches/representative or agency offices that have legal personality in their own right, e.g.
- local law allows the branch to enter into contracts in their own right
- local branch that can be taken to court separately from their parent
- Other entity formation types e.g.
- State bodies or state-owned companies
- Corporations sole
- Entities formed via contract, eg Luxembourg “SICAV” funds
- Unincorporated Associations
- Non-profits, Foundations, Trusts, etc.
Basically, anything that forms a “company” – an entity that has the ability to enter into commercial contracts, and/or can carry out commercial activities.
What’s out of scope
Company register search engines sometimes offer up other types of data which do not fall under the above types of entity. We run various tests to ensure that we exclude these types of data as “companies” in OpenCorporates. Here are the most frequently-seen types:
- Registered trademarks or symbols
- Name reservations – where a prospective company that has not yet incorporated is reserving a name to prevent other companies from taking the same name; or where an existing company registers alternative trading names separately from the main company.
- Local branches that do not have distinct enough legal personality from the parent company in the same jurisdiction – e.g.:
- Local establishments
- Representative offices
- Agency offices
- Factories, shops, call centers, etc.
Company Number identifiers
It’s essential we are able to categorically and uniquely identify legal entities. Companies change their names relatively frequently, and legal names are even reused. Being able to use the identifiers issued by corporate registers is an essential requirement of our analysis, as such identifiers are the only way of categorically identifying legal entities (we use the term ‘company numbers’, even though the identifiers issued by corporate registers may not always be numeric).
In most cases, corporate registers issue persistent identifiers, however we have discovered a number of registers where the identifiers are neither well-designed nor persistent nor unique. We’ve written a public policy document that outlines our approach to handling company numbers, as well as what we do when there are problems with the numbers issued by the register.
During the analysis phase, we will assess the company numbering scheme in use as outlined in the policy paper, and ensure that the correct approach is taken. This includes a “light touch” normalisation of identifiers in some cases – whilst we always capture the original number, we do remove spaces and punctuation to help with matching of company data to other data items, and to allow the normalised number to form part of the public OpenCorporates URL (e.g. https://opencorporates.com/companies/gb/XXXXXXXX) which is our unique, open identifier for a company.
Corporate registers might also publish other identifiers, that might be used elsewhere in the jurisdiction, for example Taxpayer or VAT IDs, Charity numbers, “Business numbers”. We look for these as they will be used to help with reconciliation of other data types (licences, register entries) to the Company record.
Most countries have a standardised way of identifying the various economic activities being undertaken by companies in that jurisdiction. When collected and published by company registers, the information can be used to help identify potential clients (and competitors), suppliers or new markets for investment.
When a source does publish the data, we carry out analysis to ensure we can map it to the standard classifications used by that country. We start with the UN listings of country-specific classification systems to establish which local authority is responsible for maintaining the codes and descriptions, and we will follow-up with these to obtain the full lists in structured data format. Part of our analysis effort covers the assessment of how country classification codes can be rolled up or mapped to these standard classifications.
Mapping the fields
The most important part of our analysis is to carry out a detailed field-to-field mapping of corporate register data to OpenCorporates’ company schema. This allows us to model a company’s data attributes in a consistent way irrespective of what jurisdiction they are registered in.
We start with the core company fields – name, number, incorporation date, etc – and work our way through all the data that is available, including Officers and Filings. We carry out a wide range of data sampling against the source, perhaps looking at what attributes are made available by company type, or active/inactive statuses, as these may vary. Some fields in the source might need splitting into multiple fields in our schema, requiring more complex mapping or transformation rules.
Our aim is to ensure we have analysed all fields made available by the source, and that all have been correctly mapped / transformed to the the OpenCorporates Company schema.
Where there are any ambiguities or queries raised about the data, we will liaise with the register or with our in-country community collaborators to help ensure our understanding about the data is correct.
Case study: Belarus
Earlier in 2017 we added Belarus companies to OpenCorporates. As the register publishes relatively limited company data, the mapping was fairly straightforward as can be seen in this screenshot. Our main challenge was working with Cyrillic text, and we worked with native speakers to speed up the analysis effort.
If we are able, we try to normalise addresses into these components:
- Street Address
- Postal Code
- Country name
- Country code
Local language / script is retained for addressing elements (street address, locality, region, country), with the exception of country code which we use to support searches and data matching.
Low cardinality field analysis
These are the main fields used to assist users in filtering results on the OpenCorporates website:
- Current Status
- Company Type
- Officer Position
We review the list of values and carry out some minimal normalisation to ensure that these fields are low cardinality fields (i.e. in theory there should be relatively few unique company type values), so that users are easily able to search and filter using the data. We’re also looking at how we can providing more standardized values of these and perhaps also tie in with the new ISO standard on Entity Legal Forms.
As mentioned in the scope section above, branches are companies based in other jurisdictions that have a presence in the jurisdiction being looked at. Many jurisdictions require registration of branches for tax or other regulatory purposes.
We are usually able to directly identify them based on information provided in company or entity type (such as “Foreign Company” or “Branch of an offshore company”), but registers can sometimes have other ways to denote them, such as a variation in company number prefix.
Language, localisation & character sets
Company data is multilingual, and often presented in non-Latin script. OpenCorporates currently translates a few attributes to English, where the source is available only in non-Latin character sets (eg Arabic, Chinese, Cyrillic, Japanese, etc). This is limited to the main search filters on our website – current status and company type. We work with our community partners to assist with translation. All data is stored as UTF-8 (currently 3-byte Basic Multilingual Plane Unicode characters)
Original: 有限会社 English: Limited Company ==> company_type: Limited Company (有限会社)
Where we’re changing from one source to another, or if the source we are using changes its layout or structure – perhaps adding or removing some fields, or amending how others are displayed – we carry out a significant amount of additional analysis to determine the impact on existing data already in OpenCorporates. This then drives requirements for any data migration development work that might be needed.
Next steps – obtaining sample data
Once the initial mapping analysis has been completed, reviewed and signed off, we will obtain sample data (unstructured, structured, and transformed into our Company schema format). We need sufficient data to be able to confirm any hypotheses about our field mapping and transformation rules, and to help construct our unit tests. The sample dataset as a baseline to the our fetcher development process… which we’ll cover in the next blog post!