Understanding corporate networks. Part 4: how we record the data

In parts 1, 2 and 3 of this series, we explored the complex world of corporate control, and how it is described in various regulations. We found that a company may control other companies in many different ways, from majority and minority share holdings to contractual relationships.

At OpenCorporates we believe corporate control data is the cornerstone of a useful Open Corporate Data platform. Without it, there’s no way to link apparently unconnected regulated events that are in fact related. For example, in 2009, the City of Prineville, Oregon granted a 15 year property tax exemption to a company called Vitesse LLC. This, in fact, is a tax exemption that ultimately benefits Facebook Inc:

vitesse

Therefore, one of our goals at OpenCorporates is to capture corporate control information as Open Data. There are already sources of corporate hierarchy data online, but these are not open, and we consider this a problem.

Why? Not because you have to pay to access this kind of information (though this shuts out many potential users, and thus has negative effects on data quality). Rather, it’s because these datasets are a proprietary asset: their value relies on the fact they are hard to reproduce. Such vendors, therefore, have an interest in not telling you where their data came from. By contrast, at OpenCorporates we ensure that every statement we make has a full provenance, so you can check our facts – which both gives you confidence, and allows you to let us know if we’ve made any mistakes.

As we’ve seen, the notion of “control” is hard to define, and depends on regulations that both vary between jurisdictions, and often rely on subjective judgement to interpret. This is the main reason why it’s important to know the sources for data: it allows you to understand those judgements, or even create your own models of control based on the underlying data, if you have the time! If you don’t have the time, of course, the control models we come up will hopefully be good enough.

The main purpose of this article is to describe how we model our data in detail. We want our users to understand what the data on OpenCorporates means, and to decide whether the control models work for them. We also think this granular, provenanced approach to facts is innovative, and rather than attempt to patent it, we’d rather share it with the whole community. 

How we model the data

In the core OpenCorporates database, every bit of data can be described as a:

STATEMENT about one or more COMPANIES or PLACEHOLDERS, each with a PROVENANCE

What do each of these terms mean? I’ll describe them in reverse order:

Provenances

A “provenance” is everything you need to decide if you want to trust a statement made by OpenCorporates. Here’s one:

A provenance on OpenCorporates
A provenance on OpenCorporates

The most important part is the source: the primary reference where you can find the information yourself. This should always be something you can check yourself – normally a link to a website, or a copy of an original document.

The provenance also tells you who found the information (in this case some automated software known as a “bot”), when it was found, and the confidence that our interpretation of the source is correct. Crucially, the only sources which we consider trustworthy enough to include by default are regulatory sources, by which we mean official records like government registries, and notices which are backed by law.

Companies or Placeholders

A “placeholder” is the term OpenCorporates uses to describe something we believe is probably a company. For example, this is how Facebook’s subsidiaries are listed in a recent regulatory filing:

Subsidiaries of Facebook in their 10-K SEC filing, 2012
Subsidiaries of Facebook in their 10-K SEC filing, 2012

This seems to be telling us that a company called Vitesse, LLC, registered in Delaware, is a subsidiary of Facebook, Inc. But there are various reasons this might not be the case, such as:

  • The person inputting the text may have made a typing error, or written the wrong place of incorporation
  • The name of Vitesse, LLC may have changed since the filing…
  • …and there may now even be another, different company called Vitesse, LLC.

Thus, being pedantic, this document is telling us that these are probably companies. At OpenCorporates, we take pride in pedantry, so we refuse to call Vitesse, LLC a company until we can prove it exists, with reference to an entry in the official corporate register for Delaware (and of course there are some company registers that do not make this information freely available). Until then, we call it a placeholder.

When we feel we can reliably say that a placeholder is, in fact, a company, we create a new record in our system which we call a “company reconciliation link”, and record the provenance for that link as a separate data point.

Statements

A “statement” is a fact or assertion that we’ve derived from a primary source. There are various types of statement in OpenCorporates, such as “Licences” (a permission for a company to engage in a regulated activity) and “Subsidiary Relationships”:

Example statement from OpenCorporates
Example statement from OpenCorporates


Technically, a statement is composed of several bits of information:

  1. The data point. In the example above, this is “There is a subsidiary relationship that existed on December 31, 2012″. The data point is derived directly from information in the primary source.
  2. The subject company or placeholder. In the example above, this is Facebook, Inc.
  3. The object company or placeholder. In this case, Vitesse, LLC.
  4. The verbs linking the respective companies to the data point. In this case, Facebook, Inc “has a subsidiary”, and Vitesse, LLC “is a subsidiary”. Internally, we call these “placeholder data links

Optionally, there may also be company reconciliation links linking the placeholders to companies (as described above).

Here’s how we represent the structure of a statement internally:

schema
The schema for an OpenCorporates statement

And here’s the same structure, with the data from the Vitesse, Inc example we’re using:

How a statement about Facebook is recorded
How a statement about Facebook is recorded

Crucially, every component in the diagram above also comes with a provenance. This means that we know:

  • where the data came from (in this case, the SEC);
  • where the placeholders and data links came from (in this case, they were inferred by software, but could be inferred by a site editor);
  • where the reconciliation links came from (either a person matching placeholders to companies, or software again); and
  • where the companies came from (corporate registers, invariably).

How we model corporate control networks

So those are the bits of information that lie behind any assertion we make on OpenCorporates.  When it comes to corporate relationships of control, such as the nice tree diagrams excerpted above, we are primarily interested in the following types of statement:

  • Subsidiaries: statements that X is a subsidiary of Y
  • Share Holdings: statements that X holds shares in Y
  • Acquisitions: statements that X acquired Y. Often these statements are derived from press releases, so are not considered as reliable as other kinds of statements.
  • Branches: entities permitted to operate in a jurisdiction but with a legal personality registered elsewhere.

As we’ve seen, to make statements about control relationships, you have to make a number of assumptions – for example, what percentage of share ownership constitutes control. You also have to think about the confidence level you’re willing to accept; is a press release a sufficiently reliable source for you? Or are you only interested in regulated information of the kind available in corporate registers?

This kind of decision involves a considerable amount of judgement, analysis and time, which is why we’ve come up with a way of combining these statements into corporate networks on our network pages. We currently combine subsidiaries, shareholdings and acquisitions to form networks, and are planning to add branches in 2014. Our network pages allow you to set the confidence level you’re willing to accept, and the shareholding level you want to consider as implying control.

Options for controlling which companies to appear in a network view
Options for controlling which companies to appear in a network view

Additionally, if you click on a company name, you can view the provenances for the statements behind its presence in the network, and check them for yourself.

Given the complexity of defining corporate control, it’s impossible to guarantee that any corporate graphs are accurate, but this is precisely the strength of our Open Data approach: we have no incentives to hide the sources of our information, and you have every incentive to help improve our data for the public good.

Our way of modelling networks of corporate control is just one suggested way; we encourage you to use our modelling as inspiration, and build on OpenCorporates‘ open data platform to produce your own models. Let us know what you manage to find out, and we’ll help you share this information with the community!