OpenCorporates’ mission is to increase corporate transparency by making official company information more accessible, more useful, and more usable for the public benefit. We aim, ultimately, to list every single company (legal entity) in the world.
We believe our deep expertise, our technology and processes, and the ‘many eyes’ of the hundreds of thousands of disparate users means OpenCorporates company data is the best in the world. However, bringing this information together is complex work. The OpenCorporates data team is constantly working to expand its coverage while meeting a rigorous set of data quality standards, and of course maintaining existing datasets.
This is the third in a series of data-focused blog posts explaining what goes on behind the scenes when we bring a new jurisdiction into OpenCorporates.
Got a question? Email us: community[@]opencorporates.com
In this series, we’ve previously covered how the OpenCorporates Data Team finds new sources of company data, and chooses the most appropriate one. We’ve also written about how we analyse it to understand the company register in depth and map data attributes to the OpenCorporates Company data model schema. Our goal in all this is focused on setting and maintaining the highest standard of data quality.
In Part 3, we’re covering Development – where our developers convert the analysis documentation into working code that fetches and transforms the data into a format that can be regularly imported to OpenCorporates. This work is necessarily complex in nature, but we’ve attempted to explain the concepts in non-technical language wherever possible.
Fetching the data
We follow a standard pattern for obtaining data, based on how company information is made available for use by the register. The main methods by which data is sourced are via:
- Data dumps – company data that is made available in bulk for direct download, perhaps in CSV, XML or spreadsheet formats
- API – some registers make structured data available through APIs, a method which allows OpenCorporates’ servers to directly talk to the register’s servers (perhaps using SOAP or REST web services) and obtain information about companies in JSON or XML format
- Web scraping – where we automate the searching and browsing of company registers’ websites and parse the resulting web pages or PDF files to extract the required information.
Sometimes a combination of the above is used, however our preference is to use the first two over web scraping, for reasons we explain below.
OpenCorporates has developed an in-house data fetching framework that’s configured so that three common tasks can be carried out:
- Finding and retrieving new companies
- Refreshing known companies to update them with the latest information from the register
- Transforming the retrieved data into the standard OpenCorporates company data model and saving the record
The framework contains additional tried and tested libraries that are useful for web scraping, such as dealing with data in tabular format, parsing textual dates into standard date formats, or managing different ways of finding new companies (e.g. by incrementing company numbers).
In addition the framework also:
- Keeps count of the number of new/existing companies it has retrieved and processed
- Manages state so it can pick up from where it left off
- Validates parsed data against the OpenCorporates Company schema
- Handles errors with alerting, logging and exception management
The data retrieved from each source is saved in its own SQLite database file. This isolates the data from the other fetchers and from the core OpenCorporates database. Data can also be more easily moved between servers, and SQLite allows our analysts to easily query/profile the data for quality assurance testing. The bot framework provides generic methods for handling save functions (eg UPSERT/MERGE type operations – “insert or update”) against the database.
Custom fetcher code
Each jurisdiction-specific fetcher has its own code that extends or overrides the framework with functionality specifically designed for the registry in question. This mainly consists of the logic that maps and parses the source data fields to the OpenCorporates Company object. This mapping is done based on the analysis work previously carried out (see Part 2 of this series of blogs).
The developer starts by writing the custom fetcher code and unit tests, and conducts an exploratory partial fetch of the dataset, so that we can preview the parsed data to support analysis and the development of the QA tests. The parsed sample data is used for testing purposes so that it can be reviewed on an ongoing basis as bugs are fixed or parsing logic altered. It is tested against expected results, checked for validity against the Company schema, and compared to previous runs for consistency or regression issues.
Once all the tests pass, a full dataset is retrieved and processed so that a full test suite can be ran.
Many scrapers are written to cope with change and ambiguities in data. We deliberately take the opposite approach, as we believe that this is the only way to rigorously ensure high data quality. Specifically, the code is written so that it is brittle – i.e. it breaks easily if the source changes. We want to know as soon as possible if the source has changed its layout or added/removed data attributes, otherwise errors could be introduced to the main OpenCorporates database. We are alerted to these breaks and can then quickly develop and test changes to the fetcher code to get them back up and running.
Our main company fetcher code and frameworks are written in Ruby, and we use several helper gems, including Nokogiri, Mechanize, Scraperwiki & our own openc_bot gem. Data is persisted in SQLite. We use Github for source code management, Jira for managing our development tasks, and we broadly use the Kanban agile software development methodology for managing the backlog and prioritising work in-hand, supported by dashboards storing the status of the bots.
A note on web scraping
Our preference is to avoid scraping if possible, and we encourage registrars to make their data available as Open Data under a permissive re-use licence so that the data is more readily accessible to the public.
Where OpenCorporates takes the approach to scrape data from web pages, it does so in an ethical way, to minimise any impact on the source website. In most cases, data fetching takes place out of local business hours, when the source website will be least in demand. We only retrieve the minimum of data needed to keep our data fresh, and we often pause between requests so that we don’t flood the server.
Once these processes are complete, we have an output which can be tested for data quality issues and is in a format that can be imported for use by the OpenCorporates website and API.
The following diagram illustrates the end-to-end process, starting with the source register (in this example ZEFIX, the Swiss Central Business Index, output from its SOAP Webservice which we use to pick up daily updates, the fetcher code, parsed output in JSON format, and the imported data.
Next steps before ingestion
This record is now in a format that is ready for Quality Assurance and ingestion, the next stage in the development process. We’ll be covering this in more depth in our next blog post.