Since we launched the dataset of over 5 million German companies earlier this week, we’ve had lots of questions about how we assembled the data. This post aims to answer that question. It’s obviously quite detailed and technical, but we hope it will be of interest to a technical audience at least. Apologies that it is not in German, but do feel free to translate!
(A reminder: as well as being on OpenCorporates, the dataset can be downloaded from offeneregister.de, run by the Open Knowledge Foundation Deutschland, to whom we donated the data, and who in almost zero time built a great website around it.)
First, a general overview:
- The are 5.3 million companies in the dataset, of which 2.3 million are currently registered, and 2.9 million of which are “removed”.
- Virtually every major datapoint comes not from the Handelsregister.de but from the Gazette Notices at Handelsregisterbekanntmachungen.de. These are simply a set of gazette notices about the incorporation, dissolution of the company, change in officers, change in district court, or change in address. The text is unstructured and inconsistent, but we have been able to successfully extract data from them.
- These notices are numbered sequentially, and so we iterate through them, parsing the notices and extracting data – in particular the district court (Amtsgericht) with which the company is registered, the identifier (company number) issued by that court, and the company type (HRB, HRA etc), which is used in the next step. We also parse the other information from the notices, including information on officers, addresses, and also changes in court registration (when the entity moves from one district court to another).
- Parsing the officers is particularly difficult, given the free-text nature of the data, and we in general prioritise quality over quantity, i.e. we don’t want to generate lots of bad data (false positives) in our hunt for every last genuine result. This means that we will definitely be some missing officers, and also a small number of minor parsing issues, particularly parsing the name into their constituent parts. So please email firstname.lastname@example.org with any issues that you see – whether missing officers, or other problems. It’s also worth pointing out that if the underlying data was made available to all as open data – and not just to those who pay to have privileged access – this would cease to be a problem.
- We then visit the Handelregister search page to see if the company with those details exists on the register. Some don’t – mainly as a result of court reorganisations as well as changes in company numbers on the Handelsregister that are not reflected in the original gazette notices, mostly affecting registered clubs and associations (meaning that a court/number based search using the legacy detail does not return a matching entry). We then retrieve the company name and the current status from the search results. The company name is a little tricky to parse from the gazette notices, given the dirtiness of the data, but it’s not impossible, should we need to rely on just the gazette notices themselves. The status could also be derived from these notices, and that’s something we’re considering.
- We are not currently taking any information from the “entity data” details page for the company on Handelsregister – the incorporation date, share capital, legal form, deletion date, registered address. We looked at scraping this and did some tests, but it’s a lot of requests and we haven’t yet figured out how to do it without putting a strain on the source, which we don’t wish to do. In fact, we hope that now the data has been made available as open data, the need for others to scrape the register is much reduced.
- A tiny amount of information was manually obtained from searching the Handelsregister website. 22 companies with more than one headquarters (“Doppelsitz“ or “Mehrfachsitz”) were identified by means of the advanced search functionality on the website. This information was then manually transformed into “Alternate Registration” data and inserted into the relevant company in our dataset. Manual collection is not something we usually do – but in this case it was necessary.
- Data for around 45,000 companies has been sourced solely from Handelsregister search listings – these are mostly inactive companies that pre-date the publication of electronic gazette notices at Handelsregisterbekanntmachungen.
- There are a few pieces of data that are returned in Handelsregister searches that are stored in our database but actually redundant, and we may remove. These include Registered Office town (we extract the full registered office from the gazette notice), and different representations of the company number and Amstgericht.
- Another key piece of work that have done was matching different registrations of the same entity together – this happens when a legal entity moves from one area, and thus Amstgericht, to another. There are also very messy gazette notices about this, and we’ve done a huge amount of work to figure out the situation and represent it as data – something the Handelsregister hasn’t done. For these situations, there are usually two notices, one for the existing court stating that the registration has moved to the new court ; one for the new court stating that the registration has come from the old jurisdiction.
The steps in detail:
- Scrape gazette notice from HRB.de – eg https://www.handelsregisterbekanntmachungen.de/en/skripte/hrb.php?rb_id=350704&land_abk=ni
- Parse gazette notice – attempt to extract following data:
- Company number
- Event date
- Publication date
- Type of notice (New, Amendment, Deletion)
- Related registration (subsequent or previous registration), including details of the related court & company number
- Officers – name, city, date of birth, position, type (derive company or person)
- Registered address
- Match gazette notice court name from 2b to valid XJustiz court ID so that it can be linked to Handelsregister. This requires fuzzy matching as the court names on Handelsregisterbekanntmachungen notices are often mis-spelled
- Compute OpenCorporates company number for the gazette notice
- For the computed number, search Handelsregister for that company
- From Handelsregister search listings capture – name, current status, whether additional data is available from the register
- Construct final company object
- Attempt to cross-match new officers against existing list so that any resignation of officers can be marked with an end_date
- Construct officers array based on 7a
- Construct related_registrations based on information parsed from 2f
And here’s a concrete example. The JSON for PPP3 UG (haftungsbeschränkt) looks like: https://gist.github.com/CountCulture/e30c192b14018ffa3c563fa0b432f441
This information was derived from the following gazette notices:
- https://www.handelsregisterbekanntmachungen.de/skripte/hrb.php?rb_id=350099&land_abk=ni Includes the move of the registration from one amstgericht to another
- https://www.handelsregisterbekanntmachungen.de/skripte/hrb.php?rb_id=351890&land_abk=ni Includes the move of the registration from one amstgericht from another
We will be posting more details on the Amstgericht moves and the problems with German company identifiers soon.
Photograph by Alex Skene
EU Horizon 2020
The collection of the German company data has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 780247, TheyBuyForYou