How we model data: Making gazette notices standard and comparable

A few weeks ago, we announced the launch of OpenGazettes, a new project that will make the information in government gazettes more accessible. Gazettes are periodicals in which notices are published to comply with legal requirements for public notice. Corporations are required to publish notices of corporate events in gazettes, including notices of liquidation, dissolution, winding-up orders, annual general meetings, and director actions. This rich and timely information is valuable to a variety of users, from banks performing due diligence, to journalists investigating companies, to companies collecting intelligence on the market.

As a next step towards the public release of OpenGazettes, we are sharing a first version of a schema for gazette notices. (A schema describes a standard way of storing information.) To see the schema in practice, take a look at these two examples of it being used to store information about a French gazette notice.

The challenge in developing a schema for gazette notices is the variability across jurisdictions in what information is published and how it’s published. For example, Ireland’s Iris Oifigiúil is published as PDF files (see the December 15 issue), from which we can identify about 20 data elements (“what’s a data element?”); these data elements need to be carefully extracted from the PDF’s unstructured text. On the other hand, France’s BODACC is published as XML files (see their open data page), which provides a clear structure for about 75 data elements.

Most gazettes in the world are more like Ireland’s than France’s. In this first version, our priority is to provide a way to store the common data elements that most gazette notices share, while maintaining a way to store the additional data elements that some gazettes publish.

Our approach to schema development

Our approach to developing a new schema is straight-forward. First, we look at the source information (whether it’s a PDF or an XML file) and identify data elements – like the title of the notice, the date of the issue, or the names of companies in the text of the notice. We list all these data elements in a spreadsheet. The purpose of this first step is to establish the universe of things to represent in the schema.

Once we’ve looked at a few gazettes and feel confident that we have a good sample of all the data elements, the next step is to organize the data elements into sensible collections. (You may be familiar with the term “class” for a collection, and any of the terms “property”, “attribute” or “field” for a data element – especially if you are familiar with object-oriented programming.) Once this organization is done and the relationships between the different collections are defined, we have a taxonomy – not unlike the biological classification of animals with species, genus, family, etc. Creating a taxonomy is more like applied philosophy (ontology) than anything relating to programming. We create a new column in the spreadsheet to assign each property to a class.

Once we have our taxonomy, we refine the terms used to refer to its classes and properties, by choosing terms that are more precise, more consistent with other taxonomies (external consistency), more consistent with OpenCorporates’ other taxonomies (internal consistency), or that more clearly describe their meaning (i.e. semantics). Open Knowledge’s Linked Open Vocabularies is a great resource for finding other taxonomies.

Once we’re satisfied with our choice of terms (our vocabulary), we can turn to the implementation details of the schema, like what data types (for example, integer, text, list) are allowed as values for each property. We might also constrain the valid values by describing the format (for example, ISO 8601 for dates) or by using a code list (like ISO 3166-1 alpha-2 codes for countries).

At last, we can describe all this information using JSON Schema, a format for describing schema in JSON, and then author supporting documentation. Throughout this process, we’re soliciting feedback internally and externally to stay on the right track.

What’s next

As we collect data from more jurisdictions, we will make iterative improvements to the schema. If you have experience working with gazettes, we hope you’ll take a look at the schema, and send us your comments and feedback.

We are taking this opportunity to release all our JSON Schemas under the MIT license, meaning the schemas and any contributions will be free for anyone to use, modify and distribute. The other schemas include, among others, company, license and filing schema.


OpenGazettes is supported by the ODINE.

If you would like to be part of our contributor community, join our Slack.