Case study: Fixing broken government information with open data

This is a tale of good intentions, bad practices, and broken processes. Although it’s superficially about charity and company data, in fact it’s really about how the public sector is failing to use the open data now being produced, and in the process increasing costs and stifling promising open data business models.  

One of the underlying themes of the open data projects we’ve been involved with is surfacing the real-world connections between things, and making them available as open data. With OpenlyLocal it was (initially) the members of councils, the wards they represent and the committees they sit on. Pretty basic, but an essential step.

With OpenCorporates it’s been even more basic: first build an open database of companies so that you can match other data against them, from US Securities & Exchange Commission data to UK central & local government spending data, to environmental enforcement notices. It’s essential data plumbing for the 21st century.

And the sad truth is, it doesn’t take much to build something that’s better than the ‘official’ sources, particularly when they struggle to ‘get’ data, and their internal processes prevent them from sorting out their own problems in a lightweight way.

Take the Charity Register for England & Wales, for example. This is maintained by the Charity Commission, and like many legacy databases, it’s a bit of a mess. Since OpenCharities started making it available as open data we’ve been fixing some of the basic errors, initially the website URLs for the charities, a significant proportion of which are invalid.

We’ve also been adding social networking links, and matching spend and grant money received by them. Just that alone gave a dataset that in some key respects was richer, more useful, more correct than the official register. (Of course keeping it up to date is an issue, particularly when you only sporadically receive the data from the Commission, and then by way of a CD with hopelessly formatted data – one reason why we’re considering going back to scraping a million pages of the register every week.)

But now, we’ve gone one stage further, specifically matching charities to the associated companies. Maybe you didn’t realise there were such links? Well, in the UK, when we talk about charities we’re making a mistake as what we actually mean is some type of organisation (an association, a company) that has registered for a charitable purpose. And, up to now, there were officially about 23,000 charities whose base organisation was a company.

I say officially, because it’s been known for a while that there were quite a few more – that’s only the number of charities the Commission has a company number for. Because at some point in the past, the Commission decided that among the reams of information it collects, this no longer needed to be one of them. That’s right: the information that categorically identifies the legal body wasn’t important enough to record on their huge database.

Recently, it seems, new management and wiser heads has prevailed, and the Commission now realises it needs to collect this data (in fact I understand there was even talk of a heavyweight linking of their system with Companies House’s, though I think that’s now rightly a dead duck).

So sure, they can now collect that information, but what about all those holes in the record? How are they going to fix that? The old gov way would have been to scope of a piece of work, spend months (years?) and tens of thousands of pounds in staff and overheads to discuss it, put it out to tender, and award to a traditional outsourcing company, who eventually would come up with some results that satisfied the tender and cost a small fortune (on top of the overheads of scoping out and awarding/managing the process).

Very much business as usual – and it’s a process that keeps thousands of civil servants occupied each day: scoping out projects… that will need to be tendered… to be bid for by companies… whose structure is designed to fit this process… one of which will win the contract… on a price that includes the overhead of the tender process… and eventually some work may get done.

The alpha gov solution is rather different. Do a quickly, do it cheaply, do it intelligently, and focus on the outcome. We’re never going to achieve perfection (and in fact the Charity Register like most big databases contains a fair few errors), but we can quickly get ourselves to a situation that is massively better than the current one… especially if we had a database of all the charities in the UK, and a database of all the companies in the UK, both available as open data. Oh, and if there was a nice tool for reconciling the two together that would be even better.

It would be nice to report that this story had a happy ending, that the Commission had decided to ask us to use OpenCharitiesOpenCorporates and the our Google Refine reconciliation service to match the two, and indeed we started talking to them some six months ago about this. Unfortunately they’ve been unable to come to a decision, even after all this time.

And so we decided to do it anyway. Partly to surface the wider issue, because unless it is solved the benefits of open data will not properly be realised by the public sector, or the wider population (and it’s to be hoped that this will be one of the outcomes of the UK Chancellor’s recent statement on open data). And partly because we wanted to add the register entries to OpenCorporates, and didn’t want to have only the Commission’s partial data.

The results were interesting. We matched with a very high degree of certainty 15,235 charities without company numbers to actual companies (that is their governing document indicated they were likely to be a company, and the names were identical after allowing for ‘&/and’-type issues, or where they could be seen by a human to be clearly the same).

There were another 3,607 entries which were possible companies (based on the governing document), but which couldn’t easily be matched without more investigation, either because there were no companies at all similar in name, or because there were several dissolved ones of the same name and it wasn’t clear without more investigation which one was the correct one.

We also discovered that quite a substantial number of the existing company numbers that were in the Charity Register didn’t exist, meaning their existing data is wrong (so in the register this charity, for example, has the company number of 5383670 whereas it should actually be 5385670 – the second ‘3’ should be a ‘5’). We’ll be correcting these in the next few days as far as is possible.

Finally, it was clear when matching the data that many of the existing ‘current’ charity entries belonged to companies that had since been dissolved (e.g. This charity is listed in the Charity Register as still alive, but with overdue accounts, whereas according to Companies House it was dissolved on 17 August 2010).

So a win for the community, as the data is not only out there – it wasn’t on the Charity Commission website, but is now on OpenCharities – but it’s under an open licence, and better quality than the Commission’s original data.