Tech notes: Towards a faster OpenCorporates

We’re always looking for ways to improve OpenCorporates, both our data and how we make it available to our users. More jurisdictions, more data, more insights, all provenanced White Box data that you can trust – and we’re working on some incredibly cool new developments in our data collection system, thanks to a grant from The Mohn Westlake Foundation (we’ll be talking more about this in the coming months). 

But it’s not just the data, it’s also how we get it to users – and as part of a project funded by Luminate, we’ve been working on improving the user experience of the OpenCorporates website and API. As well as UX research, we’ve been looking at improving the speed of the site. To this end, we recently optimised a function deep within our codebase which sped up parts of our website and API quite dramatically.

We thought we would describe this as an illustration of the kind of work we do within the development team at OpenCorporates. 

It’s only an example, because actually we do a wide variety of work, undertaking projects to take our technology in new directions, such as the work that’s currently being done using machine learning to completely overhaul OpenCorporates’ reconciliation (or entity resolution) functionality.

A lot of the time though, we’re working behind-the-scenes, sometimes moving to new technologies, such as when we switched from neo4j to TigerGraph, or better architectures, where we are looking to carve up our large codebase into more of a microservices architecture. But sometimes, as in this case, we’re just tweaking the existing codebase.

Our Ruby On Rails app

The core codebase is a Ruby on Rails app. Weighing in at 619,700 lines of code, it is both complicated and complex, particularly within the models layer, which has been handed down by previous developers from years gone by (one of whom is busy being the CEO now).

It might be unfair to describe it as “legacy”, since the code still serves us well and was generally well built with good test coverage, but it presents some challenges, particularly in getting new developers up to speed.

Optimisation efforts & the “best_data_objects”

Lurking in the models code there are a few areas in need of optimisation, particularly as our database grows. We knew this because we had been noticing some company pages on our website loading very slowly as the number of “data objects” attached grew. 

In OpenCorporates a “data object” is how we represent a piece of information that is related to a company, for example a business licence or a shareholding. In the worst cases, we had a few pages that could take over a minute to load. We recently decided, as a team, we would commit to tackling these slow pages.

We had already pinned it down to one troublesome method called “best_data_objects”. We knew this kicked off a lot of querying to our SQL database, but we also knew that it was deeply embedded within not only the website code, but also the logic for inserting new data objects. Changing this method seemed risky. There was the risk of breaking a lot of things… but also the promise of speeding up a lot of things! 

So when Ivan Bashkirov, one of our software engineers, stepped forward and developed an optimisation, we all eyed it tentatively at first. The code change seemed simple and logically flawless. It boiled down to an SQL join query replacing a ruby loop. The tests passed. The method produced the same output as far as we could see. 

But, however comprehensive your test coverage, we all know that your tests won’t always cover every single use case that could occur in the real world. In this case, we wanted to be absolutely sure, so we developed an additional rather brute-force real data testing approach. We set a process running in a live production rails console, which compared the method output before and after for several hundred thousand real company records. In all cases, the before and after method produced the same results.

Then we deployed, and our troublesome slow pages were slow no more, as we reduced the page load time by approximately 50 times! 

We have more work to do to bring you company data faster, and of course we continue to expand our database, but we’re pleased about speeding up these slow pages, not least because our servers are humming a little more quietly now, as the graph above shows.

Want to know more?

Read more about how our tech team has helped make open company data more available and easy to access via OpenCorporates.

Read more >

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s