A question we get asked a lot is “what does your tech stack look like?”, so we thought we’d do a brief blog post to answer it. We hope this is useful to other people working on data projects, and we’d be interested to hear how teams at other organisations have tackled similar problems.
Our tech stack is dependable and thoroughly production tested. The core has been around since the beginning but we’re constantly considering which component parts could be changed for the better. Leaning on these tools has meant we’ve been able to consistently improve the OpenCorporates platform, growing it to contain information on 127 million companies.
Ruby on Rails
The main OpenCorporates application is built on top of the tried and tested Ruby on Rails web framework. It allowed OpenCorporates to get up and running quickly in the beginning, and it continues to allow us to develop new features efficiently with a relatively small team.
The API is kept small and nimble and is built with Sinatra – another ruby web application framework, but one that is significantly smaller in scope than Rails. Sinatra allows us to keep the API codebase simple, and separate it from the main OpenCorporates website. It simply renders JSON or XML at the various endpoints of our API.
We use MySQL as our primary relational database. Our MySQL servers house a lot of data, with some tables reaching several billion rows. Because we store so much data, we have to be very careful how we do it, as migrating when we get it wrong can be painful.
Due to the nature of the data we’re capturing, and the many varied sources from which we get it, we also store supplementary data in a serialized form. This allows us to capture details that are unique to a certain data source, or when we’re unsure yet how we might make use of it. MySQL isn’t the perfect database for this use-case, but we can do it relatively easily with Rails.
Elasticsearch is an open-source distributed search engine based on Lucene. We use it at OpenCorporates to power the full-text search in our API and web interface, as well as allowing us to filter and facet data through a rich API. We denormalise some of the data allowing us to search it more easily, and making some queries significantly cheaper. The company pages on the OpenCorporates website pull together data from many different tables. In a sense, we’re using it almost like a document store, but it’s always computed from our relational data.
Neo4j is a graph database that helps us to find and understand relationships between data. It’s very different to a relational database, and allows us to perform queries on complex relationships that would otherwise not be practical.
We currently use it in a couple of places, such as the network views on some company pages. As we work more to connect data together – including corporate structures and beneficial ownership data – we’ll probably lean on Neo4j more heavily to shine a light on those relationships.
Ceph is a distributed system for object, block, and file storage. At the application level, we use the object storage in a variety of ways, such as caching filings and other PDF documents from various data sources. We also use it occasionally when scraping to cache web pages for processing at a later stage.
Resque is a library for creating, enqueuing, and processing background jobs. We use Resque workers to process jobs in the OpenCorporates application, and throughout our data pipeline. We have regular jobs – some of them quite long running – that fetch, process, and transform data from different sources behind the scenes.
Docker is a tool for packaging up self-contained application environments, allowing them to run anywhere – a bit like a lightweight virtual machine. We run Resque workers in Docker containers that are managed by Mesos (see below). We also use containers to sandbox bots in our scraping platform Turbot. This allows us to run code in an isolated environment that can’t interfere with the rest of our infrastructure.
Docker is used also by our Jenkins CI server to build images that mirror the production environment for running tests.
Mesos is a platform for abstracting and distributing resources across our system. We scale dockerized resque workers up and down, as well as other background services, such as bots scraping and retrieving company data.
Want to share your experience of using these tools or have better recommendations? Let us know in the comments.
Also if you like the sound of this and want to work on it, we’re hiring!