This is a guest post by Waldo Jaquith, who runs US Open Data, and this month pretty much single-handedly persuaded the US state of Virginia to open up all their company register data
OpenCorporates’ work is in aggregating openly published corporate registries, but less well known is the work that the organization foments. Moving corporate registers from closed records to open data is a laborious process, requiring work from a bunch of stakeholders. Here’s the story of how one new registry was opened up, effective today.
The U.S. state of Virginia is adjacent to Washington DC, roughly the size of North Korea, has 8 million residents and over half a million businesses. In the United States, corporations are registered with the state, not with the federal government, so instead of a national registry, we have 55 registries, each using different standards and practices.
Like many U.S. states, Virginia has long been a “dark state” for corporate data. The State Corporation Commission, the independent state agency that regulates corporations, does not provide open data about Virginia’s businesses. There’s no API and no bulk downloads. There are a pair of web-based interfaces to look up records, one at a time—one with a limited subset of the data and another that is complete, but with a horrifically bad interface.
However, Virginia has long sold bulk data. They require that a contract be signed and a payment of $450 be made every three months, in exchange for FTP-based access to the data. This data is a mess, to put it gently.
Two years ago, I started buying Virginia’s corporate data to give it away for free. I didn’t have any great plans for it – I was mostly just embarrassed to see my home state as a blank spot in OpenCorporates’ collection. Corporate data needs to exist within OpenCorporates. Period.
Making the file itself freely available was as easy as writing a cron job to copy it into an S3 bucket once a week. But getting the data into useable shape was a much greater challenge. The data published by Virginia’s State Corporation Commission was just a mess, on every level. Here’s a sample:
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
|020369964COLONIAL POINT CIVIC LEAGUE, INC. 00200903200000000019910117VA001420 FLINTFIELD CRESCENT CHESAPEAKE VA23321000000000000CLARINE B. ROBERTS 1420 Flintfield Crescent Chesapeake VA233210000199101171236N00000000000 0 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000|
|020712928BARGER INC 10201101030000000020090819VA00 00000000000000000BARBARA J IFKO 1201 OLEANDER AVE CHESAPEAKE VA233250000200908191236S00000000100 0COMMON 0000000010000000000000000000000000000000000000000000000000000000000000000000000000000000|
|020622258ASTA CRS, INC. 00201010120000000020040816VA0044121 HARRY BYRD HWY STE 230 ASHBURN VA20147000000000000PRABHAKAR THANGARAJAH 44121 HARRY BYRD HWY STE 230 ASHBURN VA201470000201207091153S00000005000 0COMMON 0000000500000000000000000000000000000000000000000000000000000000000000000000000000000000|
|020369967ATLANTIC NEWS-FEATURES, INC. 00200302280000000019910117VA009108 WINDOVER CT HENRICO VA23229000000000000STEPHEN P NASH 9108 WINDOVER CT RICHMOND VA232290000200303052143S00000005000 0COMMON 0000000500000000000000000000000000000000000000000000000000000000000000000000000000000000|
|020369971OBERMEYER CONSTRUCTION CO., INC., FRED 10201105310000000019910117VA006012 LEEWOOD DR ALEXANDRIA VA22310000000000000FRED OBERMEYER 6012 Leewood Drive Alexandria VA223100000199101171129S00000005000 0COMMON 0000000500000000000000000000000000000000000000000000000000000000000000000000000000000000|
|02F132519CONTAINER-CARE VIRGINIA, INC. 40201003020000000020040825TX002633 CAMINO RAMON, STE 450 SAN RAMON CA94583000000000000CT CORPORATION SYSTEM 4701 COX RD STE 301 GLEN ALLEN VA230606802200401055143S00000001000 0COMMON 0000000100000000000000000000000000000000000000000000000000000000000000000000000000000000|
The data was divided into nine different files, which they concatenated together into a single file for some reason. So using the data first required breaking it up into its individual files. Then the fixed-width data had to be mapped into structured data. Then the character encodings have to be normalized, because the agency has periodically used different character encodings – some of which are truly mysterious – without updating the old ones. (Sometimes they switch encodings mid-record.)
At this point, OpenCorporates was able to harvest the records and incorporate them into the site.
Mission accomplished? Not quite.
I didn’t want to settle for buying public data, indefinitely, for $1,800/year. So I set about trying to convince the Virginia State Corporation Commission to stop selling the data, and to instead give it away. I promoted the fact that they only had six paying customers. I called them out via Twitter with every check that I wrote:
I conducted a study that found that this data had US$100 million in untapped value to localities in Virginia, and made sure that it got lots of press coverage. And then, finally, in April, I had a passive-aggressive sheet cake delivered to the head of the agency:
It’s hard to say which of these things worked – maybe all of them, maybe none of them – but on July 1, the agency announced a 180° change: they would start giving away the data for free, as CSV, effective today, August 1.
This was beyond what I’d hoped for: not only are they giving the data away for free, but they’re providing it as structured data in an open format, available via HTTP. I no longer have to write checks every three months, OpenCorporates can harvest the data directly from Virginia, the state can stop wasting money administering what is surely a money-losing data-sales system, and this data is now available to anybody. Everybody wins. It took a couple of years, but the effort that OpenCorporates set in motion is now complete.
OpenCorporates is more than a repository of corporate registries. Its existence is a wedge that’s opening up corporate registries throughout the world. This isn’t happening quickly—it took two years of non-trivial work just to get one medium-sized U.S. state to publish its data openly. But OpenCorporates is leading the charge towards openness in corporate registries, as evidenced by Virginia’s change today.