You are currently viewing What you need to know before sourcing data directly from US state registries

What you need to know before sourcing data directly from US state registries

A time, effort, and cost analysis

Summary

Collecting, cleaning, and maintaining company data directly from 50 separate US state registries (plus DC and territories) is possible – but only with significant, ongoing expense in money, time, and specialist effort. Variation across states in data availability, formats, fee models, and cadence – paired with legal/licensing friction – turns “do‑it‑yourself” into a long, laborious program rather than a one‑off project. 

OpenCorporates has already solved this at scale: normalizing disparate sources, preserving provenance, and delivering continuously updated, usable data through one interface. For most teams, the DIY path will cost more, take longer, and introduce operational risk while diverting scarce engineering and data resources away from product priorities. 

The fragmented US registry landscape

Unlike jurisdictions with a single national companies register, the United States has no centralized federal register for ordinary business entities; registration happens at the state level through Secretaries of State or equivalent offices. The Small Business Administration’s guidance is explicit: how and where you register depends on your state and business structure, and for most small businesses this means registering with state and local governments.

Across states, “basic” attributes (name, entity type, formation date, status, and registered agent) are common, but richer elements (officers/directors, status histories, filing images) are inconsistent and access mechanisms vary widely. For example, Florida exposes officer/authorized‑person lookups and makes annual‑report images downloadable free of charge once posted; California’s business search offers free PDF copies of 17M+ imaged documents. By contrast, Delaware’s public interface is a pay‑per‑entity model ($10 for status; $20 for status with tax/history).

Two routes to direct acquisition – and their real costs

1) Buy official bulk data (where offered).
A subset of states publish or sell bulk datasets. The offerings, scope, cadence, and price points vary dramatically. Below are just a few examples:

  • Minnesota sells Active Business Data (name + primary address for active entities) for $30 (one‑time or weekly debits of $30). Bulk officer data are not included; officer information is obtainable only via $35 per name searches – impractical for bulk.
  • Indiana’s rulemaking sets a $9,500 fee for a bulk download with monthly updates; a one‑time bulk download is $8,000.
  • Kentucky’s Business Records subscription ($2,000/month) includes monthly full files and officers/principals, plus daily/weekly deltas (new filings, officer changes).
  • West Virginia offers a Business Entity List Service (build your own file) with a $25 minimum search fee + $0.05 per record, alongside a monthly Bulk Data Service for subscribers.
  • New York publishes multiple Department of State datasets on the open portal – e.g., Corporations & Other Entities: All Filings – Name Status History, plus documentation and related entity datasets.
  • Texas, notably, does not publish an open bulk registry dataset; access is via SOSDirect, with a $1 per search charge (waived when an order is placed from the results) and standard fees for copies/certificates.

Even when you pay, the contents often exclude officers, historical statuses, or images; some states separate images into other products or do not provide image bulk at all. The Kentucky example – where officers are packaged and deltas exist – shows what “good” looks like; many states provide a lot less.

2) Scrape public web portals.
Scraping trades cash costs for engineering time and legal/operational risk. Technically, you’ll build ~50 custom scrapers for heterogeneous search flows, HTML structures, throttles, and anti‑bot systems – then maintain them whenever sites change. Legally, US precedent narrowed one major risk: in April 2022, the Ninth Circuit reaffirmed that scraping publicly accessible pages likely does not violate the CFAA’s “without authorization” prong (hiQ v. LinkedIn) – but terms of use, privacy, IP, and state‑level rules still apply, and registries can throttle or block abusive automated access.

Hidden workstreams most teams underestimate

Schema harmonization and reference data. States label similar concepts differently (“Active” vs “In Existence” vs “Good Standing”), encode entity types with different vocabularies, and present addresses/agents in non‑uniform formats. Creating a canonical schema and maintaining mapping tables is unavoidable – and it grows each time a state tweaks its fields.

Identity resolution. Without a single, nationwide ID, cross‑state deduplication (same firm registered in multiple states, name variants, conversions/mergers) requires robust match logic and caution about false merges.

Provenance. For compliance workflows, you need to track where every attribute came from and when: URL or document ID, retrieval date, and jurisdiction. Building lineage capture and exposing it to end users is non‑trivial and must be baked into ingestion pipelines and downstream APIs.

Licensing and permitted use. State bulk programs typically require agreements – often distinguishing non‑commercial vs commercial use and restricting redistribution. Minnesota, for example, requires commercial users to execute a license agreement to obtain bulk data; Kentucky requires a subscriber agreement.

The “freshness” burden (and why a snapshot goes stale fast)

Formation, dissolution, and officer changes happen constantly. The US Census Bureau’s Business Formation Statistics (BFS) show high‑frequency flows of new business applications and formations, with monthly national and state‑level series. That means a one‑time registry pull degrades quickly; keeping data fresh requires ingesting deltas frequently and reconciling full refreshes on a cadence.

A few states help: Kentucky publishes daily/weekly deltas for new companies, new officers, and company/officer changes alongside monthly fulls; others provide only monthly files (or no deltas at all), pushing teams toward periodic full diffs or per‑record scraping.

What a national DIY program really costs

Cash outlay. The shape of costs varies widely depending on your needs, and is dominated by a handful of expensive jurisdictions. Put together, a legitimate nationwide buy can readily land in the tens or even hundreds of thousands of dollars annually, even if you lean on open states like New York.

Time and effort. Expect months to procure files, stand up pipelines, normalize fields, and reconcile entities, followed by continuous operations for updates, break‑fix, and schema drift. Scraping substitutes fees with engineering hours and fragility; bulk buying reduces scraping efforts but still requires significant data engineering and contract administration across dozens of programs.

Why OpenCorporates (buy) beats building your own (DIY)

OpenCorporates has spent more than a decade assembling, standardizing, and maintaining company data from official registries worldwide (including all US states), with provenance and freshness as first‑class features. Instead of building 50+ integrations, matchers, and monitoring jobs yourself, you integrate once and offload the ongoing risk of portal changes, schema updates, and licensing nuances to a provider whose core competency is lawful, scalable acquisition and normalization. The result is faster time‑to‑value, lower total cost of ownership, and higher confidence for verification/KYB, onboarding, and investigations use cases.

The economics: One relationship instead of dozens of contracts and fee schedules; economies of scale vs buying the priciest states piecemeal.

The time‑to‑value: Ship features using standardized data and IDs instead of building pipelines and reference mappings.

The risk: Provider absorbs breakage from portal/format changes and manages provenance system‑wide.

Implementation recommendation

  1. Make OpenCorporates your authoritative baseline for US company data and provenance. Use direct state portals selectively for edge cases (e.g., certified copies, audit requests).
  2. Design for evidence and audit. Even when you use an aggregate, keep clear proof paths to source registries (document IDs/URLs and retrieval dates).
  3. Focus your teams on differentiation. Allocate engineering/data science to customer value (risk models, entity resolution in your graph, case workflows), not registry plumbing.

For more information

Learn more about how OpenCorporates’ data can help you understand corporate structures and manage risk. Reach out for a demo or explore our services.

Leave a Reply