Your risk team thinks you’re dealing with Acme Corp, finance books it in as ACME CO, and marketing targets Acme Corporation in New York. Three “different” customers that are, in fact, the same legal entity. Multiply that by thousands of records, and duplicate data quietly drains budgets, derails onboarding, and invites regulatory fines every single day. One missed connection can green-light a sanctioned counter-party or sink a multi-million-dollar deal. The race to nail down a golden record for every business has never been more urgent.
Why this isn’t just another data hygiene post
Most conversations about entity resolution stop at “clean your CRM.” We’re going deeper. Today we’ll explore the landscape of US business registries, expose why the lack of a universal identifier still haunts data teams, and reveal the playbook data aggregators use to turn messy records into rock-solid truth. You’ll see how OpenCorporates, the world’s largest open company database, slots into that playbook and why its partnership with the Legal Entity Identifier (LEI) system is quietly changing everything. By the end you’ll know exactly where the major potholes lie, the tools top aggregators deploy, and how to tell if your “single source of truth” is really golden or just gold-plated.
Why business data is so fragmented
Corporate registration in America is complicated. There’s no federal companies register. Instead, more than 50 separate state and territory registries keep their own formats, IDs, and disclosure rules.
For data teams this means:
- A Delaware-incorporated firm that files to do business in Texas and California now owns three different IDs and, often, three slightly different legal names.
- One state might publish officers and directors, another won’t even reveal a mailing address.
- “Active,” “Good Standing,” and “In Business” might all describe the same legal status, depending on which clerk keyed it in.
Even basic due-diligence questions like “is this company real” can require stitching together dozens of records.
Entity Resolution 101
So the question is, how do you transform that tangled web of near-duplicates into a single, reliable profile? Entity Resolution, also known as entity matching or record linkage, is the data-management discipline that identifies when different records actually point to the same real-world entity and then links or merges them into one “golden record.”
A definition we like is: Entity Resolution “identifies and links records across multiple data sources to create a unified view – or golden record – that represents the best version of critical entities.”
The mechanics of Entity Resolution:
- Standardise the inputs – Cleanse and harmonise names, addresses, and dates so “Acme Corp.” and “Acme Corporation” share a comparable format.
- Generate candidate pairs – Use blocking keys (like zip code or company number) to drastically shrink the search space.
- Score the similarity – Apply deterministic rules (exact registration-number matches) or machine-learning models that weigh fuzzy name similarity, shared officers, and geospatial closeness.
- Decide & persist – If the confidence score clears your threshold, collapse the records; if it’s borderline, escalate for human review.
- Iterate & govern – Track provenance, version each change, and update whenever source data refreshes, because companies move, rename, merge, and dissolve faster than you can say “quarter-close.”
Everything that follows – golden sources, unique identifiers, the OpenCorporates/LEI bridge – assumes you can reliably collapse duplicates. No matter how rich your datasets or how cutting-edge your graph database, if you match “Acme Holdings LLC” (parent) to “Acme Corp Ltd” (subsidiary) by mistake, your risk scores, supply-chain exposure, and ESG metrics all fall like dominoes. Entity Resolution is the hinge on which the entire narrative, and any real-world compliance program, turns.
The anatomy of a golden source, and why it matters
“Master data management … creates a ‘golden record,’ a single source of truth everyone can trust.”
Golden source data starts with authoritative inputs, typically the official state registries, and flows through five critical steps:
- Ingest: pull filings from every jurisdiction (plus SEC, tax, and licensing data).
- Standardize: normalize names, dates, and addresses into one schema.
- Match & Merge: run fuzzy logic and machine learning to collapse duplicates.
- Survivorship Rules: decide which source “wins” when facts conflict.
- Enrich & Publish: attach external identifiers (LEI, DUNS, etc.) and push a clean API.
Three big things every data leader should know about
1. Scale is nothing without provenance
OpenCorporates now holds data on 220 million+ companies across 140+ jurisdictions, and every field links back to its official filing. That provenance isn’t a nice-to-have; auditors demand it, regulators expect it, and machine-learning teams need it for training sets.
2. Identifiers are the new currency
LEIs cover almost 2.9 million entities globally, tiny next to the full business universe, but when you have one, matching accuracy jumps to near-perfect. In 2023, GLEIF linked over half of all LEIs directly to OpenCorporates IDs, creating an open bridge between registry truth and global financial reporting.
3. Real-world wins: Quantexa’s risk graph
Analytics firm Quantexa ingested OpenCorporates bulk data to seed a network graph, then overlaid transaction and sanction lists. The result was instantly connected hidden director networks across borders and flagged fraud rings invisible in siloed datasets.
What this means for you tomorrow
- Faster onboarding, lower false positives – Plugging a golden source into KYC can slash manual review by double-digit percentages. Compliance teams spend less time Googling company names and more time chasing real red flags.
- Cheaper data ops – Why pay for 50 state feeds when one API covers them? OpenCorporates’ data plus IDs reduce ingestion and licensing overhead.
- Future-proof governance – As the US rolls out a federal Beneficial Ownership database, systems already anchored to transparent IDs will bolt on new attributes painlessly.
A note of caution: No silver bullets
Many US states still hide officers, and only a slice of businesses carry LEIs. Bias can creep into machine-learning matches, and survivorship rules require human oversight. Finally, if your downstream teams ignore the golden source and keep editing records locally, duplicates will creep back. Entity resolution is a discipline, not a one-off upload.
For more information
Learn more about how OpenCorporates’ data can help you understand corporate structures and manage risk. Reach out for a demo or explore our services.