The Year 0003 Problem

I'm building a regional economic publication out of three federal data sources: the Colorado Secretary of State, the Bureau of Labor Statistics QCEW, and IRS Form 990 filings. Three different schemas, three different file formats, three different ideas of what a date is.

I am two weeks in and I have already learned something I should have known fifteen years ago.

The schema lies. Not the README. Not the data dictionary. The actual column definitions in the actual table. They claim a shape the data does not have.

Concrete example. The SOS dump comes with an effectivedate column. The schema says DATE. I joined it into our load table with a DATE NOT NULL constraint and the import failed on sixteen rows. Sixteen rows out of 275,180. I went and looked at them. Their effectivedate was 0003-01-30. The year three. Not 2003. Not 1903. Three.

These are real records. Somebody really did file them with that date. Possibly a clerk who typed 3 when they meant 2003 and skipped past a validator. Possibly a legacy system migration where a NULL got coerced to 0003 instead of NULL. I do not know. What I know is the column type promised DATE and the values include "the year a Roman emperor named Septimius Severus was alive." The schema was a promise. The data is the truth.

I have started thinking of every column in a federal data ingestion as having two definitions. The one in the file header, and the one I am going to discover by querying the actual values. The job of a load script is to take the second definition and decide what to do about the gap. Cast the offending rows to NULL? Drop them? Quarantine them in a separate table marked effectivedate_suspicious for the editor to look at later? It depends on the piece. For the multi-year business-formation analysis I am running, those sixteen rows are sixteen rows in 275,000, so I dropped them with a comment and moved on. For an investigative piece on shell companies, those sixteen rows might be the entire story.

Exclude Before Include

The second thing I learned this week is the inverse of how I would have approached this five years ago.

I needed to filter the SOS file down to El Paso County. The schema does not have a county column. It has a free-text city column populated by whoever filed the entity. So I wrote what felt like the natural query: match my list of El Paso County cities, return the matching rows. Colorado Springs, Manitou Springs, Monument, Fountain, Falcon, Black Forest, the works.

The result had a quiet 11,000-row leak. Teller County rows. The substring match was tolerant in ways I had not thought about. "Florissant Mountain Estates, Colorado Springs" matched on the wrong token. Free-text city fields filed by 275,000 different people are not clean enumerations. They are a field that looks structured and is actually a junk drawer.

I rewrote the filter. Instead of "match the cities I want," it became "exclude the cities I know I do not want, then take the rest." Name the Teller County cities. Name the Pueblo County cities. Name the Douglas County cities that sometimes cross-list. Anything left after the exclusions is treated as El Paso County until proven otherwise, and the proof gets logged.

This is the inversion. The naive shape is: include what you want. The real shape is: exclude what you do not want, then look at what is left. The world does not sort itself into your includes. It sorts itself into your excludes. This is the pattern in spam filters, in allowlists versus denylists for production firewalls, in feature flags rolled out by exclusion ring. I had used it everywhere and never said it out loud.

In a regional dataset where free-text city names are the only geo-locator you have, exclude-before-include is the difference between a 92% precision filter and a 99.9% precision filter. The 8% you would have leaked are concentrated and findable. The 0.1% you leak after the inversion are scattered and statistically tolerable.

What this changes about how I load data

Two practical things I am doing now that I was not doing two weeks ago.

First, every load script starts with a diagnostic pass. Before the data goes anywhere near the warehouse table, I run a half-page of summary queries. Min and max of every date column. Count of distinct values in every categorical column. Length-of-string histograms for every text column. Null percentages. The diagnostic pass takes about thirty seconds to write and surfaces the year 0003 rows on the first pull. I would have caught it eventually in QA. Catching it before the table even exists is cheaper.

Second, every filter I write has an exclude version sitting next to it. If I am writing WHERE city IN (...), I am also writing a complementary filter against any proxy I can find, and I diff the result counts. If they disagree by more than a percent, something is wrong with one of them and I figure out which.

Both of these are small. Both of them are the kind of thing a senior engineer mutters about when they tell you they have started doing it after the third bad incident. I am not saying I am that engineer. I am saying I am one bad incident closer.

What the schema is for

The schema is not a description of the data. It is a description of what the system that produced the data wishes were true. That gap is where the actual engineering happens.

The schema gets you eighty percent of the way to a clean ingest. The remaining twenty percent is reading the data with your own eyes, in small bites, with diagnostic queries, before you trust any aggregate built on top of it. Every dashboard built on an unread dataset is a dashboard reporting on the schema's wishes, not the world.

The publication I am building lives or dies on whether the numbers are real. The schema is not what makes them real. Reading the data is.

I will let you know when I hit my next year 0003.