This week I published an economic brief that rests on a single number: in 2024, El Paso County started 2.52 businesses for every one that closed, down from a decade spent comfortably above four to one. That number is the whole piece. Everything else is there to make a reader trust it.
What I want to write about here is not the finding. It is the part nobody sees, which is that getting to a trustworthy 2.52 took almost no SQL and a surprising amount of arguing with myself about definitions. The data was 1,482,011 transaction records across 275,180 business entities, pulled from the Colorado Secretary of State and loaded into a 500-megabyte SQLite file on my desk. The query that produces the ratio is about six lines. The six lines were the easy day. The hard part was deciding what the words in them mean.
Problem one: there is no county
I wanted a number for El Paso County. The Secretary of State file does not have a county column. It has a free-text field called principalcity that whoever filed the paperwork typed by hand. So instead of a clean join I had a pile of strings, and the strings are exactly as bad as you would expect from a form that millions of people fill out with no validation.
Colorado Springs alone shows up as "Colorado Springs," "Colo Spgs," "CO SPGS," "Col Springs," "coloradosprings," and a dozen more. Some rows are just "CO," which is the state, not a city, and means nothing. Some are zip codes glued to the name. A naive WHERE city = 'Colorado Springs' would have silently dropped tens of thousands of real businesses because somebody abbreviated.
So the city field needed a normalizer before it could be trusted. The pipeline I landed on is four steps: clean the string (lowercase, strip the trailing state and zip, expand the common abbreviations so "spgs" becomes "springs"), try an exact match against a canonical list of towns, then fall back to fuzzy matching with a score floor, and finally leave genuinely unknown names alone rather than forcing a bad guess. Junk tokens like "na," "various," and bare "co" get mapped to nothing on purpose.
The part that mattered most was the canonical list, and specifically what I chose to keep out of El Paso County. The fuzzy matcher is happy to pull in Woodland Park and Cripple Creek, which are real, nearby, and in Teller County. If I had let it, every Teller County formation would have quietly inflated the El Paso numbers and I would never have known, because nothing errors. The output would just be wrong by a confident-looking margin. Drawing a county boundary out of hand-typed city names is the kind of task that looks like data cleaning and is actually a series of judgment calls about what to exclude.
Problem two: there is no "closed" flag
Here is the one that took the longest. I needed to count business closures. The data does not have a closure flag either. What it has is a transaction log, where each row carries a historydes string describing what kind of filing it was. There are a lot of these strings, and the question of which ones mean "this business ended" turns out to be genuinely contestable.
The obvious ones are easy: dissolving an LLC, dissolving a profit corporation, dissolving a nonprofit. But the edges are where the number lives or dies.
"Intent to Dissolve" is a filing a business submits before it actually dissolves. It is a real, dated event. If I count it, and then also count the actual dissolution that follows, I have counted one closure twice. So intent has to be excluded, not because it is irrelevant, but because it double-books.
"Statement Curing Delinquency" is the opposite trap. It sounds administrative and dead, but it is a revival, a business that was on its way out coming back to good standing. Count that as a closure and you have recorded a death for something that is alive.
Then there is a whole category of filings that look like closures and are not: terminating a registered agent, withdrawing a trade name. The business persists. The agent or the name went away. If you are pattern-matching on the word "withdraw" or "terminate," you will catch these, and your closure count will float upward for no real reason.
I ended up with fifteen transaction labels that I decided constitute a business actually ceasing to exist, and an explicit, written-down list of the ones that look like death but are not. The exclusions are the work. Anyone can write COUNT(*) WHERE it looks dissolved. The credibility of 2.52 is entirely in the filings I chose not to count.
The number is the last five percent
There was one more honesty check. The cohort half of the brief traces every business that dissolved in 2024 back to the year it was born, to see whether the closures were concentrated in the recent pandemic-era startup boom. That join only works for entities that actually carry a formation date, so before I trusted the cohort story I measured the match rate: what fraction of the 2024-dissolved entities could even be traced back. If that number had been seventy percent I would have said so, or not run the analysis at all. It was high enough to stand on, and I wrote it into the methodology so a reader can see the floor I was building on.
This is the shape of most data engineering I do, and it is almost the inverse of how the work gets described. The pipeline diagram makes it look like the value is in the movement of data from A to B. The actual value sat in maybe twenty small decisions about what a word means: is this city in the county, is this filing a death, does this entity count. None of them are technically hard. All of them are places the final number could have gone quietly wrong, with no error message, no failed test, nothing but a clean-looking result that happened to be false.
The query was six lines. I trust the six lines because of the two days I spent deciding what they were allowed to mean.

