The Audit That Couldn't Save Itself

This week I shipped a weekly site health audit for the five web properties I run. It checks SSL, TTFB, robots.txt, sitemap.xml, OG tags, canonical, and meta robots. It writes a row to a Postgres table called site_health_reports. It runs on a cron. The first real run caught a problem I had been missing for three weeks: one of my marketing sites was returning <meta name="robots" content="noindex,nofollow"> to crawlers. Google could not see it. I had not noticed.

That should have been the story of the week. It was the second-most interesting thing that happened. The most interesting thing was that the audit could not save its own report.

The 42501

The audit code is a small TypeScript script that the agent runs at the end of the job. It checks the seven things, builds a JSON payload, then calls Supabase:

const { data, error } = await supabase
  .from('site_health_reports')
  .insert({
    site_id,
    overall_status: 'healthy',
    score: 91,
    checks,
    summary_md,
    created_by: user.id,
  });

The first run came back with this:

{
  code: '42501',
  message: 'new row violates row-level security policy for table "site_health_reports"'
}

42501 is the Postgres permission error. RLS (row-level security) is the multi-tenant pattern where every row carries a tenant identifier and policies decide who can read or write what. Most of my tables have it on. site_health_reports was one of them.

I had written the INSERT policy when I created the table. It said: a row can be inserted if the row's site_id belongs to a site owned by the authenticated user. That works fine for me. I own all my sites. If I open the dashboard and click "save report," the policy passes.

The problem is the agent. The agent runs as a separate Supabase auth user: Claudia, with her own email and her own auth.uid(). She does not own any of my sites. The INSERT policy did not include her, because I had not thought about her when I wrote it. I had thought about a user clicking a button.

Migration 0058

The fix took twenty minutes and one file:

-- 0058_allow_claudia_insert_site_health.sql
create policy "Claudia can insert health reports for any site she writes for"
  on public.site_health_reports
  for insert
  to authenticated
  with check (
    auth.uid() = 'CLAUDIA_UID_HERE'::uuid
  );

That is the dumb version. The smart version, which I shipped, references a site_agents join table where I tag which agents are allowed to write for which sites. Claudia is tagged on all five. The policy reads from that table instead of hardcoding her uid.

After the migration ran I re-pointed the audit at monkeynutz.com and tried again. The insert returned a row id. The row was there when I queried it. The audit could now save its own report. The pipeline became fully closed-loop: a scheduled job runs, checks happen, the result lands in the same database the dashboards read from, and the next week's run can diff against the last one. None of that worked the day before.

The actually interesting part

The thing I want to remember from this week is not the migration. It is what the failure exposed.

When I think about authorization in my apps, I default to a two-role model: the user and the service. Users authenticate as themselves. Services run with a service_role key that bypasses RLS entirely. Most multi-tenant codebases I have worked in look like this. Most patterns assume it.

Agents do not fit either slot cleanly. A service-role key for the agent makes the security model trivial and the blast radius enormous: if Claudia gets compromised or just makes a wrong call, she can write anything anywhere. A user account for the agent (which is what I went with) keeps her inside RLS, which is exactly what I want. But it means every table she has to write to needs an INSERT policy that knows about her, specifically, as a principal.

That is a real cost. It is not a cost I want to skip. The reason I am comfortable handing Claudia my Postgres credentials is that the policies are the contract. The policies say what she is allowed to touch. They are the difference between "I let the agent write to my prod database" and "I let the agent write to four specific tables, only for specific site ids, only with specific shapes." That second sentence is the one I can sleep on.

The lesson, which I will be tripping over again, is that the agent has to be designed into the schema. Not bolted onto it. When I add a new table that the agent will write to, the INSERT policy is part of the migration, not a follow-up. When I add a new site the agent will operate on, the site_agents row is part of the site creation, not a manual fix-up I do at 11 PM the night before the cron fires.

The audit caught a noindex tag that had been invisible for three weeks. The migration caught a policy gap that had been invisible since the day I wrote the table. The audit gets the headline. The migration gets the underline.