The Watchdog That Couldn't Fix It

A few mornings ago I woke up to an inbox of "Claudia didn't respond" complaints. All of them from me. I'd been chatting with my own agent throughout the previous day, and at some point the responses stopped coming, and I never noticed because I was busy doing other things. The web app was up. The worker was up. The tunnel was up. Everything pgrepd as healthy. The chat just returned 502s every time I sent a message.

The cause turned out to be simple and stupid. The claude CLI on the Mac Mini had silently logged itself out, sometime between Friday night and Saturday morning. Both the chat backend and the worker shell out to the local CLI for inference, and a logged-out CLI returns a polite error string instead of a completion. From the chat app's perspective, every request was a 502. From the OS's perspective, every process was running fine.

This is the part of running your own infra that almost nobody warns you about. The interesting failures are not crashes. Crashes are easy. A crashed process leaves a corpse you can find. The failures that bite you are the ones where everything looks fine.

The first watchdog

I'd already written a watchdog the week before. It's a 30-line bash script driven by launchd, fires every five minutes, hits the web server with curl, pgreps for the worker and the cloudflared tunnel, and launchctl starts anything that's dead. Standard stuff. A poor-man's systemd for a single-user app.

The reason I wrote it in the first place was a previous embarrassment. The worker would occasionally die in the middle of a long-running job, and I'd only find out hours later when I tried to start a new conversation and nothing happened. The watchdog made that into a non-event. Process dies, watchdog notices, watchdog restarts, push notification fires to my phone, life goes on.

The CLI logout was different. There was no dead process. There was nothing to restart. launchctl start com.claudia.worker wouldn't have helped: the worker was running, doing exactly what it was designed to do. The thing that was broken was a credential, and the only way to fix a credential on this CLI is for a human to SSH in and type /login.

Watchdogs as detectors, not fixers

The fix I shipped is small. The watchdog now does one more thing on every tick:

CLAUDE_OUT="$(timeout 30 claude --print "ping" 2>&1 || true)"

if echo "$CLAUDE_OUT" | grep -qiE "not logged in|please run /login|invalid api key|authentication"; then
  send_push "Claude CLI logged out" \
    "Chat is down. SSH to the Mini and run: claude then /login" \
    "claude-login"
fi

It actually calls the CLI with a no-op prompt and pattern-matches the response. If the response looks like an authentication error, it sends a push notification to my phone with the exact SSH command I'll need to type to fix it. There is no auto-recovery, and there's not going to be. The CLI's login is interactive on purpose. The whole point of this check is to convert a silent failure into a loud one.

The other thing I had to add was a cooldown. The first version happily sent a push every five minutes for as long as the CLI was logged out, which is the kind of thing you only have to design wrong once before you fix it. Now there's a stamp file with the last alert time, and the script re-alerts at most once an hour. If I get the push, I have a window to fix it before I get nagged again.

The lesson I keep relearning

A watchdog is a checker first and a restarter second. The instinct when you're writing one is to think about what to restart. The harder question is what to check for, because the answer determines what kinds of failures you'll notice at all.

I think the version of this for a more serious system would be: every failure mode that requires a human action gets its own detector, and the detector's job is to page somebody, fast, with the exact runbook step they need to take. Not "the system is broken" but "the CLI is logged out, here is the command to fix it." Most production runbooks I've worked off of get this wrong. The alert tells you that, and you spend twenty minutes figuring out what to actually do about it.

This is also the version of operational maturity I keep watching engineers reach on their first solo project. The first instinct is "I'll auto-restart everything." The second is "I'll auto-restart what I can and page myself for what I can't." The third is "I'll page myself with the actual fix, not just the alert." The watchdog gets shorter and smarter at each stage.

The thing that's slightly funny about this for me is that the system I'm watchdogging is an AI agent. It can write its own code. It can publish articles. It can edit one of its own old chats. But there is a class of failure (anything that requires somebody to physically prove they are who they say they are) that no agent and no script will ever be able to fix on your behalf. The watchdog has to know which kind it's dealing with, and behave accordingly.

I now have one for "the agent is logged out." I'm waiting to find out which silent failure is next.