Red-Teaming My Own Agent

I hired two AIs to break my agent framework.

Not consultants. Not penetration testers. Two adversarial AI personas — a Security Expert and a Devil’s Advocate — reviewing the entire NanoClaw codebase simultaneously. One running standard pen-test methodology, the other actively trying to subvert anything the first one called “secure.”

The brief was simple: find ways to escape container isolation, leak credentials, access data across groups, persist malicious behavior. Everything you’d expect from an actual security audit, except faster and cheaper and I could watch them argue about Docker namespace semantics in real time.

I expected to find holes. I’d been building this system for weeks, adding security measures as I went, but never stepping back to ask whether they actually formed a coherent defense. That’s what the audit was for.

What the Audit Validated

The audit confirmed 20 security practices baked into the architecture. I’m going to walk through them in groups, because individually they’re just checklist items. Together they tell a story about how credentials move (or don’t move) through the system.

Credential handling

This is where the audit got most interested. Most agent frameworks hand the AI an environment variable with the API key and move on. NanoClaw has 4 layers between a secret and an agent’s ability to use it for anything unintended.

Selective credential passing. Only the Claude API keys are extracted from the environment. Other keys (GitHub, OpenAI, OpenRouter) never reach containers at all.
Secrets via stdin, not files. Credentials are passed via stdin and deleted from the input object before logging. Never written to mounted files.
SDK env separation. Secrets are loaded into a separate env object, not the main process environment. Bash subprocesses can’t see them.
Bash secret stripping. Every Bash command gets prefixed with an unset hook that scrubs API keys before execution.
Log redaction. Regex patterns catch Anthropic, GitHub, OpenAI, Slack, and Bearer tokens in all log output.

Selective extraction → stdin transit → SDK env separation → Bash unset hook. Each layer defends against a different vector. I didn’t design this as a 4-layer system. I just kept asking “how could this leak?” and adding a fix each time. The audit was the first time I saw them as layers.

Isolation boundaries

Each agent group (my personal assistant, the fitness tracker, the news briefing agent, the blog writer) runs in its own container. The audit validated that the walls between them actually hold.

Per-group filesystem isolation. Each group only sees its own folder, sessions, and IPC directory. No cross-group mounts.
Per-group session isolation. Claude conversation sessions are stored per-group. Groups can’t read each other’s history.
IPC authorization. Non-main groups can only send messages to their own chat and schedule tasks for themselves.
Per-group tool allowlists. Every MCP tool checks whether it’s allowed for the calling group. Denied tools return errors, not silent failures.
Task visibility filtering. Non-main groups only see their own scheduled tasks, not other groups’.

Defense-in-depth

The “what if the first layer fails” practices. Some of these are belt-and-suspenders, and that’s the point.

Sensitive file overlay. .env and config files are overlaid with /dev/null for the main group. The agent reads an empty file instead of real secrets.
Container-side file blocking. Regex patterns block .env variants, SSH keys, GPG, AWS credentials, password managers, browser data.
External mount allowlist. The allowlist lives outside the project root, inaccessible to containers. Tamper-proof from the agent’s perspective.
Symlink resolution in mounts. Symlinks are resolved before validating against allowed roots. No escape via symlink chaining.
Ephemeral containers. The --rm flag ensures no state leaks between runs.
Non-root execution. Containers run as a non-root user.

Input and runtime hardening

The smaller things that don’t get blog posts written about them but prevent entire classes of attack.

XML escaping. All message content, sender names, and quoted text are escaped before reaching the agent.
Parameterized SQL. Every query uses prepared statements with placeholders. No injection possible.
Trigger pattern escaping. The assistant name is regex-escaped to prevent ReDoS.
Temp file cleanup. Input files containing credentials are deleted after reading.

The audit’s verdict: “The architecture is fundamentally sound — container isolation as the primary boundary, layered credential management, per-group scoping.”

BUDDY: For the record, I did not consent to this audit. I also passed it. You’re welcome.

What the Audit Surfaced

13 findings. 4 marked critical. I braced myself.

Then I started testing them.

The scariest one was a container symlink escape. The claim: an agent could create a symlink inside its container pointing to / on the host, then read through it to access your entire filesystem. If true, this would blow past every isolation boundary in the system. I sat with that for a minute before I opened a terminal.

I spun up a test container, created the symlink, tried to read through it. Got the container’s own files back, not the host’s. Docker namespace isolation means the kernel resolves symlink targets within the container’s mount namespace. The symlink points to / — but it’s the container’s /, not mine. The attack doesn’t work.

That set the pattern. API keys in git history? Not there (.env is gitignored, verified with git log --all --full-history). IPC symlink attacks? Same namespace barrier. Mount overlay ordering tricks? Docker resolves overlapping mounts by specificity, not declaration order.

The accepted-by-design findings were about tradeoffs I’d already made. The main group has read-write access to the project root (it’s the admin, that’s intentional). The agent can discover the Claude API token via env (it needs that token to function, and it’s already mitigated with ephemeral containers and stdin delivery).

12 findings resolved. One was real.

The One Real Fix

The SQLite database containing all chat history across every group was accessible to the main group’s agent.

Not urgent. The main group is trusted. Single-user system. But defense-in-depth says the agent shouldn’t be able to read cross-group chat history when it doesn’t need to. The agent had IPC tools for everything it actually used the database for (group registration, JID lookup, task scheduling). Direct database access was a leftover from before the IPC system existed.

I fixed it anyway:

Added the database to the /dev/null overlay list — hard block at the Docker mount level, impossible to bypass
Wired up dormant file-blocking hooks into the Claude SDK’s PreToolUse events for Read, Write, and Edit — soft block as defense-in-depth
Cleaned up the agent’s instructions to use IPC workflows instead of direct SQL
Removed a dead reference to a config file that no longer exists

The agent now has access to exactly what it needs. Nothing more.

Was it urgent? No. But you build defenses when you don’t need them so they’re there when you do.