Your AGENTS.md Needs a Feedback Loop

The Augment team ran controlled evals across real repositories and found that the same AGENTS.md could boost bug-fix quality by 25% while degrading complex feature work by 30% (on the same codebase, same module). Same file. Thirty points of swing depending on what you asked the agent to do.

Most people read that and ask: what should my AGENTS.md look like? That’s the wrong question. Or at least, it’s not the interesting one. The interesting one is: how does your AGENTS.md update itself?

Every Sunday morning, I run a distill skill across that week’s JSONL session files (tool call logs, assistant turns, and any corrections I issued mid-session). It takes maybe 20 minutes to run and produces a structured output sorted into three buckets.

The first bucket is skill candidates: patterns I repeated by hand twice or more during the week. If I’m copy-pasting the same command sequence into three separate sessions, that’s a skill waiting to be extracted. The second bucket is learnings.md candidates: corrections I had to issue to the agent twice. Not once. Twice. Once might be a one-off. Twice is a pattern. Those go into learnings.md so the agent stops making the same mistake I already corrected. The third bucket is AGENTS.md candidates: routing and discovery failures (instructions the agent ignored or visibly misread during the week).

The taxonomy matters:

Skills are repeated procedures.
learnings.md is repeated corrections.
AGENTS.md is routing and discovery failures.

That separation is the point. A repeated procedure becomes a skill. A repeated correction becomes learnings.md. A routing or discovery failure changes AGENTS.md. Conflate those, and the router turns into a junk drawer.

My first AGENTS.md was around 2,000 lines. I thought thoroughness was a virtue. Augment’s research confirms it isn’t: their analysis found that architecture overviews caused agents to load roughly 80,000 tokens of irrelevant context, cutting task completeness by 25%. Excessive warnings without matching solutions doubled task duration. Bigger isn’t more careful. It’s just slower and wrong in more directions at once.

My current AGENTS.md is around 150 lines. It’s structured as a router. The heavy lifting is in the skills: specific procedures that have survived real use. learnings.md catches the corrections. AGENTS.md points at the right doc for the right shape of task. Augment’s finding that 100-150 line files with focused references delivered 10-15% improvements across all metrics lines up with what I landed on, though I got there through iteration rather than evals. My evidence is anecdotal. The friction dropped. I can’t quantify the lift.

Garry Tan has described a YC-scale version of this architecture: six thousand founder profiles, post-event NPS surveys feeding a self-rewriting matching skill. His is institutional and event-driven (a survey closes, the skill updates). Mine is personal and time-driven: Sunday rolls around, the distill skill runs. In both cases, the system turns observed failures into updated operating instructions. Same architectural shape. Different surface area and genuinely different stakes. At YC, a bad match wastes a founder’s time and shows up in NPS. At my desk, a bad AGENTS.md just means I correct the agent more often that week. The loop makes sense at both scales. What changes is whether the feedback signal is a survey or your own JSONL files.

Some people run distillation on a hook at session close. Reasonable approach: catch the pattern while it’s fresh. I haven’t moved to that yet. The Sunday cadence gives me a beat to actually read what surfaces before deciding if it belongs. There’s a difference between a pattern the agent should internalize and a pattern that only made sense in one session. The human in the loop matters when what’s being edited is the agent’s own instructions. I’d rather run the review once a week and make the call deliberately than have the loop auto-commit on close. If distillation surfaces a false positive and I miss it, it writes a bad instruction directly into AGENTS.md and that compounds every session afterward until you notice the agent is doing something subtly wrong and trace it back to the source. The Sunday review is the check on that.

Augment ran their evals across dozens of repositories with golden PR comparisons. I have one system and no control group. I can tell you the workflow feels better. I cannot tell you the size of the lift. If you’ve got session data and want to compare notes, I want to talk.

The Augment piece names the specific failure modes: architecture overviews that bloat context, warnings without matching solutions, and the discovery gap (nested READMEs get found 40% of the time; orphaned _docs/ folders under 10%). Worth reading for those alone.

The ±30 point swing exists whether or not you run a distillation loop. It exists because AGENTS.md is not a document. It’s a control surface for agent behavior.

And control surfaces drift.

The file you wrote six months ago reflects the failures you understood six months ago. The tasks changed. The model changed. Your workflows changed. The mistakes changed.

The quality of your AGENTS.md depends on how tightly it stays connected to recent failures. The maintenance loop is not administrative overhead. The maintenance loop is the product.