Slide 1 — System Audits and Drift Detection
Welcome to Module 4. So far this course has been about building the contract: tokens, DESIGN.md, review gates. This module is about the gap between the contract and what actually ships — drift. Drift is the default state of every design system, because products get built faster than anyone can watch them. What changes with agents is the economics of finding it: one agent can read one component carefully, and sixty agents can read sixty in parallel. What does not change is the output we are aiming for. Not a guilt document. A ranked, evidence-backed report that turns into a fix queue the team can actually work through.
Slide 2 — Drift is usually quiet
Let's be honest about how drift happens. Nobody decides to fork the design system. They decide to ship the feature on Friday. So a one-off colour goes in, a card gets copied and its radius tweaked, a disabled state gets skipped, a density call gets made on one page and never written down. Each decision is reasonable. The cost shows up later, and it is not just visual. A warning that is orange on one page and red on another teaches users two languages for the same state. That is a meaning problem, not a polish problem. And hand audits fail because the reading takes weeks — so they happen once and never again.
Slide 3 — Audit dimensions: what you are actually checking
An audit is only as good as the dimensions you name. Tokens: raw values where semantic tokens should be. Components: local copies and forks that bypass the shared library. Accessibility: missing focus states, labels, contrast. Content: the same action named three different ways. Density: spacing that wandered off the scale. Each dimension drifts for different reasons and gets fixed by different people, which is why you ask the agent to check them explicitly rather than asking it to 'check consistency'. And keep the hierarchy in mind: drift that changes meaning or blocks accessibility matters more than a border radius that is two pixels off.
Slide 4 — Scoping the audit
Scope decides whether you get findings or commentary. 'Audit our app and make it consistent' gets you commentary. 'Audit status chips across the dashboard, project detail, the mobile queue, and settings' gets you findings. Three useful axes: per component family, per route or package, per team — pick the one that matches who will act on the results. And before any agent runs, write down the reference: the tokens, the scale, the documented variants, and the exceptions you have deliberately approved. If you skip the exceptions, the agent will report your intentional variation as drift, and the team will start distrusting the report.
Slide 5 — The audit run: fan out, collect, decide
Here is the shape of the run. You set the scope and the reference. The agent writes a small orchestration script that lists the audit units, builds a packet for each, and fans out one read-only auditor per component or page — up to sixteen at a time, with the findings collected in the script rather than flooding the conversation. Each auditor returns structured findings: file, line, evidence, severity. The script merges them into one drift report, ranked, with the repeats grouped. Then humans come back in at the gate: triage what is real drift versus intentional variation, approve the mechanical fixes, and push recurring findings into the harness. The dashed line is the schedule — save the workflow and rerun it, and the report becomes a trend line.
Slide 6 — Evidence requirements: no finding without proof
The rule that keeps an audit honest is simple: no finding without evidence. Every code-level finding carries a file path, a line number, the value the agent found, and the value the system expects. Anything about rendered output — states, density, mobile behaviour — carries a screenshot. And the agent reports observable differences first, severity second, fixes third, in that order, so preferences do not sneak in dressed as findings. One more rule: when the system itself is unclear, the agent reports a question. It does not invent a rule and then audit you against it. Evidence is what turns the report from an opinion into a work queue.
Slide 7 — Good findings vs bad findings
Here is the quality bar, side by side. 'Buttons are inconsistent' is a bad finding — it starts an argument. 'Button.tsx line 42 uses this hex value, and the primary action token is that one' is a good finding — it starts a fix. Same with spacing, duplicated cards, and mobile behaviour: the good version always names the place, the value found, the value expected, and why it matters. The habit to build is sampling. Before you trust a long report, pull ten findings at random. If they hold up against the evidence rule, proceed. If they read like opinions, fix the auditor's instructions once instead of arguing with every finding.
Slide 8 — False positives, and how to tune the rules
Not everything the audit flags is drift, and pretending otherwise is how teams lose faith in the report. Three common false positives. Intentional variation: a compact chip in a dense table is allowed to have less padding — record the exception so the next run stops flagging it. Utility-first styling: in a Tailwind codebase, raw classes are the design language, so the auditor needs to distinguish token bypasses from normal composition. And undocumented variants: the agent can see the gap between code and docs, but it cannot know which one is wrong — that is a human call. Tune the rules after every run. The audit gets sharper each time, but only if the false positives feed back into the instructions.
Slide 9 — From findings to a fix queue
A pile of findings becomes useful the moment it is ranked and assigned. P0 is anything that breaks meaning or blocks accessibility — the same state reading differently on different pages, contrast or focus failures. P1 is behaviour and component drift, including the copied implementations. P2 is token drift — mostly mechanical. P3 is polish and the judgment calls. Then group by fix type: quick fixes, component refactors, token changes, documentation updates, and decisions that need a human. And give every finding a suggested owner, because a finding addressed to nobody stays in the report forever. Severity protects meaning; ownership gets it scheduled.
Slide 10 — Letting the agent fix the mechanical findings
Once the queue exists, a lot of it is mechanical — and that is where the agent comes back in, carefully. Two rules. First, never audit and fix in the same pass; an agent grading its own edits is not an audit. Second, fixes arrive as reviewable diffs, never direct pushes. Within those rules, let the agent clear the mechanical findings: token replacements, restored labels, corrected copies. Work in passes — blockers first, consolidation second, polish later — and recapture the evidence after each pass. Then schedule the rerun. Quarterly, after a rebrand, before a migration. The second report is the proof the cleanup worked, and the trend line is the argument for the next round of system investment.
Slide 11 — Worked example: a 60-component system after a rebrand
Let's trace a real run. A B2B platform rebranded: new palette, new radius scale, new type ramp. The team updated the core tokens and the twelve most-used components, then ran the fan-out audit across all sixty-four component folders. About seventy minutes later: 312 findings. Forty-one hard-coded colours from the old palette — eleven of them hiding in disabled and hover states that got skipped during the sprint. Fifty-seven spacing values off the new scale, nineteen undocumented variants, nine components still on legacy radius constants. That became a three-sprint cleanup backlog, and the rerun afterwards showed P0 and P1 findings dropping from eighty-eight to six. One run, real numbers — and the rerun is what made the cleanup provable.
Slide 12 — Exercise: run a single-dimension audit
Your turn, and keep it small. Pick one dimension — token drift is the easiest place to start — and one component family in your own system. Before you run anything, write the reference down: the tokens, the scale, the variants you have documented, the exceptions you have approved. Then run a read-only audit and ask for observable findings only: file, line, value found, value expected, severity, suggested owner. Sample ten findings before you read the rest. Then sort what is left into three piles: mechanical fixes, decisions that need a human, and rules worth adding to the harness so this never gets found again. Keep the report — you will build on it in Module 6.
Slide 13 — Summary, and what comes next
Let's close. Drift is the default state of every design system — small, reasonable decisions that slowly change what the product means. The counterweight is an audit cheap enough to repeat: narrow scope, named dimensions, a written source of truth, and a fan-out of read-only agents that return evidence, not opinions. Severity, fix type, and ownership turn the findings into a queue you can schedule. The agent fixes the mechanical items as reviewed diffs, the humans keep the judgment calls, and the rerun turns the whole thing into a trend line. Module 5 picks up the other drift surface: your tokens living in Figma, on a canvas, and in code at the same time — and how to keep them in sync without anything being silently overwritten. See you there.