AAgentic Design School
Module 4 of 6
45–55 minutes

Design Systems for Agents

System Audits and Drift Detection

Drift is the default state of every design system. This module covers running audits at scale — tokens, components, accessibility, copy — with evidence per finding, and turning audit output into a prioritised fix queue rather than a guilt document.

Duration45–55 minutes

Slides13 slides with notes and narration

Learning objectives

  • Run a multi-dimension audit across a real codebase with an agent.
  • Require evidence — file, line, screenshot — for every finding.
  • Convert audit results into a prioritised, assignable fix queue.
Slide deck

Work through the module

Each slide is shown in its 16:9 frame, exactly as it appears in the video version. Open the notes under any slide for the longer explanation, and the narration if you prefer to read along.

Slide 1 of 1316:9

System Audits and Drift Detection

Design Systems for Agents · Module 4 of 6

  • Drift is the default; audits are the counterweight
  • Audit dimensions, scope, and evidence requirements
  • From findings to a prioritised, assignable fix queue
  • Letting the agent fix the mechanical findings — under review

An audit without evidence is a list of opinions. This module is about making drift visible, provable, and fixable.

Slide notes

Modules 1 to 3 built the system's contract: tokens that carry intent, a DESIGN.md the agent loads every session, and review gates that keep component taste deliberate. This module is about the gap between that contract and what actually ships. Drift is not a failure of discipline; it is what happens when a product is built faster than any one person can watch it. A one-off colour during a deadline, a copied component with a tweaked radius, a focus state that never got ported — none of these feels like a problem in the moment, which is exactly why nobody catches them.

The claim of this module is narrow and practical: agents change the economics of finding drift, not the politics of fixing it. One agent can read one component folder with full attention; sixty agents can read sixty folders in parallel. What used to be a heroic one-off project — weeks of designer and engineer reading time, ending in a slide deck nobody reruns — becomes something a team runs quarterly or after every rebrand.

Set the tone for the output early. The point of an audit is not a guilt document that lists everything wrong with the system. It is a decision artifact: ranked findings, each with evidence, each with a likely owner, feeding a fix plan the team can actually schedule. Everything in this module is in service of that artifact.

Narration for this slide

Welcome to Module 4. So far this course has been about building the contract: tokens, DESIGN.md, review gates. This module is about the gap between the contract and what actually ships — drift. Drift is the default state of every design system, because products get built faster than anyone can watch them. What changes with agents is the economics of finding it: one agent can read one component carefully, and sixty agents can read sixty in parallel. What does not change is the output we are aiming for. Not a guilt document. A ranked, evidence-backed report that turns into a fix queue the team can actually work through.

Slide 2 of 1316:9

Drift is usually quiet

Drift rarely arrives as a redesign. It arrives as small, reasonable decisions that never make it back into the system.

  • A one-off colour or radius added during a deadline
  • A component copied and adjusted instead of extended
  • A focus or disabled state skipped in one implementation
  • A density decision made on one page and never documented

The cost is not just visual inconsistency. Drift changes product meaning: the same state starts looking different in different places.

Slide notes

The examples on the slide are deliberately mundane because that is what real drift looks like. Nobody decides to fork the design system; they decide to ship the feature on Friday. The copied card with a slightly different radius, the warning badge that is orange on one page and red on another, the disabled button that looks like a secondary button — each was a local decision that made sense under pressure.

Push past the aesthetic framing. The deeper cost is meaning: a status colour that differs across pages teaches users two different languages for the same state. A copied modal that omits focus management turns a shortcut into an accessibility regression. A button fork with a different touch target changes how reliable the product feels on mobile. Drift is a product-quality problem wearing a consistency costume.

This is also why hand audits fail. Reading every component folder, every consuming route, and every place a hex value was typed instead of a token takes weeks, so most teams do it once, produce a deck, and never repeat it. The drift continues the day after the deck is presented. The rest of this module is about making the audit cheap enough to repeat.

Narration for this slide

Let's be honest about how drift happens. Nobody decides to fork the design system. They decide to ship the feature on Friday. So a one-off colour goes in, a card gets copied and its radius tweaked, a disabled state gets skipped, a density call gets made on one page and never written down. Each decision is reasonable. The cost shows up later, and it is not just visual. A warning that is orange on one page and red on another teaches users two languages for the same state. That is a meaning problem, not a polish problem. And hand audits fail because the reading takes weeks — so they happen once and never again.

Slide 3 of 1316:9

Audit dimensions: what you are actually checking

Most audits fail because they only inspect one layer. Ask the agent to check each dimension explicitly.

DimensionWhat driftsTypical evidence
TokensRaw values where semantic tokens should beFile and line with the value found and the expected token
ComponentsLocal copies and forks bypassing shared componentsTwo implementations of the same pattern, props compared
AccessibilityMissing focus, labels, contrast, keyboard statesState screenshot plus the missing attribute or style
ContentLabel case, terminology, button copy varying by pageThe same action named three ways across routes
Density and layoutSpacing off the scale, padding and rhythm varyingValues found versus the documented spacing scale

Meaning and accessibility drift outrank token and polish drift. The dimensions exist so severity can reflect that.

Slide notes

The table compresses two framings from the school's audit article: the audit dimensions you scope, and the four layers of drift — meaning, behaviour, implementation, presentation — that explain why the same visible inconsistency can have different causes and different owners. A file search can find raw hex values but cannot tell whether the UI still communicates the right priority; a screenshot review can spot inconsistent cards but miss that two implementations use different components underneath. You need both code-level and rendered-output checks.

Walk one example through the layers. Two status chips look different. If the difference is that escalated renders red on one page and orange on another, that is meaning drift and it is severe. If one chip lacks a disabled state, that is behaviour drift. If one is a local copy that bypasses the shared StatusChip, that is implementation drift, and the fix is consolidation. If only the padding differs, that is presentation drift, and it might even be intentional in a dense table.

The dimensions are also what stop the audit collapsing into vague commentary. 'Buttons are inconsistent' is not a finding. 'The escalated chip uses a raw red on the project detail route while the dashboard uses the semantic warning token' names the dimension, the evidence, and the likely fix path.

Narration for this slide

An audit is only as good as the dimensions you name. Tokens: raw values where semantic tokens should be. Components: local copies and forks that bypass the shared library. Accessibility: missing focus states, labels, contrast. Content: the same action named three different ways. Density: spacing that wandered off the scale. Each dimension drifts for different reasons and gets fixed by different people, which is why you ask the agent to check them explicitly rather than asking it to 'check consistency'. And keep the hierarchy in mind: drift that changes meaning or blocks accessibility matters more than a border radius that is two pixels off.

Slide 4 of 1316:9

Scoping the audit

Do not ask an agent to audit the whole product at once unless the product is tiny. Useful audits start narrow.

  • Per component family: every status chip, every card, every button fork
  • Per route or package: one surface, every component on it
  • Per team: the surfaces one team owns, so findings land with their owner
  • Always name the source of truth: tokens, scales, documented variants, known exceptions

A scope is good when one agent can read every relevant file and state — and fixing the findings would change real product quality.

Slide notes

Scope is where audits are won or lost. 'Audit our app design and make it consistent' produces vague commentary; 'audit status chips across dashboard, project detail, mobile queue, and settings' produces findings. The scope should name the component family or surface, the routes where it appears, the source of truth to compare against, the screenshots or states to inspect, and the allowed output. The agent finds and classifies drift; it does not silently redesign the system while it audits.

The three scoping axes on the slide map to who acts on the result. Per component family is the strongest default because one family used in many places concentrates the most reusable findings. Per route or package suits release-driven work and marketing surfaces. Per team matters in larger organisations because a finding without a plausible owner just sits in the report.

The last bullet is the one teams skip: write the reference down before any agent runs. The token set, the spacing scale, the documented variants, and — critically — the known intentional exceptions. If a designer deliberately approved a compact chip variant for dense tables, say so in the scope, or the agent will dutifully report it as drift. Ambiguity in the reference becomes noise in the findings.

Narration for this slide

Scope decides whether you get findings or commentary. 'Audit our app and make it consistent' gets you commentary. 'Audit status chips across the dashboard, project detail, the mobile queue, and settings' gets you findings. Three useful axes: per component family, per route or package, per team — pick the one that matches who will act on the results. And before any agent runs, write down the reference: the tokens, the scale, the documented variants, and the exceptions you have deliberately approved. If you skip the exceptions, the agent will report your intentional variation as drift, and the team will start distrusting the report.

Slide 5 of 1316:9

The audit run: fan out, collect, decide

One orchestrating script dispatches one read-only auditor per unit, merges their findings, and the team decides at the gate.

Flow diagram of a design system audit run. A human-defined scope and reference feeds an agent-run fan-out script that dispatches one read-only auditor agent per component or page. Example auditor cards show evidence-backed findings: a raw colour in Button.tsx, an off-scale gap in pricing.tsx, and a duplicated card component. Findings merge into a drift report ranked P0 to P3 with file, line, and evidence per finding, which passes through a human review gate where mechanical fixes are approved as reviewed diffs and recurring findings become harness rules. A dashed line returns from the gate to the scope, showing the workflow rerun on a schedule as a trend line.
Scope and the review gate are human-led; the fan-out, the per-unit auditors, and the report assembly are agent-run. The dashed return line is the schedule: rerunning the saved workflow turns each report into a point on a trend line.

Findings live in the orchestration script, not in the chat. The human sees one merged, ranked report instead of sixty conversations.

Slide notes

Walk the diagram left to right. The scope and reference are human work — the previous slide. The fan-out is an orchestration script the agent writes: it enumerates the audit units with a glob so the list stays accurate as the system grows, builds one audit packet per unit, and dispatches one subagent per component folder or page, in batches. In Claude Code this runs as a dynamic workflow: the intermediate findings live in script variables rather than the main context, up to sixteen agents run concurrently, and an interrupted run resumes rather than starting over. As of June 2026 a single run can dispatch up to a thousand agents, which is more than any design system needs.

The per-unit auditor is deliberately narrow: it reads one folder, compares against the reference it was handed, and returns structured findings — file, line, category, severity, evidence, suggested fix — as JSON. Narrow agents produce findings that merge cleanly; broad agents produce essays that do not. Defining the auditor once in .claude/agents/ keeps behaviour consistent across runs.

The right side is where humans return. The merged report ranks findings and surfaces repeats; the review gate triages drift versus intentional variation, approves mechanical fixes, and pushes recurring findings into the harness. The dashed line back to scope is the part that makes this a practice rather than a project: save the orchestration to .claude/workflows/ and rerun it, so each report becomes a point on a trend line.

Narration for this slide

Here is the shape of the run. You set the scope and the reference. The agent writes a small orchestration script that lists the audit units, builds a packet for each, and fans out one read-only auditor per component or page — up to sixteen at a time, with the findings collected in the script rather than flooding the conversation. Each auditor returns structured findings: file, line, evidence, severity. The script merges them into one drift report, ranked, with the repeats grouped. Then humans come back in at the gate: triage what is real drift versus intentional variation, approve the mechanical fixes, and push recurring findings into the harness. The dashed line is the schedule — save the workflow and rerun it, and the report becomes a trend line.

Slide 6 of 1316:9

Evidence requirements: no finding without proof

Every finding the agent reports must be something it can point to. Evidence is what separates an audit from a critique.

  • File path and line number for every code-level finding
  • The exact value found and the expected token or scale value
  • A screenshot for anything about rendered output, states, or responsive behaviour
  • Severity, likely source of drift, and a suggested fix owner
  • Uncertainty reported as a question — never invented as a system rule

Ask for observable differences first, then severity, then the recommended fix — in that order.

Slide notes

Evidence is the difference between an audit the team acts on and an audit the team argues with. A finding that says 'spacing feels random on the pricing page' creates a discussion about whose taste counts. A finding that says 'pricing.tsx uses 18px and 22px gaps; the documented scale is 16px and 24px' creates a fix. The wording matters doubly because findings become engineering tickets: a ticket that says 'make chips consistent' invites subjective rework, while one that names the file, the value, and the conflicting token is actionable without further negotiation.

Code-level evidence is the easy half. For anything about rendered output — states, density, responsive wrapping, contrast — require screenshots, captured at consistent viewports, covering the states that matter: normal, hover or focus, disabled, loading, empty, error, and at least one mobile width. Many of the worst findings live at the edges: a label that wraps before its icon at 390px, a disabled state with unreadable contrast, a focus ring that vanished in one theme.

The ordering rule on the slide — observable differences first, then severity, then fix — is a guard against the model turning preferences into findings. So is the last bullet: when the system itself is ambiguous, the agent should surface the question rather than inventing a rule and auditing against it. Screenshot extraction of your own live site, diffed against documented tokens, is a useful supplementary evidence source here: every mismatch is either undocumented drift or a token that exists only in someone's head.

Narration for this slide

The rule that keeps an audit honest is simple: no finding without evidence. Every code-level finding carries a file path, a line number, the value the agent found, and the value the system expects. Anything about rendered output — states, density, mobile behaviour — carries a screenshot. And the agent reports observable differences first, severity second, fixes third, in that order, so preferences do not sneak in dressed as findings. One more rule: when the system itself is unclear, the agent reports a question. It does not invent a rule and then audit you against it. Evidence is what turns the report from an opinion into a work queue.

Slide 7 of 1316:9

Good findings vs bad findings

A vague finding creates a discussion. A precise finding creates a fix.

Bad findingGood finding
Buttons are inconsistent across the appP1: Button.tsx line 42 uses #2D6CDF; the primary action token color.action.primary is #2456C4
Spacing feels random on the pricing pageP2: pricing.tsx uses 18px and 22px gaps; the documented scale is 16px and 24px
There are too many card componentsP1: PromoCard and OfferTile re-implement Card with different radii (12px vs 16px); consolidate on Card with a variant prop
Mobile looks offP0: at 390px the status label wraps before its icon, breaking scan rhythm in the queue table

Review a sample of findings before trusting the full report. If the samples read like opinions, tighten the auditor's instructions.

Slide notes

These pairs come straight from the school's published audit material, and the pattern is consistent: the good finding names the file or route, the value found, the value expected, and the impact, in one or two sentences. The bad finding names a feeling. Both might describe the same underlying drift; only one of them can be assigned, estimated, and closed.

The practical habit to teach here is sampling. Before anyone reads a three-hundred-finding report end to end, pull ten findings at random and check them against the evidence rule. If they hold, the report is probably trustworthy. If they read like opinions — 'the hierarchy feels weak', 'this page needs polish' — the problem is upstream in the auditor's instructions, and the fix is to tighten the subagent definition once rather than to argue with three hundred findings individually.

Note also what the good findings do not do: they do not prescribe a redesign. The finding about the duplicated cards recommends consolidation and names the variant path, but the decision about whether the system should support that variant remains a human call at the review gate. Findings prepare decisions; they do not make them.

Narration for this slide

Here is the quality bar, side by side. 'Buttons are inconsistent' is a bad finding — it starts an argument. 'Button.tsx line 42 uses this hex value, and the primary action token is that one' is a good finding — it starts a fix. Same with spacing, duplicated cards, and mobile behaviour: the good version always names the place, the value found, the value expected, and why it matters. The habit to build is sampling. Before you trust a long report, pull ten findings at random. If they hold up against the evidence rule, proceed. If they read like opinions, fix the auditor's instructions once instead of arguing with every finding.

Slide 8 of 1316:9

False positives, and how to tune the rules

Some of what the audit flags is not drift. Handling that well is what keeps the team trusting the report.

  • Intentional variation: a compact table chip is not the same as a broken chip
  • Utility-first styling: a raw class is not automatically a token violation
  • Docs that are behind: an undocumented variant may be a documentation gap, not a defect
  • Tune by recording exceptions in the scope and tightening the auditor definition — not by ignoring the report

An audit is not a demand that every instance look identical. It separates intentional variation from accidental drift.

Slide notes

Every audit produces false positives, and how the team handles them determines whether the second run ever happens. The most common category is intentional variation: a compact chip in a dense table genuinely needs less padding than its detail-page sibling, and an agent comparing the two will flag it unless told otherwise. The semantic state, accessible name, colour token, and icon logic should still come from the system — but the padding difference is a decision, not drift. Record approved exceptions in the audit packet so future runs stop reporting them.

The second category is style architecture. In a Tailwind-style codebase, raw utility classes are the design language, not a violation of it. A raw rounded-full on a badge may be exactly what the system intends; a raw hex colour inside a route file is far more suspicious. The auditor's instructions should distinguish token bypasses from normal composition, or the report drowns in noise.

The third is the docs-versus-code question: an agent can see that a variant exists in code and not in the documentation, but it cannot know whether the docs are behind or the variant should not exist. That distinction is a human triage call at the review gate. The general tuning loop: every false positive either becomes a recorded exception, a tightened rule in the auditor definition, or a documentation fix. What it should never become is a reason to stop running the audit.

Narration for this slide

Not everything the audit flags is drift, and pretending otherwise is how teams lose faith in the report. Three common false positives. Intentional variation: a compact chip in a dense table is allowed to have less padding — record the exception so the next run stops flagging it. Utility-first styling: in a Tailwind codebase, raw classes are the design language, so the auditor needs to distinguish token bypasses from normal composition. And undocumented variants: the agent can see the gap between code and docs, but it cannot know which one is wrong — that is a human call. Tune the rules after every run. The audit gets sharper each time, but only if the false positives feed back into the instructions.

Slide 9 of 1316:9

From findings to a fix queue

Severity protects product meaning. Effort and ownership make the queue schedulable.

SeverityWhat it coversTypical owner
P0Meaning breaks and accessibility blockers: same state reads differently, contrast or focus blocks useFix before anything else; system team plus the owning feature team
P1Behaviour and component drift: states work differently, local copies bypass the shared componentSystem team; consolidation may need a cross-team decision
P2Token drift: raw values and one-off variables replacing semantic tokensMostly mechanical; agent-fixable under review
P3Polish drift and open questions: spacing, radius, label case, intentional-variation callsBatched into polish passes, or decided by a designer

Group the queue into quick fixes, component refactors, token changes, documentation updates, and human decisions — then schedule it.

Slide notes

Severity exists to stop the team spending a day tuning border radius while a warning state is unreadable on mobile. The P0-to-P3 ladder here follows the school's audit article: meaning and accessibility breaks first, behaviour and component drift second, token drift third, polish and judgment calls last. Severity is an input to planning, not a plan — roadmap pressure, ownership, and effort still shape the order — but it keeps the conversation anchored to user impact rather than to whatever finding annoyed the loudest reviewer.

The second move is grouping by fix type, because different findings route to different work. Quick fixes — replace a raw value with the token, restore a missing label — can be batched and largely automated. Component refactors, like consolidating three forked buttons, need a decision about which props survive, which is a meeting, not a commit. Token changes should be rare: add a semantic token only when the current system genuinely cannot express a real need. Documentation updates record the approved variants the audit surfaced. Human decisions cover the places where the system itself is unclear.

Ownership is the part that makes the queue real. A finding with a plausible owner gets scheduled; a finding addressed 'to the team' gets admired. Asking the auditor to suggest an owner per finding is cheap, and the review gate corrects the suggestions that are wrong.

Narration for this slide

A pile of findings becomes useful the moment it is ranked and assigned. P0 is anything that breaks meaning or blocks accessibility — the same state reading differently on different pages, contrast or focus failures. P1 is behaviour and component drift, including the copied implementations. P2 is token drift — mostly mechanical. P3 is polish and the judgment calls. Then group by fix type: quick fixes, component refactors, token changes, documentation updates, and decisions that need a human. And give every finding a suggested owner, because a finding addressed to nobody stays in the report forever. Severity protects meaning; ownership gets it scheduled.

Slide 10 of 1316:9

Letting the agent fix the mechanical findings

The audit run is read-only. Fixing is a separate run, with write access, and a human reviewing every diff.

  • Never audit and fix in the same pass — it makes both harder to trust
  • Mechanical fixes first: token replacements, restored labels, corrected copies
  • Fixes arrive as reviewable diffs or pull requests, not direct pushes
  • Fix in passes — P0 and P1, then consolidation, then polish — and recapture evidence after each pass
  • Schedule the rerun: quarterly, after rebrands, before token migrations, or before each release

Rerunning the saved workflow after a cleanup turns the report into a trend line — far more persuasive than a single snapshot.

Slide notes

The separation between auditing and fixing is a trust decision, not a ceremony. An agent that changes code while it audits is grading its own homework, and the team can no longer tell whether the report describes the system or the agent's edits. Keep the audit read-only; run the fixes as a second pass with write access, the same fan-out shape if the volume justifies it, and human review on every diff — the same review gate discipline Module 3 established for new components applies to fixes.

Within that constraint, the mechanical share of the queue is genuinely large. Replacing a raw hex with the token it shadows, restoring a missing aria-label, correcting a copied class — these are exactly the changes agents do reliably and humans do resentfully. The judgment-heavy findings — consolidating forks, deciding whether a variant should exist — stay with people, with the agent preparing the evidence and the options.

Scheduling is what turns this from a project into a practice. Save the working orchestration to .claude/workflows/ so the next run is a command rather than an afternoon of prompting, and pick triggers that match how your system changes: quarterly as a health check, after a rebrand to measure how much of it shipped, before a token migration to know what you are migrating from, or before releases as part of a visual QA sweep. A second run after the cleanup is also the cheapest way to prove the cleanup worked — falling severity counts are evidence, and evidence is what design system teams chronically lack when arguing for time.

Narration for this slide

Once the queue exists, a lot of it is mechanical — and that is where the agent comes back in, carefully. Two rules. First, never audit and fix in the same pass; an agent grading its own edits is not an audit. Second, fixes arrive as reviewable diffs, never direct pushes. Within those rules, let the agent clear the mechanical findings: token replacements, restored labels, corrected copies. Work in passes — blockers first, consolidation second, polish later — and recapture the evidence after each pass. Then schedule the rerun. Quarterly, after a rebrand, before a migration. The second report is the proof the cleanup worked, and the trend line is the argument for the next round of system investment.

Slide 11 of 1316:9

Worked example: a 60-component system after a rebrand

A B2B platform rebranded — new palette, radius scale, and type ramp — updated the core tokens and the twelve most-used components, then ran the audit to see what was left.

What happened
The run64 read-only auditor agents, one per component folder, finished in about 70 minutes
Findings312 total: 41 hard-coded old-palette colours, 57 off-scale spacing values, 19 undocumented variants, 9 components on legacy radius constants
The detail that mattered11 of the hard-coded colours sat in disabled and hover states skipped during the rebrand sprint
What it becameA three-sprint cleanup backlog prioritised by severity counts
The rerunP0 and P1 findings dropped from 88 to 6 — the evidence in the rebrand completion review

The numbers are from one traced run, not a benchmark. The pattern — fan out, evidence, severity, rerun — is the part that transfers.

Slide notes

This case is drawn from the school's published audit-at-scale workflow; present it as one traced run rather than a promise of what every team will see. The setup is the common one: the rebrand updated the visible surface — core tokens and the twelve most-used components — and the team needed to know how much of the long tail was still wearing the old brand. That is precisely the question hand audits never answer, because nobody re-reads sixty component folders after the launch pressure has passed.

The finding worth dwelling on is the eleven hard-coded colours hiding in disabled and hover states. Those are the states that get skipped under deadline and the states a quick visual scan never exercises — and they are the kind of finding that only shows up when the evidence requirement covers states, not just default renders. The nine components still importing legacy radius constants are the other quiet category: nothing looks broken until the constant is finally deleted and the build fails.

The ending is the operational point. The report became a three-sprint backlog because the findings carried severity, evidence, and owners; the rerun after cleanup gave the team a number — 88 P0 and P1 findings down to 6 — to put in front of stakeholders. Two adjacent cases from the same workflow are worth mentioning if there is time: a 28-page marketing site audited per page in under an hour, where the agency fixed the top twenty findings in one sprint, and three forked buttons that the report turned into one consolidation meeting instead of a quarter of intermittent debate.

Narration for this slide

Let's trace a real run. A B2B platform rebranded: new palette, new radius scale, new type ramp. The team updated the core tokens and the twelve most-used components, then ran the fan-out audit across all sixty-four component folders. About seventy minutes later: 312 findings. Forty-one hard-coded colours from the old palette — eleven of them hiding in disabled and hover states that got skipped during the sprint. Fifty-seven spacing values off the new scale, nineteen undocumented variants, nine components still on legacy radius constants. That became a three-sprint cleanup backlog, and the rerun afterwards showed P0 and P1 findings dropping from eighty-eight to six. One run, real numbers — and the rerun is what made the cleanup provable.

Slide 12 of 1316:9

Exercise: run a single-dimension audit

Pick one dimension and one component family in your own system, and run a small, read-only audit this week.

  • Choose one dimension — token drift is the easiest first target — and one component family or surface
  • Write the reference down first: tokens, scale, documented variants, known exceptions
  • Ask for observable findings only: file, line, value found, expected value, severity, suggested owner
  • Sample ten findings against the evidence rule before reading the rest
  • Sort what remains into mechanical fixes, decisions for a human, and rules to add to the harness

Keep the report. Module 5 uses the same source-of-truth thinking for token sync, and Module 6 turns this run into a scheduled loop.

Slide notes

Keep the exercise deliberately small: one dimension, one component family, read-only. The goal is not coverage; it is to experience the difference between an evidence-backed finding and an opinion, and to discover how much of the work is actually in writing the reference down. Most participants find that the hardest step is the second bullet — articulating what the system is supposed to be — which is itself a useful result, because ambiguity in the reference is exactly where their drift comes from.

Token drift is the recommended first dimension because the evidence is unambiguous: a raw value either matches a token or it does not. Accessibility and density audits need screenshots and state coverage, which is more setup than a first run warrants. If the participant has no codebase access, the field-note variant works: extract the rendered styles from their own live site and diff against the documented tokens — every mismatch is either undocumented drift or a token that exists only in someone's head.

The sorting step at the end is the bridge to the rest of the course. Mechanical fixes are the future automated passes; the human decisions are the review gate from Module 3 doing its job; and the rules worth adding to the harness are how this audit stops finding the same things twice. Ask participants to keep the report: Module 6 builds the scheduled, self-maintaining version of this loop, and a real first report makes that module concrete.

Narration for this slide

Your turn, and keep it small. Pick one dimension — token drift is the easiest place to start — and one component family in your own system. Before you run anything, write the reference down: the tokens, the scale, the variants you have documented, the exceptions you have approved. Then run a read-only audit and ask for observable findings only: file, line, value found, value expected, severity, suggested owner. Sample ten findings before you read the rest. Then sort what is left into three piles: mechanical fixes, decisions that need a human, and rules worth adding to the harness so this never gets found again. Keep the report — you will build on it in Module 6.

Slide 13 of 1316:9

Summary, and what comes next

  • Drift is the default: small reasonable decisions that change product meaning over time
  • Audit narrow scopes across named dimensions, against a written source of truth
  • No finding without evidence — file, line, value, screenshot — and no rule invented by the agent
  • Severity, fix type, and ownership turn findings into a schedulable queue, not a guilt document
  • Audit read-only, fix in reviewed passes, rerun on a schedule so the report becomes a trend line

Module 5 turns to the other drift surface: tokens living in three tools at once, and keeping them in sync without silent overwrites.

Slide notes

Recap by connecting the bullets rather than repeating them. Drift happens because the system is used faster than it is watched, so the counterweight has to be cheap enough to repeat — that is what the fan-out buys. Evidence is what makes the output trustworthy, severity and ownership are what make it actionable, and the read-only-audit-then-reviewed-fixes split is what keeps both honest. The audit also feeds the earlier modules: recurring findings become token rules from Module 1, DESIGN.md corrections from Module 2, and review-gate checks from Module 3.

Be clear about the limits one more time, because this is where overclaiming erodes trust: the audit finds drift, it does not decide what the system should be. Intentional variation, undocumented-versus-unwanted variants, and density disagreements between teams are design decisions that the report should carry as questions. Static reading also misses runtime-only issues — computed styles, theme switching, third-party widgets — which is why heavy releases pair this audit with a screenshot-based regression sweep.

Preview Module 5 concretely: most teams hold tokens in at least three places — a design tool, a canvas or spec, and code — and that multiplicity is its own drift surface. The module covers choosing a single source of truth, running sync as an agent job that reads, diffs, and proposes rather than overwrites, and reporting divergence instead of silently resolving it. The audit habits from this module carry straight across.

Narration for this slide

Let's close. Drift is the default state of every design system — small, reasonable decisions that slowly change what the product means. The counterweight is an audit cheap enough to repeat: narrow scope, named dimensions, a written source of truth, and a fan-out of read-only agents that return evidence, not opinions. Severity, fix type, and ownership turn the findings into a queue you can schedule. The agent fixes the mechanical items as reviewed diffs, the humans keep the judgment calls, and the rerun turns the whole thing into a trend line. Module 5 picks up the other drift surface: your tokens living in Figma, on a canvas, and in code at the same time — and how to keep them in sync without anything being silently overwritten. See you there.

Module transcript
Module 4, narrated slide by slide

Slide 1System Audits and Drift Detection

Welcome to Module 4. So far this course has been about building the contract: tokens, DESIGN.md, review gates. This module is about the gap between the contract and what actually ships — drift. Drift is the default state of every design system, because products get built faster than anyone can watch them. What changes with agents is the economics of finding it: one agent can read one component carefully, and sixty agents can read sixty in parallel. What does not change is the output we are aiming for. Not a guilt document. A ranked, evidence-backed report that turns into a fix queue the team can actually work through.

Slide 2Drift is usually quiet

Let's be honest about how drift happens. Nobody decides to fork the design system. They decide to ship the feature on Friday. So a one-off colour goes in, a card gets copied and its radius tweaked, a disabled state gets skipped, a density call gets made on one page and never written down. Each decision is reasonable. The cost shows up later, and it is not just visual. A warning that is orange on one page and red on another teaches users two languages for the same state. That is a meaning problem, not a polish problem. And hand audits fail because the reading takes weeks — so they happen once and never again.

Slide 3Audit dimensions: what you are actually checking

An audit is only as good as the dimensions you name. Tokens: raw values where semantic tokens should be. Components: local copies and forks that bypass the shared library. Accessibility: missing focus states, labels, contrast. Content: the same action named three different ways. Density: spacing that wandered off the scale. Each dimension drifts for different reasons and gets fixed by different people, which is why you ask the agent to check them explicitly rather than asking it to 'check consistency'. And keep the hierarchy in mind: drift that changes meaning or blocks accessibility matters more than a border radius that is two pixels off.

Slide 4Scoping the audit

Scope decides whether you get findings or commentary. 'Audit our app and make it consistent' gets you commentary. 'Audit status chips across the dashboard, project detail, the mobile queue, and settings' gets you findings. Three useful axes: per component family, per route or package, per team — pick the one that matches who will act on the results. And before any agent runs, write down the reference: the tokens, the scale, the documented variants, and the exceptions you have deliberately approved. If you skip the exceptions, the agent will report your intentional variation as drift, and the team will start distrusting the report.

Slide 5The audit run: fan out, collect, decide

Here is the shape of the run. You set the scope and the reference. The agent writes a small orchestration script that lists the audit units, builds a packet for each, and fans out one read-only auditor per component or page — up to sixteen at a time, with the findings collected in the script rather than flooding the conversation. Each auditor returns structured findings: file, line, evidence, severity. The script merges them into one drift report, ranked, with the repeats grouped. Then humans come back in at the gate: triage what is real drift versus intentional variation, approve the mechanical fixes, and push recurring findings into the harness. The dashed line is the schedule — save the workflow and rerun it, and the report becomes a trend line.

Slide 6Evidence requirements: no finding without proof

The rule that keeps an audit honest is simple: no finding without evidence. Every code-level finding carries a file path, a line number, the value the agent found, and the value the system expects. Anything about rendered output — states, density, mobile behaviour — carries a screenshot. And the agent reports observable differences first, severity second, fixes third, in that order, so preferences do not sneak in dressed as findings. One more rule: when the system itself is unclear, the agent reports a question. It does not invent a rule and then audit you against it. Evidence is what turns the report from an opinion into a work queue.

Slide 7Good findings vs bad findings

Here is the quality bar, side by side. 'Buttons are inconsistent' is a bad finding — it starts an argument. 'Button.tsx line 42 uses this hex value, and the primary action token is that one' is a good finding — it starts a fix. Same with spacing, duplicated cards, and mobile behaviour: the good version always names the place, the value found, the value expected, and why it matters. The habit to build is sampling. Before you trust a long report, pull ten findings at random. If they hold up against the evidence rule, proceed. If they read like opinions, fix the auditor's instructions once instead of arguing with every finding.

Slide 8False positives, and how to tune the rules

Not everything the audit flags is drift, and pretending otherwise is how teams lose faith in the report. Three common false positives. Intentional variation: a compact chip in a dense table is allowed to have less padding — record the exception so the next run stops flagging it. Utility-first styling: in a Tailwind codebase, raw classes are the design language, so the auditor needs to distinguish token bypasses from normal composition. And undocumented variants: the agent can see the gap between code and docs, but it cannot know which one is wrong — that is a human call. Tune the rules after every run. The audit gets sharper each time, but only if the false positives feed back into the instructions.

Slide 9From findings to a fix queue

A pile of findings becomes useful the moment it is ranked and assigned. P0 is anything that breaks meaning or blocks accessibility — the same state reading differently on different pages, contrast or focus failures. P1 is behaviour and component drift, including the copied implementations. P2 is token drift — mostly mechanical. P3 is polish and the judgment calls. Then group by fix type: quick fixes, component refactors, token changes, documentation updates, and decisions that need a human. And give every finding a suggested owner, because a finding addressed to nobody stays in the report forever. Severity protects meaning; ownership gets it scheduled.

Slide 10Letting the agent fix the mechanical findings

Once the queue exists, a lot of it is mechanical — and that is where the agent comes back in, carefully. Two rules. First, never audit and fix in the same pass; an agent grading its own edits is not an audit. Second, fixes arrive as reviewable diffs, never direct pushes. Within those rules, let the agent clear the mechanical findings: token replacements, restored labels, corrected copies. Work in passes — blockers first, consolidation second, polish later — and recapture the evidence after each pass. Then schedule the rerun. Quarterly, after a rebrand, before a migration. The second report is the proof the cleanup worked, and the trend line is the argument for the next round of system investment.

Slide 11Worked example: a 60-component system after a rebrand

Let's trace a real run. A B2B platform rebranded: new palette, new radius scale, new type ramp. The team updated the core tokens and the twelve most-used components, then ran the fan-out audit across all sixty-four component folders. About seventy minutes later: 312 findings. Forty-one hard-coded colours from the old palette — eleven of them hiding in disabled and hover states that got skipped during the sprint. Fifty-seven spacing values off the new scale, nineteen undocumented variants, nine components still on legacy radius constants. That became a three-sprint cleanup backlog, and the rerun afterwards showed P0 and P1 findings dropping from eighty-eight to six. One run, real numbers — and the rerun is what made the cleanup provable.

Slide 12Exercise: run a single-dimension audit

Your turn, and keep it small. Pick one dimension — token drift is the easiest place to start — and one component family in your own system. Before you run anything, write the reference down: the tokens, the scale, the variants you have documented, the exceptions you have approved. Then run a read-only audit and ask for observable findings only: file, line, value found, value expected, severity, suggested owner. Sample ten findings before you read the rest. Then sort what is left into three piles: mechanical fixes, decisions that need a human, and rules worth adding to the harness so this never gets found again. Keep the report — you will build on it in Module 6.

Slide 13Summary, and what comes next

Let's close. Drift is the default state of every design system — small, reasonable decisions that slowly change what the product means. The counterweight is an audit cheap enough to repeat: narrow scope, named dimensions, a written source of truth, and a fan-out of read-only agents that return evidence, not opinions. Severity, fix type, and ownership turn the findings into a queue you can schedule. The agent fixes the mechanical items as reviewed diffs, the humans keep the judgment calls, and the rerun turns the whole thing into a trend line. Module 5 picks up the other drift surface: your tokens living in Figma, on a canvas, and in code at the same time — and how to keep them in sync without anything being silently overwritten. See you there.