AAgentic Design School
Module 6 of 6
40–50 minutes

Design Systems for Agents

Systems That Maintain Themselves

The end state this course aims at: scheduled audits, automatic documentation, proposed fixes arriving as reviewable changes, and a small human rotation that approves rather than produces. Honest about what still needs people.

Duration40–50 minutes

Slides13 slides with notes and narration

Learning objectives

  • Design a maintenance loop where the system proposes its own fixes on a schedule.
  • Keep documentation generated from the source rather than written after the fact.
  • Define the human roles that remain: approval, taste, prioritisation, escalation.
Slide deck

Work through the module

Each slide is shown in its 16:9 frame, exactly as it appears in the video version. Open the notes under any slide for the longer explanation, and the narration if you prefer to read along.

Slide 1 of 1316:9

Systems That Maintain Themselves

Design Systems for Agents · Module 6 of 6

  • Maintenance as a standing loop, not a quarterly project
  • Scheduled audits, PR checks, token migrations, regenerated docs
  • Proposed fixes as pull requests, never direct pushes
  • What still needs people: deprecations, taste shifts, breaking changes

Everything the previous five modules built — tokens, DESIGN.md, review gates, audits, sync — assembles here into a loop that runs without being asked.

Slide notes

This is the closing module of the course, and it is deliberately an assembly module rather than a new-technique module. Tokens that carry intent (Module 1), a DESIGN.md the agent reads (Module 2), review gates with evidence requirements (Module 3), audits with findings the team trusts (Module 4), and token sync that reports divergence (Module 5) are the parts. This module is about wiring them into a standing arrangement that runs on a schedule and on every change, with humans approving rather than producing.

Set expectations honestly at the start. The phrase self-maintaining is doing careful work: the system can detect drift, propose fixes, regenerate documentation, and keep tokens propagated, but it cannot decide what counts as drift versus evolution, whether a deprecation is worth the breakage, or whether the new green is better. The almost matters, and the module spends real time on it.

If participants have not done the earlier modules, the loop will still make sense as a diagram, but the parts will feel abstract. The exercise at the end asks them to draft their own loop on one page, and the quality of that page tracks how much of the earlier infrastructure they actually have.

Narration for this slide

Welcome to the final module of Design Systems for Agents. Everything we have built so far — tokens that carry intent, a DESIGN.md the agent actually reads, review gates, audits, token sync — comes together here into one arrangement: a system that maintains itself, almost. The agent runs scheduled audits, checks every pull request, proposes fixes as reviewable changes, and keeps the documentation generated from source. The humans on the system shift from producing the maintenance work to approving it. The almost is important, and we will be honest about it: deprecations, breaking changes, and taste stay with people. Let's look at the loop.

Slide 2 of 1316:9

Maintenance as a standing loop, not a quarterly project

Design systems do not fail at launch. They fail eighteen months later, one small unreviewed gap at a time.

  • Drift accumulates between changes: a hardcoded colour here, a one-off variant there
  • Each gap is small and defensible; together they erode trust in the system
  • The cause is structural: every decision lives on more surfaces than anyone remembers
  • Finding every surface is search and bookkeeping — exactly what agents are good at
  • The fix is cadence, not heroics: a loop that runs whenever the system changes, and on a schedule between changes

Drift is the default state of a design system. The counterweight is a loop that runs constantly, not a cleanup that runs annually.

Slide notes

Open with the failure pattern, because everyone in the room has lived it. The system launches well-documented and well-adopted, then decays quietly: a brand colour adjusted in CSS but not in the docs, a component that grew an off-system variant during a deadline week, a spacing value copied into four files because nobody knew there was a token. None of these is a scandal on its own. Together they are why the next person who asks what colour is our accent gets three different answers depending on which file they open.

The important reframe is that this is not a discipline problem. People are good at making the design decision and bad at finding every place it has to land — the CSS variable, the theme mapping, the hover state, the prose in the documentation, the canvas, the onboarding deck. That is a search-and-bookkeeping problem, and search and bookkeeping is a fair description of what coding agents do best: find every reference, edit each one, regenerate what describes it, run the checks, and produce a diff a human can review.

The second reframe is cadence. The traditional response to drift is the quarterly or annual cleanup project — heroic, exhausting, and out of date within a month. The maintenance loop replaces the project with a standing arrangement: checks on every change, audits on a schedule, fixes proposed as they are found. Smaller batches, reviewed continuously, by people who are approving rather than excavating.

Narration for this slide

Here is the problem this module solves. Design systems rarely fail at launch — they fail later, quietly, as drift accumulates. A colour gets adjusted in the code but not the docs. A component grows a one-off variant during a deadline week. Each gap is small and defensible, and together they are why nobody fully trusts the system after eighteen months. The cause is structural: every design decision lives on more surfaces than anyone remembers, and humans are bad at finding all of them. That is search and bookkeeping, which is exactly what agents are good at. So the fix is not a heroic quarterly cleanup. It is a standing loop that runs on every change and on a schedule in between.

Slide 3 of 1316:9

The maintenance loop: detect, propose, review, apply, log

Six stations, with a clear answer at each one to the question: who runs this?

Loop diagram of a self-maintaining design system. A human-owned source of truth — tokens, DESIGN.md, the audit schedule, and the human-only decision list — feeds agent-run detection: scheduled drift audits, design QA on every pull request, and token and accessibility checks. Detection feeds agent-run proposals: mechanical fixes opened as pull requests, token migrations on a branch, and documentation regenerated from source, each with a report of every file touched. Proposals pass through a human review gate where a person reads the diff, the gate output, and the render. Approved changes are applied and logged by the agent, and decisions that need judgment — deprecations, breaking changes, intentional variation, taste — go to a human decision step whose outcomes feed back into the source of truth along a dashed line.
Yellow-striped cards are agent-run and produce evidence: findings, diffs, gate output, reports. The dark gate and the wide decision card are human. Approved decisions return to the source of truth — tokens and DESIGN.md — never directly into components.

The agent runs detection, proposal, and application. Humans hold the review gate and the decisions. Removing either human station turns the loop into unsupervised drift at machine speed.

Slide notes

Walk the loop station by station and name the owner of each. The source of truth is human-owned: the token layer, DESIGN.md, the audit schedule, and — importantly — a written list of decisions that are human-only. Detection is agent-run and has two triggers: standing checks on every pull request that touches UI, and scheduled audits that sweep the whole surface for what per-change checks miss. Proposal is agent-run too, with one hard rule: everything arrives as a reviewable change — a pull request with evidence attached — never a direct push. The review gate is human: a person reads the diff, the gate output, and the rendered result. Apply and log is agent-run, but only after approval, and it includes re-running the gates and updating the change log. The decision station is human: deprecations, breaking changes, intentional variation, and taste, with the agent supplying the impact analysis.

Point at the dashed feedback line. Approved decisions go back into the source of truth — a token value changes, a rule is added to DESIGN.md — and the next cycle of the loop propagates them. Decisions never land directly in components, because that is exactly the bypass that created drift in the first place.

The failure mode to warn about is collapsing the two human stations. Teams under load start treating the review gate as a formality and the decision station as something the agent can handle because it sounds confident. The loop then keeps running and keeps producing technically compliant changes that nobody with taste has looked at, which is drift wearing a high-visibility vest.

Narration for this slide

Here is the whole module in one picture. On the left, the source of truth: tokens, DESIGN.md, the audit schedule, and a written list of decisions that stay human. The agent detects — scheduled drift audits, design QA on every pull request, token and accessibility checks. It proposes — fixes, migrations, and regenerated docs, always as pull requests with a report of what was touched, never direct pushes. A human reads the diff and the gate output at the review gate. Approved changes get applied and logged. And everything that needs judgment — deprecations, breaking changes, taste — goes to people, whose decisions return to the source of truth and start the loop again. The agent runs the work. You hold the gate and the decisions.

Slide 4 of 1316:9

Scheduled audits and report cadence

Per-change checks catch what a diff introduces. Scheduled audits catch what accumulates between diffs.

  • Standing gates run on every change: token audit, type check, contrast and visual checks where they matter
  • The recurring audit runs on a calendar: monthly or quarterly, sized to how fast the product changes
  • Same audit brief every time, kept in the repo next to DESIGN.md, with the next run date in it
  • Findings arrive graded by severity, with file, line, and screenshot evidence — never as a guilt document
  • Reports route to a named owner; anything needing judgment goes to the decision station, not a backlog

Two defences, two cadences: gates on every change keep the baseline from eroding; the scheduled audit finds what no per-change rule was written to see.

Slide notes

The distinction to land is that these are two different activities, and conflating them weakens both. Standing gates — the token audit, the type check, the design QA pass on a pull request — are cheap, unambiguous, and constant. Their limitation is scope: they only check what someone thought to encode as a rule, and they only see the change in front of them. The recurring audit is a scheduled, whole-surface review asking whether the system as built still matches the system as intended — meaning drift, behaviour drift, the component that is technically token-compliant but no longer matches its documented purpose. Module 4 covered how to run that audit; what changes in the maintenance loop is only the trigger and the cadence: it stops being a heroic once-a-year cleanup and becomes a recurring brief.

Cadence should be sized to the product, not to ambition. Monthly works for a product shipping UI weekly; quarterly is fine for a slower system. The practical wiring is unglamorous: gate commands in CI so they cannot be skipped, the audit brief checked into the repository next to the system documentation with the next run date written in it, and findings opened as reviewable items with evidence attached rather than dumped into a document nobody owns.

The report format matters as much as the schedule. A finding without a file, a line, and a screenshot is an opinion, and opinions get argued with rather than fixed. A finding graded by severity with evidence and a proposed fix gets actioned in the same week. The audit module's evidence rules apply unchanged here; the only new requirement is a named human owner for the report, because a report routed to everyone is routed to no one.

Narration for this slide

Detection has two rhythms. The first is per change: every pull request that touches UI runs the standing gates — the token audit, the type check, accessibility and visual checks on the screens the branch changed. That keeps the baseline from eroding one merge at a time. The second is the calendar: a recurring audit, monthly or quarterly depending on how fast your product moves, that sweeps the whole surface for what the per-change rules cannot see — meaning drift, behaviour drift, components that pass the checks but no longer match their documented intent. Keep the audit brief in the repository, put the next run date in it, and make sure every finding arrives with evidence, a severity, and a named owner. A report routed to everyone is routed to no one.

Slide 5 of 1316:9

Documentation generated from source, not written after the fact

The most fragile surface in any system is the prose that describes it. No compiler reads it, and stale documentation gets believed long after it stops being true.

  • Reference docs — props, tokens, usage, examples — regenerate from component source and token files
  • Documentation is updated in the same change that makes it stale, never as a follow-up ticket
  • The agent drafts the doc update from the diff it just produced; the reviewer reads both together
  • Documentation that agents read is configuration: stale DESIGN.md actively instructs future changes to be wrong
  • Generated prose still needs an owner who reads it — plausible and subtly wrong is worse than a gap

The test for any sentence of system documentation: if it were wrong tomorrow, who would be misled, and how would they find out?

Slide notes

Documentation is the surface no gate protects. The token audit does not parse Markdown, the type checker does not read usage guidance, and the build does not fail when a prose sentence describes a colour that no longer exists. That is exactly why documentation drifts first and gets trusted longest: it looks authoritative, and looking authoritative is most of what it takes to be believed.

Two practices keep it honest. First, generate what can be generated: prop tables, token value tables, usage examples, and component inventories can be regenerated from source — by the agent, on a schedule or as part of any change that touches the source they describe. Second, for the prose that cannot be generated, enforce the same-change rule: documentation is updated in the same pull request that makes it stale, drafted by the agent from the diff it just produced, and reviewed together with the code. Module 2 set up DESIGN.md as the contract; this is how the contract stays true.

The stakes are higher in this course than in a human-only system, and it is worth saying why. In the workflows this course has built, DESIGN.md is not just read by people — it is loaded by the agent at the start of every session as an instruction set. Stale documentation in that world does not merely confuse a new hire; it actively instructs the next hundred agent runs to reproduce the wrong value. Documentation that agents read is configuration, and it deserves the same change discipline as code. The caveat to keep: generated documentation still needs a human owner who reads it, because confidently generated prose that is subtly wrong does more damage than an honest gap.

Narration for this slide

Now the surface nothing protects: documentation. No script fails when a sentence in DESIGN.md describes a colour that no longer exists, and authoritative-looking prose gets believed long after it stops being true. Two habits fix this. Generate what can be generated — prop tables, token values, usage examples — straight from the source, so they cannot drift. And for everything else, update the documentation in the same change that makes it stale, drafted by the agent from its own diff and reviewed together with the code. Remember what raises the stakes here: your agent reads DESIGN.md as instructions. Stale documentation does not just mislead people — it instructs every future run to be wrong. Treat docs the agent reads as configuration.

Slide 6 of 1316:9

Proposed fixes as pull requests, never direct pushes

The rule that makes the whole loop trustworthy: every agent action produces a reviewable change with evidence attached.

  • Mechanical audit findings become fix PRs: token replacements, missing states, regenerated docs
  • Larger work — a token migration across hundreds of files — runs on a branch with a reviewer agent on every diff
  • Every proposal carries its evidence: what was searched, what was touched, what was checked and left alone
  • Gates re-run on the proposal itself, so the reviewer sees the audit and type check output, not a promise
  • Write access to main is the line: agents propose, humans merge

The agent is allowed to do almost everything except decide that its own work is done.

Slide notes

This is the structural rule the loop depends on, and it is worth being absolutist about it: nothing the agent produces lands on the main branch without passing through a human. Not because agent output is usually wrong — most of the mechanical fixes are fine — but because the moment changes flow into the system unreviewed, the team loses the ability to notice the ones that are not, and loses it precisely when volume is highest.

In practice the proposals come in three sizes. Small: individual audit findings fixed and opened as pull requests — a hardcoded colour replaced with its token, a missing focus state added, a stale prop table regenerated. Medium: the design QA findings on someone else's branch, posted as a review comment so the author fixes their own work. Large: a token migration across hundreds of files, run as an orchestrated workflow on a branch with a per-file reviewer agent rejecting any diff that goes beyond the approved mapping, and a visual recapture before the pull request opens. The course's related workflows — design QA on every PR and the design token migration — are the worked versions of the small and large cases; reuse them rather than rebuilding.

The evidence requirement is what keeps review cheap. A proposal that says what it searched, what it changed, what it checked and deliberately left alone, with the gate output attached, can be reviewed in minutes. A proposal that says trust me, the checks pass forces the reviewer to either re-do the investigation or rubber-stamp it — and rubber-stamping is the failure mode the next slides deal with.

Narration for this slide

Here is the rule that makes the loop safe to run: the agent proposes, humans merge. Every fix arrives as a pull request with its evidence attached — what was searched, what was changed, what was checked and left alone, plus the gate output from re-running the audits on the proposal itself. Small fixes are individual PRs: a hardcoded colour swapped for its token, a regenerated prop table. Big work, like a token migration across three hundred files, runs on a branch with a reviewer agent on every diff and a visual recapture at the end. Either way, the agent is allowed to do almost everything except decide that its own work is done. That decision is yours.

Slide 7 of 1316:9

The human rotation: what approval actually involves

Approval is a job, not a formality. It needs a named person, bounded time, and a definition of what reading the change means.

  • A small rotation — one named approver per week or sprint — rather than diffuse team responsibility
  • Read the diff and the render, not just the green checks; gates make review cheap, not optional
  • Check scope: did the change stay inside the named tokens and surfaces it claimed?
  • Escalate, do not absorb: anything touching deprecation, breaking impact, or taste goes to the decision owner
  • Budget it honestly: roughly 30–60 minutes a day on an active system, and say so when staffing the rotation

The loop changes the human job from producing maintenance to approving it. Approving is still work, and unstaffed work does not happen.

Slide notes

Teams adopt the loop and then quietly fail to staff it. The proposals arrive, nobody is named as the approver, and the queue either stalls — which discredits the loop — or gets approved in bulk by whoever feels guiltiest, which is worse. The fix is boring: a rotation, with one named approver at a time, a bounded slot in their calendar, and an explicit definition of what approval means.

What it means, concretely: read the diff, not the summary of the diff. Look at the rendered result on at least one affected surface. Check that the change stayed inside its claimed scope — the propagation prompt's do-not-touch list exists precisely so a reviewer can verify it was respected. Read the gate output rather than trusting the green tick, and remember the gate's known blind spots. And when the change touches anything on the human-only list — a deprecation, a rename, a breaking impact, a taste call — escalate it to the decision owner instead of absorbing it into the approval.

Be honest about the cost when proposing this to a team. On an active system the approval rotation is somewhere between thirty and sixty minutes a day, more in a week with a migration in flight. That is dramatically less than the production work it replaces, but it is not zero, and pretending it is zero is how rotations collapse into rubber-stamping. The rotation is also where the system's taste gets exercised week to week, which is a reason to rotate it among people whose taste you want encoded — not to hand it permanently to whoever is most junior.

Narration for this slide

The loop only works if the human side is staffed. Set up a small rotation — one named approver per week or sprint — and define what approval means. It means reading the diff, not the summary. Looking at the render on at least one affected surface. Checking the change stayed inside the scope it claimed. Reading the gate output rather than trusting the green tick. And escalating anything that touches a deprecation, a breaking change, or a taste call to whoever owns those decisions, instead of absorbing it. Budget it honestly: on an active system this is thirty to sixty minutes a day. Far less than doing the maintenance by hand — but not zero, and pretending it is zero is how the rotation collapses.

Slide 8 of 1316:9

Good delegation, bad delegation

The line is not how much the agent does. It is whether every action produces reviewable evidence and whether decisions still pass through people.

Good delegationBad delegation
The changeHuman picks the new value; agent finds every surface, edits, runs gates, reportsAgent decides a rename or a new variant on its own initiative
MergingAgent opens the PR; a person reads the diff and the render before mergeAgent merges its own changes because the gates passed
ReviewReviewer reads the diff, the gate output, and the renderReviewer approves on green checks without opening the diff
DocsRegenerated in the same change, reviewed with the codeLeft for a follow-up ticket that never happens
ScopePrompt names surfaces, search steps, do-not-touch list, required reportOpen-ended fix the design system everywhere instruction

Delegating too little keeps the bookkeeping human. Delegating too much removes judgment. The evidence trail is what lets you sit safely in between.

Slide notes

The maintenance loop fails in two opposite directions, and teams usually only guard against one of them. Delegating too little — using the agent as a fancier find-and-replace while a human still does all the surface-finding and bookkeeping — wastes the capability and keeps maintenance unstaffed in practice. Delegating too much — letting changes flow into the system without anyone exercising judgment — produces the slow accumulation of technically compliant changes that nobody with taste has looked at in months. The system stays consistent and gets quietly worse.

Walk a couple of rows. The merging row is the bright line from two slides ago: gates exist to make review cheap, not optional, and an agent merging its own work because the checks passed is the automation equivalent of marking your own homework. The review row is the same failure on the human side — approving on green checks without opening the diff is a gate blind spot wearing a person's name. The scope row is the quiet one: open-ended instructions like fix the design system everywhere produce changes nobody can review, because there is no claimed scope to verify. The do-not-touch line in a propagation prompt is what lets a reviewer trust a small diff.

A useful test for any proposed delegation: is the thing being delegated checkable? Keeping components compliant with the documented system is checkable, so it delegates well. Deciding whether the system should allow a second accent colour is not checkable — it is a question about product intent — and the agent will answer it confidently anyway, which is exactly why it must not be asked to.

Narration for this slide

Delegation fails in two directions. Too little, and the agent is just a faster find-and-replace while you still do the bookkeeping. Too much, and changes flow into the system without judgment — it stays consistent and gets quietly worse. The table is the line between them. Good: you pick the new value, the agent finds every surface, edits, runs the gates, and reports; a person reads the diff before merge; docs regenerate in the same change; the prompt names its scope. Bad: the agent merges because the checks passed, the reviewer approves without opening the diff, the instruction was fix everything. The test for any delegation is simple: is it checkable? Compliance is checkable. Whether the system should allow a second accent colour is not — and the agent will answer it confidently anyway.

Slide 9 of 1316:9

Failure modes: rubber-stamping, alert fatigue, silent scope growth

Self-maintaining systems do not fail loudly. They fail by becoming background noise.

  • Rubber-stamping — approvals on green checks; counter with spot-audits of recently approved changes
  • Alert fatigue — too many low-severity findings; tune the rules, batch P3s, never page on advisory issues
  • Silent scope growth — the agent starts fixing things nobody asked it to watch; re-state scope in every brief
  • Gate blind-spot drift — the loop optimises for what its checks can see; review the check list itself quarterly
  • Unattended reliability — months of smooth runs erode attention exactly when volume makes review matter most

The dangerous failure is not a broken loop. It is a loop that keeps running after the humans have stopped looking.

Slide notes

These failure modes deserve their own slide because none of them looks like a failure while it is happening. Rubber-stamping looks like an efficient team clearing its review queue. Alert fatigue looks like a thorough audit. Scope growth looks like a helpful agent. The common thread is attention decay: the loop is designed around human attention at two stations, and every one of these failure modes is a way that attention quietly leaves.

The counters are specific. For rubber-stamping, periodically pull three recently approved changes and review them properly; if the second look finds things the first approval missed, the rotation needs more time or fewer proposals per day. For alert fatigue, treat the finding rules as a product: tune out the false positives, batch the advisory findings into the scheduled report instead of raising them per change, and reserve interruption for severities that genuinely block. For scope growth, re-state the scope in every recurring brief and have the reviewer check the do-not-touch list was respected — agents drift toward helpfulness, and helpfulness across an unbounded surface is indistinguishable from noise. For gate blind spots, remember the lesson from the audit and case-study work earlier in this course: a passing gate means the checks you wrote found nothing, not that nothing is wrong, so review the check list itself on a schedule.

The last bullet is the cultural one. The failure mode of automated maintenance is not a dramatic breakage; it is a slow accumulation of compliant changes nobody with taste has examined. The standing appointment that matters most is not the audit cadence — it is the recurring moment where a person looks at the actual product and asks whether the system is still serving it.

Narration for this slide

Self-maintaining systems fail quietly, so know the shapes in advance. Rubber-stamping: approvals on green checks — counter it by spot-auditing changes you recently approved. Alert fatigue: too many low-severity findings — tune the rules, batch the advisories, and never interrupt anyone for a P3. Silent scope growth: the agent starts fixing things nobody asked it to watch — restate the scope in every brief and check it was respected. Gate blind spots: the loop optimises for what its checks can see, so review the checks themselves quarterly. And the big one: months of smooth running erode attention exactly when the volume makes attention matter most. The dangerous failure is not a broken loop. It is a loop still running after the humans have stopped looking.

Slide 10 of 1316:9

What still needs people

Most of what stays human stays human because it is a decision right, not a capability gap.

  • Deprecations, renames, and breaking changes — which teams you are willing to break, and on what timeline
  • Intentional variation versus drift — only the people who own the intent can answer
  • Taste shifts and direction — the system exists to encode taste, not to encode it away
  • Accessibility judgment — a passing token audit says nothing about contrast or legibility
  • Prioritisation and escalation — what gets fixed this quarter, and what gets argued about in person

Agents supply the impact analysis — every consumer, every surface a change touches. That changes how confidently humans decide. It does not change who decides.

Slide notes

It is tempting to present the remaining human work as temporary — that better models will eventually own deprecation and taste the way they own propagation. The honest position, and the one this course closes on, is that most of it is not a capability gap at all. Whether to deprecate a token is a question about which consuming teams you are willing to break and on what timeline; that is an organisational decision with an owner, not a pattern-matching problem. Whether a variation is drift or evolution is a question about intent, and only the people who own the intent can answer it. Whether the deeper green is better is taste, and taste is the thing the design system exists to encode — flattening it into a rule the agent applies is how systems become consistent and lifeless at the same time.

What the agent genuinely changes is the quality of the decision inputs. A deprecation decision made with a complete impact analysis — every consumer of the token, every surface a breaking change touches, the migration cost estimated from the actual codebase — is a different decision from one made on instinct and a grep. Demand that analysis from the loop; just do not let the analysis quietly become the decision.

Name the operational limits too. Gates have blind spots, and accessibility is the sharpest one in this course's own toolchain: nothing in a token audit checks contrast, so a system can be perfectly token-compliant and illegible. Add the specific checks where they matter and keep human judgment behind them. And the loop inherits the quality of its source of truth — an agent maintaining a badly structured system will faithfully propagate its problems everywhere, faster than before. Self-maintenance amplifies the system you have, which is why the first five modules came first.

Narration for this slide

So what still needs people? Mostly things that are decision rights, not capability gaps. Deprecations and breaking changes are about which teams you are willing to break and when — that has an owner, and it is not the agent. Whether a variation is drift or evolution is a question about intent. Taste is the thing the system exists to encode; do not let the loop flatten it. Accessibility needs its own checks and its own judgment — a passing token audit says nothing about contrast. What the agent does change is how well-informed those decisions are: it can hand you every consumer of a token and every surface a breaking change touches before you decide. Take the analysis. Keep the decision.

Slide 11 of 1316:9

Worked example: a month of self-maintenance, reviewed

One mid-size product system, four weeks of the loop running, and what the humans actually had to do.

Loop activityWhat the agent didWhat the humans did
PR design QA (23 UI branches)Captured changed screens, flagged 31 token and 4 accessibility findingsAuthors fixed their own branches; 2 severity disputes settled in review
Scheduled drift audit (week 2)Whole-surface sweep; 47 findings with evidence, graded P1–P3Owner confirmed P1s, demoted 3 findings as intentional variation
Mechanical fix PRs19 small PRs: token replacements, missing focus states, stale prop tablesApprover rotation merged 17, sent 2 back for scope creep
Accent token changePropagated across CSS, components, and DESIGN.md; gates re-run; 3-line diffDesign lead chose the value; reviewer read the diff and render
Deprecation questionImpact analysis: 2 legacy spacing tokens, 64 usages across 3 packagesDeferred — migration cost not worth it this quarter; logged with the reasoning

Roughly a day of human time across the month, almost all of it judgment: approving, disputing severity, and one deliberate decision not to act.

Slide notes

This composite traces a month of the loop on a mid-size product system, drawn from the executed case studies behind this course's articles and workflows — the accent-token propagation run on this school's own files, the PR design QA gate adopted by a small product team, and the audit and migration workflows. The numbers are indicative of that scale of system, not a benchmark; say so when presenting it.

Walk the rows for the division of labour. The PR gate did the highest-volume work: twenty-three UI branches reviewed automatically, with findings going back to the branch authors rather than to the system team — the system team only got involved in two severity disputes. The scheduled audit produced the larger findings list, and the most important human action on it was demotion: three findings the owner marked as intentional variation, which is precisely the call an agent cannot make. The mechanical fix PRs are where the approval rotation earned its keep, including sending two back for scope creep — the agent had also reformatted a file while it was in there, which is exactly the drift-toward-helpfulness the previous slide warned about. The accent token change is the propagation pattern from the source article: a human decision, a three-line diff, gates re-run, reviewed in one sitting.

The last row is the one to dwell on, because it is the least intuitive. The agent produced a complete impact analysis for deprecating two legacy spacing tokens, and the humans decided not to act this quarter — and logged why. A maintenance loop that can produce a documented decision not to change something is working exactly as designed. The total human cost across the month was on the order of a day, nearly all of it judgment rather than production. That is the trade the whole course has been building towards.

Narration for this slide

Let's make the loop concrete with a month on one mid-size system, drawn from the executed case studies behind this course. The PR gate reviewed twenty-three UI branches and sent thirty-five findings back to their authors. The scheduled audit produced forty-seven findings; the owner confirmed the serious ones and demoted three as intentional — a call only a human can make. Nineteen small fix PRs went through the approval rotation; two got sent back for scope creep. One accent token change propagated everywhere in a three-line diff. And one deprecation question got a full impact analysis and a deliberate decision not to act this quarter, logged with the reasoning. Total human cost: about a day, almost all of it judgment. That is the trade.

Slide 12 of 1316:9

Exercise: draft your maintenance loop on one page

One page, your real system. The goal is a loop you could start running next week, not an aspirational diagram.

  • List every surface a design decision lives on: tokens, theme files, components, docs, canvases, decks
  • Name your standing gates and your audit cadence, with the next audit date written down
  • Decide what arrives as agent-proposed PRs in month one — start with mechanical audit findings only
  • Staff the approval rotation: who, how often it rotates, and the realistic time budget
  • Write the human-only decision list: deprecations, breaking changes, new tokens, intentional variation, taste

If a section of the page is hard to fill in, that is the module of this course to revisit before switching the loop on.

Slide notes

This exercise is the course's closing artifact, and it doubles as a diagnostic. Each line of the page maps to a module: the surface list and gates come from the token and DESIGN.md work in Modules 1 and 2, the review standards from Module 3, the audit cadence and evidence rules from Module 4, the sync surfaces from Module 5, and the rotation and decision list from this module. Where a participant cannot fill a section in, the gap is usually not in this module — it is in the prerequisite the loop assumes.

Steer people towards starting smaller than they want to. Month one should be the standing gates plus mechanical fix PRs only — token replacements, regenerated tables, missing states. Add the scheduled audit in month two once the approval rotation has found its rhythm, and only then consider larger delegations like migrations. The most common way teams discredit the loop internally is switching everything on at once and drowning the rotation in week one.

The two sections worth pressing on in a live session are the rotation and the human-only list. The rotation needs names and a time budget, not a team will review statement — unstaffed review is the rubber-stamping slide waiting to happen. And the human-only decision list is the single page that makes the almost in self-maintaining trustworthy: if it is not written down, it will be decided ad hoc, under deadline, by whoever is closest — which is exactly how the system drifted in the first place.

Narration for this slide

Your closing exercise: one page, your real system. List every surface a design decision lives on — tokens, theme files, components, docs, canvases, even the onboarding deck. Name your standing gates and your audit cadence, and write down the date of the next audit. Decide what the agent proposes in month one — start with mechanical fixes only, and add the rest once the rotation finds its rhythm. Staff the approval rotation with names and an honest time budget. And write the human-only decision list: deprecations, breaking changes, new tokens, intentional variation, taste. If any section is hard to fill in, that points you at the module to revisit before you switch the loop on.

Slide 13 of 1316:9

Summary, and where to go next

  • Drift is the default; the counterweight is a standing loop — detect, propose, review, apply, log — not a quarterly project
  • Detection runs on two rhythms: gates on every change, and scheduled whole-surface audits with evidence
  • Everything the agent produces arrives as a reviewable change with its evidence — proposals, never pushes
  • Documentation regenerates from source in the same change that makes it stale; docs the agent reads are configuration
  • Approval is staffed work, and deprecations, breaking changes, and taste remain decision rights, not capability gaps

That closes Design Systems for Agents. The natural next step in the curriculum is orchestration — running teams of agents across larger design programmes — or going deeper on review and critique practice.

Slide notes

Recap by walking the loop one last time, but spend the time on how the course's modules slot into it: tokens and DESIGN.md are the source of truth the loop reads and writes; the review gates from Module 3 are the human station every proposal passes through; the audit discipline from Module 4 is the detection step, just put on a schedule; the sync work from Module 5 is what keeps the loop's picture of the system true across tools; and this module added the cadence, the rotation, and the honest list of what stays human.

Close with the framing that the end state is not zero human effort — it is human effort relocated. The system team stops spending its weeks excavating drift and writing documentation after the fact, and spends a much smaller amount of time approving changes, settling severity disputes, and making the decisions that shape the system. The quality of that smaller effort is what the system's quality now depends on, which is why the rotation and the decision list are not afterthoughts.

For where to go next: learners who want to scale this pattern beyond one design system should look at the school's course on orchestrating design agent teams, which covers running multiple agents and larger programmes of work. Those who want to sharpen the judgment side of the loop — what good critique and review practice looks like when an agent is the producer — should look at the design review and critique course. And the related workflows referenced throughout this module — design QA on every PR, the design token migration, and the accessibility sweep — are the runnable starting points for the loop drafted in the exercise.

Narration for this slide

Let's close the course. Drift is the default state of every design system, and the counterweight is a standing loop: detect, propose, review, apply, log. Detection runs on every change and on a schedule. Everything the agent produces arrives as a reviewable change with evidence — proposals, never pushes. Documentation regenerates from source in the same change that makes it stale, because docs the agent reads are configuration. And the human work that remains is real work: a staffed approval rotation, and the decisions that stay yours — deprecations, breaking changes, and taste. From here, the natural next step is orchestration — running teams of agents across bigger programmes — or going deeper on review and critique. Thanks for taking the course, and good luck with the loop.

Module transcript
Module 6, narrated slide by slide

Slide 1Systems That Maintain Themselves

Welcome to the final module of Design Systems for Agents. Everything we have built so far — tokens that carry intent, a DESIGN.md the agent actually reads, review gates, audits, token sync — comes together here into one arrangement: a system that maintains itself, almost. The agent runs scheduled audits, checks every pull request, proposes fixes as reviewable changes, and keeps the documentation generated from source. The humans on the system shift from producing the maintenance work to approving it. The almost is important, and we will be honest about it: deprecations, breaking changes, and taste stay with people. Let's look at the loop.

Slide 2Maintenance as a standing loop, not a quarterly project

Here is the problem this module solves. Design systems rarely fail at launch — they fail later, quietly, as drift accumulates. A colour gets adjusted in the code but not the docs. A component grows a one-off variant during a deadline week. Each gap is small and defensible, and together they are why nobody fully trusts the system after eighteen months. The cause is structural: every design decision lives on more surfaces than anyone remembers, and humans are bad at finding all of them. That is search and bookkeeping, which is exactly what agents are good at. So the fix is not a heroic quarterly cleanup. It is a standing loop that runs on every change and on a schedule in between.

Slide 3The maintenance loop: detect, propose, review, apply, log

Here is the whole module in one picture. On the left, the source of truth: tokens, DESIGN.md, the audit schedule, and a written list of decisions that stay human. The agent detects — scheduled drift audits, design QA on every pull request, token and accessibility checks. It proposes — fixes, migrations, and regenerated docs, always as pull requests with a report of what was touched, never direct pushes. A human reads the diff and the gate output at the review gate. Approved changes get applied and logged. And everything that needs judgment — deprecations, breaking changes, taste — goes to people, whose decisions return to the source of truth and start the loop again. The agent runs the work. You hold the gate and the decisions.

Slide 4Scheduled audits and report cadence

Detection has two rhythms. The first is per change: every pull request that touches UI runs the standing gates — the token audit, the type check, accessibility and visual checks on the screens the branch changed. That keeps the baseline from eroding one merge at a time. The second is the calendar: a recurring audit, monthly or quarterly depending on how fast your product moves, that sweeps the whole surface for what the per-change rules cannot see — meaning drift, behaviour drift, components that pass the checks but no longer match their documented intent. Keep the audit brief in the repository, put the next run date in it, and make sure every finding arrives with evidence, a severity, and a named owner. A report routed to everyone is routed to no one.

Slide 5Documentation generated from source, not written after the fact

Now the surface nothing protects: documentation. No script fails when a sentence in DESIGN.md describes a colour that no longer exists, and authoritative-looking prose gets believed long after it stops being true. Two habits fix this. Generate what can be generated — prop tables, token values, usage examples — straight from the source, so they cannot drift. And for everything else, update the documentation in the same change that makes it stale, drafted by the agent from its own diff and reviewed together with the code. Remember what raises the stakes here: your agent reads DESIGN.md as instructions. Stale documentation does not just mislead people — it instructs every future run to be wrong. Treat docs the agent reads as configuration.

Slide 6Proposed fixes as pull requests, never direct pushes

Here is the rule that makes the loop safe to run: the agent proposes, humans merge. Every fix arrives as a pull request with its evidence attached — what was searched, what was changed, what was checked and left alone, plus the gate output from re-running the audits on the proposal itself. Small fixes are individual PRs: a hardcoded colour swapped for its token, a regenerated prop table. Big work, like a token migration across three hundred files, runs on a branch with a reviewer agent on every diff and a visual recapture at the end. Either way, the agent is allowed to do almost everything except decide that its own work is done. That decision is yours.

Slide 7The human rotation: what approval actually involves

The loop only works if the human side is staffed. Set up a small rotation — one named approver per week or sprint — and define what approval means. It means reading the diff, not the summary. Looking at the render on at least one affected surface. Checking the change stayed inside the scope it claimed. Reading the gate output rather than trusting the green tick. And escalating anything that touches a deprecation, a breaking change, or a taste call to whoever owns those decisions, instead of absorbing it. Budget it honestly: on an active system this is thirty to sixty minutes a day. Far less than doing the maintenance by hand — but not zero, and pretending it is zero is how the rotation collapses.

Slide 8Good delegation, bad delegation

Delegation fails in two directions. Too little, and the agent is just a faster find-and-replace while you still do the bookkeeping. Too much, and changes flow into the system without judgment — it stays consistent and gets quietly worse. The table is the line between them. Good: you pick the new value, the agent finds every surface, edits, runs the gates, and reports; a person reads the diff before merge; docs regenerate in the same change; the prompt names its scope. Bad: the agent merges because the checks passed, the reviewer approves without opening the diff, the instruction was fix everything. The test for any delegation is simple: is it checkable? Compliance is checkable. Whether the system should allow a second accent colour is not — and the agent will answer it confidently anyway.

Slide 9Failure modes: rubber-stamping, alert fatigue, silent scope growth

Self-maintaining systems fail quietly, so know the shapes in advance. Rubber-stamping: approvals on green checks — counter it by spot-auditing changes you recently approved. Alert fatigue: too many low-severity findings — tune the rules, batch the advisories, and never interrupt anyone for a P3. Silent scope growth: the agent starts fixing things nobody asked it to watch — restate the scope in every brief and check it was respected. Gate blind spots: the loop optimises for what its checks can see, so review the checks themselves quarterly. And the big one: months of smooth running erode attention exactly when the volume makes attention matter most. The dangerous failure is not a broken loop. It is a loop still running after the humans have stopped looking.

Slide 10What still needs people

So what still needs people? Mostly things that are decision rights, not capability gaps. Deprecations and breaking changes are about which teams you are willing to break and when — that has an owner, and it is not the agent. Whether a variation is drift or evolution is a question about intent. Taste is the thing the system exists to encode; do not let the loop flatten it. Accessibility needs its own checks and its own judgment — a passing token audit says nothing about contrast. What the agent does change is how well-informed those decisions are: it can hand you every consumer of a token and every surface a breaking change touches before you decide. Take the analysis. Keep the decision.

Slide 11Worked example: a month of self-maintenance, reviewed

Let's make the loop concrete with a month on one mid-size system, drawn from the executed case studies behind this course. The PR gate reviewed twenty-three UI branches and sent thirty-five findings back to their authors. The scheduled audit produced forty-seven findings; the owner confirmed the serious ones and demoted three as intentional — a call only a human can make. Nineteen small fix PRs went through the approval rotation; two got sent back for scope creep. One accent token change propagated everywhere in a three-line diff. And one deprecation question got a full impact analysis and a deliberate decision not to act this quarter, logged with the reasoning. Total human cost: about a day, almost all of it judgment. That is the trade.

Slide 12Exercise: draft your maintenance loop on one page

Your closing exercise: one page, your real system. List every surface a design decision lives on — tokens, theme files, components, docs, canvases, even the onboarding deck. Name your standing gates and your audit cadence, and write down the date of the next audit. Decide what the agent proposes in month one — start with mechanical fixes only, and add the rest once the rotation finds its rhythm. Staff the approval rotation with names and an honest time budget. And write the human-only decision list: deprecations, breaking changes, new tokens, intentional variation, taste. If any section is hard to fill in, that points you at the module to revisit before you switch the loop on.

Slide 13Summary, and where to go next

Let's close the course. Drift is the default state of every design system, and the counterweight is a standing loop: detect, propose, review, apply, log. Detection runs on every change and on a schedule. Everything the agent produces arrives as a reviewable change with evidence — proposals, never pushes. Documentation regenerates from source in the same change that makes it stale, because docs the agent reads are configuration. And the human work that remains is real work: a staffed approval rotation, and the decisions that stay yours — deprecations, breaking changes, and taste. From here, the natural next step is orchestration — running teams of agents across bigger programmes — or going deeper on review and critique. Thanks for taking the course, and good luck with the loop.