AAgentic Design School

Orchestrating a Design Team of Agents: Patterns, Costs and Merge Pain

How to actually run a team of design agents once you have decided to split the work — decomposition patterns, what agent teams, subagents, worktrees, and concurrent canvases support today, a five-surface orchestration run executed for this article, and the complexity tax with the run's own numbers.

Last reviewed2026-06-01

Section 1

You decided to split. Now you have a management job

This article assumes a decision has already been made: the design task in front of you is genuinely separable, valuable enough to pay for coordination, and you are going to run it across several agents. If that decision is still open, read the companion piece on when to use multiple design agents first — it carries the decision map, the separability test, and a deliberately small two-worker trace that shows what the artifacts look like at the smallest honest scale. Staying with one agent is the default there, and nothing in this article changes that. What changes here is the question. Not whether to split, but how to run the team once you have.

Running the team is a management job, and the manager is you. The orchestrator role — whether you play it yourself in one session or delegate it to a lead agent — is closer to a design lead than to a prompt writer. It decides the decomposition, writes the contract every worker reads, sets the gates that work must pass, and runs the merge review where the separate outputs become one coherent design again. The tools have caught up with this framing: Claude Code now ships an experimental agent-teams surface with a shared task list, messaging, and plan approval; subagents and git worktrees give you bounded workers and isolated checkouts; Codex builds parallelism into its CLI, cloud tasks, and desktop app; OpenPencil runs concurrent agents over regions of a shared canvas. None of them remove the two jobs that decide the outcome: choosing boundaries that are real, and treating the merge as the actual design review.

To keep this practical, the article is built around a run executed for it: a five-surface redesign-spec exercise on this site's own subject matter — home hero band, articles index card grid, article page header, newsletter signup band, plus a read-only QA worker — orchestrated on scratch files, with a deliberate overlap that forced a real conflict, a merge log that resolved it, and a single-agent baseline for comparison. One honesty note up front, repeated where it matters: the worker passes in that run executed sequentially, not as parallel processes, because the drafting environment could not spawn parallel workers. The artifacts, conflicts, and merge decisions are real and quoted; the wall-clock benefit of parallelism is estimated, and labeled as such wherever it appears.

Section 2

Three ways to cut the work: spatial, functional, pipeline

Decomposition is a design decision about boundaries, and there are only three shapes worth teaching. Spatial decomposition splits by place: pages, bands, dashboard panels, canvas regions. Each worker owns a surface, which means workers rarely touch the same files and parallelism is high — but the merged result has seams, and the seams are where it fails. Spacing rhythm drifts between sections, and two surfaces express the same signal two different ways, which is exactly the conflict class the case study below produced. Functional decomposition splits by skill: one worker owns structure, one owns type and copy, one owns accessibility, one runs QA. It maps nicely onto how design teams already think about roles, but every worker touches every surface a little, so the merge is the hardest of the three — style consistency is the casualty, and two domains will eventually want to edit the same component for different reasons. It works best when most of the functional workers are read-only reviewers rather than writers.

Pipeline decomposition splits by stage: a spec pass, a visual pass, an implementation pass, a QA pass, each handing an artifact to the next. It barely parallelizes, because stages wait on each other, but it merges almost for free — each stage's output is the next stage's input, so there is little to reconcile. What it buys is not speed but focus and gates: every handoff is a natural review point. The honest summary is that spatial buys parallelism and pays at the seams, functional buys specialization and pays in consistency, and pipeline buys checkpoints and pays in wall-clock time.

Mixing shapes is normal and usually right. The case study in this article is spatial for its four surface workers and functional for the fifth — a read-only QA worker layered across all of them. Whatever shape you pick, the per-worker rules do not change: one owner, one output file, an explicit list of exclusions, and an acceptance test the orchestrator can check without re-doing the work. If you cannot state the boundary of a slice in one sentence, the decomposition is not done; that test comes from the decision article and it applies with more force here, because at four or five workers a fuzzy boundary multiplies into four or five collisions instead of one.

diagramDecomposition decision board
Comparison board with three columns. Spatial decomposition splits by region or surface, parallelizes well, and merges on seams and duplicated signals; pick it when the boundaries are places. Functional decomposition splits by skill domain, parallelizes moderately, and merges on style consistency because every worker touches every surface a little; pick it when the boundaries are skills. Pipeline decomposition splits by stage, barely parallelizes, but merges easily because each stage's output is the next stage's input; pick it when the boundaries are stages. A footer notes that the case study mixes spatial workers with one functional read-only QA worker.

Choose the shape of the split by where the boundaries actually are: places, skills, or stages.

Section 3

Agent teams, in depth: what the lead actually coordinates (verified June 2026)

Claude Code's Agent Teams are the most complete documented version of the orchestrator pattern as a product surface, and they are explicitly experimental: disabled by default, switched on with an environment flag, and subject to change. A team is a lead session plus independent teammate sessions, and the coordination runs through two mechanisms you would otherwise build by hand. The first is a shared task list — tasks have pending, in-progress, and completed states, can declare dependencies on each other, and are claimed with file-locked claiming so two teammates do not race for the same task. The second is a mailbox: messages are delivered automatically between sessions, teammates can write to each other by name rather than routing everything through the lead, and the lead gets notified when a teammate goes idle. Team and task state live in the user's home directory, not in the project, which is a deliberate signal that a team is a working arrangement rather than a project artifact.

Two more mechanics matter for design work. Teammate roles can reuse subagent definitions, so a visual-QA worker you already defined for single-session use can become a teammate with the same tool restrictions and model — though skills and MCP server frontmatter are not carried over, which matters if your design worker depends on a canvas MCP server. And the lead can require plan approval: a teammate works in a read-only planning mode until the lead approves its plan, and the lead approves or rejects autonomously against criteria you give it. That is an orchestration-time review gate — the equivalent of approving a worker's brief interpretation before any output exists. The documentation also names the failure mode every practitioner write-up confirms: the lead starts implementing tasks itself instead of waiting for its teammates, and you have to tell it not to. Practitioner guides describe a delegate mode toggle that restricts the lead to coordination only; that control is reported by practitioners rather than something verified hands-on for this article, so treat it as a thing to check in your own session rather than a documented guarantee.

The same documentation is refreshingly concrete about size and limits. The guidance is three to five teammates with five or six tasks each, and the phrase worth keeping is that three focused teammates often outperform five scattered ones. The limitations list is long enough to plan around: in-process teammates cannot be resumed after the lead session ends, task status can lag and block dependent tasks, there is one team per lead, no nested teams, the lead cannot be transferred, teammates inherit the lead's permissions at spawn time, and split-pane display needs tmux or iTerm2. None of these are reasons to avoid the surface; they are reasons to treat a team as a short-lived working session you set up, run, and shut down — not an environment you leave running. For designers, the practical reading is that the product now does the bookkeeping (tasks, messages, approvals), and everything that makes the output good still lives in the files you give it: the shared constraints, the briefs, and the merge protocol the rest of this article walks through.

  • Agent Teams are experimental and off by default; everything here is as documented in June 2026 and can change.
  • Coordination = a dependency-aware shared task list with file-locked claiming, plus a mailbox with teammate-to-teammate messaging and idle notifications.
  • Plan approval lets the lead hold teammates in read-only planning until their plan passes criteria you set.
  • Teammate roles can reuse subagent definitions (tools and model carry over; skills and MCP server settings do not).
  • Sizing guidance: 3–5 teammates, 5–6 tasks each; documented failure mode: the lead starts doing the work itself.
  • Team and task state live under the home directory, not the repo — a team is a session, not a project artifact.

Section 4

Workers, worktrees, and what each agent actually sees

Underneath any orchestration surface sit two questions you have to answer per worker: what context does it see, and where do its changes land. On the first, Claude Code subagents are precise in ways that matter for design work. Each subagent runs in its own context window with its own system prompt, tool restrictions, and permissions, and it returns only a summary to the caller — the orchestrating session never absorbs the worker's full transcript. The built-in Explore and Plan subagents skip the project's CLAUDE.md and the parent's git status to stay cheap, while other built-in and custom subagents load them; that distinction decides whether your design rules are actually in front of a given worker or merely near it. Subagents cannot spawn subagents, so the hierarchy stays one level deep, and they can run in the foreground or the background. The orchestration design consequence is simple: nothing a worker needs can be assumed ambient. If the worker must respect DESIGN.md, the brief says so and the worker type loads it.

On the second question, isolation on disk is what keeps parallel workers from fighting over files. Git worktrees are the standard answer: a separate checkout on its own branch, so a worker's edits never touch the checkout another worker is reading. Claude Code can create and enter worktrees itself, supports a hook for relocating them, and its docs explicitly position the three mechanisms against each other — subagents for lightweight delegation inside a session, worktrees for manual parallel sessions, agent teams for coordinated sessions — and document combining them, running a subagent inside a worktree so its changes land in an isolated checkout. Codex points the same direction from the other side of the fence: its CLI supports subagent-style parallel workflows, its cloud tasks run in hosted environments in the background — including a best-of-N shape where several attempts at the same task compete and you keep the best one, which is an alternative to decomposition rather than a form of it — and its desktop app creates a separate git worktree per agent thread automatically. Those Codex details come from its developer documentation as located by search rather than verified hands-on for this article, so treat the specifics as pointers to check rather than quoted behavior; the structural point stands across vendors, because worktree-per-worker has independently become both a product default and the most common hand-rolled pattern in practitioner write-ups.

For spec-level work like the case study below, file isolation can be as simple as one owned output file per worker — no worktrees needed, because nobody edits shared files at all. The escalation path is worth knowing in advance though: specs become component edits, component edits become parallel branches, and at that point ownership boundaries in briefs need their mechanical twin on disk. Briefs decide who is allowed to change what; worktrees make sure that even a worker that ignores its brief cannot quietly overwrite someone else's work.

Section 5

OpenCode and OpenPencil: orchestration as config, and the concurrent canvas

OpenCode's official surface treats agents as project configuration: primary agents and subagents defined in opencode.json or per-file under .opencode/agents/, each with its own permissions for editing, shell access, and — through task permission globs — which other agents it is allowed to invoke. That is enough to express a permissioned orchestrator: a lead agent that can call the design workers you list and nothing else, review workers that cannot edit at all, implementation workers that must ask. What OpenCode does not currently ship is a coordinated team surface comparable to the one described above; that layer is community territory, visible in an open feature request for conductor-style teams in isolated workspaces, a written port of the agent-teams idea onto OpenCode, and plugin projects that wire an orchestrator persona to specialist agents on different models. Use those as evidence that the pattern transfers, not as features you can rely on — they are fast-moving, mostly single-maintainer projects, and this article cites them as examples rather than recommendations.

OpenPencil is the design-native expression of the same idea and the clearest example of spatial decomposition as a product behavior. Its concurrent agent teams have an orchestrator decompose a page into spatial sub-tasks, with multiple agents streaming output into different regions of the same vector canvas and per-member indicators showing who is working where. Its file format is the more interesting half for this article: the .op format is JSON designed for concurrent reads and writes, and the app ships a git panel with three-way merge and a conflict UI for design files — which makes it the only tool covered here whose design artifact assumes multiple writers rather than tolerating them. An MCP server and a CLI let external agents act as the team. Two cautions belong next to that description: it is reported from the repository as it stood at the start of June 2026, when the project's main branch had been quiet for about a month with a new version in progress, so re-check before you depend on it; and the run in this article did not execute inside OpenPencil — the case study below is plain files and plain agents, not a canvas session.

Step back from the individual tools and the operating model is the same everywhere: a coordinating context, bounded workers with their own context, a shared contract they all read, file-level ownership, and a merge step nobody automates away. The differences are where the plan lives — a conversation, a task list, a config file, a canvas orchestrator — and how workers communicate, from report-back-only to mailboxes to the file system. That is good news for designers, because it means the artifacts this article spends the rest of its time on are portable. Briefs, constraints, owned outputs, and merge logs work on every one of these platforms, and they are the part you keep when the experimental surfaces change.

Section 6

Case study: a five-surface redesign run, actually orchestrated

The run executed for this article is a redesign-spec exercise on this site's own surfaces: the home hero band, the articles index card grid, the article page header and meta block, and the newsletter signup band, plus a fifth read-only QA worker. The deliverables were specifications, not code — five Markdown files an implementation pass could follow — produced entirely in a scratch folder that was deleted after the excerpts were captured. The decomposition is spatial, the surfaces are genuinely separable, and one overlap was left in deliberately: every content surface needs a position on how the site's last-reviewed signal is presented and which accent color it uses. That overlap exists to force the conflict class that spatial splits produce in real work, at a scale where two or three workers collide rather than two.

Execution, honestly described: one orchestrating session wrote the orchestration brief, the shared constraints, and the five worker briefs as real files. The five worker outputs were then produced as sequential bounded passes — each pass given only its own brief, the shared constraints, and DESIGN.md — rather than as parallel subagent processes, because this drafting environment could not spawn parallel workers. The orchestrating session then ran the merge review and wrote the merge log, and finally ran the same five-surface brief once more as a single-agent baseline for the cost comparison. Everything quoted in the next few sections is from those files, unedited apart from trimming. The wall-clock total for the orchestrated run was about 77 minutes as executed; the parallel figure quoted later is an estimate of what the worker phase would compress to, not a measurement.

The orchestration brief is the parent contract, and at team scale it earns its length. It states the user job once so no worker reinvents it, points everyone at the same constraints file, gives each worker a one-sentence ownership boundary, names the overlap the merge should expect to resolve, and writes the merge gates down before any work exists — because gates invented after the outputs arrive have a way of bending around whatever came back. This is the actual brief from the run.

orchestration-brief.md (from the run)
# Field-Guide Refresh Specs — Orchestration Brief

User job: a designer learning agentic workflows lands on the site, decides quickly
whether it is current and trustworthy, finds an article worth reading, and can tell
on every surface when the material was last reviewed.

Deliverable: five Markdown artifacts a later implementation pass can follow without
re-deciding structure — four redesign specs (home hero band, articles index card
grid, article page header and meta block, newsletter signup band) and one read-only
QA findings file. No production code is written or edited in this run.

Shared constraints: every worker reads shared-constraints.md and DESIGN.md before
proposing anything.

Ownership boundaries (one sentence each):
- Worker 1 owns the home hero band: headline, supporting copy, primary and secondary
  CTAs, and the proof strip under them.
- Worker 2 owns the articles index card grid: the repeated article card's fields,
  hierarchy, states, and the grid's responsive behavior.
- Worker 3 owns the article page header and meta block: title, summary, topics,
  tools, and the last-reviewed treatment on the article page itself.
- Worker 4 owns the newsletter signup band: tone, copy structure, form layout,
  and the expectation-setting line.
- Worker 5 (QA) owns nothing; it reads the other four outputs and DESIGN.md and
  reports findings only.

Known overlap (deliberate): every content surface needs a position on how the
"Last reviewed" signal is presented and which accent color it uses. Workers may
each propose one; the orchestrator resolves the collision at merge.

Merge gates:
1. Read every output in full before accepting anything.
2. List agreements, conflicts, and gaps in merge-log.md.
3. Resolve conflicts against DESIGN.md and the user job, not against whichever
   worker wrote the most.
4. Record rejected options and the reason.
5. Run the QA findings against the merged plan; log unowned gaps as open questions
   rather than inventing answers during the merge.
6. No worker output is implemented until the merged plan is approved by a human.
screenshotFive-surface orchestration run board
Sequence board with six lanes. The orchestrator lane writes the orchestration brief, shared constraints, and five worker briefs in about twenty-two minutes, passes a brief-approval gate, and later runs a fourteen-minute merge review that resolves two conflicts and logs two unowned gaps before a human approves the plan. Four worker lanes each produce one spec file for the hero band, card grid, article header, and newsletter band, the last needing one revision; a read-only QA lane reports five findings. A note explains the worker passes ran sequentially in this environment and that parallel workers would compress the middle phase toward the longest single pass.

The executed run as lanes: the orchestrator's contract and merge review, four worker passes, the read-only QA pass, and the gates between them.

Section 7

The shared constraints file is the team's design system entry point

At two workers you can get away with pasting constraints into each brief. At five, duplication is how drift starts: one brief gets edited, four do not, and the workers are now following different rules without anyone deciding that. The fix is one shared constraints file that every brief points at and no brief restates. It is deliberately short — it does not replace DESIGN.md, it tells every worker to read DESIGN.md and then adds only the run-specific rules: what kind of artifact to produce, which component names to reuse, which color semantics are non-negotiable, and the structural rule that makes the merge possible at all — each worker writes exactly one file and edits nothing else.

The constraints file is also where you encode the lessons of previous runs. The line about not introducing new colors exists because agents do it constantly and plausibly; the line requiring an explicit do-not list in every spec exists because exclusions are what make a spec checkable; the line requiring states and responsive behavior exists because those are the sections workers skip when nobody asks. One worker in this run violated the no-new-colors rule anyway — its first pass introduced an off-token panel tint and was sent back — which is worth noticing: shared constraints reduce violations, they do not eliminate them, and the acceptance check at the gate is what actually catches the remainder.

This is the file from the run, in full. It took about six of the twenty-two setup minutes and was read by every worker pass and by the QA worker, which used it as the checklist to review against.

shared-constraints.md (from the run)
# Shared Constraints (read before proposing anything)

- Read DESIGN.md first. Use its token and component names, not raw values and not
  invented primitives. Reuse PageHero, SectionBand, SectionHeading, AccentCard,
  SchoolBadge, ArrowLink, NewsletterForm, ArticleHeader, ArticleCard.
- Editorial density: flat bordered panels, 2px borders, 8px radius maximum, no
  decorative gradients, no nested cards, no marketing-hero treatment.
- Color semantics: deep school blue for CTAs and emphasis, warm school yellow for
  learning bands and badges, lab green reserved for workflow and proof-layer cues.
  Do not introduce new colors.
- Typography: serif display for headlines, sans for body, mono only inside code
  surfaces. No letter-spacing tricks.
- Every spec must state: structure, hierarchy, states (including empty or long
  content), responsive behavior, and an explicit "do not" list.
- Each worker writes exactly one file in outputs/ and edits nothing else.
- Specs must be implementable against the existing component system without new
  dependencies.

Section 8

Worker briefs at team scale: small, bounded, and checkable

A worker brief at team scale looks almost identical to a worker brief at two-worker scale, and that is the point — the anatomy does not grow with the team, only the count does. Scope with explicit exclusions, the inputs the worker is allowed to read, one task, one output file, and acceptance criteria the orchestrator can check in a minute or two without redoing the work. What changes at five workers is how much weight the exclusions carry. With two workers, a boundary violation is an awkward overlap; with five, a single worker that wanders re-decides things three other workers own, and the merge review inherits a tangle. The card-grid brief below is typical of the run: most of its text is about what not to do.

Notice what the brief does not contain: design direction. It does not tell the worker what the card should look like, which fields matter most, or how the freshness signal should read. The taste lives in DESIGN.md and the shared constraints; the brief carries the boundary, the deliverable, and the acceptance test. That separation is what keeps briefs cheap — the five briefs in this run took roughly a quarter hour of the setup time between them — and cheap briefs matter because the alternative is skipping them, and skipped briefs are how you end up with five eager workers and no decomposition.

The QA worker's brief deserves a special mention because its boundary is behavioral rather than spatial: it owns nothing, reads everything, and is forbidden from proposing fixes. Findings only, each with a severity, the constraint it violates, and the evidence line. That restriction is not modesty — it keeps review independent of authorship, exactly like a design crit where the reviewer does not grab the pen. In this run the QA pass caught both conflicts the orchestrator would otherwise have had to find by close reading, plus a contrast check and the gaps no brief had assigned.

worker-briefs/02-articles-index-card-grid.md (from the run)
# Worker 2: Articles Index Card Grid Spec

Scope: the articles index card grid only — the repeated article card and the grid
that holds it. Do not specify the index page hero, the article detail page, or the
newsletter band. Do not write code.

Inputs: DESIGN.md, ../shared-constraints.md, ../orchestration-brief.md (your
boundary and the known overlap only).

Task: specify the article card (fields, order, emphasis, last-reviewed signal,
long-title and missing-summary behavior) and the grid (columns per breakpoint,
ordering, what happens with very few or very many articles).

Output: outputs/02-articles-index-card-grid-spec.md with structure, hierarchy,
states, responsive behavior, and a do-not list.

Acceptance criteria:
- Card builds on AccentCard and ArticleCard from DESIGN.md; badges use SchoolBadge.
- No new colors; no nested cards; radius within the 8px ceiling.
- Long-title and missing-summary behavior is explicit.

Section 9

What came back, and where five workers collided

The four surface specs were structurally good on first pass — the briefs and constraints did their job for hierarchy, states, and component reuse — and the failures were exactly the kind a team produces rather than the kind a bad worker produces. The hero worker delivered a clean band structure and a sensible proof panel, then crossed its boundary: it specified a strip of three recently reviewed article cards including the card's internal anatomy, which belongs to the card-grid worker. The card-grid worker and the article-header worker each answered the deliberately overlapped question differently: the card spec marked recently reviewed articles with a lab-green pill, while the header spec used a school-yellow badge inside a 90-day window. Both are defensible alone; together they ship one signal with two meanings on adjacent surfaces, and the green version quietly violates the design system's color semantics. The newsletter worker produced the run's only revision — its first pass invented an off-token panel tint despite the constraint saying not to, and was sent back once.

The QA worker earned its place. Its findings file flagged both collisions as P1s with the evidence lines quoted, confirmed the newsletter revision had resolved the contrast question it would otherwise have raised, added a small consistency note about meta-row ordering, and — most usefully — named the decisions nobody owned: what the last-reviewed line says when an article has never been re-reviewed, and who owns the vertical rhythm where the hero band meets the newsletter band on the home page. Those gaps are not worker errors; they are decomposition errors, seams the orchestrator failed to assign. At two workers you can usually spot the seam yourself. At five, a read-only reviewer that does nothing but look for them is the cheapest insurance in the run.

One more failure is worth keeping because it is the canonical one: the orchestrating session caught itself starting to draft the hero spec before the worker pass had run, and had to discard that text. This is the same failure mode the agent-teams documentation warns about for lead agents — the lead starts implementing instead of delegating — and it is just as real when the lead is you. The discipline that fixes it is structural, not motivational: the orchestrator's deliverables are the contract and the merge log, and any design content it produces outside those is a sign the decomposition is being bypassed.

  • Worker 1 (hero): accepted for structure; crossed its boundary by re-specifying card anatomy — caught by QA as a P1.
  • Worker 2 (card grid): proposed a lab-green reviewed pill — token-semantics conflict with the design system and with Worker 3.
  • Worker 3 (article header): proposed a school-yellow badge inside a 90-day window — the proposal that survived the merge.
  • Worker 4 (newsletter): one revision after introducing an off-token tint; second pass accepted.
  • Worker 5 (QA): two P1s, one P2, two P3s, plus the two unowned gaps — findings only, no fixes.
  • Orchestrator: briefly started doing a worker's job itself; the text was discarded and the pass re-run from the brief.

Section 10

The merge is the real design review

Everything before the merge is preparation; the merge is where the design decisions actually get made. At team scale the merge review has a shape worth following in order. First an ownership audit: did every worker stay inside its boundary, and did anything arrive that two workers both claim? Second, conflict classification, because conflicts at this scale fall into recognizable classes — token-semantics conflicts, where two surfaces use the system's colors or type roles to mean different things; seam conflicts, where spacing, alignment, or band order breaks where two owned regions meet; duplicated-signal conflicts, where the same information is expressed twice in competing ways; and unowned-gap defects, the decisions nobody was assigned and therefore nobody made. Third, resolution against the constitution rather than the contributors: the design system and the user job decide, not whichever worker wrote the most or returned last. Fourth, a written record of what was rejected and why. The smaller companion article carries a general-purpose merge checklist; what changes at five workers is not the checklist but the volume — which is why classification and a written log stop being optional.

In this run the merge log resolved both conflicts in about fourteen minutes. The last-reviewed treatment went to one rule everywhere — a school-yellow badge inside a 90-day window, plain muted text otherwise — because the design system assigns badges to the yellow secondary and reserves green for workflow and proof cues, and because the 90-day window matches the cadence the hero's proof panel claims. The green pill was rejected on token semantics, not on quality. The hero's card strip was kept as a placement decision but stripped of its own card anatomy, which now references the card-grid spec — one owner per component definition. Both rejected options are recorded with reasons, which is what makes the log durable: the next agent or human who wonders why reviewed badges are not green can read the answer instead of relitigating it. The two unowned gaps were logged as open questions for a human, not answered by the orchestrator on the spot, because inventing answers during the merge is how scope quietly re-enters through the back door.

It is worth saying plainly that the merge is the part of the run you cannot delegate away, although you can get help with it. A reviewer agent over the merged plan is a reasonable second pass — that is essentially what the QA worker was — and the next section shows where automation can hold the gate. But choosing between the green pill and the yellow badge is a design decision with a reason attached, and the reason is the deliverable. A merge produced by concatenation, where every worker's output is accepted because none of it is technically wrong, is the most common way multi-agent design work fails while looking like it succeeded.

merge-log.md (from the run, conflicts and gaps section)
## Conflict 1 — the "Last reviewed" treatment (02 vs 03, flagged P1 by QA)
- Worker 2: lab-green pill in the card meta row, plain muted text after six months.
- Worker 3: school-yellow SchoolBadge within 90 days, muted text otherwise.
- Decision: one rule everywhere — school-yellow SchoolBadge when the review is
  within 90 days, plain muted "Last reviewed <Month YYYY>" otherwise. DESIGN.md
  assigns badges to the yellow secondary and reserves lab green for workflow and
  proof-layer cues; a per-article freshness mark is a badge, not a proof-layer
  surface. The 90-day window (03) wins over the six-month window (02) because it
  matches the review cadence the hero proof panel claims.
- Rejected: the lab-green pill (token semantics; signal would mean different
  things on adjacent surfaces) and the six-month threshold (inconsistent with the
  cadence claim).

## Conflict 2 — hero card strip (01, flagged P1 by QA)
- Worker 1 specified a three-card "recently reviewed" strip including card
  anatomy, which is Worker 2's surface.
- Decision: keep the strip as a placement decision only; the hero spec references
  "three ArticleCards as specified in 02" and loses its own field list. Card
  anatomy has exactly one owner.
- Rejected: a second card definition living inside the hero spec.

## Gaps at the seams (no owner — logged, not invented)
- Copy when an article has never been re-reviewed since publication.
- Home-page vertical rhythm and band-tone order between the hero band and the
  newsletter band.

Section 11

The complexity tax, with this run's numbers

The honest way to talk about cost is to run the same work twice, so this run did. The single-agent baseline took the same five-surface brief in one continuous pass: roughly 28 minutes, one iteration, a perfectly usable combined spec, and — because one context held all four surfaces — a single consistent freshness rule decided implicitly along the way. The orchestrated run took about 77 minutes as executed: 22 of setup, 41 of sequential worker passes including one revision, and 14 of merge review. Had the worker passes run in parallel, the middle phase would compress toward the longest single pass and the total would land somewhere near 46 minutes — an estimate, since this environment ran them one after another. Token accounting was not available per pass, so the multiple is also an estimate: roughly three and a half to four times the baseline, driven by every worker re-reading the design system and constraints, plus the orchestration and merge text that simply does not exist in a single-agent run.

What did the extra cost buy? Not output volume — the baseline covered all four surfaces. It bought the things the baseline structurally cannot produce: two conflicts surfaced as comparable proposals and resolved on the record before any implementation, an independent QA pass with findings that were not self-graded, explicit state coverage the baseline thinned out on two surfaces, a written log of rejected options, and a set of briefs that can be reused the next time these surfaces change. The baseline was faster precisely because it never had to defend its decisions; whether that is a saving or a debt depends on how much the surfaces matter and how many people will touch them next.

The external numbers point the same direction at larger scale. Anthropic's engineering write-up of its multi-agent research system reports such systems using around fifteen times the tokens of a single chat interaction — a research-workload figure, quoted here for scale rather than as a design benchmark — and the official cost guidance for agent teams says token use scales roughly with team size, that idle teammates keep consuming tokens until the team is cleaned up, and that the practical economy moves are smaller teams, cheaper models for workers, focused spawn prompts, and shutting things down. Put the run-level and the published numbers together and the rule of thumb for design work is unglamorous: orchestration costs about half again to four times as much as just doing it, pays off in surfaced conflicts, decision records, and reusable structure rather than in speed at small scale, and only pays off in speed when the workers genuinely run in parallel on genuinely separable surfaces.

tableComplexity-tax table
Table comparing the single-agent baseline and the orchestrated run. The baseline used one agent, about twenty-eight minutes, one pass, surfaced no conflicts, and used roughly one times the tokens; it was fastest but produced no decision record, thinner state coverage, and no contrast check. The orchestrated run used an orchestrator, four workers, and a QA worker as sequential passes, took about seventy-seven minutes with an estimated forty-six if parallel, twenty-two minutes of setup, six passes plus one revision, resolved two conflicts, caught one unowned gap plus a contrast check and a consistency rule before implementation, and used an estimated three and a half to four times the tokens; it bought a decision record, early conflicts, QA findings, and reusable briefs. A footnote band notes that token figures are estimates and carries the attributed Anthropic fifteen-times figure and Claude Code's guidance that team token use scales with team size and idle teammates keep consuming tokens.

The same five-surface brief, run twice on scratch artifacts. Wall-clock figures are approximate; token figures are labeled estimates.

Section 12

Coordination machinery: gates you can automate, and the ones you should not

Most of the coordination in a run like this is files — the contract, the constraints, the briefs, the owned outputs, the log. The machinery question is which gates can be enforced automatically rather than by the orchestrator remembering to check. Agent teams expose three hooks for exactly this: one fires when a teammate goes idle, one when a task is created, and one when a task is reported complete — and a hook that exits with a blocking code rejects the event and sends feedback to the teammate. For design work the task-completion hook is the interesting one: it can run a script against the worker's output before the task is allowed to close — check that the output file exists and is the only file touched, grep for hex values or color names that bypass the token set, require the do-not section, fail on forbidden component names. That is the run's acceptance criteria turned into a gate that does not depend on anyone's attention. Plan approval is the other automatable gate, sitting earlier: the lead holds each teammate in read-only planning until its plan satisfies criteria you wrote, which is the closest current product equivalent of approving a worker's interpretation of its brief before the work starts.

The sketch below shows the shape of a completion gate. Treat it as a sketch: the agent-teams surface is experimental, hook details can change, and the same check is just as useful run by hand or in CI against a pull request from a worktree. The point is the placement, not the platform — the cheap, mechanical parts of the merge review (token discipline, file ownership, required sections) move to the gate, so the human part of the merge review can spend its attention on the conflicts and the seams.

What should not be automated is the resolution itself. A hook can detect that two outputs both define a freshness treatment; it cannot decide which one the design system actually supports, and it certainly cannot decide that neither is right. The same goes for plan approval criteria that amount to taste. Automate detection and discipline; keep judgment in the merge log, with a name attached.

Design gate on task completion (sketch — verify hook details against current docs)
#!/usr/bin/env bash
# TaskCompleted-style gate: block completion until the worker's output passes
# the run's mechanical checks. Exit 0 to allow, exit 2 to block with feedback.

OUTPUT_FILE="$1"   # e.g. outputs/02-articles-index-card-grid-spec.md

fail() { echo "$1" >&2; exit 2; }

[ -f "$OUTPUT_FILE" ] || fail "Missing owned output file: $OUTPUT_FILE"

# One owner, one file: nothing else in outputs/ may have changed in this task.
CHANGED=$(git status --porcelain outputs/ | grep -v "$(basename "$OUTPUT_FILE")" || true)
[ -z "$CHANGED" ] && true || fail "Worker touched files outside its ownership boundary: $CHANGED"

# Token discipline: no raw hex values or invented color names in a spec.
grep -nE "#[0-9a-fA-F]{3,8}\b" "$OUTPUT_FILE" && fail "Raw color values found; use DESIGN.md token names"

# Required sections: states, responsive behavior, and an explicit do-not list.
for section in "States" "Responsive" "Do not"; do
  grep -qi "$section" "$OUTPUT_FILE" || fail "Spec is missing a required section: $section"
done

exit 0

Section 13

Good vs bad orchestration

The failure modes of orchestration are boring, predictable, and almost all visible in this one small run or in the documentation of the tools that support it. The lead does the work itself instead of delegating — the documented agent-teams failure, and the thing this run's orchestrator caught itself doing. Briefs without exclusions, so workers wander into each other's surfaces the way the hero worker did. Two workers given write access to the same file or the same component definition, which turns the merge from a review into a rescue. Overlaps left implicit instead of named, so the merge is surprised by conflicts it should have been waiting for. Merge by concatenation, where everything is accepted because nothing is individually wrong. Teams left running after the work is done, which on a hosted team surface is not just untidy but billed. And the quiet one: no single-agent baseline or prior expectation, so nobody can say afterward whether the orchestration bought anything.

Good orchestration is the same list inverted, and it is mostly writing. The orchestrator's output is a contract, a set of briefs, gate criteria, and a merge log; if it is producing design content outside those, the decomposition is being bypassed. Overlaps are named in advance so conflicts arrive as expected comparisons rather than surprises. Ownership is one sentence per worker and one owned output per worker, with worktrees underneath once outputs become edits. The QA role is read-only and independent. The merge resolves against the design system and the user job, records what it rejected, and leaves unowned gaps as open questions. And the team is sized like the documentation says and experience confirms: a few focused workers over many scattered ones, scaled to the number of genuinely separable surfaces rather than to enthusiasm.

tableGood vs bad orchestration
1Bad: the lead drafts surface specs itself while workers run

Good: the lead's only outputs are the contract, briefs, gates, and the merge log

2Bad: briefs describe the task but not the exclusions

Good: every brief states what the worker must not touch, in writing

3Bad: two workers can write to the same file or component definition

Good: one owned output per worker; worktrees once outputs become edits

4Bad: overlaps left implicit and discovered at merge

Good: known overlaps named in the orchestration brief so the merge expects them

5Bad: merge by concatenation — accept everything that is not wrong

Good: resolve against the design system and user job; record rejected options

6Bad: the team keeps running after the work is done

Good: shut the team down; idle workers keep consuming tokens

7Bad: no baseline, so the benefit is asserted

Good: a baseline run or a prior estimate, so the complexity tax is measured against something

Most orchestration failures are boundary and review failures, not tool failures.

Section 14

Limits, risks, and when to fall back to one agent mid-run

Some of this article's subject matter is experimental and labeled that way for a reason. Agent Teams are opt-in and changing; details quoted here were verified against the documentation at the start of June 2026 and should be re-checked before you build a workflow that depends on them. Codex's parallel surfaces were located through search rather than exercised hands-on for this run, OpenCode's team layer is community work rather than product, and OpenPencil's concurrent canvas could not be executed in this environment at all. The durable layer is the one that does not depend on any of them: briefs, constraints, owned outputs, gates, and a merge log are plain files and survive every product change.

Some work simply does not orchestrate. Taste loops — the polish pass where spacing, copy tone, and hierarchy are being tuned against each other — get worse when split, because every decision depends on the decisions around it; that is the strongest published counter-position to multi-agent work and it is consistent with what this run saw at the seams. Small tasks do not earn the setup bill; this run's twenty-two minutes of contract-writing would have been most of the total budget for a single-surface change. Debugging is genuinely harder across distributed traces: when the merged plan has a problem, the cause may live in any worker's pass, in the brief that shaped it, or in the constraint nobody wrote, and you will read more text finding it than you would have in one session's history. And budgets are real — token use grows with the size of the team, and the meter runs whether or not the split is helping.

Falling back is not failure, and it is easier than it feels mid-run. If workers keep colliding, if the same brief needs rewriting twice, or if the merge review starts to look like a rescue, the artifacts you have already produced are exactly what a single agent needs to finish the job well: hand one agent the orchestration brief, the shared constraints, the surviving outputs, and the merge log so far, and let it complete the work in one context. The escalation path runs in both directions, and the files are the thing that makes both directions cheap. Anti-patterns to watch for beyond that: orchestrating because the tooling is exciting rather than because the task has boundaries; scaling the worker count to the org chart instead of to the surfaces; treating worker output as finished design because it arrived in parallel; and skipping the baseline so the orchestration can never be found wanting.

  • Re-verify experimental surfaces before depending on them; the file-based artifacts are the part that does not churn.
  • Do not split taste loops, polish passes, or anything where every decision depends on its neighbors.
  • Expect debugging across traces to cost more than debugging one session's history.
  • Watch the budget: token use scales with team size, and idle workers keep consuming until shut down.
  • Fall back to one agent by handing it the briefs, constraints, outputs, and merge log produced so far.

Section 15

A reusable orchestration kit

Everything this article executed reduces to a kit you can reuse the same day, on any platform that can read files. One folder per run. Inside it: an orchestration brief that states the user job, the deliverable, one-sentence ownership boundaries, the named overlaps, and the merge gates; a shared constraints file every brief points at; one brief and one owned output file per worker, with a read-only QA worker once you pass three or four writers; a merge log with sections for accepted items, conflicts with decisions and rejected options, gaps at the seams, and open questions; and run notes recording wall-clock per phase, iterations, what was caught at which gate, and what went wrong. The case study's artifacts are the worked example of every one of these, and the prompt below is the merge review condensed into something you can paste at the end of any run, including a run where the workers were people.

The runbook order matters more than any individual artifact: write the contract before the briefs, the briefs before any worker runs, and the gates before any output exists; let the workers run without interference; review at the gate, not over the worker's shoulder; classify conflicts before resolving them; resolve against the system, record what you rejected, log what nobody owned; and shut the team down when the merge log is written. Then keep the briefs and the log and delete the rest — the durable value of an orchestrated run is the decisions and the structure, not the scratch output. That is also the honest summary of the whole exercise: orchestration is a writing discipline with scheduling attached, and the merge log is the design review you were going to need anyway, finally written down.

Merge-review prompt (paste into the orchestrating session after all outputs arrive)
Run the merge review for this orchestrated design run. Inputs: the orchestration
brief, shared-constraints.md, DESIGN.md, every file in outputs/, and the QA
findings file if present.

1. Ownership audit: for each output, confirm the worker stayed inside its stated
   boundary. List every place a worker specified another worker's surface.
2. Conflict classification: list every disagreement between outputs and classify
   it — token-semantics conflict, seam or alignment conflict, duplicated or
   competing signal, or unowned gap (a decision no brief assigned).
3. Resolution: resolve each conflict against DESIGN.md and the user job stated in
   the orchestration brief — never against which worker wrote more or answered
   last. State the decision and the reason.
4. Rejected options: record every rejected proposal with the reason it lost.
5. Gaps: list decisions nobody owned. Do not invent answers; log them as open
   questions for human review.
6. Output: write merge-log.md with sections for accepted items, conflicts and
   decisions, rejected options, gaps, and open questions. Do not edit any worker
   output and do not implement anything until the merged plan is approved.

Sources

Sources & further reading

Newsletter

Get the orchestration brief, worker brief, and merge-review templates from this article by email.

The newsletter is the update channel for article revisions, tool changes, and field-tested workflows.

Processed by Buttondown. You can unsubscribe from any email.

Further reading

For deeper reading, see The Agentic Designer and Claude Code for Designers.

The Agentic Designer cover
Curriculum
The Agentic Designer
How AI agents are transforming product design.

The operating model for product designers, design leads, and builders who need to understand what changes when agents join design work.

Claude Code for Designers cover
Curriculum
Claude Code for Designers
A designer's guide to AI-assisted workflows.

A practical guide for designers who want to work directly with coding agents without turning it into a programming manual.