Section 1
You decided to split. Now you have a management job
This article assumes a decision has already been made: the design task in front of you is genuinely separable, valuable enough to pay for coordination, and you are going to run it across several agents. If that decision is still open, read the companion piece on when to use multiple design agents first — it carries the decision map, the separability test, and a deliberately small two-worker trace that shows what the artifacts look like at the smallest honest scale. Staying with one agent is the default there, and nothing in this article changes that. What changes here is the question. Not whether to split, but how to run the team once you have.
Running the team is a management job, and the manager is you. The orchestrator role — whether you play it yourself in one session or delegate it to a lead agent — is closer to a design lead than to a prompt writer. It decides the decomposition, writes the contract every worker reads, sets the gates that work must pass, and runs the merge review where the separate outputs become one coherent design again. The tools have caught up with this framing: Claude Code now ships an experimental agent-teams surface with a shared task list, messaging, and plan approval; subagents and git worktrees give you bounded workers and isolated checkouts; Codex builds parallelism into its CLI, cloud tasks, and desktop app; OpenPencil runs concurrent agents over regions of a shared canvas. None of them remove the two jobs that decide the outcome: choosing boundaries that are real, and treating the merge as the actual design review.
To keep this practical, the article is built around a run executed for it: a five-surface redesign-spec exercise on this site's own subject matter — home hero band, articles index card grid, article page header, newsletter signup band, plus a read-only QA worker — orchestrated on scratch files, with a deliberate overlap that forced a real conflict, a merge log that resolved it, and a single-agent baseline for comparison. One honesty note up front, repeated where it matters: the worker passes in that run executed sequentially, not as parallel processes, because the drafting environment could not spawn parallel workers. The artifacts, conflicts, and merge decisions are real and quoted; the wall-clock benefit of parallelism is estimated, and labeled as such wherever it appears.
Projects to inspect
Section 2
Three ways to cut the work: spatial, functional, pipeline
Decomposition is a design decision about boundaries, and there are only three shapes worth teaching. Spatial decomposition splits by place: pages, bands, dashboard panels, canvas regions. Each worker owns a surface, which means workers rarely touch the same files and parallelism is high — but the merged result has seams, and the seams are where it fails. Spacing rhythm drifts between sections, and two surfaces express the same signal two different ways, which is exactly the conflict class the case study below produced. Functional decomposition splits by skill: one worker owns structure, one owns type and copy, one owns accessibility, one runs QA. It maps nicely onto how design teams already think about roles, but every worker touches every surface a little, so the merge is the hardest of the three — style consistency is the casualty, and two domains will eventually want to edit the same component for different reasons. It works best when most of the functional workers are read-only reviewers rather than writers.
Pipeline decomposition splits by stage: a spec pass, a visual pass, an implementation pass, a QA pass, each handing an artifact to the next. It barely parallelizes, because stages wait on each other, but it merges almost for free — each stage's output is the next stage's input, so there is little to reconcile. What it buys is not speed but focus and gates: every handoff is a natural review point. The honest summary is that spatial buys parallelism and pays at the seams, functional buys specialization and pays in consistency, and pipeline buys checkpoints and pays in wall-clock time.
Mixing shapes is normal and usually right. The case study in this article is spatial for its four surface workers and functional for the fifth — a read-only QA worker layered across all of them. Whatever shape you pick, the per-worker rules do not change: one owner, one output file, an explicit list of exclusions, and an acceptance test the orchestrator can check without re-doing the work. If you cannot state the boundary of a slice in one sentence, the decomposition is not done; that test comes from the decision article and it applies with more force here, because at four or five workers a fuzzy boundary multiplies into four or five collisions instead of one.
Choose the shape of the split by where the boundaries actually are: places, skills, or stages.
Section 3
Agent teams, in depth: what the lead actually coordinates (verified June 2026)
Claude Code's Agent Teams are the most complete documented version of the orchestrator pattern as a product surface, and they are explicitly experimental: disabled by default, switched on with an environment flag, and subject to change. A team is a lead session plus independent teammate sessions, and the coordination runs through two mechanisms you would otherwise build by hand. The first is a shared task list — tasks have pending, in-progress, and completed states, can declare dependencies on each other, and are claimed with file-locked claiming so two teammates do not race for the same task. The second is a mailbox: messages are delivered automatically between sessions, teammates can write to each other by name rather than routing everything through the lead, and the lead gets notified when a teammate goes idle. Team and task state live in the user's home directory, not in the project, which is a deliberate signal that a team is a working arrangement rather than a project artifact.
Two more mechanics matter for design work. Teammate roles can reuse subagent definitions, so a visual-QA worker you already defined for single-session use can become a teammate with the same tool restrictions and model — though skills and MCP server frontmatter are not carried over, which matters if your design worker depends on a canvas MCP server. And the lead can require plan approval: a teammate works in a read-only planning mode until the lead approves its plan, and the lead approves or rejects autonomously against criteria you give it. That is an orchestration-time review gate — the equivalent of approving a worker's brief interpretation before any output exists. The documentation also names the failure mode every practitioner write-up confirms: the lead starts implementing tasks itself instead of waiting for its teammates, and you have to tell it not to. Practitioner guides describe a delegate mode toggle that restricts the lead to coordination only; that control is reported by practitioners rather than something verified hands-on for this article, so treat it as a thing to check in your own session rather than a documented guarantee.
The same documentation is refreshingly concrete about size and limits. The guidance is three to five teammates with five or six tasks each, and the phrase worth keeping is that three focused teammates often outperform five scattered ones. The limitations list is long enough to plan around: in-process teammates cannot be resumed after the lead session ends, task status can lag and block dependent tasks, there is one team per lead, no nested teams, the lead cannot be transferred, teammates inherit the lead's permissions at spawn time, and split-pane display needs tmux or iTerm2. None of these are reasons to avoid the surface; they are reasons to treat a team as a short-lived working session you set up, run, and shut down — not an environment you leave running. For designers, the practical reading is that the product now does the bookkeeping (tasks, messages, approvals), and everything that makes the output good still lives in the files you give it: the shared constraints, the briefs, and the merge protocol the rest of this article walks through.
- Agent Teams are experimental and off by default; everything here is as documented in June 2026 and can change.
- Coordination = a dependency-aware shared task list with file-locked claiming, plus a mailbox with teammate-to-teammate messaging and idle notifications.
- Plan approval lets the lead hold teammates in read-only planning until their plan passes criteria you set.
- Teammate roles can reuse subagent definitions (tools and model carry over; skills and MCP server settings do not).
- Sizing guidance: 3–5 teammates, 5–6 tasks each; documented failure mode: the lead starts doing the work itself.
- Team and task state live under the home directory, not the repo — a team is a session, not a project artifact.
Projects to inspect
Section 4
Workers, worktrees, and what each agent actually sees
Underneath any orchestration surface sit two questions you have to answer per worker: what context does it see, and where do its changes land. On the first, Claude Code subagents are precise in ways that matter for design work. Each subagent runs in its own context window with its own system prompt, tool restrictions, and permissions, and it returns only a summary to the caller — the orchestrating session never absorbs the worker's full transcript. The built-in Explore and Plan subagents skip the project's CLAUDE.md and the parent's git status to stay cheap, while other built-in and custom subagents load them; that distinction decides whether your design rules are actually in front of a given worker or merely near it. Subagents cannot spawn subagents, so the hierarchy stays one level deep, and they can run in the foreground or the background. The orchestration design consequence is simple: nothing a worker needs can be assumed ambient. If the worker must respect DESIGN.md, the brief says so and the worker type loads it.
On the second question, isolation on disk is what keeps parallel workers from fighting over files. Git worktrees are the standard answer: a separate checkout on its own branch, so a worker's edits never touch the checkout another worker is reading. Claude Code can create and enter worktrees itself, supports a hook for relocating them, and its docs explicitly position the three mechanisms against each other — subagents for lightweight delegation inside a session, worktrees for manual parallel sessions, agent teams for coordinated sessions — and document combining them, running a subagent inside a worktree so its changes land in an isolated checkout. Codex points the same direction from the other side of the fence: its CLI supports subagent-style parallel workflows, its cloud tasks run in hosted environments in the background — including a best-of-N shape where several attempts at the same task compete and you keep the best one, which is an alternative to decomposition rather than a form of it — and its desktop app creates a separate git worktree per agent thread automatically. Those Codex details come from its developer documentation as located by search rather than verified hands-on for this article, so treat the specifics as pointers to check rather than quoted behavior; the structural point stands across vendors, because worktree-per-worker has independently become both a product default and the most common hand-rolled pattern in practitioner write-ups.
For spec-level work like the case study below, file isolation can be as simple as one owned output file per worker — no worktrees needed, because nobody edits shared files at all. The escalation path is worth knowing in advance though: specs become component edits, component edits become parallel branches, and at that point ownership boundaries in briefs need their mechanical twin on disk. Briefs decide who is allowed to change what; worktrees make sure that even a worker that ignores its brief cannot quietly overwrite someone else's work.
Projects to inspect
Section 5
OpenCode and OpenPencil: orchestration as config, and the concurrent canvas
OpenCode's official surface treats agents as project configuration: primary agents and subagents defined in opencode.json or per-file under .opencode/agents/, each with its own permissions for editing, shell access, and — through task permission globs — which other agents it is allowed to invoke. That is enough to express a permissioned orchestrator: a lead agent that can call the design workers you list and nothing else, review workers that cannot edit at all, implementation workers that must ask. What OpenCode does not currently ship is a coordinated team surface comparable to the one described above; that layer is community territory, visible in an open feature request for conductor-style teams in isolated workspaces, a written port of the agent-teams idea onto OpenCode, and plugin projects that wire an orchestrator persona to specialist agents on different models. Use those as evidence that the pattern transfers, not as features you can rely on — they are fast-moving, mostly single-maintainer projects, and this article cites them as examples rather than recommendations.
OpenPencil is the design-native expression of the same idea and the clearest example of spatial decomposition as a product behavior. Its concurrent agent teams have an orchestrator decompose a page into spatial sub-tasks, with multiple agents streaming output into different regions of the same vector canvas and per-member indicators showing who is working where. Its file format is the more interesting half for this article: the .op format is JSON designed for concurrent reads and writes, and the app ships a git panel with three-way merge and a conflict UI for design files — which makes it the only tool covered here whose design artifact assumes multiple writers rather than tolerating them. An MCP server and a CLI let external agents act as the team. Two cautions belong next to that description: it is reported from the repository as it stood at the start of June 2026, when the project's main branch had been quiet for about a month with a new version in progress, so re-check before you depend on it; and the run in this article did not execute inside OpenPencil — the case study below is plain files and plain agents, not a canvas session.
Step back from the individual tools and the operating model is the same everywhere: a coordinating context, bounded workers with their own context, a shared contract they all read, file-level ownership, and a merge step nobody automates away. The differences are where the plan lives — a conversation, a task list, a config file, a canvas orchestrator — and how workers communicate, from report-back-only to mailboxes to the file system. That is good news for designers, because it means the artifacts this article spends the rest of its time on are portable. Briefs, constraints, owned outputs, and merge logs work on every one of these platforms, and they are the part you keep when the experimental surfaces change.
Projects to inspect
Section 6
Case study: a five-surface redesign run, actually orchestrated
The run executed for this article is a redesign-spec exercise on this site's own surfaces: the home hero band, the articles index card grid, the article page header and meta block, and the newsletter signup band, plus a fifth read-only QA worker. The deliverables were specifications, not code — five Markdown files an implementation pass could follow — produced entirely in a scratch folder that was deleted after the excerpts were captured. The decomposition is spatial, the surfaces are genuinely separable, and one overlap was left in deliberately: every content surface needs a position on how the site's last-reviewed signal is presented and which accent color it uses. That overlap exists to force the conflict class that spatial splits produce in real work, at a scale where two or three workers collide rather than two.
Execution, honestly described: one orchestrating session wrote the orchestration brief, the shared constraints, and the five worker briefs as real files. The five worker outputs were then produced as sequential bounded passes — each pass given only its own brief, the shared constraints, and DESIGN.md — rather than as parallel subagent processes, because this drafting environment could not spawn parallel workers. The orchestrating session then ran the merge review and wrote the merge log, and finally ran the same five-surface brief once more as a single-agent baseline for the cost comparison. Everything quoted in the next few sections is from those files, unedited apart from trimming. The wall-clock total for the orchestrated run was about 77 minutes as executed; the parallel figure quoted later is an estimate of what the worker phase would compress to, not a measurement.
The orchestration brief is the parent contract, and at team scale it earns its length. It states the user job once so no worker reinvents it, points everyone at the same constraints file, gives each worker a one-sentence ownership boundary, names the overlap the merge should expect to resolve, and writes the merge gates down before any work exists — because gates invented after the outputs arrive have a way of bending around whatever came back. This is the actual brief from the run.
# Field-Guide Refresh Specs — Orchestration Brief User job: a designer learning agentic workflows lands on the site, decides quickly whether it is current and trustworthy, finds an article worth reading, and can tell on every surface when the material was last reviewed. Deliverable: five Markdown artifacts a later implementation pass can follow without re-deciding structure — four redesign specs (home hero band, articles index card grid, article page header and meta block, newsletter signup band) and one read-only QA findings file. No production code is written or edited in this run. Shared constraints: every worker reads shared-constraints.md and DESIGN.md before proposing anything. Ownership boundaries (one sentence each): - Worker 1 owns the home hero band: headline, supporting copy, primary and secondary CTAs, and the proof strip under them. - Worker 2 owns the articles index card grid: the repeated article card's fields, hierarchy, states, and the grid's responsive behavior. - Worker 3 owns the article page header and meta block: title, summary, topics, tools, and the last-reviewed treatment on the article page itself. - Worker 4 owns the newsletter signup band: tone, copy structure, form layout, and the expectation-setting line. - Worker 5 (QA) owns nothing; it reads the other four outputs and DESIGN.md and reports findings only. Known overlap (deliberate): every content surface needs a position on how the "Last reviewed" signal is presented and which accent color it uses. Workers may each propose one; the orchestrator resolves the collision at merge. Merge gates: 1. Read every output in full before accepting anything. 2. List agreements, conflicts, and gaps in merge-log.md. 3. Resolve conflicts against DESIGN.md and the user job, not against whichever worker wrote the most. 4. Record rejected options and the reason. 5. Run the QA findings against the merged plan; log unowned gaps as open questions rather than inventing answers during the merge. 6. No worker output is implemented until the merged plan is approved by a human.
The executed run as lanes: the orchestrator's contract and merge review, four worker passes, the read-only QA pass, and the gates between them.
Section 8
Worker briefs at team scale: small, bounded, and checkable
A worker brief at team scale looks almost identical to a worker brief at two-worker scale, and that is the point — the anatomy does not grow with the team, only the count does. Scope with explicit exclusions, the inputs the worker is allowed to read, one task, one output file, and acceptance criteria the orchestrator can check in a minute or two without redoing the work. What changes at five workers is how much weight the exclusions carry. With two workers, a boundary violation is an awkward overlap; with five, a single worker that wanders re-decides things three other workers own, and the merge review inherits a tangle. The card-grid brief below is typical of the run: most of its text is about what not to do.
Notice what the brief does not contain: design direction. It does not tell the worker what the card should look like, which fields matter most, or how the freshness signal should read. The taste lives in DESIGN.md and the shared constraints; the brief carries the boundary, the deliverable, and the acceptance test. That separation is what keeps briefs cheap — the five briefs in this run took roughly a quarter hour of the setup time between them — and cheap briefs matter because the alternative is skipping them, and skipped briefs are how you end up with five eager workers and no decomposition.
The QA worker's brief deserves a special mention because its boundary is behavioral rather than spatial: it owns nothing, reads everything, and is forbidden from proposing fixes. Findings only, each with a severity, the constraint it violates, and the evidence line. That restriction is not modesty — it keeps review independent of authorship, exactly like a design crit where the reviewer does not grab the pen. In this run the QA pass caught both conflicts the orchestrator would otherwise have had to find by close reading, plus a contrast check and the gaps no brief had assigned.
# Worker 2: Articles Index Card Grid Spec Scope: the articles index card grid only — the repeated article card and the grid that holds it. Do not specify the index page hero, the article detail page, or the newsletter band. Do not write code. Inputs: DESIGN.md, ../shared-constraints.md, ../orchestration-brief.md (your boundary and the known overlap only). Task: specify the article card (fields, order, emphasis, last-reviewed signal, long-title and missing-summary behavior) and the grid (columns per breakpoint, ordering, what happens with very few or very many articles). Output: outputs/02-articles-index-card-grid-spec.md with structure, hierarchy, states, responsive behavior, and a do-not list. Acceptance criteria: - Card builds on AccentCard and ArticleCard from DESIGN.md; badges use SchoolBadge. - No new colors; no nested cards; radius within the 8px ceiling. - Long-title and missing-summary behavior is explicit.
Section 9
What came back, and where five workers collided
The four surface specs were structurally good on first pass — the briefs and constraints did their job for hierarchy, states, and component reuse — and the failures were exactly the kind a team produces rather than the kind a bad worker produces. The hero worker delivered a clean band structure and a sensible proof panel, then crossed its boundary: it specified a strip of three recently reviewed article cards including the card's internal anatomy, which belongs to the card-grid worker. The card-grid worker and the article-header worker each answered the deliberately overlapped question differently: the card spec marked recently reviewed articles with a lab-green pill, while the header spec used a school-yellow badge inside a 90-day window. Both are defensible alone; together they ship one signal with two meanings on adjacent surfaces, and the green version quietly violates the design system's color semantics. The newsletter worker produced the run's only revision — its first pass invented an off-token panel tint despite the constraint saying not to, and was sent back once.
The QA worker earned its place. Its findings file flagged both collisions as P1s with the evidence lines quoted, confirmed the newsletter revision had resolved the contrast question it would otherwise have raised, added a small consistency note about meta-row ordering, and — most usefully — named the decisions nobody owned: what the last-reviewed line says when an article has never been re-reviewed, and who owns the vertical rhythm where the hero band meets the newsletter band on the home page. Those gaps are not worker errors; they are decomposition errors, seams the orchestrator failed to assign. At two workers you can usually spot the seam yourself. At five, a read-only reviewer that does nothing but look for them is the cheapest insurance in the run.
One more failure is worth keeping because it is the canonical one: the orchestrating session caught itself starting to draft the hero spec before the worker pass had run, and had to discard that text. This is the same failure mode the agent-teams documentation warns about for lead agents — the lead starts implementing instead of delegating — and it is just as real when the lead is you. The discipline that fixes it is structural, not motivational: the orchestrator's deliverables are the contract and the merge log, and any design content it produces outside those is a sign the decomposition is being bypassed.
- Worker 1 (hero): accepted for structure; crossed its boundary by re-specifying card anatomy — caught by QA as a P1.
- Worker 2 (card grid): proposed a lab-green reviewed pill — token-semantics conflict with the design system and with Worker 3.
- Worker 3 (article header): proposed a school-yellow badge inside a 90-day window — the proposal that survived the merge.
- Worker 4 (newsletter): one revision after introducing an off-token tint; second pass accepted.
- Worker 5 (QA): two P1s, one P2, two P3s, plus the two unowned gaps — findings only, no fixes.
- Orchestrator: briefly started doing a worker's job itself; the text was discarded and the pass re-run from the brief.
Section 10
The merge is the real design review
Everything before the merge is preparation; the merge is where the design decisions actually get made. At team scale the merge review has a shape worth following in order. First an ownership audit: did every worker stay inside its boundary, and did anything arrive that two workers both claim? Second, conflict classification, because conflicts at this scale fall into recognizable classes — token-semantics conflicts, where two surfaces use the system's colors or type roles to mean different things; seam conflicts, where spacing, alignment, or band order breaks where two owned regions meet; duplicated-signal conflicts, where the same information is expressed twice in competing ways; and unowned-gap defects, the decisions nobody was assigned and therefore nobody made. Third, resolution against the constitution rather than the contributors: the design system and the user job decide, not whichever worker wrote the most or returned last. Fourth, a written record of what was rejected and why. The smaller companion article carries a general-purpose merge checklist; what changes at five workers is not the checklist but the volume — which is why classification and a written log stop being optional.
In this run the merge log resolved both conflicts in about fourteen minutes. The last-reviewed treatment went to one rule everywhere — a school-yellow badge inside a 90-day window, plain muted text otherwise — because the design system assigns badges to the yellow secondary and reserves green for workflow and proof cues, and because the 90-day window matches the cadence the hero's proof panel claims. The green pill was rejected on token semantics, not on quality. The hero's card strip was kept as a placement decision but stripped of its own card anatomy, which now references the card-grid spec — one owner per component definition. Both rejected options are recorded with reasons, which is what makes the log durable: the next agent or human who wonders why reviewed badges are not green can read the answer instead of relitigating it. The two unowned gaps were logged as open questions for a human, not answered by the orchestrator on the spot, because inventing answers during the merge is how scope quietly re-enters through the back door.
It is worth saying plainly that the merge is the part of the run you cannot delegate away, although you can get help with it. A reviewer agent over the merged plan is a reasonable second pass — that is essentially what the QA worker was — and the next section shows where automation can hold the gate. But choosing between the green pill and the yellow badge is a design decision with a reason attached, and the reason is the deliverable. A merge produced by concatenation, where every worker's output is accepted because none of it is technically wrong, is the most common way multi-agent design work fails while looking like it succeeded.
## Conflict 1 — the "Last reviewed" treatment (02 vs 03, flagged P1 by QA) - Worker 2: lab-green pill in the card meta row, plain muted text after six months. - Worker 3: school-yellow SchoolBadge within 90 days, muted text otherwise. - Decision: one rule everywhere — school-yellow SchoolBadge when the review is within 90 days, plain muted "Last reviewed <Month YYYY>" otherwise. DESIGN.md assigns badges to the yellow secondary and reserves lab green for workflow and proof-layer cues; a per-article freshness mark is a badge, not a proof-layer surface. The 90-day window (03) wins over the six-month window (02) because it matches the review cadence the hero proof panel claims. - Rejected: the lab-green pill (token semantics; signal would mean different things on adjacent surfaces) and the six-month threshold (inconsistent with the cadence claim). ## Conflict 2 — hero card strip (01, flagged P1 by QA) - Worker 1 specified a three-card "recently reviewed" strip including card anatomy, which is Worker 2's surface. - Decision: keep the strip as a placement decision only; the hero spec references "three ArticleCards as specified in 02" and loses its own field list. Card anatomy has exactly one owner. - Rejected: a second card definition living inside the hero spec. ## Gaps at the seams (no owner — logged, not invented) - Copy when an article has never been re-reviewed since publication. - Home-page vertical rhythm and band-tone order between the hero band and the newsletter band.
Section 11
The complexity tax, with this run's numbers
The honest way to talk about cost is to run the same work twice, so this run did. The single-agent baseline took the same five-surface brief in one continuous pass: roughly 28 minutes, one iteration, a perfectly usable combined spec, and — because one context held all four surfaces — a single consistent freshness rule decided implicitly along the way. The orchestrated run took about 77 minutes as executed: 22 of setup, 41 of sequential worker passes including one revision, and 14 of merge review. Had the worker passes run in parallel, the middle phase would compress toward the longest single pass and the total would land somewhere near 46 minutes — an estimate, since this environment ran them one after another. Token accounting was not available per pass, so the multiple is also an estimate: roughly three and a half to four times the baseline, driven by every worker re-reading the design system and constraints, plus the orchestration and merge text that simply does not exist in a single-agent run.
What did the extra cost buy? Not output volume — the baseline covered all four surfaces. It bought the things the baseline structurally cannot produce: two conflicts surfaced as comparable proposals and resolved on the record before any implementation, an independent QA pass with findings that were not self-graded, explicit state coverage the baseline thinned out on two surfaces, a written log of rejected options, and a set of briefs that can be reused the next time these surfaces change. The baseline was faster precisely because it never had to defend its decisions; whether that is a saving or a debt depends on how much the surfaces matter and how many people will touch them next.
The external numbers point the same direction at larger scale. Anthropic's engineering write-up of its multi-agent research system reports such systems using around fifteen times the tokens of a single chat interaction — a research-workload figure, quoted here for scale rather than as a design benchmark — and the official cost guidance for agent teams says token use scales roughly with team size, that idle teammates keep consuming tokens until the team is cleaned up, and that the practical economy moves are smaller teams, cheaper models for workers, focused spawn prompts, and shutting things down. Put the run-level and the published numbers together and the rule of thumb for design work is unglamorous: orchestration costs about half again to four times as much as just doing it, pays off in surfaced conflicts, decision records, and reusable structure rather than in speed at small scale, and only pays off in speed when the workers genuinely run in parallel on genuinely separable surfaces.
The same five-surface brief, run twice on scratch artifacts. Wall-clock figures are approximate; token figures are labeled estimates.
Section 12
Coordination machinery: gates you can automate, and the ones you should not
Most of the coordination in a run like this is files — the contract, the constraints, the briefs, the owned outputs, the log. The machinery question is which gates can be enforced automatically rather than by the orchestrator remembering to check. Agent teams expose three hooks for exactly this: one fires when a teammate goes idle, one when a task is created, and one when a task is reported complete — and a hook that exits with a blocking code rejects the event and sends feedback to the teammate. For design work the task-completion hook is the interesting one: it can run a script against the worker's output before the task is allowed to close — check that the output file exists and is the only file touched, grep for hex values or color names that bypass the token set, require the do-not section, fail on forbidden component names. That is the run's acceptance criteria turned into a gate that does not depend on anyone's attention. Plan approval is the other automatable gate, sitting earlier: the lead holds each teammate in read-only planning until its plan satisfies criteria you wrote, which is the closest current product equivalent of approving a worker's interpretation of its brief before the work starts.
The sketch below shows the shape of a completion gate. Treat it as a sketch: the agent-teams surface is experimental, hook details can change, and the same check is just as useful run by hand or in CI against a pull request from a worktree. The point is the placement, not the platform — the cheap, mechanical parts of the merge review (token discipline, file ownership, required sections) move to the gate, so the human part of the merge review can spend its attention on the conflicts and the seams.
What should not be automated is the resolution itself. A hook can detect that two outputs both define a freshness treatment; it cannot decide which one the design system actually supports, and it certainly cannot decide that neither is right. The same goes for plan approval criteria that amount to taste. Automate detection and discipline; keep judgment in the merge log, with a name attached.
#!/usr/bin/env bash
# TaskCompleted-style gate: block completion until the worker's output passes
# the run's mechanical checks. Exit 0 to allow, exit 2 to block with feedback.
OUTPUT_FILE="$1" # e.g. outputs/02-articles-index-card-grid-spec.md
fail() { echo "$1" >&2; exit 2; }
[ -f "$OUTPUT_FILE" ] || fail "Missing owned output file: $OUTPUT_FILE"
# One owner, one file: nothing else in outputs/ may have changed in this task.
CHANGED=$(git status --porcelain outputs/ | grep -v "$(basename "$OUTPUT_FILE")" || true)
[ -z "$CHANGED" ] && true || fail "Worker touched files outside its ownership boundary: $CHANGED"
# Token discipline: no raw hex values or invented color names in a spec.
grep -nE "#[0-9a-fA-F]{3,8}\b" "$OUTPUT_FILE" && fail "Raw color values found; use DESIGN.md token names"
# Required sections: states, responsive behavior, and an explicit do-not list.
for section in "States" "Responsive" "Do not"; do
grep -qi "$section" "$OUTPUT_FILE" || fail "Spec is missing a required section: $section"
done
exit 0Section 13
Good vs bad orchestration
The failure modes of orchestration are boring, predictable, and almost all visible in this one small run or in the documentation of the tools that support it. The lead does the work itself instead of delegating — the documented agent-teams failure, and the thing this run's orchestrator caught itself doing. Briefs without exclusions, so workers wander into each other's surfaces the way the hero worker did. Two workers given write access to the same file or the same component definition, which turns the merge from a review into a rescue. Overlaps left implicit instead of named, so the merge is surprised by conflicts it should have been waiting for. Merge by concatenation, where everything is accepted because nothing is individually wrong. Teams left running after the work is done, which on a hosted team surface is not just untidy but billed. And the quiet one: no single-agent baseline or prior expectation, so nobody can say afterward whether the orchestration bought anything.
Good orchestration is the same list inverted, and it is mostly writing. The orchestrator's output is a contract, a set of briefs, gate criteria, and a merge log; if it is producing design content outside those, the decomposition is being bypassed. Overlaps are named in advance so conflicts arrive as expected comparisons rather than surprises. Ownership is one sentence per worker and one owned output per worker, with worktrees underneath once outputs become edits. The QA role is read-only and independent. The merge resolves against the design system and the user job, records what it rejected, and leaves unowned gaps as open questions. And the team is sized like the documentation says and experience confirms: a few focused workers over many scattered ones, scaled to the number of genuinely separable surfaces rather than to enthusiasm.
Good: the lead's only outputs are the contract, briefs, gates, and the merge log
Good: every brief states what the worker must not touch, in writing
Good: one owned output per worker; worktrees once outputs become edits
Good: known overlaps named in the orchestration brief so the merge expects them
Good: resolve against the design system and user job; record rejected options
Good: shut the team down; idle workers keep consuming tokens
Good: a baseline run or a prior estimate, so the complexity tax is measured against something
Most orchestration failures are boundary and review failures, not tool failures.
Section 14
Limits, risks, and when to fall back to one agent mid-run
Some of this article's subject matter is experimental and labeled that way for a reason. Agent Teams are opt-in and changing; details quoted here were verified against the documentation at the start of June 2026 and should be re-checked before you build a workflow that depends on them. Codex's parallel surfaces were located through search rather than exercised hands-on for this run, OpenCode's team layer is community work rather than product, and OpenPencil's concurrent canvas could not be executed in this environment at all. The durable layer is the one that does not depend on any of them: briefs, constraints, owned outputs, gates, and a merge log are plain files and survive every product change.
Some work simply does not orchestrate. Taste loops — the polish pass where spacing, copy tone, and hierarchy are being tuned against each other — get worse when split, because every decision depends on the decisions around it; that is the strongest published counter-position to multi-agent work and it is consistent with what this run saw at the seams. Small tasks do not earn the setup bill; this run's twenty-two minutes of contract-writing would have been most of the total budget for a single-surface change. Debugging is genuinely harder across distributed traces: when the merged plan has a problem, the cause may live in any worker's pass, in the brief that shaped it, or in the constraint nobody wrote, and you will read more text finding it than you would have in one session's history. And budgets are real — token use grows with the size of the team, and the meter runs whether or not the split is helping.
Falling back is not failure, and it is easier than it feels mid-run. If workers keep colliding, if the same brief needs rewriting twice, or if the merge review starts to look like a rescue, the artifacts you have already produced are exactly what a single agent needs to finish the job well: hand one agent the orchestration brief, the shared constraints, the surviving outputs, and the merge log so far, and let it complete the work in one context. The escalation path runs in both directions, and the files are the thing that makes both directions cheap. Anti-patterns to watch for beyond that: orchestrating because the tooling is exciting rather than because the task has boundaries; scaling the worker count to the org chart instead of to the surfaces; treating worker output as finished design because it arrived in parallel; and skipping the baseline so the orchestration can never be found wanting.
- Re-verify experimental surfaces before depending on them; the file-based artifacts are the part that does not churn.
- Do not split taste loops, polish passes, or anything where every decision depends on its neighbors.
- Expect debugging across traces to cost more than debugging one session's history.
- Watch the budget: token use scales with team size, and idle workers keep consuming until shut down.
- Fall back to one agent by handing it the briefs, constraints, outputs, and merge log produced so far.
Section 15
A reusable orchestration kit
Everything this article executed reduces to a kit you can reuse the same day, on any platform that can read files. One folder per run. Inside it: an orchestration brief that states the user job, the deliverable, one-sentence ownership boundaries, the named overlaps, and the merge gates; a shared constraints file every brief points at; one brief and one owned output file per worker, with a read-only QA worker once you pass three or four writers; a merge log with sections for accepted items, conflicts with decisions and rejected options, gaps at the seams, and open questions; and run notes recording wall-clock per phase, iterations, what was caught at which gate, and what went wrong. The case study's artifacts are the worked example of every one of these, and the prompt below is the merge review condensed into something you can paste at the end of any run, including a run where the workers were people.
The runbook order matters more than any individual artifact: write the contract before the briefs, the briefs before any worker runs, and the gates before any output exists; let the workers run without interference; review at the gate, not over the worker's shoulder; classify conflicts before resolving them; resolve against the system, record what you rejected, log what nobody owned; and shut the team down when the merge log is written. Then keep the briefs and the log and delete the rest — the durable value of an orchestrated run is the decisions and the structure, not the scratch output. That is also the honest summary of the whole exercise: orchestration is a writing discipline with scheduling attached, and the merge log is the design review you were going to need anyway, finally written down.
Run the merge review for this orchestrated design run. Inputs: the orchestration brief, shared-constraints.md, DESIGN.md, every file in outputs/, and the QA findings file if present. 1. Ownership audit: for each output, confirm the worker stayed inside its stated boundary. List every place a worker specified another worker's surface. 2. Conflict classification: list every disagreement between outputs and classify it — token-semantics conflict, seam or alignment conflict, duplicated or competing signal, or unowned gap (a decision no brief assigned). 3. Resolution: resolve each conflict against DESIGN.md and the user job stated in the orchestration brief — never against which worker wrote more or answered last. State the decision and the reason. 4. Rejected options: record every rejected proposal with the reason it lost. 5. Gaps: list decisions nobody owned. Do not invent answers; log them as open questions for human review. 6. Output: write merge-log.md with sections for accepted items, conflicts and decisions, rejected options, gaps, and open questions. Do not edit any worker output and do not implement anything until the merged plan is approved.
Sources
Sources & further reading
- Claude Code Agent Teams documentation
The experimental team surface: shared task list, mailbox, plan approval, team hooks, sizing guidance, and limitations.
- Claude Code subagents documentation
Per-subagent context windows, tool restrictions, summary-only returns, and which subagents load project instructions.
- Claude Code worktrees documentation
Isolated checkouts per session or per subagent — the file-isolation layer under parallel work.
- Claude Code costs documentation
Includes agent-team token costs: usage scales with team size, idle teammates keep consuming, and the recommended economy moves.
- Codex cloud tasks
Hosted background and parallel Codex tasks, including best-of-N attempts at the same task.
- Codex CLI features
Current Codex CLI capabilities, including subagent workflows for parallelizing larger tasks.
- OpenCode agents documentation
Primary agents, subagents, and per-agent permissions including control over which agents may be invoked.
- Porting Claude Code's agent teams to OpenCode
Community write-up reproducing the team pattern on OpenCode — ecosystem work, not an official feature.
- OpenPencil
MIT-licensed AI-native vector tool with concurrent agent teams over canvas regions and a multi-writer .op format.
- Anthropic: How we built our multi-agent research system
Reports multi-agent research systems using roughly fifteen times the tokens of a single chat — the scale figure quoted in the cost section.
- Cognition: Don't Build Multi-Agents
The strongest published counter-position: parallel workers without shared context make conflicting assumptions on write-heavy work.
- LangChain: How and when to build multi-agent systems
Reconciles the positions around read-heavy versus write-heavy decomposability.
- Addy Osmani: Claude Code Swarms
Hands-on practitioner write-up on running agent teams at scale, including coordination and review observations.
- alexop.dev: From Tasks to Swarms
Practitioner walkthrough of moving from single-session task lists to agent teams.
- Builder.io: Claude agent teams explained
A skeptical practitioner take on when agent teams are and are not worth using.
- Mike Mason: AI coding agents in 2026 — coherence through orchestration, not autonomy
Argues that multi-agent quality comes from orchestration structure and review gates rather than agent autonomy.
- Particula: the worktree-per-worker pattern (oh-my-codex)
Community example of hand-rolled worktree-fleet orchestration outside the Anthropic ecosystem.
- Running multiple Codex agents: parallel orchestration patterns
Practitioner notes on scoping work per session so parallel agents produce complementary rather than competing changes.

