AAgentic Design School
UX workflow
Intermediate

IA Audit and Card Sort Analysis

Two connected loops for information architecture work: an audit that crawls the navigation and content inventory with one agent per section, and a card sort and tree test analysis stage where agents compute agreement and label confusion while a human IA designer judges the tradeoffs.

OrchestrationSubagent fan-out per section plus analysis stage

Typical run2–3 hours per site or app area

Last reviewed2026-06-02

View the code samples on GitHub

Section 1

Why IA work is two jobs, not one

Information architecture problems show up as small, constant frictions: people search instead of browsing, the same content lives in three places under three names, and the navigation describes the org chart better than it describes anyone's task. Fixing this is two distinct jobs that teams tend to blur together — diagnosing the structure you have, and testing the structure you might want.

The diagnosis is an audit: walk the navigation, inventory the content, and look at every label, depth level, duplicate, and orphan with the same level of attention. Almost nobody does this well by hand because the work is enormous and tedious; a 4,000-page site does not get a careful per-page review from a human, it gets a sample and a guess.

The test is a card sort or tree test, and its bottleneck is different: the data arrives as exports that need patient computation — agreement between participants, which cards travel together, where labels confuse people — before a designer can do the actual design work of proposing a structure. This workflow gives both jobs to agents in two connected loops, and keeps the judgment calls where they belong.

Section 2

When to reach for this workflow

Use the audit loop when a site or product area has grown by accretion and nobody can say with confidence what is in it: before a redesign, before a migration, before a navigation change, or when search logs and support tickets suggest people cannot find things. Use the analysis loop when you have run a card sort or tree test and the export is sitting unanalyzed, which is more common than anyone admits.

The two loops connect but do not require each other. An audit can stand alone as a content inventory with findings; a card sort analysis can stand alone if the inventory already exists. Run both when the goal is a restructure proposal you can defend.

  • A site or app area where depth, duplication, and label drift have accumulated over years.
  • A pending migration or redesign that needs a real inventory, not a sitemap from memory.
  • A completed open or closed card sort export that nobody has had time to analyze.
  • Tree test results that need to be compared against the current structure honestly.

Section 3

Loop one: build the inventory before judging it

The audit starts with a crawl, not an opinion. Use whatever produces a reliable URL list with titles and navigation context — a Screaming Frog export, a sitemap, a CMS export, or a small crawler script the agent writes against the site. The inventory needs, per page: URL, title, the navigation path that reaches it, depth, and the first few hundred words of body content so label-versus-content mismatches can be checked.

For an app rather than a site, the inventory is the screen and settings tree instead: every screen, where it sits in the navigation, and what it actually contains. The crawler is replaced by an export from the design system, the route table, or a manual walk that the agent helps structure.

Split the inventory by top-level section before the audit pass. Sections are the natural unit of fan-out, and they keep each audit agent's reading load small enough that page 300 gets the same attention as page 3.

build-inventory.mjs (crawl excerpt)
import { writeFile } from "node:fs/promises"

// Builds inventory.csv from a sitemap: url, title, nav_path, depth, excerpt.
// For large sites, prefer a Screaming Frog or CMS export and skip the crawl.
const sitemap = await fetch("https://www.example.edu/sitemap.xml").then((r) => r.text())
const urls = [...sitemap.matchAll(/<loc>(.*?)<\/loc>/g)].map((m) => m[1])

const rows = [["url", "title", "nav_path", "depth", "excerpt"]]
for (const url of urls) {
  const html = await fetch(url).then((r) => r.text())
  const title = (html.match(/<title>(.*?)<\/title>/s) || [, ""])[1].trim()
  const text = html.replace(/<script[\s\S]*?<\/script>/g, "").replace(/<[^>]+>/g, " ")
  const depth = new URL(url).pathname.split("/").filter(Boolean).length
  rows.push([url, title, "", String(depth), text.replace(/\s+/g, " ").slice(0, 400)])
}

await writeFile("./inventory.csv", rows.map((r) => r.map((c) => JSON.stringify(c)).join(",")).join("\n"))
console.log("Inventoried " + urls.length + " pages")

Section 4

The orchestration pattern: fan out by section, then analyze

Both loops run as a Claude Code dynamic workflow: a JavaScript script that Claude writes and runs in the background, orchestrating subagents while intermediate results — the inventory rows, the per-section findings, the computed matrices — stay in script variables rather than in Claude's own context. With thousands of pages or dozens of card sort participants, that is the difference between a careful pass and a context window full of half-read CSV.

In the audit loop, one agent per section assesses labels, depth, duplication, orphan pages, and mismatches between what the navigation promises and what the page contains, and a merge step combines findings into one report. In the analysis loop, a script stage computes the numbers from the export — co-occurrence, agreement, tree test success and directness — and analysis agents describe what the numbers show and propose candidate structures for a human to evaluate.

Workflows can run up to 16 agents concurrently and up to 1,000 in a run, are resumable if a crawl or a long audit gets interrupted, and can be saved to .claude/workflows/ in the project (or ~/.claude/workflows/ for personal use) as a reusable command like /ia-audit. You trigger a run by including the word workflow in the prompt or via /effort ultracode, and the section-auditor and sort-analyst subagent definitions live in .claude/agents/*.md.

diagramIA audit and analysis loop
Step 1

Build inventory

Step 2

Audit per section

Step 3

Merge findings

Step 4

Run card sort or tree test

Step 5

Compute agreement

Step 6

Propose structures

Step 7

Human IA review

feeds next cycle

The audit findings shape what gets card sorted; the sort results loop back into the restructure proposal.

Section 5

The audit prompt

Give the audit agents the inventory, the current navigation, and a fixed set of questions per section. The discipline that keeps the output useful is the same as in any agent review: findings must cite specific URLs, and observations must be separated from recommendations. An audit that says the labels are inconsistent is a vibe; an audit that lists the four labels used for the same fee-payment page is a finding.

IA audit workflow prompt
Run this as a workflow.

Input: ./inventory.csv (url, title, nav_path, depth, excerpt) and ./current-nav.md describing the navigation as users see it.

Stage 1 - Section audit: Split the inventory by top-level section. For each section, launch one agent that reviews only that section's rows and reports, with URLs cited for every finding: (a) labels that do not describe the page content beneath them, (b) pages deeper than 4 levels and how they are reached, (c) near-duplicate pages covering the same task or topic, (d) orphan pages with no navigation path, (e) inconsistent naming for the same concept across the section.

Stage 2 - Merge: Combine the per-section findings into ia-audit.md, grouped by issue type, with counts per section and a table of the 20 highest-traffic or highest-depth problem pages. Separate observations (what is true of the inventory) from recommendations (what might change), and keep recommendations to one short closing section.

Rules: every finding cites at least one URL from the inventory; do not assess pages that are not in the inventory; do not propose a new sitemap in this loop.

Section 6

The section auditor, defined once

The per-section agent is worth defining in .claude/agents/ rather than inline, because the questions it asks are stable across projects even when the inventory changes. The definition keeps the agent on observation rather than redesign — the restructure proposal belongs to the second loop, after the card sort data exists.

.claude/agents/ia-section-auditor.md
---
name: ia-section-auditor
description: Audits one section of a content inventory for label, depth, duplication, and orphan issues. Use during IA audit workflows.
tools: Read, Grep
model: sonnet
---

You audit exactly one section of a content inventory. You receive the inventory rows for your section and the current navigation description.

Report, citing URLs for every finding:
- Label-content mismatches: navigation or page titles that do not describe what the page contains.
- Depth: pages more than 4 levels deep, with the path that reaches them.
- Duplication: pages that cover the same task or topic; quote the overlapping titles.
- Orphans: pages with no navigation path in the inventory.
- Naming drift: the same concept named differently across the section.

Do not propose a new structure. Do not assess pages outside your section. If the section looks healthy, say so briefly rather than inventing findings.

Section 7

Loop two: card sort and tree test analysis

The second loop starts from an export: an OptimalSort-style CSV of which participant put which card in which named group, or a tree test export of task, path taken, and outcome. The first stage is computation, not language: a script builds the card-by-card co-occurrence matrix (how often each pair of cards ended up in the same group), per-card agreement scores, the group labels participants invented, and for tree tests the success, directness, and first-click numbers per task.

Keep the statistics honest by keeping them boring. The script computes percentages and matrices and writes them to files; the analysis agents describe what is in those files, name where the data is ambiguous or the sample is small, and propose two or three candidate structures with the evidence that supports each. Humans judge significance, weigh tradeoffs the data cannot see — brand language, regulatory naming, what the CMS can support — and pick or blend a structure.

Dendrograms are useful but they are an artifact of a clustering algorithm, not a fact about users. The workflow treats clustering output as data to inspect — printed as ordered cluster lists with the merge heights — rather than as a picture that settles arguments.

analyse-card-sort.mjs (computation excerpt)
import { readFile, writeFile } from "node:fs/promises"

// Reads an OptimalSort-style export: participant, card, group_label
const rows = (await readFile("./card-sort-export.csv", "utf8"))
  .split("\n").slice(1).filter(Boolean)
  .map((line) => {
    const [participant, card, group] = line.split(",").map((s) => s.trim())
    return { participant, card, group }
  })

const participants = [...new Set(rows.map((r) => r.participant))]
const cards = [...new Set(rows.map((r) => r.card))]

// Co-occurrence: % of participants who placed each pair of cards in the same group.
const matrix = {}
for (const a of cards) {
  matrix[a] = {}
  for (const b of cards) {
    if (a === b) continue
    let together = 0
    for (const p of participants) {
      const ga = rows.find((r) => r.participant === p && r.card === a)?.group
      const gb = rows.find((r) => r.participant === p && r.card === b)?.group
      if (ga && gb && ga === gb) together++
    }
    matrix[a][b] = Math.round((together / participants.length) * 100)
  }
}

// Label frequency: what participants actually called their groups.
const labels = {}
for (const r of rows) labels[r.group] = (labels[r.group] || 0) + 1

await writeFile("./output/co-occurrence.json", JSON.stringify(matrix, null, 2))
await writeFile("./output/group-labels.json", JSON.stringify(labels, null, 2))
console.log(participants.length + " participants, " + cards.length + " cards analyzed")

Section 8

The analysis prompt

Once the numbers exist, the analysis agents read them alongside the audit findings from loop one. Insisting that every claim in the analysis names the number behind it is what stops the proposal stage from drifting into taste dressed up as data.

Card sort analysis workflow prompt
Run this as a workflow.

Input: ./output/co-occurrence.json, ./output/group-labels.json, the raw export at ./card-sort-export.csv, and ./ia-audit.md from the audit loop. The sort was an open card sort with the participant count stated in the export.

Stage 1 - Describe: One agent summarizes what the numbers show: card pairs with co-occurrence above 70 percent, cards with no strong partners (highest co-occurrence below 40 percent), participant group labels that recur, and labels applied to very different card sets by different participants (label confusion). Every claim names the figure it rests on.

Stage 2 - Propose: A second agent proposes 2-3 candidate top-level structures. Each candidate lists: its sections with suggested labels drawn from participant language where possible, which co-occurrence clusters support each section, which cards remain ambiguous, and which audit findings it would and would not resolve.

Stage 3 - Caveats: A third agent lists what this data cannot support: claims about findability (that needs a tree test), anything resting on fewer than 5 participants agreeing, and differences between candidates that the data does not distinguish.

Output: sort-analysis.md, candidate-structures.md, caveats.md. Do not recommend a single winner; present the candidates and tradeoffs for the IA designer to evaluate.

Section 9

Step by step through one engagement

Build or import the inventory and check it spot-wise — ten random URLs, do the rows describe real pages. Run the audit loop and read the merged findings; expect this to take under an hour for most product areas and longer for very large sites where the crawl dominates. Use the audit to choose the 30 to 60 cards worth sorting; cards should be tasks and content people actually look for, not the current navigation labels read back to participants.

Run the sort or tree test with real participants using whatever tool the team has. When the export lands, run the analysis loop: computation first, description second, candidates third. Then do the IA design work — evaluate the candidates against the constraints the data cannot see, and where two candidates are genuinely close, plan a tree test to separate them rather than arguing.

Budget two to three hours of agent runtime and review per site or app area, excluding the days the human study itself takes. The agents compress the inventory drudgery and the export math; they do not compress recruitment.

diagramFrom audit to validated structure
1

Design decision

Audit findings

2

Design decision

Choose cards

3

Design decision

Run study

4

Design decision

Compute results

5

Design decision

Candidate structures

6

Design decision

Tree test the favorite

The audit chooses the cards, the sort produces the data, the analysis frames the candidates, and a tree test checks the winner.

Section 10

Case study: a university website with 4,000 pages

A university web team ran the audit loop across a 4,000-page site spread over eleven top-level sections, ahead of a CMS migration. The crawl and per-section audit took most of one afternoon, dominated by the inventory build. The merged findings reported 9 levels of depth at the deepest point — a scholarship-conditions page reached only through a 2019 news article — 312 pages with no navigation path, and the same fee-payment task documented on four pages owned by three departments under four different labels.

The audit changed the migration plan before any card sorting happened: roughly 700 pages were marked for archive review rather than migration, which the content team described as the single largest scope reduction of the project. The card sort that followed used 52 cards drawn from top tasks rather than the existing section names.

The honest caveat from the analysis stage mattered too: prospective students and current students grouped the same cards differently, and the sample of 11 prospective students was too small to design around alone. The team ran a follow-up tree test with prospective students before committing the top level.

Section 11

Case study: settings IA across three acquisitions

A B2B product's settings area had grown by accretion across three acquisitions, and the audit loop inventoried 214 settings screens across the merged products. The findings showed the same concept — user roles and permissions — configurable in three places with three permission models, and 40 percent of settings screens reachable only through a legacy admin URL that did not appear in the new navigation at all.

A closed card sort with 28 administrators followed, sorting 48 settings tasks into a proposed five-category structure. The analysis showed strong agreement (above 75 percent co-occurrence) for billing, security, and integrations, and genuine confusion in the middle: notification settings split almost evenly between a personal-preferences category and a workspace-administration category, which the caveats file correctly framed as a question about per-user versus per-workspace scope rather than a labeling problem.

The IA designer resolved it by splitting notification settings by scope rather than by topic — a design decision the data informed but did not make — and the proposal document cited the co-occurrence figures for the categories it kept and the audit findings for the screens it retired.

Section 12

Case study: an open card sort that contradicted the org chart

A government services team ran an open card sort with 42 participants on 40 service-related cards, with the existing navigation organized by the departments that delivered each service. The computed group labels told the story before any agent wrote a sentence of analysis: participants' most frequent group names were life events and tasks — moving house, starting a business, having a baby — and no participant group label matched a department name.

The analysis stage proposed two candidate structures, one task-based and one life-event-based, and noted that 9 of the 40 cards were ambiguous between them with co-occurrence spread thin. The caveats file flagged that 42 participants supported the overall direction confidently but could not settle the placement of those 9 cards, and recommended a tree test on the two candidates rather than further argument.

The tree test, run with 160 participants on the two structures, gave the task-based candidate an 81 percent average success rate against 64 percent for the existing department structure. The internal debate did not end because the data was unanswerable; it ended because the evidence trail from audit to sort to tree test was complete enough that the department-shaped navigation no longer had a defensible case.

Section 13

Good vs bad analysis output

The failure mode in IA analysis is confidence without numbers: structures recommended because they look clean, statistics implied but never computed, and participant language replaced by the team's internal vocabulary. The fix is mechanical — every claim names its figure, every candidate names its supporting clusters, and everything the data cannot distinguish is said out loud.

tableAnalysis quality comparison
1Bad

Users clearly think of the site in terms of five categories

2Good

Three clusters show co-occurrence above 75 percent across 42 participants; the remaining cards form no cluster above 48 percent and are listed as ambiguous

3Bad

The label Resources tested poorly

4Good

The label Resources was applied by 19 participants to card sets that share only 2 cards on average; it functions as a junk drawer rather than a category

5Bad

We recommend the task-based structure (no evidence cited)

6Good

Two candidates presented; the task-based one is supported by clusters 1, 2, and 4 and resolves 11 of 15 audit findings; 9 cards remain ambiguous and a tree test is proposed to separate the candidates

7Bad

The dendrogram proves users want three top-level sections

8Good

Average-linkage clustering at height 0.4 yields three groups; at 0.55 it yields five; the cut point is a design choice, listed for the IA designer to make

Computed, cited analysis can be checked; pattern-matched recommendations cannot.

Section 14

Limits: what this workflow cannot prove

Card sorts show how people group concepts; they do not show whether people can find things in a finished navigation, which is what tree tests and live findability metrics measure. The workflow keeps the two claims apart, and so should the report that goes to stakeholders.

Sample sizes in IA studies are usually modest, and agents will happily compute percentages on eight participants that read as more solid than they are. Statistical significance, whether a difference between candidates is meaningful, and whether a subgroup is large enough to design for are judgments the researcher and IA designer make. The agents compute and describe; they do not adjudicate.

Finally, the audit describes the structure, not the strategy. Whether content should exist, which audience the site serves first, and what the organization is willing to rename for political reasons are decisions the data informs and humans own.

  • Cannot prove findability from a card sort; that requires a tree test or live data.
  • Cannot decide statistical significance or whether a small subgroup justifies a separate structure.
  • Cannot weigh brand, legal, or political naming constraints the data never sees.
  • Cannot decide what content deserves to exist; it can only show what is there and how people group it.

Section 15

The reusable IA workflow

Save the prompts, the section-auditor and analyst definitions, and the analysis script in the repository, and save the prompts to .claude/workflows/ so the next area gets audited with an /ia-audit command and the next export gets analyzed the same way as the last one. Consistency across runs is what lets you compare a section audited this quarter with the same section audited after the restructure ships.

IA audit and card sort workflow
1. Build the content inventory: crawl, sitemap, or CMS export with url, title, path, depth, excerpt.
2. Run the audit loop: one agent per section; findings cite URLs; merge into one report.
3. Choose 30-60 cards from top tasks and audit findings, not from current navigation labels.
4. Run the card sort or tree test with real participants in your usual tool.
5. Run the computation script on the export: co-occurrence, agreement, labels, task success.
6. Run the analysis loop: describe the numbers, propose 2-3 candidate structures, list caveats.
7. IA designer evaluates candidates against constraints the data cannot see; tree test close calls.
8. Archive the inventory, exports, computed files, and the decision so the next audit has a baseline.

Sources

Sources & further reading

Browse the full library on the workflows page or open the code samples in the public repository.

Newsletter

Get the IA audit checklist and the card sort analysis script updates by email.

The newsletter is the update channel for article revisions, tool changes, and field-tested workflows.

Processed by Buttondown. You can unsubscribe from any email.

Further reading

For deeper reading, see The Agentic Designer and Claude Code for Designers.

The Agentic Designer cover
Curriculum
The Agentic Designer
How AI agents are transforming product design.

The operating model for product designers, design leads, and builders who need to understand what changes when agents join design work.

Claude Code for Designers cover
Curriculum
Claude Code for Designers
A designer's guide to AI-assisted workflows.

A practical guide for designers who want to work directly with coding agents without turning it into a programming manual.