AAgentic Design School
Module 3 of 5
40–50 minutes

Motion and Storytelling with Agents

Narrated Explainers and Decks

Explainers and decks generated from the same content source as the product or course material: scripts as first-class artifacts, narration and slide structure produced together, and the editing pass that keeps the result sounding like a person rather than a press release.

Duration40–50 minutes

Slides13 slides with notes and narration

Learning objectives

  • Treat the script as the primary artifact and generate slides and narration from it.
  • Keep one content source feeding written, slide, and video versions of the same material.
  • Edit generated narration for voice, pacing, and honesty before it ships.
Slide deck

Work through the module

Each slide is shown in its 16:9 frame, exactly as it appears in the video version. Open the notes under any slide for the longer explanation, and the narration if you prefer to read along.

Slide 1 of 1316:9

Narrated Explainers and Decks

Motion and Storytelling with Agents · Module 3 of 5

  • The script is the design — slides and narration are generated from it
  • One content source feeding the article, the deck, and the video
  • Synthetic voice-over: how it works and what to disclose
  • The editing pass that keeps the result sounding like a person

This module is about the words. Module 2 covered how motion gets built; here the question is what it says, and in whose voice.

Slide notes

Position this module against the previous two. Module 1 made the case for video as an agent output; Module 2 covered the two working stacks — Remotion compositions and hyperframe-style HTML sequences — and the project structure that keeps them maintainable. This module deliberately steps back from the rendering machinery and spends its time on the content layer: the script, the slide structure derived from it, the narration voice, and the editing pass. Most quality problems in agent-produced explainers are content problems wearing a motion costume.

The phrase 'the script is the design' is the organising idea. In a traditional explainer pipeline the script, the slides, and the voice-over are produced by different people at different times, and they drift. In an agent pipeline, the script is the single artifact a human writes and edits with care, and everything else — slide structure, on-screen text, timing, narration audio, captions — is derived from it mechanically. That makes the script the highest-leverage place to spend design attention.

Flag the worked example early: this school's own course modules are written as slides plus per-slide narration scripts, and that exact format is what feeds the school's video pipeline. Participants are, in a literal sense, inside the worked example right now — the script being narrated over this slide is the same kind of artifact the module teaches them to write.

Narration for this slide

Welcome to Module 3. The last module was about how motion gets built — Remotion, hyperframe-style sequences, and the project structure around them. This module is about what the motion says. We are going to treat the script as the primary artifact: the thing a human writes and edits with care, from which the slides, the narration, the captions, and the timing are all derived. We will cover synthetic voice-over and what you should disclose about it, and the editing pass that keeps generated narration sounding like a person. And the worked example is close to home — this very module is a script-first artifact of exactly the kind we are about to discuss.

Slide 2 of 1316:9

One source, three outputs

The article, the deck, and the narrated video are not three projects. They are three renderings of one content source.

  • The source: a structured document — sections, key claims, evidence, examples
  • Output one: the written article or course page, read at the reader's pace
  • Output two: the deck — presented live, or exported as HTML and editable PPTX
  • Output three: the narrated video — script segments, visuals, and a voice on a timeline
  • When the source changes, all three regenerate; none is maintained by hand

Maintain the source, not the outputs. The moment a fix is made in the deck or the video but not the source, the formats start lying to each other.

Slide notes

This is the structural argument of the module, and it is the same argument the rest of the course makes about motion generally: the artifact that matters is the diffable text source, and the rendered output is a build product. Applied to explainers and decks, the source is a structured document — for this school, a course module file with slides, speaker notes, and per-slide scripts; for a product team, it might be a release note, a spec, or a help-centre article. The article, the deck, and the video are derived from it.

The practical payoff is maintenance. Product material goes stale constantly — pricing, screenshots, feature names. When the deck and the video are hand-built, they decay silently and someone discovers the out-of-date claim in front of a customer. When they are generated from the source, updating the source and re-running the pipeline is the whole job. Huashu Design's slide-deck capability is a useful concrete reference here: it produces an HTML deck for browser presentation plus an editable PPTX, where the export translates real DOM styles into actual PowerPoint text frames rather than images — so even the 'editable in PowerPoint' requirement does not force you back to hand-maintained decks.

Be honest about the boundary: one source, three outputs does not mean one text pasted into three templates. The formats have different grammars — the article can sustain nuance and caveats, the deck compresses to claims and evidence, the video adds time and voice. The source has to carry enough structure that each output can make its own selection. That is what the next two slides cover.

Narration for this slide

Here is the structural idea. The article, the deck, and the narrated video are not three separate projects — they are three renderings of one content source. The source is a structured document: sections, claims, evidence, examples. The article renders it for reading. The deck renders it for presenting — and tools in this space can export both an HTML deck and an editable PowerPoint file from the same source. The video renders it onto a timeline with a voice. The rule that makes this work is simple and strict: maintain the source, never the outputs. The moment you fix something in the deck but not in the source, the formats start drifting apart, and they drift in front of an audience.

Slide 3 of 1316:9

Writing scripts agents can act on

A script an agent can act on is structured like a screenplay, not like an essay: scenes, beats, and on-screen anchors.

  • Scenes: one section of the source becomes one scene with a stated purpose
  • Beats: one idea per beat — each beat becomes one slide or one shot
  • On-screen anchors: the exact words and numbers that appear on screen, written into the script
  • Narration says more than the screen shows; the screen never shows what narration ignores
  • 60–120 words per beat at conversational pace is roughly 25–50 seconds of audio

If the on-screen words are not in the script, the agent will invent them — and invented copy is where generated explainers go wrong first.

Slide notes

The format matters because the script is doing double duty: it is the narration a voice will read, and it is the specification the agent uses to generate visuals. An essay-shaped script gives the agent prose to summarise, and summarisation is where invention creeps in. A screenplay-shaped script — scenes, beats, anchors — gives the agent a one-to-one mapping: this beat becomes this slide, these words go on screen, this sentence gets read aloud.

The on-screen anchors deserve the most emphasis. The single biggest quality lever found in this school's motion work, and stated in the school's article on Remotion and Hyperframes, is whether the brief contains the actual words that will appear on screen. The same is true at the script level: numbers, product names, claims, and headings should be written into the script as anchors. An agent asked to 'pick out the key point' will pick a plausible one, which is not the same as the right one.

The word-count guidance is practical, not aesthetic. At a conversational reading pace of roughly 150 words per minute, a 60–120 word beat produces 25 to 50 seconds of narration, which is about as long as a single visual can hold attention without changing. Beats much longer than that need splitting; beats much shorter than ten seconds tend to feel like a slideshow on fast-forward. This school's own per-slide scripts sit in exactly that 60–120 word band, which is one reason they convert to video shots cleanly.

Narration for this slide

So what does a script an agent can act on look like? It looks like a screenplay, not an essay. Scenes — one per section of the source, each with a stated purpose. Beats — one idea each, and each beat becomes one slide or one shot. And on-screen anchors: the exact words and numbers that will appear on screen, written into the script itself. That last one is the big quality lever. If the anchors are not in the script, the agent will invent them, and invented copy is where these things go wrong first. One sizing rule to remember: sixty to a hundred and twenty words per beat reads as roughly twenty-five to fifty seconds of narration — about as long as one visual can carry.

Slide 4 of 1316:9

Slide structure generated from the script, not alongside it

Each beat type maps to a slide treatment. The mapping is mechanical on purpose — the judgment went into the script.

Beat in the scriptSlide / shot treatmentWhat goes on screen
Hook or claimSingle statement slideThe claim, in the script's exact words
Definition or new termTerm-and-definition cardTerm plus a one-sentence definition
Process or flowDiagram, five nodes or fewer per shotNode labels named in the script
Comparison or before/afterTwo-column contrastColumn heads and three to five rows
Evidence or worked exampleTable or annotated figureThe numbers and names, never the prose
Recap or closeChecklist or pull quoteThree to five lines, or the quotable sentence

Roughly thirty words on screen per beat is the ceiling. Numbers and names go on screen; full sentences go in the narration.

Slide notes

The phrase 'not alongside it' is doing real work in the title. The common failure is to write a script and then build a deck as a separate creative act — at which point the deck drifts from the script, the narration no longer matches what is on screen, and the video assembled from both feels subtly broken even when no single element is wrong. Generating the slide structure from the script keeps the two locked together: the beat's type determines the treatment, the beat's anchors become the on-screen text.

This table is a condensed version of the element-to-template mapping this school's own video pipeline uses to turn course modules into chapter videos: section headings become teaching beats, definitions get a term-and-definition card, diagrams are capped at about five nodes per shot and split when larger, tables are trimmed to roughly five rows, code is cut to the few lines the narration actually explains, and quotes become pull-quote shots. The cap of about thirty words on screen per beat comes from the same convention.

The deeper point is about where judgment lives. Making the mapping mechanical is not an abdication of design — it is a relocation of design. The taste decisions happen in the script: what to claim, what to anchor, what to cut. The slide generation is then deliberately boring, which is exactly what makes it repeatable and reviewable. When a generated slide looks wrong, the fix is almost always upstream in the script, not downstream in the layout.

Narration for this slide

Once the script has typed beats, slide structure stops being a separate creative act. Each beat type maps to a treatment: a claim becomes a single statement slide, a definition becomes a term card, a process becomes a diagram with five nodes or fewer, a comparison becomes two columns, evidence becomes a small table, and the close becomes a checklist or a pull quote. The on-screen text comes straight from the script's anchors. Keep about thirty words on screen per beat as the ceiling — numbers and names on screen, full sentences in the narration. The mapping is mechanical on purpose. The judgment already happened when you wrote the script.

Slide 5 of 1316:9

The narrated explainer pipeline

From a per-slide script to a reviewed MP4. The agent runs the middle; humans hold both ends.

Six-stage pipeline diagram. A human-written script with one segment per slide feeds text-to-speech and timing measurement, which produces a timeline mapping each narration segment to a visual. Visuals are rendered per segment from templates and brand tokens, then assembled into an MP4 with the music ducked under the narration and captions packaged alongside. A human review pass checks pacing, accuracy, voice, and captions, and a dashed feedback line returns from the review pass to the script, because fixes go to the script and timeline rather than the rendered video.
The script and the final review are human-led; TTS, timing, the timeline, per-segment visuals, and assembly are agent-run. The dashed line is the discipline that holds it together: fixes go to the script and the timeline, never to the MP4.

Narration drives duration. Each shot is as long as its narration segment plus breathing room — the visuals fit the words, not the other way around.

Slide notes

Walk the six stages and name the owner of each. Stage one is the script, one segment per slide or beat, written and edited by a human — the previous slides covered why. Stage two is text-to-speech and timing: the agent generates one audio file per segment and measures its duration; captions and the transcript are generated from the script text rather than transcribed back from the audio, so they are exact. Stage three is the timeline: a small, diffable file that maps each segment to its visual, with start times and durations derived from the measured audio. Stage four renders the visuals per segment from templates and brand tokens, with the on-screen text taken from the script's anchors. Stage five assembles the MP4: segments stitched in order, music ducked under the narration, captions packaged with the output. Stage six is a human watching the whole thing end to end.

The ordering principle worth calling out is that narration drives duration. Each shot lasts as long as its narration plus a little breathing room, never shorter than the visual's own animation needs. Pipelines that go the other way — fixed-length visuals with narration squeezed to fit — produce the rushed, breathless pacing people associate with cheap generated video. This is also how the Huashu narrated-animation pipeline is structured: voice generated first, durations measured, a timeline file generated from those durations, and the visuals driven from the timeline, with the music ducked under the voice in the final mix.

The dashed feedback line is the same discipline as everywhere else in this course: when the review finds a problem, the fix is made in the script or the timeline and the video is re-rendered. Editing the MP4 directly would break the link between source and output and put you back in the world of hand-maintained video that this whole approach exists to escape.

Narration for this slide

Here is the pipeline end to end. A human writes the script, one segment per slide. The agent generates the voice — one audio file per segment — and measures how long each one runs. Those measured durations produce a timeline: a small text file mapping every segment to its visual. The visuals are rendered per segment from templates and brand tokens, with the on-screen words taken from the script. Then everything is assembled into an MP4, with the music ducked under the narration and captions packaged alongside. Finally, a human watches the whole thing. Two rules hold it together: narration drives duration, so the visuals fit the words — and fixes go to the script and the timeline, never to the rendered video.

Slide 6 of 1316:9

Voice: keeping generated narration in the brand's register

Generated narration defaults to the average voice of the internet: enthusiastic, vague, and slightly too pleased with itself. The brand's register has to be specified, not hoped for.

  • Write the register down: sentence length, person, contractions, banned phrases, how claims are hedged
  • Put it in the harness or skill, not in each prompt — voice is a standing rule, not a request
  • Give the agent two or three paragraphs of real, approved writing as the reference
  • Ban the explainer clichés explicitly: "in today's fast-paced world", "seamlessly", "game-changing", stacked rhetorical questions
  • Read the script aloud before generating audio — the ear catches what the eye forgives

Voice drift is not a one-off bug to fix in review. It is the default, and it returns every run unless the register is encoded where the agent always sees it.

Slide notes

The failure mode here is well documented across agent design work and has a name in the open-source skill community: slop — the visual and verbal greatest common denominator of the training data. In narration it shows up as breathless enthusiasm, vague superlatives, marketing transitions, and a tone that sounds like every product launch video at once. None of it is wrong sentence by sentence, which is exactly why it survives a quick read; it is wrong in aggregate, because it carries no information about who is speaking.

The fix is the same instrument used everywhere else in this curriculum: encode the rule where the agent always sees it. A voice section in the project's harness or skill — sentence-length range, first or second person, contraction policy, how claims are hedged, a banned-phrase list, and two or three paragraphs of approved writing as a reference — does more for narration quality than any amount of per-prompt pleading. The Huashu skill takes the same approach for visuals, maintaining an explicit list of slop elements that are banned unless the brand itself uses them; the narration equivalent is a banned-phrase list plus a positive reference sample.

Reading the script aloud before generating audio is a cheap, old discipline that survives the move to synthetic voices. Sentences that are fine on the page — long subordinate clauses, three-item lists inside three-item lists — fall apart when spoken, and a synthetic voice will read them with perfect, oblivious fluency. The ear catches the problem in seconds. Australian and standard English spelling and phrasing should also be settled at this layer, because the TTS model will pronounce whatever it is given.

Narration for this slide

Now the part that decides whether the result sounds like your organisation or like everyone's launch video at once: voice. Generated narration defaults to the average voice of the internet — enthusiastic, vague, slightly too pleased with itself. The fix is not better prompting on the day; it is writing the register down where the agent always sees it. Sentence length, person, contractions, how you hedge claims, and a banned-phrase list — no 'seamlessly', no 'game-changing', no stacked rhetorical questions. Give the agent a few paragraphs of real, approved writing as the reference. And read the script aloud before you generate any audio, because the ear catches what the eye forgives.

Slide 7 of 1316:9

Synthetic voice-over, and what to disclose

Text-to-speech is good enough for explainers and course material, and it is part of what makes the pipeline repeatable. That raises questions a team should answer once, in policy, not per video.

  • Why synthetic: re-recording is free, the voice is consistent across updates, and captions stay exact
  • Where it falls short: emotional range, humour, and material where the speaker's identity is the point
  • Disclose when a reasonable viewer would assume a human: testimonials, named presenters, anything resembling a person's endorsement
  • Never clone a real person's voice without written consent — and revisit consent when the use changes
  • As of June 2026, AI-disclosure rules differ by platform and jurisdiction; check the destination's policy before publishing

The honest default is simple: if the viewer would feel misled on learning the voice was synthetic, say so up front.

Slide notes

Make the practical case before the ethical one, because both matter. Synthetic narration is what makes the one-source, three-outputs model maintainable: when the source changes, the narration regenerates in minutes, in the same voice, with no studio booking and no drift between the script and what was actually said. Captions and transcripts are exact because they come from the script rather than from transcription. A pipeline built on a human narrator is better in some ways — warmth, emphasis, credibility — and dramatically worse at being re-run every time the product changes. Many teams land on a hybrid: synthetic for the frequently-updated material, recorded humans for the flagship pieces.

Then be straightforward about disclosure. The line that matters is viewer expectation: an explainer narrated by an obviously generic presenter voice carries a different implicit claim than a video where a named person appears to speak, or a testimonial. Where a reasonable viewer would assume a human — and would feel misled to learn otherwise — disclose. A short line in the description or end card costs nothing and protects trust, which is the asset these videos exist to build. Voice cloning of real people is a separate, harder line: written consent, scoped to the use, revisited when the use changes.

Keep the regulatory claim conservative and dated. As of June 2026, platform rules and jurisdictional requirements for AI-generated content disclosure differ and keep moving; some platforms require labels for synthetic media in specific categories. The durable advice is to settle a team policy once — when to disclose, what wording, who signs off on voice choices — and to check the destination platform's current rules at publish time rather than relying on what was true last quarter.

Narration for this slide

Let's talk about the voice itself. Most of this pipeline runs on text-to-speech, and that is a feature: re-recording is free, the voice stays consistent across updates, and the captions are exact because they come from the script. Where synthetic voices fall short is emotional range, humour, and anything where the speaker's identity is the point. That brings up disclosure. The honest default is simple: if a reasonable viewer would assume a human and feel misled to learn otherwise, say so — a line in the description or the end card is enough. Never clone a real person's voice without written consent. And as of June 2026, platform rules on AI disclosure vary, so check the destination's policy before you publish, and settle your team's own policy once rather than per video.

Slide 8 of 1316:9

The editing pass: pacing, emphasis, and cutting the filler

The script gets a human edit before audio is generated, and the assembled video gets a human watch before it ships. Neither pass is optional.

  • Cut the filler first: throat-clearing intros, restated headings, and any sentence that only says the next sentence is coming
  • Pacing: no automated check measures whether a viewer had time to read the slide — watch it at real speed
  • Emphasis: one idea per beat actually lands; three ideas per beat means none do
  • Honesty: remove claims the source does not support — generated transitions love to over-promise
  • Listen to the generated audio for mispronunciations, robotic phrasing, and sentences that need splitting

Agents reliably produce narration that is technically correct and slightly too fast, slightly too smooth, and slightly too confident. The editing pass exists to take all three down a notch.

Slide notes

There are two distinct passes and they catch different things. The script edit happens before any audio exists, which is when changes are cheapest: cut filler, tighten claims, fix the register, split sentences that will not survive being spoken. The video watch happens after assembly and is mostly about time: pacing, emphasis, and whether the on-screen text and the narration land together. Teams that skip the second pass because 'the script was already approved' ship videos where a dense diagram appears for four seconds while the narration glides past it.

The pacing point deserves the strongest wording because it is the failure no automated gate catches. Lint can find timing overlaps, frame checks can find dead animations, validators can confirm every segment is covered — nothing in the toolchain measures whether a human had time to read the headline before it left the screen. The school's motion article makes the same observation about feature clips: generated motion is reliably technically correct and slightly too fast. Budget a real-time watch for every video, not a scrub.

The honesty item is about generated connective tissue. When an agent writes transitions between beats, it tends to inflate: 'this changes everything', 'as we have seen, the results are dramatic'. Each phrase is small; the cumulative effect is a video that promises more than the source supports, which is exactly the press-release quality this module's summary warns against. The edit removes those claims or scales them back to what the evidence carries.

Narration for this slide

Two human passes hold the quality line. The first is on the script, before any audio exists. Cut the filler — the throat-clearing intros, the sentences that only announce the next sentence. Check every claim against the source, because generated transitions love to over-promise. Then generate the audio and do the second pass: watch the assembled video at real speed. This is where you catch pacing, and pacing is the failure no automated check finds — nothing in the toolchain knows whether a viewer had time to read the slide. Listen for mispronunciations and sentences that need splitting. The pattern to expect is consistent: agent narration runs slightly too fast, slightly too smooth, and slightly too confident. The edit takes all three down a notch.

Slide 9 of 1316:9

Accessibility: captions, transcripts, and contrast in motion

Script-first pipelines make accessibility cheap, because the text already exists. That removes the usual excuse.

  • Captions generated from the script are exact — sync them per segment and ship them with every video
  • Publish the transcript alongside the video; it is the script, lightly formatted, and it is also the searchable version
  • On-screen text follows the same contrast rules as product UI — motion does not suspend WCAG
  • Hold text on screen long enough to be read twice at a normal reading pace
  • Avoid flashing and rapid full-screen changes; respect reduced-motion preferences in any interactive or embedded version

In a script-first pipeline, captions and transcripts are not extra work — they are the same text, packaged twice. Treat their absence as a failed gate, not a nice-to-have.

Slide notes

The structural advantage is worth stating plainly: in a script-first pipeline, the narration text exists before the audio does, so captions and transcripts are derivations rather than transcriptions. Captions generated from the script have no recognition errors; the only work is timing them to the audio segments, which the pipeline already measures for shot durations. The transcript is the script with headings. Both should be packaged with the render as a matter of routine — the school's article on this topic puts captions and narration pacing in the render gate, not in a retrofit, and that placement is the right one.

On-screen text in motion is still text: contrast ratios, minimum sizes at the destination's actual playback size, and reading time all apply. A 28-pixel caption that passes on a desktop preview is unreadable on a phone where most of the audience will watch. The reading-time rule of thumb — long enough to read twice at normal pace — exists because viewers are also listening, glancing away, and reading captions; text that disappears the moment a fast reader finishes it has failed half the audience.

Motion-specific harms need naming even in slide-style explainers: flashing content, rapid full-screen luminance changes, and aggressive parallax can be genuinely harmful for photosensitive and vestibular-sensitive viewers, not merely unpleasant. Rendered MP4s cannot respond to a reduced-motion preference, so restraint has to be designed in; HTML or embedded interactive versions can and should respect prefers-reduced-motion. Module 4 picks this thread up properly for motion inside the product.

Narration for this slide

Accessibility is where this pipeline pays for itself. Because the script exists before the audio, captions are exact — they come from the script, not from speech recognition — and the transcript is the same text again, lightly formatted and searchable. Ship both with every video, and treat their absence as a failed gate. On-screen text is still text: the same contrast rules as your product UI apply, and it has to stay up long enough to be read twice at a normal pace, because viewers are also listening. And avoid flashing or rapid full-screen changes — a rendered video cannot respond to a reduced-motion preference, so the restraint has to be designed in from the start.

Slide 10 of 1316:9

Decks and exports: where the outputs actually go

The deck is the output most likely to leave your pipeline and live in someone else's hands. Export it in a form that survives the journey.

  • HTML deck: presented from a browser, stays connected to the source, regenerates on change
  • Editable PPTX: real text frames translated from the deck's styles — not screenshots pasted onto slides
  • Narrated video: the same beats with TTS audio, captions, and a music bed, delivered as MP4
  • Stills and PDF: the low-fidelity exports for review threads, docs, and print
  • Name the canonical version in the repo; every export carries a generated-from note

An export the audience can edit is also an export that can drift. Hand over editable files knowingly, with the source named, not as the default.

Slide notes

Decks deserve their own slide because they are the output with the messiest social life. Articles stay on your site and videos are consumed read-only, but decks get forwarded, presented by other people, and edited the night before a meeting by someone who has never seen your pipeline. The export strategy has to acknowledge that reality rather than pretend everyone will present from the canonical HTML version.

The export options are concrete and current. The HTML deck is the native output: presented from a browser, styled by the same tokens as everything else, regenerated when the source changes. For audiences that need PowerPoint, the Huashu Design toolchain demonstrates the standard worth holding any pipeline to — its export translates the deck's computed styles into actual PowerPoint text frames, so the receiving team can edit the words, rather than receiving images of slides pretending to be a deck. The narrated video version reuses the same beats with the TTS narration from earlier in this module. Stills and PDF cover review threads and documentation. As of June 2026 these export paths are skill- and script-based rather than polished products, so expect to keep a small amount of glue code in the repo.

The governance point matters more than the formats: every export should carry a note saying what it was generated from and when, and the repository should name the canonical version. Editable exports drift the moment they leave — that is their purpose — so the team needs to know which artifact to trust when the deck that comes back no longer matches the video. The answer should always be the source, which is the same answer this module started with.

Narration for this slide

A quick word on decks specifically, because the deck is the output most likely to leave your pipeline and live in someone else's hands. The native form is an HTML deck — presented from a browser, regenerated when the source changes. For audiences that need PowerPoint, the bar to hold any export to is real, editable text frames, not screenshots pasted onto slides. The narrated video reuses the same beats with the synthetic voice and captions. Stills and PDF cover everything else. And because an editable export is also an export that can drift, every file you hand over should say what it was generated from, and your repo should name the canonical version. When the edited deck comes back different, the source wins.

Slide 11 of 1316:9

Worked example: a course module turned into a narrated explainer

This school's modules are written as slides plus per-slide scripts — the format you are watching right now. Here is what one module's trip through the video pipeline looks like.

StageWhat happenedWhere the human time went
Source13 slides, each with a 60–120 word script — written for the course, reused as-isAlready paid for when the module was authored
Shot planBeats mapped to templates: opener, agenda, teaching beats, checkpoint, pull-quote closeReviewing the mapping; splitting one oversized diagram beat
Narration + timingTTS per segment at conversational pace; durations measured; shot length = narration plus breathing roomListening for mispronunciations; respelling two product names
Visuals + assemblyTemplates filled from the script's anchors and brand tokens; segments stitched; music ducked under the voiceNone — agent-run against the existing template library
ValidationAutomated check: every slide covered, no leftover placeholder text, durations cover the audioReading the report; zero errors required before review
ReviewWatched end to end at real speed before publishingThe full runtime, plus one pacing fix routed back to a script

The expensive artifact — the script — was written once, for the course. The video cost a shot plan, two pronunciation fixes, a validation report, and one real-time watch.

Slide notes

This trace comes from the school's own production setup: course modules are authored as slide definitions with speaker notes and a per-slide narration script, and a separate video pipeline turns each module into a chapter video using a template library, per-segment TTS, and an automated validator. The numbers of shots and the specific fixes vary by module; the shape of the run is stable and is what the table shows. Present it as a representative trace of a working pipeline, not a benchmark.

The details worth dwelling on are the ones that confirm the module's claims. The script was not written for the video — it was written as course material, and the video pipeline consumed it unchanged, which is the one-source argument made concrete. Shot length was derived from measured narration duration plus breathing room, never shorter than the template's own animation envelope, which is the narration-drives-duration rule. The validator is the automated gate: it checks that every slide of the source is covered by a shot, that no template placeholder text leaked into the output, and that each shot's duration covers its audio — and the convention is zero errors before a human watches it. The pacing problem found in review was fixed by editing one script segment and re-running, not by trimming the video.

Be equally clear about what the table does not show. Building the template library and the pipeline itself was real up-front work, spread over many modules — this is the second-video-is-a-pipeline-question argument that Module 5 takes up properly. And the human review pass does not get cheaper per video: it costs the runtime every time, which is exactly why it is the gate worth protecting when everything around it gets faster.

Narration for this slide

Let's trace a real case: one of this school's course modules becoming a narrated explainer — and yes, the module you are watching is the same kind of artifact. The source already existed: thirteen slides, each with a sixty-to-one-twenty word script, written for the course. The pipeline mapped each beat to a template, generated the narration per segment, measured the durations, and set each shot's length from its narration plus breathing room. Visuals were filled from the script's anchors and brand tokens, stitched together, music ducked under the voice. An automated validator confirmed every slide was covered and nothing was left as placeholder text. The human time went to a shot-plan review, two pronunciation fixes, and one real-time watch — which caught a pacing problem, fixed in the script, not in the video.

Slide 12 of 1316:9

Exercise: convert one document into a script outline

Take one existing document — a release note, a help article, a spec, a course page — and turn it into a script outline. Do not generate anything yet; this is a writing exercise.

  • Pick a document the audience genuinely needs explained, not the one that is easiest
  • Break it into scenes and beats: one idea per beat, 60–120 words of narration each, beat type named
  • Write the on-screen anchors for every beat — the exact words, numbers, and labels
  • Mark three beats where generated narration would most likely drift off-register, and write those three word-for-word
  • Note what you would disclose about the voice, and where the captions and transcript would be published

Keep the outline. Module 5's exercise asks you to define the recurring pipeline for a format — this outline is the first run of that format.

Slide notes

The exercise is deliberately a writing task with no tooling, for the same reason the module keeps insisting the script is the design: the part of this work that agents do not do well is exactly the part the exercise practises. Choosing what the document is really about, cutting it into beats, deciding what earns a place on screen, and writing narration in the organisation's own register are judgment calls. The generation steps that follow are mechanical by comparison.

The step that participants find hardest, and learn the most from, is writing the on-screen anchors. It forces a decision about what the viewer must see versus what they only need to hear, and it surfaces how much of the source document is connective prose that no output needs. The three word-for-word beats are the voice exercise in miniature: the beats most likely to drift are usually the opening hook, anything making a claim about value, and the close — which is also where press-release tone does the most damage.

If this is run as a group session, have people swap outlines and read each other's narration aloud. Hearing someone else read your words at speaking pace exposes pacing and register problems faster than any silent review, and it previews the experience of hearing a synthetic voice read the same words. The disclosure and accessibility notes at the end are small on purpose; the point is to make answering them a habit that happens at outline time, not at publish time.

Narration for this slide

Your turn. Pick one existing document — a release note, a help article, a spec, or a course page — and convert it into a script outline on paper. Break it into scenes and beats, one idea per beat, sixty to a hundred and twenty words of narration each, and name each beat's type. Then write the on-screen anchors: the exact words and numbers the viewer must see. Pick the three beats where generated narration would most likely drift off-register and write those three word-for-word yourself. Finally, note what you would disclose about the voice, and where the captions and transcript would live. Keep the outline — it becomes the first run of the pipeline you will define in Module 5.

Slide 13 of 1316:9

Summary, and the bridge to motion in product

  • The script is the primary artifact: slides, narration, captions, and timing are derived from it
  • One source feeds the article, the deck, and the video — maintain the source, never the outputs
  • Narration drives duration; the pipeline measures the voice and fits the visuals to it
  • Synthetic voice-over makes the pipeline repeatable; disclose it where a viewer would otherwise assume a human
  • Two human passes hold the line: the script edit before audio, and the real-time watch before shipping

Module 4 turns from video to the product itself: motion inside the interface, and the duration, easing, and restraint rules an agent can apply without decorating everything that moves.

Slide notes

Recap by ownership rather than by sequence: humans own the script and the final watch; the agent owns text-to-speech, timing, the timeline, the per-segment visuals, and assembly; the automated gates own coverage, placeholders, and durations. Pacing, register, and honesty remain human judgments, and they remain the difference between an explainer that builds trust and one that merely exists.

The one-source discipline is the idea most worth restating, because it is the one teams break first under deadline pressure. The moment a fix lands in the deck or the MP4 but not in the source, the formats begin to disagree, and the disagreement is always discovered by an audience rather than by the team. The fix is cheap and procedural: name the canonical source, route every fix through it, and regenerate.

Then set up Module 4 as a genuine change of subject. This module and the two before it were about motion as content — videos and decks that get watched. Module 4 is about motion as interface: transitions, micro-interactions, and state changes inside the product, where the questions are duration, easing, and restraint, and where the characteristic agent failure is not press-release narration but decorative animation applied to everything that moves. The instrument is the same — rules encoded where the agent always sees them — but the rules themselves are different, and that is where the course goes next.

Narration for this slide

Let's close. The script is the design: it is the artifact a human writes and edits with care, and the slides, narration, captions, and timing are all derived from it. One source feeds the article, the deck, and the video — maintain the source, never the outputs. Narration drives duration, so the visuals fit the words. Synthetic voice-over is what makes the pipeline repeatable, and the honest default is to disclose it wherever a viewer would otherwise assume a human. And two human passes hold the line: the script edit before any audio exists, and the real-time watch before anything ships. Module 4 changes the subject from video to the product itself — motion inside the interface, and the rules of duration, easing, and restraint that keep an agent from animating everything that moves. See you there.

Module transcript
Module 3, narrated slide by slide

Slide 1Narrated Explainers and Decks

Welcome to Module 3. The last module was about how motion gets built — Remotion, hyperframe-style sequences, and the project structure around them. This module is about what the motion says. We are going to treat the script as the primary artifact: the thing a human writes and edits with care, from which the slides, the narration, the captions, and the timing are all derived. We will cover synthetic voice-over and what you should disclose about it, and the editing pass that keeps generated narration sounding like a person. And the worked example is close to home — this very module is a script-first artifact of exactly the kind we are about to discuss.

Slide 2One source, three outputs

Here is the structural idea. The article, the deck, and the narrated video are not three separate projects — they are three renderings of one content source. The source is a structured document: sections, claims, evidence, examples. The article renders it for reading. The deck renders it for presenting — and tools in this space can export both an HTML deck and an editable PowerPoint file from the same source. The video renders it onto a timeline with a voice. The rule that makes this work is simple and strict: maintain the source, never the outputs. The moment you fix something in the deck but not in the source, the formats start drifting apart, and they drift in front of an audience.

Slide 3Writing scripts agents can act on

So what does a script an agent can act on look like? It looks like a screenplay, not an essay. Scenes — one per section of the source, each with a stated purpose. Beats — one idea each, and each beat becomes one slide or one shot. And on-screen anchors: the exact words and numbers that will appear on screen, written into the script itself. That last one is the big quality lever. If the anchors are not in the script, the agent will invent them, and invented copy is where these things go wrong first. One sizing rule to remember: sixty to a hundred and twenty words per beat reads as roughly twenty-five to fifty seconds of narration — about as long as one visual can carry.

Slide 4Slide structure generated from the script, not alongside it

Once the script has typed beats, slide structure stops being a separate creative act. Each beat type maps to a treatment: a claim becomes a single statement slide, a definition becomes a term card, a process becomes a diagram with five nodes or fewer, a comparison becomes two columns, evidence becomes a small table, and the close becomes a checklist or a pull quote. The on-screen text comes straight from the script's anchors. Keep about thirty words on screen per beat as the ceiling — numbers and names on screen, full sentences in the narration. The mapping is mechanical on purpose. The judgment already happened when you wrote the script.

Slide 5The narrated explainer pipeline

Here is the pipeline end to end. A human writes the script, one segment per slide. The agent generates the voice — one audio file per segment — and measures how long each one runs. Those measured durations produce a timeline: a small text file mapping every segment to its visual. The visuals are rendered per segment from templates and brand tokens, with the on-screen words taken from the script. Then everything is assembled into an MP4, with the music ducked under the narration and captions packaged alongside. Finally, a human watches the whole thing. Two rules hold it together: narration drives duration, so the visuals fit the words — and fixes go to the script and the timeline, never to the rendered video.

Slide 6Voice: keeping generated narration in the brand's register

Now the part that decides whether the result sounds like your organisation or like everyone's launch video at once: voice. Generated narration defaults to the average voice of the internet — enthusiastic, vague, slightly too pleased with itself. The fix is not better prompting on the day; it is writing the register down where the agent always sees it. Sentence length, person, contractions, how you hedge claims, and a banned-phrase list — no 'seamlessly', no 'game-changing', no stacked rhetorical questions. Give the agent a few paragraphs of real, approved writing as the reference. And read the script aloud before you generate any audio, because the ear catches what the eye forgives.

Slide 7Synthetic voice-over, and what to disclose

Let's talk about the voice itself. Most of this pipeline runs on text-to-speech, and that is a feature: re-recording is free, the voice stays consistent across updates, and the captions are exact because they come from the script. Where synthetic voices fall short is emotional range, humour, and anything where the speaker's identity is the point. That brings up disclosure. The honest default is simple: if a reasonable viewer would assume a human and feel misled to learn otherwise, say so — a line in the description or the end card is enough. Never clone a real person's voice without written consent. And as of June 2026, platform rules on AI disclosure vary, so check the destination's policy before you publish, and settle your team's own policy once rather than per video.

Slide 8The editing pass: pacing, emphasis, and cutting the filler

Two human passes hold the quality line. The first is on the script, before any audio exists. Cut the filler — the throat-clearing intros, the sentences that only announce the next sentence. Check every claim against the source, because generated transitions love to over-promise. Then generate the audio and do the second pass: watch the assembled video at real speed. This is where you catch pacing, and pacing is the failure no automated check finds — nothing in the toolchain knows whether a viewer had time to read the slide. Listen for mispronunciations and sentences that need splitting. The pattern to expect is consistent: agent narration runs slightly too fast, slightly too smooth, and slightly too confident. The edit takes all three down a notch.

Slide 9Accessibility: captions, transcripts, and contrast in motion

Accessibility is where this pipeline pays for itself. Because the script exists before the audio, captions are exact — they come from the script, not from speech recognition — and the transcript is the same text again, lightly formatted and searchable. Ship both with every video, and treat their absence as a failed gate. On-screen text is still text: the same contrast rules as your product UI apply, and it has to stay up long enough to be read twice at a normal pace, because viewers are also listening. And avoid flashing or rapid full-screen changes — a rendered video cannot respond to a reduced-motion preference, so the restraint has to be designed in from the start.

Slide 10Decks and exports: where the outputs actually go

A quick word on decks specifically, because the deck is the output most likely to leave your pipeline and live in someone else's hands. The native form is an HTML deck — presented from a browser, regenerated when the source changes. For audiences that need PowerPoint, the bar to hold any export to is real, editable text frames, not screenshots pasted onto slides. The narrated video reuses the same beats with the synthetic voice and captions. Stills and PDF cover everything else. And because an editable export is also an export that can drift, every file you hand over should say what it was generated from, and your repo should name the canonical version. When the edited deck comes back different, the source wins.

Slide 11Worked example: a course module turned into a narrated explainer

Let's trace a real case: one of this school's course modules becoming a narrated explainer — and yes, the module you are watching is the same kind of artifact. The source already existed: thirteen slides, each with a sixty-to-one-twenty word script, written for the course. The pipeline mapped each beat to a template, generated the narration per segment, measured the durations, and set each shot's length from its narration plus breathing room. Visuals were filled from the script's anchors and brand tokens, stitched together, music ducked under the voice. An automated validator confirmed every slide was covered and nothing was left as placeholder text. The human time went to a shot-plan review, two pronunciation fixes, and one real-time watch — which caught a pacing problem, fixed in the script, not in the video.

Slide 12Exercise: convert one document into a script outline

Your turn. Pick one existing document — a release note, a help article, a spec, or a course page — and convert it into a script outline on paper. Break it into scenes and beats, one idea per beat, sixty to a hundred and twenty words of narration each, and name each beat's type. Then write the on-screen anchors: the exact words and numbers the viewer must see. Pick the three beats where generated narration would most likely drift off-register and write those three word-for-word yourself. Finally, note what you would disclose about the voice, and where the captions and transcript would live. Keep the outline — it becomes the first run of the pipeline you will define in Module 5.

Slide 13Summary, and the bridge to motion in product

Let's close. The script is the design: it is the artifact a human writes and edits with care, and the slides, narration, captions, and timing are all derived from it. One source feeds the article, the deck, and the video — maintain the source, never the outputs. Narration drives duration, so the visuals fit the words. Synthetic voice-over is what makes the pipeline repeatable, and the honest default is to disclose it wherever a viewer would otherwise assume a human. And two human passes hold the line: the script edit before any audio exists, and the real-time watch before anything ships. Module 4 changes the subject from video to the product itself — motion inside the interface, and the rules of duration, easing, and restraint that keep an agent from animating everything that moves. See you there.