Slide 1 — Narrated Explainers and Decks
Welcome to Module 3. The last module was about how motion gets built — Remotion, hyperframe-style sequences, and the project structure around them. This module is about what the motion says. We are going to treat the script as the primary artifact: the thing a human writes and edits with care, from which the slides, the narration, the captions, and the timing are all derived. We will cover synthetic voice-over and what you should disclose about it, and the editing pass that keeps generated narration sounding like a person. And the worked example is close to home — this very module is a script-first artifact of exactly the kind we are about to discuss.
Slide 2 — One source, three outputs
Here is the structural idea. The article, the deck, and the narrated video are not three separate projects — they are three renderings of one content source. The source is a structured document: sections, claims, evidence, examples. The article renders it for reading. The deck renders it for presenting — and tools in this space can export both an HTML deck and an editable PowerPoint file from the same source. The video renders it onto a timeline with a voice. The rule that makes this work is simple and strict: maintain the source, never the outputs. The moment you fix something in the deck but not in the source, the formats start drifting apart, and they drift in front of an audience.
Slide 3 — Writing scripts agents can act on
So what does a script an agent can act on look like? It looks like a screenplay, not an essay. Scenes — one per section of the source, each with a stated purpose. Beats — one idea each, and each beat becomes one slide or one shot. And on-screen anchors: the exact words and numbers that will appear on screen, written into the script itself. That last one is the big quality lever. If the anchors are not in the script, the agent will invent them, and invented copy is where these things go wrong first. One sizing rule to remember: sixty to a hundred and twenty words per beat reads as roughly twenty-five to fifty seconds of narration — about as long as one visual can carry.
Slide 4 — Slide structure generated from the script, not alongside it
Once the script has typed beats, slide structure stops being a separate creative act. Each beat type maps to a treatment: a claim becomes a single statement slide, a definition becomes a term card, a process becomes a diagram with five nodes or fewer, a comparison becomes two columns, evidence becomes a small table, and the close becomes a checklist or a pull quote. The on-screen text comes straight from the script's anchors. Keep about thirty words on screen per beat as the ceiling — numbers and names on screen, full sentences in the narration. The mapping is mechanical on purpose. The judgment already happened when you wrote the script.
Slide 5 — The narrated explainer pipeline
Here is the pipeline end to end. A human writes the script, one segment per slide. The agent generates the voice — one audio file per segment — and measures how long each one runs. Those measured durations produce a timeline: a small text file mapping every segment to its visual. The visuals are rendered per segment from templates and brand tokens, with the on-screen words taken from the script. Then everything is assembled into an MP4, with the music ducked under the narration and captions packaged alongside. Finally, a human watches the whole thing. Two rules hold it together: narration drives duration, so the visuals fit the words — and fixes go to the script and the timeline, never to the rendered video.
Slide 6 — Voice: keeping generated narration in the brand's register
Now the part that decides whether the result sounds like your organisation or like everyone's launch video at once: voice. Generated narration defaults to the average voice of the internet — enthusiastic, vague, slightly too pleased with itself. The fix is not better prompting on the day; it is writing the register down where the agent always sees it. Sentence length, person, contractions, how you hedge claims, and a banned-phrase list — no 'seamlessly', no 'game-changing', no stacked rhetorical questions. Give the agent a few paragraphs of real, approved writing as the reference. And read the script aloud before you generate any audio, because the ear catches what the eye forgives.
Slide 7 — Synthetic voice-over, and what to disclose
Let's talk about the voice itself. Most of this pipeline runs on text-to-speech, and that is a feature: re-recording is free, the voice stays consistent across updates, and the captions are exact because they come from the script. Where synthetic voices fall short is emotional range, humour, and anything where the speaker's identity is the point. That brings up disclosure. The honest default is simple: if a reasonable viewer would assume a human and feel misled to learn otherwise, say so — a line in the description or the end card is enough. Never clone a real person's voice without written consent. And as of June 2026, platform rules on AI disclosure vary, so check the destination's policy before you publish, and settle your team's own policy once rather than per video.
Slide 8 — The editing pass: pacing, emphasis, and cutting the filler
Two human passes hold the quality line. The first is on the script, before any audio exists. Cut the filler — the throat-clearing intros, the sentences that only announce the next sentence. Check every claim against the source, because generated transitions love to over-promise. Then generate the audio and do the second pass: watch the assembled video at real speed. This is where you catch pacing, and pacing is the failure no automated check finds — nothing in the toolchain knows whether a viewer had time to read the slide. Listen for mispronunciations and sentences that need splitting. The pattern to expect is consistent: agent narration runs slightly too fast, slightly too smooth, and slightly too confident. The edit takes all three down a notch.
Slide 9 — Accessibility: captions, transcripts, and contrast in motion
Accessibility is where this pipeline pays for itself. Because the script exists before the audio, captions are exact — they come from the script, not from speech recognition — and the transcript is the same text again, lightly formatted and searchable. Ship both with every video, and treat their absence as a failed gate. On-screen text is still text: the same contrast rules as your product UI apply, and it has to stay up long enough to be read twice at a normal pace, because viewers are also listening. And avoid flashing or rapid full-screen changes — a rendered video cannot respond to a reduced-motion preference, so the restraint has to be designed in from the start.
Slide 10 — Decks and exports: where the outputs actually go
A quick word on decks specifically, because the deck is the output most likely to leave your pipeline and live in someone else's hands. The native form is an HTML deck — presented from a browser, regenerated when the source changes. For audiences that need PowerPoint, the bar to hold any export to is real, editable text frames, not screenshots pasted onto slides. The narrated video reuses the same beats with the synthetic voice and captions. Stills and PDF cover everything else. And because an editable export is also an export that can drift, every file you hand over should say what it was generated from, and your repo should name the canonical version. When the edited deck comes back different, the source wins.
Slide 11 — Worked example: a course module turned into a narrated explainer
Let's trace a real case: one of this school's course modules becoming a narrated explainer — and yes, the module you are watching is the same kind of artifact. The source already existed: thirteen slides, each with a sixty-to-one-twenty word script, written for the course. The pipeline mapped each beat to a template, generated the narration per segment, measured the durations, and set each shot's length from its narration plus breathing room. Visuals were filled from the script's anchors and brand tokens, stitched together, music ducked under the voice. An automated validator confirmed every slide was covered and nothing was left as placeholder text. The human time went to a shot-plan review, two pronunciation fixes, and one real-time watch — which caught a pacing problem, fixed in the script, not in the video.
Slide 12 — Exercise: convert one document into a script outline
Your turn. Pick one existing document — a release note, a help article, a spec, or a course page — and convert it into a script outline on paper. Break it into scenes and beats, one idea per beat, sixty to a hundred and twenty words of narration each, and name each beat's type. Then write the on-screen anchors: the exact words and numbers the viewer must see. Pick the three beats where generated narration would most likely drift off-register and write those three word-for-word yourself. Finally, note what you would disclose about the voice, and where the captions and transcript would live. Keep the outline — it becomes the first run of the pipeline you will define in Module 5.
Slide 13 — Summary, and the bridge to motion in product
Let's close. The script is the design: it is the artifact a human writes and edits with care, and the slides, narration, captions, and timing are all derived from it. One source feeds the article, the deck, and the video — maintain the source, never the outputs. Narration drives duration, so the visuals fit the words. Synthetic voice-over is what makes the pipeline repeatable, and the honest default is to disclose it wherever a viewer would otherwise assume a human. And two human passes hold the line: the script edit before any audio exists, and the real-time watch before anything ships. Module 4 changes the subject from video to the product itself — motion inside the interface, and the rules of duration, easing, and restraint that keep an agent from animating everything that moves. See you there.