Product Lead · 2023 — 2024

AI Summaries for Session Replay

Replay review time reduction

~50%

First-time user activation lift

+17%

LLMStructured Event DataAI SummarisationExperimentation PlatformAmplitude Analytics

Context

Amplitude Session Replay gives teams visibility into how users move through their product. After launch, over 50% of new customers never opened Replay. Among those who did, many dropped off — scrubbing through 10 to 15 minute recordings to find what mattered was too slow. Replay felt like a storage tool, not an insight tool.

Problem

New users couldn't find value fast enough. The workflow required watching replays end to end, which slowed time to first insight and made it hard to justify the tool during onboarding.

Approach

1Built an evaluation framework before writing any prompts. We pulled real customer replays and defined what a good summary had to include: key user actions, meaningful friction points, and behavioral signals. That became our labeled eval set. We tracked groundedness, signal capture, and noise — every prompt or model change ran against that dataset. When we found a new failure mode, we added it as a permanent test case.
2Reduced hallucination risk by grounding the LLM on structured event data rather than raw video or loosely formatted logs. The model handled reasoning; we controlled the inputs and enforced a strict output format to keep summaries consistent and auditable.
3Shipped the product changes: a Summary Card on every replay showing instant AI-generated insights, and updated onboarding so new users saw the summary first instead of being dropped into a raw recording.

What I shipped

✓Summary Card surfaced on every session replay with AI-generated insight highlights
✓Labeled eval dataset built from real customer replays, used to validate every model and prompt change
✓Eval framework tracking groundedness, signal capture, and noise — with regression tests for known failure modes
✓Updated onboarding flow prioritising the Summary Card as the activation entry point

Impact

·Replay review time dropped by approximately 50% — users could scan summaries and jump directly to important moments.
·First-time user activation increased 17% across key workflows.
·Users could quickly decide what to investigate without watching full sessions.

Learnings

LLM quality has to be treated like a product surface. If the summary is wrong one in ten times, trust drops fast. Defining a labeled eval set from real replays and testing every prompt change against it was critical — it forced us to measure groundedness and signal capture rather than relying on gut feel.
Grounding beats clever prompting. The biggest quality lift came from constraining the model to structured event data rather than raw video. Controlled inputs produced consistent, reliable outputs.
AI alone doesn't move activation — workflow does. The model was necessary but not sufficient. Activation moved when we changed the entry point and pushed users to the Summary Card first. The onboarding changes were just as important as the model quality.

One key tradeoff

Decision

Summarise structured event data vs. attempt direct video or transcript summarisation

Rationale

Structured events gave us a clean, auditable input format with far lower hallucination risk. Video or transcript summarisation would have been richer in theory but unpredictable in practice.

The tradeoff

We gave up some nuance — tone, emotional cues, visual context — that a transcript might have captured. For an analytics tool where precision matters more than colour, that was the right trade.

Metric definitions

Replay review timeMedian time from opening a session replay to the user taking an action or closing — measured via instrumentation on the replay player.

First-time user activationPercentage of new Replay users who completed a defined activation event (e.g. jumping to a flagged moment or sharing a replay) within their first session.