Deterministic Replay
Replay lets teams reproduce any FleetForge run with the same inputs, tool responses, prompts, and model versions. It underpins regression debugging, ChangeOps release gates, and compliance audits.
How Replay Works
- Seeded DAGs: Every run requires a positive
seed. The scheduler and executors derive random choices (sampling temperature, tool routing, retries) from that seed so replays follow the same path. - Artifact capture: Inputs, outputs, prompts, tool calls, and policy decisions persist as artifacts. The UI and CLI surface diffs between live and replayed runs so drift is obvious.
- Executor determinism: LLM/tool executors record temperature, top-p, provider metadata, and any prompts resolved from packs. When replaying, the runtime rehydrates the executor with the same parameters.
- Data provenance: Memory adapters and context sources include versioned pointers so a replay pulls the same documents or tool artifacts.
Replay Workflow
- Run
fleetforge-ctl runs get <RUN_ID>(or use the UI “Replay” button) to spawn a replay with the original seed. - Review the diff view; if behaviour diverged, inspect prompts, tool outputs, and policy decisions to locate drift.
- Promote replays to eval suites (see
evals/) so regressions fail CI. - Archive replay artifacts as part of ChangeOps approvals when shipping new prompts or adapters.
Hardening deterministic replay
Seeds and checkpointed DAGs get most workloads close to deterministic parity, but regulated teams need additional controls before they trust replays for forensics. The ongoing hardening work covers four areas:
- Artifact sealing: every executor now captures request/response envelopes,
signed capability tokens, and attestation IDs. Upcoming releases will seal
outbound HTTP payloads and tool transcripts so replays can stub responses
without hitting external systems. See
core/runtime/src/replay.rsfor the plumbing. - Time & randomness virtualization: the runtime already derives randomness from the run seed; we are adding a per-run virtual clock so time-sensitive tools (cron, TTL caches) see the same timestamps during replay. The clock will be stored alongside the checkpoint metadata and surfaced in OTEL spans.
- Model + dataset pinning: prompts reference model versions today, but replays also need dataset snapshots (vector store digests, feature store versions, HTTP cache keys) plus explicit manifest pinning for each tool/model invocation. The AIBOM format is being extended to include these digests so replays can verify the same context sources and manifests before executing.
- External effect sandboxes: shell/HTTP/tool adapters will gain a
replay_mode=strictflag that forces stubbed IO. When enabled, the runtime fails the replay if an adapter tries to reach an unsafed endpoint, making drift visible rather than silently tolerating it.
Each control is tracked in Status & Acceptance → Deterministic replay controls.
ChangeOps Integration
Change reviews require a replay demonstrating that the new code or prompt behaves as expected. The ChangeOps gate stores replay metadata alongside eval results, budgets, and novelty scores.
Related Docs
- Concepts: Delivery guarantees
- Concept: ChangeOps governance
- Tutorial: Hello Fleet walkthrough
- Reference: Contracts phases
- Reference: Roadmap & Status → Deterministic replay controls