Deterministic Replay

Replay lets teams reproduce any FleetForge run with the same inputs, tool responses, prompts, and model versions. It underpins regression debugging, ChangeOps release gates, and compliance audits.

How Replay Works

Seeded DAGs: Every run requires a positive seed. The scheduler and executors derive random choices (sampling temperature, tool routing, retries) from that seed so replays follow the same path.
Artifact capture: Inputs, outputs, prompts, tool calls, and policy decisions persist as artifacts. The UI and CLI surface diffs between live and replayed runs so drift is obvious.
Executor determinism: LLM/tool executors record temperature, top-p, provider metadata, and any prompts resolved from packs. When replaying, the runtime rehydrates the executor with the same parameters.
Data provenance: Memory adapters and context sources include versioned pointers so a replay pulls the same documents or tool artifacts.

Replay Workflow

Run fleetforge-ctl runs get <RUN_ID> (or use the UI “Replay” button) to spawn a replay with the original seed.
Review the diff view; if behaviour diverged, inspect prompts, tool outputs, and policy decisions to locate drift.
Promote replays to eval suites (see evals/) so regressions fail CI.
Archive replay artifacts as part of ChangeOps approvals when shipping new prompts or adapters.

Hardening deterministic replay

Seeds and checkpointed DAGs get most workloads close to deterministic parity, but regulated teams need additional controls before they trust replays for forensics. The ongoing hardening work covers four areas:

Artifact sealing: every executor now captures request/response envelopes, signed capability tokens, and attestation IDs. Upcoming releases will seal outbound HTTP payloads and tool transcripts so replays can stub responses without hitting external systems. See core/runtime/src/replay.rs for the plumbing.
Time & randomness virtualization: the runtime already derives randomness from the run seed; we are adding a per-run virtual clock so time-sensitive tools (cron, TTL caches) see the same timestamps during replay. The clock will be stored alongside the checkpoint metadata and surfaced in OTEL spans.
Model + dataset pinning: prompts reference model versions today, but replays also need dataset snapshots (vector store digests, feature store versions, HTTP cache keys) plus explicit manifest pinning for each tool/model invocation. The AIBOM format is being extended to include these digests so replays can verify the same context sources and manifests before executing.
External effect sandboxes: shell/HTTP/tool adapters will gain a replay_mode=strict flag that forces stubbed IO. When enabled, the runtime fails the replay if an adapter tries to reach an unsafed endpoint, making drift visible rather than silently tolerating it.

Each control is tracked in Status & Acceptance → Deterministic replay controls.

ChangeOps Integration

Change reviews require a replay demonstrating that the new code or prompt behaves as expected. The ChangeOps gate stores replay metadata alongside eval results, budgets, and novelty scores.

Concepts: Delivery guarantees
Concept: ChangeOps governance
Tutorial: Hello Fleet walkthrough
Reference: Contracts phases
Reference: Roadmap & Status → Deterministic replay controls

How Replay Works​

Replay Workflow​

Hardening deterministic replay​

ChangeOps Integration​

Related Docs​

How Replay Works

Replay Workflow

Hardening deterministic replay

ChangeOps Integration

Related Docs