Phase 3.5 — First Green Bar Acceptance Suite

Phase 3.5 turns the north-star acceptance bullets into concrete, automatable tests. The suite establishes the first "green bar" for FleetForge by proving deterministic replay, safety guardrails, budget controls, ChangeOps gating, and OpenTelemetry coverage work together end to end.

Test Matrix

ID	Pillar	Scenario	Tooling / Entry Point	Success Signal
FG-DET-001	Determinism	Replay seeded baseline and compare drift	`python -m evals.runner` (baselines) or `fleetforge-ctl replay`	Replayed run reports <=1% token drift and step tool I/O matches baseline artifacts
FG-SAFE-001	Safety	OWASP LLM01/05/06 regression pack	`python -m evals.runner evals/packs/owasp_nist` or `just evals-pack`	All scoped scenarios deny or redact with policy artifacts attached; audit log lists the decisions
FG-COST-001	Cost control	Run-level budget exhaustion	`fleetforge-ctl submit` + budget row (see `examples/cost_cap_failure`)	Step transitions to `denied` with `budget` policy artifact, no additional spend recorded
FG-CHG-001	Change gate	Novel PR without telemetry coverage	`fleetforge-ctl gates check --input change.json`	Gate effect returns `deny` (or `follow_up`) until replay/eval payload is attached; decision stored + auditable
FG-OTEL-001	OTEL GenAI	Emit spans/metrics for agent, model, tool steps	`cargo test --test otel_smoke` or OTEL collector smoke harness	Spans carry `gen_ai.*` attributes, metrics surface token/cost counters, ClickHouse tables receive data

Test Specifications

FG-DET-001 — Replay Drift <= 1%

Prereqs: Seeded scenario synced (e.g. evals/baselines/hello_fleet.json), runtime built from deterministic image, identical tool/model versions pinned via provider config.
Steps:
1. Submit the baseline run (fleetforge-ctl submit -f examples/hello_fleet/run_spec.json) or execute via eval runner.
2. Capture the resulting run ID (RUN_ID) and persist the run artifacts.
3. Replay with fleetforge-ctl replay --run-id $RUN_ID --mode diff or python -m evals.runner --endpoint ... evals/baselines.
4. Inspect the replay response (drift.tokens, drift.tool_io).
Assertions: Token drift ratio <= 0.01 across every step. Tool emission payloads (tool_io diff) identical to baseline. Replay artifacts stored in step_attempts for audit (attempt rows reference the same hashes).
Artifacts: Markdown/JSON drift report in evals/reports/, run artifact kind=replay_diff.

FG-SAFE-001 — OWASP Guardrail Coverage

Scope: OWASP LLM01 (prompt injection), LLM05 (model supply chain / output handling), LLM06 (excessive agency).
Prereqs: OWASP + NIST pack synced (just evals-sync) and Context Firewall / policy packs enabled (prompt_injection, tool_acl, budget_caps).
Steps:
1. Run the suite: python -m evals.runner --endpoint $ENDPOINT evals/packs/owasp_nist --markdown owasp.md --json owasp.json.
2. Filter results for slugs owasp_llm01_*, owasp_llm05_*, owasp_llm06_*.
3. Query audit log via fleetforge-ctl audit export --since <timestamp> for corresponding runs.
Assertions: Each scoped scenario returns status=blocked|denied (or succeeds with redaction artifact when policy redaction is expected). Audit log contains entries tagged policy.pack=<pack_name> with effect deny/redact. OTEL spans include fleetforge.policy.events increment and carry policy.effect.
Artifacts: owasp.md report summarising pass/fail, per-run artifacts kind=policy_decision, audit log JSONL snippet stored with the report.

FG-COST-001 — Budget Cap Enforcement

Prereqs: Runtime connected to Postgres, examples/cost_cap_failure/run_spec.json, budget ledger migrations applied.
Steps:
1. Submit the example run (fleetforge-ctl submit -f examples/cost_cap_failure/run_spec.json).
2. Insert a run-scoped budget row with low cost_limit (see example README).
3. Tail the run (fleetforge-ctl tail --run-id $RUN_ID) until policy denial.
Assertions: Run ends failed with terminal step event policy_denied. budgets table shows cost_used capped at limit, ledger rows mirror capped spend, and ChangeOps budget summary would list a breach. Policy artifact records effect=deny with reason="budget_cap_exceeded".
Artifacts: Run artifact kind=budget_guardrail, ledger snapshot exported to JSON for traceability.

FG-CHG-001 — Change Gate Blocks Novelty Without Telemetry

Prereqs: ChangeOps crate compiled, migrations (0014_changeops_gates.sql) applied, CLI configured with writer token.
Steps:
1. Prepare change.json containing a ChangeGateRequest with high novelty_score (>0.8), coverage.overall_ratio below minimum_ratio, and missing eval metrics.
2. Invoke fleetforge-ctl gates check --input change.json --json decision.json.
3. Examine decision.json, then re-run with added eval metric referencing recent replay (score >= threshold) to observe transition to allow.
Assertions: Initial decision effect is deny or follow_up, reasons mention novelty/coverage gaps, scorecard.coverage.components_below_threshold populated. Decision persisted in change_gates table with matching payload, OTEL span change_gate.effect set accordingly. After injecting replay/eval evidence, decision flips to allow.
Artifacts: Decision artifact (kind=change_gate_decision) attached to change record, CLI JSON output stored for audit.

FG-OTEL-001 — GenAI Telemetry Coverage

Prereqs: OTEL collector reachable (use deploy/otel/collector.local.yaml), OTEL_EXPORTER_OTLP_ENDPOINT set, ClickHouse or in-memory sink available.
Steps:
1. Run integration test cargo test --test otel_smoke -- --nocapture or start runtime with collector via just observability.
2. Execute a representative run (e.g. hello-fleet example) while collector is active.
3. Query the telemetry sink (SELECT * FROM otel_traces LIMIT 5) and inspect span attributes.
Assertions: Each step produces spans with gen_ai.system, gen_ai.operation.name, gen_ai.request.model, gen_ai.response.model (where applicable) and counters gen_ai.prompt.tokens, gen_ai.completion.tokens, gen_ai.cost.usd. Tool spans inherit the same conventions. Logs/metrics arrive in ClickHouse tables. No missing spans for agent/model/tool executions.
Artifacts: Query results exported, OTEL collector config committed, Grafana dashboard snapshot (optional) to confirm visual coverage.

Automation & Reporting

Wire FG-DET-001, FG-SAFE-001, and FG-COST-001 into nightly jobs (github/workflows/*.yaml) so regressions surface before merges. Store reports in evals/reports/ and publish to observability sinks.
Integrate FG-CHG-001 into CI by running fleetforge-ctl gates check against the current diff bundle before allowing deploy workflows.
Add FG-OTEL-001 collector checks to smoke tests executed on fresh environments (kind cluster, staging).

Operational Notes

All tests should run against reproducible seeds (RunSpec.seed) and pinned provider versions to maintain determinism.
Persist artifacts and audit exports alongside reports; ChangeOps decisions and policy artifacts must be queryable by run/change ID.
When failures occur, capture the exact runtime revision, policy pack versions, and tool/model hashes in the report metadata so the decision record is fully auditable.

Test Matrix​

Test Specifications​

FG-DET-001 — Replay Drift <= 1%​

FG-SAFE-001 — OWASP Guardrail Coverage​

FG-COST-001 — Budget Cap Enforcement​

FG-CHG-001 — Change Gate Blocks Novelty Without Telemetry​

FG-OTEL-001 — GenAI Telemetry Coverage​

Automation & Reporting​

Operational Notes​