Phase 3.5 — First Green Bar Acceptance Suite
Phase 3.5 turns the north-star acceptance bullets into concrete, automatable tests. The suite establishes the first "green bar" for FleetForge by proving deterministic replay, safety guardrails, budget controls, ChangeOps gating, and OpenTelemetry coverage work together end to end.
Test Matrix
| ID | Pillar | Scenario | Tooling / Entry Point | Success Signal |
|---|---|---|---|---|
| FG-DET-001 | Determinism | Replay seeded baseline and compare drift | python -m evals.runner (baselines) or fleetforge-ctl replay | Replayed run reports <=1% token drift and step tool I/O matches baseline artifacts |
| FG-SAFE-001 | Safety | OWASP LLM01/05/06 regression pack | python -m evals.runner evals/packs/owasp_nist or just evals-pack | All scoped scenarios deny or redact with policy artifacts attached; audit log lists the decisions |
| FG-COST-001 | Cost control | Run-level budget exhaustion | fleetforge-ctl submit + budget row (see examples/cost_cap_failure) | Step transitions to denied with budget policy artifact, no additional spend recorded |
| FG-CHG-001 | Change gate | Novel PR without telemetry coverage | fleetforge-ctl gates check --input change.json | Gate effect returns deny (or follow_up) until replay/eval payload is attached; decision stored + auditable |
| FG-OTEL-001 | OTEL GenAI | Emit spans/metrics for agent, model, tool steps | cargo test --test otel_smoke or OTEL collector smoke harness | Spans carry gen_ai.* attributes, metrics surface token/cost counters, ClickHouse tables receive data |
Test Specifications
FG-DET-001 — Replay Drift <= 1%
- Prereqs: Seeded scenario synced (e.g.
evals/baselines/hello_fleet.json), runtime built from deterministic image, identical tool/model versions pinned via provider config. - Steps:
- Submit the baseline run (
fleetforge-ctl submit -f examples/hello_fleet/run_spec.json) or execute via eval runner. - Capture the resulting run ID (
RUN_ID) and persist the run artifacts. - Replay with
fleetforge-ctl replay --run-id $RUN_ID --mode difforpython -m evals.runner --endpoint ... evals/baselines. - Inspect the replay response (
drift.tokens,drift.tool_io).
- Submit the baseline run (
- Assertions: Token drift ratio <= 0.01 across every step. Tool emission payloads (
tool_iodiff) identical to baseline. Replay artifacts stored instep_attemptsfor audit (attemptrows reference the same hashes). - Artifacts: Markdown/JSON drift report in
evals/reports/, run artifactkind=replay_diff.
FG-SAFE-001 — OWASP Guardrail Coverage
- Scope: OWASP LLM01 (prompt injection), LLM05 (model supply chain / output handling), LLM06 (excessive agency).
- Prereqs: OWASP + NIST pack synced (
just evals-sync) and Context Firewall / policy packs enabled (prompt_injection,tool_acl,budget_caps). - Steps:
- Run the suite:
python -m evals.runner --endpoint $ENDPOINT evals/packs/owasp_nist --markdown owasp.md --json owasp.json. - Filter results for slugs
owasp_llm01_*,owasp_llm05_*,owasp_llm06_*. - Query audit log via
fleetforge-ctl audit export --since <timestamp>for corresponding runs.
- Run the suite:
- Assertions: Each scoped scenario returns
status=blocked|denied(or succeeds with redaction artifact when policy redaction is expected). Audit log contains entries taggedpolicy.pack=<pack_name>with effectdeny/redact. OTEL spans includefleetforge.policy.eventsincrement and carrypolicy.effect. - Artifacts:
owasp.mdreport summarising pass/fail, per-run artifactskind=policy_decision, audit log JSONL snippet stored with the report.
FG-COST-001 — Budget Cap Enforcement
- Prereqs: Runtime connected to Postgres,
examples/cost_cap_failure/run_spec.json, budget ledger migrations applied. - Steps:
- Submit the example run (
fleetforge-ctl submit -f examples/cost_cap_failure/run_spec.json). - Insert a run-scoped budget row with low
cost_limit(see example README). - Tail the run (
fleetforge-ctl tail --run-id $RUN_ID) until policy denial.
- Submit the example run (
- Assertions: Run ends
failedwith terminal step eventpolicy_denied.budgetstable showscost_usedcapped at limit, ledger rows mirror capped spend, and ChangeOps budget summary would list a breach. Policy artifact recordseffect=denywithreason="budget_cap_exceeded". - Artifacts: Run artifact
kind=budget_guardrail, ledger snapshot exported to JSON for traceability.
FG-CHG-001 — Change Gate Blocks Novelty Without Telemetry
- Prereqs: ChangeOps crate compiled, migrations (
0014_changeops_gates.sql) applied, CLI configured with writer token. - Steps:
- Prepare
change.jsoncontaining aChangeGateRequestwith highnovelty_score(>0.8),coverage.overall_ratiobelowminimum_ratio, and missing eval metrics. - Invoke
fleetforge-ctl gates check --input change.json --json decision.json. - Examine
decision.json, then re-run with added eval metric referencing recent replay (score >= threshold) to observe transition toallow.
- Prepare
- Assertions: Initial decision
effectisdenyorfollow_up,reasonsmention novelty/coverage gaps,scorecard.coverage.components_below_thresholdpopulated. Decision persisted inchange_gatestable with matching payload, OTEL spanchange_gate.effectset accordingly. After injecting replay/eval evidence, decision flips toallow. - Artifacts: Decision artifact (
kind=change_gate_decision) attached to change record, CLI JSON output stored for audit.
FG-OTEL-001 — GenAI Telemetry Coverage
- Prereqs: OTEL collector reachable (use
deploy/otel/collector.local.yaml),OTEL_EXPORTER_OTLP_ENDPOINTset, ClickHouse or in-memory sink available. - Steps:
- Run integration test
cargo test --test otel_smoke -- --nocaptureor start runtime with collector viajust observability. - Execute a representative run (e.g. hello-fleet example) while collector is active.
- Query the telemetry sink (
SELECT * FROM otel_traces LIMIT 5) and inspect span attributes.
- Run integration test
- Assertions: Each step produces spans with
gen_ai.system,gen_ai.operation.name,gen_ai.request.model,gen_ai.response.model(where applicable) and countersgen_ai.prompt.tokens,gen_ai.completion.tokens,gen_ai.cost.usd. Tool spans inherit the same conventions. Logs/metrics arrive in ClickHouse tables. No missing spans for agent/model/tool executions. - Artifacts: Query results exported, OTEL collector config committed, Grafana dashboard snapshot (optional) to confirm visual coverage.
Automation & Reporting
- Wire FG-DET-001, FG-SAFE-001, and FG-COST-001 into nightly jobs (
github/workflows/*.yaml) so regressions surface before merges. Store reports inevals/reports/and publish to observability sinks. - Integrate FG-CHG-001 into CI by running
fleetforge-ctl gates checkagainst the current diff bundle before allowing deploy workflows. - Add FG-OTEL-001 collector checks to smoke tests executed on fresh environments (kind cluster, staging).
Operational Notes
- All tests should run against reproducible seeds (
RunSpec.seed) and pinned provider versions to maintain determinism. - Persist artifacts and audit exports alongside reports; ChangeOps decisions and policy artifacts must be queryable by run/change ID.
- When failures occur, capture the exact runtime revision, policy pack versions, and tool/model hashes in the report metadata so the decision record is fully auditable.