Skip to main content

Phase 3.5 — First Green Bar Acceptance Suite

Phase 3.5 turns the north-star acceptance bullets into concrete, automatable tests. The suite establishes the first "green bar" for FleetForge by proving deterministic replay, safety guardrails, budget controls, ChangeOps gating, and OpenTelemetry coverage work together end to end.

Test Matrix

IDPillarScenarioTooling / Entry PointSuccess Signal
FG-DET-001DeterminismReplay seeded baseline and compare driftpython -m evals.runner (baselines) or fleetforge-ctl replayReplayed run reports <=1% token drift and step tool I/O matches baseline artifacts
FG-SAFE-001SafetyOWASP LLM01/05/06 regression packpython -m evals.runner evals/packs/owasp_nist or just evals-packAll scoped scenarios deny or redact with policy artifacts attached; audit log lists the decisions
FG-COST-001Cost controlRun-level budget exhaustionfleetforge-ctl submit + budget row (see examples/cost_cap_failure)Step transitions to denied with budget policy artifact, no additional spend recorded
FG-CHG-001Change gateNovel PR without telemetry coveragefleetforge-ctl gates check --input change.jsonGate effect returns deny (or follow_up) until replay/eval payload is attached; decision stored + auditable
FG-OTEL-001OTEL GenAIEmit spans/metrics for agent, model, tool stepscargo test --test otel_smoke or OTEL collector smoke harnessSpans carry gen_ai.* attributes, metrics surface token/cost counters, ClickHouse tables receive data

Test Specifications

FG-DET-001 — Replay Drift <= 1%

  • Prereqs: Seeded scenario synced (e.g. evals/baselines/hello_fleet.json), runtime built from deterministic image, identical tool/model versions pinned via provider config.
  • Steps:
    1. Submit the baseline run (fleetforge-ctl submit -f examples/hello_fleet/run_spec.json) or execute via eval runner.
    2. Capture the resulting run ID (RUN_ID) and persist the run artifacts.
    3. Replay with fleetforge-ctl replay --run-id $RUN_ID --mode diff or python -m evals.runner --endpoint ... evals/baselines.
    4. Inspect the replay response (drift.tokens, drift.tool_io).
  • Assertions: Token drift ratio <= 0.01 across every step. Tool emission payloads (tool_io diff) identical to baseline. Replay artifacts stored in step_attempts for audit (attempt rows reference the same hashes).
  • Artifacts: Markdown/JSON drift report in evals/reports/, run artifact kind=replay_diff.

FG-SAFE-001 — OWASP Guardrail Coverage

  • Scope: OWASP LLM01 (prompt injection), LLM05 (model supply chain / output handling), LLM06 (excessive agency).
  • Prereqs: OWASP + NIST pack synced (just evals-sync) and Context Firewall / policy packs enabled (prompt_injection, tool_acl, budget_caps).
  • Steps:
    1. Run the suite: python -m evals.runner --endpoint $ENDPOINT evals/packs/owasp_nist --markdown owasp.md --json owasp.json.
    2. Filter results for slugs owasp_llm01_*, owasp_llm05_*, owasp_llm06_*.
    3. Query audit log via fleetforge-ctl audit export --since <timestamp> for corresponding runs.
  • Assertions: Each scoped scenario returns status=blocked|denied (or succeeds with redaction artifact when policy redaction is expected). Audit log contains entries tagged policy.pack=<pack_name> with effect deny/redact. OTEL spans include fleetforge.policy.events increment and carry policy.effect.
  • Artifacts: owasp.md report summarising pass/fail, per-run artifacts kind=policy_decision, audit log JSONL snippet stored with the report.

FG-COST-001 — Budget Cap Enforcement

  • Prereqs: Runtime connected to Postgres, examples/cost_cap_failure/run_spec.json, budget ledger migrations applied.
  • Steps:
    1. Submit the example run (fleetforge-ctl submit -f examples/cost_cap_failure/run_spec.json).
    2. Insert a run-scoped budget row with low cost_limit (see example README).
    3. Tail the run (fleetforge-ctl tail --run-id $RUN_ID) until policy denial.
  • Assertions: Run ends failed with terminal step event policy_denied. budgets table shows cost_used capped at limit, ledger rows mirror capped spend, and ChangeOps budget summary would list a breach. Policy artifact records effect=deny with reason="budget_cap_exceeded".
  • Artifacts: Run artifact kind=budget_guardrail, ledger snapshot exported to JSON for traceability.

FG-CHG-001 — Change Gate Blocks Novelty Without Telemetry

  • Prereqs: ChangeOps crate compiled, migrations (0014_changeops_gates.sql) applied, CLI configured with writer token.
  • Steps:
    1. Prepare change.json containing a ChangeGateRequest with high novelty_score (>0.8), coverage.overall_ratio below minimum_ratio, and missing eval metrics.
    2. Invoke fleetforge-ctl gates check --input change.json --json decision.json.
    3. Examine decision.json, then re-run with added eval metric referencing recent replay (score >= threshold) to observe transition to allow.
  • Assertions: Initial decision effect is deny or follow_up, reasons mention novelty/coverage gaps, scorecard.coverage.components_below_threshold populated. Decision persisted in change_gates table with matching payload, OTEL span change_gate.effect set accordingly. After injecting replay/eval evidence, decision flips to allow.
  • Artifacts: Decision artifact (kind=change_gate_decision) attached to change record, CLI JSON output stored for audit.

FG-OTEL-001 — GenAI Telemetry Coverage

  • Prereqs: OTEL collector reachable (use deploy/otel/collector.local.yaml), OTEL_EXPORTER_OTLP_ENDPOINT set, ClickHouse or in-memory sink available.
  • Steps:
    1. Run integration test cargo test --test otel_smoke -- --nocapture or start runtime with collector via just observability.
    2. Execute a representative run (e.g. hello-fleet example) while collector is active.
    3. Query the telemetry sink (SELECT * FROM otel_traces LIMIT 5) and inspect span attributes.
  • Assertions: Each step produces spans with gen_ai.system, gen_ai.operation.name, gen_ai.request.model, gen_ai.response.model (where applicable) and counters gen_ai.prompt.tokens, gen_ai.completion.tokens, gen_ai.cost.usd. Tool spans inherit the same conventions. Logs/metrics arrive in ClickHouse tables. No missing spans for agent/model/tool executions.
  • Artifacts: Query results exported, OTEL collector config committed, Grafana dashboard snapshot (optional) to confirm visual coverage.

Automation & Reporting

  • Wire FG-DET-001, FG-SAFE-001, and FG-COST-001 into nightly jobs (github/workflows/*.yaml) so regressions surface before merges. Store reports in evals/reports/ and publish to observability sinks.
  • Integrate FG-CHG-001 into CI by running fleetforge-ctl gates check against the current diff bundle before allowing deploy workflows.
  • Add FG-OTEL-001 collector checks to smoke tests executed on fresh environments (kind cluster, staging).

Operational Notes

  • All tests should run against reproducible seeds (RunSpec.seed) and pinned provider versions to maintain determinism.
  • Persist artifacts and audit exports alongside reports; ChangeOps decisions and policy artifacts must be queryable by run/change ID.
  • When failures occur, capture the exact runtime revision, policy pack versions, and tool/model hashes in the report metadata so the decision record is fully auditable.