Delivery Guarantees
FleetForge treats delivery as a first-class concern: every run must complete deterministically, surface actionable signals when it cannot, and provide the telemetry operators need to prove SLO compliance. This page explains the building blocks behind those guarantees.
Guarantees
- Durable orchestration: Runs and steps persist to Postgres with idempotent lifecycle transitions. The scheduler resumes in-flight work after restarts and respects per-step deadlines.
- Transactional outbox: Lifecycle events stream through a Postgres transactional outbox into Kafka/Redpanda. Operators can run the forwarder in at-least-once mode by default or enable Kafka transactions for exactly-once semantics.
- Backpressure-aware scheduling: Budget and inflight caps (
max_inflight_*) prevent the runtime from oversubscribing workers. Deferred steps emitstep_deferredevents so operators see the queue backlog. - Deterministic retries: Steps declare
max_attempts, exponential backoff, and optional compensation steps. Checkpoints (StepExecutionResult::with_checkpoint) make retries idempotent even with external tool calls.
Runtime Components
core/runtime/– State machine, backpressure gates, and step lifecycle transitions.core/bus/– Outbox forwarder that publishes events to Kafka/Redpanda with optional transactions (transactional.id).core/telemetry/– Emits OpenTelemetry spans/metrics/logs so delivery metrics (queue lag, retries, budgets) appear in ClickHouse and Grafana.core/ctl/– CLI helpers such asfleetforge-ctl delivery-modefor inspecting forwarder configuration.
Operator Workflow
- Queue health: Monitor
fleetforge.queue.lag_secondsand Grafana’s run latency panels (deploy/otel/dashboards/observability.json). Set alerts for rising p95 queue lag. - Budget adherence: Use
fleetforge.budget.*metrics to ensure inflight runs stay within token/cost caps. - Replay parity: When investigating incidents, replay the run with the same seed to confirm deterministic behaviour. Drift suggests an external system changed (provider version, tool output, data source).
- Forwarder mode: For regulated workloads, enable the transactional
forwarder (
transactional.id=..., consumers withisolation.level=read_committed) to get exactly-once semantics.
Related Docs
- Tutorial: Hello Fleet walkthrough
- Reference: Code structure
- How-to: Deploy with Helm
- Reference: Guardrails
Telemetry & compatibility roadmap
FleetForge leans on OpenTelemetry’s emerging GenAI semantic conventions for
trust.* attributes, but those keys are still evolving. The authoritative
policy (versioning, dual-emission windows, environment variables) now lives in
docs/reference/telemetry-compat.md. Highlights:
- Runtime releases carry a
trust.semconv.versionattribute that records which OTEL proposal version the spans follow while translators incore/telemetry/emit both the current keys and the previous stable set for one release. - Operators can pin a workspace to a specific schema with
FLEETFORGE_TRUST_SEMCONV_PIN=vX.Ywhile collectors upgrade, and collectors should setOTEL_SEMCONV_STABILITY_OPT_IN=gen-aiso the GenAI schema is decoded correctly. - The observability guide under
Roadmap & Status → Telemetry versioning policydocuments the supported versions, exactly-once caveats (Kafka transactions stop at the outbox boundary), and how to pin sinks that require deterministic delivery.
This roadmap keeps the “operate fleets like infrastructure” promise credible as the wider ecosystem standardizes on GenAI telemetry, with readiness tracked on Status & Acceptance → Telemetry compatibility + transparency roadmap.
Budget & SLO scorecards (FinOps)
FleetForge already emits fleetforge.budget.*, fleetforge.cost.*, and
fleetforge.slo.* metrics per run/workspace. The FinOps scorecard feature
productizes that telemetry:
- Workspace dashboards: The console surfaces token/cost vs SLO attainment per workspace, per policy gate, and per adapter so budget owners see where overruns occur.
- Export connectors: Built-in exports (Snowflake, BigQuery, Looker, CSV) stream the aggregated metrics nightly so Finance/Analytics teams can plug the numbers into existing reports.
- Gate integration: ChangeOps gates can now require “budget delta < X%” or “SLO attainment ≥ tier target” before approving a merge, aligning engineering and FinOps workflows.
Configure via FLEETFORGE_SCORECARDS_EXPORT_SNOWFLAKE_URL (and equivalents for
other sinks). Scorecard status lives in
Status & Acceptance → Budget/SLO scorecards.
Delivery mode switch & playbooks
FleetForge exposes delivery guarantees explicitly so operators know which level of durability is configured:
- Transactional outbox (default): Postgres writes + outbox forwarders provide at-least-once semantics per step. Use this mode when downstream sinks can tolerate duplicates or you need minimal latency.
- Kafka transactions: Enable
transactional.idon the forwarder and set consumers toisolation.level=read_committedto achieve end-to-end exactly-once between the runtime and Kafka consumers. This requires hardened connectors and slightly higher latency. - Delivery mode switch: The CLI (
fleetforge-ctl delivery-mode) and console show the active mode and let operators request a change. Mode changes emit artifacts + telemetry so the history is auditable. - Operational playbooks:
docs/how-to/delivery-playbooks.md(coming soon) describes how to switch modes safely, monitor backpressure, and recover from forwarder failures.
Documenting and instrumenting these flows keeps the scheduler’s guarantees transparent to SREs and auditors.