Skip to main content

Delivery Guarantees

FleetForge treats delivery as a first-class concern: every run must complete deterministically, surface actionable signals when it cannot, and provide the telemetry operators need to prove SLO compliance. This page explains the building blocks behind those guarantees.

Guarantees

  • Durable orchestration: Runs and steps persist to Postgres with idempotent lifecycle transitions. The scheduler resumes in-flight work after restarts and respects per-step deadlines.
  • Transactional outbox: Lifecycle events stream through a Postgres transactional outbox into Kafka/Redpanda. Operators can run the forwarder in at-least-once mode by default or enable Kafka transactions for exactly-once semantics.
  • Backpressure-aware scheduling: Budget and inflight caps (max_inflight_*) prevent the runtime from oversubscribing workers. Deferred steps emit step_deferred events so operators see the queue backlog.
  • Deterministic retries: Steps declare max_attempts, exponential backoff, and optional compensation steps. Checkpoints (StepExecutionResult::with_checkpoint) make retries idempotent even with external tool calls.

Runtime Components

  • core/runtime/ – State machine, backpressure gates, and step lifecycle transitions.
  • core/bus/ – Outbox forwarder that publishes events to Kafka/Redpanda with optional transactions (transactional.id).
  • core/telemetry/ – Emits OpenTelemetry spans/metrics/logs so delivery metrics (queue lag, retries, budgets) appear in ClickHouse and Grafana.
  • core/ctl/ – CLI helpers such as fleetforge-ctl delivery-mode for inspecting forwarder configuration.

Operator Workflow

  1. Queue health: Monitor fleetforge.queue.lag_seconds and Grafana’s run latency panels (deploy/otel/dashboards/observability.json). Set alerts for rising p95 queue lag.
  2. Budget adherence: Use fleetforge.budget.* metrics to ensure inflight runs stay within token/cost caps.
  3. Replay parity: When investigating incidents, replay the run with the same seed to confirm deterministic behaviour. Drift suggests an external system changed (provider version, tool output, data source).
  4. Forwarder mode: For regulated workloads, enable the transactional forwarder (transactional.id=..., consumers with isolation.level=read_committed) to get exactly-once semantics.

Telemetry & compatibility roadmap

FleetForge leans on OpenTelemetry’s emerging GenAI semantic conventions for trust.* attributes, but those keys are still evolving. The authoritative policy (versioning, dual-emission windows, environment variables) now lives in docs/reference/telemetry-compat.md. Highlights:

  • Runtime releases carry a trust.semconv.version attribute that records which OTEL proposal version the spans follow while translators in core/telemetry/ emit both the current keys and the previous stable set for one release.
  • Operators can pin a workspace to a specific schema with FLEETFORGE_TRUST_SEMCONV_PIN=vX.Y while collectors upgrade, and collectors should set OTEL_SEMCONV_STABILITY_OPT_IN=gen-ai so the GenAI schema is decoded correctly.
  • The observability guide under Roadmap & Status → Telemetry versioning policy documents the supported versions, exactly-once caveats (Kafka transactions stop at the outbox boundary), and how to pin sinks that require deterministic delivery.

This roadmap keeps the “operate fleets like infrastructure” promise credible as the wider ecosystem standardizes on GenAI telemetry, with readiness tracked on Status & Acceptance → Telemetry compatibility + transparency roadmap.

Budget & SLO scorecards (FinOps)

FleetForge already emits fleetforge.budget.*, fleetforge.cost.*, and fleetforge.slo.* metrics per run/workspace. The FinOps scorecard feature productizes that telemetry:

  • Workspace dashboards: The console surfaces token/cost vs SLO attainment per workspace, per policy gate, and per adapter so budget owners see where overruns occur.
  • Export connectors: Built-in exports (Snowflake, BigQuery, Looker, CSV) stream the aggregated metrics nightly so Finance/Analytics teams can plug the numbers into existing reports.
  • Gate integration: ChangeOps gates can now require “budget delta < X%” or “SLO attainment ≥ tier target” before approving a merge, aligning engineering and FinOps workflows.

Configure via FLEETFORGE_SCORECARDS_EXPORT_SNOWFLAKE_URL (and equivalents for other sinks). Scorecard status lives in Status & Acceptance → Budget/SLO scorecards.

Delivery mode switch & playbooks

FleetForge exposes delivery guarantees explicitly so operators know which level of durability is configured:

  • Transactional outbox (default): Postgres writes + outbox forwarders provide at-least-once semantics per step. Use this mode when downstream sinks can tolerate duplicates or you need minimal latency.
  • Kafka transactions: Enable transactional.id on the forwarder and set consumers to isolation.level=read_committed to achieve end-to-end exactly-once between the runtime and Kafka consumers. This requires hardened connectors and slightly higher latency.
  • Delivery mode switch: The CLI (fleetforge-ctl delivery-mode) and console show the active mode and let operators request a change. Mode changes emit artifacts + telemetry so the history is auditable.
  • Operational playbooks: docs/how-to/delivery-playbooks.md (coming soon) describes how to switch modes safely, monitor backpressure, and recover from forwarder failures.

Documenting and instrumenting these flows keeps the scheduler’s guarantees transparent to SREs and auditors.