Skip to main content

Set Up ClickHouse & Grafana

This guide connects FleetForge’s OpenTelemetry stream to ClickHouse and Grafana so operators can inspect latency, policy, and budget dashboards.

1. Run the local collector (optional)

just observability

The helper spins up ClickHouse and the OTEL Collector using deploy/otel/collector.local.yaml. The collector listens on http://localhost:4317.

Point the runtime (or just demo) at the collector:

export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317

Browse raw telemetry tables at http://localhost:8123.

2. Configure the collector in other environments

  • Start from deploy/compose/otel-collector-config.yaml to fan OTLP data into ClickHouse via the community exporter.
  • Create the telemetry database manually (for example, CREATE DATABASE telemetry;). The exporter creates tables automatically.
  • Expose the Prometheus scrape endpoint (:9464) if you plan to alert on runtime metrics.

3. Materialize analytics tables

The deploy/clickhouse/materialized_views.sql script provisions the analytics tables that power FleetForge’s observability dashboards. It builds on the raw telemetry.run_events and telemetry.step_events tables produced by the OTEL exporter and maintains the following derived datasets:

TablePurpose
telemetry.run_latency_metricsprovides MTTR/SLO slices and latency histograms
telemetry.step_retry_heatmapexposes retry attempt distributions per step kind/status
telemetry.budget_burndowntracks reserved/used tokens & cost per workspace/app
telemetry.policy_decision_ratesaggregates allow/deny/modify counts for guardrails
telemetry.tool_error_taxonomyattributes failure rates to tools/providers

Apply the script with clickhouse-client --multiquery < materialized_views.sql after the exporter has created the base tables.

4. Import the Grafana dashboard

  • Provision Grafana with the bundled config in deploy/grafana/provisioning.
  • Mount the dashboards from deploy/otel/dashboards/ into /etc/grafana/dashboards.
  • Add a ClickHouse datasource with UID clickhouse pointing at your telemetry endpoint.

The FleetForge Observability dashboard surfaces:

  1. Run MTTR & success SLO panels
  2. Step retry heatmaps
  3. Budget burn-down charts
  4. Policy violation burst timelines
  5. Tool flake-rate tables

Use Grafana’s alerting UI to attach thresholds—trigger incidents when run success rate dips, policy denies surge, or tool flakes exceed an SLO.

5. Alert on budgets & queue health

The runtime now publishes dedicated OpenTelemetry metrics for budgets and queue pressure. The most relevant time series are:

  • fleetforge.budget.tokens_remaining / fleetforge.budget.cost_remaining – remaining tokens/cost after each step when a runtime budget is active.
  • fleetforge.budget.tokens_projected_burn / fleetforge.budget.cost_projected_burn – projected usage (used + reserved) before the step executes.
  • fleetforge.queue.lag_seconds – how long a step waited in the scheduler queue before running.
  • fleetforge.step.retries – counter of retry attempts driven by the scheduler.

When scraped by Prometheus the metric names are normalised (dots become underscores) and histograms expose the usual _sum, _count, and _bucket series.

deploy/otel/collector.local.yaml and the Docker compose collector config now expose these metrics via the built-in Prometheus exporter on :9464. Point a Prometheus/Alertmanager pair at that endpoint and add alerting rules such as:

groups:
- name: fleetforge-budget
rules:
- alert: FleetForgeBudgetLow
expr: (fleetforge_budget_tokens_remaining_sum / fleetforge_budget_tokens_remaining_count) < 100
for: 5m
labels:
severity: warning
annotations:
summary: "Budget low for FleetForge run ({{ $labels.run_id }})"
description: "Remaining tokens dropped below 100 for 5 minutes."

Swap the threshold/for-window to match your policy. For projected exhaustion alerts use fleetforge_budget_tokens_projected_burn_sum / fleetforge_budget_tokens_projected_burn_count compared against the known limit. Queue pressure alerts can rely on histogram_quantile(0.95, fleetforge_queue_lag_seconds_bucket) (e.g. trigger when the 95th percentile exceeds a few seconds).

6. Fan out traces to third-party platforms

Set FLEETFORGE_TRACE_EXPORTERS=langsmith,langfuse,phoenix (any comma-separated combination) to fan out every outbox event to your preferred observability platforms while Kafka/ClickHouse remain the system of record. Exporters are distributed as optional plugins: drop a manifest such as plugins/exporters/langsmith.toml containing name = "langsmith" (and optionally enabled = true) on disk before turning on the env var. Without a manifest the runtime ignores the exporter and continues using the OTEL collector as the blessed default. Each exporter expects service-specific credentials:

  • LangSmithLANGSMITH_API_KEY (optional: LANGSMITH_API_URL, LANGSMITH_DATASET_ID)
  • LangfuseLANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY (optional: LANGFUSE_BASE_URL)
  • PhoenixPHOENIX_API_KEY (optional: PHOENIX_BASE_URL, PHOENIX_WORKSPACE)

The runtime delivers the canonical run/step envelope (including trace metadata) to the configured APIs using reqwest over HTTPS. Failures are logged as warnings but do not block the primary Kafka/ClickHouse pipeline.

7. Kafka / Outbox configuration checklist

FleetForge’s outbox forwarder delivers events via Kafka/Redpanda. By default it operates in at-least-once mode; you must supply the correct producer and consumer settings to reach effectively-exactly-once semantics.

  1. Enable producer transactions

    • Set transactional.id=<unique-worker-id> in BusConfig (or via client_props).
    • Keep enable.idempotence=true (the forwarder sets this automatically).
    • This mirrors Confluent’s recommended transactional outbox flow—single DB write + Kafka transaction—avoiding dual writes and unlocking exactly-once delivery when paired with configured consumers.
  2. Harden consumers

    • Use isolation.level=read_committed so aborted transactions are filtered.
    • Continue de-duplicating on the outbox key (event.id) as defense in depth.
  3. Run the configuration smoke test

    • Execute just check-transactional-forwarder (or the equivalent CI step) with your configuration. The test confirms that the forwarder detects the transactional ID and that transactional() is enabled; failures surface misconfigurations before production.
  4. Fallback behaviour

    • Without a transactional ID the forwarder logs a warning and remains in at-least-once mode. Consumers must treat events idempotently in this mode.
  5. Monitor delivery metrics

    • Export the forwarder’s metrics (e.g., send success/failure counters) to observe retry storms or broker issues while tuning your deployment.