Set Up ClickHouse & Grafana
This guide connects FleetForge’s OpenTelemetry stream to ClickHouse and Grafana so operators can inspect latency, policy, and budget dashboards.
1. Run the local collector (optional)
just observability
The helper spins up ClickHouse and the OTEL Collector using
deploy/otel/collector.local.yaml. The collector listens on
http://localhost:4317.
Point the runtime (or just demo) at the collector:
export OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
Browse raw telemetry tables at http://localhost:8123.
2. Configure the collector in other environments
- Start from
deploy/compose/otel-collector-config.yamlto fan OTLP data into ClickHouse via the community exporter. - Create the
telemetrydatabase manually (for example,CREATE DATABASE telemetry;). The exporter creates tables automatically. - Expose the Prometheus scrape endpoint (
:9464) if you plan to alert on runtime metrics.
3. Materialize analytics tables
The deploy/clickhouse/materialized_views.sql script provisions the analytics
tables that power FleetForge’s observability dashboards. It builds on the raw
telemetry.run_events and telemetry.step_events tables produced by the OTEL
exporter and maintains the following derived datasets:
| Table | Purpose |
|---|---|
telemetry.run_latency_metrics | provides MTTR/SLO slices and latency histograms |
telemetry.step_retry_heatmap | exposes retry attempt distributions per step kind/status |
telemetry.budget_burndown | tracks reserved/used tokens & cost per workspace/app |
telemetry.policy_decision_rates | aggregates allow/deny/modify counts for guardrails |
telemetry.tool_error_taxonomy | attributes failure rates to tools/providers |
Apply the script with clickhouse-client --multiquery < materialized_views.sql
after the exporter has created the base tables.
4. Import the Grafana dashboard
- Provision Grafana with the bundled config in
deploy/grafana/provisioning. - Mount the dashboards from
deploy/otel/dashboards/into/etc/grafana/dashboards. - Add a ClickHouse datasource with UID
clickhousepointing at your telemetry endpoint.
The FleetForge Observability dashboard surfaces:
- Run MTTR & success SLO panels
- Step retry heatmaps
- Budget burn-down charts
- Policy violation burst timelines
- Tool flake-rate tables
Use Grafana’s alerting UI to attach thresholds—trigger incidents when run success rate dips, policy denies surge, or tool flakes exceed an SLO.
5. Alert on budgets & queue health
The runtime now publishes dedicated OpenTelemetry metrics for budgets and queue pressure. The most relevant time series are:
fleetforge.budget.tokens_remaining/fleetforge.budget.cost_remaining– remaining tokens/cost after each step when a runtime budget is active.fleetforge.budget.tokens_projected_burn/fleetforge.budget.cost_projected_burn– projected usage (used + reserved) before the step executes.fleetforge.queue.lag_seconds– how long a step waited in the scheduler queue before running.fleetforge.step.retries– counter of retry attempts driven by the scheduler.
When scraped by Prometheus the metric names are normalised (dots become
underscores) and histograms expose the usual _sum, _count, and _bucket
series.
deploy/otel/collector.local.yaml and the Docker compose collector config now
expose these metrics via the built-in Prometheus exporter on :9464. Point a
Prometheus/Alertmanager pair at that endpoint and add alerting rules such as:
groups:
- name: fleetforge-budget
rules:
- alert: FleetForgeBudgetLow
expr: (fleetforge_budget_tokens_remaining_sum / fleetforge_budget_tokens_remaining_count) < 100
for: 5m
labels:
severity: warning
annotations:
summary: "Budget low for FleetForge run ({{ $labels.run_id }})"
description: "Remaining tokens dropped below 100 for 5 minutes."
Swap the threshold/for-window to match your policy. For projected exhaustion
alerts use
fleetforge_budget_tokens_projected_burn_sum / fleetforge_budget_tokens_projected_burn_count
compared against the known limit. Queue pressure alerts can rely on
histogram_quantile(0.95, fleetforge_queue_lag_seconds_bucket) (e.g. trigger
when the 95th percentile exceeds a few seconds).
6. Fan out traces to third-party platforms
Set FLEETFORGE_TRACE_EXPORTERS=langsmith,langfuse,phoenix (any comma-separated
combination) to fan out every outbox event to your preferred observability
platforms while Kafka/ClickHouse remain the system of record. Exporters are
distributed as optional plugins: drop a manifest such as
plugins/exporters/langsmith.toml containing name = "langsmith" (and
optionally enabled = true) on disk before turning on the env var. Without a
manifest the runtime ignores the exporter and continues using the OTEL collector
as the blessed default. Each exporter expects service-specific credentials:
- LangSmith –
LANGSMITH_API_KEY(optional:LANGSMITH_API_URL,LANGSMITH_DATASET_ID) - Langfuse –
LANGFUSE_PUBLIC_KEY,LANGFUSE_SECRET_KEY(optional:LANGFUSE_BASE_URL) - Phoenix –
PHOENIX_API_KEY(optional:PHOENIX_BASE_URL,PHOENIX_WORKSPACE)
The runtime delivers the canonical run/step envelope (including trace metadata)
to the configured APIs using reqwest over HTTPS. Failures are logged as
warnings but do not block the primary Kafka/ClickHouse pipeline.
7. Kafka / Outbox configuration checklist
FleetForge’s outbox forwarder delivers events via Kafka/Redpanda. By default it operates in at-least-once mode; you must supply the correct producer and consumer settings to reach effectively-exactly-once semantics.
-
Enable producer transactions
- Set
transactional.id=<unique-worker-id>inBusConfig(or viaclient_props). - Keep
enable.idempotence=true(the forwarder sets this automatically). - This mirrors Confluent’s recommended transactional outbox flow—single DB write + Kafka transaction—avoiding dual writes and unlocking exactly-once delivery when paired with configured consumers.
- Set
-
Harden consumers
- Use
isolation.level=read_committedso aborted transactions are filtered. - Continue de-duplicating on the outbox key (
event.id) as defense in depth.
- Use
-
Run the configuration smoke test
- Execute
just check-transactional-forwarder(or the equivalent CI step) with your configuration. The test confirms that the forwarder detects the transactional ID and thattransactional()is enabled; failures surface misconfigurations before production.
- Execute
-
Fallback behaviour
- Without a transactional ID the forwarder logs a warning and remains in at-least-once mode. Consumers must treat events idempotently in this mode.
-
Monitor delivery metrics
- Export the forwarder’s metrics (e.g., send success/failure counters) to observe retry storms or broker issues while tuning your deployment.