Skip to main content

Untrusted Context Policy Pack

FleetForge ships a prompt-injection (PI) policy pack that inspects prompts, tool inputs, and model outputs before they cross trust boundaries. The pack is implemented in Rego and compiled to WebAssembly so it can execute inside the deterministic WasmPolicyEngine.

Looking for ready-made guardrail bundles? See Policy presets.

Capabilities

The PI pack scans inbound and outbound context for:

  • Override attempts such as “ignore previous instructions” or “disregard the system message”
  • Hidden instructions embedded in Markdown/HTML, export/exfil phrasing, or forced tool routing (e.g. “run this shell command”)
  • Suspicious prompt scaffolding (excessive external links, credential leakage, “for research purposes only” jail-break cues)

The decision contract mirrors other FleetForge policies:

EffectWhen it triggers
denyHigh-confidence injections (explicit overrides, exfiltration)
redactSuspicious phrases or link-heavy payloads (content replaced with [filtered-prompt-injection])
allowClean content

Building the Wasm bundle

The repository embeds core/policy/packs/prompt_injection/policy.wasm. Regenerate it after updating policy.rego:

cd core/policy/packs/prompt_injection
opa build \
-t wasm \
-e fleetforge/prompt_injection/decision \
policy.rego \
-o policy.wasm

Commit both policy.rego and the resulting policy.wasm so the runtime can load the bundle through WasmPolicyEngine.

If the Wasm module cannot be initialised, FleetForge logs a warning and falls back to a lightweight heuristic detector (the same behaviour seen in development when opa is unavailable).

Requires OPA v0.57.0+ because the policy relies on the structured decision API.

Runtime integration

  • PolicyBundle::for_policy(...) (see core/runtime/src/guardrails.rs) builds the bundle per step based on policy.guardrails. If Wasm initialisation fails, the runtime falls back to a heuristic prompt-injection guardrail and logs a warning.
  • Guardrails pass JSON patches back to the executor so the scheduler can redact messages or outputs before model/tool execution.
  • Memory adapters call the same bundle when reading or writing values, ensuring untrusted snippets cannot be reintroduced without filtering.

Operator visibility

  • Each guardrail evaluation now produces a policy_decisions artifact per step. The artifact captures the boundary, effect, policy-specific decisions, applied JSON patches, and the originating trust metadata.
  • Step outputs (and failure payloads) include a policy_decisions summary with counts, effects, sampled reasons/previews, and a link to the artifact. Tap and run-detail views surface this summary so operators can audit redactions or denials without replaying the run.
  • Artifact metadata is marked as derived trust sourced from the policy firewall, making it safe to downstream into audit or SIEM pipelines.

Evaluations & red-teaming

  • The evals/untrusted_context/ suite replays three attack classes: hidden directives (HTML/data URLs), disallowed HTTP egress, and memory replay. Each scenario asserts both no_injection and no_pii on the sanitized outputs.
  • Run the pack locally with python -m evals.runner --endpoint http://127.0.0.1:50051 evals/untrusted_context (or just evals-untrusted) before shipping guardrail changes.
  • CI blocks merges when any scenario surfaces injection/PII matches, keeping the Context Firewall regression-free.

Alignment with industry guidance

  • OWASP LLM Top 10 (LLM01 & LLM02) – Prompt safe-merge, indirect-injection scanning, and output guardrails apply OWASP’s mitigations before/after every boundary, preventing instruction override and insecure tool execution. OWASP Foundation
  • NIST GenAI Profile (AI 600-1) – Provenance metadata, per-step policy enforcement, and policy-decision artifacts provide traceability and defense-in-depth across the ingest→process→egress pipeline. NIST Publications
  • OPA/Rego ➜ Wasm – Policies compile to portable WebAssembly so the same guardrails run in CI, staging, and production with auditable Rego sources. openpolicyagent.org
  • Isolation posture – Wasmtime protects in-process policy execution, while Docker (read-only, --network none) and optional Firecracker microVMs provide hardened sandboxes for tools. Amazon Web Services, Inc. docs.wasmtime.dev bytecodealliance.org
  • Indirect injection defenses – HTTP proxy sanitisation and policy packs mirror Microsoft Prompt Shields / Azure Content Safety guidance for hostile web content. Microsoft

For deployments that need stricter review, point FLEETFORGE_PROMPT_INJECTION_WASM to a custom-built module and call prompt_injection_from_path from core/policy::packs.

Guardrail flags

For the full list of guardrail flags (policy.guardrails[]) and environment variables that shape policy behaviour, see the Guardrail reference. This page focuses on the prompt-injection pack and how it interacts with the Context Firewall.

HTTP proxy tool

  • Steps of type http invoke the hardened egress proxy (HttpProxyExecutor).
  • All requests must match egress_http_allowlist:* guardrails (or the FLEETFORGE_HTTP_ALLOWLIST environment variable). Entries may include optional path prefixes (e.g., corp.example.com/docs).
  • Default response limits: 512 KiB and textual content types; override via guardrails or environment (FLEETFORGE_HTTP_MAX_BYTES, FLEETFORGE_HTTP_CONTENT_TYPES).
  • Responses run through the Context Firewall (IngressDocument + EgressTool) for PII/prompt-injection scanning before they are returned to the caller.
  • HTML content is sanitised (scripts/styles removed, hidden directives stripped) and normalised to Markdown before the PI policy executes.

Firecracker tool sandbox

  • Set policy.sandbox = "firecracker" (or add tool_sandbox:firecracker to policy.guardrails) to run a tool step via the Firecracker microVM sandbox.
  • Configure the shim that launches Firecracker via FLEETFORGE_FIRECRACKER_SHIM (and optional FLEETFORGE_FIRECRACKER_TIMEOUT_SECS). When unset, the runtime falls back to the Docker toolbox.