Untrusted Context Policy Pack
FleetForge ships a prompt-injection (PI) policy pack that inspects prompts, tool
inputs, and model outputs before they cross trust boundaries. The pack is
implemented in Rego and compiled to WebAssembly so it can execute inside the
deterministic WasmPolicyEngine.
Looking for ready-made guardrail bundles? See Policy presets.
Capabilities
The PI pack scans inbound and outbound context for:
- Override attempts such as “ignore previous instructions” or “disregard the system message”
- Hidden instructions embedded in Markdown/HTML, export/exfil phrasing, or forced tool routing (e.g. “run this shell command”)
- Suspicious prompt scaffolding (excessive external links, credential leakage, “for research purposes only” jail-break cues)
The decision contract mirrors other FleetForge policies:
| Effect | When it triggers |
|---|---|
deny | High-confidence injections (explicit overrides, exfiltration) |
redact | Suspicious phrases or link-heavy payloads (content replaced with [filtered-prompt-injection]) |
allow | Clean content |
Building the Wasm bundle
The repository embeds core/policy/packs/prompt_injection/policy.wasm. Regenerate
it after updating policy.rego:
cd core/policy/packs/prompt_injection
opa build \
-t wasm \
-e fleetforge/prompt_injection/decision \
policy.rego \
-o policy.wasm
Commit both policy.rego and the resulting policy.wasm so the runtime can load
the bundle through WasmPolicyEngine.
If the Wasm module cannot be initialised, FleetForge logs a warning and falls
back to a lightweight heuristic detector (the same behaviour seen in
development when opa is unavailable).
Requires OPA v0.57.0+ because the policy relies on the structured decision API.
Runtime integration
PolicyBundle::for_policy(...)(seecore/runtime/src/guardrails.rs) builds the bundle per step based onpolicy.guardrails. If Wasm initialisation fails, the runtime falls back to a heuristic prompt-injection guardrail and logs a warning.- Guardrails pass JSON patches back to the executor so the scheduler can redact messages or outputs before model/tool execution.
- Memory adapters call the same bundle when reading or writing values, ensuring untrusted snippets cannot be reintroduced without filtering.
Operator visibility
- Each guardrail evaluation now produces a
policy_decisionsartifact per step. The artifact captures the boundary, effect, policy-specific decisions, applied JSON patches, and the originating trust metadata. - Step outputs (and failure payloads) include a
policy_decisionssummary with counts, effects, sampled reasons/previews, and a link to the artifact. Tap and run-detail views surface this summary so operators can audit redactions or denials without replaying the run. - Artifact metadata is marked as derived trust sourced from the policy firewall, making it safe to downstream into audit or SIEM pipelines.
Evaluations & red-teaming
- The
evals/untrusted_context/suite replays three attack classes: hidden directives (HTML/data URLs), disallowed HTTP egress, and memory replay. Each scenario asserts bothno_injectionandno_piion the sanitized outputs. - Run the pack locally with
python -m evals.runner --endpoint http://127.0.0.1:50051 evals/untrusted_context(orjust evals-untrusted) before shipping guardrail changes. - CI blocks merges when any scenario surfaces injection/PII matches, keeping the Context Firewall regression-free.
Alignment with industry guidance
- OWASP LLM Top 10 (LLM01 & LLM02) – Prompt safe-merge, indirect-injection scanning, and output guardrails apply OWASP’s mitigations before/after every boundary, preventing instruction override and insecure tool execution. OWASP Foundation
- NIST GenAI Profile (AI 600-1) – Provenance metadata, per-step policy enforcement, and policy-decision artifacts provide traceability and defense-in-depth across the ingest→process→egress pipeline. NIST Publications
- OPA/Rego ➜ Wasm – Policies compile to portable WebAssembly so the same guardrails run in CI, staging, and production with auditable Rego sources. openpolicyagent.org
- Isolation posture – Wasmtime protects in-process policy execution, while Docker (read-only,
--network none) and optional Firecracker microVMs provide hardened sandboxes for tools. Amazon Web Services, Inc. docs.wasmtime.dev bytecodealliance.org - Indirect injection defenses – HTTP proxy sanitisation and policy packs mirror Microsoft Prompt Shields / Azure Content Safety guidance for hostile web content. Microsoft
For deployments that need stricter review, point
FLEETFORGE_PROMPT_INJECTION_WASM to a custom-built module and call
prompt_injection_from_path from core/policy::packs.
Guardrail flags
For the full list of guardrail flags (policy.guardrails[]) and environment
variables that shape policy behaviour, see the
Guardrail reference. This page focuses on the prompt-injection
pack and how it interacts with the Context Firewall.
HTTP proxy tool
- Steps of type
httpinvoke the hardened egress proxy (HttpProxyExecutor). - All requests must match
egress_http_allowlist:*guardrails (or theFLEETFORGE_HTTP_ALLOWLISTenvironment variable). Entries may include optional path prefixes (e.g.,corp.example.com/docs). - Default response limits: 512 KiB and textual content types; override via guardrails
or environment (
FLEETFORGE_HTTP_MAX_BYTES,FLEETFORGE_HTTP_CONTENT_TYPES). - Responses run through the Context Firewall (
IngressDocument+EgressTool) for PII/prompt-injection scanning before they are returned to the caller. - HTML content is sanitised (scripts/styles removed, hidden directives stripped) and normalised to Markdown before the PI policy executes.
Firecracker tool sandbox
- Set
policy.sandbox = "firecracker"(or addtool_sandbox:firecrackertopolicy.guardrails) to run a tool step via the Firecracker microVM sandbox. - Configure the shim that launches Firecracker via
FLEETFORGE_FIRECRACKER_SHIM(and optionalFLEETFORGE_FIRECRACKER_TIMEOUT_SECS). When unset, the runtime falls back to the Docker toolbox.