I wanted a small, honest implementation of the GenAI governance shape in code: a component on every LLM call that applies policy first, optionally scrubs prompts and responses, and emits metrics—without pretending to be enterprise inline inspection. This repo is a Rust gRPC MVP with keyword and rate limits, regex redaction, Prometheus counters, and pluggable providers (OpenAI, Anthropic, mock).

Industry collateral often uses the same vocabulary—visibility, inline policy, sensitive data in prompts and answers—for example Zscaler on securing generative AI and AI Guardrails. No affiliation with Zscaler; not an endorsement or a capability comparison.

Problem

When many clients talk straight to a provider API, you get recurring failure modes: no single choke point for policy, accidental or careless PII in prompts or model output, abuse and cost spikes, and weak signals for operators who need to know what was allowed, blocked, or altered.

A gateway in front of the provider gives you that choke point: enforce rules before the model runs, redact or block on the way in and out, and emit metrics so you are not flying blind.

This repo implements that as an MVP in Rust (Tokio, gRPC/tonic, pluggable backends): keyword blocklist, fixed-window rate limit, regex redaction—not a full DLP catalog or ML classifiers, but the same narrative line: inspect and govern the path, then call the model.

Why Rust and gRPC for this kind of gateway

The gateway sits inline: if that hop jitters, people stop trusting “govern every call.” Rust buys predictable latency in the enforcement path—no GC pauses while you scan and rewrite text—and memory safety while doing it. gRPC with Protobuf gives a versioned request/response contract (SecureCompletionRequest / SecureCompletionResponse), compact wire encoding, and generated server stubs so callers share one schema instead of ad hoc JSON that drifts quietly as fields change. The same surface extends cleanly to server streaming when you want token-by-token replies without a bespoke HTTP contract per client.

Architecture

The one-liner version:

Client → gRPC Gateway (Rust) → Policy Pipeline → Pluggable LLM Provider → Response + Metrics

The diagram below is inspired by the “at a glance” story common in GenAI security collateral (visibility, inline control, data-in-motion)—for example Zscaler’s Gen AI Security at-a-glance PDF—but redrawn for this open-source MVP only. It is not a depiction of Zscaler’s product or deployment model.

Architecture diagram

The three stacked stages echo the access · data · visibility framing used in GenAI security “at a glance” sheets: one inline choke point, with metrics as the separate HTTP scrape surface (port 8080), not an extra hop on the gRPC path.

Components:

  • Gateway: Tokio async server. gRPC on port 50051 for SecureCompletion, HTTP on 8080 for Prometheus /metrics only.
  • Policy before the model: Keyword blocklist and per-user rate limit run inside PolicyEngine before any LLM call.
  • Redaction: Regex-based scrubbing on the prompt and/or response in the gRPC handler when enabled (not part of the allow/block decision).
  • Providers: ChatCompletionProvider trait. Swap via LLM_PROVIDER env: OpenAI, Anthropic, or in-process mock (no HTTP; see README for ghz benchmarks).
  • Observability: Prometheus metrics (gateway_total_requests, gateway_blocked_requests, gateway_allowed_requests, gateway_provider_errors_total, gateway_request_latency_seconds, gateway_tokens_used_total).

Implementation

Async and gRPC

  • Tokio for async runtime. gRPC server uses tonic; HTTP uses axum.
  • Protobuf defines SecureCompletionRequest (user_id, prompt) and SecureCompletionResponse with SecureCompletionDecision enum (ALLOWED / BLOCKED), plus response text and reason. tonic-build compiles proto to Rust at build time.
  • Dual servers: gRPC and HTTP run on separate ports. HTTP serves Prometheus scrape only; chat flows through gRPC only.

Policies

  • Keywords: BANNED_KEYWORDS env (comma-separated). Case-insensitive match; blocks before LLM call.
  • Rate limit: In-memory fixed window per user_id (counter resets after the window elapses). Configurable via RATE_LIMIT_REQUESTS and RATE_LIMIT_WINDOW_SECS; RATE_LIMIT_MAX_TRACKED_USERS caps how many distinct IDs are tracked (eviction when full).
  • Redaction: Regex-based. Built-in patterns for email, API keys, SSN, credit cards, private IPs. Custom patterns via REDACT_CUSTOM_PATTERNS (JSON file path); each custom rule’s id must also appear in REDACT_PATTERNS to run. Runs on prompt (before LLM) and response (before client).

Pluggable Providers

Each provider implements ChatCompletionProvider. from_config() reads LLM_PROVIDER and instantiates OpenAI, Anthropic, or an in-process mock that returns a fixed string with no HTTP—useful for ghz runs that isolate gRPC, policy, and redaction from real LLM latency (see README Gateway-only benchmark).

Running It

export OPENAI_API_KEY=sk-...
cargo run
grpcurl -plaintext -import-path proto -proto ai_security_gateway.proto \
  -d '{"user_id":"user-1","prompt":"Say hello"}' \
  localhost:50051 ai_security.AiSecurityGateway/SecureCompletion

Docker (Dockerfile, docker-compose.yml) and Kubernetes manifests (k8s/) support local images and kind/minikube-style deploys. For cluster runs, the README covers loading the image, creating the API key Secret before pods start when using OpenAI or Anthropic (otherwise the container exits on OPENAI_API_KEY must be set), rollout restart after ConfigMap or Secret changes, and port-forward smoke tests with grpcurl and /metrics.

Gateway-only load check

With LLM_PROVIDER=mock and the ghz commands in the README, you can stress gRPC + policy + redaction without spending tokens. Latency and RPS depend on your machine and concurrency; turn keyword checks and redaction back on when you want those paths included. For long runs with a single user_id, raise RATE_LIMIT_REQUESTS and clear BANNED_KEYWORDS as the README describes so you are measuring the stack, not the default rate limit.

What worked

  1. Separate ports for gRPC and HTTP: Prometheus /metrics on HTTP; chat only on gRPC. No gRPC-Web or transcoding in this MVP.
  2. Keywords and rate limits before the LLM call. Redaction on allowed traffic mirrors the “sensitive data in prompts and answers” theme—bidirectional scrub, separate from the allow/block decision.
  3. Trait-based providers: new backend = new type + from_config() branch.
  4. In-memory rate limit is enough for one replica; multiple replicas need a shared store (e.g. Redis).

Next steps

  • Streaming: gRPC server-streaming for token-by-token responses.
  • Distributed rate limiting: Redis-backed for horizontal scaling.
  • More backends: Vertex AI, Azure OpenAI, Ollama.
  • Jailbreak / prompt-injection classifiers: Closer to the guardrails pages’ “inspect before harm” story than a static keyword list (still out of scope for this MVP).
  • Response caching: Cache by prompt hash to reduce LLM calls.

Repository: github.com/sprider/rust-grpc-ai-security-gateway