The false dichotomy in enterprise AI is this: either you send everything to the cloud and accept the compliance risk, or you run everything locally and accept the capability constraints. The Router Pattern dissolves this binary. It gives you both—cloud capability where it's safe, local control where it matters.
This isn't theoretical. Organizations are deploying hybrid architectures today that route queries based on data sensitivity classification, dynamically choosing between local sovereign models and cloud APIs. The result: 80% of queries go to fast, cheap cloud endpoints, while 20% containing sensitive data stay completely local.
The cost savings are significant. The compliance posture is defensible. And the user experience is seamless—employees don't know or care which model answered their question.
The key insight: Most enterprise queries don't contain sensitive data. "Summarize this public earnings report" and "What are best practices for project management?" don't need sovereign infrastructure. But "Analyze this patient's test results" absolutely does.
The Architecture
The Router Pattern consists of four components: the Classifier, the Router, the Model Pool, and the Response Normalizer. Each plays a specific role in the request lifecycle.
1. The Classifier
Every incoming query first hits the Classifier. This component analyzes the request and assigns a sensitivity classification. Classifications typically map to your existing data governance framework—if you already classify documents as Public, Internal, Confidential, and Restricted, your AI classifier should use the same taxonomy.
Classification happens through multiple mechanisms:
Pattern matching: Regular expressions catch obvious sensitive data—SSNs, credit card numbers, medical record numbers. This is fast and deterministic but catches only structured data.
Named entity recognition: NER models identify patient names, company names, case numbers, and other entities that indicate sensitivity. This catches unstructured sensitive references.
Contextual classification: A lightweight local model evaluates the full query context. "What's the weather?" is Public. "What's the weather at the facility where we're conducting the clinical trial?" is Confidential—even though neither contains PII.
Source tagging: If the query references documents from a sensitive system (the HR database, the legal matter management system), it inherits that system's classification regardless of content.
Classification Latency Budget
Classification must be fast—under 50ms for pattern matching, under 200ms for full contextual analysis. If classification takes too long, users will route around the system. Design for speed, default to the more restrictive classification when uncertain.
2. The Router
Once classified, the query hits the Router. This component maintains a routing table that maps classifications to model endpoints. The simplest implementation:
| Classification | Primary Route | Fallback |
|---|---|---|
| Public | Cloud API (GPT-4, Claude) | Local Llama 3 |
| Internal | Cloud API with Enterprise Agreement | Local Llama 3 |
| Confidential | Local Llama 3 70B | Queue (no cloud fallback) |
| Restricted | Air-gapped Local Model | Reject |
But routing decisions aren't just about sensitivity. The Router also considers:
Latency requirements: Real-time applications might prefer a faster local model over a higher-quality cloud model that adds 500ms of network latency.
Cost optimization: High-volume, low-complexity queries might route to cheaper endpoints even when cloud is available.
Capability matching: Code generation queries might route to models fine-tuned for coding. Medical queries might route to clinical models.
Load balancing: When local GPU capacity is constrained, lower-priority queries might route to cloud to preserve local capacity for sensitive workloads.
3. The Model Pool
The Model Pool is your inventory of available inference endpoints. A typical enterprise deployment includes:
Cloud APIs: OpenAI, Anthropic, Google—whatever your enterprise agreements cover. These provide frontier capability for non-sensitive workloads.
Local large models: Llama 3 70B, Mixtral 8x22B, or equivalent. These handle the bulk of sensitive workloads with near-frontier quality.
Local small models: Llama 3 8B, Mistral 7B, Phi-3. Fast, cheap, good enough for many tasks. Useful for classification, simple queries, and fallback.
Specialized models: Domain-specific fine-tunes for legal, medical, financial, or coding tasks. These may be local or cloud-hosted depending on sensitivity.
The Model Pool abstracts these endpoints behind a consistent interface. The Router doesn't care whether it's calling OpenAI or a local vLLM instance—the API contract is identical.
4. The Response Normalizer
Different models return different response formats. The Response Normalizer ensures consistent output regardless of which model handled the request. It also:
Strips model-specific artifacts: Some models include disclaimers, formatting quirks, or metadata that should be normalized.
Enforces output policies: If your governance requires responses to include source attribution or confidence scores, the Normalizer adds them.
Logs for audit: Every response is logged with full lineage—which model, which classification, which routing decision. This creates the audit trail compliance requires.
Classification Strategies
The Classifier is the linchpin. Get it wrong, and sensitive data leaks to cloud endpoints. Get it right, and you have a defensible, efficient hybrid architecture.
The Conservative Default
Start conservative. When in doubt, classify as Confidential and route locally. False positives (treating non-sensitive data as sensitive) cost you latency and compute. False negatives (treating sensitive data as non-sensitive) cost you compliance violations.
The asymmetry is clear: over-classify by default, then tune toward efficiency as you gain confidence in your classifier.
User-Driven Classification
Consider allowing users to override classification upward (marking queries as more sensitive than detected) but not downward. If a user knows their query contains information the classifier missed, they can force local routing. But they can't force cloud routing for something the classifier flagged.
Anti-pattern: Letting users classify their own queries without system verification. Users will classify everything as "Public" to get faster responses. Classification must be system-enforced, not user-selected.
Context Window Awareness
Classification must consider the full context window, not just the current query. If a conversation started with sensitive data, subsequent queries in that conversation inherit the sensitivity—even if individual messages look innocuous.
"What's the prognosis?" is meaningless without context. In a conversation that started with patient records, it's highly sensitive. The Classifier must track conversation state.
Failure Modes and Mitigations
Classifier Bypass
Sophisticated users might attempt to phrase sensitive queries in ways that evade classification. "Hypothetically, if a patient had these lab values..." contains real patient data wrapped in hypothetical framing.
Mitigation: Train classifiers on adversarial examples. Include "hypothetical" framings, encoded data, and other evasion attempts in your training data.
Cascade Failures
When local models are overloaded, what happens? If all routes fail, does the query get dropped? Does it queue indefinitely? Does it fall back to cloud?
Mitigation: Define explicit fallback behavior per classification. Confidential queries queue rather than cloud-fallback. Public queries cloud-fallback immediately. Make these policies explicit and configurable.
Classification Drift
Classification accuracy degrades over time as vocabulary changes, new sensitive data types emerge, and user behavior evolves. A classifier trained on 2024 data might miss 2025 sensitive patterns.
Mitigation: Continuous evaluation. Sample classified queries for human review. Track false positive and false negative rates. Retrain classifiers quarterly.
Implementation Patterns
Gateway Pattern
Deploy the Router as an API gateway that sits in front of all AI endpoints. Applications don't call models directly—they call the gateway, which handles classification and routing transparently.
Benefits: Centralized policy enforcement, consistent logging, single point of configuration. All AI traffic flows through one control plane.
Sidecar Pattern
For applications that need tighter integration, deploy the Router as a sidecar. Each application instance gets its own routing logic, but policy is managed centrally.
Benefits: Lower latency (no extra network hop), better failure isolation. Tradeoff: More complex deployment, harder to update routing logic.
Embedded Pattern
For maximum performance, embed routing logic directly in the application. The application itself classifies and routes, calling models directly.
Benefits: Lowest latency, most flexibility. Tradeoff: Hardest to govern, requires each application to implement correctly.
The SIA Router Implementation
The Sovereign Intelligence Architecture includes a reference Router implementation that handles the common patterns and edge cases.
Multi-Layer Classification
Pattern matching, NER, and contextual analysis combined. Configurable classification rules that map to your data governance taxonomy.
Pluggable Model Pool
Standard interface for any model endpoint—cloud APIs, local inference, or custom deployments. Add new models without changing routing logic.
Policy-Based Routing
Declarative routing rules that consider sensitivity, latency, cost, and capability. Change routing behavior through configuration, not code.
Audit-Ready Logging
Complete request lineage: classification, routing decision, model selection, response metadata. Ready for compliance review.
The Economic Case
Let's model a typical enterprise deployment: 100,000 queries per day, average 1,000 tokens per query.
Pure cloud approach: At $0.01 per 1K tokens (GPT-4 Turbo input), that's $1,000/day or $365,000/year. Plus the compliance risk of sending everything to external APIs.
Pure local approach: Requires significant GPU infrastructure to handle peak load. At 100K queries/day, you need 4-8 high-end GPUs running continuously. Capex plus power plus maintenance: roughly $200,000/year. But some queries that could use cloud endpoints are forced through expensive local infrastructure.
Router Pattern: If 80% of queries are non-sensitive and can route to cloud, you need local infrastructure only for 20K queries/day. That's 1-2 GPUs instead of 4-8. Local costs drop to $50,000/year. Cloud costs for the 80% are $292,000/year. Total: $342,000/year—comparable to pure cloud, but with full compliance for sensitive data.
The real savings come from risk reduction. One HIPAA violation can cost more than a decade of infrastructure. The Router Pattern provides the economic efficiency of cloud with the compliance posture of local.
The bottom line: You don't have to choose between cloud convenience and sovereign control. The Router Pattern gives you both—route by sensitivity, pay for what you need, and maintain a defensible compliance posture.
Ready to implement hybrid routing?
The SIA methodology includes reference architectures and implementation guidance for the Router Pattern tailored to your regulatory environment.
Start a Conversation →