← Resources
By Sasha Volkov
— Developer Advocate, osStudio
·
· GUIDE
BYOK LLM Architecture: 3 Patterns for Bring-Your-Own-Key Products
_Letting users bring their own AI provider keys is no longer optional for serious B2B products — but the architecture choices are subtle. osFoundry has run all three major BYOK patterns in production: a centralized gateway, an embedded-SDK pass-through, and a hybrid that does both. Each has different cost, latency, and trust implications. This piece walks through the trade-offs and shows when each pattern fits, with concrete numbers from running a multi-tenant LLM platform._
Why BYOK at all
Three forces push products toward BYOK whether they wanted to or not:
**Procurement.** Enterprise buyers have already negotiated rates with OpenAI, Anthropic, or their preferred cloud. They want to spend that committed budget, not pay you a markup on top. A 2025 Gartner snapshot put 64% of enterprise GenAI buyers as already holding direct provider contracts. Re-billing them through your account is friction.
**Trust.** Regulated industries — healthcare, legal, finance — often have data-processing agreements with specific providers and cannot route through a third party. BYOK lets the customer's data hit *their* OpenAI tenant, governed by *their* DPA.
**Cost passthrough.** If you charge $0.05 per message and the underlying tokens cost $0.04, your margin is fragile. BYOK lets the customer absorb token volatility directly while you charge a flat platform fee. osFoundry uses this — 7.5% on top of provider cost, billed against the customer's own provider account, not ours.
The counter-argument — that BYOK is operational overhead users don't want — is real for prosumer tools. For B2B, it's table stakes.
The 3 architecture patterns
**Pattern 1: Gateway.** Every model call goes through your servers. You hold the customer's key in escrow, decrypt it per request, sign the upstream call. You see every prompt and response.
- **Pro:** centralized observability, easy retries/fallback, single point for rate limiting, server-side prompt templates.
- **Con:** you become a compliance burden — the prompt and response transit your infra. Customers in regulated industries may reject this.
- **Latency cost:** +20-80ms per call vs direct.
**Pattern 2: Embedded SDK.** The client (browser, desktop app, mobile) holds the key and calls the provider directly. Your servers never see the key or the payload.
- **Pro:** minimal trust surface, no compliance burden, lowest latency.
- **Con:** no server-side observability, no fallback orchestration, key visible in client memory (a problem in browser contexts).
- **Latency cost:** baseline, no proxy hop.
**Pattern 3: Hybrid.** Key stored server-side, but the actual model call originates client-side. Server issues short-lived signed tokens or per-call ephemeral credentials.
- **Pro:** centralized key management with direct client-to-provider calls.
- **Con:** more complex; requires providers that support delegated credentials (OpenAI's `session-key` pattern, Anthropic via OAuth coming).
- **Latency cost:** +10-30ms for the token mint, then direct.
Key storage — don't log them
Whatever pattern you pick, keys need an envelope-encrypted store, not a plain DB column. The minimum bar:
- **At-rest encryption with a KMS-managed key.** Per-tenant data keys, master key in HSM. Never store the raw provider key in a column accessible to your app role.
- **Decrypt only at use.** The plaintext key lives in process memory for the duration of the upstream call, then is zeroed. Don't hold it in a long-lived cache.
- **Audit every access.** Every decrypt operation gets a row in an append-only audit log with `who`, `when`, `which key`, `which request`. If the audit log volume scares you, your access pattern is wrong.
- **Never log the key.** This sounds obvious. It is the #1 way keys leak. Sentry, Datadog, CloudWatch — all of them have caught raw API keys because someone logged a request object verbatim. Use a serializer that strips `authorization` headers and any field named `*_key`.
A leaked customer OpenAI key is a 10-figure-incident-report problem. The engineering hours spent on a strict secrets discipline are nothing compared to the alternative.
Routing and fallback
Real BYOK products rarely use just one provider. A customer might have keys for OpenAI (primary), Anthropic (fallback for outages), and a self-hosted vLLM (for private data). The router decides which to call.
The routing surface has three axes:
- **Capability** — does the request need vision? function calling? 200K context? Some providers can't serve it, route around them.
- **Cost** — at the same capability tier, route to the cheapest provider the customer has a key for.
- **Health** — if Anthropic returned 5xx for the last N requests, deprioritize for the next 60s. Standard circuit breaker.
Fallback chains should fail *forward*, not backward — if Claude is down, fall over to GPT, don't return an error and ask the user to retry. But fallbacks have a tax: the user's first-token latency now includes the failed attempt timeout. Cap retries aggressively (1 retry, 3s timeout, then bail) or your worst-case latency triples.
The gateway pattern makes this easy. The embedded SDK pattern makes it hard — the client has to implement routing logic, which means every client gets out of date when you add a provider.
Observability without phoning home
BYOK products need to tell customers what their money is being spent on without seeing the actual prompts. Three telemetry layers:
**Layer 1 — counts.** Number of calls, by model, by tool, by user. Always safe to collect, never contains content.
**Layer 2 — token counts.** Input and output token counts per call. Sensitive only in extreme cases (e.g., "input was 50K tokens" leaks corpus size). Generally fine.
**Layer 3 — content.** The actual prompts and responses. In a strict BYOK deployment, this never leaves the customer's machine or their cloud tenant. osFoundry's embedded-SDK path takes this approach for sensitive customers — the desktop app stores a local trace that the customer can opt to share when they file a bug, but the cloud never sees it.
The hard case is debugging. "My agent gave a wrong answer" is impossible to triage from token counts alone. The middle ground we've landed on: redacted traces that hash inputs/outputs with a per-tenant salt, plus the customer can opt-in per-session to share unredacted traces for support.
Pricing models — passthrough vs markup vs flat
Three pricing shapes for BYOK products, with the trade-offs:
**Pure passthrough.** You charge zero on tokens — they're billed to the customer's provider account. You charge a flat platform fee (per seat, per workspace, per feature). Cleanest legally, hardest to grow ARPU.
**Percentage markup.** You charge X% on top of token cost as a platform fee. osFoundry uses 7.5%. The advantage: revenue scales with usage, naturally aligning your incentives with the customer's value. The risk: customers see the line item and ask "why am I paying you a markup when it's *my* key?" — the answer has to be the platform value (orchestration, observability, agents).
**Flat per-call.** You charge $0.001 per LLM call regardless of size. Predictable for buyers, terrible margin for you on long-context calls.
The shape that wins depends on your wedge. Tools that primarily provide orchestration value (LangChain-likes, agent platforms) tend toward percentage markup. Tools that primarily provide a UI (chat clients, IDE extensions) tend toward flat per-seat with pure BYOK passthrough.
Migrating from one-vendor lock to BYOK
If you started with a single provider account paying for everyone, the migration is non-trivial. The steps that worked for us:
1. **Add the schema first** — provider table, key table (encrypted), per-user/workspace bindings — without using it. Ship to prod, verify nothing broke.
2. **Build the routing layer with a single hard-coded provider.** Functionally identical to before, but the path is now "router → provider call," not "direct provider call."
3. **Add an admin UI for keys.** Customers can paste a key; the routing layer prefers their key if present, falls back to the platform key.
4. **Migrate cohorts.** Enterprise customers first (they want it). Then mid-market. Prosumer/free-tier can stay on the platform key forever if your business model supports it.
5. **Sunset the platform key on paid tiers.** Eventually, the platform key only serves free-tier and trial users. That's the steady state.
Frequently asked questions
- What is BYOK in the context of LLMs?
- osFoundry defines BYOK (Bring Your Own Key) as a product architecture where customers supply their own API keys for AI providers — OpenAI, Anthropic, Google, or self-hosted models — and the product routes calls through those keys rather than a centralized platform account. The customer is billed by the provider directly; the product charges a platform fee on top.
- Which BYOK architecture pattern is best?
- There is no single best BYOK pattern. The gateway pattern fits when you need centralized observability and fallback orchestration. The embedded SDK pattern fits regulated customers who can't route data through a third party. The hybrid pattern fits when you need centralized key management but want direct client-to-provider calls. osFoundry runs all three depending on the customer's compliance posture.
- How should I store user API keys securely?
- Store BYOK API keys with envelope encryption — a KMS-managed master key wrapping per-tenant data keys, with plaintext only existing in process memory during the upstream call. Audit every decrypt operation. Strip any header or field matching `*_key` or `authorization` from your logging pipeline. A leaked customer provider key is a major incident; the engineering discipline to prevent it is mandatory.
- What latency does a BYOK gateway add?
- A BYOK gateway typically adds 20-80ms per call versus a direct provider call, depending on geographic proximity, TLS reuse, and whether key decryption hits a KMS roundtrip. The embedded SDK pattern adds zero latency because the client calls the provider directly. The hybrid pattern adds 10-30ms for short-lived credential minting, then runs at direct-call speed.
- How do BYOK products charge customers?
- BYOK products typically use one of three pricing models: pure passthrough (flat platform fee, zero on tokens), percentage markup (osFoundry uses 7.5% on top of provider cost), or flat per-call pricing. Percentage markup scales naturally with usage but requires the platform to deliver clear orchestration value. Flat per-seat with passthrough is common for prosumer tools where simplicity wins.
- Can a BYOK product see customer prompts?
- It depends on the architecture pattern. In a gateway pattern, prompts and responses transit the platform's infrastructure and are observable. In an embedded SDK or strict-hybrid pattern, prompts go client-to-provider directly and the platform never sees them. osFoundry's most privacy-sensitive customers use the embedded-SDK path with local-only traces that they opt to share when filing bugs.
- How do BYOK products handle provider fallback?
- BYOK fallback chains route through capability, cost, and health filters: capability eliminates providers that can't serve the request (no vision, no 200K context), cost picks the cheapest remaining option, and circuit breakers deprioritize providers returning recent 5xx errors. Cap retries at one attempt with a 3-second timeout to avoid tripling worst-case latency on a fully-down provider.
Sources