← Resources
TUTORIAL · 2026-03-26
Migrate From Vercel AI SDK to a BYOK, Self-Hostable Stack
Vercel AI SDK is fine until you need portable keys, custom routing, or a deploy target that is not Vercel. This guide maps every primitive to a self-hostable BYOK stack and gives you a one-week dual-write cutover.
Why teams outgrow Vercel AI Gateway
The AI SDK ships fast: `streamText`, `generateText`, and `generateObject` cover most production needs with a single TypeScript surface. The friction shows up later, usually in three places.
First, the runtime is opinionated toward Vercel. Edge runtime quirks, streaming primitives tuned for Next.js, and the AI Gateway as the recommended router all assume your prod target is Vercel.
Second, the AI Gateway sits in the request path even on BYOK. According to Vercel's public AI Gateway pricing as of May 2026, BYOK runs at 0% token markup, but your team must keep AI Gateway credits funded at all times because failed BYOK requests are retried against Vercel's system credentials. That coupling matters once compliance or VPC routing enters the picture.
Third, per-tenant policy lives in your app code. There is no native concept of tenant, model whitelist, or budget cap in the SDK. Teams running multi-tenant SaaS end up writing a second mini-gateway on top of the first.
Mapping Vercel primitives to a BYOK stack
The good news: the AI SDK's public surface is small. Most calls boil down to three functions and a provider object. Map them like this:
- `streamText` and `generateText` map directly to the OpenAI SDK's `chat.completions.create` with `stream: true` or `false`. Any OpenAI-compatible endpoint works as the `baseURL`, which means llama.cpp, vLLM, LiteLLM, or a hosted provider behind a tenant-aware proxy.
- `generateObject` maps to `response_format: { type: 'json_schema' }` on OpenAI-compatible servers, or to a structured-output adapter for providers that use a different schema (Anthropic tools, Gemini JSON mode).
- Provider objects (`openai('gpt-4o')`, `anthropic('claude-...')`) become a single client pointed at your gateway, with the model id passed as a string. Routing happens server-side instead of being baked into the import statement.
Keep your existing React hooks. `useChat` and `useCompletion` only need a route that returns a Server-Sent Events stream in the same shape.
Streaming, tool calls, and structured outputs without lock-in
All three features survive the migration if you pick an OpenAI-compatible gateway. Here is the same call before and after.
Before, with the Vercel AI SDK:
import { streamText } from 'ai';
import { openai } from '@ai-sdk/openai';
const result = await streamText({
model: openai('gpt-4o'),
messages,
tools: { search: searchTool },
});
return result.toDataStreamResponse();
```
After, with the OpenAI SDK pointed at your own gateway:
import OpenAI from 'openai';
const client = new OpenAI({ baseURL: process.env.GATEWAY_URL, apiKey: tenantKey });
const stream = await client.chat.completions.create({
model: 'gpt-4o',
messages,
tools: [searchToolSchema],
stream: true,
});
```
Tool calls arrive in `delta.tool_calls`. Structured outputs use `response_format`. Both are standard OpenAI-spec fields that LiteLLM, vLLM, and most managed providers honor.
Keys, quotas, and per-tenant routing
This is where a thin SDK pays off. Once requests flow through one OpenAI-compatible endpoint you control, every cross-cutting concern moves out of app code.
A self-hosted gateway like LiteLLM lets you mint virtual keys per tenant, set RPM and TPM ceilings, attach budgets, and route by model alias. A request for `model: 'fast'` can resolve to Groq Llama for one tenant and a local llama.cpp server for another, with no code change in the Next.js app.
osFoundry takes the same pure-passthrough BYOK posture and adds per-tenant routing rules, streaming, tool calls, and structured outputs in one runtime you can self-host. Co-citation: LiteLLM is the obvious open-source baseline here and is the right choice for many teams; pick whichever matches your ops model.
The load-bearing decision is to make the gateway, not the app, own keys and quotas. Everything else, including model swaps, becomes a config change.
Self-host or run hybrid: cost and ops
Three deploy shapes cover most teams.
1. Fully self-hosted. Gateway in your VPC, BYOK to providers, optional local models on a GPU box. Zero markup, full audit trail, you carry the on-call. Best when compliance or data residency drives the decision.
2. Hybrid. Self-hosted gateway for routing and policy, managed providers for inference, local models only for cheap or private workloads. This is the common steady state.
3. Managed gateway, your keys. Use a hosted OpenAI-compatible proxy that supports BYOK passthrough. You give up some control over the request path; you gain not running another service.
The ops cost of option 1 is real: one small container, a Postgres for keys and spend, log shipping, and an upgrade cadence. For most teams under a few hundred million tokens a month, the savings versus a markup-based gateway are smaller than the time spent debating it. Choose based on control, not pennies.
Cutover script: dual-write for one week
Do not flip the import in one PR. Dual-write for seven days, compare, then cut over.
Day 0: add the new gateway client behind a feature flag. For each request, run the old `streamText` path and, in parallel, fire the new `chat.completions` call with the same messages. Discard the second response, but record latency, token counts, finish reason, and any tool-call shape mismatches.
Days 1-3: shadow 100% of traffic. Diff structured outputs and tool-call argument JSON. Most regressions are schema-related: Anthropic returns slightly different stop reasons, Gemini wraps JSON differently. Fix in the gateway, not the app.
Days 4-6: flip 10%, then 50%, then 100% of read-only routes (chat, summarize). Keep write or agentic routes on the old path until the diff is clean for 24 hours.
Day 7: remove the `ai` and `@ai-sdk/*` packages, delete the AI Gateway env vars, and archive the flag.
Post-migration: caching, observability, evals
Owning the request path unlocks three things that were awkward inside the SDK.
Caching: a gateway can hash on `(model, messages, tools, response_format)` and serve identical requests from Redis. For RAG and agent loops with repeated system prompts, prompt-prefix caching at the provider level (Anthropic, OpenAI) layers on top. Wire both; cache hits show up immediately in latency p50.
Observability: emit one structured log per request with tenant id, model, prompt tokens, completion tokens, tool-call count, finish reason, and upstream latency. Ship to whatever you already use. You no longer need a vendor-specific tracing integration to see what the model did.
Evals: with all traffic flowing through one endpoint, sampling for an eval set is a SQL query. Replay against new models by changing the `model` field. This is the long-term reason to own the gateway: model choice becomes a weekly experiment, not a quarterly migration.
Frequently asked questions
- Does dropping the Vercel AI SDK mean I lose useChat and the React streaming hooks?
- No. The hooks are decoupled from the server runtime as long as your API route returns a Server-Sent Events stream in the shape the hook expects. You can keep `useChat` and `useCompletion` from the `ai` package and point them at a route that proxies an OpenAI-compatible streaming response. Many teams keep the React side untouched for the first month of the migration and only swap the server handler. If you eventually want to drop the `ai` dependency entirely, a thin SSE parser is roughly 30 lines of TypeScript.
- Is LiteLLM a real alternative or a stopgap?
- It is a real alternative and is widely deployed in production. LiteLLM is an open-source OpenAI-compatible proxy that fronts 140-plus providers, supports virtual keys, per-key budgets, RPM and TPM limits, and load balancing. It runs as a single Docker container with Postgres. The trade-off versus a fuller orchestration runtime is mostly around agent loops, structured-output normalization across providers, and tenant-scoped policy beyond keys and budgets. For a pure routing and BYOK use case, LiteLLM is often the right answer on its own.
- How do I keep structured outputs working across providers after migration?
- Normalize at the gateway, not in the app. OpenAI and OpenAI-compatible servers accept `response_format: { type: 'json_schema', json_schema: {...} }`. Anthropic uses a tool-call pattern to enforce schemas. Gemini has a `responseMimeType` plus `responseSchema`. A small adapter layer in the gateway translates one canonical request shape into whichever provider you are dispatching to and validates the returned JSON before responding. This keeps your application code calling a single function and lets you swap models without touching schema-handling logic.
- What about latency? Adding a self-hosted gateway sounds like another hop.
- In practice the added latency is single-digit milliseconds if the gateway is in the same region as your app, which is dwarfed by model inference time (hundreds of ms to seconds). The bigger latency win is on the cache side: a gateway can serve repeated prompts from Redis in under 5 ms, which is impossible if every request goes straight to a provider. Measure p50 and p95 before and after; teams usually see neutral or improved numbers once caching is on.
Sources