Home / Features / Run any model
Run any AI model with osFoundry — local, cloud, or self-hosted
BYOK any cloud API, run open weights on your laptop, or deploy a dedicated GPU endpoint — all from one workspace.
osFoundry is a hybrid AI orchestration platform that runs any AI model from a single workspace — open-weight Llama, Qwen, or Mistral on your laptop; Claude, GPT, or Gemini through your own API keys; and dedicated GPU endpoints in our cloud for reserved capacity. Switch backends mid-conversation, never get locked into a single provider, and pay only for the seconds your model actually runs.
Quick answer
- Run open-weight models locally with osFoundry’s on-device inference runtime — no token cost, no data leaves your machine.
- Bring your own API keys (BYOK) for Anthropic, OpenAI, Google, Mistral, Together, and any OpenAI-compatible endpoint.
- Deploy dedicated GPU endpoints in osFoundry cloud for reserved throughput on the open-weight model of your choice.
- Route requests across all three modes from one chat — switch local ↔ cloud ↔ self-host without leaving the conversation.
- No markup on tokens — your provider account is billed directly.
What it is
Most AI tools force a single backend: a hosted chat product, a single model API, or a self-host you maintain alone. osFoundry treats local inference, cloud APIs, and self-hosted endpoints as three interchangeable backends behind one chat surface, one config layer, and one billing surface. The same prompt can hit a local 8B model for low-latency triage, a Claude Sonnet API for hard reasoning, and a self-hosted Llama 70B for sensitive data — all in one conversation.
Key capabilities
- Local inference with quantised open-weight models (Q4 to FP16) on Apple Silicon and NVIDIA GPUs.
- BYOK to any provider with an OpenAI-compatible API — keys live in your encrypted keychain.
- Per-request model dispatch driven by user-configurable routing rules in osStudio.
- Hot-swap LoRA adapters at inference time without restarting the model.
- Inference server fleet view — pool capacity across local boxes, cloud endpoints, and self-hosted GPUs.
- Fall-back chains: try local first, fail over to cloud if the model isn’t loaded.
How to do it in osFoundry
- Pick a model — Browse the catalog at /community/models and /community/api-models — 76,000+ open weights and 364 hosted API models, with cross-links between dual-nature ones (e.g. Llama 3.1 70B is both).
- Wire it up — For BYOK: paste your provider key into the key dialog and assign the model to a Maestro role. For local: hit Install on the model page. For self-host: deploy a GPU endpoint from the Servers tab.
- Use it — Chat with it directly, call invokeAI from a Room App, or hit it as an HTTP endpoint from your own services — same model, same routing, three interfaces.
How osFoundry compares
| Capability | osFoundry | Most other tools |
|---|
| Backends | Local + cloud + self-hosted, switchable per request. | Single backend, vendor-locked. |
| Token markup | None — direct provider pricing. | 20–100% markup on hosted tokens. |
| Privacy mode | Local-only mode — no traffic ever leaves the device. | Always cloud-bound. |
| Model count | 76K open + 364 API + your self-hosted weights. | A handful of curated models. |
Use cases
- Solo developer: Run Llama 3.1 8B locally for everyday coding chat. Switch to Claude Sonnet for tough refactors. Same chat thread.
- Privacy-first team: Force all sensitive prompts to local models; allow public-info prompts to use cloud APIs. Routing rules enforce the policy.
- Heavy-volume startup: Self-host Mixtral 8x22B on a reserved A100 for 80% of traffic; burst to GPT-4o for the hard 20%.
Inference server fleet
Aggregate capacity across local machines, BYOK endpoints, and self-hosted GPUs into a single addressable pool. Maestro routes per request based on availability and configured priorities.
Frequently asked questions
Can I use osFoundry without buying any credits?
Yes. BYOK and local inference both work without any osFoundry credit purchase — you pay your own provider for cloud usage, and local inference is free.
Does osFoundry mark up cloud API tokens?
No. BYOK passes your traffic directly to your provider account. We charge only for our own cloud-hosted services (GPU endpoints, app hosting, storage).
Which providers can I BYOK to?
Anthropic, OpenAI, Google (Vertex + AI Studio), Mistral, Together, Groq, DeepSeek, Cohere, and any OpenAI-compatible endpoint. New providers are added via the connector library.
What hardware do I need to run open-weight models locally?
A consumer GPU with 16 GB VRAM runs 7–13B models well at Q4. 24 GB handles 30B models. 70B+ models need an A100/H100 80 GB or quantisation tradeoffs.
Can I switch models mid-conversation?
Yes. Each turn can use a different model. Maestro’s routing rules in osStudio let you switch automatically based on prompt content.
How is a self-hosted endpoint different from local inference?
Local inference runs on your own machine. A self-hosted endpoint runs on a dedicated GPU you provision in osFoundry cloud — reserved capacity, no rate limits, accessed over your private network.
Does osFoundry support image, audio, and video models too?
Yes. The catalog includes 76K open-weight models across chat, image, audio, video, and embedding. BYOK works for hosted image/audio providers (DALL·E, Midjourney via Replicate, ElevenLabs, etc.).
Can I run osFoundry fully offline?
Yes — install the desktop app, download a local model, and disable cloud routes. Local-first mode is a first-class workspace setting.
Pricing
Local inference: free (your hardware). BYOK: your provider’s pricing, no markup. osFoundry-hosted GPU endpoints: per-second of GPU time, see pricing for current rates.
Related features