Home / Features / Run any model / Self-host LLMs
Self-host LLMs in osFoundry — open weights, no vendor lock-in
osFoundry self-hosts any open-weight LLM (Llama, Qwen, Mistral, Mixtral, DeepSeek, Phi, GPT-OSS) with full control over weights, runtime, and routing. Run on your local hardware, on a dedicated GPU endpoint in osFoundry cloud, or on your own infrastructure. The model is registered in your workspace catalog and routable from Maestro the moment it’s loaded.
Quick answer
- Self-host any of the 76K open-weight models indexed in the catalog.
- Three runtimes: local hardware, osFoundry cloud GPU endpoint, your own GPU server.
- Model is workspace-routable the moment it loads.
- Full data control — weights and prompts never leave your scope.
Key capabilities
- 76K open-weight models indexed and installable in one click.
- Built-in inference server (no Ollama, no manual llama.cpp setup).
- Quantisation at install: pick Q4 for cheap, FP16 for full precision.
- Hot-swap LoRA adapters on a base model — many specialised variants on one GPU.
- Workspace-wide routing — same model handle, three possible backends.
How to do it in osFoundry
- Browse and pick a model — Open /community/models, filter to open-weight, pick the size that fits your target hardware.
- Choose where to host — Local (free, your hardware), osFoundry cloud GPU endpoint (per-second billing), or your own GPU server (free; you manage infra).
- Install — One click. The platform pulls the weights, applies the quantisation you picked, loads into the inference server.
- Use it — The model is now a routable handle in Maestro and every Room App. Switch to it per request or via osStudio routing rules.
How osFoundry compares
| Capability | osFoundry | Most other tools |
|---|
| Setup time | Minutes — one-click install. | Hours of llama.cpp / vLLM / Triton setup. |
| Hardware | Local, our cloud, or yours — interchangeable. | Pick one venue, commit. |
| Routing post-install | Automatic — model is a workspace handle. | Manual API wiring in your code. |
| Quantisation | Pick at install; switch later. | Convert weights manually with separate tooling. |
Use cases
- Privacy-sensitive industry: Healthcare / legal / finance team self-hosts Llama 3.1 70B on an internal A100 — prompts and outputs never leave the org perimeter.
- High-volume SaaS: Run Mixtral 8x22B on a reserved H100 for 80% of traffic; burst to a cloud API for the hard 20%. Per-token cost drops by 60%.
- Researcher: Test 12 candidate base models locally before picking one for fine-tuning. Free, fast iteration without hosted API bills.
Frequently asked questions
What models can I self-host on osFoundry?
Any of the 76K open-weight models indexed at /community/models — Llama, Qwen, Mistral, Mixtral, DeepSeek, Phi, GPT-OSS, and more.
Do I need to fine-tune to self-host?
No. Self-hosting just means running the base model under your control. Fine-tuning is optional (LoRA flow available).
Is self-hosting cheaper than BYOK to a hosted API?
For high volume, yes. A reserved A100 amortises across millions of tokens at a lower per-token cost than hosted pricing.
Can I bring my own quantised weights?
Yes — upload a .safetensors or .gguf file and osFoundry registers it as a custom model.
What licences apply when I self-host?
The base model’s licence. Each model page in the catalog has a licence explainer (commercial-use / restricted / research-only).
Can the same model be hosted in two places at once?
Yes — same model handle can have a local backend and a cloud-endpoint backend simultaneously. Routing rules decide which runs each request.
Pricing
Local self-hosting: free (your hardware, your electricity). osFoundry cloud GPU endpoint: per-second of GPU time at A10 / A100 / H100 rates. Your own GPU server: free to osFoundry; pay your infra provider.
Related features