Home / Features / Run any model / Self-host LLMs

Self-host LLMs in osFoundry — open weights, no vendor lock-in

osFoundry self-hosts any open-weight LLM (Llama, Qwen, Mistral, Mixtral, DeepSeek, Phi, GPT-OSS) with full control over weights, runtime, and routing. Run on your local hardware, on a dedicated GPU endpoint in osFoundry cloud, or on your own infrastructure. The model is registered in your workspace catalog and routable from Maestro the moment it’s loaded.

Quick answer

Self-host any of the 76K open-weight models indexed in the catalog.
Three runtimes: local hardware, osFoundry cloud GPU endpoint, your own GPU server.
Model is workspace-routable the moment it loads.
Full data control — weights and prompts never leave your scope.

Key capabilities

76K open-weight models indexed and installable in one click.
Built-in inference server (no Ollama, no manual llama.cpp setup).
Quantisation at install: pick Q4 for cheap, FP16 for full precision.
Hot-swap LoRA adapters on a base model — many specialised variants on one GPU.
Workspace-wide routing — same model handle, three possible backends.

How to do it in osFoundry

Browse and pick a model — Open /community/models, filter to open-weight, pick the size that fits your target hardware.
Choose where to host — Local (free, your hardware), osFoundry cloud GPU endpoint (per-second billing), or your own GPU server (free; you manage infra).
Install — One click. The platform pulls the weights, applies the quantisation you picked, loads into the inference server.
Use it — The model is now a routable handle in Maestro and every Room App. Switch to it per request or via osStudio routing rules.

How osFoundry compares

Capability	osFoundry	Most other tools
Setup time	Minutes — one-click install.	Hours of llama.cpp / vLLM / Triton setup.
Hardware	Local, our cloud, or yours — interchangeable.	Pick one venue, commit.
Routing post-install	Automatic — model is a workspace handle.	Manual API wiring in your code.
Quantisation	Pick at install; switch later.	Convert weights manually with separate tooling.

Use cases

Privacy-sensitive industry: Healthcare / legal / finance team self-hosts Llama 3.1 70B on an internal A100 — prompts and outputs never leave the org perimeter.
High-volume SaaS: Run Mixtral 8x22B on a reserved H100 for 80% of traffic; burst to a cloud API for the hard 20%. Per-token cost drops by 60%.
Researcher: Test 12 candidate base models locally before picking one for fine-tuning. Free, fast iteration without hosted API bills.

Frequently asked questions

What models can I self-host on osFoundry?

Any of the 76K open-weight models indexed at /community/models — Llama, Qwen, Mistral, Mixtral, DeepSeek, Phi, GPT-OSS, and more.

Do I need to fine-tune to self-host?

No. Self-hosting just means running the base model under your control. Fine-tuning is optional (LoRA flow available).

Is self-hosting cheaper than BYOK to a hosted API?

For high volume, yes. A reserved A100 amortises across millions of tokens at a lower per-token cost than hosted pricing.

Can I bring my own quantised weights?

Yes — upload a .safetensors or .gguf file and osFoundry registers it as a custom model.

What licences apply when I self-host?

The base model’s licence. Each model page in the catalog has a licence explainer (commercial-use / restricted / research-only).

Can the same model be hosted in two places at once?

Yes — same model handle can have a local backend and a cloud-endpoint backend simultaneously. Routing rules decide which runs each request.

Pricing

Local self-hosting: free (your hardware, your electricity). osFoundry cloud GPU endpoint: per-second of GPU time at A10 / A100 / H100 rates. Your own GPU server: free to osFoundry; pay your infra provider.