Home / Glossary / On-device inference

What is On-device Inference?

On-device inference runs an LLM directly on the user’s hardware (laptop, phone) without any network call. osFoundry’s built-in inference server supports Apple Silicon (Metal) and NVIDIA (CUDA) for open-weight models.

Detail

On-device inference has three big upsides: zero per-token cost, zero network latency, zero data leakage. Limits: model size constrained by VRAM; speed limited by the device. A 7B model runs fast on a modern Mac; a 70B model needs an A100-class GPU.

Quantisation (Q4, Q5) is essential for fitting larger models into consumer VRAM.

How osFoundry approaches On-device Inference

osFoundry’s desktop app includes the inference server. One-click install for any open-weight model. Quality models like Llama 3.1 8B and Qwen 2.5 14B run smoothly on consumer hardware.

Related terms

self-hosting
quantization
local-first
no-leak-llm

Related features

local-llm-inference
self-host-llms