Home / Glossary / On-device inference
What is On-device Inference?
On-device inference runs an LLM directly on the user’s hardware (laptop, phone) without any network call. osFoundry’s built-in inference server supports Apple Silicon (Metal) and NVIDIA (CUDA) for open-weight models.
Detail
On-device inference has three big upsides: zero per-token cost, zero network latency, zero data leakage. Limits: model size constrained by VRAM; speed limited by the device. A 7B model runs fast on a modern Mac; a 70B model needs an A100-class GPU.
Quantisation (Q4, Q5) is essential for fitting larger models into consumer VRAM.
How osFoundry approaches On-device Inference
osFoundry’s desktop app includes the inference server. One-click install for any open-weight model. Quality models like Llama 3.1 8B and Qwen 2.5 14B run smoothly on consumer hardware.
Related terms
Related features