Home / Glossary / On-device inference

What is On-device Inference?

On-device inference बिना किसी network call के directly user के hardware (laptop, phone) पर एक LLM चलाता है। osFoundry का built-in inference server open-weight models के लिए Apple Silicon (Metal) और NVIDIA (CUDA) को support करता है।

Detail

On-device inference के तीन बड़े upsides हैं: zero per-token cost, zero network latency, zero data leakage। Limits: model size VRAM द्वारा constrained; speed device द्वारा limited। एक 7B model एक modern Mac पर तेज़ चलता है; एक 70B model को एक A100-class GPU की आवश्यकता होती है।

Consumer VRAM में larger models fit करने के लिए Quantisation (Q4, Q5) essential है।

How osFoundry approaches On-device Inference

osFoundry के desktop app में inference server शामिल है। किसी भी open-weight model के लिए One-click install। Llama 3.1 8B और Qwen 2.5 14B जैसे quality models consumer hardware पर smoothly चलते हैं।