Home / Glossary / Mixture of Experts

What is Mixture of Experts?

Abbreviation: MoE

Mixture of Experts (MoE) is an LLM architecture where only a subset of parameters is activated per token, giving the capacity of a large model at the inference cost of a smaller one. Mixtral 8x7B and 8x22B are popular MoE models in osFoundry’s catalog.

Detail

Standard transformers activate every parameter for every token. MoE models route tokens through a learned gating layer that picks a small subset of "expert" sub-networks. A Mixtral 8x22B has 176B total parameters but only ~39B are activated per token — inference cost matches a ~40B dense model, but quality matches a much larger model.

MoE models are memory-hungry to host (you need all the experts loaded) but cheap to run per token. Good fit for GPU endpoints with enough VRAM.

Related terms

parameters
self-hosting

Related features

self-host-llms
gpu-endpoint