← Resources
By Hisako Aoyagi
— Localization Engineer
·
· GUIDE
Picking an Embedding Model for Multilingual RAG (CJK + Latin)
_Most teams pick an embedding model by glancing at the English MTEB leaderboard and shipping it — then watch retrieval quality collapse the moment a Japanese or Chinese document enters the corpus. osFoundry runs retrieval across English, Japanese, and Chinese in the same workspace, and the failure modes are subtle: tokenizer mismatches, dimension trade-offs, false neighbors across scripts. This piece walks through what actually works in production, with named models — voyage-3, bge-m3, mxbai-embed-large — and the testing methodology that catches problems before users do._
The naive English-only mistake
The default path looks like this: you read "OpenAI text-embedding-3-small is great," you plug it in, English retrieval feels fine, you ship. Six weeks later a Japanese-speaking customer files a bug — searches in Japanese return mostly English results, even when the corpus has Japanese matches.
The failure has two roots:
**Root 1 — training data skew.** Most embedding models are trained on a corpus that's 70-90% English. Cross-lingual alignment is an afterthought. The vectors for "machine learning" (English) and "機械学習" (Japanese) might land within cosine 0.3 of each other, but the vectors for two unrelated Japanese sentences might land within 0.1 — because the model has compressed all Japanese text into a narrow region of the embedding space.
**Root 2 — tokenizer pathology.** BPE tokenizers built primarily on English mangle CJK characters into one-byte-per-token sequences. A 20-character Japanese sentence might consume 60+ tokens. You hit context limits faster, embedding quality degrades, and cost-per-document spikes 3×.
Neither problem shows up on the English MTEB leaderboard. You need different evals.
Tokenizer-aware MTEB
MTEB (Massive Text Embedding Benchmark) is the standard — but the original MTEB is 90% English tasks. For multilingual work, the relevant benchmarks are:
- **MTEB Multilingual** — covers 50+ languages, retrieval + classification + clustering tasks. Models like BGE-M3, multilingual-e5, voyage-multilingual-2 publish scores here.
- **JMTEB** — Japanese-specific, 16 tasks. Catches models that fake multilingual support but tank on real Japanese retrieval.
- **C-MTEB** — Chinese-specific, similar shape to JMTEB.
- **MIRACL** — focused specifically on cross-lingual retrieval (query in language A, doc in language B). The hardest test and the one closest to real multilingual RAG.
Before picking a model, look up its **MIRACL score for your language pairs**. A model with 70+ MTEB English score but a 35 MIRACL Japanese score will fail your users. The gap between those numbers is the real risk.
Shared vs per-language embeddings
Architecture choice: one embedding model for all languages, or one per language?
**Shared model.** Single index, single model, all documents in one vector space. Pros: simpler ops, queries can match across languages, no language-detection step. Cons: cross-lingual quality is bounded by the model's worst language.
**Per-language model.** Separate index per language, language-detection at query time, route to the right index. Pros: each index can use the best-in-class model for that language (e.g., Japanese-specific GLuCoSE for JA, voyage-3 for EN). Cons: complex ops, no cross-lingual matching, language detection adds latency and is wrong ~3% of the time on short queries.
The right answer is almost always **shared**, with one caveat: if your corpus is dominated (>80%) by one non-English language and English is the minority, a per-language setup may be worth it. Otherwise, pick a strong multilingual model like BGE-M3 or voyage-multilingual-2 and accept that each individual language won't be quite as good as a language-specific specialist.
Dimension trade-offs
Modern embedding models offer dimensions from 256 to 4096. The trade-off:
- **256-384 dim:** fastest, cheapest storage (~1.5KB per vector), good for high-recall first-stage retrieval. Quality drop of ~5-10% vs full dimension on hard queries.
- **768-1024 dim:** sweet spot for most production systems. ~4-6KB per vector. Quality close to full dimension at a fraction of the storage cost.
- **1536-4096 dim:** highest quality, ~12-25KB per vector. Storage and search latency become the constraint at scale.
For a corpus of 10 million chunks:
- 384-dim @ 1.5KB = 15 GB
- 1024-dim @ 4KB = 40 GB
- 3072-dim @ 12KB = 120 GB
Matryoshka-trained models (voyage-3, mxbai-embed-large-v1) let you store the full vector and truncate at query time. Store the 1024-dim version, query at 384 for first-stage recall, rerank with the full vector. This is the pattern we ship by default.
Reranking is what saves you
Here's the secret that turns mediocre multilingual retrieval into good retrieval: **a cross-encoder reranker covers a lot of sins**.
First-stage retrieval (bi-encoder embedding similarity) gives you top-100 candidates. A cross-encoder reranker — which jointly encodes query+document and produces a relevance score — reorders those 100 to give you the top-10 you actually return.
For multilingual RAG, the rerankers that matter:
- **bge-reranker-v2-m3** — open-weight, strong on Chinese and Japanese, free to self-host. Adds ~50ms for 100 candidates.
- **voyage-rerank-2** — hosted, BYOK-compatible, strong cross-lingual performance.
- **Cohere rerank-multilingual-v3** — hosted, mature, 100+ languages.
Why it works: the cross-encoder sees both the query and the document together at attention time, so it can resolve ambiguity the bi-encoder missed. A query in English finding a Japanese document is exactly the case where this helps most — the bi-encoder's cross-lingual alignment is approximate, the reranker's is exact.
Reranking adds ~50-200ms of latency. Almost always worth it.
Practical recommendations
If I were starting a new multilingual RAG system today, here's what I'd pick by tier:
**Tier A — managed, BYOK-friendly, highest quality:**
- Embedder: `voyage-3-large` or `voyage-multilingual-2` (1024-dim, strong MIRACL scores).
- Reranker: `voyage-rerank-2` or `cohere-rerank-multilingual-v3`.
- Cost: ~$0.18 per million input tokens for embedding, ~$2 per 1k searches for reranking.
**Tier B — open-weight, self-hosted, no per-call cost:**
- Embedder: `BAAI/bge-m3` (1024-dim, multilingual, dense + sparse + ColBERT modes in one model).
- Reranker: `BAAI/bge-reranker-v2-m3`.
- Cost: GPU time only. A single A10G handles ~1000 embeddings/sec.
**Tier C — local, on-device, no GPU:**
- Embedder: `mxbai-embed-large-v1` (1024-dim, Matryoshka, ~340M params, runs on CPU at ~50/sec).
- Reranker: skip, or use `jinaai/jina-reranker-v2-base-multilingual` (~280M params).
For osFoundry's managed embedding proxy, we default to voyage-3 + bge-reranker-v2-m3. The combination scores within 2 points of the best published numbers on JMTEB and MIRACL while staying cheap enough for chunking at scale.
Testing on your own data
Public benchmarks are necessary but insufficient. Your corpus has domain vocabulary, your queries have a specific shape, your users speak particular languages. Build a small evaluation set:
1. **Collect 50-100 real queries** from your users (or write them yourself based on the corpus). Make sure the language mix matches your traffic — if you're 60% EN / 30% JA / 10% ZH, your eval should be too.
2. **For each query, hand-label the top 3-5 "ideal" documents** from your corpus. This is the painful step. Two hours of labeling is the price of admission.
3. **Run each candidate model and measure NDCG@10 or Recall@10.** Compare against your current setup.
4. **Don't trust a model that wins by less than 3 points.** Within-noise differences flip across benchmarks.
The teams that skip step 2 and rely entirely on MTEB scores are the ones that ship surprises to production. Two hours of labeling now saves a week of "why is retrieval bad?" debugging later.
Frequently asked questions
- What is the best embedding model for multilingual RAG?
- For multilingual RAG covering English plus CJK languages, the strongest options are voyage-multilingual-2 (managed, 1024-dim, top MIRACL scores) and BAAI/bge-m3 (open-weight, self-hostable, dense+sparse+ColBERT in one model). For local on-device use without a GPU, mxbai-embed-large-v1 with Matryoshka truncation is the practical pick. Always pair with a multilingual reranker like bge-reranker-v2-m3.
- Why do English-only embedding models fail on Japanese?
- English-trained embedding models fail on Japanese for two reasons: training data skew compresses all Japanese text into a narrow region of the embedding space, and BPE tokenizers built on English mangle CJK characters into long byte sequences. A 20-character Japanese sentence may consume 60+ tokens, hitting context limits and degrading quality. Use multilingual-trained models like BGE-M3 or voyage-multilingual-2 instead.
- Should I use one embedding model or one per language?
- Use one shared multilingual model for almost all cases. A per-language setup adds language-detection latency, removes cross-lingual matching, and requires complex routing — only worth it when one non-English language dominates over 80% of the corpus. A strong shared model like BGE-M3 or voyage-multilingual-2 is the right default.
- What embedding dimension should I use?
- 768-1024 dimensions is the production sweet spot for most multilingual RAG systems — close to full-dimension quality at a fraction of the storage cost. Matryoshka-trained models like voyage-3 and mxbai-embed-large let you store the full 1024-dim vector and truncate at query time to 384 dim for fast first-stage retrieval, then rerank with the full vector. Store 4-6KB per vector at 1024 dim.
- Does reranking improve multilingual retrieval?
- Reranking dramatically improves multilingual retrieval. A cross-encoder reranker like bge-reranker-v2-m3, voyage-rerank-2, or Cohere rerank-multilingual-v3 jointly encodes the query and each candidate document, resolving cross-lingual ambiguity the bi-encoder missed. Expect 50-200ms of added latency for the top-100-to-top-10 rerank step. Almost always worth it for multilingual corpora.
- What benchmarks should I check for multilingual embeddings?
- Check MIRACL for cross-lingual retrieval (query in language A, document in language B) — this is the closest test to real multilingual RAG. Also check MTEB Multilingual for general coverage and the language-specific benchmarks JMTEB (Japanese) and C-MTEB (Chinese). A model with a high English MTEB score but a low MIRACL score will fail multilingual users.
- How much does multilingual embedding cost at scale?
- Managed multilingual embedding via voyage-3 or Cohere costs around $0.18 per million input tokens, plus roughly $2 per 1,000 reranking searches. For a corpus of 10 million 500-token chunks, that's about $900 for the initial index, with ongoing query cost in the cents range. Self-hosted bge-m3 on a single A10G GPU handles ~1,000 embeddings per second at GPU-time-only cost.
Sources