Do I actually need a vector database for RAG?

Probably not until you cross several million vectors. Postgres with pgvector handles up to roughly 10 million vectors comfortably with an HNSW index, hybrid search via tsvector, and standard SQL filters for tenancy and recency. You inherit backups, replication, and access control from the database you already operate. Dedicated vector databases like Qdrant, Milvus, or Weaviate become worth their operational cost when you need sharded billion-scale indexes, specialized filtering at very high QPS, or features like multi-vector retrieval that pgvector does not yet match. Start with pgvector and migrate only when a measured bottleneck forces it.

Is a reranker really worth the latency?

For most retrieval-heavy workloads, yes. A cross-encoder reranker like Cohere rerank-v3.5 or Voyage rerank-2 typically adds 80 to 200 milliseconds at p50 on a candidate set of 40 to 100 chunks, and it consistently lifts mean reciprocal rank by 10 to 30 percent over dense retrieval alone. The latency is hidden behind whatever LLM call follows, which usually dominates the user-visible budget. Skip the reranker only for latency-critical autocomplete-style flows under 200 milliseconds end to end, or when your candidate set is already small enough that the dense ranking is reliable.

How do I handle chunk overlap without bloating storage?

Use small overlaps of 10 to 15 percent of the chunk size, stored as character offsets into the source document rather than duplicated text. At retrieval time you can optionally fetch the neighboring chunk for context expansion. This keeps the index lean and avoids paying for the same tokens twice during embedding. If you need richer context for generation, a parent-document pattern works well: index small chunks for retrieval precision, then resolve hits back to their parent section before passing to the LLM. The parent lookup is a single indexed query and adds negligible latency.

← Resources

TUTORIAL · 2026-02-12

Build a Production RAG Pipeline Without LangChain (2026)

You can ship a production-grade RAG pipeline in a few hundred lines of code by composing provider SDKs, pgvector, and a reranker directly. Skip LangChain's abstractions until you have a concrete need they actually solve.

Why teams are unbundling LangChain in 2026

By 2026, removing LangChain from production has become a recognizable engineering pattern, with public postmortems from teams that once evangelized it. The complaint set is consistent: layered abstractions that obscure what is actually sent to the model, frequent breaking changes between minor versions, and debugging sessions that turn into spelunking expeditions through wrapper classes.

The underlying shift is that provider SDKs got good. The OpenAI, Anthropic, and Google SDKs now ship first-class streaming, structured outputs, tool calls, and batching. Voyage, Cohere, and Jina expose clean REST endpoints for embeddings and reranking. Postgres with pgvector handles ANN search up to roughly 10M vectors without a dedicated vector database.

For most RAG workloads, the framework you wanted in 2023 is now four function calls and a SQL query. The reasonable default in 2026 is to start with raw SDKs, add a thin pipeline abstraction when you feel real pain, and treat heavyweight frameworks as opt-in.

The five stages every RAG pipeline needs

Every production RAG system, regardless of framework, decomposes into the same five stages. Naming them explicitly makes the code easier to test and replace piecewise.

Ingest: load source documents, normalize encoding, strip boilerplate.
Chunk: split into retrieval units with stable IDs and source metadata.
Embed and index: encode chunks into a vector store, alongside a lexical index for hybrid search.
Retrieve and rerank: pull a wide candidate set, then narrow with a cross-encoder reranker.
Generate and cite: assemble a prompt with retrieved context and return answers with source attribution.

Keep each stage as a pure function with typed inputs and outputs. The retrieval stage should not know which LLM will generate; the generation stage should not know which embedding model was used. This separation is what frameworks promise and rarely deliver, because they couple stages through opaque chain objects. Writing it yourself takes an afternoon and removes a category of upgrade risk.

Chunking strategies: semantic, recursive, and agentic

Chunking is where most RAG quality is won or lost. Three patterns dominate in 2026.

Recursive character splitting on a hierarchy of separators (paragraphs, sentences, then characters) is the baseline. It is fast, deterministic, and good enough for prose. Semantic chunking embeds candidate splits and merges adjacent chunks whose embeddings are close, producing topically coherent units at higher ingest cost. Agentic chunking asks a small LLM to propose split points for structured documents like contracts or transcripts, where headings and turn boundaries matter more than character count.

def recursive_chunk(text, max_chars=1200, overlap=150):
    seps = ["\n\n", "\n", ". ", " "]
    def split(s, depth=0):
        if len(s) <= max_chars or depth == len(seps):
            return [s]
        parts, sep = [], seps[depth]
        for p in s.split(sep):
            parts.extend(split(p, depth + 1))
        return parts
    raw = split(text)
    return [raw[i] + raw[i+1][:overlap] for i in range(len(raw)-1)] + [raw[-1]]

Start recursive, measure, then graduate to semantic only on the document classes where evaluation shows it pays.

Embeddings and rerankers: Voyage, BGE, and Cohere

The 2026 picture for retrievers is clearer than it was a year ago. Voyage AI, now part of MongoDB, ships voyage-3-large as a strong general-purpose dense model and released the voyage-4 family in early 2026 with a mixture-of-experts variant aimed at the top of the RTEB leaderboard. Cohere's embed-v4 is the other production frontrunner. For open-weight self-hosting, BAAI's bge-m3 remains the default: a single model supporting dense, sparse, and multi-vector retrieval across 100-plus languages, with an 8192-token context.

For reranking, Cohere rerank-v3.5 is the workhorse: one multilingual model, 4096-token chunks, and roughly 80-150 ms p50 latency on typical payloads. Voyage rerank-2 is competitive and integrates cleanly if you already use Voyage embeddings.

The practical rule: pick one dense embedder, one reranker, and freeze the pair behind an interface. Swapping later costs a reindex, not a rewrite.

Wiring retrieval, generation, and citations

With pgvector you get hybrid search and citations in straightforward SQL. Store chunks with their document ID, embedding, and a tsvector for lexical recall. Retrieve a wide candidate set, rerank, then pass the top N into the LLM with explicit source IDs the model is instructed to cite.

import psycopg, voyageai, cohere
vo, co = voyageai.Client(), cohere.Client()

def retrieve(query, k=40, top_n=8): qvec = vo.embed([query], model="voyage-3-large").embeddings[0] with psycopg.connect(DSN) as conn: rows = conn.execute( "SELECT id, doc_id, text FROM chunks ORDER BY embedding <=> %s LIMIT %s", (qvec, k)).fetchall() docs = [r[2] for r in rows] ranked = co.rerank(model="rerank-v3.5", query=query, documents=docs, top_n=top_n) return [rows[r.index] for r in ranked.results] ```

For pgvector, default to an HNSW index with m=16 and ef_construction=64, add a tenant or recency prefilter so the ANN scan starts narrow, and always pair `ORDER BY embedding <=> $1` with a `LIMIT`. Pass retrieved chunks to the LLM with their IDs and instruct the model to emit citation markers; resolve those markers back to source URLs in a postprocess step.

Evaluation: hit rate, MRR, and faithfulness

A RAG pipeline without an evaluation harness is a guess. Build the harness before you tune anything. The minimal kit is a labeled query set of 100 to 500 examples covering the question types you actually expect, plus three metrics.

Hit rate at K answers whether the correct chunk made it into the candidate set, which isolates retrieval quality. Mean reciprocal rank captures how high the right chunk ranked, which is what the reranker is paid to improve. Faithfulness, scored by an LLM judge prompted to compare each generated claim against the cited chunks, captures whether the model hallucinated past its context.

Run the harness on every change: a new chunk size, a different embedder, a prompt edit. Plot the three metrics over time in the same dashboard. When a change improves MRR but tanks faithfulness, the reranker is surfacing distractors and the generation prompt needs guardrails, not more retrieval tuning.

When to graduate to a managed orchestrator

Hand-rolled pipelines stay maintainable as long as one team owns them and the stage count is small. The point at which a managed orchestrator pays off is when non-engineers need to tune retrieval, when you are running many pipelines in parallel for different document classes, or when you want versioned configs and A/B routing without a redeploy.

At that point, the choice is between heavyweight frameworks like LlamaIndex or Haystack, and config-driven platforms that expose retrieval stages as declarative units. osStudio, the no-code orchestration editor in osFoundry, takes the second approach: the five stages above are first-class config objects with managed Voyage embeddings and reranking behind a single proxy, so you keep BYOK on the LLM side and avoid framework lock-in on the retrieval side.

The useful question is not framework versus no-framework. It is whether your pipeline configuration deserves to be code, config, or a UI, and that answer changes as the team grows.

Frequently asked questions

Do I actually need a vector database for RAG?: Probably not until you cross several million vectors. Postgres with pgvector handles up to roughly 10 million vectors comfortably with an HNSW index, hybrid search via tsvector, and standard SQL filters for tenancy and recency. You inherit backups, replication, and access control from the database you already operate. Dedicated vector databases like Qdrant, Milvus, or Weaviate become worth their operational cost when you need sharded billion-scale indexes, specialized filtering at very high QPS, or features like multi-vector retrieval that pgvector does not yet match. Start with pgvector and migrate only when a measured bottleneck forces it.
Is a reranker really worth the latency?: For most retrieval-heavy workloads, yes. A cross-encoder reranker like Cohere rerank-v3.5 or Voyage rerank-2 typically adds 80 to 200 milliseconds at p50 on a candidate set of 40 to 100 chunks, and it consistently lifts mean reciprocal rank by 10 to 30 percent over dense retrieval alone. The latency is hidden behind whatever LLM call follows, which usually dominates the user-visible budget. Skip the reranker only for latency-critical autocomplete-style flows under 200 milliseconds end to end, or when your candidate set is already small enough that the dense ranking is reliable.
How do I handle chunk overlap without bloating storage?: Use small overlaps of 10 to 15 percent of the chunk size, stored as character offsets into the source document rather than duplicated text. At retrieval time you can optionally fetch the neighboring chunk for context expansion. This keeps the index lean and avoids paying for the same tokens twice during embedding. If you need richer context for generation, a parent-document pattern works well: index small chunks for retrieval precision, then resolve hits back to their parent section before passing to the LLM. The parent lookup is a single indexed query and adds negligible latency.
What is the simplest way to evaluate RAG quality?: Build a labeled set of 100 to 200 representative queries with the chunk IDs that should be retrieved. Compute hit rate at K and mean reciprocal rank from the retrieval stage alone, which isolates retriever quality from generation noise. For faithfulness, use an LLM judge prompted to compare each generated claim against the cited chunks and score grounded versus unsupported. Run the three numbers on every meaningful change and treat regressions as blockers. This three-metric setup catches the majority of RAG regressions in practice and avoids the trap of optimizing one number while silently hurting another.

Sources

Voyage AI blog: voyage-3-large announcement
Voyage AI blog: voyage-multimodal-3.5
Cohere docs: Rerank overview
Cohere Rerank 3.5 reference (Oracle docs)
BAAI bge-m3 model card
pgvector HNSW production tuning guide
LangChain alternatives in 2026 (DEV community)