Hybrid RAG Search: BM25 + Embeddings [Deep Dive 2026]
Bottom Line
Hybrid RAG works best when lexical recall and semantic recall are fused instead of forced to compete. Use BM25 for exact terms, dense vectors for intent, and RRF to combine both without brittle score tuning.
Key Takeaways
- ›BM25 is still the best first pass for IDs, acronyms, code tokens, and rare keywords.
- ›all-MiniLM-L6-v2 outputs 384-dimensional embeddings, which map cleanly to Elasticsearch dense_vector.
- ›RRF merges BM25 and kNN rankings without manual score normalization or weighted math.
- ›Use k=20 and num_candidates=100 as a solid local starting point for small corpora.
Retrieval-augmented generation fails more often from weak retrieval than weak generation. A single retriever usually misses either exact-match signals or semantic intent, which is why production systems increasingly combine BM25 with dense vector search. In this tutorial, you will build a compact hybrid RAG pipeline on Elasticsearch, index 384-dimensional embeddings, fuse lexical and semantic results with RRF, and hand the final context to your LLM.
- BM25 catches literal matches that embeddings often underweight.
- Dense vectors recover meaning when users paraphrase or use different terminology.
- RRF combines both rankings without needing score normalization.
- Hybrid retrieval usually improves recall before you touch prompt engineering.
Bottom Line
If your RAG stack must handle both exact terms and natural-language paraphrases, hybrid retrieval is the safer default. Keep BM25, add vectors, and fuse rankings with RRF before spending time on model-side fixes.
| Dimension | BM25 | Vector embeddings | Edge |
|---|---|---|---|
| Exact identifiers | Strong on SKUs, class names, error codes | Can miss literal rarity | BM25 |
| Paraphrase handling | Weak when wording shifts | Strong on semantic similarity | Embeddings |
| Cold-start simplicity | Built in on text fields | Needs model + vector field | BM25 |
| Recall on messy user queries | Often brittle | Usually more forgiving | Embeddings |
| Operational cost | Lower | Higher due to embedding generation | BM25 |
| Best production pattern | Good alone for keyword search | Good alone for semantic search | Hybrid |
Why Hybrid Search Wins
When to choose each retrieval mode
Elasticsearch uses BM25 as the default similarity for text fields, while dense_vector fields support kNN retrieval for embeddings. Those two signals are useful for different failure modes, which is why combining them tends to outperform either one alone in RAG pipelines.
Choose BM25 when:
- Your corpus contains IDs, filenames, command flags, API paths, or stack traces.
- Users search with precise domain terms and expect literal matching.
- You need cheap baseline relevance with minimal indexing complexity.
Choose vector embeddings when:
- Users paraphrase heavily or ask conceptual questions.
- Your documents and queries use different vocabulary for the same idea.
- You want stronger recall across FAQs, policies, and narrative text.
Prerequisites
- A running Elasticsearch cluster that supports dense_vector fields and the RRF retriever.
- Python 3.11+.
pip install elasticsearch sentence-transformers.- A small knowledge base with short chunks of text and a stable document ID.
- A chat model you already use for final answer generation.
For the embedding model, this tutorial uses sentence-transformers/all-MiniLM-L6-v2, which produces 384-dimensional vectors. That matters because your Elasticsearch mapping must use the same dims value.
Step 1: Create the Index
Create one text field for lexical retrieval and one vector field for semantic retrieval. We will keep the mapping intentionally small.
PUT kb-hybrid
{
"mappings": {
"properties": {
"doc_id": { "type": "keyword" },
"title": { "type": "text" },
"content": { "type": "text" },
"url": { "type": "keyword" },
"content_vector": {
"type": "dense_vector",
"dims": 384,
"index": true,
"similarity": "cosine"
}
}
}
}
Why this mapping works
contentis indexed for BM25 matching.content_vectoris indexed for fast kNN search.- cosine is the natural fit for normalized sentence embeddings.
Step 2: Index Documents and Embeddings
Now generate embeddings and bulk-index both text and vectors into the same document. Keeping lexical and vector signals on one record simplifies retrieval and source attribution later.
from elasticsearch import Elasticsearch, helpers
from sentence_transformers import SentenceTransformer
es = Elasticsearch("http://localhost:9200")
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")
docs = [
{
"doc_id": "rag-1",
"title": "RAG overview",
"content": "Retrieval-augmented generation adds external context before answer generation.",
"url": "https://example.com/rag-overview"
},
{
"doc_id": "rag-2",
"title": "BM25 basics",
"content": "BM25 is a lexical ranking algorithm that scores documents based on term frequency and inverse document frequency.",
"url": "https://example.com/bm25-basics"
},
{
"doc_id": "rag-3",
"title": "Vector search basics",
"content": "Dense embeddings improve recall for semantically similar queries that do not share exact wording.",
"url": "https://example.com/vector-search-basics"
}
]
for doc in docs:
doc["content_vector"] = model.encode(doc["content"], normalize_embeddings=True).tolist()
actions = [
{"_index": "kb-hybrid", "_id": doc["doc_id"], "_source": doc}
for doc in docs
]
helpers.bulk(es, actions)
es.indices.refresh(index="kb-hybrid")
Chunking guidance
- Keep chunks small enough to stay topically coherent.
- Store titles and URLs so the generator can cite sources.
- Normalize embeddings consistently at index and query time.
Step 3: Run Hybrid Retrieval for RAG
This is the core move: run a standard text query and a kNN vector query in parallel, then fuse them with RRF. Elasticsearch documents rank_constant with a default of 60, but using a smaller explicit value while testing makes behavior easier to reason about.
from textwrap import dedent
query = "How does hybrid retrieval improve RAG recall?"
query_vector = model.encode(query, normalize_embeddings=True).tolist()
resp = es.search(
index="kb-hybrid",
retriever={
"rrf": {
"retrievers": [
{
"standard": {
"query": {
"match": {
"content": query
}
}
}
},
{
"knn": {
"field": "content_vector",
"query_vector": query_vector,
"k": 20,
"num_candidates": 100
}
}
],
"rank_window_size": 20,
"rank_constant": 20
}
},
size=5,
source=["title", "content", "url"]
)
hits = resp["hits"]["hits"]
context = "\n\n".join(
f"Title: {h['_source']['title']}\nSource: {h['_source']['url']}\nPassage: {h['_source']['content']}"
for h in hits
)
prompt = dedent(f"""
Use the context to answer the question.
If the context is insufficient, say so explicitly.
Question:
{query}
Context:
{context}
""")
print(prompt)
What RRF is doing here
- The standard retriever produces a lexical ranking using BM25.
- The knn retriever produces a semantic ranking from the vector field.
- RRF merges the ranked lists without assuming both scores live on the same scale.
Verify, Troubleshoot, and Next Steps
Verification and expected output
Your search response should return a mixed set of hits where some results are strong lexical matches and others are semantic matches. The assembled prompt should contain 5 passages or fewer, each with a title, source URL, and chunk text.
Title: RAG overview
Source: https://example.com/rag-overview
Passage: Retrieval-augmented generation adds external context before answer generation.
Title: Vector search basics
Source: https://example.com/vector-search-basics
Passage: Dense embeddings improve recall for semantically similar queries that do not share exact wording.
That output is enough to validate the retrieval layer before you plug the prompt into your generation client. If you want to quickly clean or standardize snippets before testing prompt assembly, a pass through TechBytes' Code Formatter can help when your sample payloads get noisy.
Troubleshooting: top 3 issues
- Dimension mismatch: if Elasticsearch rejects vectors, your field
dimsdoes not match the embedding model output. all-MiniLM-L6-v2 uses 384. - Weak semantic results: increase
num_candidates, inspect chunk boundaries, and make sure query embeddings are normalized the same way as document embeddings. - Exact terms disappear: your chunks may be too large, or your lexical query may be too loose. Test the BM25 branch by itself before blaming fusion.
What's next
- Add metadata filters so hybrid retrieval respects tenant, product, or time boundaries.
- Measure Recall@k and answer groundedness on a fixed evaluation set.
- Introduce reranking only after hybrid retrieval is stable.
- Store chunk IDs in answers so you can trace every generated claim back to source text.
Frequently Asked Questions
How do I combine BM25 and vector search for RAG? +
Why is BM25 still useful if I already have embeddings? +
What causes Elasticsearch dense_vector dimension errors? +
dense_vector mapping must use the exact number of dimensions produced by the embedding model. For sentence-transformers/all-MiniLM-L6-v2, that value is 384.Should I average BM25 and vector scores instead of using RRF? +
Get Engineering Deep-Dives in Your Inbox
Weekly breakdowns of architecture, security, and developer tooling — no fluff.
Related Deep-Dives
Vector Database Indexing Patterns for Retrieval Apps
A practical guide to chunking, metadata, and index design for semantic retrieval.
Developer ReferenceLLM RAG Evaluation Checklist for Engineering Teams
Use a repeatable framework to measure recall, groundedness, and answer quality.
AI EngineeringElasticsearch kNN Search Explained for Developers
Understand dense vectors, candidate selection, and practical tuning tradeoffs.