Home Posts Hybrid RAG Search: BM25 + Embeddings [Deep Dive 2026]
AI Engineering

Hybrid RAG Search: BM25 + Embeddings [Deep Dive 2026]

Hybrid RAG Search: BM25 + Embeddings [Deep Dive 2026]
Dillip Chowdary
Dillip Chowdary
Tech Entrepreneur & Innovator · May 03, 2026 · 9 min read

Bottom Line

Hybrid RAG works best when lexical recall and semantic recall are fused instead of forced to compete. Use BM25 for exact terms, dense vectors for intent, and RRF to combine both without brittle score tuning.

Key Takeaways

  • BM25 is still the best first pass for IDs, acronyms, code tokens, and rare keywords.
  • all-MiniLM-L6-v2 outputs 384-dimensional embeddings, which map cleanly to Elasticsearch dense_vector.
  • RRF merges BM25 and kNN rankings without manual score normalization or weighted math.
  • Use k=20 and num_candidates=100 as a solid local starting point for small corpora.

Retrieval-augmented generation fails more often from weak retrieval than weak generation. A single retriever usually misses either exact-match signals or semantic intent, which is why production systems increasingly combine BM25 with dense vector search. In this tutorial, you will build a compact hybrid RAG pipeline on Elasticsearch, index 384-dimensional embeddings, fuse lexical and semantic results with RRF, and hand the final context to your LLM.

  • BM25 catches literal matches that embeddings often underweight.
  • Dense vectors recover meaning when users paraphrase or use different terminology.
  • RRF combines both rankings without needing score normalization.
  • Hybrid retrieval usually improves recall before you touch prompt engineering.

Bottom Line

If your RAG stack must handle both exact terms and natural-language paraphrases, hybrid retrieval is the safer default. Keep BM25, add vectors, and fuse rankings with RRF before spending time on model-side fixes.

DimensionBM25Vector embeddingsEdge
Exact identifiersStrong on SKUs, class names, error codesCan miss literal rarityBM25
Paraphrase handlingWeak when wording shiftsStrong on semantic similarityEmbeddings
Cold-start simplicityBuilt in on text fieldsNeeds model + vector fieldBM25
Recall on messy user queriesOften brittleUsually more forgivingEmbeddings
Operational costLowerHigher due to embedding generationBM25
Best production patternGood alone for keyword searchGood alone for semantic searchHybrid

Why Hybrid Search Wins

When to choose each retrieval mode

Elasticsearch uses BM25 as the default similarity for text fields, while dense_vector fields support kNN retrieval for embeddings. Those two signals are useful for different failure modes, which is why combining them tends to outperform either one alone in RAG pipelines.

Choose BM25 when:

  • Your corpus contains IDs, filenames, command flags, API paths, or stack traces.
  • Users search with precise domain terms and expect literal matching.
  • You need cheap baseline relevance with minimal indexing complexity.

Choose vector embeddings when:

  • Users paraphrase heavily or ask conceptual questions.
  • Your documents and queries use different vocabulary for the same idea.
  • You want stronger recall across FAQs, policies, and narrative text.
Pro tip: Before embedding internal docs, scrub secrets and PII. A quick pass through the Data Masking Tool helps you avoid indexing sensitive raw text into your retrieval layer.

Prerequisites

  • A running Elasticsearch cluster that supports dense_vector fields and the RRF retriever.
  • Python 3.11+.
  • pip install elasticsearch sentence-transformers.
  • A small knowledge base with short chunks of text and a stable document ID.
  • A chat model you already use for final answer generation.

For the embedding model, this tutorial uses sentence-transformers/all-MiniLM-L6-v2, which produces 384-dimensional vectors. That matters because your Elasticsearch mapping must use the same dims value.

Step 1: Create the Index

Create one text field for lexical retrieval and one vector field for semantic retrieval. We will keep the mapping intentionally small.

PUT kb-hybrid
{
  "mappings": {
    "properties": {
      "doc_id": { "type": "keyword" },
      "title":  { "type": "text" },
      "content": { "type": "text" },
      "url": { "type": "keyword" },
      "content_vector": {
        "type": "dense_vector",
        "dims": 384,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}

Why this mapping works

  • content is indexed for BM25 matching.
  • content_vector is indexed for fast kNN search.
  • cosine is the natural fit for normalized sentence embeddings.

Step 2: Index Documents and Embeddings

Now generate embeddings and bulk-index both text and vectors into the same document. Keeping lexical and vector signals on one record simplifies retrieval and source attribution later.

from elasticsearch import Elasticsearch, helpers
from sentence_transformers import SentenceTransformer

es = Elasticsearch("http://localhost:9200")
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

docs = [
    {
        "doc_id": "rag-1",
        "title": "RAG overview",
        "content": "Retrieval-augmented generation adds external context before answer generation.",
        "url": "https://example.com/rag-overview"
    },
    {
        "doc_id": "rag-2",
        "title": "BM25 basics",
        "content": "BM25 is a lexical ranking algorithm that scores documents based on term frequency and inverse document frequency.",
        "url": "https://example.com/bm25-basics"
    },
    {
        "doc_id": "rag-3",
        "title": "Vector search basics",
        "content": "Dense embeddings improve recall for semantically similar queries that do not share exact wording.",
        "url": "https://example.com/vector-search-basics"
    }
]

for doc in docs:
    doc["content_vector"] = model.encode(doc["content"], normalize_embeddings=True).tolist()

actions = [
    {"_index": "kb-hybrid", "_id": doc["doc_id"], "_source": doc}
    for doc in docs
]

helpers.bulk(es, actions)
es.indices.refresh(index="kb-hybrid")

Chunking guidance

  • Keep chunks small enough to stay topically coherent.
  • Store titles and URLs so the generator can cite sources.
  • Normalize embeddings consistently at index and query time.

Step 3: Run Hybrid Retrieval for RAG

This is the core move: run a standard text query and a kNN vector query in parallel, then fuse them with RRF. Elasticsearch documents rank_constant with a default of 60, but using a smaller explicit value while testing makes behavior easier to reason about.

from textwrap import dedent

query = "How does hybrid retrieval improve RAG recall?"
query_vector = model.encode(query, normalize_embeddings=True).tolist()

resp = es.search(
    index="kb-hybrid",
    retriever={
        "rrf": {
            "retrievers": [
                {
                    "standard": {
                        "query": {
                            "match": {
                                "content": query
                            }
                        }
                    }
                },
                {
                    "knn": {
                        "field": "content_vector",
                        "query_vector": query_vector,
                        "k": 20,
                        "num_candidates": 100
                    }
                }
            ],
            "rank_window_size": 20,
            "rank_constant": 20
        }
    },
    size=5,
    source=["title", "content", "url"]
)

hits = resp["hits"]["hits"]
context = "\n\n".join(
    f"Title: {h['_source']['title']}\nSource: {h['_source']['url']}\nPassage: {h['_source']['content']}"
    for h in hits
)

prompt = dedent(f"""
Use the context to answer the question.
If the context is insufficient, say so explicitly.

Question:
{query}

Context:
{context}
""")

print(prompt)

What RRF is doing here

  • The standard retriever produces a lexical ranking using BM25.
  • The knn retriever produces a semantic ranking from the vector field.
  • RRF merges the ranked lists without assuming both scores live on the same scale.
Watch out: Do not average raw BM25 and vector scores unless you have a strong reason and measured calibration. Their score distributions are different; rank fusion is usually safer to start with.

Verify, Troubleshoot, and Next Steps

Verification and expected output

Your search response should return a mixed set of hits where some results are strong lexical matches and others are semantic matches. The assembled prompt should contain 5 passages or fewer, each with a title, source URL, and chunk text.

Title: RAG overview
Source: https://example.com/rag-overview
Passage: Retrieval-augmented generation adds external context before answer generation.

Title: Vector search basics
Source: https://example.com/vector-search-basics
Passage: Dense embeddings improve recall for semantically similar queries that do not share exact wording.

That output is enough to validate the retrieval layer before you plug the prompt into your generation client. If you want to quickly clean or standardize snippets before testing prompt assembly, a pass through TechBytes' Code Formatter can help when your sample payloads get noisy.

Troubleshooting: top 3 issues

  1. Dimension mismatch: if Elasticsearch rejects vectors, your field dims does not match the embedding model output. all-MiniLM-L6-v2 uses 384.
  2. Weak semantic results: increase num_candidates, inspect chunk boundaries, and make sure query embeddings are normalized the same way as document embeddings.
  3. Exact terms disappear: your chunks may be too large, or your lexical query may be too loose. Test the BM25 branch by itself before blaming fusion.

What's next

  • Add metadata filters so hybrid retrieval respects tenant, product, or time boundaries.
  • Measure Recall@k and answer groundedness on a fixed evaluation set.
  • Introduce reranking only after hybrid retrieval is stable.
  • Store chunk IDs in answers so you can trace every generated claim back to source text.

Frequently Asked Questions

How do I combine BM25 and vector search for RAG? +
Use one lexical query and one vector query, then fuse the ranked results with RRF. This avoids manual score normalization and is usually the fastest way to improve recall in a RAG pipeline.
Why is BM25 still useful if I already have embeddings? +
Embeddings are good at semantic similarity, but they can underperform on exact identifiers like error codes, product names, API routes, and acronyms. BM25 remains the stronger signal for literal matching, which is common in engineering corpora.
What causes Elasticsearch dense_vector dimension errors? +
Your dense_vector mapping must use the exact number of dimensions produced by the embedding model. For sentence-transformers/all-MiniLM-L6-v2, that value is 384.
Should I average BM25 and vector scores instead of using RRF? +
Usually no, at least not first. Raw lexical and vector scores are not naturally calibrated to the same scale, so naive averaging can create unstable rankings; RRF is more robust for an initial production setup.

Get Engineering Deep-Dives in Your Inbox

Weekly breakdowns of architecture, security, and developer tooling — no fluff.

Found this useful? Share it.