How do I combine BM25 and vector search for RAG?

Use one lexical query and one vector query, then fuse the ranked results with RRF. This avoids manual score normalization and is usually the fastest way to improve recall in a RAG pipeline.

Why is BM25 still useful if I already have embeddings?

Embeddings are good at semantic similarity, but they can underperform on exact identifiers like error codes, product names, API routes, and acronyms. BM25 remains the stronger signal for literal matching, which is common in engineering corpora.

What causes Elasticsearch dense_vector dimension errors?

Your dense_vector mapping must use the exact number of dimensions produced by the embedding model. For sentence-transformers/all-MiniLM-L6-v2, that value is 384.

Should I average BM25 and vector scores instead of using RRF?

Usually no, at least not first. Raw lexical and vector scores are not naturally calibrated to the same scale, so naive averaging can create unstable rankings; RRF is more robust for an initial production setup.

Hybrid RAG Search: BM25 + Embeddings [Deep Dive 2026]

Retrieval-augmented generation fails more often from weak retrieval than weak generation. A single retriever usually misses either exact-match signals or semantic intent, which is why production systems increasingly combine BM25 with dense vector search. In this tutorial, you will build a compact hybrid RAG pipeline on Elasticsearch, index 384-dimensional embeddings, fuse lexical and semantic results with RRF, and hand the final context to your LLM.

BM25 catches literal matches that embeddings often underweight.
Dense vectors recover meaning when users paraphrase or use different terminology.
RRF combines both rankings without needing score normalization.
Hybrid retrieval usually improves recall before you touch prompt engineering.

Bottom Line

If your RAG stack must handle both exact terms and natural-language paraphrases, hybrid retrieval is the safer default. Keep BM25, add vectors, and fuse rankings with RRF before spending time on model-side fixes.

Dimension	BM25	Vector embeddings	Edge
Exact identifiers	Strong on SKUs, class names, error codes	Can miss literal rarity	BM25
Paraphrase handling	Weak when wording shifts	Strong on semantic similarity	Embeddings
Cold-start simplicity	Built in on text fields	Needs model + vector field	BM25
Recall on messy user queries	Often brittle	Usually more forgiving	Embeddings
Operational cost	Lower	Higher due to embedding generation	BM25
Best production pattern	Good alone for keyword search	Good alone for semantic search	Hybrid

Why Hybrid Search Wins

When to choose each retrieval mode

Elasticsearch uses BM25 as the default similarity for text fields, while dense_vector fields support kNN retrieval for embeddings. Those two signals are useful for different failure modes, which is why combining them tends to outperform either one alone in RAG pipelines.

Choose BM25 when:

Your corpus contains IDs, filenames, command flags, API paths, or stack traces.
Users search with precise domain terms and expect literal matching.
You need cheap baseline relevance with minimal indexing complexity.

Choose vector embeddings when:

Users paraphrase heavily or ask conceptual questions.
Your documents and queries use different vocabulary for the same idea.
You want stronger recall across FAQs, policies, and narrative text.

Pro tip: Before embedding internal docs, scrub secrets and PII. A quick pass through the Data Masking Tool helps you avoid indexing sensitive raw text into your retrieval layer.

Prerequisites

A running Elasticsearch cluster that supports dense_vector fields and the RRF retriever.
Python 3.11+.
pip install elasticsearch sentence-transformers.
A small knowledge base with short chunks of text and a stable document ID.
A chat model you already use for final answer generation.

For the embedding model, this tutorial uses sentence-transformers/all-MiniLM-L6-v2, which produces 384-dimensional vectors. That matters because your Elasticsearch mapping must use the same dims value.

Step 1: Create the Index

Create one text field for lexical retrieval and one vector field for semantic retrieval. We will keep the mapping intentionally small.

PUT kb-hybrid
{
  "mappings": {
    "properties": {
      "doc_id": { "type": "keyword" },
      "title":  { "type": "text" },
      "content": { "type": "text" },
      "url": { "type": "keyword" },
      "content_vector": {
        "type": "dense_vector",
        "dims": 384,
        "index": true,
        "similarity": "cosine"
      }
    }
  }
}

Why this mapping works

content is indexed for BM25 matching.
content_vector is indexed for fast kNN search.
cosine is the natural fit for normalized sentence embeddings.

Step 2: Index Documents and Embeddings

Now generate embeddings and bulk-index both text and vectors into the same document. Keeping lexical and vector signals on one record simplifies retrieval and source attribution later.

from elasticsearch import Elasticsearch, helpers
from sentence_transformers import SentenceTransformer

es = Elasticsearch("http://localhost:9200")
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

docs = [
    {
        "doc_id": "rag-1",
        "title": "RAG overview",
        "content": "Retrieval-augmented generation adds external context before answer generation.",
        "url": "https://example.com/rag-overview"
    },
    {
        "doc_id": "rag-2",
        "title": "BM25 basics",
        "content": "BM25 is a lexical ranking algorithm that scores documents based on term frequency and inverse document frequency.",
        "url": "https://example.com/bm25-basics"
    },
    {
        "doc_id": "rag-3",
        "title": "Vector search basics",
        "content": "Dense embeddings improve recall for semantically similar queries that do not share exact wording.",
        "url": "https://example.com/vector-search-basics"
    }
]

for doc in docs:
    doc["content_vector"] = model.encode(doc["content"], normalize_embeddings=True).tolist()

actions = [
    {"_index": "kb-hybrid", "_id": doc["doc_id"], "_source": doc}
    for doc in docs
]

helpers.bulk(es, actions)
es.indices.refresh(index="kb-hybrid")

Chunking guidance

Keep chunks small enough to stay topically coherent.
Store titles and URLs so the generator can cite sources.
Normalize embeddings consistently at index and query time.

Step 3: Run Hybrid Retrieval for RAG

This is the core move: run a standard text query and a kNN vector query in parallel, then fuse them with RRF. Elasticsearch documents rank_constant with a default of 60, but using a smaller explicit value while testing makes behavior easier to reason about.

from textwrap import dedent

query = "How does hybrid retrieval improve RAG recall?"
query_vector = model.encode(query, normalize_embeddings=True).tolist()

resp = es.search(
    index="kb-hybrid",
    retriever={
        "rrf": {
            "retrievers": [
                {
                    "standard": {
                        "query": {
                            "match": {
                                "content": query
                            }
                        }
                    }
                },
                {
                    "knn": {
                        "field": "content_vector",
                        "query_vector": query_vector,
                        "k": 20,
                        "num_candidates": 100
                    }
                }
            ],
            "rank_window_size": 20,
            "rank_constant": 20
        }
    },
    size=5,
    source=["title", "content", "url"]
)

hits = resp["hits"]["hits"]
context = "\n\n".join(
    f"Title: {h['_source']['title']}\nSource: {h['_source']['url']}\nPassage: {h['_source']['content']}"
    for h in hits
)

prompt = dedent(f"""
Use the context to answer the question.
If the context is insufficient, say so explicitly.

Question:
{query}

Context:
{context}
""")

print(prompt)

What RRF is doing here

The standard retriever produces a lexical ranking using BM25.
The knn retriever produces a semantic ranking from the vector field.
RRF merges the ranked lists without assuming both scores live on the same scale.

Watch out: Do not average raw BM25 and vector scores unless you have a strong reason and measured calibration. Their score distributions are different; rank fusion is usually safer to start with.

Verify, Troubleshoot, and Next Steps

Verification and expected output

Your search response should return a mixed set of hits where some results are strong lexical matches and others are semantic matches. The assembled prompt should contain 5 passages or fewer, each with a title, source URL, and chunk text.

Title: RAG overview
Source: https://example.com/rag-overview
Passage: Retrieval-augmented generation adds external context before answer generation.

Title: Vector search basics
Source: https://example.com/vector-search-basics
Passage: Dense embeddings improve recall for semantically similar queries that do not share exact wording.

That output is enough to validate the retrieval layer before you plug the prompt into your generation client. If you want to quickly clean or standardize snippets before testing prompt assembly, a pass through TechBytes' Code Formatter can help when your sample payloads get noisy.

Troubleshooting: top 3 issues

Dimension mismatch: if Elasticsearch rejects vectors, your field dims does not match the embedding model output. all-MiniLM-L6-v2 uses 384.
Weak semantic results: increase num_candidates, inspect chunk boundaries, and make sure query embeddings are normalized the same way as document embeddings.
Exact terms disappear: your chunks may be too large, or your lexical query may be too loose. Test the BM25 branch by itself before blaming fusion.

What's next

Add metadata filters so hybrid retrieval respects tenant, product, or time boundaries.
Measure Recall@k and answer groundedness on a fixed evaluation set.
Introduce reranking only after hybrid retrieval is stable.
Store chunk IDs in answers so you can trace every generated claim back to source text.

Hybrid RAG Search: BM25 + Embeddings [Deep Dive 2026]

Bottom Line

Bottom Line

Why Hybrid Search Wins

When to choose each retrieval mode

Prerequisites

Step 1: Create the Index

Why this mapping works

Step 2: Index Documents and Embeddings

Chunking guidance

Step 3: Run Hybrid Retrieval for RAG

What RRF is doing here

Verify, Troubleshoot, and Next Steps

Verification and expected output

Troubleshooting: top 3 issues

What's next

Frequently Asked Questions

Get Engineering Deep-Dives in Your Inbox

Related Deep-Dives

Vector Database Indexing Patterns for Retrieval Apps

LLM RAG Evaluation Checklist for Engineering Teams

Elasticsearch kNN Search Explained for Developers