How to Build a Hybrid RAG + Fine-Tuning Pipeline (Step-by-Step Guide 2025)

👤 AcadKits📅 March 24, 2026⏱️ 13 min read👁️ 38 views
How to Build a Hybrid RAG + Fine-Tuning Pipeline (Step-by-Step Guide 2025)

AI Engineering

How to Build a Hybrid RAG +

Fine-Tuning Pipeline

Most teams pick one and regret it. RAG gives your model fresh knowledge. Fine-tuning gives it the right behavior. The best production LLM systems use both — here is exactly how to build one from scratch.


In this article

  1. What "hybrid" actually means
  2. When does a hybrid pipeline make sense?
  3. The architecture explained
  4. Step 1 — Build the RAG layer first
  5. Step 2 — Identify your fine-tuning gaps
  6. Step 3 — Fine-tune with QLoRA
  7. Step 4 — Wire everything together
  8. Cost and performance in real numbers
  9. 3 mistakes that will break your pipeline
  10. Frequently asked questions

Most teams treat RAG and fine-tuning as an either/or decision. They pick one, ship it, and spend the next three months wondering why the model still gets things wrong. The truth is they solve completely different problems. RAG controls what the model knows at the moment you ask it a question. Fine-tuning controls how it behaves when it answers. Run one without the other and you are leaving a lot of quality on the table.

This guide walks you through building a hybrid pipeline that combines both — with real Python code you can copy, a cost breakdown that makes the business case clear, and an honest look at where teams go wrong so you can skip those mistakes entirely.

What "hybrid" actually means

A hybrid RAG + fine-tuning pipeline is not some exotic, over-engineered architecture. It is just two proven systems running together, each doing what it does best.

Think of it like hiring a new employee. Fine-tuning is the onboarding — you train them on your processes, your tone, your output format, your style guide. RAG is giving them access to the internal wiki so they can look up current information before responding. You need both. An employee who knows how to behave but has no access to current information will confidently give you outdated answers. An employee who has all the documents but was never trained on how to communicate will give you technically correct responses that miss the mark in every other way.

The hybrid pipeline gets you an LLM that knows the right things and responds the right way. That combination is what separates demo-quality AI from production-quality AI.

Key insight

Fine-tuning teaches a model how to think and respond. RAG teaches it what to think about right now. You need both layers for a system that is accurate, current, and well-behaved.

When does a hybrid pipeline make sense?

Not every project needs this level of complexity. A hybrid pipeline earns its cost when you are dealing with at least two of the following conditions at the same time:

  • Your data changes regularly — product docs, internal wikis, legal updates, pricing tables. RAG keeps answers current without retraining.
  • You need a specific output format or tone — JSON schemas, brand voice, medical templates, code that follows your internal style guide. Fine-tuning locks this in reliably.
  • Mistakes are costly — legal, medical, financial, or compliance use cases where a confidently wrong answer causes real damage.
  • You have labeled examples — at least 500 input/output pairs showing exactly how the model should respond. Without these, fine-tuning adds cost without adding value.

If you are still in the prototype phase, start with RAG only. Get it into production, log real queries for two weeks, and then look at where the model fails consistently. Fine-tune specifically to fix those patterns — nothing broader. A focused dataset of 800 targeted examples beats a general dataset of 10,000 almost every time.

The architecture explained

Before writing code, it helps to see how the components connect at inference time.

Inference-time flow — what happens per query



The fine-tuning pipeline runs separately on a schedule and never blocks inference

The key difference from a standard RAG setup is the green box — instead of sending retrieved chunks to a generic base model, you are sending them to a model that has already been trained to handle your specific domain, format, and tone. The base model might write a technically correct answer; your fine-tuned model writes the right answer in exactly the format your users and downstream systems expect.

Step 1 — Build the RAG layer first

Resist the urge to build everything at once. Get retrieval working and deployed before you touch fine-tuning. This gives you a baseline, surfaces real user queries, and generates the logs you will later use to build your fine-tuning dataset.

Index your documents

The chunking strategy matters more than most teams realize. A chunk size of 512 tokens with 64-token overlap is a strong starting point for most document types — large enough to preserve context, small enough to stay precise.

rag_setup.py

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_community.document_loaders import DirectoryLoader
import os

loader = DirectoryLoader("./docs", glob="**/*.md")
documents = loader.load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64,
    separators=["\n\n", "\n", ".", " "]
)
chunks = splitter.split_documents(documents)

embeddings = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",
    encode_kwargs={"normalize_embeddings": True}
)

vectorstore = PineconeVectorStore.from_documents(
    chunks, embeddings,
    index_name=os.environ["PINECONE_INDEX"]
)
print(f"Indexed {len(chunks)} chunks.")


Pro tip

Use BAAI/bge-large-en-v1.5 instead of OpenAI's embedding API. It performs comparably on most retrieval benchmarks and eliminates a per-query API cost that compounds fast at scale.

Add a reranker to filter noise

The reranker is the single highest-leverage upgrade you can make to a basic RAG pipeline. Vector similarity search is fast but blunt — it finds chunks that are semantically related to the query, not chunks that actually answer it. Without a reranker, irrelevant paragraphs regularly make it into the context window and cause hallucinations.

reranker.py

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

base_retriever = vectorstore.as_retriever(
    search_kwargs={"k": 20}
)

reranker = CohereRerank(
    model="rerank-english-v3.0",
    top_n=4
)

retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=base_retriever
)

Step 2 — Identify your fine-tuning gaps

Run RAG with a base model for two weeks and log every query and response. Then go through those logs and ask one question: what is the model consistently getting wrong that better retrieval would not fix?

You will typically find patterns like these:

  • Wrong tone or voice — the model writes corporate-neutral when your product needs something warmer
  • Inconsistent formatting — tables, lists, and code blocks look different from response to response
  • Unnecessary hedging — the model adds disclaimers that hurt the user experience
  • Schema non-compliance — JSON output does not reliably match your required structure
  • Domain reasoning gaps — incorrect conclusions even when the right context is provided

These are behavior problems, not retrieval problems. Fine-tuning solves behavior problems. Document every failure category and make sure every category has representative examples in your training set before you start training.

Step 3 — Fine-tune with QLoRA

QLoRA makes fine-tuning accessible to teams without a GPU cluster. It compresses the base model into memory using 4-bit quantization, then trains small adapter layers on top. The result is nearly identical quality to full fine-tuning at roughly 10% of the compute cost. You can run this on a single rented A100 for around $2 per hour.

Format your training data

Every training example must mirror real inference conditions. If your pipeline retrieves 3 chunks at inference time, your training inputs need to include those same 3 chunks in the context. Training on clean Q&A pairs without context teaches a completely different behavior than what the model will face in production.

training_data.jsonl

{
  "instruction": "You are a technical support assistant. Answer using the provided context only. Be concise and direct. No disclaimers.",
  "input": "Context: [retrieved chunk 1]\n[retrieved chunk 2]\n\nQuestion: How do I reset my API key?",
  "output": "Go to Settings → API → Regenerate Key. Your old key is invalidated immediately — update your app config before regenerating to avoid downtime."
}

Run QLoRA training

finetune_qlora.py

Copy

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer, SFTConfig
from datasets import load_dataset
import torch

MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, quantization_config=bnb_config, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
tokenizer.pad_token = tokenizer.eos_token

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.CAUSAL_LM
)

model = get_peft_model(model, lora_config)
dataset = load_dataset("json", data_files="training_data.jsonl", split="train")

trainer = SFTTrainer(
    model=model,
    args=SFTConfig(
        output_dir="./fine-tuned-mistral",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        learning_rate=2e-4,
        bf16=True,
    ),
    train_dataset=dataset,
    tokenizer=tokenizer,
)
trainer.train()


Keep r=16 and limit training to 3 epochs on your first run. Higher rank and more epochs on a small dataset leads to overfitting — the model memorizes your examples rather than learning the underlying patterns.

Step 4 — Wire everything together

The final step is connecting your fine-tuned adapter to the RAG retrieval chain. The only meaningful change from a standard RAG setup is swapping the base LLM for your fine-tuned version.

hybrid_pipeline.py

from langchain.chains import RetrievalQA
from langchain_community.llms import HuggingFacePipeline
from transformers import pipeline, AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

base_model = AutoModelForCausalLM.from_pretrained(
    "mistralai/Mistral-7B-Instruct-v0.3",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(base_model, "./fine-tuned-mistral")
tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.3")

hf_pipeline = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=512,
    temperature=0.1,
)
llm = HuggingFacePipeline(pipeline=hf_pipeline)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    retriever=retriever,  # from Step 1 — includes reranker
    return_source_documents=True,
    chain_type="stuff"
)

def answer(query: str) -> dict:
    result = qa_chain.invoke({"query": query})
    return {
        "answer": result["result"],
        "sources": [doc.metadata["source"] for doc in result["source_documents"]]
    }

response = answer("What are the enterprise pricing tiers?")
print(response["answer"])
print("Sources:", response["sources"])

1. Index documents on a schedule

Run rag_setup.py once to populate your vector store. Re-run it nightly or weekly as your documents change — a stale index defeats the entire purpose of RAG.

2. Retrain adapters periodically, not continuously

Retrain your LoRA adapter monthly or when you accumulate 200+ new labeled examples. Continuous training adds infrastructure complexity without proportional quality gains.

3. Always A/B test before swapping into production

Run the new adapter against your eval set before replacing the old one. A 5% accuracy regression in the wrong output category can be worse than not updating at all.

4. Log everything from day one

Store every query, retrieved context, and model response. This becomes your future training data, your quality audit trail, and your early-warning system for regressions.

Cost and performance in real numbers

Here is a realistic breakdown for an application handling around 50,000 queries per month on self-hosted infrastructure:

Cost factorRAG onlyHybrid (RAG + fine-tuning)One-time fine-tuning (7B, QLoRA, 1K examples)$0$20–45Monthly inference (self-hosted 7B)$60–90$70–100Vector DB (Pinecone Starter)$70/mo$70/moReranker (Cohere, 50K queries)$25/mo$25/moDomain-specific accuracy~72%~89%Output format compliance rate~61%~96%Hallucination rate~8%~2%

Adding fine-tuning costs roughly $30–50 more per month. In return you get a 17-point accuracy gain, a 35-point format compliance gain, and a 75% reduction in hallucinations. For any application where output quality has real consequences, that trade is almost always worth it.

3 mistakes that will break your pipeline

Fine-tuning without retrieved context in your training inputs

Your training examples need to match real inference conditions. If your pipeline retrieves 3 document chunks per query at inference time, those same 3 chunks need to appear in your training inputs. Teams that train on clean Q&A pairs — without any retrieved context — are teaching a completely different behavior than what the model will face in production. The result is a model that has learned to answer from memory, not from your documents.

Never updating the vector store

The entire point of RAG is that your model stays current without retraining. That advantage disappears the moment you stop updating your index. Set up a pipeline that automatically detects changed or new documents and re-indexes them. Even a simple nightly cron job that re-crawls your docs folder is enough to start. Stale indexes are the number one reason RAG systems gradually drift toward confident wrong answers over time.

Skipping the reranker to save cost

Teams skip the reranker to cut costs and end up paying a much higher price in answer quality. Without it, irrelevant paragraphs regularly make it into the model's context window, and even a well-trained model will produce poor answers when the context it is given is mostly noise. The Cohere reranker at 50,000 queries costs around $25 per month — the cheapest quality upgrade in this entire stack.

Frequently asked questions

Can I use RAG and fine-tuning at the same time? +Yes — and for most production applications you should. Fine-tuning shapes how the model behaves: its tone, output format, and reasoning style. RAG controls what the model knows at inference time. They solve completely different problems and perform better together than either does in isolation.

Which should I implement first — RAG or fine-tuning? +Almost always RAG first. It is cheaper, faster to set up, and gives you a working system within days. Once you have run it in production for a few weeks and identified where it consistently fails, you introduce fine-tuning to fix those specific gaps. Never add fine-tuning complexity until you understand exactly what problem it is solving.

Does fine-tuning improve RAG retrieval accuracy? +Fine-tuning does not directly improve retrieval — that is the vector search layer's job. What it does improve is how the model uses retrieved chunks: a fine-tuned model is better at ignoring irrelevant context, synthesizing information across multiple chunks, and citing sources correctly when the prompt requires it.

How much does a hybrid RAG + fine-tuning pipeline cost to run? +Fine-tuning a 7B model once on 1,000 examples with QLoRA on a single A100 costs roughly $20–45. Ongoing RAG infrastructure — a vector DB and a reranker — adds around $95/month. Total production cost for moderate traffic (50K queries/month) typically lands between $165–215/month, which is a fraction of what the same capability costs through the GPT-4 API.

What is the best model to fine-tune for a hybrid pipeline? +Mistral 7B Instruct v0.3 and Llama 3 8B Instruct are the most practical choices in 2025. Both fine-tune efficiently with QLoRA on a single GPU and perform close to GPT-4 on narrow domain tasks after tuning. If you want lower infrastructure maintenance, fine-tuning GPT-3.5 Turbo through OpenAI's platform is also a strong option.

What if my fine-tuned model still hallucinates even with RAG? +This almost always traces back to one of two issues. Either your retrieval is returning irrelevant chunks that confuse the model — add or tune a reranker to filter them — or your fine-tuning dataset contained factual errors that the model learned to replicate. Audit your training data carefully, remove any hallucinated examples, and retrain.

Is a hybrid pipeline overkill for a small startup? +Usually yes. Start with RAG only. A hybrid pipeline makes sense when you have already shipped RAG to production, identified specific failure patterns that better retrieval cannot fix, and have at least 500–1,000 labeled training examples in hand. Do not add fine-tuning complexity before those conditions are met.

❓ Frequently Asked Questions

What is AG in machine learning?
AG stands for Adversarial Generation, which is a generative approach that involves training a model to produce new data samples that are similar to the existing data.
What is fine-tuning in machine learning?
Fine-tuning involves taking a pre-trained model and adjusting its parameters to fit your specific use case. This approach is useful when you have a small dataset and want to leverage the knowledge that the pre-trained model has learned from a larger dataset.
When should I use AG?
You should use AG when you have a large dataset and want to generate new data samples that are similar to the existing data. AG is typically used for unsupervised learning tasks, such as anomaly detection or dimensionality reduction.
When should I use fine-tuning?
You should use fine-tuning when you have a small dataset and want to leverage the knowledge that a pre-trained model has learned from a larger dataset. Fine-tuning is typically used for supervised learning tasks, such as image classification or object detection.
Can I use both AG and fine-tuning for my project?
Yes, you can use both AG and fine-tuning for your project. For example, you can use AG to generate new data samples and then fine-tune a pre-trained model on the generated data.
What are some common applications of AG and fine-tuning?
Some common applications of AG and fine-tuning include image classification, object detection, natural language processing, and recommender systems.

🎓 Need Help With Your Project?

AcadKits provides ready-made engineering projects, custom development services, and free developer tools for students.