Open source RAG evaluation framework

Stop guessing which RAG
strategy actually works.

Upload your documents. Run 5 retrieval strategies in parallel. Get an AI-scored leaderboard in under 10 minutes. Open source and self-hostable.

โšก 9 seconds per evaluation๐Ÿ’ฐ $0.002 average cost๐Ÿ”’ PII scrubbing built inโญ Open source on GitHub

Which strategy should I use?

Naive RAG? Hybrid search? Reranking? HyDE? Every team guesses and hopes. Nobody measures.

Why is my RAG failing?

Your system scores 68% but you have no idea if it's the chunking, the embeddings, or the model. Debugging takes weeks.

How do I prove it works?

Regulators and stakeholders want evidence. You have vibes. That gap costs contracts.

How it works

From document to decision in 4 steps

1

Upload your documents

PDF or any text document. Stored securely. Chunked and embedded automatically.

2

Auto-generate test questions

AI reads your documents and creates realistic Q&A pairs automatically. PII scrubbed before storage.

3

5 strategies run in parallel

LangGraph agents benchmark Naive RAG, Hybrid BM25, Cohere Rerank, HyDE, and Parent-Child simultaneously.

4

Know exactly what to deploy

AI judge scores each strategy. Failure attribution tells you why the others lost.

Self-hosting

Self-host in minutes

Clone the repo and run locally in under 5 minutes. No vendor lock-in. Your data stays yours.

1

Clone the repo

git clone https://github.com/
tanmaykaushik451/rag-eval-app
cd rag-eval-framework
2

Add your API keys

cp .env.example .env
# Add your keys:
# OpenRouter, Cohere, AWS S3
# Neon PostgreSQL
3

Run it

pip install -r requirements.txt
uvicorn backend.main:app
cd frontend && npm run dev

Full setup guide in the README โ†’

5 strategies

5 strategies. One winner. No guessing.

Most teams pick one and hope. We test all five in 9 seconds.

Baseline

Naive RAG

Pure vector similarity search. Fast, simple, often not enough.

When it wins: Short focused documents
Most Popular

Hybrid BM25 + Vector

Combines keyword and semantic search using Reciprocal Rank Fusion.

When it wins: Technical docs with specific terms
Most Accurate

Cohere Rerank

Cross-encoder re-scores 20 candidates by true relevance.

When it wins: Long policy documents
Most Innovative

HyDE

Generates a hypothetical answer first, then searches with that embedding.

When it wins: Questions worded differently than docs
Best Context

Parent-Child

Matches small chunks but returns surrounding context for richer answers.

When it wins: Multi-paragraph answers

Use cases

Who this is built for

Any team shipping RAG in production faces the same problem. Here is how different industries use RAG Eval.

Example use case

Financial Institution

OSFI-compliant RAG evaluation

A Canadian bank building an internal policy assistant for compliance documents needs to prove systematic evaluation to regulators before going live.

  • โœ“Auto-generates test questions from policy docs
  • โœ“PII scrubbed before any cloud processing
  • โœ“5-strategy benchmark across full corpus
  • โœ“One-click audit ZIP for OSFI review
  • โœ“CI/CD gate prevents quality regressions
Example use case

AI-Native Startup

Shipping RAG without breaking production

A developer tools company building an AI assistant over their documentation needs to pick the right retrieval strategy before launch and protect quality after.

  • โœ“Benchmarks all 5 strategies in one run
  • โœ“Failure attribution shows exactly why Naive RAG misses technical terminology
  • โœ“Baseline locked after first evaluation
  • โœ“GitHub Action blocks regressions on every PR
Example use case

Healthcare Platform

Patient-safe RAG with full audit trail

A healthcare platform building clinical decision support needs evaluation that never exposes patient data and produces evidence for clinical governance review.

  • โœ“Self-hostable โ€” runs entirely on your servers
  • โœ“Zero data leaves your network
  • โœ“Presidio PII scrubbing on all test questions
  • โœ“Full audit export for clinical governance
  • โœ“Batch evaluation across hundreds of questions

Features

Everything you need. Nothing you don't.

Built for production teams โ€” not for demos.

Failure Attribution

Pinpoints exactly why each strategy failed โ€” embedding weakness, retrieval logic, or generation layer.

CI/CD Regression Gates

GitHub Action blocks merges when RAG quality drops below your baseline. Never ship a silent regression.

Synthetic Test Generator

Auto-generates realistic Q&A pairs from your documents. No manual labeling required.

PII Scrubbing

Microsoft Presidio automatically detects and removes names, emails, and sensitive data before processing.

Audit Export

One-click ZIP with questions.csv, scrub_report.json, and a PDF summary formatted for regulatory review.

LangSmith Tracing

Every parallel agent run fully traced. See cost, latency, and token usage per strategy in real time.

Tech stack

Built on the stack you already trust

FastAPIPostgreSQL + pgvectorLangGraphLangSmithCohereHuggingFaceAWS S3Next.jsMicrosoft Presidio

Built with async Python, LangGraph parallel agents, pgvector for vector storage, and Microsoft Presidio for PII detection. Every evaluation traced in LangSmith. Open architecture โ€” bring your own models, swap any component.

Open source. Self-hostable. Free.

Clone the repo, add your API keys, and run your first evaluation in under 5 minutes.