How I Improved beyin's Retrieval Quality

beyin is a local RAG tool I built that works both as a direct query tool and through MCP with AI agents. You point it at YouTube videos or web pages, it transcribes and indexes the content, and then makes that knowledge queryable later in a useful way.

If you want to take a look at the project itself, the repo is here: github.com/buralog/beyin.

The initial implementation worked. But "worked" is vague. I wanted to know how well it worked, and more importantly, where it was failing.

This article is the story of that investigation: the benchmark I built, what I found, the fixes I tried, the ones that backfired, and what actually made a difference.

The Test Content: A Turkish Software Podcast

The content I used for benchmarking was a Turkish YouTube playlist, a software podcast where two developers discuss technical topics: software testing, career growth in the industry, music equipment for home studios, front-end frameworks, career paths, architecture patterns like Composition API.

The language mix made it a meaningful real-world test. The podcast is in Turkish, but the technical vocabulary is almost entirely English: framework names, product names, acronyms, and brand names. Queries naturally mix both languages, like "Composition API kullanımı" (using the Composition API) or "MIDI ve ses ekipmanları" (MIDI and audio equipment). This is exactly the kind of content that exposes weaknesses in embedding models.

I built a pack from 8 episodes, totaling hundreds of transcript chunks stored in ChromaDB.

How the Benchmark Works

What gets measured

The benchmark runs a set of natural language queries against the pack and checks whether the correct episode's chunks appear in the top results. The metrics are:

Hit@k: does the correct chunk appear in the top-k results? Hit@1 means it was the very first result. Hit@8 means it appeared somewhere in the top 8. Higher k = easier bar to clear.

MRR (Mean Reciprocal Rank): a single number summarizing ranking quality. If the correct result is rank 1, MRR contribution is 1.0. If it's rank 2, it's 0.5. If it's rank 5, it's 0.2. Averaged across all queries. MRR punishes buried correct answers even when they technically "hit" within the top-8 window.

Latency: mean query time in milliseconds, measured across all queries.

Two datasets

I ran two sets of queries:

Standard dataset (24 queries): Natural questions about episode topics in Turkish and English
Hard dataset (24 queries): Specifically designed to target content that speech-to-text models commonly mishear, including product model numbers, acronyms, mixed-language technical terms, and brand names

Run 1: Baseline

The initial beyin implementation used:

whisper-small for audio transcription
all-MiniLM-L6-v2 as the embedding model

all-MiniLM-L6-v2 is the most common default in the sentence-transformers ecosystem. Fast, compact, and well-documented, it was a reasonable starting point.

Run	Config	Hit@1	Hit@3	Hit@5	Hit@8	MRR	Latency
1 - baseline	all-MiniLM-L6-v2	62%	75%	83%	92%	0.705	16.8ms

At first glance, 92% Hit@8 looks acceptable. But two Turkish queries completely missed; the correct episode didn't appear in the top 8 at all:

"müzik ekipmanları ve ses düzeni" (music equipment and sound setup)
"yazılım sektöründe kariyer gelişimi" (career growth in the software industry)

Both misses were Turkish queries. All English queries hit within the top 8. This was the first signal: the embedding model might have a language gap.

Run 2: Diagnosis - Is the content even indexed?

Before trying any fix, I needed to answer a simpler question: are the correct chunks even in the index?

I ran the same queries but retrieved n=20 results instead of n=8, to see where the missed chunks actually ranked.

Findings:

"müzik ekipmanları ve ses düzeni" (music equipment and sound setup) -> correct chunk at rank 18
"yazılım sektöründe kariyer gelişimi" (career growth in the software industry) -> correct chunk at rank 13

The content was there. The model had indexed it correctly. It was just ranked too poorly to surface in the top 8.

This was an important diagnostic step. The problem wasn't missing data or chunking issues. It was a ranking quality problem. The fix needed to be about the embedding model, not the data pipeline.

Run 3: The Obvious Fix That Backfired

The standard retrieval improvement playbook says: add a cross-encoder reranker. Retrieve more candidates with the bi-encoder (fast, approximate), then rerank with a slower but more accurate cross-encoder model.

I tried cross-encoder/ms-marco-MiniLM-L-6-v2, a popular cross-encoder trained on the MS MARCO passage ranking dataset, which consists of English Bing search queries.

Setup: retrieve 20 candidates with all-MiniLM-L6-v2, rerank to top 8 with the cross-encoder.

Run	Config	Hit@1	Hit@3	Hit@5	Hit@8	MRR	Latency
1 - baseline	all-MiniLM-L6-v2	62%	75%	83%	92%	0.705	16.8ms
3 - ms-marco reranker	all-MiniLM-L6-v2 + ms-marco	71%	79%	83%	83%	0.760	92.8ms

Hit@1 improved by 9 percentage points. MRR improved. But Hit@8 dropped from 92% to 83% because the reranker pushed correct Turkish chunks further down than they were before.

"software testing önemi ve test yazma" went from rank 1 in the baseline to a complete miss after reranking.

The cross-encoder was trained on English search data, so when applied to Turkish transcript chunks it had no basis for assessing semantic relevance. It introduced systematic scoring errors. A language-mismatched reranker isn't neutral, it's actively harmful.

Latency also jumped 5.5× from 16.8ms to 92.8ms.

Run 4: Trying a Multilingual Reranker

I replaced ms-marco with BAAI/bge-reranker-base, a multilingual cross-encoder with Turkish support. Same setup: retrieve 20 candidates, rerank to top 8.

Run	Config	Hit@1	Hit@3	Hit@5	Hit@8	MRR	Latency
1 - baseline	all-MiniLM-L6-v2	62%	75%	83%	92%	0.705	16.8ms
3 - ms-marco reranker	all-MiniLM-L6-v2 + ms-marco	71%	79%	83%	83%	0.760	92.8ms
4 - BGE reranker	all-MiniLM-L6-v2 + BGE	62%	75%	79%	92%	0.717	186.4ms

The BGE reranker fixed one of the two misses. "müzik ekipmanları" (music equipment) moved to rank 6, but "kariyer gelişimi" (career growth) still missed. MRR improved only marginally over baseline (+0.012). Latency also exploded to 186.4ms mean, with p95 at 767ms, or 11x slower than baseline for a near-zero gain.

The lesson: rerankers work best when the bi-encoder retrieves noisy candidates and a stronger model can separate signal from noise. When the bi-encoder itself is the problem, and it can't measure Turkish semantic similarity at all, a reranker can't fully compensate. You're asking the reranker to fix bad input.

Run 5: The Real Fix

Instead of patching around the bi-encoder's weakness, I asked a different question: what if I used a model that actually understands Turkish?

paraphrase-multilingual-mpnet-base-v2 is a 768-dimensional model trained on parallel corpora across 50+ languages, including Turkish. I tested it with an in-memory re-index: fetched all chunks from ChromaDB, re-embedded them with the new model, and queried with cosine similarity. No pack rebuild was needed for this test.

Run	Config	Hit@1	Hit@3	Hit@5	Hit@8	MRR	Latency
1 - baseline	all-MiniLM-L6-v2	62%	75%	83%	92%	0.705	16.8ms
3 - ms-marco reranker	all-MiniLM-L6-v2 + ms-marco	71%	79%	83%	83%	0.760	92.8ms
4 - BGE reranker	all-MiniLM-L6-v2 + BGE	62%	75%	79%	92%	0.717	186.4ms
5 - multilingual bi-encoder*	multilingual-mpnet	83%	96%	100%	100%	0.892	99.6ms

* Includes re-embedding 696 chunks at test time, so latency is not representative of production speed. See Run 7.

Zero misses. Both previously missing Turkish queries now ranked first. This was the largest gain across the entire experiment series, and it came from a single model change.

Turkish Hit@8 went from 88% to 100%. English stayed at 100%.

The outcome itself was not surprising. On mixed Turkish-English content, you would expect a multilingual model to outperform a more English-oriented default. What made the benchmark useful was showing how strongly that choice dominated the results, while more complex fixes added little or made things worse.

Run 6: Does Adding a Reranker on Top Help More?

With the strong multilingual bi-encoder in place, I tested whether the BGE reranker could squeeze out additional gains: bi-encoder retrieves 20 candidates, reranker selects top 8.

Run	Config	Hit@1	Hit@3	Hit@5	Hit@8	MRR	Latency
1 - baseline	all-MiniLM-L6-v2	62%	75%	83%	92%	0.705	16.8ms
3 - ms-marco reranker	all-MiniLM-L6-v2 + ms-marco	71%	79%	83%	83%	0.760	92.8ms
4 - BGE reranker	all-MiniLM-L6-v2 + BGE	62%	75%	79%	92%	0.717	186.4ms
5 - multilingual bi-encoder*	multilingual-mpnet	83%	96%	100%	100%	0.892	99.6ms
6 - multilingual + BGE*	multilingual-mpnet + BGE	71%	92%	96%	96%	0.821	118.6ms

* Both runs used the same in-memory re-index setup.

Worse across every metric. Hit@1 dropped 12pp from Run 5. MRR dropped 0.071. "software career growth and testing practices" went from rank 1 to a miss.

When the bi-encoder already retrieves high-quality, well-ranked candidates, adding a reranker introduces noise. It second-guesses correct rankings. The best pipeline is the simplest one.

Run 7: Production Rebuild

Runs 5 and 6 used in-memory re-indexing, which included model loading and re-embedding 696 chunks per test run. That 99.6ms latency figure was misleading.

Run 7 was a proper full rebuild: pack rebuilt from scratch with whisper-medium transcription and paraphrase-multilingual-mpnet-base-v2 embeddings stored in ChromaDB.

Run	Config	Hit@1	Hit@3	Hit@5	Hit@8	MRR	Latency
1 - baseline	all-MiniLM-L6-v2	62%	75%	83%	92%	0.705	16.8ms
3 - ms-marco reranker	all-MiniLM-L6-v2 + ms-marco	71%	79%	83%	83%	0.760	92.8ms
4 - BGE reranker	all-MiniLM-L6-v2 + BGE	62%	75%	79%	92%	0.717	186.4ms
5 - multilingual bi-encoder*	multilingual-mpnet	83%	96%	100%	100%	0.892	99.6ms
6 - multilingual + BGE*	multilingual-mpnet + BGE	71%	92%	96%	96%	0.821	118.6ms
7 - production rebuild	multilingual-mpnet (whisper-medium)	83%	92%	96%	100%	0.885	19.5ms

* Runs 5 and 6 used the in-memory re-index setup.

Production query latency with the multilingual model is 19.5ms, barely different from the baseline 16.8ms. ChromaDB vector search is fast regardless of embedding dimensionality. The in-memory test overhead came from re-embedding, not querying.

The tiny MRR drop (0.892 -> 0.885) compared to the in-memory test is explained by slightly different chunking. whisper-medium produced 640 chunks vs 696 from whisper-small, so the transcription boundaries changed slightly.

Runs 8 and 9: Does Whisper Model Size Affect Retrieval?

I had switched to whisper-medium for Run 7 because medium produces cleaner transcriptions, especially for content with English technical terms in Turkish speech. But I wanted to know: does this actually affect retrieval quality, or just the readability of the transcript text?

I built a harder query set specifically targeting content that whisper-small commonly mishears: product model numbers, acronyms, mixed Turkish/English technical terms, and brand names. Then I rebuilt the pack with small whisper and ran the hard benchmark, then rebuilt with medium whisper and ran it again.

Whisper	Hit@1	Hit@3	Hit@5	Hit@8	MRR	Latency
small	79%	92%	92%	96%	0.854	19.2ms
medium	79%	88%	92%	96%	0.851	20.5ms

Statistically identical. The single miss on both runs was a semantics problem: a query about TDD methodology that targeted an episode about career growth in software and testing culture, not TDD itself. Not a transcription issue.

paraphrase-multilingual-mpnet-base-v2 is robust enough to transcription noise that even whisper-small's garbled technical terms don't meaningfully hurt retrieval. The multilingual embedding space captures semantic intent even when specific terms are slightly mangled.

But whisper-medium still matters. The LLM reads the raw transcript text to generate answers. When technical terms like framework names, product model numbers, and API names are transcribed correctly, the generated answers are more accurate and precise. That improvement is real, it is just not captured by Hit@k or MRR. For content like this podcast, with dense technical vocabulary in a non-English language, medium is worth the extra build time.

The Normalization Detour

After these benchmark runs, I audited the core pipeline for any remaining improvements. One thing I noticed: the embedding vectors weren't normalized. The norm of a paraphrase-multilingual-mpnet-base-v2 embedding is approximately 2.83, not 1.0.

Conventional wisdom says: normalize vectors and use cosine similarity for semantic search. L2 distance on unnormalized vectors can be dominated by magnitude differences rather than direction.

So I tested it:

Added normalize_embeddings=True when building the index
Added normalize_embeddings=True when embedding queries
Switched ChromaDB's distance metric to cosine (hnsw:space: cosine)

Results on the standard dataset: MRR dropped from 0.885 to 0.873, Hit@3 dropped from 92% to 83%.

Worse. Reverted everything.

The empirical answer overrides the theoretical intuition: for this specific model, L2 on raw unnormalized vectors performs better than normalized cosine. The magnitude of paraphrase-multilingual-mpnet-base-v2's output vectors appears to carry useful signal that cosine discards. Whether this is by design or a quirk of how the model was trained is unclear, but the benchmark result was unambiguous.

What Was Actually Wrong, and What's Still Wrong

Before the improvements, retrieval was noisy for several concrete reasons. It's worth being specific about which ones we fixed and which ones remain.

Fixed

Cross-language retrieval. The podcast is in Turkish, but the technical vocabulary is almost entirely English: framework names, product names, acronyms. all-MiniLM-L6-v2 was trained primarily on English data and couldn't measure Turkish semantic similarity accurately. Swapping to paraphrase-multilingual-mpnet-base-v2 fixed this entirely. This was the dominant problem.

Named entity corruption. whisper-small regularly mishears English proper nouns spoken with a Turkish accent. Technical terms, product model numbers, and framework names would come back garbled in the transcript text, making the LLM's answers imprecise or wrong. Switching to whisper-medium substantially reduces this. "Martin Folder" becomes "Martin Fowler" again.

English words in Turkish speech. When a Turkish speaker switches to English mid-sentence, which happens constantly in a software podcast, whisper-small struggles to follow the language switch cleanly. whisper-medium handles this transition much better, producing cleaner transcripts for the mixed-language segments that dominate this content.

Still present

Conversational chunks mixing topics. The content is a podcast, with two people talking and meandering through subjects. Chunking is time-based, so a single chunk can cover three unrelated topics in 60 seconds. When a query targets one of those topics, the chunk scores lower because only a third of it is relevant. This would require smarter topic-aware or sentence-boundary chunking to fix properly.

Broad questions returning loosely related chunks. The index is permissive rather than strict. When there's no direct match, it returns the closest thing it has rather than saying "I don't know." A narrow, concrete question like "What does he say about Unit of Work pattern?" works well. A broad synthesis question like "What does he think about the future of the software industry?" returns loosely related chunks that happen to share vocabulary. This is a fundamental RAG limitation, not a bug.

Residual ASR noise. whisper-medium reduces transcription errors but doesn't eliminate them. Uncommon proper nouns, very fast speech, and background noise will still leave some chunks garbled.

The practical implication hasn't changed entirely: narrow, concrete questions still work better than broad synthesis questions. But the language mismatch issue, which was causing clean Turkish queries to miss completely, is gone.

What Actually Made a Difference

The bi-encoder model is the dominant lever. One config line change, swapping all-MiniLM-L6-v2 for paraphrase-multilingual-mpnet-base-v2, eliminated all retrieval misses, improved MRR by 26%, improved Hit@1 by 21 percentage points, and added only ~3ms to query latency.

Everything else was either neutral or negative:

English cross-encoder reranker: actively harmful for Turkish content
Multilingual cross-encoder reranker: marginal gain, 11× latency cost, regresses when bi-encoder is strong
Normalizing embeddings + cosine distance: worse than L2 on raw vectors for this model
Larger whisper model: improves answer quality, has no measurable effect on retrieval

The final configuration for beyin: paraphrase-multilingual-mpnet-base-v2 as the default embedding model, whisper-medium as the default transcription model. No reranker.

For multilingual content, or honestly any content where the default English-optimized embedding model might struggle, the model choice matters far more than any retrieval architecture complexity layered on top of it.