Day 9 · Overreliance on reranker (No.5, No.6)

PSBigBig · Saturday at 2:44 AM

most teams flip on a shiny reranker and the offline chart jumps. then real traffic arrives and the lift melts. if the base space is unhealthy, a reranker only hides the pain. this writeup is the minimal path to prove that, fix the base, then keep reranking as light polish.

a quick story to set context

we had a product faq bot. cross-encoder reranker looked great on 30 handpicked questions. in prod, small paraphrases flipped answers. reading traces showed citations pointed to generic intros, not the exact span. turning off rerank exposed the truth. the raw top-k almost never covered the right section. geometry was wrong. chunks were messy. we were living in No.5 and occasionally No.6 when synthesis tried to “fill in” gaps.

60 second ablation that tells you the truth

run the same question twice
1.1 retriever only
1.2 retriever then reranker
record three numbers
coverage of the target section in top-k
ΔS(question, retrieved)
citations per atomic claim
label
low coverage without rerank that “magically” improves only after rerank → No.5 Semantic ≠ Embedding
coverage ok but prose still drifts or merges extra claims → No.6 Logic Collapse
stability
ask three paraphrases. if labels or answers alternate, the chain is unstable. reranker is masking the base failure.

rules of thumb
coverage before rerank ≥ 0.70
ΔS ≤ 0.45 for stable chains
one valid citation per atomic claim

what overreliance looks like in traces

base top-k rarely contains the true span. reranker promotes “sounds right” text
small header or boilerplate chunks dominate retrieval candidates
cosine vs L2 setup is mixed across shards. norms inconsistent
offline tables show nice MRR but human readers cannot match citations to spans
with rerank off, answers alternate across runs on paraphrases
model “repairs” missing evidence instead of pausing for it

root causes to check first

metric and normalization mismatch between corpus and queries
chunking to embedding contract missing. no stable snippet id, section id, offsets
vectorstore fragmentation. near-duplicates split the same fact across ids
reranker objective favors generic summaries over tight claim-aligned spans
eval set is tiny and biased toward reranker behavior

minimal fix path

goal: make the base space trustworthy, then keep reranking as a gentle, auditable layer.

align metric and normalization keep one metric policy across build and query. for cosine style retrieval, L2-normalize both sides and use a consistent index.

Code:

from sklearn.preprocessing import normalize
Z = normalize(Z, axis=1).astype("float32")   # corpus
Q = normalize(Q, axis=1).astype("float32")   # queries

enforce the chunk → embed contract
mask boilerplate, keep window sizes consistent with your model, emit snippet_id, section_id, offsets, tokens.
add a coverage gate before rerank
if base coverage is below 0.70, do not rerank. return a short bridge plan that asks for a better retrieval pass or more context.

Code:

def coverage_ok(candidates, target_ids, k=10, th=0.70):
    hits = sum(1 for i in candidates[:k] if i in target_ids)
    denom = max(1, min(k, len(target_ids)))
    return hits / float(denom) >= th

lock cite-then-explain fail fast when any claim lacks in-scope citations.

Code:

def per_claim_ok(payload, allowed):
    bad = [i for i,c in enumerate(payload)
           if not c.get("citations") or not set(c["citations"]) <= set(allowed)]
    return {"ok": not bad, "bad_claims": bad}

keep reranking for span alignment only prefer claim-aligned spans over generic summaries. record rerank scores next to citations for auditing.

when minimal is not enough

rebuild the index from clean embeddings with a single metric policy
retrain IVF or PQ codebooks after dedup and boilerplate masking
collapse near-duplicates before indexing
add a sparse leg and fuse simply when exact terms matter
if you must cross-encode, cap its influence and keep the base candidate set healthy

tiny utilities you can paste

base vs rerank lift

Code:

def lift_at_k(gt_ids, base_ids, rr_ids, k=10):
    base_hit = int(any(x in gt_ids for x in base_ids[:k]))
    rr_hit   = int(any(x in gt_ids for x in rr_ids[:k]))
    return {"base_hit": base_hit, "rr_hit": rr_hit, "lift": rr_hit - base_hit}

neighbor overlap sanity

Code:

def overlap_at_k(a_ids, b_ids, k=20):
    a, b = set(a_ids[:k]), set(b_ids[:k])
    return len(a & b) / float(k)   # healthy spaces sit well below 0.35

minimal ΔS probe

Code:

import numpy as np
def delta_s(q, r):
    q = q / np.linalg.norm(q)
    r = r / np.linalg.norm(r)
    return float(1.0 - np.dot(q, r))

acceptance before you call it fixed

base top-k covers the target section at 0.70 or higher
ΔS at or below 0.45 across three paraphrases
every claim has an in-scope citation id
reranker provides positive lift without being required for correctness

tldr

rerankers are polish, not crutches. fix metric and normalization, fix chunk contracts, demand coverage and citations, then let the reranker nudge spans into place. call it No.5 when geometry is wrong, and No.6 when synthesis still drifts after coverage is healthy.

full writeup and the rest of the series live here
Problem Map article series

Continue reading...

Day 9 · Overreliance on reranker (No.5, No.6)

PSBigBig

Guest

a quick story to set context​

60 second ablation that tells you the truth​

what overreliance looks like in traces​

root causes to check first​

minimal fix path​

when minimal is not enough​

tiny utilities you can paste​

acceptance before you call it fixed​

tldr​

a quick story to set context

60 second ablation that tells you the truth

what overreliance looks like in traces

root causes to check first

minimal fix path

when minimal is not enough

tiny utilities you can paste

acceptance before you call it fixed

tldr