Day 9 Β· Overreliance on reranker (No.5, No.6)

P

PSBigBig

Guest
most teams flip on a shiny reranker and the offline chart jumps. then real traffic arrives and the lift melts. if the base space is unhealthy, a reranker only hides the pain. this writeup is the minimal path to prove that, fix the base, then keep reranking as light polish.

a quick story to set context​


we had a product faq bot. cross-encoder reranker looked great on 30 handpicked questions. in prod, small paraphrases flipped answers. reading traces showed citations pointed to generic intros, not the exact span. turning off rerank exposed the truth. the raw top-k almost never covered the right section. geometry was wrong. chunks were messy. we were living in No.5 and occasionally No.6 when synthesis tried to β€œfill in” gaps.

60 second ablation that tells you the truth​


  1. run the same question twice
    1.1 retriever only
    1.2 retriever then reranker


  2. record three numbers
    coverage of the target section in top-k
    Ξ”S(question, retrieved)
    citations per atomic claim


  3. label
    low coverage without rerank that β€œmagically” improves only after rerank β†’ No.5 Semantic β‰  Embedding
    coverage ok but prose still drifts or merges extra claims β†’ No.6 Logic Collapse


  4. stability
    ask three paraphrases. if labels or answers alternate, the chain is unstable. reranker is masking the base failure.

rules of thumb
coverage before rerank β‰₯ 0.70
Ξ”S ≀ 0.45 for stable chains
one valid citation per atomic claim

what overreliance looks like in traces​

  • base top-k rarely contains the true span. reranker promotes β€œsounds right” text
  • small header or boilerplate chunks dominate retrieval candidates
  • cosine vs L2 setup is mixed across shards. norms inconsistent
  • offline tables show nice MRR but human readers cannot match citations to spans
  • with rerank off, answers alternate across runs on paraphrases
  • model β€œrepairs” missing evidence instead of pausing for it

root causes to check first​

  • metric and normalization mismatch between corpus and queries
  • chunking to embedding contract missing. no stable snippet id, section id, offsets
  • vectorstore fragmentation. near-duplicates split the same fact across ids
  • reranker objective favors generic summaries over tight claim-aligned spans
  • eval set is tiny and biased toward reranker behavior

minimal fix path​


goal: make the base space trustworthy, then keep reranking as a gentle, auditable layer.

  1. align metric and normalization keep one metric policy across build and query. for cosine style retrieval, L2-normalize both sides and use a consistent index.

Code:
from sklearn.preprocessing import normalize
Z = normalize(Z, axis=1).astype("float32")   # corpus
Q = normalize(Q, axis=1).astype("float32")   # queries

  1. enforce the chunk β†’ embed contract
    mask boilerplate, keep window sizes consistent with your model, emit snippet_id, section_id, offsets, tokens.


  2. add a coverage gate before rerank
    if base coverage is below 0.70, do not rerank. return a short bridge plan that asks for a better retrieval pass or more context.

Code:
def coverage_ok(candidates, target_ids, k=10, th=0.70):
    hits = sum(1 for i in candidates[:k] if i in target_ids)
    denom = max(1, min(k, len(target_ids)))
    return hits / float(denom) >= th
  1. lock cite-then-explain fail fast when any claim lacks in-scope citations.

Code:
def per_claim_ok(payload, allowed):
    bad = [i for i,c in enumerate(payload)
           if not c.get("citations") or not set(c["citations"]) <= set(allowed)]
    return {"ok": not bad, "bad_claims": bad}
  1. keep reranking for span alignment only prefer claim-aligned spans over generic summaries. record rerank scores next to citations for auditing.

when minimal is not enough​

  • rebuild the index from clean embeddings with a single metric policy
  • retrain IVF or PQ codebooks after dedup and boilerplate masking
  • collapse near-duplicates before indexing
  • add a sparse leg and fuse simply when exact terms matter
  • if you must cross-encode, cap its influence and keep the base candidate set healthy

tiny utilities you can paste​


base vs rerank lift


Code:
def lift_at_k(gt_ids, base_ids, rr_ids, k=10):
    base_hit = int(any(x in gt_ids for x in base_ids[:k]))
    rr_hit   = int(any(x in gt_ids for x in rr_ids[:k]))
    return {"base_hit": base_hit, "rr_hit": rr_hit, "lift": rr_hit - base_hit}

neighbor overlap sanity


Code:
def overlap_at_k(a_ids, b_ids, k=20):
    a, b = set(a_ids[:k]), set(b_ids[:k])
    return len(a & b) / float(k)   # healthy spaces sit well below 0.35

minimal Ξ”S probe


Code:
import numpy as np
def delta_s(q, r):
    q = q / np.linalg.norm(q)
    r = r / np.linalg.norm(r)
    return float(1.0 - np.dot(q, r))

acceptance before you call it fixed​

  • base top-k covers the target section at 0.70 or higher
  • Ξ”S at or below 0.45 across three paraphrases
  • every claim has an in-scope citation id
  • reranker provides positive lift without being required for correctness

tldr​


rerankers are polish, not crutches. fix metric and normalization, fix chunk contracts, demand coverage and citations, then let the reranker nudge spans into place. call it No.5 when geometry is wrong, and No.6 when synthesis still drifts after coverage is healthy.

full writeup and the rest of the series live here
Problem Map article series

Continue reading...
 


Join 𝕋𝕄𝕋 on Telegram
Channel PREVIEW:
Back
Top