25 chunking tricks for RAG that devs actually use

D

<devtips/>

Guest

Not another theory-filled blog post just real chunking strategies that work (or blew up in testing).​

https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1155%2F1%2AXhUpCjvm5FZhKzPblZrwlA.png

Chunking is the unsung hero of RAG pipelines. Done right, it makes your LLM feel sharp and reliable. Done wrong, it spits out hallucinations like “The Eiffel Tower is a vegetable.”

This guide skips the fluff. No recycled diagrams. No “just use LangChain lol.” Just 25 real-world chunking strategies, tested in messy PDFs, gnarly codebases, and live projects.

Some work like magic. Some failed hilariously. You’ll get both.

Let’s chunk smart.

Table of contents​

  1. What even is chunking in RAG?
  2. Why bad chunking ruins everything
  3. Core chunking strategies (the classics)
  4. Semantic-aware chunking (AI knows best)
  5. Code and technical docs chunking
  6. Multimodal chunking (images, tables, PDFs)
  7. Application-based chunking (custom jobs)
  8. Tools and libraries that help
  9. Chunking experiments that worked (or hilariously failed)
  10. Clean-up tips and sanity checks
  11. Conclusion + resources + outro

What even is “chunking” in RAG?​


Chunking is how you prep documents for retrieval in a RAG system. Since LLMs can’t process entire files (too long, too chaotic), you split your data into smaller, meaningful pieces called chunks that are easy to index, embed, and search.

These chunks get converted into vector embeddings and stored in a vector database like Pinecone, Weaviate, or FAISS. When a user asks a question, the RAG pipeline uses semantic search to fetch the most relevant chunks, then passes those into the LLM as context.

Think of it like Git diffs: instead of comparing whole files, you break them into meaningful diffs that are easier to track, understand, and manage. The same applies to chunking. You’re optimizing for relevance and retrievability, not just size.

When do you actually need chunking?​


You need chunking any time you:

  • Feed long-form documents (PDFs, markdown, Notion exports, API docs) into RAG
  • Want to avoid LLM context limits (OpenAI has ~16k–128k token windows, but that’s not unlimited)
  • Plan to embed and search content in a vector store
  • Work with semi-structured or messy data like scraped web pages, logs, or support chats

Basically: if your LLM’s input context isn’t short, clean, and already scoped you need chunking. And the way you chunk it affects the final answer the LLM gives. Every time.

Why bad chunking ruins everything​


Chunking sounds simple until you realize bad chunks are why your RAG system keeps returning answers that read like a chatbot having a stroke.

The problem? Loss of context. If your chunk cuts off mid-thought or splits a concept in half, the LLM loses the thread. Too much overlap and it becomes repetitive. Not enough? It becomes clueless. And if you’re blindly chunking by word count or tokens without considering the content yeah, you’re gonna get garbage out.

Real talk: I once built a quick RAG prototype on company policy docs. I used basic 500-token chunks with zero overlap and no cleanup. When asked “What’s the parental leave policy?”, it confidently responded with “Refer to Section 3.1 and the attached ketchup bottle.” No idea where that came from.

That’s what happens when context boundaries are broken, semantically unrelated content is stuck together, and keywords are present without meaning.

Chunking isn’t just about size it’s about sense.

The Core chunking strategies (the classics)​


These are the OG methods most people start with. They’re simple, reliable, and work surprisingly well for many cases if you know when to use them.

1. Fixed-size chunking (tokens or words)​

  • Break text into equally sized blocks (e.g., 500 tokens or 300 words)
  • Easy to implement with tokenizer tools like [URL='https://github.com/openai/tiktoken']tiktoken[/URL]
  • Works well for uniform data like logs or chat transcripts
Downside: Splits can land mid-sentence or mid-paragraph, killing context.

2. Sentence-level chunking​

  • Split based on sentence boundaries using punctuation or NLP libs like spaCy
  • Chunks make more grammatical sense, improving coherence
Downside: Chunks may be too short to be useful alone.

3. Paragraph-level chunking​

  • Treat each paragraph as one chunk
  • Great for structured docs like wikis, blog posts, manuals

Downside: Some paragraphs are massive; others too tiny. Inconsistent retrieval quality.

4. Sliding window chunking​

  • Create overlapping chunks: chunk A (tokens 1–500), chunk B (tokens 400–900), etc.
  • Improves continuity and recall at the cost of more chunks

Downside: Increases storage and retrieval load

5. Recursive chunking​

  • Try one method (like paragraph), then fallback to smaller units if the chunk is too big
  • Supported by tools like LangChain RecursiveCharacterTextSplitter

Use case: When your data is uneven and you want graceful degradation

These methods are simple, but still form the foundation for most chunking pipelines. Don’t dismiss them just because they’re old-school they’re predictable and fast.

https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1155%2F1%2AdiJty5RFRvb7ZRS08ZC7aQ.png

Semantic-aware chunking (AI knows best)​


Classic chunking works fine… until your content stops being regular. Think messy blogs, technical docs, or transcripts. That’s where semantic chunking steps in splitting text based on meaning, not just size.

Here are a few strategies devs actually use when they want their chunks to make sense:

6. Sentence similarity-based chunking​

  • Use cosine similarity between sentences to group semantically related ideas
  • Great for narratives, tutorials, or anything with logical flow
  • Often done using sentence transformers like all-MiniLM
Ideal when you want each chunk to contain a full idea or sub-topic.

7. Topic segmentation (TextTiling, Topical BERT)​

  • Use NLP models to detect topic shifts and create boundaries
  • TextTiling is classic, [URL='https://github.com/zelandiya/topical-segmentation']Topical Segmenter[/URL] is newer
  • Super useful for long-form content with multiple subtopics (e.g., research papers)

8. Graph-based segmentation​

  • Build a semantic graph of sentences, find natural break points where topic flow changes
  • More complex but can give amazing results on academic or deeply structured content
Feels overkill for FAQs, goldmine for research-heavy domains.

9. Embedding-aware chunking​

  • Use the embedding distance between chunks to decide where to split
  • Detects shifts in meaning at vector level
  • Can be paired with a threshold-based splitter or adaptive window

Semantic chunking costs more compute, but pays off in quality. You’re not just throwing chunks into a vector DB you’re feeding the LLM coherent, complete thoughts.

Chunking for code and technical docs​


Trying to chunk code like it’s prose? Congrats, you just created the worst autocomplete ever.

Code has a different structure. You can’t just slice by character count or paragraph breaks it’ll split function() declarations, orphan braces, and destroy context. Same with docs like Swagger specs, Markdown-heavy READMEs, or API manuals.

Here’s how to chunk code intelligently:

10. Function-based chunking​

  • Extract one function per chunk (def, function, etc.)
  • Keeps logic together and makes retrieval relevant
  • Works great in Python, JavaScript, Go, etc.

11. Class/module-based chunking​

  • Group related functions under a class or file
  • Useful for class-heavy languages like Java, TypeScript, or C++
Watch out for massive files may still need fallback chunking.
https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1155%2F1%2A3fectsybLhleuGrvcmIVSw.jpeg

12. Comment-to-code chunking​

  • Chunk together comment blocks + the function or code they describe
  • Improves semantic search, since comments often contain user-facing terms

13. Docstring-first chunking​

  • Prioritize docstrings and chunk surrounding logic with them
  • LLMs love docstrings it’s like cheat codes for retrieval

Bonus: Markdown-aware splitting​

  • For docs like README.md, chunk by headers (##) and subheaders (###)
  • Keeps logical sections together and helps with clean context injection

TL;DR: Treat code like code. If you split logic in half, you confuse the LLM and yourself.

Multimodal chunking (images, tables, PDFs)​


Not all data is neat, UTF-8 text. Sometimes you’re chunking scanned PDFs, HTML pages, tables, graphs, or slides that were exported by someone who hates future developers.

Here’s how to avoid turning your RAG pipeline into a dumpster fire when handling mixed content:

14. Table-aware chunking​

  • Extract tables as complete units, not as rows or raw text
  • Use pandas, pdfplumber, or Camelot to preserve structure
  • You want to retain the layout and headers for context

15. OCR + layout-aware splitting​

  • Use OCR tools like [URL='https://github.com/tesseract-ocr/tesseract']Tesseract[/URL] or LayoutLM for scanned docs
  • Preserve visual blocks (columns, headers, footers)
  • Especially useful for resumes, contracts, invoices
Pro tip: PDFs aren’t real documents they’re screenshots with trust issues.

16. Image/document section segmentation​

  • Use vision encoders (e.g., CLIP, Donut) to segment visual content into logical regions
  • Chunk each region as a self-contained “semantic image chunk”
  • Works well for UI screenshots, infographics, slide decks

17. Hybrid chunking for HTML​

  • Chunk by <h1>, <h2>, <p>, <table>, etc.
  • Retain DOM structure, skip nav/ads
  • Use tools like BeautifulSoup, html2text, or trafilatura

Chunking isn’t just about size here it’s about preserving layout, hierarchy, and meaning. If the model can’t “see” what’s grouped together, it won’t reason about it.

Application-based chunking (custom jobs)​


Some data just refuses to behave. A generic chunker won’t cut it when you’re dealing with legal contracts, medical records, customer support chats, or anything with domain-specific structure. Here, you chunk with context in mind not just structure or size.

18. Legal documents → Section-aware chunking​

  • Use heading detection (e.g., “Section 4.2 Liability”) to split at logical contract sections
  • Retain clause numbers and sub-clauses together
  • Don’t split mid-paragraph bad legal advice awaits

19. Medical data → Entity + sentence-level​

  • Combine sentence-level splits with key medical entities (e.g., drugs, symptoms, dosages)
  • Use clinical NLP models like scispaCy
  • Crucial for preserving diagnostic relationships

20. Customer support → Intent-based chunking​

  • Chunk by resolved intents: each Q&A pair or resolved case is one unit
  • Group similar problems with their resolutions
  • Improves retrieval accuracy for chatbot or ticket-based RAG

21. Research papers → Abstract + section chunking​

  • Chunk by “Abstract,” “Methods,” “Results,” etc.
  • Preserve citations as a unit with their explanation
  • Tools like [URL='https://github.com/kermitt2/grobid']Grobid[/URL] can help structure papers

When you’re working in a specific domain, chunking must reflect how humans navigate the content not just what fits in 500 tokens.

https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1155%2F1%2AXoUG92S5m6ttoNIsXdZKrQ.png

Tools and libraries that help​


Look, you can write your own chunker from scratch, regex and all. Or you can use one of these libraries that already figured it out, tested it, and wrapped it in a function with a name like split_documents().

Here are some tools devs actually use:

22. LangChain​

  • Offers RecursiveCharacterTextSplitter, TokenTextSplitter, and more
  • Handles fallback strategies and overlapping chunks
  • Good starting point for structured and unstructured text
  • Docs →

23. LlamaIndex (formerly GPT Index)​

  • High-level abstractions for chunking + indexing
  • Comes with SentenceSplitter, SemanticSplitter, and support for PDFs, HTML, Markdown, etc.
  • Especially nice when paired with retrieval eval tools
  • Docs →

24. Haystack​

  • Modular RAG framework with preprocessing pipelines
  • Use it with Hugging Face models or OpenAI
  • Strong PDF and OCR support for chunking weird docs
  • Docs →

25. OpenAI Tokenizer Viewer​

  • Helps visualize how text will be tokenized before you chunk
  • Use this to debug token splits in fixed-size strategies
  • Web tool →

Honorable mentions​

  • spaCy – great for sentence parsing
  • [URL='https://github.com/jsvine/pdfplumber']pdfplumber[/URL] – table-aware PDF parsing
  • [URL='https://github.com/adbar/trafilatura']trafilatura[/URL] – web text + HTML cleaner with structural awareness

These tools don’t just save time they prevent you from burning hours on chunking edge cases that someone else already solved.

Chunking experiments that worked (or hilariously failed)​


Chunking isn’t just theory it’s trial, error, and screaming into your terminal.

Here are a few real stories (yes, tested on actual projects) that show what can go right or spectacularly wrong.

Worked: Recursive + overlap = clean dev wiki​

  • Chunked a Notion wiki using recursive paragraph split, fallback to sentence, with 20% overlap
  • Retrieval relevance jumped ~30% with fewer hallucinations
  • Overlap helped with questions that landed between topics

Failed: Fixed-token chunks on legal docs​

  • Tried 500-token chunks with no structure awareness
  • Retrieval kept surfacing “Definitions” section for everything
  • Outputs: legally impressive nonsense

Worked: Markdown-aware chunking for tech blog archive​

  • Split by headers (##, ###) in Markdown
  • Each chunk became a clean, isolated sub-topic
  • Made fine-tuned RAG assistant way more accurate for tutorials

Failed: Sentence-level chunking for source code​

  • Chunked Python scripts sentence by sentence (why? idk, it was late)
  • Retrieval returned “def”, “return”, and “if” as separate answers
  • Worse than useless it was confident and wrong

Some of these taught us more than any tutorial ever could. Chunking is like tuning hyperparameters you don’t really know what works until it breaks.

Clean-up tips and sanity checks​


You’ve chunked your docs. They’re in the vector DB. You run your first query…

…and the results make less sense than a regex written under duress.

Before blaming the LLM, check your chunk hygiene. Here’s how:

1. Watch for cutoff chunks​


Chunks ending mid-sentence or mid-thought? That’s how you get hallucinations. Use sentence or semantic splitters if you’re seeing garbage completions.

2. Tune your overlap percentage​


Too little overlap = broken context. Too much = bloated retrieval with duplicate answers.
Start with 10–20% overlap and adjust based on output quality.

3. Validate with actual retrieval tests​


Don’t just chunk and ship. Use a few test questions and verify:

  • Are the right chunks being retrieved?
  • Do answers change if chunking config changes?

Use tools like:

  • llama-index retriever playground
  • [URL='https://github.com/explodinggradients/ragas']RAGAS[/URL] for retrieval evals

4. Clean your input before you chunk​


Remove:

  • Extra whitespace
  • Table of contents
  • Footers and headers (especially in PDFs)
  • Boilerplate sections repeated across docs

Garbage in = garbage chunked = garbage out.

These last-mile details are what separate a “cool prototype” from an actually useful RAG system.

Conclusion: Chunk smarter, not harder​


If you made it this far, congrats you now know more about chunking than 95% of devs throwing PDFs at vector DBs and praying for answers.

To recap:

  • There’s no “one right” chunking method.
  • Classic strategies still work if you know when to use them.
  • Semantic, code-aware, and layout-aware chunking can massively improve retrieval.
  • Bad chunking quietly ruins everything, so test often and tune like a backend config.

You don’t need a PhD to chunk well you just need to stop slicing everything at 500 tokens like it’s 2022.

Helpful resources​

https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1155%2F1%2AWMCesV6B2egTIckiWZm7XQ.jpeg


Continue reading...
 


Join đť•‹đť•„đť•‹ on Telegram
Channel PREVIEW:
Back
Top