25 chunking tricks for RAG that devs actually use

<devtips/> · Thursday at 11:03 PM

Not another theory-filled blog post just real chunking strategies that work (or blew up in testing).

https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1155%2F1%2AXhUpCjvm5FZhKzPblZrwlA.png

Chunking is the unsung hero of RAG pipelines. Done right, it makes your LLM feel sharp and reliable. Done wrong, it spits out hallucinations like “The Eiffel Tower is a vegetable.”

This guide skips the fluff. No recycled diagrams. No “just use LangChain lol.” Just 25 real-world chunking strategies, tested in messy PDFs, gnarly codebases, and live projects.

Some work like magic. Some failed hilariously. You’ll get both.

Let’s chunk smart.

What even is chunking in RAG?
Why bad chunking ruins everything
Core chunking strategies (the classics)
Semantic-aware chunking (AI knows best)
Code and technical docs chunking
Multimodal chunking (images, tables, PDFs)
Application-based chunking (custom jobs)
Tools and libraries that help
Chunking experiments that worked (or hilariously failed)
Clean-up tips and sanity checks
Conclusion + resources + outro

What even is “chunking” in RAG?

Chunking is how you prep documents for retrieval in a RAG system. Since LLMs can’t process entire files (too long, too chaotic), you split your data into smaller, meaningful pieces called chunks that are easy to index, embed, and search.

These chunks get converted into vector embeddings and stored in a vector database like Pinecone, Weaviate, or FAISS. When a user asks a question, the RAG pipeline uses semantic search to fetch the most relevant chunks, then passes those into the LLM as context.

Think of it like Git diffs: instead of comparing whole files, you break them into meaningful diffs that are easier to track, understand, and manage. The same applies to chunking. You’re optimizing for relevance and retrievability, not just size.

When do you actually need chunking?

You need chunking any time you:

Feed long-form documents (PDFs, markdown, Notion exports, API docs) into RAG
Want to avoid LLM context limits (OpenAI has ~16k–128k token windows, but that’s not unlimited)
Plan to embed and search content in a vector store
Work with semi-structured or messy data like scraped web pages, logs, or support chats

Basically: if your LLM’s input context isn’t short, clean, and already scoped you need chunking. And the way you chunk it affects the final answer the LLM gives. Every time.

Why bad chunking ruins everything

Chunking sounds simple until you realize bad chunks are why your RAG system keeps returning answers that read like a chatbot having a stroke.

The problem? Loss of context. If your chunk cuts off mid-thought or splits a concept in half, the LLM loses the thread. Too much overlap and it becomes repetitive. Not enough? It becomes clueless. And if you’re blindly chunking by word count or tokens without considering the content yeah, you’re gonna get garbage out.

Real talk: I once built a quick RAG prototype on company policy docs. I used basic 500-token chunks with zero overlap and no cleanup. When asked “What’s the parental leave policy?”, it confidently responded with “Refer to Section 3.1 and the attached ketchup bottle.” No idea where that came from.

That’s what happens when context boundaries are broken, semantically unrelated content is stuck together, and keywords are present without meaning.

Chunking isn’t just about size it’s about sense.

The Core chunking strategies (the classics)

These are the OG methods most people start with. They’re simple, reliable, and work surprisingly well for many cases if you know when to use them.

1. Fixed-size chunking (tokens or words)

Break text into equally sized blocks (e.g., 500 tokens or 300 words)
Easy to implement with tokenizer tools like [URL='https://github.com/openai/tiktoken']tiktoken[/URL]
Works well for uniform data like logs or chat transcripts

Downside: Splits can land mid-sentence or mid-paragraph, killing context.

2. Sentence-level chunking

Split based on sentence boundaries using punctuation or NLP libs like spaCy
Chunks make more grammatical sense, improving coherence

Downside: Chunks may be too short to be useful alone.

3. Paragraph-level chunking

Treat each paragraph as one chunk
Great for structured docs like wikis, blog posts, manuals

Downside: Some paragraphs are massive; others too tiny. Inconsistent retrieval quality.

4. Sliding window chunking

Create overlapping chunks: chunk A (tokens 1–500), chunk B (tokens 400–900), etc.
Improves continuity and recall at the cost of more chunks

Downside: Increases storage and retrieval load

5. Recursive chunking

Try one method (like paragraph), then fallback to smaller units if the chunk is too big
Supported by tools like LangChain RecursiveCharacterTextSplitter

Use case: When your data is uneven and you want graceful degradation

These methods are simple, but still form the foundation for most chunking pipelines. Don’t dismiss them just because they’re old-school they’re predictable and fast.

https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1155%2F1%2AdiJty5RFRvb7ZRS08ZC7aQ.png

Semantic-aware chunking (AI knows best)

Classic chunking works fine… until your content stops being regular. Think messy blogs, technical docs, or transcripts. That’s where semantic chunking steps in splitting text based on meaning, not just size.

Here are a few strategies devs actually use when they want their chunks to make sense:

6. Sentence similarity-based chunking

Use cosine similarity between sentences to group semantically related ideas
Great for narratives, tutorials, or anything with logical flow
Often done using sentence transformers like all-MiniLM

Ideal when you want each chunk to contain a full idea or sub-topic.

7. Topic segmentation (TextTiling, Topical BERT)

Use NLP models to detect topic shifts and create boundaries
TextTiling is classic, [URL='https://github.com/zelandiya/topical-segmentation']Topical Segmenter[/URL] is newer
Super useful for long-form content with multiple subtopics (e.g., research papers)

8. Graph-based segmentation

Build a semantic graph of sentences, find natural break points where topic flow changes
More complex but can give amazing results on academic or deeply structured content

Feels overkill for FAQs, goldmine for research-heavy domains.

9. Embedding-aware chunking

Use the embedding distance between chunks to decide where to split
Detects shifts in meaning at vector level
Can be paired with a threshold-based splitter or adaptive window

Semantic chunking costs more compute, but pays off in quality. You’re not just throwing chunks into a vector DB you’re feeding the LLM coherent, complete thoughts.

Chunking for code and technical docs

Trying to chunk code like it’s prose? Congrats, you just created the worst autocomplete ever.

Code has a different structure. You can’t just slice by character count or paragraph breaks it’ll split function() declarations, orphan braces, and destroy context. Same with docs like Swagger specs, Markdown-heavy READMEs, or API manuals.

Here’s how to chunk code intelligently:

10. Function-based chunking

Extract one function per chunk (def, function, etc.)
Keeps logic together and makes retrieval relevant
Works great in Python, JavaScript, Go, etc.

11. Class/module-based chunking

Group related functions under a class or file
Useful for class-heavy languages like Java, TypeScript, or C++

Watch out for massive files may still need fallback chunking.

https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1155%2F1%2A3fectsybLhleuGrvcmIVSw.jpeg

12. Comment-to-code chunking

Chunk together comment blocks + the function or code they describe
Improves semantic search, since comments often contain user-facing terms

13. Docstring-first chunking

Prioritize docstrings and chunk surrounding logic with them
LLMs love docstrings it’s like cheat codes for retrieval

Bonus: Markdown-aware splitting

For docs like README.md, chunk by headers (##) and subheaders (###)
Keeps logical sections together and helps with clean context injection

TL;DR: Treat code like code. If you split logic in half, you confuse the LLM and yourself.

Multimodal chunking (images, tables, PDFs)

Not all data is neat, UTF-8 text. Sometimes you’re chunking scanned PDFs, HTML pages, tables, graphs, or slides that were exported by someone who hates future developers.

Here’s how to avoid turning your RAG pipeline into a dumpster fire when handling mixed content:

14. Table-aware chunking

Extract tables as complete units, not as rows or raw text
Use pandas, pdfplumber, or Camelot to preserve structure
You want to retain the layout and headers for context

15. OCR + layout-aware splitting

Use OCR tools like [URL='https://github.com/tesseract-ocr/tesseract']Tesseract[/URL] or LayoutLM for scanned docs
Preserve visual blocks (columns, headers, footers)
Especially useful for resumes, contracts, invoices

Pro tip: PDFs aren’t real documents they’re screenshots with trust issues.

16. Image/document section segmentation

Use vision encoders (e.g., CLIP, Donut) to segment visual content into logical regions
Chunk each region as a self-contained “semantic image chunk”
Works well for UI screenshots, infographics, slide decks

17. Hybrid chunking for HTML

Chunk by <h1>, <h2>, <p>, <table>, etc.
Retain DOM structure, skip nav/ads
Use tools like BeautifulSoup, html2text, or trafilatura

Chunking isn’t just about size here it’s about preserving layout, hierarchy, and meaning. If the model can’t “see” what’s grouped together, it won’t reason about it.

Application-based chunking (custom jobs)

Some data just refuses to behave. A generic chunker won’t cut it when you’re dealing with legal contracts, medical records, customer support chats, or anything with domain-specific structure. Here, you chunk with context in mind not just structure or size.

18. Legal documents → Section-aware chunking

Use heading detection (e.g., “Section 4.2 Liability”) to split at logical contract sections
Retain clause numbers and sub-clauses together
Don’t split mid-paragraph bad legal advice awaits

19. Medical data → Entity + sentence-level

Combine sentence-level splits with key medical entities (e.g., drugs, symptoms, dosages)
Use clinical NLP models like scispaCy
Crucial for preserving diagnostic relationships

20. Customer support → Intent-based chunking

Chunk by resolved intents: each Q&A pair or resolved case is one unit
Group similar problems with their resolutions
Improves retrieval accuracy for chatbot or ticket-based RAG

21. Research papers → Abstract + section chunking

Chunk by “Abstract,” “Methods,” “Results,” etc.
Preserve citations as a unit with their explanation
Tools like [URL='https://github.com/kermitt2/grobid']Grobid[/URL] can help structure papers

When you’re working in a specific domain, chunking must reflect how humans navigate the content not just what fits in 500 tokens.

https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1155%2F1%2AXoUG92S5m6ttoNIsXdZKrQ.png

Tools and libraries that help

Look, you can write your own chunker from scratch, regex and all. Or you can use one of these libraries that already figured it out, tested it, and wrapped it in a function with a name like split_documents().

Here are some tools devs actually use:

22. LangChain

Offers RecursiveCharacterTextSplitter, TokenTextSplitter, and more
Handles fallback strategies and overlapping chunks
Good starting point for structured and unstructured text
Docs →

23. LlamaIndex (formerly GPT Index)

High-level abstractions for chunking + indexing
Comes with SentenceSplitter, SemanticSplitter, and support for PDFs, HTML, Markdown, etc.
Especially nice when paired with retrieval eval tools
Docs →

24. Haystack

Modular RAG framework with preprocessing pipelines
Use it with Hugging Face models or OpenAI
Strong PDF and OCR support for chunking weird docs
Docs →

25. OpenAI Tokenizer Viewer

Helps visualize how text will be tokenized before you chunk
Use this to debug token splits in fixed-size strategies
Web tool →

Honorable mentions

spaCy – great for sentence parsing
[URL='https://github.com/jsvine/pdfplumber']pdfplumber[/URL] – table-aware PDF parsing
[URL='https://github.com/adbar/trafilatura']trafilatura[/URL] – web text + HTML cleaner with structural awareness

These tools don’t just save time they prevent you from burning hours on chunking edge cases that someone else already solved.

Chunking experiments that worked (or hilariously failed)

Chunking isn’t just theory it’s trial, error, and screaming into your terminal.

Here are a few real stories (yes, tested on actual projects) that show what can go right or spectacularly wrong.

Worked: Recursive + overlap = clean dev wiki

Chunked a Notion wiki using recursive paragraph split, fallback to sentence, with 20% overlap
Retrieval relevance jumped ~30% with fewer hallucinations
Overlap helped with questions that landed between topics

Failed: Fixed-token chunks on legal docs

Tried 500-token chunks with no structure awareness
Retrieval kept surfacing “Definitions” section for everything
Outputs: legally impressive nonsense

Worked: Markdown-aware chunking for tech blog archive

Split by headers (##, ###) in Markdown
Each chunk became a clean, isolated sub-topic
Made fine-tuned RAG assistant way more accurate for tutorials

Failed: Sentence-level chunking for source code

Chunked Python scripts sentence by sentence (why? idk, it was late)
Retrieval returned “def”, “return”, and “if” as separate answers
Worse than useless it was confident and wrong

Some of these taught us more than any tutorial ever could. Chunking is like tuning hyperparameters you don’t really know what works until it breaks.

Clean-up tips and sanity checks

You’ve chunked your docs. They’re in the vector DB. You run your first query…

…and the results make less sense than a regex written under duress.

Before blaming the LLM, check your chunk hygiene. Here’s how:

1. Watch for cutoff chunks

Chunks ending mid-sentence or mid-thought? That’s how you get hallucinations. Use sentence or semantic splitters if you’re seeing garbage completions.

2. Tune your overlap percentage

Too little overlap = broken context. Too much = bloated retrieval with duplicate answers.
Start with 10–20% overlap and adjust based on output quality.

3. Validate with actual retrieval tests

Don’t just chunk and ship. Use a few test questions and verify:

Are the right chunks being retrieved?
Do answers change if chunking config changes?

Use tools like:

llama-index retriever playground
[URL='https://github.com/explodinggradients/ragas']RAGAS[/URL] for retrieval evals

4. Clean your input before you chunk

Remove:

Extra whitespace
Table of contents
Footers and headers (especially in PDFs)
Boilerplate sections repeated across docs

Garbage in = garbage chunked = garbage out.

These last-mile details are what separate a “cool prototype” from an actually useful RAG system.

Conclusion: Chunk smarter, not harder

If you made it this far, congrats you now know more about chunking than 95% of devs throwing PDFs at vector DBs and praying for answers.

To recap:

There’s no “one right” chunking method.
Classic strategies still work if you know when to use them.
Semantic, code-aware, and layout-aware chunking can massively improve retrieval.
Bad chunking quietly ruins everything, so test often and tune like a backend config.

You don’t need a PhD to chunk well you just need to stop slicing everything at 500 tokens like it’s 2022.

Helpful resources

https%3A%2F%2Fmiro.medium.com%2Fv2%2Fresize%3Afit%3A1155%2F1%2AWMCesV6B2egTIckiWZm7XQ.jpeg

25 chunking tricks for RAG that devs actually use

<devtips/>

Guest

Not another theory-filled blog post just real chunking strategies that work (or blew up in testing).​

Table of contents​

What even is “chunking” in RAG?​

When do you actually need chunking?​

Why bad chunking ruins everything​

The Core chunking strategies (the classics)​

1. Fixed-size chunking (tokens or words)​

2. Sentence-level chunking​

3. Paragraph-level chunking​

4. Sliding window chunking​

5. Recursive chunking​

Semantic-aware chunking (AI knows best)​

6. Sentence similarity-based chunking​

7. Topic segmentation (TextTiling, Topical BERT)​

8. Graph-based segmentation​

9. Embedding-aware chunking​

Chunking for code and technical docs​

10. Function-based chunking​

11. Class/module-based chunking​

12. Comment-to-code chunking​

13. Docstring-first chunking​

Bonus: Markdown-aware splitting​

Multimodal chunking (images, tables, PDFs)​

14. Table-aware chunking​

15. OCR + layout-aware splitting​

16. Image/document section segmentation​

17. Hybrid chunking for HTML​

Application-based chunking (custom jobs)​

18. Legal documents → Section-aware chunking​

19. Medical data → Entity + sentence-level​

20. Customer support → Intent-based chunking​

21. Research papers → Abstract + section chunking​

Tools and libraries that help​

22. LangChain​

23. LlamaIndex (formerly GPT Index)​

24. Haystack​

25. OpenAI Tokenizer Viewer​

Honorable mentions​

Chunking experiments that worked (or hilariously failed)​

Worked: Recursive + overlap = clean dev wiki​

Failed: Fixed-token chunks on legal docs​

Worked: Markdown-aware chunking for tech blog archive​

Failed: Sentence-level chunking for source code​

Clean-up tips and sanity checks​

1. Watch for cutoff chunks​

2. Tune your overlap percentage​

3. Validate with actual retrieval tests​

4. Clean your input before you chunk​

Conclusion: Chunk smarter, not harder​

Helpful resources​