How to chunk code at semantic boundaries when a single AST node exceeds the chunk size limit?

8 hours ago 1
ARTICLE AD BOX

I'm building a code indexing tool for LLMs using tree-sitter in Go. The goal is to split source files into chunks (~600 lines) that respect function/class boundaries for better LLM context.

When a single function or switch statement exceeds my chunk size limit, I'm forced to split mid-construct:

// Chunk 1 ends mid-switch: "case \"type_declaration\":\n\t\tfor i := 0; i < int(node.ChildCount()); i++ {\n..." // Chunk 2 continues: "\t\t\t}\n\t\t}\n\n\tcase \"const_declaration\":\n...

This breaks semantic coherence—the LLM sees a partial switch statement without knowing the full context.

Current approach:

func getSymbolBoundaries(symbols []Symbol, totalLines int) []symbolBoundary { // Sort symbols by line, split at function starts }

What I've considered:
1. Overlapping chunks - Include context from previous chunk (wastes tokens)
2. Recursive splitting - Descend into child nodes when parent exceeds limit
3. Hard truncation - Just split and add // ... continued markers

What's the idiomatic way to handle this with tree-sitter? Is there a pattern for "best-effort semantic chunking" that gracefully degrades when constructs are too large?

Read Entire Article