ARTICLE AD BOX
I'm building a code indexing tool for LLMs using tree-sitter in Go. The goal is to split source files into chunks (~600 lines) that respect function/class boundaries for better LLM context.
When a single function or switch statement exceeds my chunk size limit, I'm forced to split mid-construct:
// Chunk 1 ends mid-switch: "case \"type_declaration\":\n\t\tfor i := 0; i < int(node.ChildCount()); i++ {\n..." // Chunk 2 continues: "\t\t\t}\n\t\t}\n\n\tcase \"const_declaration\":\n...This breaks semantic coherence—the LLM sees a partial switch statement without knowing the full context.
Current approach:
func getSymbolBoundaries(symbols []Symbol, totalLines int) []symbolBoundary { // Sort symbols by line, split at function starts }What I've considered:
1. Overlapping chunks - Include context from previous chunk (wastes tokens)
2. Recursive splitting - Descend into child nodes when parent exceeds limit
3. Hard truncation - Just split and add // ... continued markers
What's the idiomatic way to handle this with tree-sitter? Is there a pattern for "best-effort semantic chunking" that gracefully degrades when constructs are too large?
