How to chunk code at semantic boundaries when a single AST node exceeds the chunk size limit?

8 hours ago 1

ARTICLE AD BOX

I'm building a code indexing tool for LLMs using tree-sitter in Go. The goal is to split source files into chunks (~600 lines) that respect function/class boundaries for better LLM context.

When a single function or switch statement exceeds my chunk size limit, I'm forced to split mid-construct:

// Chunk 1 ends mid-switch: "case \"type_declaration\":\n\t\tfor i := 0; i < int(node.ChildCount()); i++ {\n..." // Chunk 2 continues: "\t\t\t}\n\t\t}\n\n\tcase \"const_declaration\":\n...

This breaks semantic coherence—the LLM sees a partial switch statement without knowing the full context.

Current approach:

func getSymbolBoundaries(symbols []Symbol, totalLines int) []symbolBoundary { // Sort symbols by line, split at function starts }

What I've considered:
1. Overlapping chunks - Include context from previous chunk (wastes tokens)
2. Recursive splitting - Descend into child nodes when parent exceeds limit
3. Hard truncation - Just split and add // ... continued markers

What's the idiomatic way to handle this with tree-sitter? Is there a pattern for "best-effort semantic chunking" that gracefully degrades when constructs are too large?

Read Entire Article

LEFT SIDEBAR AD

Hidden in mobile, Best for skyscrapers.

How to chunk code at semantic boundaries when a single AST node exceeds the chunk size limit?

ARTICLE AD BOX

Related

go exec.Command is not finding "source" command in my $PATH variable

Go, QueryContext and %g verbs

What is the idiomatic way to organize shared helper functions in Go projects?

LEFT SIDEBAR AD