the chunking problem that breaks rag

most chunking approaches split every 256-1024 tokens with 20% overlap. this creates a cascade of issues: math formulas get divided mid-equation, code blocks chopped in half, quotes separated from their surrounding context, and semantically similar content scattered across unrelated chunks.

semantic chunking tries to fix this by computing embedding similarity between sentences, but classical approaches have a critical flaw: they only examine adjacent sentences. when encountering semantically different content (like a code block embedded in a tutorial), they immediately create a new chunk, even when that content clearly relates to the surrounding text.

chunk-it-pro: semantic double-pass merging

chunk-it-pro implements semantic double-pass merging -- creating many small mini-chunks first, then intelligently merging them based on broader semantic relationships. this two-phase approach captures context that single-pass methods miss entirely.

pass 1: conservative chunking

the first pass uses an appending threshold. it splits text into sentences, computes cosine similarity between the current chunk and each new sentence, and creates many small semantically cohesive chunks.

def first_pass(sentences, appending_threshold=0.8):
    chunks = []
    current_chunk = [sentences[0]]
    for i in range(1, len(sentences)):
        similarity = cosine_similarity(
            embed(current_chunk),
            embed([sentences[i]])
        )
        if similarity >= appending_threshold:
            current_chunk.append(sentences[i])
        else:
            chunks.append(current_chunk)
            current_chunk = [sentences[i]]
    return chunks

pass 2: intelligent merging with lookahead

the second pass looks ahead two chunks to detect semantic relationships despite intervening "different" content. this is the key innovation -- it can bridge across code blocks, equations, or other interruptions in the text flow.

def second_pass(chunks, merging_threshold=0.7):
    merged_chunks = []
    i = 0
    while i < len(chunks):
        current = chunks[i]
        if i + 2 < len(chunks):
            similarity_skip_one = cosine_similarity(
                embed(current), embed(chunks[i + 2])
            )
            if similarity_skip_one >= merging_threshold:
                merged = current + chunks[i + 1] + chunks[i + 2]
                merged_chunks.append(merged)
                i += 3
                continue
        if i + 1 < len(chunks):
            similarity_adjacent = cosine_similarity(
                embed(current), embed(chunks[i + 1])
            )
            if similarity_adjacent >= merging_threshold:
                merged = current + chunks[i + 1]
                merged_chunks.append(merged)
                i += 2
                continue
        merged_chunks.append(current)
        i += 1
    return merged_chunks

why the lookahead matters

consider a document about the euclidean algorithm with three chunks:

  • chunk 1: explanation of the euclidean algorithm concept
  • chunk 2: python implementation (code block)
  • chunk 3: analysis of the algorithm's time complexity

classical single-pass chunking creates 3 separate chunks because the code block is semantically different from the prose. the second pass detects that chunks 1 and 3 are semantically related and merges all three into a single coherent chunk, preserving the full explanation-code-analysis flow.

threshold methods

choosing the right threshold is critical. chunk-it-pro supports three methods for automatically determining thresholds from the data:

percentile threshold

uses the nth percentile of cosine distances between consecutive sentences. good default for most documents.

def percentile_threshold(distances, percentile=90):
    return np.percentile(distances, percentile)

gradient threshold

finds the maximum gradient change in the sorted distance distribution. useful when there is a clear separation between intra-topic and inter-topic distances.

def gradient_threshold(distances):
    sorted_distances = np.sort(distances)
    gradients = np.diff(sorted_distances)
    max_gradient_idx = np.argmax(gradients)
    return sorted_distances[max_gradient_idx]

local maxima threshold

identifies peaks in the distance distribution, treating each peak as a potential chunk boundary. best for documents with clearly delineated sections.

def local_maxima_threshold(distances, window_size=3):
    from scipy.signal import argrelextrema
    maxima_indices = argrelextrema(
        np.array(distances),
        np.greater,
        order=window_size
    )[0]
    return distances[maxima_indices]

integration with rag pipelines

the integration path is straightforward: process the document through the double-pass chunker, generate embeddings for each merged chunk, index them in your vector store, and retrieve with significantly better context preservation at query time.

the improvement is most pronounced on:

  • technical documentation with interleaved code examples
  • academic papers with equations embedded in explanatory text
  • mixed-content documents where code, tables, and prose alternate frequently

alternatives and comparison

langchain's recursive text splitter is deterministic but context-blind -- it splits on characters and has no understanding of semantic boundaries. semantic-text-splitter does single-pass splitting only, missing cross-chunk relationships. llamaindex's semantic chunker computes similarity but lacks the lookahead mechanism that bridges across dissimilar content.

chunk-it-pro's advantage is specifically the lookahead in the second pass. by examining chunks two positions ahead, it can detect when surrounding context belongs together even when an intervening chunk (like a code block or equation) would normally trigger a split.