colqwenrag: rag without ocr, for legal papers

the problem with text-only rag on documents

the standard rag pipeline for documents follows a familiar path: pdf to ocr to chunking to text embeddings to vector search. each step introduces information loss. ocr misreads characters. chunking splits tables across boundaries. text embeddings discard layout, column structure, and visual relationships entirely. by the time a legal document reaches the vector store, its tables have become garbled rows of text, its multi-column layouts have been linearized into nonsense, and its charts have vanished.

for legal ai papers specifically, this is a severe limitation. these documents are dense with structured visual information: comparison tables, workflow diagrams, mathematical formulations, multi-column layouts, and embedded figures that carry meaning no text extraction pipeline can preserve.

vision-language embeddings: skip the pipeline

colqwenrag takes a different approach: embed the document page images directly. no ocr. no chunking. no text extraction. the colqwen2 vision-language model processes each page as an image and produces embeddings that encode both the textual content and the visual structure.

the backbone is qwen2-vl, a vision-language model that understands document layout natively. it sees the page as a human would -- text in columns, tables with aligned cells, figures with captions, headers establishing hierarchy. the model produces 128-dimensional embeddings for each visual patch of the page, creating a rich multi-vector representation of the document.

late interaction matching

colqwenrag uses late interaction for retrieval, following the colbert paradigm adapted for vision. instead of compressing an entire page into a single vector, the model produces one embedding per visual patch. at query time, each query token embedding is matched against all patch embeddings for a page, and the maximum similarity scores are summed.

# late interaction scoring
def score_page(query_embeddings, page_patch_embeddings):
    # query_embeddings: (num_query_tokens, 128)
    # page_patch_embeddings: (num_patches, 128)
    similarity_matrix = query_embeddings @ page_patch_embeddings.T
    # max similarity per query token, then sum
    max_similarities = similarity_matrix.max(dim=1).values
    return max_similarities.sum()

this late interaction approach preserves fine-grained matching -- a query about a specific table cell can match against the exact patch containing that cell, even if the rest of the page is unrelated. single-vector approaches lose this granularity.

binary quantization for scale

the multi-vector representation is expensive in raw form. each page produces hundreds of 128-dimensional float32 vectors. binary quantization compresses each dimension to a single bit, achieving 32x memory compression with minimal retrieval quality loss.

at 4,245 pages in the legal ai paper dataset, the unquantized index would require significant storage. with binary quantization, the full index fits comfortably in memory, making retrieval fast and deployment practical.

qdrant vector database

the embeddings are stored in qdrant, which supports multi-vector documents and binary quantization natively. each document page is stored as a collection of patch embeddings, and qdrant handles the late interaction scoring at query time.

from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance

client = QdrantClient(url="localhost:6333")
client.create_collection(
    collection_name="legal_ai_papers",
    vectors_config={
        "colqwen": VectorParams(
            size=128,
            distance=Distance.COSINE,
            multivector_config={"comparator": "max_sim"},
            quantization_config={"binary": {}},
        )
    },
)

retrieval accuracy

on the legal ai paper dataset (4,245 pages across multiple papers), colqwenrag achieves over 80% exact page retrieval accuracy. given a natural language question about content on a specific page, the system retrieves that exact page as the top result more than 4 out of 5 times.

this is measured strictly -- the correct page must be rank 1, not just in the top 5. for a vision-only system with no text extraction, this is a strong result. the model is finding the right page based on visual understanding of the content, not keyword matching.

why this matters for legal ai

legal documents have properties that make them particularly hostile to text-only rag:

complex table structures with merged cells, nested headers, and alignment that carries semantic meaning
multi-column layouts where text extraction produces interleaved columns that destroy reading order
mathematical formulations and logical notation that ocr frequently corrupts
comparison charts and workflow diagrams that exist only as visual elements
footnotes, margin annotations, and cross-references that depend on spatial position

vision-based retrieval sidesteps all of these problems. the model sees the page as rendered, with all visual structure intact. a question about "the accuracy numbers in table 3" can match against the actual visual table, not a garbled text extraction of it.

the tradeoff is compute cost -- vision models are heavier than text embedding models, and the multi-vector representation requires more storage than single-vector text embeddings. but for domains where document structure matters, the quality improvement justifies the cost.