the problem with text-only rag on documents
the standard rag pipeline for documents follows a familiar path: pdf to ocr to chunking to text embeddings to vector search. each step introduces information loss. ocr misreads characters. chunking splits tables across boundaries. text embeddings discard layout, column structure, and visual relationships entirely. by the time a legal document reaches the vector store, its tables have become garbled rows of text, its multi-column layouts have been linearized into nonsense, and its charts have vanished.
for legal ai papers specifically, this is a severe limitation. these documents are dense with structured visual information: comparison tables, workflow diagrams, mathematical formulations, multi-column layouts, and embedded figures that carry meaning no text extraction pipeline can preserve.
vision-language embeddings: skip the pipeline
colqwenrag takes a different approach: embed the document page images directly. no ocr. no chunking. no text extraction. the colqwen2 vision-language model processes each page as an image and produces embeddings that encode both the textual content and the visual structure.
the backbone is qwen2-vl, a vision-language model that understands document layout natively. it sees the page as a human would -- text in columns, tables with aligned cells, figures with captions, headers establishing hierarchy. the model produces 128-dimensional embeddings for each visual patch of the page, creating a rich multi-vector representation of the document.
late interaction matching
colqwenrag uses late interaction for retrieval, following the colbert paradigm adapted for vision. instead of compressing an entire page into a single vector, the model produces one embedding per visual patch. at query time, each query token embedding is matched against all patch embeddings for a page, and the maximum similarity scores are summed.
# late interaction scoring
def score_page(query_embeddings, page_patch_embeddings):
# query_embeddings: (num_query_tokens, 128)
# page_patch_embeddings: (num_patches, 128)
similarity_matrix = query_embeddings @ page_patch_embeddings.T
# max similarity per query token, then sum
max_similarities = similarity_matrix.max(dim=1).values
return max_similarities.sum()
this late interaction approach preserves fine-grained matching -- a query about a specific table cell can match against the exact patch containing that cell, even if the rest of the page is unrelated. single-vector approaches lose this granularity.
binary quantization for scale
the multi-vector representation is expensive in raw form. each page produces hundreds of 128-dimensional float32 vectors. binary quantization compresses each dimension to a single bit, achieving 32x memory compression with minimal retrieval quality loss.
at 4,245 pages in the legal ai paper dataset, the unquantized index would require significant storage. with binary quantization, the full index fits comfortably in memory, making retrieval fast and deployment practical.
qdrant vector database
the embeddings are stored in qdrant, which supports multi-vector documents and binary quantization natively. each document page is stored as a collection of patch embeddings, and qdrant handles the late interaction scoring at query time.
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance
client = QdrantClient(url="localhost:6333")
client.create_collection(
collection_name="legal_ai_papers",
vectors_config={
"colqwen": VectorParams(
size=128,
distance=Distance.COSINE,
multivector_config={"comparator": "max_sim"},
quantization_config={"binary": {}},
)
},
)
retrieval accuracy
on the legal ai paper dataset (4,245 pages across multiple papers), colqwenrag achieves over 80% exact page retrieval accuracy. given a natural language question about content on a specific page, the system retrieves that exact page as the top result more than 4 out of 5 times.
this is measured strictly -- the correct page must be rank 1, not just in the top 5. for a vision-only system with no text extraction, this is a strong result. the model is finding the right page based on visual understanding of the content, not keyword matching.
why this matters for legal ai
legal documents have properties that make them particularly hostile to text-only rag:
- complex table structures with merged cells, nested headers, and alignment that carries semantic meaning
- multi-column layouts where text extraction produces interleaved columns that destroy reading order
- mathematical formulations and logical notation that ocr frequently corrupts
- comparison charts and workflow diagrams that exist only as visual elements
- footnotes, margin annotations, and cross-references that depend on spatial position
vision-based retrieval sidesteps all of these problems. the model sees the page as rendered, with all visual structure intact. a question about "the accuracy numbers in table 3" can match against the actual visual table, not a garbled text extraction of it.
the tradeoff is compute cost -- vision models are heavier than text embedding models, and the multi-vector representation requires more storage than single-vector text embeddings. but for domains where document structure matters, the quality improvement justifies the cost.