fine-tuning embeddings until legal-rag stops missing

motivation

generic embedding models are trained on broad internet corpora. they work well enough for general-purpose retrieval, but they fall apart on specialized domains like legal text. regulatory documents are dense with cross-references, defined terms that carry precise legal meaning, and domain-specific shorthand that generic models simply have not seen enough of during pretraining. a phrase like "regulation 11(1) of the sebi (lodr) regulations, 2015" is semantically rich to a legal practitioner but opaque to a model trained mostly on wikipedia and reddit.

the result is poor retrieval precision. queries about specific regulatory provisions return tangentially related chunks instead of the exact clauses that matter. for a legal rag system, this is not a minor inconvenience -- it is a correctness failure. so the question becomes: can we fine-tune existing embedding models on domain-specific legal data and get meaningfully better retrieval without massive compute budgets?

models and dataset

three base models were selected for fine-tuning, each representing a different architecture and parameter count:

bge-base-en-v1.5 -- a compact, well-regarded english embedding model from baai
snowflake-arctic-embed-m-v2.0 -- a mid-sized model with strong out-of-the-box retrieval performance
multilingual-e5-large-instruct -- a larger instruction-tuned model with multilingual support

the training dataset was constructed by scraping publicly available sebi (securities and exchange board of india) pdf documents. these cover regulations, circulars, orders, and guidelines -- exactly the kind of text a legal rag system needs to handle. from these documents, question-answer pairs were generated using gpt-4o-mini, producing 1,456 training samples and 162 held-out test samples. the dataset is available on huggingface as axondendriteplus/legal-rag-embedding-dataset.

matryoshka representation learning

one of the key techniques applied was matryoshka representation learning. the idea is simple but powerful: instead of training embeddings at a single fixed dimension, you train them to be useful at multiple truncated dimensions simultaneously. the model learns that the first 64 dimensions should carry the most important information, the first 128 should carry more, and so on up to the full 768.

this is done by wrapping the base loss function in a matryoshka loss that evaluates the embedding at each target dimension:

MatryoshkaLoss({
    "loss": "MultipleNegativesRankingLoss",
    "matryoshka_dims": [768, 512, 256, 128, 64],
    "matryoshka_weights": [1, 1, 1, 1, 1]
})

equal weights across all dimensions force the model to produce useful representations at every truncation point. the underlying loss is multiplenegativesrankingloss, a contrastive objective that pushes matching query-document pairs together and non-matching pairs apart in the embedding space.

training configuration

all three models were trained with the same hyperparameter setup: 4 epochs, batch size 32, gradient accumulation steps of 16 (giving an effective batch size of 512), learning rate 2e-5 with cosine scheduling, bf16 mixed precision, and flash attention enabled where supported. training was done using the sentence-transformers library with the matryoshka + contrastive loss combination described above.

the complete pipeline can be run with a few lines:

from finetune_embed import EmbeddingFineTuner

fine_tuner = EmbeddingFineTuner(
    model_id="BAAI/bge-base-en-v1.5",
    dataset_id="axondendriteplus/legal-rag-embedding-dataset"
)
results = fine_tuner.run_complete_pipeline(hf_token="token")

results

the key finding is about dimensional efficiency. at 128 dimensions, the fine-tuned models lose only 7.41% retrieval accuracy compared to the full 768-dimension baseline -- while requiring 6x less storage and delivering proportionally faster similarity search. at 64 dimensions, the fine-tuned models can actually outperform the 768-dimension baseline of the unfine-tuned model, achieving 12x storage savings.

this has significant practical implications. vector databases scale in cost with dimensionality. if you can serve 128d embeddings instead of 768d embeddings with negligible quality loss, your index is 6x smaller, your search is faster, and your infrastructure costs drop accordingly. for a production legal rag system handling millions of document chunks, this matters.

the fine-tuned models also showed substantial improvements at full dimensionality across all three architectures, confirming that even small domain-specific datasets (under 1,500 samples) can meaningfully shift embedding quality when the data is representative of the target domain.

artifacts

all three fine-tuned models are available on huggingface:

axondendriteplus/Legal-Embed-bge-base-en-v1.5
axondendriteplus/Legal-Embed-snowflake-arctic-embed-m-v2.0

the training dataset is at axondendriteplus/legal-rag-embedding-dataset. the training code, evaluation scripts, and full results are in the accompanying repository.

takeaways

domain-specific embedding fine-tuning is underrated. it requires modest compute, small datasets, and straightforward training loops -- but the retrieval quality gains are real and measurable. matryoshka training adds almost no overhead and gives you a free dimensionality knob to trade off quality against storage and speed. if you are building rag on specialized text and still using off-the-shelf embeddings, you are leaving significant performance on the table.