background

deepseek is an open-source large language model built by a chinese quantitative hedge fund. despite coming from an unconventional origin for an ai lab, deepseek has produced models that approach openai-level benchmark performance on reasoning tasks -- and they released the weights publicly. the most interesting part is not the model itself but the training methodology: a clear, reproducible recipe for making language models reason.

two versions, two philosophies

deepseek released two variants that illustrate fundamentally different approaches to building reasoning capability:

  • r1-zero -- trained with pure reinforcement learning, no supervised fine-tuning. the model learned to reason entirely through trial and error, discovering chain-of-thought patterns on its own. the results were impressive on benchmarks but the reasoning traces were often incoherent, switching languages mid-thought or producing valid logic in unreadable format.
  • r1 -- trained with supervised fine-tuning first, then reinforcement learning on top. the sft stage gave the model a foundation of coherent, readable reasoning patterns. the rl stage then refined and improved that reasoning. the result was both correct and human-readable.

the lesson is important: pure rl can discover reasoning, but sft + rl produces reasoning that is actually usable. correctness without coherence is not useful in practice.

chain-of-thought reasoning

the core mechanism is chain-of-thought (cot) reasoning. instead of mapping directly from input to output, the model generates intermediate reasoning steps -- a visible thought process that breaks complex problems into manageable pieces. the model identifies what it knows, what it needs to figure out, considers approaches, checks its work, and arrives at a conclusion.

this mimics how humans solve hard problems. we do not look at a complex equation and produce an answer instantly. we decompose, substitute, simplify, check. chain-of-thought training teaches models to do the same. the practical benefit is twofold: the model makes fewer errors on complex tasks, and when it does make errors, the reasoning trace makes them diagnosable.

reinforcement learning components

the rl training pipeline has several key components:

reward functions

reward functions provide the incentive signal that drives learning. for mathematical reasoning, the reward is straightforward: did the model get the correct answer? for more open-ended tasks, reward functions evaluate reasoning quality, format compliance, and answer correctness. the danger is reward hacking -- the model finds shortcuts that score well on the reward function but produce useless outputs. deepseek mitigates this with careful reward design and kl divergence constraints.

policy

the policy (denoted as pi-theta in the literature) is the model's decision-making strategy for selecting output tokens. training updates the policy to favor token sequences that lead to higher rewards. the policy is initialized from the sft checkpoint, giving it a strong starting point.

grpo

deepseek uses group relative policy optimization rather than standard ppo. for each training query, the model generates multiple candidate responses. these are scored and ranked relative to each other within the group. the model is updated to increase the probability of better responses and decrease the probability of worse ones. this eliminates the need for a separate value model, reducing compute requirements significantly.

clipping and kl divergence

two mechanisms prevent training instability. clipping bounds how much the policy can change in a single update step, preventing catastrophic jumps. kl divergence measures the distance between the current policy and a reference model (typically the sft checkpoint), penalizing the model if it drifts too far. together, these prevent the model from becoming overconfident in narrow patterns or exploiting reward function weaknesses.

knowledge distillation

deepseek's full r1 model has 671 billion parameters -- far too large for most deployment scenarios. knowledge distillation transfers the reasoning capability from this teacher model into smaller student models ranging from 8b to 70b parameters. the teacher generates reasoning traces, and the student is trained to reproduce them. the distilled models retain a surprising amount of the teacher's reasoning ability at a fraction of the compute cost.

running it yourself

the distilled models are practical to run locally. huggingface hosts the weights and ollama provides one-command deployment. mid-range gpus (rtx 3090, rtx 4090) can run the 8b and 14b distilled variants comfortably. for higher throughput, vllm serves the models with continuous batching at 20-30 tokens per second on appropriate hardware. the barrier to entry for using reasoning models is effectively zero -- the barrier is in training new ones, and deepseek's paper provides a clear enough recipe that even that is increasingly accessible.