rolling your own reasoner: grpo + sft from scratch

what are reasoning models

reasoning models are language models that engage in step-by-step internal deliberation before generating their final output. instead of producing an answer in a single forward pass, they generate intermediate chain-of-thought reasoning steps -- breaking a problem into sub-problems, evaluating alternatives, catching their own errors, and arriving at a conclusion through a process that mimics structured human thought.

the practical difference is significant. standard language models pattern-match from training data and produce plausible-sounding outputs. reasoning models actually work through problems. this makes them dramatically better at tasks that require multi-step logic: mathematical proofs, code debugging, legal analysis, scientific reasoning. the cost is latency and tokens -- thinking out loud is expensive -- but for tasks where correctness matters more than speed, the tradeoff is obvious.

supervised fine-tuning: laying the foundation

the first stage of building a custom reasoning model is supervised fine-tuning (sft). this is conceptually straightforward: you compile a dataset of queries paired with chain-of-thought reasoning steps and final responses, then train the model to reproduce that reasoning pattern.

the dataset is the hard part. each example needs a query, a detailed reasoning trace showing how to arrive at the answer, and the final response. for domain-specific reasoning (legal, medical, financial), you typically need domain experts to validate the reasoning chains, or you generate them using a stronger model and curate aggressively.

dataset size depends on the task. a minimum viable dataset is around 1,000 examples, but production-quality reasoning models benefit from tens or hundreds of thousands. quality matters more than quantity -- a small dataset of carefully constructed reasoning traces will outperform a large dataset of shallow or incorrect ones.

training itself uses standard gradient descent, often with lora (low-rank adaptation) for parameter-efficient fine-tuning. lora freezes most of the model's weights and only trains small adapter matrices, making it feasible to fine-tune large models on consumer hardware. the goal of sft is not to produce a finished reasoning model, but to give the model a strong foundation for the reinforcement learning stage that follows.

grpo: group relative policy optimization

grpo is where the model learns to reason well, not just reproduce reasoning patterns. group relative policy optimization works by generating multiple candidate responses for each query, scoring them with a reward function, and updating the model to favor higher-scoring outputs.

the mechanism is specific: for each training query, the model generates a group of candidate responses (typically 4-16). a reward function scores each response. instead of using absolute reward values, grpo normalizes rewards within each group -- the best response in the group gets the highest relative reward, the worst gets the lowest. the model is then updated to increase the probability of higher-reward responses and decrease the probability of lower-reward ones.

the critical difference between grpo and standard ppo (proximal policy optimization) is efficiency. ppo requires a separate value model -- an entire second neural network trained to estimate expected rewards. grpo eliminates this by grouping outputs and computing relative advantages directly. it also applies kl divergence penalties directly between the current policy and a reference model, preventing the model from drifting too far from its sft foundation. this prevents reward hacking, where the model finds degenerate outputs that score well on the reward function but are actually useless.

practical considerations

hardware requirements are non-trivial. fine-tuning an 8b parameter model with grpo requires roughly 48gb of vram -- a single a6000 or two consumer gpus with model parallelism. larger models scale linearly.

reward function design is the most important and most difficult part of the pipeline. for mathematical reasoning, you can verify answers programmatically. for open-ended domains, you need custom reward functions that evaluate reasoning quality, not just final answer correctness. a model that gets the right answer through flawed reasoning is fragile -- it will fail on slightly different problems.

data quality remains critical through both stages. garbage reasoning traces in sft produce garbage reasoning foundations. noisy reward signals in grpo produce unstable training. the deepseek team demonstrated this clearly with their two-stage approach: r1-zero was trained with pure rl (no sft foundation) and produced correct but often incoherent reasoning. r1 added an sft stage first, producing reasoning that was both correct and readable. their final training mix used 600,000 reasoning examples and 200,000 non-reasoning examples.

the pipeline

the practical pipeline for building a custom reasoning model is:

curate a domain-specific dataset of queries + reasoning traces + answers
run sft to give the model a reasoning foundation
design reward functions that evaluate reasoning quality and answer correctness
run grpo to refine the model's reasoning through reinforcement learning
evaluate on held-out test sets, iterate on reward functions and data

the tooling exists. trl (transformer reinforcement learning) from huggingface supports grpo out of the box. unsloth provides memory-efficient training for consumer hardware. the bottleneck is not compute or code -- it is data quality and reward function design. get those right, and even small models can develop surprisingly strong reasoning capabilities in specialized domains.