grpo: reinforcement learning without a reward model

group relative policy optimization (grpo) is a reinforcement learning method that eliminates the need for a separate trained reward model. instead of the standard rlhf pipeline -- train a reward model, then optimize against it -- grpo generates multiple completions per prompt, scores them with deterministic or llm-based reward functions, and optimizes the policy using group-relative advantage estimation.

the core idea: for each prompt, sample a group of completions from the current policy. score each completion. compute advantages relative to the group mean. update the policy to increase the probability of above-average completions and decrease below-average ones. no separate reward model to train, maintain, or worry about reward hacking.

# grpo advantage computation
def compute_group_advantages(rewards):
    mean_reward = np.mean(rewards)
    std_reward = np.std(rewards) + 1e-8
    advantages = (rewards - mean_reward) / std_reward
    return advantages

this is particularly attractive for small models where training a reliable reward model would require more data and compute than the policy training itself.

lora: making 3b trainable

llama 3.2-3b has roughly 3 billion parameters. full fine-tuning requires multiple gpus and significant memory. lora (low-rank adaptation) reduces the trainable parameter count by injecting small rank-decomposition matrices into each attention layer while freezing the original weights.

with rank-64 lora matrices applied to the query, key, value, and output projections, the trainable parameter count drops from ~3b to ~37m -- approximately 1.2% of the original model. this makes single-gpu training feasible and keeps the base model's general capabilities intact while learning new behaviors.

lora_config = {
    "r": 64,
    "lora_alpha": 128,
    "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj"],
    "lora_dropout": 0.05,
    "task_type": "CAUSAL_LM",
}
# trainable params: ~37M / 3.2B total (1.2%)

model 1: llama-3.2-3b-ggk (math reasoning)

the first model is trained on gsm8k, a dataset of grade-school math word problems. the reward functions are entirely deterministic -- no llm judge needed. three reward signals are combined:

  • format compliance: does the output follow the expected structure with a reasoning chain ending in a boxed answer? binary reward.
  • numerical accuracy: does the final numerical answer match the ground truth? binary reward, strict equality.
  • reasoning structure: does the completion show step-by-step work rather than jumping to the answer? scored on the presence of intermediate calculations.
def math_reward(completion, ground_truth):
    score = 0.0
    # format compliance
    if has_valid_format(completion):
        score += 0.2
    # numerical accuracy
    if extract_answer(completion) == ground_truth:
        score += 0.6
    # reasoning structure
    if count_reasoning_steps(completion) >= 2:
        score += 0.2
    return score

the deterministic rewards are key -- they provide clean, unambiguous signal. the model learns quickly because there is no noise in the reward function. after training, the model shows improved step-by-step reasoning on math problems, with fewer arithmetic errors and more structured outputs.

model 2: llama-3.2-3b-glr (legal qa)

the second model targets legal question answering, where correctness cannot be verified with a simple equality check. here the reward function uses claude and gemini as evaluators, scoring completions on four dimensions:

  • accuracy: is the legal information factually correct?
  • completeness: does the answer address all aspects of the question?
  • legal soundness: are the legal principles and reasoning valid?
  • clarity: is the answer well-structured and understandable?

using two different evaluator models provides a form of cross-validation. if both claude and gemini agree a completion is high quality, the signal is more reliable than a single evaluator. the scores are averaged across dimensions and evaluators to produce the final reward.

this is more expensive per training step than deterministic rewards, but it enables training on domains where ground-truth verification is impossible. legal reasoning, medical advice, policy analysis -- any domain where "correct" requires judgment rather than lookup.

training configuration

training_args = {
    "learning_rate": 5e-6,
    "per_device_train_batch_size": 1,
    "gradient_accumulation_steps": 4,
    "num_generations": 4,  # completions per prompt
    "max_completion_length": 1024,
    "num_train_epochs": 3,
    "warmup_ratio": 0.1,
    "bf16": True,
}

batch size of 1 with 4 gradient accumulation steps gives an effective batch size of 4. combined with 4 generations per prompt, each optimization step considers 16 total completions. the low learning rate (5e-6) prevents catastrophic forgetting of the base model's capabilities while still allowing the lora weights to learn the new reward-aligned behavior.

results and availability

both models show measurable improvement over the base llama 3.2-3b on their respective tasks. the math model (ggk) produces more structured reasoning chains and fewer numerical errors. the legal model (glr) generates more comprehensive and legally sound answers as rated by independent evaluators.

the key takeaway is that grpo + lora makes rl training accessible at small model scales. you do not need a 70b model or a cluster of gpus. a single gpu, a good reward function, and a few hours of training can meaningfully improve a 3b model's reasoning on targeted tasks.

both models are available on huggingface: axondendriteplus/llama-3.2-3B-GGK and axondendriteplus/llama-3.2-3B-GLR.