Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model

Best AI papers explained - A podcast by Enoch H. Kang

Categories:

The academic paper critically examines whether Reinforcement Learning with Verifiable Rewards (RLVR) genuinely enhances the reasoning capabilities of large language models (LLMs) beyond their base models, particularly for tasks like mathematics and coding. Surprisingly, the authors find that while RLVR improves sampling efficiency for correct responses—leading to better performance at low sampling rates (pass@k at small k)—it does not generate fundamentally new reasoning patterns or expand the overall range of problems the LLM can potentially solve. In fact, comprehensive analysis using the pass@k metric at large k values reveals that base models often retain a broader scope of solvable problems than their RLVR-trained counterparts. This suggests that the reasoning capacity of current RLVR models is bounded by the pre-trained base model, with their success primarily due to optimizing existing reasoning paths rather than discovering novel strategies. Conversely, the study notes that distillation from a stronger model can introduce new reasoning patterns and genuinely expand the model's capabilities.