Stable and Efficient Single-Rollout RL for Multimodal Reasoning

1 Tencent AI Lab, Bellevue
2 University of Maryland, College Park
3 University of Virginia
4 University of Notre Dame

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has become a key paradigm to improve the reasoning capabilities of Multimodal Large Language Models (MLLMs). However, prevalent group-based algorithms such as GRPO require multi-rollout sampling for each prompt. While more efficient single-rollout variants have recently been explored in text-only settings, we find that they suffer from severe instability in multimodal contexts, often leading to training collapse. To address this training efficiency-stability trade-off, we introduce MSSR (Multimodal Stabilized Single-Rollout), a group-free RLVR framework that achieves both stable optimization and effective multimodal reasoning performance. MSSR achieves this via an entropy-based advantage-shaping mechanism that adaptively regularizes advantage magnitudes, preventing collapse and maintaining training stability. While such mechanisms have been used in group-based RLVR, we show that in the multimodal single-rollout setting they are not merely beneficial but essential for stability. In in-distribution evaluations, MSSR demonstrates superior training compute efficiency, achieving similar validation accuracy to the group-based baseline with half the training steps. When trained for the same number of steps, MSSR's performance surpasses the group-based baseline and shows consistent generalization improvements across five diverse reasoning-intensive benchmarks. Together, these results demonstrate that MSSR enables stable, compute-efficient, and effective RLVR for complex multimodal reasoning tasks.

Approach
Approach

Overview of the proposed MSSR approach. Given a multimodal input, i.e., an image and the corresponding question, we generate a single rollout through the policy model. We then use a Beta distribution to estimate the baseline value \( v \), compute the advantage \( A \), and normalize it across the batch. Finally, we propose entropy-based advantage shaping to preserve entropy and stabilize training.

MSSR Performance Overview
(a) Training accuracy
(b) Validation accuracy
(c) Generalization performance

Performance overview of MSSR. (a–b) Training and validation accuracy of MVSR (Multimodal Vanilla Single-Rollout), GRPO and our MSSR, trained on the Vision-R1-RL training set and validated on its corresponding validation set. MSSR remains stable and improves steadily, whereas MVSR is unstable and collapses. Notably, MSSR reaches a similar final validation accuracy to GRPO with half of the training steps, highlighting its superior training compute efficiency. (c) Our MSSR achieves higher generalization performance across diverse multimodal reasoning benchmarks, including MathVerse, MathVista, MMK12, R1-Onevision-Bench, and HallusionBench, compared to other baselines including GRPO, RLOO, and REINFORCE++. For fair comparisons, we have equivalent total number of rollouts per step for all methods.

Main Results
Table 1: Model generalization performance on diverse multimodal reasoning benchmarks. We compare MSSR with GRPO, RLOO, and REINFORCE++ baselines on Qwen2.5-VL 3B and 7B models. MSSR outperforms other baselines, with Qwen2.5-VL-7B + MSSR achieving the strongest average performance across benchmarks.
Model MathVerse MathVista MMK12 R1-Onevision Bench HallusionBench Avg.
SFT + RL
R1-Onevision-7B46.062.943.535.267.251.0
OpenVLThinker-7B45.870.053.534.760.052.8
VLAA-Thinker-7B48.268.051.738.470.055.3
Zero RL
MM-Eureka-Qwen-7B50.371.261.739.166.457.7
ThinkLite-VL-7B47.371.957.635.770.956.7
Qwen2.5-VL-3B33.359.542.527.659.944.6
+ GRPO36.861.746.130.262.347.4
+ RLOO35.759.745.528.861.646.3
+ REINFORCE++35.347.746.021.763.242.8
+ MSSR39.663.049.229.066.649.5
Qwen2.5-VL-7B45.867.248.134.668.452.8
+ GRPO48.570.055.837.769.756.3
+ RLOO47.869.256.038.568.556.0
+ REINFORCE++42.768.551.334.069.253.1
+ MSSR49.871.162.539.270.658.6
Ablation Studies
(a) Training accuracy
(b) Validation accuracy
(c) Model entropy

Ablation studies on effectiveness of techniques for preventing entropy collapse and stabilizing multimodal single-rollout training. Cross-modal regularization: This technique provides partial stabilization, increasing training accuracy but still resulting in degraded validation accuracy, and both metrics remain below those achieved by MSSR. Entropy loss: Adding an entropy loss term partially preserves entropy and improves training accuracy toward the end of training, but validation performance still degrades and entropy is not maintained as effectively as in MSSR.

Reasoning Output Comparison Examples

Comparison of reasoning outputs from GRPO and MSSR. While GRPO produce incorrect answers, MSSR successfully solves the problem, demonstrating its superior reasoning capability. We highlight the critical reasoning steps that lead to GRPO's incorrect answer in red, and the key steps enabling MSSR’s correct prediction in green.

BibTeX


            @article{liu2025stable,
              title={Stable and Efficient Single-Rollout RL for Multimodal Reasoning},
              author={Liu, Rui and Yu, Dian and Ke, Lei and Liu, Haolin and Zhou, Yujun and Liang, Zhenwen and Mi, Haitao and Tokekar, Pratap and Yu, Dong},
              journal={arXiv preprint arXiv:2512.18215},
              year={2025}
            }