From Prior to Pro

We introduce Distribution Contractive Reinforcement Learning (DICE-RL), a framework that uses reinforcement learning (RL) as a "distribution contraction" operator to refine pretrained generative robot policies. DICE-RL turns a pretrained behavior prior into a high-performing "pro" policy by amplifying high-success behaviors from online feedback. We pretrain a diffusion-based policy for broad behavioral coverage, then finetune it with a stable, sample-efficient residual off-policy RL framework that combines selective behavior regularization with value-guided action selection. Extensive experiments and analyses show that DICE-RL reliably improves performance with strong stability and sample efficiency, enabling mastery of complex long-horizon manipulation skills both in simulation and on a real robot.

We evaluate RL policy checkpoints at fixed training intervals using 30 trials per checkpoint. For GearInsertion and LightBulbInsertion tasks, we keep the target object poses fixed and randomize the robot end-effector’s initial pose within a small range (approximately ±2 cm in position and ±10° about each of the three rotation axes). For BeltAssembly, in addition to randomizing the robot’s initial pose, we also randomize the assembly board position within approximately ±10 cm and its planar orientation within approximately ±15°.

On the Belt Assembly task, DICE-RL improves the pretrained policy from 56.67% to 93.33% success rate over 30 runs. See our uncut evaluation video (10x).

On the Light Bulb Insertion task, DICE-RL improves the pretrained policy from 56.67% to 90% success rate.

On the Gear Insertion task, DICE-RL improves the pretrained policy from 46.67% to 90% success rate.

We compare DICE-RL against prior methods, focusing on approaches that build on pretrained diffusion-based policies. We benchmark on Can, Square, Transport,Tool Hang tasks from the Robomimic benchmark, and report results for both state-based and pixel-based observations. For Can, the BC policies are trained on 20 demonstrations; while other tasks using 50 demonstrations from Proficient-Human (PH) dataset.

DICE-RL attains the highest final performance while also being more stable and sample efficient across all tasks, and it succeeds across all difficulty levels with a single training recipe.

Our finetuning objective combines critic value maximization with a BC-style residual penalty. As training progresses, this shifts probability mass away from low-value action samples and toward consistently high-value regions, yielding a sharper action distribution at visited states.

Distribution sharpening is a state-local effect. Contraction, in contrast, is a trajectory-level property of the closed-loop dynamics induced by a policy: over a task-relevant region and in a chosen metric, trajectories initialized from nearby states move closer over time, indicating reduced sensitivity to initial conditions.

To probe contraction empirically, we sample many pairs of nearby anchor states \((s_0,s'_0)\) from the offline demonstrations \(D_{\text{demo}}\). From each pair, we rollout (i) the fine-tuned RL policy for \(T\) steps to obtain \(\{s_t^{\text{RL}}\}_{t=0}^T\) and \(\{s_t^{\prime\,\text{RL}}\}_{t=0}^T\), (ii) the pretrained BC policy for \(T\) steps to obtain \(\{s_t^{\text{Pre}}\}_{t=0}^T\) and \(\{s_t^{\prime\,\text{Pre}}\}_{t=0}^T\), and (iii) the corresponding expert trajectories of length \(T\) starting from the same anchors, denoted \(\{s_t^{\text{E}}\}_{t=0}^T\) and \(\{s_t^{\prime\,\text{E}}\}_{t=0}^T\). We then measure the normalized pairwise divergence for each rollout type \(x\in\{\text{RL},\text{Pre},\text{E}\}\): \[ c^{x}(t)\;=\;\frac{\big\|s_t^{x}-s_t^{\prime\,x}\big\|_2^2}{\big\|s_0-s'_0\big\|_2^2} \] Rollouts under our RL policy exhibit a more stable (and typically smaller) evolution of \(c(t)\) than both the pretrained BC policy and the expert demonstration rollouts, indicating stronger contraction of the closed-loop behavior.

We overlay the rollouts from the finetuned RL policy with the demonstration trajectories . The RL policy contracts around critical, contact-rich states.

From Prior to Pro:
Efficient Skill Mastery via Distribution Contractive RL Finetuning

TL;DR

Real Robot Results

Belt Assembly

Light Bulb Insertion

Gear Insertion

Simulation Results & Analyses

Main Results

Understanding DICE-RL

Distribution Sharpening

Contraction and Robustness

Rollout Distribution Visualization (Tool Hang)

Rollout Distribution Visualization (Transport)

From Prior to Pro: Efficient Skill Mastery via Distribution Contractive RL Finetuning