From Prior to Pro:
Efficient Skill Mastery via Distribution Contractive RL Finetuning

teaser figure main results bar plot

TL;DR

We introduce Distribution Contractive Reinforcement Learning (DICE-RL), a framework that uses reinforcement learning (RL) as a "distribution contraction" operator to refine pretrained generative robot policies. DICE-RL turns a pretrained behavior prior into a high-performing "pro" policy by amplifying high-success behaviors from online feedback. We pretrain a diffusion-based policy for broad behavioral coverage, then finetune it with a stable, sample-efficient residual off-policy RL framework that combines selective behavior regularization with value-guided action selection. Extensive experiments and analyses show that DICE-RL reliably improves performance with strong stability and sample efficiency, enabling mastery of complex long-horizon manipulation skills both in simulation and on a real robot.

Real Robot Results

We evaluate RL policy checkpoints at fixed training intervals using 30 trials per checkpoint. For GearInsertion and LightBulbInsertion tasks, we keep the target object poses fixed and randomize the robot end-effector’s initial pose within a small range (approximately ±2 cm in position and ±10° about each of the three rotation axes). For BeltAssembly, in addition to randomizing the robot’s initial pose, we also randomize the assembly board position within approximately ±10 cm and its planar orientation within approximately ±15°.

Belt Assembly

real robot results

On the Belt Assembly task, DICE-RL improves the pretrained policy from 56.67% to 93.33% success rate over 30 runs. See our uncut evaluation video (10x).

Pretrained BC Policy (x10)

Finetuned RL Policy (x10)

Light Bulb Insertion

real robot results

On the Light Bulb Insertion task, DICE-RL improves the pretrained policy from 56.67% to 90% success rate.

Pretrained BC Policy (x10)

Finetuned RL Policy (x10)

Gear Insertion

real robot results

On the Gear Insertion task, DICE-RL improves the pretrained policy from 46.67% to 90% success rate.

Pretrained BC Policy (x10)

Finetuned RL Policy (x10)

RL Training Timelapse (10x)

Simulation Results & Analyses

Main Results

We compare DICE-RL against prior methods, focusing on approaches that build on pretrained diffusion-based policies. We benchmark on Can, Square, Transport,Tool Hang tasks from the Robomimic benchmark, and report results for both state-based and pixel-based observations. For Can, the BC policies are trained on 20 demonstrations; while other tasks using 50 demonstrations from Proficient-Human (PH) dataset.

main results 2x4 main results legend

DICE-RL attains the highest final performance while also being more stable and sample efficient across all tasks, and it succeeds across all difficulty levels with a single training recipe.

Understanding DICE-RL

Distribution Sharpening

Our finetuning objective combines critic value maximization with a BC-style residual penalty. As training progresses, this shifts probability mass away from low-value action samples and toward consistently high-value regions, yielding a sharper action distribution at visited states.

value entropy correlation scatter

Using states from the offline demonstrations as anchors, we sample actions from the finetuned policy and compute (i) value gain relative to the pretrained BC policy and (ii) the empirical entropy drop of the sampled actions. We find a clear coupling: states with larger value improvements exhibit larger entropy drops, suggesting that successful finetuning coincides with stronger distributional concentration.

value improve entropy reduction

In a representative rollout trajectory of the RL policy on Tool Hang, together with the running change in action entropy (\(\Delta H\)) and value improvement (\(\Delta V\)). We zoom in on frames where value improvement spikes and action entropy drops; these states are often critical for task success (e.g., pre-insertion and insertion). In contrast, during free-space motions that are less consequential for success, we observe less reduction in action entropy

Contraction and Robustness

Distribution sharpening is a state-local effect. Contraction, in contrast, is a trajectory-level property of the closed-loop dynamics induced by a policy: over a task-relevant region and in a chosen metric, trajectories initialized from nearby states move closer over time, indicating reduced sensitivity to initial conditions.

To probe contraction empirically, we sample many pairs of nearby anchor states \((s_0,s'_0)\) from the offline demonstrations \(D_{\text{demo}}\). From each pair, we rollout (i) the fine-tuned RL policy for \(T\) steps to obtain \(\{s_t^{\text{RL}}\}_{t=0}^T\) and \(\{s_t^{\prime\,\text{RL}}\}_{t=0}^T\), (ii) the pretrained BC policy for \(T\) steps to obtain \(\{s_t^{\text{Pre}}\}_{t=0}^T\) and \(\{s_t^{\prime\,\text{Pre}}\}_{t=0}^T\), and (iii) the corresponding expert trajectories of length \(T\) starting from the same anchors, denoted \(\{s_t^{\text{E}}\}_{t=0}^T\) and \(\{s_t^{\prime\,\text{E}}\}_{t=0}^T\). We then measure the normalized pairwise divergence for each rollout type \(x\in\{\text{RL},\text{Pre},\text{E}\}\): \[ c^{x}(t)\;=\;\frac{\big\|s_t^{x}-s_t^{\prime\,x}\big\|_2^2}{\big\|s_0-s'_0\big\|_2^2} \] Rollouts under our RL policy exhibit a more stable (and typically smaller) evolution of \(c(t)\) than both the pretrained BC policy and the expert demonstration rollouts, indicating stronger contraction of the closed-loop behavior.

value entropy correlation scatter
value improve entropy reduction
Rollout Distribution Visualization (Tool Hang)

We overlay the rollouts from the finetuned RL policy with the demonstration trajectories . The RL policy contracts around critical, contact-rich states.

tool hang visualization
Rollout Distribution Visualization (Transport)
transport visualization