We evaluate RL policy checkpoints at fixed training intervals using 30 trials per checkpoint.
For GearInsertion and LightBulbInsertion tasks, we keep the target
object poses fixed and randomize the robot end-effector’s initial pose within a small
range (approximately ±2 cm in position and ±10° about each of the three
rotation axes). For BeltAssembly, in addition to randomizing the robot’s
initial pose, we also randomize the assembly board position within approximately
±10 cm and its planar orientation within approximately ±15°.
On the Belt Assembly task, DICE-RL improves the pretrained policy from 56.67% to 93.33% success rate over 30 runs. See our uncut evaluation video (10x).
Pretrained BC Policy (x10)
Finetuned RL Policy (x10)
On the Light Bulb Insertion task, DICE-RL improves the pretrained policy from 56.67% to 90% success rate.
Pretrained BC Policy (x10)
Finetuned RL Policy (x10)
On the Gear Insertion task, DICE-RL improves the pretrained policy from 46.67% to 90% success rate.
Pretrained BC Policy (x10)
Finetuned RL Policy (x10)
RL Training Timelapse (10x)
We compare DICE-RL against prior methods, focusing on approaches that build on pretrained diffusion-based policies. We benchmark on Can, Square, Transport,Tool Hang tasks from the Robomimic benchmark, and report results for both state-based and pixel-based observations. For Can, the BC policies are trained on 20 demonstrations; while other tasks using 50 demonstrations from Proficient-Human (PH) dataset.
DICE-RL attains the highest final performance while also being more stable and sample efficient across all tasks, and it succeeds across all difficulty levels with a single training recipe.
Our finetuning objective combines critic value maximization with a BC-style residual penalty. As training progresses, this shifts probability mass away from low-value action samples and toward consistently high-value regions, yielding a sharper action distribution at visited states.
Using states from the offline demonstrations as anchors, we sample actions from the finetuned policy and compute (i) value gain relative to the pretrained BC policy and (ii) the empirical entropy drop of the sampled actions. We find a clear coupling: states with larger value improvements exhibit larger entropy drops, suggesting that successful finetuning coincides with stronger distributional concentration.
In a representative rollout trajectory of the RL policy on Tool Hang, together with the running change in action entropy (\(\Delta H\)) and value improvement (\(\Delta V\)). We zoom in on frames where value improvement spikes and action entropy drops; these states are often critical for task success (e.g., pre-insertion and insertion). In contrast, during free-space motions that are less consequential for success, we observe less reduction in action entropy
Distribution sharpening is a state-local effect. Contraction, in contrast, is a trajectory-level property of the closed-loop dynamics induced by a policy: over a task-relevant region and in a chosen metric, trajectories initialized from nearby states move closer over time, indicating reduced sensitivity to initial conditions.
To probe contraction empirically, we sample many pairs of nearby anchor states \((s_0,s'_0)\) from the offline demonstrations \(D_{\text{demo}}\). From each pair, we rollout (i) the fine-tuned RL policy for \(T\) steps to obtain \(\{s_t^{\text{RL}}\}_{t=0}^T\) and \(\{s_t^{\prime\,\text{RL}}\}_{t=0}^T\), (ii) the pretrained BC policy for \(T\) steps to obtain \(\{s_t^{\text{Pre}}\}_{t=0}^T\) and \(\{s_t^{\prime\,\text{Pre}}\}_{t=0}^T\), and (iii) the corresponding expert trajectories of length \(T\) starting from the same anchors, denoted \(\{s_t^{\text{E}}\}_{t=0}^T\) and \(\{s_t^{\prime\,\text{E}}\}_{t=0}^T\). We then measure the normalized pairwise divergence for each rollout type \(x\in\{\text{RL},\text{Pre},\text{E}\}\): \[ c^{x}(t)\;=\;\frac{\big\|s_t^{x}-s_t^{\prime\,x}\big\|_2^2}{\big\|s_0-s'_0\big\|_2^2} \] Rollouts under our RL policy exhibit a more stable (and typically smaller) evolution of \(c(t)\) than both the pretrained BC policy and the expert demonstration rollouts, indicating stronger contraction of the closed-loop behavior.
We overlay the rollouts from the finetuned RL policy with the demonstration trajectories . The RL policy contracts around critical, contact-rich states.