Reducing Oracle Feedback
with Vision-Language Embeddings
for Preference-Based RL

1University of California, Riverside, 2AWS AI Labs

ICRA 2026

TL;DR: What is ROVED?

ROVED (Reducing Oracle Feedback with Vision-Language Embeddings) is a hybrid framework for preference-based reinforcement learning that combines scalable vision-language embedding (VLE) models with targeted oracle feedback. It works in two key stages:

1. VLE-based Preference Generation: The model uses lightweight VLE models to generate segment-level preferences for trajectory comparisons, providing a scalable alternative to expensive oracle annotations.

2. Selective Oracle Querying: Using a filtering mechanism with uncertainty thresholds (τupper and τlower), ROVED identifies noisy or uncertain samples and defers only these to the oracle for annotation, dramatically reducing annotation costs.

The framework also includes parameter-efficient fine-tuning that adapts the VLE using oracle feedback, creating a synergistic loop where the VLE improves over time. Across multiple robotic manipulation tasks, ROVED matches or surpasses prior methods while reducing oracle queries by up to 80% and achieving cumulative annotation savings of up to 90% through cross-task generalization.

ROVED Framework


Overview: Given a task description, ROVED iteratively updates the policy πφ via reinforcement learning using the reward model rθ. Trajectory segments from the replay buffer are sampled and labeled with VLE-generated preferences. These samples are then classified as clean or noisy using thresholds τupper and τlower. A budgeted subset of noisy samples is sent for oracle annotation. The reward model is trained on both VLE and oracle-labeled preferences, while the VLE is fine-tuned using oracle annotations and replay buffer samples.

Results

Main Learning Curves

Improved feedback efficiency. ROVED consistently outperforms all baselines with minimal oracle feedback, matching or exceeding PEBBLE's performance while requiring 50%-80% fewer annotations. At equal preference counts, ROVED also outperforms efficient preference based methods like MRN and SURF. Variables (x, y, z) denote the number of oracle preferences used.

Adaptation Figure

Knowledge transfer across tasks. A key objective of ROVED is to refine the VLE with oracle feedback to reduce inherent noise. We test whether an adapted VLE can generalize to related tasks with minimal additional supervision. Two types of transfer are considered: (1) same task, different object: "door-open" → "drawer-open"; (2) same object, different task: "window-close" → "window-open". In both cases, the VLE for the target task is initialized with weights from the source task, while the rest of the algorithm remains unchanged. With knowledge transfer, ROVED matches or surpasses PEBBLE while reducing annotation requirements by 75–90%. This demonstrates effective transfer in both same task, different object (left) and same object, different task (right) settings. Variables (w, x, y, z) denote the number of preferences used.

Qualitative Comparison: ROVED vs PEBBLE

Button-press
Door-close
Window-close
Drawer-open
ROVED(x)
ROVED button-press
x = 1500
ROVED door-close
x = 1000
ROVED window-close
x = 1000
ROVED drawer-open
x = 5000
PEBBLE(y)
PEBBLE button-press
y = 2000
PEBBLE door-close
y = 1000
PEBBLE window-close
y = 1000
PEBBLE drawer-open
y = 15000
PEBBLE(z)
PEBBLE button-press
z = 4000
PEBBLE door-close
z = 2000
PEBBLE window-close
z = 2000
PEBBLE drawer-open
z = 25000

Qualitative comparison of learned policies. ROVED(x) matches the performance of PEBBLE(z) while using at least 50% fewer oracle queries (x < z). PEBBLE(y) policies are suboptimal despite achieving task success—either executing the task inefficiently (door-close) or requiring multiple attempts (button-press, window-close), where x < y < z. Here x, y, z denote the number of oracle preferences provided.

BibTeX

@misc{ghosh2025preferencevlmleveragingvlms,
          title={Preference VLM: Leveraging VLMs for Scalable Preference-Based Reinforcement Learning}, 
          author={Udita Ghosh and Dripta S. Raychaudhuri and Jiachen Li and Konstantinos Karydis and Amit Roy-Chowdhury},
          year={2025},
          eprint={2502.01616},
          archivePrefix={arXiv},
          primaryClass={cs.LG},
          url={https://arxiv.org/abs/2502.01616}, 
    }