NUNO

TLDR

We develop an uncertainty-aware, offline model-based reinforcement learning approach with neural stochastic differential equations that outperforms the state-of-the-art in continuous control benchmarks, particularly in low-quality datasets.

Abstract

Offline model-based reinforcement learning (RL) offers a principled approach to using a learned dynamics model as a simulator to optimize a control policy. Despite the near-optimal performance of existing approaches on benchmarks with high-quality datasets, most struggle on datasets with low state-action space coverage or suboptimal demonstrations. We develop a novel offline model-based RL approach that particularly shines in low-quality data regimes while maintaining competitive performance on high-quality datasets. Neural Stochastic Differential Equations for Uncertainty-aware, Offline RL (NUNO) learns a dynamics model as neural stochastic differential equations (SDE), where its drift term can leverage prior physics knowledge as inductive bias. In parallel, its diffusion term provides distance-aware estimates of model uncertainty by matching the dynamics' underlying stochasticity near the training data regime while providing high but bounded estimates beyond it. To address the so-called model exploitation problem in offline model-based RL, NUNO builds on existing studies by penalizing and adaptively truncating neural SDE's rollouts according to uncertainty estimates. Our empirical results in D4RL and NeoRL MuJoCo benchmarks evidence that NUNO outperforms state-of-the-art methods in low-quality datasets by up to 93% while matching or surpassing their performance by up to 55% in some high-quality counterparts.

Distance-aware Uncertainty Estimator

We propose a parametric distance-aware uncertainty estimator that captures the distance to the closest k-th neighbor in the dataset without the need for a KNN search. Besides bypassing intractable KNN search, our parametric estimator can be trained alongside the neural SDE model such that the model can capture both aleatoric and epistemic uncertainty in the dynamics. The estimator is smooth and differentiable and thus blends well with the requirements for numerical integration of the neural SDE.

Visualization of the distance-aware uncertainty estimate on three generated datasets. The red points represent the state-action samples in the dataset. Yellow indicates high uncertainty, while dark blue represents low uncertainty. X and y-axis denote the states of the system, respectively.

Experimental Results

We empirically evaluate NUNO against state-of-the-art (SOTA) offline model-based and model-free approaches in continuous control benchmarks, namely MuJoCo datasets in D4RL and NeoRL.

How does NUNO perform in terms of human normalized score?

NUNO outperforms SOTA in low-quality datasets by up to 93% while matching or surpassing their performance by up to 55% in some high-quality counterparts.

D4RL

NeoRL

Average human-normalized scores of NUNO and other model-based and model-free offline RL approaches on D4RL (left) and NeoRL (right) MuJoCo datasets. Due to limited space, we use abbreviations of dataset names: For D4RL, hc = halfcheetah, hp = hopper, wk = walker2d; r = random, m = medium, mr = medium-replay, me = medium-expert. For NeoRL, L = low, M = medium, H = high. For NUNO, we provide the mean and standard deviation (following ±) of best scores among independent runs. Bold scores indicate the best for each task.

How does NUNO address the model exploitation phenomenon?

We assess how NUNO addresses the model exploitation phenomenon based on two aspects: (1) conservativeness of the reward function of pessimistic learned MDPs, and (2) prediction accuracy of learned dynamics models.

Model Exploitation: D4RL random

Model Exploitation: D4RL medium-replay

Based on the gap between the groundtruth score (without penalization) and the pessimistic score, we observe that NUNO constructs pessimistic learned MDPs that are less conservative than their counterparts in MOPO and TATU+MOPO, which use Gaussian ensembles. The only exception is hopper-medium-replay, which may be why TATU+MOPO and MOPO perform slightly better.

Model Analysis: D4RL in-distribution

Model Analysis: D4RL out-of-distribution

We illustrate the evolution of model prediction error in different datasets (a) In-distribution: Evaluation of the datasets in which the models are trained. (b) Out-of-distribution: Evaluation of models, trained via random, in trajectories from other datasets. Figures show that neural SDEs are significantly more accurate than a Gaussian ensemble over longer horizons.

BibTeX

@inproceedings{
        koprulu2025neural,
        title={Neural Stochastic Differential Equations for Uncertainty-Aware Offline {RL}},
        author={Cevahir Koprulu and Franck Djeumou and ufuk topcu},
        booktitle={The Thirteenth International Conference on Learning Representations},
        year={2025},
        url={https://openreview.net/forum?id=hxUMQ4fic3}
        }

Neural Stochastic Differential Equations for Uncertainty-aware, Offline RL