Safety-Prioritizing Curricula for Constrained Reinforcement Learning

The University of Texas at Austin1, Eindhoven University of Technology2, Ruhr University-Bochum3
ICLR 2025

TLDR

We propose a safe curriculum generation method (SCG) that reduces safety constraint violations during training while boosting the learning speed of constrained reinforcement learning agents.

MY ALT TEXT

Given a distribution over target tasks, SCG generates a sequence of task distributions that initially prioritizes safety over performance, then shifts its focus onto performance, and finally approaches the target distribution by equally treating safety and performance until training ends.

Abstract

Curriculum learning aims to accelerate reinforcement learning (RL) by generating curricula, i.e., sequences of tasks of increasing difficulty. Although existing curriculum generation approaches provide benefits in sample efficiency, they overlook safety-critical settings where an RL agent must adhere to safety constraints. Thus, these approaches may generate tasks that cause RL agents to violate safety constraints during training and behave suboptimally after. We develop a safe curriculum generation approach (SCG) that aligns the objectives of constrained RL and curriculum learning: improving safety during training and boosting sample efficiency. SCG generates sequences of tasks where the RL agent can be safe and performant by initially generating tasks with minimum safety violations over high-reward ones. We empirically show that compared to the state-of-the-art curriculum learning approaches and their naively modified safe versions, SCG achieves optimal performance and the lowest amount of constraint violations during training.

Failure of Curricula to Ensure Safety

The state-of-the-art curriculum learning methods focus on the standard multi-task RL problem, i.e., maximizing expected return in tasks drawn from a target distribution. They overlook constrained RL, hence they cannot distinguish unsafe behaviors. We call this the misalignment of objectives of curriculum learning and constrained RL. As a result, they propose tasks that, while yielding high rewards, also incur high costs, leading to constraint violations.

MY ALT TEXT
MY ALT TEXT

Safety-maze as an example task: Starting from the bottom left corner (green), the agent must reach a goal while avoiding the hazards (red), where it collects cost. The task, i.e., context, specifies the goal position and tolerance, i.e., the Euclidean distance to the goal for success. The agent can move freely over the white areas but cannot access the walled sections in black. The marked points visualize the contexts, drawn from context distributions at various epochs, while the color indicates goal tolerance. (Left) CURROT's curricula: CURROT, a state-of-the-art curriculum learning method (Klink et al., 2022), moves contexts from the bottom row toward the target context distribution, uniform on the top row. As CURROT ignores the cost, it places the goals mostly in the red region early on, causing suboptimal behaviors. (Right) SCG's curricula: SCG reduces constraint violations by placing goals centered on the right column, which does not yield any cost, or goals with high tolerance (brighter colors) on hazards.

Experimental Results

We study 3 constrained RL domains that showcase the misalignment phenomenon: safety-maze, safety-goal, and safety-push. Safety-goal and safety-push involve navigation tasks with realistic sensory observation. TLDR: SCG consistently yields optimal policies by achieving zero cost and the highest success rate in all environments. Furthermore, SCG reaches the lowest constraint violations among methods that learn optimal policies.

Learning Optimal Policies for Constrained RL

Safety-Maze

MY ALT TEXT

Safety-Goal

MY ALT TEXT

Safety-Push

MY ALT TEXT
MY ALT TEXT
MY ALT TEXT
MY ALT TEXT

Final success rates (top row) and expected costs (bottom row) of all evaluated methods in safety-maze (left), safety-goal (center), and safety-push (right) domains.

Reducing Constraint Violations

Safety-Maze

MY ALT TEXT

Safety-Goal

MY ALT TEXT

Safety-Push

MY ALT TEXT

Constraint violations (CV) regret, i.e., the accumulated excess cost over the cost threshold during training in safety-maze (left), safety-goal (center), and safety-push (right) domains.

Poster

Presentation

BibTeX

@inproceedings{
        koprulu2025safetyprioritizing,
        title={Safety-Prioritizing Curricula for Constrained Reinforcement Learning},
        author={Cevahir Koprulu and Thiago D. Sim{\~a}o and Nils Jansen and ufuk topcu},
        booktitle={The Thirteenth International Conference on Learning Representations},
        year={2025},
        url={https://openreview.net/forum?id=f3QR9TEERH}
        }