Invariance-aware diverse reward model ensembles

Abstract:

We want powerful AI models to internalise human goals. Modern approaches involve (either explicitly or implicitly) learning a reward function from human data generated based on these goals, then extrapolating that reward function. However, often there are multiple plausible but conflicting ways to extrapolate the goals from the given data. If we were able to detect this kind of multiplicity at the reward modelling stage, we could avoid catastrophic misgeneralisation. A standard method for detecting and addressing ambiguity in supervised learning is by training a diverse ensemble of models (rather than a single model). This project seeks to extend such methods to the domain of reward learning, accounting for the unique invariance structures of RL.

Description:

The fundamental challenge of AI alignment is to get a powerful AI model to internalise human goals. Current approaches to alignment generally involve either explicitly or implicitly modelling human goals.

In RLHF, we align a base model by explicitly learning a reward model from human preference data and then querying the reward model to generate rewards used as a training signal for tuning the base model. The reward model is an explicit model of the human goals.
In other tuning methods (DPO, RLAIF), we don’t explicitly learn a separate reward model, we just train the base model using a human-derived training signal directly. However, even still, the base model must form its own (at least implicit) model of the task it is being trained on so that it can do the right thing in novel situations.

When there are multiple plausible but conflicting ways to extrapolate goals from the given data, we face the risk of learning the wrong reward model, a phenomenon known as goal misgeneralisation [4]. Goal misgeneralisation is dangerous because it results in a capable model having internalised the wrong goal, which it would then pursue at the expense of our intended goals.

If we were able to detect the presence of plausible alternative goal extrapolations, we would get an advance warning of the possibility of goal misgeneralisation, and we could put mitigations in place (such as getting more data to clarify the correct extrapolation).

In supervised learning, there are standard methods for detecting and addressing the presence of plausible alternative goal extrapolations. One method is by training a diverse ensemble of models, rather than a single model, using an objective that encourages the individual models in the ensemble to produce similar predictions on labelled training data while producing different predictions on unlabelled test data (see [1]).

However, this approach from supervised learning is not suitable for learning reward models, because what matters about a reward model is not the predictions it gives for individual inputs, but the policies that it incentivises. There are typically many different reward functions that all incentivise the same optimal policies [3]. In order to incentivise a reward model ensemble to learn meaningfully diverse reward models, we need to account for these invariances.

Fortunately, it is possible to detect when two reward functions induce similar policies using a STARC metric [2]. The aim for this project would be to generalise the method of learning a diverse ensemble from [1] and combine it with the STARC metric from [2] to produce a new method for learning a meaningfully-diverse ensemble of reward models.

References:

Lee, Yao, and Finn (2023) “Diversify and disambiguate: Out-of-distribution robustness via disagreement.” https://arxiv.org/abs/2202.03418.
Skalse et al. (2024) “STARC: A general framework for quantifying differences between reward functions.” https://arxiv.org/abs/2309.15257.
Skalse, Farrugia-Roberts, et al. (2023) “Invariance in policy optimisation and partial identifiability in reward learning.” https://arxiv.org/abs/2203.07475.
Langosco, Koch, Sharkey, et al. (2022) “Goal misgeneralization in deep reinforcement learning.” https://arxiv.org/abs/2105.14111.

Project outline:

This project would involve the following steps:

Conducting a brief literature survey to identify related work on ambiguity in reward modelling and learning diverse reward ensembles.
Understanding references [1] and [2] in detail and combining their methods to formulate a new objective for invariance-aware diverse reward model ensemble training.
Designing a small-scale proof-of-concept reward modelling experiment (for example training a CNN to learn the reward function from a grid-world environment).
Implementing the objective and conducting the designed experiments.
Writing up the idea and the results of the proof-of-concept experiments for a final report.

Ideal candidate:

This is a clearly-scoped, hands-on research project with no minimum research experience required.

This project involves a mix of theoretical and practical components. It would be suitable for a team of 2–3 students, all of whom have a strong technical background, including the following prerequisites:

Machine learning foundations (understand concepts like ‘ensembles’ and ‘objectives’). (Covered by ARENA chapter 0).
Basic practical deep learning experience (familiarity with PyTorch or JAX; can run supervised learning experiments with small models, familiarity with CNNs). (Covered by ARENA chapter 0)
Reinforcement learning foundations (understand the framework of RL, including concepts like ‘reward function’ and ‘optimal policy’). (Covered by the 1st day of ARENA Chapter 2)

It’s acceptable if not everyone in the team meets all three of the prerequisites, but all prerequisites should be covered within the team, and all team members should meet most prerequisites.

Skills students could gain:

This is a good opportunity to learn and practice basic skills involved in running an AI safety research project, including:

Evaluating a technical research project’s theory of change / path to impact.
Finding academic papers on a specific topic.
Reading a few technical research papers in detail.
Designing novel deep learning experiments and interpreting the results.
Writing about technical research and AI safety motivations.
Working efficiently and effectively as a team.

If successful, this project could form in a high-quality blog post, and with additional time, could be extended into a submission to an AI safety workshop. With further development, the findings could potentially be developed into a conference submission.

Mentor:

Matthew is a DPhil student working on foundational research for technical AI safety, including agent foundations and singular learning theory. Matthew previously worked on understanding goal misgeneralisation at Krueger AI Safety Lab, applying singular learning theory with Timaeus, and the foundations of reward learning at CHAI. Matthew is excited to help aspiring researchers develop their research skills and contribute to technical AI safety.