Technical AI Safety Projects

LLMs as architects for multi-agent systems, and exploring information flows in LLMs

I'm a DPhil student in Machine Learning in the FLAIR lab at the University of Oxford supervised by Jakob Foerster and funded through the AIMS CDT by the EPSRC. My research focuses on scalable methods for reducing risks in advanced AI systems, exploring AI Safety, Interpretability, and Multi-Agent Systems all through the unifying lens of scale.

Previously, I was a Research Scientist Intern at Spotify and worked with the UK AI Security Institute (AISI) on their Bounty Programme investigating automated design of agentic systems. I was also the founding Research Scientist at Convergence (acquired by Salesforce), contributing to Proxy, a state-of-the-art multimodal web agent with 100k+ users, and held senior engineering roles at Pynea and Artera, leading teams and shipping ML innovations.

LLMs as architects for multi-agent systems

This project explores the use of large language models (LLMs) as automated architects for complex multi-agent systems, with cybersecurity as a test domain. Using the CyberAgentBreeder framework to design Python-based scaffolds for Capture-the-Flag challenges, the research evaluates the quality, novelty, and sophistication of agent architectures generated by different models. Building on early evidence that state-of-the-art LLMs can propose fundamentally new designs, the project aims to determine how model scale and reasoning ability shape the complexity of resulting systems, including features like hierarchical team structures, specialized roles, and advanced communication protocols.

Exploring Information Flows in LLMS

As transformer-based large language models (LLMs) scale to million-token contexts, understanding how information deeply embedded in the context flows through the neural network to the output becomes critical. This project investigates the interplay of three emerging failure modes in long-context LLMs - over-squashing, where crucial signals are compressed into indistinguishable representations; over-smoothing, where representations across tokens become homogenized; and under-reaching, where information fails to propagate far enough.

Using a novel interpretability framework based on sparse attention (work under submission), we can prune the attention pattern to identify key information flow routes, without incurring the prohibitive $O(T^2)$ memory overhead of dense attention interpretability.

Invariance-aware diverse reward model ensembles

Matthew is a DPhil student working on foundational research for technical AI safety, including agent foundations and singular learning theory. Matthew previously worked on understanding goal misgeneralisation at Krueger AI Safety Lab, applying singular learning theory with Timaeus, and the foundations of reward learning at CHAI. Matthew is excited to help aspiring researchers develop their research skills and contribute to technical AI safety.

We want powerful AI models to internalise human goals. Modern approaches involve (either explicitly or implicitly) learning a reward function from human data generated based on these goals, then extrapolating that reward function. However, often there are multiple plausible but conflicting ways to extrapolate the goals from the given data. If we were able to detect this kind of multiplicity at the reward modelling stage, we could avoid catastrophic misgeneralisation. A standard method for detecting and addressing ambiguity in supervised learning is by training a diverse ensemble of models (rather than a single model). This project seeks to extend such methods to the domain of reward learning, accounting for the unique invariance structures of RL.