Using semantic entropy to detect unfaithful reasoning
Mentor:
Avi is DPhil student, funded to work on technical AI governance by the AI Governance Initiative at the Oxford Martin School. His previous research has focussed on the effect of stochasticity in training algorithms with the Pivotal fellowship, and the impact of adaptive optimisation on training dynamics in deep neural networks.
Abstract:
Current frontier AI models often produce natural language “chain-of-thought” reasoning before acting. This provides an affordance to monitor their intermediate computations, and to detect undesired motivation and planning. However, there are many examples where this chain-of-thought can be unfaithful. If we could detect when reasoning steps are post-hoc rationalisations, this could enhance such monitoring. This project would investigate if semantic entropy (or more efficient semantic entropy probes), shown to be effective at detecting hallucination, can also be used to detect unfaithful reasoning steps.
Project description:
The steps involved in the project would likely be the following:
Reading relevant papers and identifying related work
Write code to run models on prompts that elicit potentially unfaithful reasoning
Test if semantic entropy is effective at identifying when unfaithful reasoning occurs
Either on the level of prompts: detecting if the entire chain-of-thought given a prompt will be faithful
Or on the level of reasoning steps: detecting if the subsequent step will follow from the context or be a post-hoc rationalisation
(Optional) if semantic entropy is effective, try semantic entropy probes
Compare to a baseline (e.g. a linear probe)
Write up results
References:
Farquhar, S., Kossen, J., Kuhn, L. et al. Detecting hallucinations in large language models using semantic entropy. Nature 630, 625–630 (2024).
Arcuschin, I., Janiak, J., Krzyzanowski, R., Rajamanoharan, S., Nanda, N., & Conmy, A. (2025). Chain-of-thought reasoning in the wild is not always faithful. arXiv preprint arXiv:2503.08679.
Ideal candidate:
Is familiar with relevant machine learning concepts (logits, language model APIs, )
Basic programing for machine learning (running experiments, plotting results, managing codebases and datasets)
Excited about applying interpretability to downstream safety-relevant problems