Fazl Barez | Researcher Profile | Sotabase

Career

· Senior Research Fellow, University of Oxford2025–

Publications (60)

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

arXiv.org · 2024

285

cited

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

arXiv.org · 2024

cited

Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark

Annual Meeting of the Association for Computational Linguistics · 2023

cited

Towards Interpreting Visual Information Processing in Vision-Language Models

International Conference on Learning Representations · 2024

cited

Open Problems in Machine Unlearning for AI Safety

arXiv.org · 2025

cited

The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python

Annual Meeting of the Association for Computational Linguistics · 2023

cited

Best-of-N Jailbreaking

arXiv.org · 2024

cited

Understanding Addition in Transformers

International Conference on Learning Representations · 2023

cited

Large Language Models Relearn Removed Concepts

Annual Meeting of the Association for Computational Linguistics · 2024

cited

Neuron to Graph: Interpreting Language Model Neurons at Scale

arXiv.org · 2023

cited

Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs

arXiv.org · 2025

cited

Risks and Opportunities of Open-Source Generative AI

arXiv.org · 2024

cited

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

arXiv.org · 2024

cited

Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders

arXiv.org · 2024

cited

Near to Mid-term Risks and Opportunities of Open Source Generative AI

arXiv.org · 2024

cited

AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons

arXiv.org · 2025

cited

Establishing Best Practices for Building Rigorous Agentic Benchmarks

arXiv.org · 2025

cited

Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders

2024

cited

The Singapore Consensus on Global AI Safety Research Priorities

Robotics · 2025

cited

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models

Conference on Empirical Methods in Natural Language Processing · 2023

cited

Sotabase