Sotabase
Home
Researchers
Career
·
Senior Research Fellow
,
University of Oxford
2025–
Publications
(60)
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
arXiv.org · 2024
285
cited
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
arXiv.org · 2024
84
cited
Detecting Edit Failures In Large Language Models: An Improved Specificity Benchmark
Annual Meeting of the Association for Computational Linguistics · 2023
70
cited
Towards Interpreting Visual Information Processing in Vision-Language Models
International Conference on Learning Representations · 2024
55
cited
Open Problems in Machine Unlearning for AI Safety
arXiv.org · 2025
41
cited
The Larger They Are, the Harder They Fail: Language Models do not Recognize Identifier Swaps in Python
Annual Meeting of the Association for Computational Linguistics · 2023
39
cited
Best-of-N Jailbreaking
arXiv.org · 2024
31
cited
Understanding Addition in Transformers
International Conference on Learning Representations · 2023
30
cited
Large Language Models Relearn Removed Concepts
Annual Meeting of the Association for Computational Linguistics · 2024
28
cited
Neuron to Graph: Interpreting Language Model Neurons at Scale
arXiv.org · 2023
27
cited
Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs
arXiv.org · 2025
27
cited
Risks and Opportunities of Open-Source Generative AI
arXiv.org · 2024
26
cited
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
arXiv.org · 2024
25
cited
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders
arXiv.org · 2024
19
cited
Near to Mid-term Risks and Opportunities of Open Source Generative AI
arXiv.org · 2024
19
cited
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons
arXiv.org · 2025
15
cited
Establishing Best Practices for Building Rigorous Agentic Benchmarks
arXiv.org · 2025
14
cited
Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders
2024
12
cited
The Singapore Consensus on Global AI Safety Research Priorities
Robotics · 2025
10
cited
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
Conference on Empirical Methods in Natural Language Processing · 2023
9
cited
Show all 60 papers →
Sotabase
Fazl Barez | Researcher Profile | Sotabase | Sotabase