Sotabase
Home
Researchers
Career
·
Member Of Technical Staff
,
xAI
2025–
·
PhD in Computer Science
,
UC Berkeley
2022–2025
Publications
(9)
Representation Engineering: A Top-Down Approach to AI Transparency
arXiv.org · 2023
739
cited
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning
International Conference on Machine Learning · 2024
327
cited
The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
International Conference on Learning Representations · 2022
255
cited
Foundational Challenges in Assuring Alignment and Safety of Large Language Models
arXiv.org · 2024
201
cited
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark
International Conference on Machine Learning · 2023
168
cited
Feedback Loops With Language Models Drive In-Context Reward Hacking
International Conference on Machine Learning · 2024
59
cited
Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?
Neural Information Processing Systems · 2024
48
cited
Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training
arXiv.org · 2021
20
cited
Sotabase
Alexander Pan | Researcher Profile | Sotabase | Sotabase