Alexander Pan | Researcher Profile | Sotabase | Sotabase

Career

· Member Of Technical Staff, xAI2025–

· PhD in Computer Science, UC Berkeley2022–2025

Publications (9)

Representation Engineering: A Top-Down Approach to AI Transparency

arXiv.org · 2023

739

cited

The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning

International Conference on Machine Learning · 2024

327

cited

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

International Conference on Learning Representations · 2022

255

cited

Foundational Challenges in Assuring Alignment and Safety of Large Language Models

arXiv.org · 2024

201

cited

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

International Conference on Machine Learning · 2023

168

cited

Feedback Loops With Language Models Drive In-Context Reward Hacking

International Conference on Machine Learning · 2024

59

cited

Safetywashing: Do AI Safety Benchmarks Actually Measure Safety Progress?

Neural Information Processing Systems · 2024

48

cited

Improving Robustness of Reinforcement Learning for Power System Control with Adversarial Training

arXiv.org · 2021

20

cited

Sotabase

Alexander Pan | Researcher Profile | Sotabase | Sotabase