Sotabase
Home
Researchers
Career
·
PhD Student
,
UC Berkeley
2021–
Publications
(28)
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization
Neural Information Processing Systems · 2024
391
cited
AI and Memory Wall
IEEE Micro · 2024
270
cited
SqueezeLLM: Dense-and-Sparse Quantization
International Conference on Machine Learning · 2023
268
cited
Full Stack Optimization of Transformer Inference: a Survey
arXiv.org · 2023
152
cited
EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference
Micro · 2020
149
cited
S-LoRA: Serving Thousands of Concurrent LoRA Adapters
arXiv.org · 2023
146
cited
22.9 A 12nm 18.1TFLOPs/W Sparse Transformer Processor with Entropy-Based Early Exit, Mixed-Precision Predication and Fine-Grained Power Management
IEEE International Solid-State Circuits Conference · 2023
49
cited
SPEED: Speculative Pipelined Execution for Efficient Decoding
arXiv.org · 2023
49
cited
TinyAgent: Function Calling at the Edge
Conference on Empirical Methods in Natural Language Processing · 2024
39
cited
Squeezed Attention: Accelerating Long Context Length LLM Inference
Annual Meeting of the Association for Computational Linguistics · 2024
35
cited
A 16-nm SoC for Noise-Robust Speech and NLP Edge AI Inference With Bayesian Sound Source Separation and Attention-Based DNNs
IEEE Journal of Solid-State Circuits · 2023
29
cited
SLoRA: Scalable Serving of Thousands of LoRA Adapters
Conference on Machine Learning and Systems · 2024
27
cited
9.8 A 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET
IEEE International Solid-State Circuits Conference · 2021
25
cited
EdgeBERT: Optimizing On-Chip Inference for Multi-Task NLP
arXiv.org · 2020
12
cited
ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs
arXiv.org · 2025
12
cited
Property-Aware Multi-Speaker Data Simulation: A Probabilistic Modelling Technique for Synthetic Data Generation
7th International Workshop on Speech Processing in Everyday Environments (CHiME 2023) · 2023
11
cited
ETS: Efficient Tree Search for Inference-Time Scaling
arXiv.org · 2025
10
cited
QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache
International Conference on Machine Learning · 2025
9
cited
Learned Best-Effort LLM Serving
arXiv.org · 2024
5
cited
FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference
arXiv.org · 2025
4
cited
Show all 28 papers →
Sotabase
Coleman Hooper | Researcher Profile | Sotabase | Sotabase