Coleman Hooper | Researcher Profile | Sotabase

Career

· PhD Student, UC Berkeley2021–

Publications (28)

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

Neural Information Processing Systems · 2024

391

cited

AI and Memory Wall

IEEE Micro · 2024

270

cited

SqueezeLLM: Dense-and-Sparse Quantization

International Conference on Machine Learning · 2023

268

cited

Full Stack Optimization of Transformer Inference: a Survey

arXiv.org · 2023

152

cited

EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference

Micro · 2020

149

cited

S-LoRA: Serving Thousands of Concurrent LoRA Adapters

arXiv.org · 2023

146

cited

22.9 A 12nm 18.1TFLOPs/W Sparse Transformer Processor with Entropy-Based Early Exit, Mixed-Precision Predication and Fine-Grained Power Management

IEEE International Solid-State Circuits Conference · 2023

cited

SPEED: Speculative Pipelined Execution for Efficient Decoding

arXiv.org · 2023

cited

TinyAgent: Function Calling at the Edge

Conference on Empirical Methods in Natural Language Processing · 2024

cited

Squeezed Attention: Accelerating Long Context Length LLM Inference

Annual Meeting of the Association for Computational Linguistics · 2024

cited

A 16-nm SoC for Noise-Robust Speech and NLP Edge AI Inference With Bayesian Sound Source Separation and Attention-Based DNNs

IEEE Journal of Solid-State Circuits · 2023

cited

SLoRA: Scalable Serving of Thousands of LoRA Adapters

Conference on Machine Learning and Systems · 2024

cited

9.8 A 25mm2 SoC for IoT Devices with 18ms Noise-Robust Speech-to-Text Latency via Bayesian Speech Denoising and Attention-Based Sequence-to-Sequence DNN Speech Recognition in 16nm FinFET

IEEE International Solid-State Circuits Conference · 2021

cited

EdgeBERT: Optimizing On-Chip Inference for Multi-Task NLP

arXiv.org · 2020

cited

ParallelBench: Understanding the Trade-offs of Parallel Decoding in Diffusion LLMs

arXiv.org · 2025

cited

Property-Aware Multi-Speaker Data Simulation: A Probabilistic Modelling Technique for Synthetic Data Generation

7th International Workshop on Speech Processing in Everyday Environments (CHiME 2023) · 2023

cited

ETS: Efficient Tree Search for Inference-Time Scaling

arXiv.org · 2025

cited

QuantSpec: Self-Speculative Decoding with Hierarchical Quantized KV Cache

International Conference on Machine Learning · 2025

cited

Learned Best-Effort LLM Serving

arXiv.org · 2024

cited

FGMP: Fine-Grained Mixed-Precision Weight and Activation Quantization for Hardware-Accelerated LLM Inference

arXiv.org · 2025

cited

Sotabase