BASS: Batched Attention-optimized Speculative Sampling

Speculative decoding has emerged as a powerful method to improve latency andthroughput in hosting large language models. However, most existingimplementations focus on generating a single sequence. Real-world generative AIapplications often require multiple responses and how to perform speculativedecoding in a batched setting while preserving its latency benefits posesnon-trivial challenges. This paper describes a system of batched speculativedecoding that sets a new state of the art in multi-sequence generation latencyand that demonstrates superior GPU utilization as well as quality ofgenerations within a time budget. For example, for a 7.8B-size model on asingle A100 GPU and with a batch size of 8, each sequence is generated at anaverage speed of 5.8ms per token, the overall throughput being 1.1K tokens persecond. These results represent state-of-the-art latency and a 2.15X speed-upover optimized regular decoding. Within a time budget that regular decodingdoes not finish, our system is able to generate sequences with HumanEvalPass@First of 43

Further reading