Flash Attention 1 & 2
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
paper (2022 arxiv): https://arxiv.org/abs/2205.14135
Background: compute-bound and memory-bound
- The performance on throughput of transformer layers can be either compute-bound or memory-bound. The higher the arithmetic intensity, the more likely to be memory-bound.
- Compute-bound operators include matrix multiplication where the computation takes more time than communication or data movement, memory-bound operators perform in an opposite way, including element-wise operators, and reduction, e.g., sum, softmax, batch norm, etc.