T3: Transparent Tracking Triggering for Fine-grained Overlap of Compute Collectives
On this page
Large Language Models increasingly rely on distributed techniques for theirtraining and inference. These techniques require communication across deviceswhich can reduce scaling efficiency as the number of devices increases. Whilesome distributed techniques can overlap, and thus, hide this communication withindependent computations, techniques such as Tensor Parallelism (TP) inherentlyserialize communication with model execution. One approach to hide thisserialized communication is to interleave it with the producer operation (ofthe communicated data) in a fine-grained manner. However, this fine-grainedinterleaving of communication and computation in software can be difficult.Furthermore, as with any concurrent execution, it requires compute and memoryresources to be shared between computation and communication, causing resourcecontention that reduces overlapping efficacy. To overcome these challenges, we propose T3 which applies hardware-softwareco-design to transparently overlap serialized communication while minimizingresource contention with compute. T3 transparently fuses producer operationswith the subsequent communication via a simple configuration of the producer’soutput address space and requires minor software changes. At the hardwarelevel, T3 adds a lightweight track and trigger mechanism to orchestrate theproducer’s compute, and communication. It further uses compute-enhancedmemories for communication’s attendant compute. As a result, T3 reducesresource contention, and efficiently overlaps serialized communication withcomputation. For important Transformer models like T-NLG, T3 speeds upcommunication-heavy sublayers by 30
Further reading
- Access Paper in arXiv.org