Transformers, parallel computation, and logarithmic depth

We show that a constant number of self-attention layers can efficientlysimulate, and be simulated by, a constant number of communication rounds ofMassively Parallel Computation. As a consequence, we show that logarithmicdepth is sufficient for transformers to solve basic computational tasks thatcannot be efficiently solved by several other neural sequence models andsub-quadratic transformer approximations. We thus establish parallelism as akey distinguishing property of transformers.

Further reading