Via Uber Engineering Blog.
Yet, this is another tool for Deep Learning, but I think that those guys hit the nail exposing and fixing one of the major concerns about Tensor Flow that is distributed training.
When Uber needed to use Deep Learning they found some endeavors to use the conventional Data Parallelism architecture. Using Data Parallelism arch they can distribute the training using several instances in parallel and when the gradients for every batch are calculated in each instance (node/worker) these gradients are propagated for all nodes and averaged to control the convergence (update) of the model in the training phase. The following image explains better than words.
But using this architecture Uber faced two problems that were a) the right ratio of worker to parameter servers (to avoid/deal with network and processing bottleneck) and b) the complexity of TensorFlow code (more details here).
To avoid these problems they used an idea of a 2009 paper “Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations” called Ring all-reduce.
They explain the workflow of this approach:
In the ring-allreduce algorithm, each of N nodes communicates with two of its peers 2*(N-1) times. During this communication, a node sends and receives chunks of the data buffer. In the first N-1 iterations, received values are added to the values in the node’s buffer. In the second N-1 iterations, received values replace the values held in the node’s buffer. Baidu’s paper suggests that this algorithm is bandwidth-optimal, meaning that if the buffer is large enough, it will optimally utilize the available network.
The implementations details can be found here.