PlaidML: An open source portable deep learning engine


We’re pleased to announce the next step towards deep learning for every device and platform. Today Vertex.AI is releasing PlaidML, our open source portable deep learning engine. Our mission is to make deep learning accessible to every person on every device, and we’re building PlaidML to help make that a reality. We’re starting by supporting the most popular hardware and software already in the hands of developers, researchers, and students. The initial version of PlaidML runs on most existing PC hardware with OpenCL-capable GPUs from NVIDIA, AMD, or Intel. Additionally, we’re including support for running the widely popular Keras framework on top of Plaid to allow existing code and tutorials to run unchanged.

PlaidML: An open source portable deep learning engine

Baidu are bringing HPC Techniques to Deep Learning

Via Baidu Research Blog.

The Ring all-reduce approach came to save a lot of work when training deep neural networks. The approach to propagate and update the gradients (and control the convergence of the model) are well explained below:

The Ring Allreduce

The main issue with the simplistic communication strategy described above was that the communication cost grew linearly with the number of GPUs in the system. In contrast, a ring allreduce is an algorithm for which the communication cost is constant and independent of the number of GPUs in the system, and is determined solely by the slowest connection between GPUs in the system; in fact, if you only consider bandwidth as a factor in your communication cost (and ignore latency), the ring allreduce is an optimal communication algorithm [4]. (This is a good estimate for communication cost when your model is large, and you need to send large amounts of data a small number of times.)

The GPUs in a ring allreduce are arranged in a logical ring. Each GPU should have a left neighbor and a right neighbor; it will only ever send data to its right neighbor, and receive data from its left neighbor.

The algorithm proceeds in two steps: first, a scatter-reduce, and then, an allgather. In the scatter-reduce step, the GPUs will exchange data such that every GPU ends up with a chunk of the final result. In the allgather step, the GPUs will exchange those chunks such that all GPUs end up with the complete final result.

More can be found here. To implement there’s a Github project called Tensor All-reduce that can be used for distributed deep learning.

Baidu are bringing HPC Techniques to Deep Learning

Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow

Via Uber Engineering Blog.

Yet, this is another tool for Deep Learning, but I think that those guys hit the nail exposing and fixing one of the major concerns about Tensor Flow that is distributed training.

When Uber needed to use Deep Learning they found some endeavors to use the conventional Data Parallelism architecture.  Using Data Parallelism arch they can distribute the training using several instances in parallel and when the gradients for every batch are calculated in each instance (node/worker) these gradients are propagated for all nodes and averaged to control the convergence (update) of the model in the training phase. The following image explains better than words.

But using this architecture Uber faced two problems that were a) the right ratio of worker to parameter servers (to avoid/deal with network and processing bottleneck) and b) the complexity of TensorFlow code (more details here).

To avoid these problems they used an idea of a 2009 paper  “Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations” called Ring all-reduce.

They explain the workflow of this approach:

In the ring-allreduce algorithm, each of N nodes communicates with two of its peers 2*(N-1) times. During this communication, a node sends and receives chunks of the data buffer. In the first N-1 iterations, received values are added to the values in the node’s buffer. In the second N-1 iterations, received values replace the values held in the node’s buffer. Baidu’s paper suggests that this algorithm is bandwidth-optimal, meaning that if the buffer is large enough, it will optimally utilize the available network.

The implementations details can be found here.



Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow

Tensorflow sucks (?)

This post of Nico’s blog has good points about why Pytorch even without all Google support and money is taking out users from Tensor Flow.

[…]The most interesting question to me is why Google chose a purely declarative paradigm for Tensorflow in spite of the obvious downsides of this approach. Did they feel that encapsulating all the computation in a single computation graph would simplify executing models on their TPU’s so they can cut Nvidia out of the millions of dollars to be made from cloud hosting of deep learning powered applications? It’s difficult to say. Overall, Tensorflow does not feel like a pure open source project for the common good. Which I would have no problem with, had their design been sound. In comparison with beautiful Google open source projects out there such as Protobuf, Golang, and Kubernetes, Tensorflow falls dramatically short.

While declarative paradigms are great for UI programming, there are many reasons why it is a problematic choice for deep learning.

Take the React Javascript library as an example, the standard choice today for interactive web applications. In React, the complexity of how data flows through the application makes sense to be hidden from the developer, since Javascript execution is generally orders of magnitudes faster than updates to the DOM. React developers don’t want to worry about the mechanics of how state is propagated, so long as the end user experience is “good enough”.

On the other hand, in deep learning, a single layer can literally execute billions of FLOP’s! And deep learning researchers care very much about the mechanics of how computation is done and want fine control because they are constantly pushing the edge of what’s possible (e.g. dynamic networks) and want easy access to intermediate results.[…]

Tensorflow sucks (?)