Why is that a big deal? The biggest impediment in applying deep learning (or for that matter any S/E process) in product development is turnaround time. If I spend 1 week training my model and _then_ find it is a pile of shit, because I did not initialize something well or the architecture was missing something, that’s not good. For this reason, everyone I know wants to get the best GPUs or work on the biggest clusters — not just it lets them build more expressive networks but simply they’re super fast. So, any technique that improves experiment turnaround time is welcome!
The idea is ridiculously simple (perhaps why it is effective?): randomly skip layers while training. As a result you have a network that has expected depth really small, while the maximum depth can be in the order of 1000s. In effect, like dropout training, this creates an ensemble model from the 2^L2Lpossible networks for an LL-layer deep network.