Um ótimo artigo de como a prática continua sendo uma ótima professora em relação ao tratamento de métodos metaheurísticos.
Why is that a big deal? The biggest impediment in applying deep learning (or for that matter any S/E process) in product development is turnaround time. If I spend 1 week training my model and _then_ find it is a pile of shit, because I did not initialize something well or the architecture was missing something, that’s not good. For this reason, everyone I know wants to get the best GPUs or work on the biggest clusters — not just it lets them build more expressive networks but simply they’re super fast. So, any technique that improves experiment turnaround time is welcome!
The idea is ridiculously simple (perhaps why it is effective?): randomly skip layers while training. As a result you have a network that has expected depth really small, while the maximum depth can be in the order of 1000s. In effect, like dropout training, this creates an ensemble model from the 2^L2Lpossible networks for an LL-layer deep network.
Como o assunto está muito quente foi feito até um livro online (com códigos e tudo mais):
Neural Networks and Deep Learning is a free online book. The book will teach you about:
Neural networks, a beautiful biologically-inspired programming paradigm which enables a computer to learn from observational data
Deep learning, a powerful set of techniques for learning in neural networks
Neural networks and deep learning currently provide the best solutions to many problems in image recognition, speech recognition, and natural language processing. This book will teach you the core concepts behind neural networks and deep learning.
The book is currently an incomplete beta draft. More chapters will be added over the coming months. For now, you can:
Read Chapter 1, which explains how neural networks can learn to recognize handwriting
Read Chapter 2, which explains backpropagation, the most important algorithm used to learn in neural networks.
Read Chapter 3, which explains many techniques which can be used to improve the performance of backpropagation.
Read Chapter 4, which explains why neural networks can compute any function.
Learn more about the approach taken in this book
Uma coisa interessante nesse artigo, foi que é um dos poucos que tem uma estratégia para Deep Learning não baseada em algoritmos, mas sim em indexação semântica.
Big Data Analytics and Deep Learning are two high-focus of data science. Big Data has become important as many organizations both public and private have been collecting massive amounts of domain-specific information, which can contain useful information about problems such as national intelligence, cyber security, fraud detection, marketing, and medical informatics. Companies such as Google and Microsoft are analyzing large volumes of data for business analysis and decisions, impacting existing and future technology. Deep Learning algorithms extract high-level, complex abstractions as data representations through a hierarchical learning process. Complex abstractions are learnt at a given level based on relatively simpler abstractions formulated in the preceding level in the hierarchy. A key benefit of Deep Learning is the analysis and learning of massive amounts of unsupervised data, making it a valuable tool for Big Data Analytics where raw data is largely unlabeled and un-categorized. In the present study, we explore how Deep Learning can be utilized for addressing some important problems in Big Data Analytics, including extracting complex patterns from massive volumes of data, semantic indexing, data tagging, fast information retrieval, and simplifying discriminative tasks. We also investigate some aspects of Deep Learning research that need further exploration to incorporate specific challenges introduced by Big Data Analytics, including streaming data, high-dimensional data, scalability of models, and distributed computing. We conclude by presenting insights into relevant future works by posing some questions, including defining data sampling criteria, domain adaptation modeling, defining criteria for obtaining useful data abstractions, improving semantic indexing, semi-supervised learning, and active learning.
Nessa entrevista do Arno Candel da H2o para o Kdnuggets, ele resume o que é o Deep Learning:
[…]Deep Learning methods use a composition of multiple non-linear transformations to model high-level abstractions in data. Multi-layer feed-forward artificial neural networks are some of the oldest and yet most useful such techniques. We are now reaping the benefits of over 60 years of evolution in Deep Learning that began in the late 1950s when the term Machine Learning was coined. Large parts of the growing success of Deep Learning in the past decade can be attributed to Moore’s law and the exponential speedup of computers, but there were also many algorithmic breakthroughs that enabled robust training of deep learners.
Compared to more interpretable Machine Learning techniques such as tree-based methods, conventional Deep Learning (using stochastic gradient descentand back-propagation) is a rather “brute-force” method that optimizes lots of coefficients (it is a parametric method) starting from random noise by continuously looking at examples from the training data. It follows the basic idea of “(good) practice makes perfect” (similar to a real brain) without any strong guarantees on the quality of the model. […]
Neste trecho ele fala de algumas aplicações de Deep Learning:
[…]Deep Learning is really effective at learning non-linear derived featuresfrom the raw input features, unlike standard Machine Learning methods such as linear or tree-based methods. For example, if age and income are the two features used to predict spending, then a linear model would greatly benefit from manually splitting age and income ranges into distinct groups; while a tree-based model would learn to automatically dissect the two-dimensional space.
A Deep Learning model builds hierarchies of (hidden) derived non-linear features that get composed to approximate arbitrary functions such as sqrt((age-40)^2+0.3*log(income+1)-4) with much less effort than with other methods. Traditionally, data scientists perform many of these transformations explicitly based on domain knowledge and experience, but Deep Learning has been shown to be extremely effective at coming up with those transformations, often outperforming standard Machine Learning models by a substantial margin.
Deep Learning is also very good at predicting high-cardinality class memberships, such as in image or voice recognition problems, or in predicting the best item to recommend to a user. Another strength of Deep Learning is that it can also be used for unsupervised learning where it just learns the intrinsic structure of the data without making predictions (remember the Google cat?). This is useful in cases where there are no training labels, or for various other use cases such as anomaly detection. […]
No Random Ponderings tem um overview (que já é quase uma introdução completa) sobre Redes Neurais Artificiais em Profundidade.
Esse tema não é relativamente novo, mas tem tomado um grande espaço na literatura atual devido ao fato que essa técnica vem solucionando importantes problemas ligados à classificação, principalmente em relação à aproximação por imagens.
Essas redes nada mais são do que redes em que os parâmetros de entrada na rede são codificados em que não é somente o atributo é modelado, mas sub-conjuntos de atributo-valor são introduzidos na rede; e posteriormente entre 20-60 camadas escondidas (camadas intermediárias) são introduzidas para treinamento da rede e ajuste do calculo dos pesos.
Abaixo algumas heurísticas de como treinar as RNA’s em Profundidade:
- Get the data: Make sure that you have a high-quality dataset of input-output examples that is large, representative, and has relatively clean labels. Learning is completely impossible without such a dataset.
- Preprocessing: it is essential to center the data so that its mean is zero and so that the variance of each of its dimensions is one. Sometimes, when the input dimension varies by orders of magnitude, it is better to take the log(1 + x) of that dimension. Basically, it’s important to find a faithful encoding of the input with zero mean and sensibly bounded dimensions. Doing so makes learning work much better. This is the case because the weights are updated by the formula: change in wij \propto xidL/dyj (w denotes the weights from layer x to layer y, and L is the loss function). If the average value of the x’s is large (say, 100), then the weight updates will be very large and correlated, which makes learning bad and slow. Keeping things zero-mean and with small variance simply makes everything work much better.
- Minibatches: Use minibatches. Modern computers cannot be efficient if you process one training case at a time. It is vastly more efficient to train the network on minibatches of 128 examples, because doing so will result in massively greater throughput. It would actually be nice to use minibatches of size 1, and they would probably result in improved performance and lower overfitting; but the benefit of doing so is outweighed the massive computational gains provided by minibatches. But don’t use very large minibatches because they tend to work less well and overfit more. So the practical recommendation is: use the smaller minibatch that runs efficiently on your machine.
- Gradient normalization: Divide the gradient by minibatch size. This is a good idea because of the following pleasant property: you won’t need to change the learning rate (not too much, anyway), if you double the minibatch size (or halve it).
- Learning rate schedule: Start with a normal-sized learning rate (LR) and reduce it towards the end.
- A typical value of the LR is 0.1. Amazingly, 0.1 is a good value of the learning rate for a large number of neural networks problems. Learning rates frequently tend to be smaller but rarely much larger.
- Use a validation set —- a subset of the training set on which we don’t train — to decide when to lower the learning rate and when to stop training (e.g., when error on the validation set starts to increase).
- A practical suggestion for a learning rate schedule: if you see that you stopped making progress on the validation set, divide the LR by 2 (or by 5), and keep going. Eventually, the LR will become very small, at which point you will stop your training. Doing so helps ensure that you won’t be (over-)fitting the training data at the detriment of validation performance, which happens easily and often. Also, lowering the LR is important, and the above recipe provides a useful approach to controlling via the validation set.
- But most importantly, worry about the Learning Rate. One useful idea used by some researchers (e.g., Alex Krizhevsky) is to monitor the ratio between the update norm and the weight norm. This ratio should be at around 10-3. If it is much smaller then learning will probably be too slow, and if it is much larger then learning will be unstable and will probably fail.
- Weight initialization. Worry about the random initialization of the weights at the start of learning.
- If you are lazy, it is usually enough to do something like 0.02 * randn(num_params). A value at this scale tends to work surprisingly well over many different problems. Of course, smaller (or larger) values are also worth trying.
- If it doesn’t work well (say your neural network architecture is unusual and/or very deep), then you should initialize each weight matrix with the init_scale / sqrt(layer_width) * randn. In this case init_scale should be set to 0.1 or 1, or something like that.
- Random initialization is super important for deep and recurrent nets. If you don’t get it right, then it’ll look like the network doesn’t learn anything at all. But we know that neural networks learn once the conditions are set.
- Fun story: researchers believed, for many years, that SGD cannot train deep neural networks from random initializations. Every time they would try it, it wouldn’t work. Embarrassingly, they did not succeed because they used the “small random weights” for the initialization, which works great for shallow nets but simply doesn’t work for deep nets at all. When the nets are deep, the many weight matrices all multiply each other, so the effect of a suboptimal scale is amplified.
- But if your net is shallow, you can afford to be less careful with the random initialization, since SGD will just find a way to fix it.
You’re now informed. Worry and care about your initialization. Try many different kinds of initialization. This effort will pay off. If the net doesn’t work at all (i.e., never “gets off the ground”), keep applying pressure to the random initialization. It’s the right thing to do.
- If you are training RNNs or LSTMs, use a hard constraint over the norm of the gradient (remember that the gradient has been divided by batch size). Something like 15 or 5 works well in practice in my own experiments. Take your gradient, divide it by the size of the minibatch, and check if its norm exceeds 15 (or 5). If it does, then shrink it until it is 15 (or 5). This one little trick plays a huge difference in the training of RNNs and LSTMs, where otherwise the exploding gradient can cause learning to fail and force you to use a puny learning rate like 1e-6 which is too small to be useful.
- Numerical gradient checking: If you are not using Theano or Torch, you’ll be probably implementing your own gradients. It is easy to make a mistake when we implement a gradient, so it is absolutely critical to use numerical gradient checking. Doing so will give you a complete peace of mind and confidence in your code. You will know that you can invest effort in tuning the hyperparameters (such as the learning rate and the initialization) and be sure that your efforts are channeled in the right direction.
- If you are using LSTMs and you want to train them on problems with very long range dependencies, you should initialize the biases of the forget gates of the LSTMs to large values. By default, the forget gates are the sigmoids of their total input, and when the weights are small, the forget gate is set to 0.5, which is adequate for some but not all problems. This is the one non-obvious caveat about the initialization of the LSTM.
- Data augmentation: be creative, and find ways to algorithmically increase the number of training cases that are in your disposal. If you have images, then you should translate and rotate them; if you have speech, you should combine clean speech with all types of random noise; etc. Data augmentation is an art (unless you’re dealing with images). Use common sense.
- Dropout. Dropout provides an easy way to improve performance. It’s trivial to implement and there’s little reason to not do it. Remember to tune the dropout probability, and to not forget to turn off Dropout and to multiply the weights by (namely by 1-dropout probability) at test time. Also, be sure to train the network for longer. Unlike normal training, where the validation error often starts increasing after prolonged training, dropout nets keep getting better and better the longer you train them. So be patient.
- Ensembling. Train 10 neural networks and average their predictions. It’s a fairly trivial technique that results in easy, sizeable performance improvements. One may be mystified as to why averaging helps so much, but there is a simple reason for the effectiveness of averaging. Suppose that two classifiers have an error rate of 70%. Then, when they agree they are right. But when they disagree, one of them is often right, so now the average prediction will place much more weight on the correct answer. The effect will be especially strong whenever the network is confident when it’s right and unconfident when it’s wrong.