Optimization for Deep Learning Algorithms: A Review

ABSTRACT: In past few years, deep learning has received attention in the field of artificial intelligence. This paper reviews three focus areas of learning methods in deep learning namely supervised, unsupervised and reinforcement learning. These learning methods are used in implementing deep and convolutional neural networks. They offered unified computational approach, flexibility and scalability capabilities. The computational model implemented by deep learning is used in understanding data representation with multiple levels of abstractions. Furthermore, deep learning enhanced the state-of-the-art methods in terms of domains like genomics. This can be applied in pathway analysis for modelling biological network. Thus, the extraction of biochemical production can be improved by using deep learning. On the other hand, this review covers the implementation of optimization in terms of meta-heuristics methods. This optimization is used in machine learning as a part of modelling methods.
In this review, discussed about deep learning techniques which implementing multiple level of abstraction in feature representation. Deep learning can be characterized as rebranding of artificial neural network. This learning methods gains a large interest among the researchers because of better representation and easier to learn tasks. Even though deep learning is implemented, however there are some issues has been arise. There are easily getting stuck at local optima and computationally expensive. DeepBind algorithm shows that deep learning can cooperate in genomics study. It is to ensure on achieving high level of prediction protein binding affinity. On the other hand, the optimization method which has been discusses consists of several meta-heuristics
methods which can be categorized under evolutionary algorithms. The application of the techniques involvedCRO shows the diversity of optimization algorithm to improve the analysis of modelling techniques. Furthermore, these methods are able to solve the problems arise in conventional neural network as it provides high quality in finding solution in a given search space. The application of optimization methods enable the
extraction of biochemical production of metabolic pathway. Deep learning will gives a good advantage in the biochemical production as it allows high level abstraction in cellular biological network. Thus, the use of CRO will improve the problems arise in deep learning which are getting stuck at local optima and it is computationally expensive. As CRO use global search in the search space to identify global minimum point. Thus, it will improve the training process in the network on refining the weight in order to have minimum error.
Optimization for Deep Learning Algorithms: A Review

Gradient descent revisited via an adaptive online learning rate

One of the most misunderstood concepts and the reason that a lot of cash is spent in Machine Learning as a Service (MLaaS) due a lack of optimization in this parameter that is responsible to control the convergence.

Gradient descent revisited via an adaptive online learning rate

Abstract: Any gradient descent optimization requires to choose a learning rate. With deeper and deeper models, tuning that learning rate can easily become tedious and does not necessarily lead to an ideal convergence. We propose a variation of the gradient descent algorithm in the which the learning rate η is not fixed. Instead, we learn η itself, either by another gradient descent (first-order method), or by Newton’s method (second-order). This way, gradient descent for any machine learning algorithm can be optimized.

Conclusion: In this paper, we have built a new way to learn the learning rate at each step using finite differences on the loss. We have tested it on a variety of convex and non-convex optimization tasks. Based on our results, we believe that our method would be able to adapt a good learning rate at every iteration on convex problems. In the case of non-convex problems, we repeatedly observed faster training in the first few epochs. However, our adaptive model seems more inclined to overfit the training data, even though its test accuracy is always comparable to standard SGD performance, if not slightly better. Hence we believe that in neural network architectures, our model can be used initially for pretraining for a few epochs, and then continue with any other standard optimization technique to lead to faster convergence and be computationally more efficient, and perhaps reach a new highest accuracy on the given problem. Moreover, the learning rate that our algorithm converges to suggests an ideal learning rate for the given training task. One could use our method to tune the learning rate of a standard neural network (using Adam for instance), giving a more precise value than with line-search or random search.

Gradient descent revisited via an adaptive online learning rate

Qual a diferença entre o Gradiente Descendente e o Gradiente Descendente Estocástico?

Aqui no Quora a resposta mais simples elaborada na história do mundo:

In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.

While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE training sample from your training set to do the update for a parameter in a particular iteration.

Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.

SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there.

Qual a diferença entre o Gradiente Descendente e o Gradiente Descendente Estocástico?