Multi-objective Architecture Search for CNNs

Good ideas to perform an architecture search in CNN/DL.


Multi-objective Architecture Search for CNNs

Abstract: Architecture search aims at automatically finding neural architectures that are competitive with architectures designed by human experts. While recent approaches have come close to matching the predictive performance of manually designed architectures for image recognition, these approaches are problematic under constrained resources for two reasons: first, the architecture search itself requires vast computational resources for most proposed methods. Secondly, the found neural architectures are solely optimized for high predictive performance without penalizing excessive resource consumption. We address the first shortcoming by proposing NASH, an architecture search which considerable reduces the computational resources required for training novel architectures by applying network morphisms and aggressive learning rate schedules. On CIFAR10, NASH finds architectures with errors below 4% in only 3 days. We address the second shortcoming by proposing Pareto-NASH, a method for multi-objective architecture search that allows approximating the Pareto-front of architectures under multiple objective, such as predictive performance and number of parameters, in a single run of the method. Within 56 GPU days of architecture search, Pareto-NASH finds a model with 4M parameters and test error of 3.5%, as well as a model with less than 1M parameters and test error of 4.6%.

Conclusion: We proposed NASH, a simple and fast method for automated architecture search based on a hill climbing strategy, network morphisms, and training via SGDR. Experiments on CIFAR10 showed that our method yields competitive results while requiring considerably less computational resources for architecture search than most alternative approaches. However, in most practical application not only the predictive performance plays an important role but also resource consumption. To address this, we proposed Pareto-NASH, a multi-objective architecture search method that employs additional operators for shrinking models and extends NASH’s hill climbing strategy to an evolutionary algorithm. ParetoNASH is designed to exploit the fact that evaluating the performance of a neural network is orders of magnitude more expensive than evaluating, e.g., the model’s size. Experiments on CIFAR-10 showed that Pareto-NASH is able to find competitive models in terms of both predictive performance and resource efficiency.

Multi-objective Architecture Search for CNNs

Driver behavior profiling: An investigation with different smartphone sensors and machine learning

Driver behavior profiling: An investigation with different smartphone sensors and machine learning

Abstract: Driver behavior impacts traffic safety, fuel/energy consumption and gas emissions. Driver behavior profiling tries to understand and positively impact driver behavior. Usually driver behavior profiling tasks involve automated collection of driving data and application of computer models to generate a classification that characterizes the driver aggressiveness profile. Different sensors and classification methods have been employed in this task, however, low-cost solutions and high performance are still research targets. This paper presents an investigation with different Android smartphone sensors, and classification algorithms in order to assess which sensor/method assembly enables classification with higher performance. The results show that specific combinations of sensors and intelligent methods allow classification performance improvement.
Results: We executed all combinations of the 4 MLAs and their configurations described on Table 1 over the 15 data sets described in Section 4.3 using 5 different nf values. We trained, tested, and assessed every evaluation assembly with 15 different random seeds. Finally, we calculated the mean AUC for these executions, grouped them by driving event type, and ranked the 5 best performing assemblies in the boxplot displayed in Fig 6. This figure shows the driving events on the left-hand side and the 5 best evaluation assemblies for each event on the right-hand side, with the best ones at the bottom. The assembly text identification in Fig 6 encodes, in this order: (i) the nf value; (ii) the sensor and its axis (if there is no axis indication, then all sensor axes are used); and (iii) the MLA and its configuration identifier.
Conclusions and future work: In this work we presented a quantitative evaluation of the performances of 4 MLAs (BN, MLP, RF, and SVM) with different configurations applied in the detection of 7 driving event types using data collected from 4 Android smartphone sensors (accelerometer, linear acceleration, magnetometer, and gyroscope). We collected 69 samples of these event types in a real-world experiment with 2 drivers. The start and end times of these events were recorded serve as the experiment ground-truth. We also compared the performances when applying different sliding time window sizes.
We performed 15 executions with different random seeds of 3865 evaluation assemblies of the form EA = {1:sensor, 2:sensor axis(es), 3:MLA, 4:MLA configuration, 5:number of frames in sliding window}. As a result, we found the top 5 performing assemblies for each driving event type. In the context of our experiment, these results show that (i) bigger window sizes perform better; (ii) the gyroscope and the accelerometer are the best sensors to detect our driving events; (iii) as general rule, using all sensor axes perform better than using a single one, except for aggressive left turns events; (iv) RF is by far the best performing MLA, followed by MLP; and (v) the performance of the top 35 combinations is both satisfactory and equivalent, varying from 0.980 to 0.999 mean AUC values.
As future work, we expect to collect a greater number of driving events samples using different vehicles, Android smartphone models, road conditions, weather, and temperature. We also expect to add more MLAs to our evaluation, including those based on fuzzy logic and DTW. Finally, we intend use the best evaluation assemblies observed in this work to develop an Android smartphone application which can detect driving events in real-time and calculate the driver behavior profile.
Driver behavior profiling: An investigation with different smartphone sensors and machine learning

Study of Engineered Features and Learning Features in Machine Learning – A Case Study in Document Classification

Study of Engineered Features and Learning Features in Machine Learning – A Case Study in Document Classification

Abstract:. Document classification is challenging due to handling of voluminous and highly non-linear data, generated exponentially in the era of digitization. Proper representation of documents increases efficiency and performance of classification, ultimate goal of retrieving information from large corpus. Deep neural network models learn features for document classification unlike the engineered feature based approaches where features are extracted or selected from the data. In the paper we investigate performance of different classifiers based on the features obtained using two approaches. We apply deep autoencoder for learning features while engineering features are extracted by exploiting semantic association within the terms of the documents. Experimentally it has been observed that learning feature based classification always perform better than the proposed engineering feature based classifiers.

Conclusion and Future Work: In the paper we emphasize the importance of feature representation for classification. The potential of deep learning in feature extraction process for efficient compression and representation of raw features is explored. By conducting multiple experiments we deduce that a DBN – Deep AE feature extractor and a DNNC outperforms most other techniques providing a trade-off between accuracy and execution time. In this paper we have dealt with the most significant feature extraction and classification techniques for text documents where each text document belongs to a single class label. With the explosion of digital information a large number of documents may belong to multiple class labels handling of which is a new challenge and scope of future work. Word2vec models [18] in association with Recurrent Neural Networks(RNN) [4,14] have recently started gaining popularity in feature representation domain. We would like to compare their performance with our deep learning method in future. Similar feature extraction techniques can also be applied to image data to generate compressed feature which can facilitate efficient classification. We would also like to explore such possibilities in our future work.

Study of Engineered Features and Learning Features in Machine Learning – A Case Study in Document Classification

Machine Learning Methods to Predict Diabetes Complications

Machine Learning Methods to Predict Diabetes Complications

Abstract: One of the areas where Artificial Intelligence is having more impact is machine learning, which develops algorithms able to learn patterns and decision rules from data. Machine learning algorithms have been embedded into data mining pipelines, which can combine them with classical statistical strategies, to extract knowledge from data. Within the EU-funded MOSAIC project, a data mining pipeline has been used to derive a set of predictive models of type 2 diabetes mellitus (T2DM) complications based on electronic health record data of nearly one thousand patients. Such pipeline comprises clinical center profiling, predictive model targeting, predictive model construction and model validation. After having dealt with missing data by means of random forest (RF) and having applied suitable strategies to handle class imbalance, we have used Logistic Regression with stepwise feature selection to predict the onset of retinopathy, neuropathy, or nephropathy, at different time scenarios, at 3, 5, and 7 years from the first visit at the Hospital Center for Diabetes (not from the diagnosis). Considered
variables are gender, age, time from diagnosis, body mass index (BMI), glycated hemoglobin (HbA1c), hypertension, and smoking habit. Final models, tailored in accordance with the complications, provided an accuracy up to 0.838. Different variables were selected for each complication and time scenario, leading to specialized models easy to translate to the clinical

Conclusions: This work shows how data mining and computational methods can be effectively adopted in clinical medicine to derive models that use patient-specific information to predict an outcome of interest. Predictive data mining methods may be applied to the construction of decision models for procedures such as prognosis, diagnosis and treatment planning, which—once evaluated and verified—may be embedded within clinical information systems. Developing predictive models for the onset of chronic microvascular complications in patients suffering from T2DM could contribute to evaluating the relation between exposure to individual factors and the risk of onset of a specific complication, to stratifying the patients’ population in a medical center with respect to this risk, and to developing tools for the support of clinical informed decisions in patients’ treatment.

Machine Learning Methods to Predict Diabetes Complications

Reliability Estimation of Individual Multi-target Regression Predictions

Reliability Estimation of Individual Multi-target Regression Predictions

Abstract: To estimate the quality of the induced predictive model we generally use measures of averaged prediction accuracy, such as the relative mean squared error on test data. Such evaluation fails to provide local information about reliability of individual predictions, which can be important in risk-sensitive fields (medicine, finance, industry etc.). Related work presented several ways for computing individual prediction reliability estimates for single-target regression models, but has not considered their use with multi-target regression models that predict a vector of independent target variables. In this paper we adapt the existing single-target reliability estimates to multi-target models. In this way we try to design reliability estimates, which can estimate the prediction errors without knowing true prediction errors, for multi-target regression algorithms, as well. We approach this in two ways: by aggregating reliability estimates for individual target components, and by generalizing the existing reliability estimates to higher number of dimensions. The results revealed favorable performance of the reliability estimates that are based on bagging variance and local cross-validation approaches. The results are consistent with the related work in single-target reliability estimates and provide a support for multi-target decision making.

Conclusion: In the paper we proposed several approaches for estimating the reliabilities of individual multi-target regression predictions. The aggregated variants (AM, l2 and +) produce a single-valued estimate which is preferable for interpretation and comparison. The last variant (+) is a direct generalization of the singletarget estimators from the related work. Our evaluation showed that best results were achieved using the BAGV and the LCV reliability estimates regardless the estimate variant. This complies with the related work on the single-target predictions, where these two estimates also performed well. Although all of the proposed variants achieve comparable results, our proposed generalization of existing methods (+) is still the preferred variant due to its lower computational complexity (as estimates are only calculated once for all of the target attributes) and the solid theoretical background. In our further work we intend to additionally evaluate other reliability estimates in combination with several other regression models. We also plan to test the adaptation of the proposed methods to multi-target classification. Reliability estimation of individual predictions offers many advantages especially when making decisions in highly sensitive environment. Our work provides an effective support for model-independent multi-target regression.

Reliability Estimation of Individual Multi-target Regression Predictions

The Impact of Random Models on Clustering Similarity

Abstract: Clustering is a central approach for unsupervised learning. After clustering is applied, the most fundamental analysis is to quantitatively compare clusterings. Such comparisons are crucial for the evaluation of clustering methods as well as other tasks such as consensus clustering. It is often argued that, in order to establish a baseline, clustering similarity should be assessed in the context of a random ensemble of clusterings. The prevailing assumption for the random clustering ensemble is the permutation model in which the number and sizes of clusters are fixed. However, this assumption does not necessarily hold in practice; for example, multiple runs of K-means clustering reurns clusterings with a fixed number of clusters, while the cluster size distribution varies greatly. Here, we derive corrected variants of two clustering similarity measures (the Rand index and Mutual Information) in the context of two random clustering ensembles in which the number and sizes of clusters vary. In addition, we study the impact of one-sided comparisons in the scenario with a reference clustering. The consequences of different random models are illustrated using synthetic examples, handwriting recognition, and gene expression data. We demonstrate that the choice of random model can have a drastic impact on the ranking of similar clustering pairs, and the evaluation of a clustering method with respect to a random baseline; thus, the choice of random clustering model should be carefully justified.
Discussion: Given the prevalence of clustering methods for analyzing data, clustering comparison is a fundamental problem that is pertinent to numerous areas of science. In particular, the correction of clustering similarity for chance serves to establish a baseline that facilitates comparisons between different clustering solutions. Expanding previous studies on the selection of an appropriate model for random clusterings (Meila, 2005; Vinh et al., 2009; Romano et al., 2016), our work provides an extensive summary of random models and clearly demonstrates the strong impact of the random model on the interpretation of clustering results.
Our results underpin the importance of selecting the appropriate random model for a
given context. To that end, we offer the following guidelines: 1. Consider what is fixed by the clustering method: do all clusterings have a user specified number of clusters (use Mnum), or is the cluster size sequence fixed (use Mperm)? 2. Is the comparison against a reference clustering (use a one-sided comparison), or are you comparing two derived clusterings (then use a two-sided comparison)? The specific comparisons studied here are not meant to establish the superiority of a particular clustering identification technique or a specific random clustering model, rather, they illustrate the importance of the choice of the random model. Crucially, conclusions based on corrected similarity measures can change depending on the random model for clusterings. Therefore, previous studies which did promote methods based on evidence from corrected similarity measures should be re-evaluated in the context of the appropriate random model for clusterings (Yeung et al., 2001; de Souto et al., 2008; Yeung and Ruzzo, 2001; Thalamuthu et al., 2006; McNicholas and Murphy, 2010).
The Impact of Random Models on Clustering Similarity

Learning Scalable Deep Kernels with Recurrent Structure

Abstract: Many applications in speech, robotics, finance, and biology deal with sequential data, where ordering matters and recurrent structures are common. However, this structure cannot be easily captured by standard kernel functions. To model such structure, we propose expressive closed-form kernel functions for Gaussian processes. The resulting model, GP-LSTM, fully encapsulates the inductive biases of long short-term memory (LSTM) recurrent networks, while retaining the non-parametric probabilistic advantages of Gaussian processes. We learn the properties of the proposed kernels by optimizing the Gaussian process marginal likelihood using a new provably convergent semi-stochastic gradient procedure, and exploit the structure of these kernels for scalable training and prediction. This approach provides a practical representation for Bayesian LSTMs. We demonstrate state-of-the-art performance on several benchmarks, and thoroughly investigate a consequential autonomous driving application, where the predictive uncertainties provided by GP- LSTM are uniquely valuable.
Discussion: We proposed a method for learning kernels with recurrent long short-term memory structure on sequences. Gaussian processes with such kernels, termed the GP-LSTM, have the structure and learning biases of LSTMs, while retaining a probabilistic Bayesian nonparametric representation. The GP-LSTM outperforms a range of alternatives on several sequence-toreals regression tasks. The GP-LSTM also works on data with low and high signal-to-noise ratios, and can be scaled to very large datasets, all with a straightforward, practical, and generally applicable model specification. Moreover, the semi-stochastic scheme proposed in our paper is provably convergent and efficient in practical settings, in conjunction with structure exploiting algebra. In short, the GP-LSTM provides a natural mechanism for Bayesian LSTMs, quantifying predictive uncertainty while harmonizing with the standard deep learning toolbox. Predictive uncertainty is of high value in robotics applications, such as autonomous driving, and could also be applied to other areas such as financial modeling and computational biology.
Learning Scalable Deep Kernels with Recurrent Structure

Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers

Abstract: There is a large literature explaining why AdaBoost is a successful classifier. The literature on AdaBoost focuses on classifier margins and boosting’s interpretation as the optimization of an exponential likelihood function. These existing explanations, however, have been pointed out to be incomplete. A random forest is another popular ensemble method for which there is substantially less explanation in the literature. We introduce a novel perspective on AdaBoost and random forests that proposes that the two algorithms work for similar reasons. While both classifiers achieve similar predictive accuracy, random forests cannot be conceived as a direct optimization procedure. Rather, random forests is a self- averaging, interpolating algorithm which creates what we denote as a spiked-smooth classifier, and we view AdaBoost in the same light. We conjecture that both AdaBoost and random forests succeed because of this mechanism. We provide a number of examples to support this explanation. In the process, we question the conventional wisdom that suggests that boosting algorithms for classification require regularization or early stopping and should be limited to low complexity classes of learners, such as decision stumps. We conclude that boosting should be used like random forests: with large decision trees, without regularization or early stopping.
Concluding Remarks: AdaBoost is an undeniably successful algorithm and random forests is at least as good, if not better. But AdaBoost is as puzzling as it is successful; it broke the basic rules of statistics by iteratively fitting even noisy data sets until every training set data point was fit without error. Even more puzzling, to statisticians at least, it will continue to iterate an already perfectly fit algorithm which lowers generalization error. The statistical view of boosting understands AdaBoost to be a stage wise optimization of an exponential loss, which suggest (demands!) regularization of tree size and control on the number of iterations.
In contrast, a random forest is not an optimization; it appears to work best with large
trees and as many iterations as possible. It is widely believed that AdaBoost is effective
because it is an optimization, while random forests works—well because it works. Breiman conjectured that “it is my belief that in its later stages AdaBoost is emulating a random forest” (Breiman, 2001). This paper sheds some light on this conjecture by providing a novel intuition supported by examples to show how AdaBoost and random forest are successful for the same reason.
A random forests model is a weighted ensemble of interpolating classifiers by construction. Although it is much less evident, we have shown that AdaBoost is also a weighted ensemble of interpolating classifiers. Viewed in this way, AdaBoost is actually a “random” forest of forests. The trees in random forests and the forests in the AdaBoost each interpolate the data without error. As the number of iterations increase the averaging of decision surface because smooths but nevertheless still interpolates. This is accomplished by whittling down the decision boundary around error points. We hope to have cast doubt on the commonly held belief that the later iterations of AdaBoost only serve to overfit the data. Instead, we argue that these later iterations lead to an “averaging effect”, which causes AdaBoost to behave like a random forest.
A central part of our discussion also focused on the merits of interpolation of the training
data, when coupled with averaging. Again, we hope to dispel the commonly held belief that interpolation always leads to overfitting. We have argued instead that fitting the training data in extremely local neighborhoods actually serves to prevent overfitting in the presence of averaging. The local fits serve to prevent noise points from having undue influence over the fit in other areas. Random forests and AdaBoost both achieve this desirable level of local interpolation by fitting deep trees. It is our hope that our emphasis on the “self-averaging” and interpolating aspects of AdaBoost will lead to a broader discussion of this classifier’s success that extends beyond the more traditional emphasis on margins and exponential loss minimization.
Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers

Time Series Prediction with the Self-Organizing Map: A Review

Summary. We provide a comprehensive and updated survey on applications of
Kohonen’s self-organizing map (SOM) to time series prediction (TSP). The main
goal of the paper is to show that, despite being originally designed as an unsupervised
learning algorithm, the SOM is flexible enough to give rise to a number of
efficient supervised neural architectures devoted to TSP tasks. For each SOM-based
architecture to be presented, we report its algorithm implementation in detail. Similarities and differences of such SOM-based TSP models with respect to standard
linear and nonlinear TSP techniques are also highlighted. We conclude the paper
with indications of possible directions for further research on this field.
Conclusion: In this paper we reviewed several applications of Kohonen’s SOM-based models to time series prediction. Our main goal was to show that the SOM can perform efficiently in this task and can compete equally with well-known neural
architectures, such as MLP and RBF networks, which are more commonly
used. In this sense, the main advantages of SOM-based models over MLPor
RBF-based models are the inherent local modeling property, which favors
the interpretability of the results, and the facility in developing growing architectures, which alleviates the burden of specifying an adequate number of neurons (prototype vectors).
Time Series Prediction with the Self-Organizing Map: A Review

Applying deep learning to classify pornographic images and videos

Abstract. It is no secret that pornographic material is now a one-clickaway
from everyone, including children and minors. General social media
networks are striving to isolate adult images and videos from normal
ones. Intelligent image analysis methods can help to automatically
detect and isolate questionable images in media. Unfortunately, these
methods require vast experience to design the classifier including one or
more of the popular computer vision feature descriptors. We propose to
build a classifier based on one of the recently flourishing deep learning
techniques. Convolutional neural networks contain many layers for both
automatic features extraction and classification. The benefit is an easier
system to build (no need for hand-crafting features and classifiers). Additionally,
our experiments show that it is even more accurate than the
state of the art methods on the most recent benchmark dataset.
Conclusions: We proposed applying convolutional neural networks to automatically classify
pornographic images and videos. We showed that our proposed fully automated
solution outperformed the accuracy of hand-crafted feature descriptors solutions.
We are continuing our research to find an even better network architecture for
this problem. Nevertheless, all the successful applications so far rely on supervised
training methods. We expect a new wave of deep learning networks would
emerge by combining supervised and unsupervised methods where a network
can learn from its mistakes while in actual deployment. We believe further research
can also be directed toward allowing machines to consider the context
and overall rhetorical meaning of a video clip while relating them to the images
Applying deep learning to classify pornographic images and videos