Abstract:. Document classification is challenging due to handling of voluminous and highly non-linear data, generated exponentially in the era of digitization. Proper representation of documents increases efficiency and performance of classification, ultimate goal of retrieving information from large corpus. Deep neural network models learn features for document classification unlike the engineered feature based approaches where features are extracted or selected from the data. In the paper we investigate performance of different classifiers based on the features obtained using two approaches. We apply deep autoencoder for learning features while engineering features are extracted by exploiting semantic association within the terms of the documents. Experimentally it has been observed that learning feature based classification always perform better than the proposed engineering feature based classifiers.
Conclusion and Future Work: In the paper we emphasize the importance of feature representation for classification. The potential of deep learning in feature extraction process for efficient compression and representation of raw features is explored. By conducting multiple experiments we deduce that a DBN – Deep AE feature extractor and a DNNC outperforms most other techniques providing a trade-off between accuracy and execution time. In this paper we have dealt with the most significant feature extraction and classification techniques for text documents where each text document belongs to a single class label. With the explosion of digital information a large number of documents may belong to multiple class labels handling of which is a new challenge and scope of future work. Word2vec models  in association with Recurrent Neural Networks(RNN) [4,14] have recently started gaining popularity in feature representation domain. We would like to compare their performance with our deep learning method in future. Similar feature extraction techniques can also be applied to image data to generate compressed feature which can facilitate efficient classification. We would also like to explore such possibilities in our future work.
Abstract: One of the areas where Artificial Intelligence is having more impact is machine learning, which develops algorithms able to learn patterns and decision rules from data. Machine learning algorithms have been embedded into data mining pipelines, which can combine them with classical statistical strategies, to extract knowledge from data. Within the EU-funded MOSAIC project, a data mining pipeline has been used to derive a set of predictive models of type 2 diabetes mellitus (T2DM) complications based on electronic health record data of nearly one thousand patients. Such pipeline comprises clinical center profiling, predictive model targeting, predictive model construction and model validation. After having dealt with missing data by means of random forest (RF) and having applied suitable strategies to handle class imbalance, we have used Logistic Regression with stepwise feature selection to predict the onset of retinopathy, neuropathy, or nephropathy, at different time scenarios, at 3, 5, and 7 years from the first visit at the Hospital Center for Diabetes (not from the diagnosis). Considered
variables are gender, age, time from diagnosis, body mass index (BMI), glycated hemoglobin (HbA1c), hypertension, and smoking habit. Final models, tailored in accordance with the complications, provided an accuracy up to 0.838. Different variables were selected for each complication and time scenario, leading to specialized models easy to translate to the clinical
Conclusions: This work shows how data mining and computational methods can be effectively adopted in clinical medicine to derive models that use patient-specific information to predict an outcome of interest. Predictive data mining methods may be applied to the construction of decision models for procedures such as prognosis, diagnosis and treatment planning, which—once evaluated and verified—may be embedded within clinical information systems. Developing predictive models for the onset of chronic microvascular complications in patients suffering from T2DM could contribute to evaluating the relation between exposure to individual factors and the risk of onset of a specific complication, to stratifying the patients’ population in a medical center with respect to this risk, and to developing tools for the support of clinical informed decisions in patients’ treatment.
Abstract: Traditionally, medical discoveries are made by observing associations, making hypotheses from them and then designing and running experiments to test the hypotheses. However, with medical images, observing and quantifying associations can often be difficult because of the wide variety of features, patterns, colours, values and shapes that are present in real data. Here, we show that deep learning can extract new knowledge from retinal fundus images. Using deep-learning models trained on data from 284,335 patients and validated on two independent datasets of 12,026 and 999 patients, we predicted cardiovascular risk factors not previously thought to be present or quantifiable in retinal images, such as age (mean absolute error within 3.26 years), gender (area under the receiver operating characteristic curve (AUC) = 0.97), smoking status (AUC = 0.71), systolic blood pressure (mean absolute error within 11.23 mmHg) and major adverse cardiac events (AUC = 0.70). We also show that the trained deep-learning models used anatomical features, such as the optic disc or blood vessels, to generate each prediction.
Abstract: The standard architecture of synthetic aperture radar (SAR) automatic target recognition (ATR) consists of three stages: detection, discrimination, and classification. In recent years, convolutional neural networks (CNNs) for SAR ATR have been proposed, but most of them classify target classes from a target chip extracted from SAR imagery, as a classification for the third stage of SAR ATR. In this report, we propose a novel CNN for end-to-end ATR from SAR imagery. The CNN named verification support network (VersNet) performs all three stages of SAR ATR end-to-end. VersNet inputs a SAR image of arbitrary sizes with multiple classes and multiple targets, and outputs a SAR ATR image representing the position, class, and pose of each detected target. This report describes the evaluation results of VersNet which trained to output scores of all 12 classes: 10 target classes, a target front class, and a background class, for each pixel using the moving and stationary target acquisition and recognition (MSTAR) public dataset.
Conclusion: By applying CNN to the third stage classification in the standard architecture of SAR ATR, the performance has been improved. In order to improve the overall performance of SAR ATR, it is important not only to improve the performance of the third stage classification but also to improve the performance of the first stage detection and the second stage discrimination. In this report, we proposed a CNN based on a new architecture of SAR ATR that consists of a single stage, i.e. endto-end, not the standard architecture of SAR ATR. Unlike conventional CNNs for target classification, the CNN named VersNet inputs a SAR image of arbitrary sizes with multiple classes and multiple targets, and outputs a SAR ATR image representing the position, class, and pose of each detected target. We trained the VersNet to output scores include ten target classes on MSTAR dataset and evaluated its performance. The average IoU for all the pixels of testing (2420 target chips) is over 0.9. Also, the classification accuracy is about 99.5%, if we select the majority class of maximum probability for each pixel as the predicted class.
ABSTRACT— Anomaly detection in database management systems (DBMSs) is difficult because of increasing number of statistics (stat) and event metrics in big data system. In this paper, I propose an automatic DBMS diagnosis system that detects anomaly periods with abnormal DB stat metrics and finds causal events in the periods. Reconstruction error from deep autoencoder and statistical process control approach are applied to detect time period with anomalies. Related events are found using time series similarity measures between events and abnormal stat metrics. After training deep autoencoder with DBMS metric data, efficacy of anomaly detection is investigated from other DBMSs containing anomalies. Experiment results show effectiveness of proposed model, especially, batch temporal normalization layer. Proposed model is used for publishing automatic DBMS diagnosis reports in order to determine DBMS configuration and SQL tuning.
CONCLUSION AND FUTURE WORK I proposed a machine learning model for automatic DBMS diagnosis. The proposed model detects anomaly periods from reconstruct error with deep autoencoder. I also verified empirically that temporal normalization is essential when input data is non-stationary multivariate time series. With SPC approach, time period is considered anomaly period when reconstruction error is outside of control limit. According types or users of DBMSs, decision rules that are used in SPC can be added. For example, warning line with 2 sigma can be utilized to decide whether it is anomaly or not [12, 13]. In this paper, anomaly detection test is proceeded in other DBMSs whose data is not used in training, because performance of basic pre-trained model is important in service providers’ perspective. Efficacy of detection performance is validated with blind test and DBAs’ opinions. The result of automatic anomaly diagnosis would help DB consultants save time for anomaly periods and main wait events. Thus, they can concentrate on only making solution when DB disorders occur. For better performance of anomaly detection, additional training can be proceeded after pre-trained model is adopted. In addition, recurrent and convolutional neural network can be used in reconstruction part to capture hidden representation of sequential and local relationship. If anomaly labeled data is generated, detection result can be analyzed with numerical performance measures. However, in practice, it is hard to secure labeled anomaly dataset according to each DBMS. Proposed model is meaningful in unsupervised anomaly detection model that doesn’t need labeled data and can be generalized to other DBMSs with pre-trained model
Abstract We develop an algorithm which exceeds the performance of board certified cardiologists in detecting a wide range of heart arrhythmias from electrocardiograms recorded with a single-lead wearable monitor. We build a dataset with more than 500 times the number of unique patients than previously studied corpora. On this dataset, we train a 34-layer convolutional neural network which maps a sequence of ECG samples to a sequence of rhythm classes. Committees of boardcertified cardiologists annotate a gold standard test set on which we compare the performance of our model to that of 6 other individual cardiologists. We exceed the average cardiologist performance in both recall (sensitivity) and precision (positive predictive value).
Conclusion We develop a model which exceeds the cardiologist performance in detecting a wide range of heart arrhythmias from single-lead ECG records. Key to the performance of the model is a large annotated dataset and a very deep convolutional network which can map a sequence of ECG samples to a sequence of arrhythmia annotations. On the clinical side, future work should investigate extending the set of arrhythmias and other forms of heart disease which can be automatically detected with high-accuracy from single or multiple lead ECG records. For example we do not detect Ventricular Flutter or Fibrillation. We also do not detect Left or Right Ventricular Hypertrophy, Myocardial Infarction or a number of other heart diseases which do not necessarily exhibit as arrhythmias. Some of these may be difficult or even impossible to detect on a single-lead ECG but can often be seen on a multiple-lead ECG. Given that more than 300 million ECGs are recorded annually, high-accuracy diagnosis from ECG can save expert clinicians and cardiologists considerable time and decrease the number of misdiagnoses. Furthermore, we hope that this technology coupled with low-cost ECG devices enables more widespread use of the ECG as a diagnostic tool in places where access to a cardiologist is difficult.
Abstract Learning to Optimize (Li & Malik, 2016) is a recently proposed framework for learning optimization algorithms using reinforcement learning. In this paper, we explore learning an optimization algorithm for training shallow neural nets. Such high-dimensional stochastic optimization problems present interesting challenges for existing reinforcement learning algorithms. We develop an extension that is suited to learning optimization algorithms in this setting and demonstrate that the learned optimization algorithm consistently outperforms other known optimization algorithms even on unseen tasks and is robust to changes in stochasticity of gradients and the neural net architecture. More specifically, we show that an optimization algorithm trained with the proposed method on the problem of training a neural net on MNIST generalizes to the problems of training neural nets on the Toronto Faces Dataset, CIFAR-10 and CIFAR- 100
Abstract: Batch Normalization is a commonly used trick to improve the training of deep neural networks. These neural networks use L2 regularization, also called weight decay, ostensibly to prevent overfitting. However, we show that L2 regularization has no regularizing effect when combined with normalization. Instead, regularization has an influence on the scale of weights, and thereby on the effective learning rate. We investigate this dependence, both in theory, and experimentally. We show that popular optimization methods such as ADAM only partially eliminate the in- fluence of normalization on the learning rate. This leads to a discussion on other ways to mitigate this issue.
Discussion: Normalization, either Batch Normalization, Layer Normalization, or Weight Normalization makes the learned function invariant to scaling of the weights w. This scaling is strongly affected by regularization. We know of no first order gradient method that can fully eliminate this effect. However, a direct solution of forcing kwk = 1 solves the problem. By doing this we also remove one hyperparameter from the training procedure. As noted by Salimans & Kingma (2016), the effect of weight and batch normalization on the effective learning rate might not necessarily be bad. If no regularization is used, then the norm of the weights tends to increase over time, and so the effective learning rate decreases. Often that is a desirable thing, and many training methods lower the learning rate explicitly. However, the decrease of effective learning rate can be hard to control, and can depend a lot on initial steps of training, which makes it harder to reproduce results. With batch normalization we have added two additional parameters, γ and β, and it of course makes sense to also regularize these. In our experiments we did not use regularization for these parameters, though preliminary experiments show that regularization here does not affect the results. This is not very surprising, since with rectified linear activation functions, scaling of γ also has no effect on the function value in subsequent layers. So the only parameters that are actually regularized are the γ’s for the last layer of the network.
Abstract: Deep learning is the state-of-the-art in fields such as visual object recognition and speech recognition. This learning uses a large number of layers, huge number of units, and connections. Therefore, overfitting is a serious problem. To avoid this problem, dropout learning is proposed. Dropout learning neglects some inputs and hidden units in the learning process with a probability, p, and then, the neglected inputs and hidden units are combined with the learned network to express the final output. We find that the process of combining the neglected hidden units with the learned network can be regarded as ensemble learning, so we analyze dropout learning from this point of view.
Results: After the learning, the ensemble output is calculated by using the average of the sub-network outputs. We showed that dropout learning can be regarded as ensemble learning except for using a different set of hidden units in every learning iteration. Using a different set of hidden unit outperforms ensemble learning. We also showed that dropout learning achieves the same performance as the L2 regularizer. Our future work is the theoretical analysis of dropout learning with ReLU activation function.
Motivation: Tumor classification using Imaging Mass Spectrometry (IMS) data has a high potential for future applications in pathology. Due to the complexity and size of the data, automated feature extraction and classification steps are required to fully process the data. Deep learning offers an approach to learn feature extraction and classification combined in a single model. Commonly these steps are handled separately in IMS data analysis, hence deep learning offers an alternative strategy worthwhile to explore.
Results: Methodologically, we propose an adapted architecture based on deep convolutional networks to handle the characteristics of mass spectrometry data, as well as a strategy to interpret the learned model in the spectral domain based on a sensitivity analysis. The proposed methods are evaluated on two challenging tumor classification tasks and compared to a baseline approach. Competitiveness of the proposed methods are shown on both tasks by studying the performance via cross-validation. Moreover, the learned models are analyzed by the proposed sensitivity analysis revealing biologically plausible effects as well as confounding factors of the considered task. Thus, this study may serve as a starting point for further development of deep learning approaches in IMS classification tasks.
Um ótimo artigo de base teórica, relativo a geração de Top-N recomendações em cenários bem esparsos (e.g. sistema de rating 0-5 em que poucas pessoas fazem a anotação do rating, etc).
Recentemente, esse problema de recomendar dentro de uma matriz muito esparsa foi o motivo pelo qual o Netflix mudou o seu sistema de Rating que era de 1 a 5 para jóia ou ruim.
Em todo o caso vale a pena a leitura para ver a forma na qual os autores estão trabalhando nesse tipo de desafio.
Abstract: This paper focuses on developing effective and efficient algorithms for top-N recommender systems. A novel Sparse Linear Method (SLIM) is proposed, which generates top-N recommendations by aggregating from user purchase/rating profiles. A sparse aggregation coefficient matrix W is learned from SLIM by solving an `1-norm and `2-norm regularized optimization problem. W is demonstrated to produce high quality recommendations and its sparsity allows SLIM to generate recommendations very fast. A comprehensive set of experiments is conducted by comparing the SLIM method and other state-of-the-art top-N recommendation methods. The experiments show that SLIM achieves significant improvements both in run time performance and recommendation quality over the best existing methods.
Esse é um dos segredos teóricos por trás do Netflix: Porque computacionalmente tratar todos os clientes como diferentes, se alguns deles têm preferências semelhantes.
Abstract: Item-based approaches based on SLIM (Sparse LInear Methods) have demonstrated very good performance for top-N recommendation; however they only estimate a single model for all the users. This work is based on the intuition that not all users behave in the same way — instead there exist subsets of like-minded users. By using different item-item models for these user subsets, we can capture differences in their preferences and this can lead to improved performance for top-N recommendations. In this work, we extend SLIM by combining global and local SLIM models. We present a method that computes the prediction scores as a user-specific combination of the predictions derived by a global and local item-item models. We present an approach in which the global model, the local models, their user-specific combination, and the assignment of users to the local models are jointly optimized to improve the top-N recommendation performance. Our experiments show that the proposed method improves upon the standard SLIM model and outperforms competing top-N recommendation approaches.
Para estudar com lápis na mão, e café na caneca.
1. Understanding / Generalization / Transfer
Distilling the knowledge in a neural network (2015), G. Hinton et al. [pdf]
2. Optimization / Training Techniques
Batch normalization: Accelerating deep network training by reducing internal covariate shift (2015), S. Loffe and C. Szegedy [pdf]
3. Unsupervised / Generative Models
Unsupervised representation learning with deep convolutional generative adversarial networks (2015), A. Radford et al. [pdf]
4. Convolutional Neural Network Models
Deep residual learning for image recognition (2016), K. He et al. [pdf]
5. Image: Segmentation / Object Detection
Fast R-CNN (2015), R. Girshick [pdf]
6. Image / Video / Etc.
Show and tell: A neural image caption generator (2015), O. Vinyals et al. [pdf]
7. Natural Language Processing / RNNs
Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014), K. Cho et al. [pdf]
8. Speech / Other Domain
Speech recognition with deep recurrent neural networks (2013), A. Graves [pdf]
9. Reinforcement Learning / Robotics
Human-level control through deep reinforcement learning (2015), V. Mnih et al. [pdf]
10. More Papers from 2016
Domain-adversarial training of neural networks (2016), Y. Ganin et al. [pdf]
Uma (longa e) boa resposta está nesta tese de Didrik Nielsen.
Abstract: Tree boosting has empirically proven to be a highly effective approach to predictive modeling.
It has shown remarkable results for a vast array of problems.
For many years, MART has been the tree boosting method of choice.
More recently, a tree boosting method known as XGBoost has gained popularity by winning numerous machine learning competitions.
In this thesis, we will investigate how XGBoost differs from the more traditional MART.
We will show that XGBoost employs a boosting algorithm which we will term Newton boosting. This boosting algorithm will further be compared with the gradient boosting algorithm that MART employs.
Moreover, we will discuss the regularization techniques that these methods offer and the effect these have on the models.
In addition to this, we will attempt to answer the question of why XGBoost seems to win so many competitions.
To do this, we will provide some arguments for why tree boosting, and in particular XGBoost, seems to be such a highly effective and versatile approach to predictive modeling.
The core argument is that tree boosting can be seen to adaptively determine the local neighbourhoods of the model. Tree boosting can thus be seen to take the bias-variance tradeoff into consideration during model fitting. XGBoost further introduces some subtle improvements which allows it to deal with the bias-variance tradeoff even more carefully.
Conclusion: After determining the different boosting algorithms and regularization techniques these methods utilize and exploring the effects of these, we turned to providing arguments for why XGBoost seems to win “every” competition. To provide possible answers to this question, we first gave reasons for why tree boosting in general can be an effective approach. We provided two main arguments for this. First off, additive tree models can be seen to have rich representational abilities. Provided that enough trees of sufficient depth is combined, they are capable of closely approximating complex functional relationships, including high-order interactions. The most important argument provided for the versatility of tree boosting however, was that tree boosting methods are adaptive. Determining neighbourhoods adaptively allows tree boosting methods to use varying degrees of flexibility in different parts of the input space. They will consequently also automatically perform feature selection. This also makes tree boosting methods robust to the curse of dimensionality. Tree boosting can thus be seen actively take the bias-variance tradeoff into account when fitting models. They start out with a low variance, high bias model and gradually reduce bias by decreasing the size of neighbourhoods where it seems most necessary. Both MART and XGBoost have these properties in common. However, compared to MART, XGBoost uses a higher-order approximation at each iteration, and can thus be expected to learn “better” tree structures. Moreover, it provides clever penalization of individual trees. As discussed earlier, this can be seen to make the method even more adaptive. It will allow the method to adaptively determine the appropriate number of terminal nodes, which might vary among trees. It will further alter the learnt tree structures and leaf weights in order to reduce variance in estimation of the individual trees. Ultimately, this makes XGBoost a highly adaptive method which carefully takes the bias-variance tradeoff into account in nearly every aspect of the learning process.
Abstract:In the contemporary information society, constructing an effective sales prediction model is challenging due to the sizeable amount of purchasing information obtained from diverse consumer preferences. Many empirical cases shown in the existing literature argue that the traditional forecasting methods, such as the index of smoothness, moving average, and time series, have lost their dominance of prediction accuracy when they are compared with modern forecasting approaches such as neural network (NN) and support vector machine (SVM) models. To verify these findings, this paper utilizes the Taiwanese cosmetic sales data to examine three forecasting models: i) the back propagation neural network (BPNN), ii) least-square support vector machine (LSSVM), and iii) auto regressive model (AR). The result concludes that the LS-SVM has the smallest mean absolute percent error (MAPE) and largest Pearson correlation coefficient ( R2 ) between model and predicted values.
O maior desafio corrente enfrentado pela indústria no que diz respeito à Deep Learning está sem sombra de dúvidas na parte computacional em que todo o mercado está absorvendo tanto os serviços de nuvem para realizar cálculos cada vez mais complexos como também bem como investindo em capacidade de computação das GPU.
Entretanto, mesmo com o hardware nos dias de hoje já ser um commodity, a academia está resolvendo um problema que pode revolucionar a forma na qual se faz Deep Learning que é no aspecto arquitetural/parametrização.
Esse comentário da thread diz muito a respeito desse problema em que o usuário diz:
“The main problem I see with Deep Learning: too many parameters.
When you have to find the best value for the parameters, that’s a gradient search by itself. The curse of meta-dimensionality.“
Ou seja, mesmo com toda a disponibilidade do hardware a questão de saber qual é o melhor arranjo arquitetural de uma rede neural profunda? ainda não está resolvido.
Este paper do Shai Shalev-Shwartz , Ohad Shamir, e Shaked Shammah chamado “Failures of Deep Learning” expõe esse problema de forma bastante rica inclusive com experimentos (este é o repositório no Git).
Os autores colocam que os pontos de falha das redes Deep Learning que são a) falta de métodos baseados em gradiente para otimização de parâmetros, b) problemas estruturais nos algoritmos de Deep Learning na decomposição dos problemas, c) arquitetura e d) saturação das funções de ativação.
Em outras palavras, o que pode estar acontecendo em grande parte das aplicações de Deep Learning é que o tempo de convergência poderia ser muito menor ainda, se estes aspectos já estivessem resolvidos.
Com isso resolvido, grande parte do que conhecemos hoje como indústria de hardware para as redes Deep Learning seria ou sub-utilizada ao extremo (i.e. dado que haverá uma melhora do ponto de vista de otimização arquitetural/algorítmica) ou poderia ser aproveitada para tarefas mais complexas (e.g. como reconhecimento de imagens com baixo número de pixels).
Desta forma mesmo adotando uma metodologia baseada em hardware como a indústria vem fazendo, há ainda muito espaço de otimização em relação às redes Deep Learning do ponto de vista arquitetural e algorítmico.
Abaixo uma lista de referências direto do Stack Exchange para quem quiser se aprofundar mais no assunto:
- Zaremba, Wojciech. Ilya Sutskever. Rafal Jozefowicz “An empirical exploration of recurrent network architectures.” (2015): used evolutionary computation to find optimal RNN structures.
- Franck Dernoncourt. “The medial Reticular Formation: a neural substrate for action selection? An evaluation via evolutionary computation.“. Master’s Thesis. École Normale Supérieure Ulm. 2011.
- Bayer, Justin, Daan Wierstra, Julian Togelius, and Jürgen Schmidhuber. “Evolving memory cell structures for sequence learning.” In International Conference on Artificial Neural Networks, pp. 755-764. Springer Berlin Heidelberg, 2009.: used evolutionary computation to find optimal RNN structures.
Aprendizado por Reforço:
- Jose M Alvarez, Mathieu Salzmann. Learning the Number of Neurons in Deep Networks. NIPS 2016. https://arxiv.org/abs/1611.06321
- Bowen Baker, Otkrist Gupta, Nikhil Naik, Ramesh Raskar. Designing Neural Network Architectures using Reinforcement Learning. https://arxiv.org/abs/1611.02167
- Barret Zoph, Quoc V. Le. Neural Architecture Search with Reinforcement Learning. https://arxiv.org/abs/1611.01578
- Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, Nando de Freitas. Learning to learn by gradient descent by gradient descent. https://arxiv.org/abs/1606.04474
- Franck Dernoncourt, Ji Young Lee Optimizing Neural Network Hyperparameters with Gaussian Processes for Dialog Act Classification, IEEE SLT 2016.
- Cortes, Corinna, Xavi Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. “AdaNet: Adaptive Structural Learning of Artificial Neural Networks.” arXiv preprint arXiv:1607.01097 (2016). https://arxiv.org/abs/1607.01097 : Approach that learns both the structure of the network as well as its weights.
PS: O WordPress retirou a opção de justificar texto, logo desculpem de antemão a aparência amadora do blog nos próximos dias.