Anomaly Detection in Multivariate Non-stationary Time Series for Automatic DBMS Diagnosis

Anomaly Detection in Multivariate Non-stationary Time Series for Automatic DBMS Diagnosis

ABSTRACT— Anomaly detection in database management systems (DBMSs) is difficult because of increasing number of statistics (stat) and event metrics in big data system. In this paper, I propose an automatic DBMS diagnosis system that detects anomaly periods with abnormal DB stat metrics and finds causal events in the periods. Reconstruction error from deep autoencoder and statistical process control approach are applied to detect time period with anomalies. Related events are found using time series similarity measures between events and abnormal stat metrics. After training deep autoencoder with DBMS metric data, efficacy of anomaly detection is investigated from other DBMSs containing anomalies. Experiment results show effectiveness of proposed model, especially, batch temporal normalization layer. Proposed model is used for publishing automatic DBMS diagnosis reports in order to determine DBMS configuration and SQL tuning.

CONCLUSION AND FUTURE WORK I proposed a machine learning model for automatic DBMS diagnosis. The proposed model detects anomaly periods from reconstruct error with deep autoencoder. I also verified empirically that temporal normalization is essential when input data is non-stationary multivariate time series. With SPC approach, time period is considered anomaly period when reconstruction error is outside of control limit. According types or users of DBMSs, decision rules that are used in SPC can be added. For example, warning line with 2 sigma can be utilized to decide whether it is anomaly or not [12, 13]. In this paper, anomaly detection test is proceeded in other DBMSs whose data is not used in training, because performance of basic pre-trained model is important in service providers’ perspective. Efficacy of detection performance is validated with blind test and DBAs’ opinions. The result of automatic anomaly diagnosis would help DB consultants save time for anomaly periods and main wait events. Thus, they can concentrate on only making solution when DB disorders occur. For better performance of anomaly detection, additional training can be proceeded after pre-trained model is adopted. In addition, recurrent and convolutional neural network can be used in reconstruction part to capture hidden representation of sequential and local relationship. If anomaly labeled data is generated, detection result can be analyzed with numerical performance measures. However, in practice, it is hard to secure labeled anomaly dataset according to each DBMS. Proposed model is meaningful in unsupervised anomaly detection model that doesn’t need labeled data and can be generalized to other DBMSs with pre-trained model

Anomaly Detection in Multivariate Non-stationary Time Series for Automatic DBMS Diagnosis

Anomaly Detection in Multivariate Non-stationary Time Series for Automatic DBMS Diagnosis

Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks

Abstract We develop an algorithm which exceeds the performance of board certified cardiologists in detecting a wide range of heart arrhythmias from electrocardiograms recorded with a single-lead wearable monitor. We build a dataset with more than 500 times the number of unique patients than previously studied corpora. On this dataset, we train a 34-layer convolutional neural network which maps a sequence of ECG samples to a sequence of rhythm classes. Committees of boardcertified cardiologists annotate a gold standard test set on which we compare the performance of our model to that of 6 other individual cardiologists. We exceed the average cardiologist performance in both recall (sensitivity) and precision (positive predictive value).

Conclusion We develop a model which exceeds the cardiologist performance in detecting a wide range of heart arrhythmias from single-lead ECG records. Key to the performance of the model is a large annotated dataset and a very deep convolutional network which can map a sequence of ECG samples to a sequence of arrhythmia annotations. On the clinical side, future work should investigate extending the set of arrhythmias and other forms of heart disease which can be automatically detected with high-accuracy from single or multiple lead ECG records. For example we do not detect Ventricular Flutter or Fibrillation. We also do not detect Left or Right Ventricular Hypertrophy, Myocardial Infarction or a number of other heart diseases which do not necessarily exhibit as arrhythmias. Some of these may be difficult or even impossible to detect on a single-lead ECG but can often be seen on a multiple-lead ECG. Given that more than 300 million ECGs are recorded annually, high-accuracy diagnosis from ECG can save expert clinicians and cardiologists considerable time and decrease the number of misdiagnoses. Furthermore, we hope that this technology coupled with low-cost ECG devices enables more widespread use of the ECG as a diagnostic tool in places where access to a cardiologist is difficult.

Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks

Learning to Optimize Neural Nets

Abstract Learning to Optimize (Li & Malik, 2016) is a recently proposed framework for learning optimization algorithms using reinforcement learning. In this paper, we explore learning an optimization algorithm for training shallow neural nets. Such high-dimensional stochastic optimization problems present interesting challenges for existing reinforcement learning algorithms. We develop an extension that is suited to learning optimization algorithms in this setting and demonstrate that the learned optimization algorithm consistently outperforms other known optimization algorithms even on unseen tasks and is robust to changes in stochasticity of gradients and the neural net architecture. More specifically, we show that an optimization algorithm trained with the proposed method on the problem of training a neural net on MNIST generalizes to the problems of training neural nets on the Toronto Faces Dataset, CIFAR-10 and CIFAR- 100


Learning to Optimize Neural Nets

L2 Regularization versus Batch and Weight Normalization

Abstract: Batch Normalization is a commonly used trick to improve the training of deep neural networks. These neural networks use L2 regularization, also called weight decay, ostensibly to prevent overfitting. However, we show that L2 regularization has no regularizing effect when combined with normalization. Instead, regularization has an influence on the scale of weights, and thereby on the effective learning rate. We investigate this dependence, both in theory, and experimentally. We show that popular optimization methods such as ADAM only partially eliminate the in- fluence of normalization on the learning rate. This leads to a discussion on other ways to mitigate this issue.

Discussion: Normalization, either Batch Normalization, Layer Normalization, or Weight Normalization makes the learned function invariant to scaling of the weights w. This scaling is strongly affected by regularization. We know of no first order gradient method that can fully eliminate this effect. However, a direct solution of forcing kwk = 1 solves the problem. By doing this we also remove one hyperparameter from the training procedure. As noted by Salimans & Kingma (2016), the effect of weight and batch normalization on the effective learning rate might not necessarily be bad. If no regularization is used, then the norm of the weights tends to increase over time, and so the effective learning rate decreases. Often that is a desirable thing, and many training methods lower the learning rate explicitly. However, the decrease of effective learning rate can be hard to control, and can depend a lot on initial steps of training, which makes it harder to reproduce results. With batch normalization we have added two additional parameters, γ and β, and it of course makes sense to also regularize these. In our experiments we did not use regularization for these parameters, though preliminary experiments show that regularization here does not affect the results. This is not very surprising, since with rectified linear activation functions, scaling of γ also has no effect on the function value in subsequent layers. So the only parameters that are actually regularized are the γ’s for the last layer of the network.

L2 Regularization versus Batch and Weight Normalization

Analysis of dropout learning regarded as ensemble learning

Abstract: Deep learning is the state-of-the-art in fields such as visual object recognition and speech recognition. This learning uses a large number of layers, huge number of units, and connections. Therefore, overfitting is a serious problem. To avoid this problem, dropout learning is proposed. Dropout learning neglects some inputs and hidden units in the learning process with a probability, p, and then, the neglected inputs and hidden units are combined with the learned network to express the final output. We find that the process of combining the neglected hidden units with the learned network can be regarded as ensemble learning, so we analyze dropout learning from this point of view.

Results: After the learning, the ensemble output is calculated by using the average of the sub-network outputs. We showed that dropout learning can be regarded as ensemble learning except for using a different set of hidden units in every learning iteration. Using a different set of hidden unit outperforms ensemble learning. We also showed that dropout learning achieves the same performance as the L2 regularizer. Our future work is the theoretical analysis of dropout learning with ReLU activation function.

Analysis of dropout learning regarded as ensemble learning

Deep Learning for Tumor Classification in Imaging Mass Spectrometry

Motivation: Tumor classification using Imaging Mass Spectrometry (IMS) data has a high potential for future applications in pathology. Due to the complexity and size of the data, automated feature extraction and classification steps are required to fully process the data. Deep learning offers an approach to learn feature extraction and classification combined in a single model. Commonly these steps are handled separately in IMS data analysis, hence deep learning offers an alternative strategy worthwhile to explore.

Results: Methodologically, we propose an adapted architecture based on deep convolutional networks to handle the characteristics of mass spectrometry data, as well as a strategy to interpret the learned model in the spectral domain based on a sensitivity analysis. The proposed methods are evaluated on two challenging tumor classification tasks and compared to a baseline approach. Competitiveness of the proposed methods are shown on both tasks by studying the performance via cross-validation. Moreover, the learned models are analyzed by the proposed sensitivity analysis revealing biologically plausible effects as well as confounding factors of the considered task. Thus, this study may serve as a starting point for further development of deep learning approaches in IMS classification tasks.

Source Code: Learning


Deep Learning for Tumor Classification in Imaging Mass Spectrometry

SLIM: Sparse Linear Methods for Top-N Recommender Systems

Um ótimo artigo de base teórica, relativo a geração de Top-N recomendações em cenários bem esparsos (e.g. sistema de rating 0-5 em que poucas pessoas fazem a anotação do rating, etc).

Recentemente, esse problema de recomendar dentro de uma matriz muito esparsa foi o motivo pelo qual o Netflix mudou o seu sistema de Rating que era de 1 a 5 para jóia ou ruim.


Em todo o caso vale a pena a leitura para ver a forma na qual os autores estão trabalhando nesse tipo de desafio.

Abstract: This paper focuses on developing effective and efficient algorithms for top-N recommender systems. A novel Sparse Linear Method (SLIM) is proposed, which generates top-N recommendations by aggregating from user purchase/rating profiles. A sparse aggregation coefficient matrix W is learned from SLIM by solving an `1-norm and `2-norm regularized optimization problem. W is demonstrated to produce high quality recommendations and its sparsity allows SLIM to generate recommendations very fast. A comprehensive set of experiments is conducted by comparing the SLIM method and other state-of-the-art top-N recommendation methods. The experiments show that SLIM achieves significant improvements both in run time performance and recommendation quality over the best existing methods. 

SLIM: Sparse Linear Methods for Top-N Recommender Systems

Local Item-Item Models For Top-N Recommendation

Esse é um dos segredos teóricos por trás do Netflix: Porque computacionalmente tratar todos os clientes como diferentes, se alguns deles têm preferências semelhantes.

Abstract: Item-based approaches based on SLIM (Sparse LInear Methods) have demonstrated very good performance for top-N recommendation; however they only estimate a single model for all the users. This work is based on the intuition that not all users behave in the same way — instead there exist subsets of like-minded users. By using different item-item models for these user subsets, we can capture differences in their preferences and this can lead to improved performance for top-N recommendations. In this work, we extend SLIM by combining global and local SLIM models. We present a method that computes the prediction scores as a user-specific combination of the predictions derived by a global and local item-item models. We present an approach in which the global model, the local models, their user-specific combination, and the assignment of users to the local models are jointly optimized to improve the top-N recommendation performance. Our experiments show that the proposed method improves upon the standard SLIM model and outperforms competing top-N recommendation approaches.

Local Item-Item Models For Top-N Recommendation

Melhores papers de Deep Learning de 2012 até 2016

Para estudar com lápis na mão, e café na caneca.

Via Kdnuggets

1. Understanding / Generalization / Transfer

Distilling the knowledge in a neural network (2015), G. Hinton et al. [pdf]

2. Optimization / Training Techniques

Batch normalization: Accelerating deep network training by reducing internal covariate shift (2015), S. Loffe and C. Szegedy [pdf]

3. Unsupervised / Generative Models

Unsupervised representation learning with deep convolutional generative adversarial networks (2015), A. Radford et al. [pdf]

4. Convolutional Neural Network Models

Deep residual learning for image recognition (2016), K. He et al. [pdf]

5. Image: Segmentation / Object Detection

Fast R-CNN (2015), R. Girshick [pdf]

6. Image / Video / Etc.

Show and tell: A neural image caption generator (2015), O. Vinyals et al. [pdf]

7. Natural Language Processing / RNNs

Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014), K. Cho et al. [pdf]

8. Speech / Other Domain

Speech recognition with deep recurrent neural networks (2013), A. Graves [pdf]

9. Reinforcement Learning / Robotics

Human-level control through deep reinforcement learning (2015), V. Mnih et al. [pdf]

10. More Papers from 2016

Domain-adversarial training of neural networks (2016), Y. Ganin et al. [pdf]

Melhores papers de Deep Learning de 2012 até 2016

Porque o xGBoost ganha todas as competições de Machine Learning

Uma (longa e) boa resposta está nesta tese de Didrik Nielsen.


Abstract: Tree boosting has empirically proven to be a highly effective approach to predictive modeling.
It has shown remarkable results for a vast array of problems.
For many years, MART has been the tree boosting method of choice.
More recently, a tree boosting method known as XGBoost has gained popularity by winning numerous machine learning competitions.
In this thesis, we will investigate how XGBoost differs from the more traditional MART.
We will show that XGBoost employs a boosting algorithm which we will term Newton boosting. This boosting algorithm will further be compared with the gradient boosting algorithm that MART employs.
Moreover, we will discuss the regularization techniques that these methods offer and the effect these have on the models.
In addition to this, we will attempt to answer the question of why XGBoost seems to win so many competitions.
To do this, we will provide some arguments for why tree boosting, and in particular XGBoost, seems to be such a highly effective and versatile approach to predictive modeling.
The core argument is that tree boosting can be seen to adaptively determine the local neighbourhoods of the model. Tree boosting can thus be seen to take the bias-variance tradeoff into consideration during model fitting. XGBoost further introduces some subtle improvements which allows it to deal with the bias-variance tradeoff even more carefully.

Conclusion: After determining the different boosting algorithms and regularization techniques these methods utilize and exploring the effects of these, we turned to providing arguments for why XGBoost seems to win “every” competition. To provide possible answers to this question, we first gave reasons for why tree boosting in general can be an effective approach. We provided two main arguments for this. First off, additive tree models can be seen to have rich representational abilities. Provided that enough trees of sufficient depth is combined, they are capable of closely approximating complex functional relationships, including high-order interactions. The most important argument provided for the versatility of tree boosting however, was that tree boosting methods are adaptive. Determining neighbourhoods adaptively allows tree boosting methods to use varying degrees of flexibility in different parts of the input space. They will consequently also automatically perform feature selection. This also makes tree boosting methods robust to the curse of dimensionality. Tree boosting can thus be seen actively take the bias-variance tradeoff into account when fitting models. They start out with a low variance, high bias model and gradually reduce bias by decreasing the size of neighbourhoods where it seems most necessary. Both MART and XGBoost have these properties in common. However, compared to MART, XGBoost uses a higher-order approximation at each iteration, and can thus be expected to learn “better” tree structures. Moreover, it provides clever penalization of individual trees. As discussed earlier, this can be seen to make the method even more adaptive. It will allow the method to adaptively determine the appropriate number of terminal nodes, which might vary among trees. It will further alter the learnt tree structures and leaf weights in order to reduce variance in estimation of the individual trees. Ultimately, this makes XGBoost a highly adaptive method which carefully takes the bias-variance tradeoff into account in nearly every aspect of the learning process.

Porque o xGBoost ganha todas as competições de Machine Learning

Novel Revenue Development and Forecasting Model using Machine Learning Approaches for Cosmetics Enterprises.

Abstract:In the contemporary information society, constructing an effective sales prediction model is challenging due to the sizeable amount of purchasing information obtained from diverse consumer preferences. Many empirical cases shown in the existing literature argue that the traditional forecasting methods, such as the index of smoothness, moving average, and time series, have lost their dominance of prediction accuracy when they are compared with modern forecasting approaches such as neural network (NN) and support vector machine (SVM) models. To verify these findings, this paper utilizes the Taiwanese cosmetic sales data to examine three forecasting models: i) the back propagation neural network (BPNN), ii) least-square support vector machine (LSSVM), and iii) auto regressive model (AR). The result concludes that the LS-SVM has the smallest mean absolute percent error (MAPE) and largest Pearson correlation coefficient ( R2 ) between model and predicted values.

Novel Revenue Development and Forecasting Model using Machine Learning Approaches for Cosmetics Enterprises.

Falhas na abordagem de Deep Learning: Arquiteturas e Meta-parametrização

O maior desafio corrente enfrentado pela indústria no que diz respeito à Deep Learning está sem sombra de dúvidas na parte computacional em que todo o mercado está absorvendo tanto os serviços de nuvem para realizar cálculos cada vez mais complexos como também bem como investindo em capacidade de computação das GPU.

Entretanto, mesmo com o hardware nos dias de hoje já ser um commodity, a academia está resolvendo um problema que pode revolucionar a forma na qual se faz Deep Learning que é no aspecto arquitetural/parametrização.

Esse comentário da thread diz muito a respeito desse problema em que o usuário diz:

The main problem I see with Deep Learning: too many parameters.

When you have to find the best value for the parameters, that’s a gradient search by itself. The curse of meta-dimensionality.

Ou seja, mesmo com toda a disponibilidade do hardware a questão de saber qual é o melhor arranjo arquitetural de uma rede neural profunda? ainda não está resolvido.

Este paper do Shai Shalev-Shwartz , Ohad Shamir, e Shaked Shammah chamado “Failures of Deep Learning” expõe esse problema de forma bastante rica inclusive com experimentos (este é o repositório no Git).

Os autores colocam que os pontos de falha das redes Deep Learning que são a) falta de métodos baseados em gradiente para otimização de parâmetros, b) problemas estruturais nos algoritmos de Deep Learning na decomposição dos problemas, c) arquitetura e d) saturação das funções de ativação.

Em outras palavras, o que pode estar acontecendo em grande parte das aplicações de Deep Learning é que o tempo de convergência poderia ser muito menor ainda, se estes aspectos já estivessem resolvidos.

Com isso resolvido, grande parte do que conhecemos hoje como indústria de hardware para as redes Deep Learning seria ou sub-utilizada ao extremo (i.e. dado que haverá uma melhora do ponto de vista de otimização arquitetural/algorítmica) ou poderia ser aproveitada para tarefas mais complexas (e.g. como reconhecimento de imagens com baixo número de pixels).

Desta forma mesmo adotando uma metodologia baseada em hardware como a indústria vem fazendo, há ainda muito espaço de otimização em relação às redes Deep Learning do ponto de vista arquitetural e algorítmico.

Abaixo uma lista de referências direto do Stack Exchange para quem quiser se aprofundar mais no assunto:

Algoritmos Neuro-Evolutivos

Aprendizado por Reforço:


PS: O WordPress retirou a opção de justificar texto, logo desculpem de antemão a aparência amadora do blog nos próximos dias.


Falhas na abordagem de Deep Learning: Arquiteturas e Meta-parametrização

Além do aprendizado ativo em Sistemas de Recomendação de domínio cruzado

Um dos problemas mais comuns em Sistemas de Recomendação é o famoso Cold Start (i.e. quando não há conhecimento prévio sobre os gostos de alguém que acaba de entrar na plataforma).

Esse paper trás uma perspectiva interessante sobre o assunto.

Toward Active Learning in Cross-domain Recommender Systems – Roberto Pagano, Massimo Quadrana, Mehdi Elahi, Paolo Cremonesi

Abstract: One of the main challenges in Recommender Systems (RSs) is the New User problem which happens when the system has to generate personalised recommendations for a new user whom the system has no information about. Active Learning tries to solve this problem by acquiring user preference data with the maximum quality, and with the minimum acquisition cost. Although there are variety of works in active learning for RSs research area, almost all of them have focused only on the single-domain recommendation scenario. However, several real-world RSs operate in the cross-domain scenario, where the system generates recommendations in the target domain by exploiting user preferences in both the target and auxiliary domains. In such a scenario, the performance of active learning strategies can be significantly influenced and typical active learning strategies may fail to perform properly. In this paper, we address this limitation, by evaluating active learning strategies in a novel evaluation framework, explicitly suited for the cross-domain recommendation scenario. We show that having access to the preferences of the users in the auxiliary domain may have a huge impact on the performance of active learning strategies w.r.t. the classical, single-domain scenario.

Conclusions: In this paper, we have evaluated several widely used active learning strategies adopted to tackle the cold-start problem in a novel usage scenario, i.e., Cross-domain recommendation scenario. In such a case, the user preferences are available not only in the target domain, but also in additional auxiliary domain. Hence, the active learner can exploit such knowledge to better estimate which preferences are more valuable for the system to acquire. Our results have shown that the performance of the considered active learning strategies significantly change in the cross-domain recommendation scenario in comparison to the single-domain recommendation. Hence, the presence of the auxiliary domain may strongly influence the performance of the active learning strategies. Indeed, while a certain active learning strategy performs the best for MAE reduction in the single scenario (i.e., highest-predicted strategy), it actually performs poor in the cross-domain scenario. On the other hand, the strategy with the worst MAE in single-domain scenario (i.e., lowest-predicted strategy) can perform excellent in the cross-domain scenario. This is an interesting observation which indicates the importance of further analysis of these two scenarios in order to better design and develop active learning strategies for them. Our future work includes the further analysis of the AL strategies in other domains such as book, electronic products, tourism, etc. Moreover, we plan to investigate the potential impact of considering different rating prediction models (e.g., context-aware models) on the performance of different active learning strategies.

Além do aprendizado ativo em Sistemas de Recomendação de domínio cruzado

MEBoost – Novo método para seleção de variáveis

Um dos campos bem pouco explorados em termos acadêmicos é sem sombra de dúvidas a parte de seleção de variáveis. Esse paper trás um pouco de luz sobre esse assunto tão importante e que drena parte do tempo produtivo de Data Scientists.

MEBoost: Variable Selection in the Presence of Measurement Error – Benjamin Brown, Timothy Weaver, Julian Wolfson

Abstract:  We present a novel method for variable selection in regression models when covariates are measured with error. The iterative algorithm we propose, MEBoost, follows a path defined by estimating equations that correct for covariate measurement error. Via simulation, we evaluated our method and compare its performance to the recently-proposed Convex Conditioned Lasso (CoCoLasso) and to the “naive” Lasso which does not correct for measurement error. Increasing the degree of measurement error increased prediction error and decreased the probability of accurate covariate selection, but this loss of accuracy was least pronounced when using MEBoost. We illustrate the use of MEBoost in practice by analyzing data from the Box Lunch Study, a clinical trial in nutrition where several variables are based on self-report and hence measured with error.

Conclusions: We examined the variable selection problem in regression when the number of potential covariates is large compared to the sample size and when these potential covariates are measured with measurement error. We proposed MEBoost, a computationally simple descent-based approach which follows a path determined by measurement error-corrected estimating equations. We compared MEBoost, via simulation and in a real data example, with the recently-proposed Convex Conditioned Lasso (CoCoLasso) as well as the naive Lasso which assumes that covariates are measured without error. In almost all simulation scenarios, MEBoost performed best in terms of prediction error and coefficient bias. The CoCoLasso is more conservative with the highest specificity in each case, but sensitivity and prediction are better with MEBoost. In the comparison of selection paths, we saw that MEBoost was more aggressive in identifying variables to be included in the model more quickly than the CoCoLasso. These differences were most apparent when the measurement error had a larger variance and a more complex correlation structure. In addition, MEBoost was 7 times faster than the CoCoLasso. One application of MEBoost took 0.04 seconds versus 0.28 seconds for the CoCoLasso. MEBoost, while a promising approach, has some limitations. One limitation–which is shared with many methods that correct for measurement error–is that we assume that the covariance matrix of the measurement error process is known, an assumption which in many settings may be unrealistic. In some cases, it may be possible to estimate these structures using external data sources, but absent such data one could perform a sensitivity analysis with different measurement error variances and correlation structures, as we demonstrate in the real data application. Another challenging aspect of model selection with error-prone covariates is that, even if the set of candidate models is generated via a technique which accounts for measurement error, the process of selecting a final model (e.g., via cross-validation) still uses covariates that are measured with error. However, we showed in our simulation study that MEBoost performs well in selecting a model which recovers the relationship between the true (error-free) covariates and the outcome, even when using error-prone covariates to select the final model. This finding suggests that the procedure for generating a “path” of candidate models has a greater influence on prediction error and variable selection accuracy than the procedure picking a final model from among those candidates. To conclude, we note that while we only considered linear and Poisson regression in this paper, MEBoost can easily be applied to other regression models by, e.g., using the estimating equations presented by Nakamura (1990) or others which correct for measurement error. In contrast, the approaches of Sørensen et al. (2012) and Datta and Zou (2017) exploit the structure of the linear regression model and it is not obvious how they could be extended to the broader family of generalized linear models. The robustness and simplicity of MEBoost, along with its strong performance against other methods in the linear model case suggests that this novel method is a reliable way to deal with variable selection in the presence of measurement error.

MEBoost – Novo método para seleção de variáveis

Regressão com instâncias corrompidas: Uma abordagem robusta e suas aplicações

Trabalho interessante.

Multivariate Regression with Grossly Corrupted Observations: A Robust Approach and its Applications – Xiaowei Zhang, Chi Xu, Yu Zhang, Tingshao Zhu, Li Cheng

Abstract: This paper studies the problem of multivariate linear regression where a portion of the observations is grossly corrupted or is missing, and the magnitudes and locations of such occurrences are unknown in priori. To deal with this problem, we propose a new approach by explicitly consider the error source as well as its sparseness nature. An interesting property of our approach lies in its ability of allowing individual regression output elements or tasks to possess their unique noise levels. Moreover, despite working with a non-smooth optimization problem, our approach still guarantees to converge to its optimal solution. Experiments on synthetic data demonstrate the competitiveness of our approach compared with existing multivariate regression models. In addition, empirically our approach has been validated with very promising results on two exemplar real-world applications: The first concerns the prediction of \textit{Big-Five} personality based on user behaviors at social network sites (SNSs), while the second is 3D human hand pose estimation from depth images. The implementation of our approach and comparison methods as well as the involved datasets are made publicly available in support of the open-source and reproducible research initiatives.

Conclusions: We consider a new approach dedicating to the multivariate regression problem where some output labels are either corrupted or missing. The gross error is explicitly addressed in our model, while it allows the adaptation of distinct regression elements or tasks according to their own noise levels. We further propose and analyze the convergence and runtime properties of the proposed proximal ADMM algorithm which is globally convergent and efficient. The model combined with the specifically designed solver enable our approach to tackle a diverse range of applications. This is practically demonstrated on two distinct applications, that is, to predict personalities based on behaviors at SNSs, as well as to estimation 3D hand pose from single depth images. Empirical experiments on synthetic and real datasets have showcased the applicability of our approach in the presence of label noises. For future work, we plan to integrate with more advanced deep learning techniques to better address more practical problems, including 3D hand pose estimation and beyond.

Regressão com instâncias corrompidas: Uma abordagem robusta e suas aplicações

Feature Screening in Large Scale Cluster Analysis

Mais trabalhos sobre clustering.

Feature Screening in Large Scale Cluster Analysis – Trambak Banerjee, Gourab Mukherjee, Peter Radchenko

Abstract: We propose a novel methodology for feature screening in clustering massive datasets, in which both the number of features and the number of observations can potentially be very large. Taking advantage of a fusion penalization based convex clustering criterion, we propose a very fast screening procedure that efficiently discards non-informative features by first computing a clustering score corresponding to the clustering tree constructed for each feature, and then thresholding the resulting values. We provide theoretical support for our approach by establishing uniform non-asymptotic bounds on the clustering scores of the “noise” features. These bounds imply perfect screening of non-informative features with high probability and are derived via careful analysis of the empirical processes corresponding to the clustering trees that are constructed for each of the features by the associated clustering procedure. Through extensive simulation experiments we compare the performance of our proposed method with other screening approaches, popularly used in cluster analysis, and obtain encouraging results. We demonstrate empirically that our method is applicable to cluster analysis of big datasets arising in single-cell gene expression studies.

Conclusions: We propose COSCI, a novel feature screening method for large scale cluster analysis problems that are characterized by both large sample sizes and high dimensionality of the observations. COSCI efficiently ranks the candidate features in a non-parametric fashion and, under mild regularity conditions, is robust to the distributional form of the true noise coordinates. We establish theoretical results supporting ideal feature screening properties of our proposed procedure and provide a data driven approach for selecting the screening threshold parameter. Extensive simulation experiments and real data studies demonstrate encouraging performance of our proposed approach. An interesting topic for future research is extending our marginal screening method by means of utilizing multivariate objective criteria, which are more potent in detecting multivariate cluster information among marginally unimodal features. Preliminary analysis of the corresponding `2 fusion penalty based criterion, which, unlike the `1 based approach used in this paper, is non-separable across dimensions, suggests that this criterion can provide a way to move beyond marginal screening.

Feature Screening in Large Scale Cluster Analysis

Deterministic quantum annealing expectation-maximization (DQAEM)

Apesar do nome bem complicado o paper fala de uma modificação do mecanismo do algoritmo de cluster Expectation-Maximization (EM) em que o mesmo tem o incremento de uma meta-heurísica similar ao Simulated Annealing (arrefecimento simulado) para eliminar duas deficiências do EM que é de depender muito dos dados de início (atribuições iniciais) e o fato de que as vezes há problemas de mínimos locais.

Relaxation of the EM Algorithm via Quantum Annealing for Gaussian Mixture Models

Abstract: We propose a modified expectation-maximization algorithm by introducing the concept of quantum annealing, which we call the deterministic quantum annealing expectation-maximization (DQAEM) algorithm. The expectation-maximization (EM) algorithm is an established algorithm to compute maximum likelihood estimates and applied to many practical applications. However, it is known that EM heavily depends on initial values and its estimates are sometimes trapped by local optima. To solve such a problem, quantum annealing (QA) was proposed as a novel optimization approach motivated by quantum mechanics. By employing QA, we then formulate DQAEM and present a theorem that supports its stability. Finally, we demonstrate numerical simulations to confirm its efficiency.

Conclusion: In this paper, we have proposed the deterministic quantum annealing expectation-maximization (DQAEM) algorithm for Gaussian mixture models (GMMs) to relax the problem of local optima of the expectation-maximization (EM) algorithm by introducing the mechanism of quantum fluctuations into EM. Although we have limited our attention to GMMs in this paper to simplify the discussion, the derivation presented in this paper can be straightforwardly applied to any models which have discrete latent variables. After formulating DQAEM, we have presented the theorem that guarantees its convergence. We then have given numerical simulations to show its efficiency compared to EM and DSAEM. It is expect that the combination of DQAEM and DSAEM gives better performance than DQAEM. Finally, one of our future works is a Bayesian extension of this work. In other words, we are going to propose a deterministic quantum annealing variational Bayes inference.

Deterministic quantum annealing expectation-maximization (DQAEM)

Modularização do Morfismo de Redes Neurais

Quem foi que disse que não podem ocorrer alterações morfológicas nas arquiteturas/topologias de Redes Neurais?

Modularized Morphing of Neural Networks – Tao Wei, Changhu Wang, Chang Wen Chen

Abstract: In this work we study the problem of network morphism, an effective learning scheme to morph a well-trained neural network to a new one with the network function completely preserved. Different from existing work where basic morphing types on the layer level were addressed, we target at the central problem of network morphism at a higher level, i.e., how a convolutional layer can be morphed into an arbitrary module of a neural network. To simplify the representation of a network, we abstract a module as a graph with blobs as vertices and convolutional layers as edges, based on which the morphing process is able to be formulated as a graph transformation problem. Two atomic morphing operations are introduced to compose the graphs, based on which modules are classified into two families, i.e., simple morphable modules and complex modules. We present practical morphing solutions for both of these two families, and prove that any reasonable module can be morphed from a single convolutional layer. Extensive experiments have been conducted based on the state-of-the-art ResNet on benchmark datasets, and the effectiveness of the proposed solution has been verified.

Conclusions: This paper presented a systematic study on the problem of network morphism at a higher level, and tried to answer the central question of such learning scheme, i.e., whether and how a convolutional layer can be morphed into an arbitrary module. To facilitate the study, we abstracted a modular network as a graph, and formulated the process of network morphism as a graph transformation process. Based on this formulation, both simple morphable modules and complex modules have been defined and corresponding morphing algorithms have been proposed. We have shown that a convolutional layer can be morphed into any module of a network. We have also carried out experiments to illustrate how to achieve a better performing model based on the state-of-the-art ResNet with minimal extra computational cost on benchmark datasets.

Modularização do Morfismo de Redes Neurais

Akid: Uma biblioteca de Redes Neurais para pesquisa e produção

Finalmente começaram a pensar em eliminar esse vale entre ciência/academia e indústria.

Akid: A Library for Neural Network Research and Production from a Dataism Approach – Shuai Li
Abstract: Neural networks are a revolutionary but immature technique that is fast evolving and heavily relies on data. To benefit from the newest development and newly available data, we want the gap between research and production as small as possibly. On the other hand, differing from traditional machine learning models, neural network is not just yet another statistic model, but a model for the natural processing engine — the brain. In this work, we describe a neural network library named {\texttt akid}. It provides higher level of abstraction for entities (abstracted as blocks) in nature upon the abstraction done on signals (abstracted as tensors) by Tensorflow, characterizing the dataism observation that all entities in nature processes input and emit out in some ways. It includes a full stack of software that provides abstraction to let researchers focus on research instead of implementation, while at the same time the developed program can also be put into production seamlessly in a distributed environment, and be production ready. At the top application stack, it provides out-of-box tools for neural network applications. Lower down, akid provides a programming paradigm that lets user easily build customized models. The distributed computing stack handles the concurrency and communication, thus letting models be trained or deployed to a single GPU, multiple GPUs, or a distributed environment without affecting how a model is specified in the programming paradigm stack. Lastly, the distributed deployment stack handles how the distributed computing is deployed, thus decoupling the research prototype environment with the actual production environment, and is able to dynamically allocate computing resources, so development (Devs) and operations (Ops) could be separated. 

Akid: Uma biblioteca de Redes Neurais para pesquisa e produção

Para quem quiser saber um pouco mais das evoluções em relação a aplicação de aprendizado por reforço  e Deep Learning em sistemas autônomos, esse paper é uma boa pedida.

Learning to Drive using Inverse Reinforcement Learning and Deep Q-Networks

Abstract: We propose an inverse reinforcement learning (IRL) approach using Deep QNetworks to extract the rewards in problems with large state spaces. We evaluate the performance of this approach in a simulation-based autonomous driving scenario. Our results resemble the intuitive relation between the reward function and readings of distance sensors mounted at different poses on the car. We also show that, after a few learning rounds, our simulated agent generates collision-free motions and performs human-like lane change behaviour.

Conclusions: In this paper we proposed using Deep Q-Networks as the refinement step in Inverse Reinforcement Learning approaches. This enabled us to extract the rewards in scenarios with large state spaces such as driving, given expert demonstrations. The aim of this work was to extend the general approach to IRL. Exploring more advanced methods like Maximum Entropy IRL and the support for nonlinear reward functions is currently under investigation.