Data Science: Como agentes reguladores, professores e praticantes estão fazendo isso errado

Esse post da Data Robot é um daqueles tipos de post que mostra muito como a evolução das plataformas de Big Data, aliado com um maior arsenal computacional e preditivo estão varrendo para baixo do tapete qualquer bullshit disfarçado com tecnicalidades em relação à Data Science.

Vou reproduzir na íntegra, pois vale a pena usar esse post quando você tiver que justificar a qualquer burocrata de números (não vou dar nome aos bois dado o butthurt que isso poderia causar) porque ninguém mais dá a mínima para P-Valor, testes de hipóteses, etc na era em que temos uma abundância de dados; e principalmente está havendo a morte da significância estatística.

“Underpinning many published scientific conclusions is the concept of ‘statistical significance,’ typically assessed with an index called the p-value. While the p-value can be a useful statistical measure, it is commonly misused and misinterpreted.”  ASA Statement on Statistical Significance and p-Values

If you’ve ever heard the words “statistically significant” or “fail to reject,” then you are among the countless thousands who have been traumatized by an academic approach building predictive models.  Unfortunately, I can’t claim innocence in this matter.  I taught statistics when I was in grad school, and I do have a Ph.D. in applied statistics.  I was born into the world that uses formal hypothesis testing to justify every decision made in the model building process:

Should I include this variable in my model?  How about an F-test?

Do my two samples have different means?  Student’s t-test!

Does my model fit my data?  Why not try the Hosmer–Lemeshow test or maybe use the Cramér–von Mises criterion?

Are my variables correlated?  How about a test using a Pearson Correlation Coefficient?

And on, and on, and on, and on…

These tests are all based on various theoretical assumptions.  If the assumptions are valid, then they allegedly tell you whether or not your results are “statistically significant.”

Over the last century, as businesses and governments have begun to incorporate data science into their business processes, these “statistical tests” have also leaked into commercial and regulatory practices.

For instance, federal regulators in the banking industry issued this tortured guidance in 2011:

“… statistical tests depend on specific distributional assumptions and the purpose of the model… Any single test is rarely sufficient, so banks should apply a variety of tests to develop a sound model.”

In other words, statistical tests have lots of assumptions that are often (always) untrue, so use lots of them. (?!)

Here’s why statistical significance is a waste of time

statistical-significance

If assumptions are invalid, the tests are invalid — even if your model is good

I developed a statistical test of my very own for my dissertation.  The procedure for doing this is pretty simple.  First, you make some assumptions about independence and data distributions, and variance, and so on.  Then, you do some math that relies (heavily) on these assumptions in order to come up with a p-value. The p-value tells you what decision to make.

As an example, let’s take linear regression.  Every business stats student memorizes the three assumptions associated with the p-values in this approach: independence (for which no real test exists), constant variance, and normality.  If all these assumptions aren’t met, then none of the statistical tests that you might do are valid; yet regulators, professors, scientists, and statisticians all expect you to rely (heavily) on these tests.

What’s are you to do if your assumptions are invalid?  In practice, the general practice is to wave your hands about “robustness” or some such thing and then continue along the same path.

If your data is big enough, EVERYTHING is significant

“The primary product of a research inquiry is one or more measures of effect size, not P values.” Jacob Cohen

As your data gets bigger and bigger (as data tends to do these days), everything becomes statistically significant.  On one hand, this makes intuitive sense.  For example, the larger a dataset is, the most likely an F-test is to tell you that your GLM coefficients are nonzero; i.e., larger datasets can support more complex models, as expected.  On the other hand, for many assumption validity tests — e.g., tests for constant variance — statistical significance indicates invalid assumptions.  So, for big datasets, you end up with tests telling you every feature is significant, but assumption tests telling you to throw out all of your results.

Validating assumptions is expensive and doesn’t add value

Nobody ever generated a single dollar of revenue by validating model assumptions (except of course the big consulting firms that are doing the work).  No prospect was converted; no fraud was detected; no marketing message was honed by the drudgery of validating model assumptions.  To make matters worse, it’s a never ending task.  Every time a model is backtested, refreshed, or evaluated, the same assumption-validation-song-and-dance has to happen again.  And that’s assuming that the dozens of validity tests don’t give you inconsistent results.  It’s a gigantic waste of resources because there is a better way.

You can cheat, and nobody will ever know

Known as data dredging, data snooping, or p-hacking, it is very easy and relatively undetectable to manufacture statistically significant results.  Andrew Gelman observed that most modelers have a (perverse) incentive to produce statistically significantresults — even at the expense of reality.  It’s hardly surprising that these techniques exist, given the pressure to produce valuable data driven solutions.  This risk, on its own, should be sufficient reason to abandon p-values entirely in some settings, like financial services, where cheating could result in serious consequences for the economy.

If the model is misspecified, then your p-values are likely to be misleading

Suppose you’re investigating whether or not a gender gap exists in America.  Lots of things are correlated with gender; e.g., career choice, hours worked per week, percentage of vacation taken, participation in a STEM career, and so on.  To the extent that any of these variables are excluded from your investigation — whether you know about them or not — the significance of gender will be overstated.  In other words, statistical significance will give the impression that a gender gap exists, when it may not — simply due to model misspecification.

Only out-of-sample accuracy matters

Whether or not results are statistically significant is the wrong question.  The only metric that actually matters when building models is whether or not your models can make accurate predictions on new data.  Not only is this metric difficult to fake, but it also perfectly aligns with the business motivation for building the model in the first place.  Fraud models that do a good job predicting fraud actually prevent losses.  Underwriting models that accurately segment credit risk really do increase profits.  Optimizing model accuracy instead of identifying statistical significance makes good business sense.

Over the course of the last few decades lots and lots of tools have been developed outside of the hypothesis testing framework.  Cross-validation, partial dependence, feature importance, and boosting/bagging methods are just some of the tools in the machine learning toolbox.  They provide a means not only for ensuring out-of-sample accuracy, but also understanding which features are important and how complex models work.

A survey of these methods is out of scope, but let me close with a final point.  Unlike traditional statistical methods, tasks like cross-validation, model tuning, feature selection, and model selection are highly automatable.  Custom coded solutions of any kind are inherently error prone, even for the most experienced data scientist

Many of the world’s biggest companies are recognizing that bespoke models, hand-built by Ph.D.’s are too slow and expensive to develop and maintain.  Solutions like DataRobot provide a way for business experts to build predictive models in a safe, repeatable, systematic way that yields business value much more quickly and much cheaper than other approaches.

By Greg Michaelson, Director – DataRobot Labs

Data Science: Como agentes reguladores, professores e praticantes estão fazendo isso errado

Novel Revenue Development and Forecasting Model using Machine Learning Approaches for Cosmetics Enterprises.

Abstract:In the contemporary information society, constructing an effective sales prediction model is challenging due to the sizeable amount of purchasing information obtained from diverse consumer preferences. Many empirical cases shown in the existing literature argue that the traditional forecasting methods, such as the index of smoothness, moving average, and time series, have lost their dominance of prediction accuracy when they are compared with modern forecasting approaches such as neural network (NN) and support vector machine (SVM) models. To verify these findings, this paper utilizes the Taiwanese cosmetic sales data to examine three forecasting models: i) the back propagation neural network (BPNN), ii) least-square support vector machine (LSSVM), and iii) auto regressive model (AR). The result concludes that the LS-SVM has the smallest mean absolute percent error (MAPE) and largest Pearson correlation coefficient ( R2 ) between model and predicted values.

Novel Revenue Development and Forecasting Model using Machine Learning Approaches for Cosmetics Enterprises.

Análise de Múltipla Correspondência no R para o problema de Churn

Via Data Science Plus

Analytical challenges in multivariate data analysis and predictive modeling include identifying redundant and irrelevant variables. A recommended analytics approach is to first address the redundancy; which can be achieved by identifying groups of variables that are as correlated as possible among themselves and as uncorrelated as possible with other variable groups in the same data set. On the other hand, relevancy is about potential predictor variables and involves understanding the relationship between the target variable and input variables.
Multiple correspondence analysis (MCA) is a multivariate data analysis and data mining tool for finding and constructing a low-dimensional visual representation of variable associations among groups of categorical variables. Variable clustering as a tool for identifying redundancy is often applied to get a first impression of variable associations and multivariate data structure.
The motivations of this post are to illustrate the applications of: 1) preparing input variables for analysis and predictive modeling, 2) MCA as a multivariate exploratory data analysis and categorical data mining tool for business insights of customer churn data, and 3) variable clustering of categorical variables for the identification of redundant variables.

Análise de Múltipla Correspondência no R para o problema de Churn

Interpretando a razão de chances

Agora o Matt Bogard do Econometric Sense dá a dica de como interpretar esse número:

From the basic probabilities above, we know that the probability of event Y is greater for males than females. The odds of event Y are also greater for males than females. These relationships are also reflected in the odds ratios. The odds of event Y for males is 3 times the odds of females. The odds of event Y for females are only .33 times the odds of males. In other words, the odds of event Y for males are greater and the odds of event Y for females is less.

This can also be seen from the formula for odds ratios. If the OR M vs F  = odds(M)/odds(F), we can see that if the odds (M) > odds(F), the odds ratio will be greater than 1. Alternatively, for OR  F vs M = odds(F)/odds(M), we can see that if the odds(F) < odds(M) then the ratio will be less than 1.  If the odds for both groups are equal, the odds ratio will be 1 exactly.

RELATION TO LOGISTIC REGRESSION

 Odds ratios can be obtained from logistic regression by exponentiating the coefficient or beta for a given explanatory variable.  For categorical variables, the odds ratios are interpreted as above. For continuous variables, odds ratios are in terms of changes in odds as a result of a one-unit change in the variable.

Interpretando a razão de chances

Falhar na preparação, é se preparar para falhar…

Assunto antigo, mas que deve ser lembrado sempre que possível:

Given this context, it is curious to note that so much of what is published (again, especially on-line; think of titles such as: “The 10 Learning Algorithms Every Data Scientist Must Know”) and so many job listings emphasize- almost to the point of exclusivity- learning algorithms, as opposed to practical questions of data sampling, data preparation and enhancement, variable reduction, solving the business problem (instead of the technical one) or ability to deploy the final product.

 

Falhar na preparação, é se preparar para falhar…

Deep Learning AMI Amazon Web Services

Para quem quer escalar processamento em Machine Learning e não tem grana para comprar GPUs, o Deep Learning AMI da Amazon é uma ótima alternativa em termos de custos.

The Deep Learning AMI is an Amazon Linux image supported and maintained by Amazon Web Services for use on Amazon Elastic Compute Cloud (Amazon EC2). It is designed to provide a stable, secure, and high performance execution environment for deep learning applications running on Amazon EC2. It includes popular deep learning frameworks, including MXNet, Caffe, Tensorflow, Theano, CNTK and Torch as well as packages that enable easy integration with AWS, including launch configuration tools and many popular AWS libraries and tools. It also includes the Anaconda Data Science Platform for Python2 and Python3. Amazon Web Services provides ongoing security and maintenance updates to all instances running the Amazon Linux AMI. The Deep Learning AMI is provided at no additional charge to Amazon EC2 users.

The AMI Ids for the Deep Learning Amazon Linux AMI are the following:
us-east-1 : ami-e7c96af1
us-west-2: ami-dfb13ebf
eu-west-1: ami-6e5d6808

Release tags/Branches used for the DW Frameworks:
MXNet : v0.9.3 tag
Tensorflow : v1.0.0 tag
Theano : rel-0.8.2 tag
Caffe : rc5 tag
CNTK : v2.0beta12.0 tag
Torch : master branch
Keras : 1.2.2 tag

Deep Learning AMI Amazon Web Services

Ferramenta para Machine Learning – MLJAR

Para quem busca uma alternativa paga para Machine Learning em ambientes fora da própria infraestrutura o MLJAR pode ser a resposta.

WHAT IS MLJAR?

MLJAR is a human-first platform for machine learning.
It provides a service for prototyping, development and deploying pattern recognition algorithms.
It makes algorithm search and tuning painless!

HOW IT WORKS?

You pay for computational time used for models training, predictions and data analysis. 1 credit is 1 computation hour on machine with 8 CPU and 15GB RAM. Computational time is aggregated per second basis.

Ferramenta para Machine Learning – MLJAR

Falhas na abordagem de Deep Learning: Arquiteturas e Meta-parametrização

O maior desafio corrente enfrentado pela indústria no que diz respeito à Deep Learning está sem sombra de dúvidas na parte computacional em que todo o mercado está absorvendo tanto os serviços de nuvem para realizar cálculos cada vez mais complexos como também bem como investindo em capacidade de computação das GPU.

Entretanto, mesmo com o hardware nos dias de hoje já ser um commodity, a academia está resolvendo um problema que pode revolucionar a forma na qual se faz Deep Learning que é no aspecto arquitetural/parametrização.

Esse comentário da thread diz muito a respeito desse problema em que o usuário diz:

The main problem I see with Deep Learning: too many parameters.

When you have to find the best value for the parameters, that’s a gradient search by itself. The curse of meta-dimensionality.

Ou seja, mesmo com toda a disponibilidade do hardware a questão de saber qual é o melhor arranjo arquitetural de uma rede neural profunda? ainda não está resolvido.

Este paper do Shai Shalev-Shwartz , Ohad Shamir, e Shaked Shammah chamado “Failures of Deep Learning” expõe esse problema de forma bastante rica inclusive com experimentos (este é o repositório no Git).

Os autores colocam que os pontos de falha das redes Deep Learning que são a) falta de métodos baseados em gradiente para otimização de parâmetros, b) problemas estruturais nos algoritmos de Deep Learning na decomposição dos problemas, c) arquitetura e d) saturação das funções de ativação.

Em outras palavras, o que pode estar acontecendo em grande parte das aplicações de Deep Learning é que o tempo de convergência poderia ser muito menor ainda, se estes aspectos já estivessem resolvidos.

Com isso resolvido, grande parte do que conhecemos hoje como indústria de hardware para as redes Deep Learning seria ou sub-utilizada ao extremo (i.e. dado que haverá uma melhora do ponto de vista de otimização arquitetural/algorítmica) ou poderia ser aproveitada para tarefas mais complexas (e.g. como reconhecimento de imagens com baixo número de pixels).

Desta forma mesmo adotando uma metodologia baseada em hardware como a indústria vem fazendo, há ainda muito espaço de otimização em relação às redes Deep Learning do ponto de vista arquitetural e algorítmico.

Abaixo uma lista de referências direto do Stack Exchange para quem quiser se aprofundar mais no assunto:

Algoritmos Neuro-Evolutivos

Aprendizado por Reforço:

Miscelânea:

PS: O WordPress retirou a opção de justificar texto, logo desculpem de antemão a aparência amadora do blog nos próximos dias.

 

Falhas na abordagem de Deep Learning: Arquiteturas e Meta-parametrização

Além do aprendizado ativo em Sistemas de Recomendação de domínio cruzado

Um dos problemas mais comuns em Sistemas de Recomendação é o famoso Cold Start (i.e. quando não há conhecimento prévio sobre os gostos de alguém que acaba de entrar na plataforma).

Esse paper trás uma perspectiva interessante sobre o assunto.

Toward Active Learning in Cross-domain Recommender Systems – Roberto Pagano, Massimo Quadrana, Mehdi Elahi, Paolo Cremonesi

Abstract: One of the main challenges in Recommender Systems (RSs) is the New User problem which happens when the system has to generate personalised recommendations for a new user whom the system has no information about. Active Learning tries to solve this problem by acquiring user preference data with the maximum quality, and with the minimum acquisition cost. Although there are variety of works in active learning for RSs research area, almost all of them have focused only on the single-domain recommendation scenario. However, several real-world RSs operate in the cross-domain scenario, where the system generates recommendations in the target domain by exploiting user preferences in both the target and auxiliary domains. In such a scenario, the performance of active learning strategies can be significantly influenced and typical active learning strategies may fail to perform properly. In this paper, we address this limitation, by evaluating active learning strategies in a novel evaluation framework, explicitly suited for the cross-domain recommendation scenario. We show that having access to the preferences of the users in the auxiliary domain may have a huge impact on the performance of active learning strategies w.r.t. the classical, single-domain scenario.

Conclusions: In this paper, we have evaluated several widely used active learning strategies adopted to tackle the cold-start problem in a novel usage scenario, i.e., Cross-domain recommendation scenario. In such a case, the user preferences are available not only in the target domain, but also in additional auxiliary domain. Hence, the active learner can exploit such knowledge to better estimate which preferences are more valuable for the system to acquire. Our results have shown that the performance of the considered active learning strategies significantly change in the cross-domain recommendation scenario in comparison to the single-domain recommendation. Hence, the presence of the auxiliary domain may strongly influence the performance of the active learning strategies. Indeed, while a certain active learning strategy performs the best for MAE reduction in the single scenario (i.e., highest-predicted strategy), it actually performs poor in the cross-domain scenario. On the other hand, the strategy with the worst MAE in single-domain scenario (i.e., lowest-predicted strategy) can perform excellent in the cross-domain scenario. This is an interesting observation which indicates the importance of further analysis of these two scenarios in order to better design and develop active learning strategies for them. Our future work includes the further analysis of the AL strategies in other domains such as book, electronic products, tourism, etc. Moreover, we plan to investigate the potential impact of considering different rating prediction models (e.g., context-aware models) on the performance of different active learning strategies.

Além do aprendizado ativo em Sistemas de Recomendação de domínio cruzado

Ética Estóica para agentes artificiais

E não é que adaptaram a filosofia estóica para a Inteligência Artificial?

Stoic Ethics for Artificial Agents – Gabriel Murray

Abstract: We present a position paper advocating the notion that Stoic philosophy and ethics can inform the development of ethical A.I. systems. This is in sharp contrast to most work on building ethical A.I., which has focused on Utilitarian or Deontological ethical theories. We relate ethical A.I. to several core Stoic notions, including the dichotomy of control, the four cardinal virtues, the ideal Sage, Stoic practices, and Stoic perspectives on emotion or affect. More generally, we put forward an ethical view of A.I. that focuses more on internal states of the artificial agent rather than on external actions of the agent. We provide examples relating to near-term A.I. systems as well as hypothetical superintelligent agents.

Conclusions: In this position paper, we have attempted to show how Stoic ethics could be applied to the development of ethical A.I. systems. We argued that internal states matter for ethical A.I. agents, and that internal states can be analyzed by describing the four cardinal Stoic virtues in terms of characteristics of an intelligent system. We also briefly described other Stoic practices and how they could be realized by an A.I. agent. We gave a brief sketch of how to start developing Stoic A.I. systems by creating approval-directed agents with Stoic overseers, and/or by employing a syncretic paramedic ethics algorithm with a step featuring Stoic constraints. While it can be beneficial to analyze the ethics of an A.I. agent from several different perspectives, including consequentialist perspectives, we have argued for the importance of also conducting a Stoic ethical analysis of A.I. agents, where the agent’s internal states are analyzed, and moral judgments are not based on consequences outside of the agent’s control.

Ética Estóica para agentes artificiais

MEBoost – Novo método para seleção de variáveis

Um dos campos bem pouco explorados em termos acadêmicos é sem sombra de dúvidas a parte de seleção de variáveis. Esse paper trás um pouco de luz sobre esse assunto tão importante e que drena parte do tempo produtivo de Data Scientists.

MEBoost: Variable Selection in the Presence of Measurement Error – Benjamin Brown, Timothy Weaver, Julian Wolfson

Abstract:  We present a novel method for variable selection in regression models when covariates are measured with error. The iterative algorithm we propose, MEBoost, follows a path defined by estimating equations that correct for covariate measurement error. Via simulation, we evaluated our method and compare its performance to the recently-proposed Convex Conditioned Lasso (CoCoLasso) and to the “naive” Lasso which does not correct for measurement error. Increasing the degree of measurement error increased prediction error and decreased the probability of accurate covariate selection, but this loss of accuracy was least pronounced when using MEBoost. We illustrate the use of MEBoost in practice by analyzing data from the Box Lunch Study, a clinical trial in nutrition where several variables are based on self-report and hence measured with error.

Conclusions: We examined the variable selection problem in regression when the number of potential covariates is large compared to the sample size and when these potential covariates are measured with measurement error. We proposed MEBoost, a computationally simple descent-based approach which follows a path determined by measurement error-corrected estimating equations. We compared MEBoost, via simulation and in a real data example, with the recently-proposed Convex Conditioned Lasso (CoCoLasso) as well as the naive Lasso which assumes that covariates are measured without error. In almost all simulation scenarios, MEBoost performed best in terms of prediction error and coefficient bias. The CoCoLasso is more conservative with the highest specificity in each case, but sensitivity and prediction are better with MEBoost. In the comparison of selection paths, we saw that MEBoost was more aggressive in identifying variables to be included in the model more quickly than the CoCoLasso. These differences were most apparent when the measurement error had a larger variance and a more complex correlation structure. In addition, MEBoost was 7 times faster than the CoCoLasso. One application of MEBoost took 0.04 seconds versus 0.28 seconds for the CoCoLasso. MEBoost, while a promising approach, has some limitations. One limitation–which is shared with many methods that correct for measurement error–is that we assume that the covariance matrix of the measurement error process is known, an assumption which in many settings may be unrealistic. In some cases, it may be possible to estimate these structures using external data sources, but absent such data one could perform a sensitivity analysis with different measurement error variances and correlation structures, as we demonstrate in the real data application. Another challenging aspect of model selection with error-prone covariates is that, even if the set of candidate models is generated via a technique which accounts for measurement error, the process of selecting a final model (e.g., via cross-validation) still uses covariates that are measured with error. However, we showed in our simulation study that MEBoost performs well in selecting a model which recovers the relationship between the true (error-free) covariates and the outcome, even when using error-prone covariates to select the final model. This finding suggests that the procedure for generating a “path” of candidate models has a greater influence on prediction error and variable selection accuracy than the procedure picking a final model from among those candidates. To conclude, we note that while we only considered linear and Poisson regression in this paper, MEBoost can easily be applied to other regression models by, e.g., using the estimating equations presented by Nakamura (1990) or others which correct for measurement error. In contrast, the approaches of Sørensen et al. (2012) and Datta and Zou (2017) exploit the structure of the linear regression model and it is not obvious how they could be extended to the broader family of generalized linear models. The robustness and simplicity of MEBoost, along with its strong performance against other methods in the linear model case suggests that this novel method is a reliable way to deal with variable selection in the presence of measurement error.

MEBoost – Novo método para seleção de variáveis

Regressão com instâncias corrompidas: Uma abordagem robusta e suas aplicações

Trabalho interessante.

Multivariate Regression with Grossly Corrupted Observations: A Robust Approach and its Applications – Xiaowei Zhang, Chi Xu, Yu Zhang, Tingshao Zhu, Li Cheng

Abstract: This paper studies the problem of multivariate linear regression where a portion of the observations is grossly corrupted or is missing, and the magnitudes and locations of such occurrences are unknown in priori. To deal with this problem, we propose a new approach by explicitly consider the error source as well as its sparseness nature. An interesting property of our approach lies in its ability of allowing individual regression output elements or tasks to possess their unique noise levels. Moreover, despite working with a non-smooth optimization problem, our approach still guarantees to converge to its optimal solution. Experiments on synthetic data demonstrate the competitiveness of our approach compared with existing multivariate regression models. In addition, empirically our approach has been validated with very promising results on two exemplar real-world applications: The first concerns the prediction of \textit{Big-Five} personality based on user behaviors at social network sites (SNSs), while the second is 3D human hand pose estimation from depth images. The implementation of our approach and comparison methods as well as the involved datasets are made publicly available in support of the open-source and reproducible research initiatives.

Conclusions: We consider a new approach dedicating to the multivariate regression problem where some output labels are either corrupted or missing. The gross error is explicitly addressed in our model, while it allows the adaptation of distinct regression elements or tasks according to their own noise levels. We further propose and analyze the convergence and runtime properties of the proposed proximal ADMM algorithm which is globally convergent and efficient. The model combined with the specifically designed solver enable our approach to tackle a diverse range of applications. This is practically demonstrated on two distinct applications, that is, to predict personalities based on behaviors at SNSs, as well as to estimation 3D hand pose from single depth images. Empirical experiments on synthetic and real datasets have showcased the applicability of our approach in the presence of label noises. For future work, we plan to integrate with more advanced deep learning techniques to better address more practical problems, including 3D hand pose estimation and beyond.

Regressão com instâncias corrompidas: Uma abordagem robusta e suas aplicações

Feature Screening in Large Scale Cluster Analysis

Mais trabalhos sobre clustering.

Feature Screening in Large Scale Cluster Analysis – Trambak Banerjee, Gourab Mukherjee, Peter Radchenko

Abstract: We propose a novel methodology for feature screening in clustering massive datasets, in which both the number of features and the number of observations can potentially be very large. Taking advantage of a fusion penalization based convex clustering criterion, we propose a very fast screening procedure that efficiently discards non-informative features by first computing a clustering score corresponding to the clustering tree constructed for each feature, and then thresholding the resulting values. We provide theoretical support for our approach by establishing uniform non-asymptotic bounds on the clustering scores of the “noise” features. These bounds imply perfect screening of non-informative features with high probability and are derived via careful analysis of the empirical processes corresponding to the clustering trees that are constructed for each of the features by the associated clustering procedure. Through extensive simulation experiments we compare the performance of our proposed method with other screening approaches, popularly used in cluster analysis, and obtain encouraging results. We demonstrate empirically that our method is applicable to cluster analysis of big datasets arising in single-cell gene expression studies.

Conclusions: We propose COSCI, a novel feature screening method for large scale cluster analysis problems that are characterized by both large sample sizes and high dimensionality of the observations. COSCI efficiently ranks the candidate features in a non-parametric fashion and, under mild regularity conditions, is robust to the distributional form of the true noise coordinates. We establish theoretical results supporting ideal feature screening properties of our proposed procedure and provide a data driven approach for selecting the screening threshold parameter. Extensive simulation experiments and real data studies demonstrate encouraging performance of our proposed approach. An interesting topic for future research is extending our marginal screening method by means of utilizing multivariate objective criteria, which are more potent in detecting multivariate cluster information among marginally unimodal features. Preliminary analysis of the corresponding `2 fusion penalty based criterion, which, unlike the `1 based approach used in this paper, is non-separable across dimensions, suggests that this criterion can provide a way to move beyond marginal screening.

Feature Screening in Large Scale Cluster Analysis

Deterministic quantum annealing expectation-maximization (DQAEM)

Apesar do nome bem complicado o paper fala de uma modificação do mecanismo do algoritmo de cluster Expectation-Maximization (EM) em que o mesmo tem o incremento de uma meta-heurísica similar ao Simulated Annealing (arrefecimento simulado) para eliminar duas deficiências do EM que é de depender muito dos dados de início (atribuições iniciais) e o fato de que as vezes há problemas de mínimos locais.

Relaxation of the EM Algorithm via Quantum Annealing for Gaussian Mixture Models

Abstract: We propose a modified expectation-maximization algorithm by introducing the concept of quantum annealing, which we call the deterministic quantum annealing expectation-maximization (DQAEM) algorithm. The expectation-maximization (EM) algorithm is an established algorithm to compute maximum likelihood estimates and applied to many practical applications. However, it is known that EM heavily depends on initial values and its estimates are sometimes trapped by local optima. To solve such a problem, quantum annealing (QA) was proposed as a novel optimization approach motivated by quantum mechanics. By employing QA, we then formulate DQAEM and present a theorem that supports its stability. Finally, we demonstrate numerical simulations to confirm its efficiency.

Conclusion: In this paper, we have proposed the deterministic quantum annealing expectation-maximization (DQAEM) algorithm for Gaussian mixture models (GMMs) to relax the problem of local optima of the expectation-maximization (EM) algorithm by introducing the mechanism of quantum fluctuations into EM. Although we have limited our attention to GMMs in this paper to simplify the discussion, the derivation presented in this paper can be straightforwardly applied to any models which have discrete latent variables. After formulating DQAEM, we have presented the theorem that guarantees its convergence. We then have given numerical simulations to show its efficiency compared to EM and DSAEM. It is expect that the combination of DQAEM and DSAEM gives better performance than DQAEM. Finally, one of our future works is a Bayesian extension of this work. In other words, we are going to propose a deterministic quantum annealing variational Bayes inference.

Deterministic quantum annealing expectation-maximization (DQAEM)

K-Means distribuído sobre dados binários comprimidos

E quem disse que o K-Means estava morto hein?

Distributed K-means over Compressed Binary Data

Abstract—We consider a network of binary-valued sensors with a fusion center. The fusion center has to perform K-means clustering on the binary data transmitted by the sensors. In order to reduce the amount of data transmitted within the network, the sensors compress their data with a source coding scheme based on LDPC codes. We propose to apply the K-means algorithm directly over the compressed data without reconstructing the original sensors measurements, in order to avoid potentially complex decoding operations. We provide approximated expressions of the error probabilities of the K-means steps in the compressed domain. From these expressions, we show that applying the Kmeans algorithm in the compressed domain enables to recover the clusters of the original domain. Monte Carlo simulations illustrate the accuracy of the obtained approximated error probabilities, and show that the coding rate needed to perform K-means clustering in the compressed domain is lower than the rate needed to reconstruct all the measurements.

Conclusion: In this paper, we considered a network of sensors which transmit their compressed binary measurements to a fusion center. We proposed to apply the K-means algorithm directly over the compressed data, without reconstructing the sensor measurements. From a theoretical analysis and Monte Carlo simulations, we showed the efficiency of applying K-means in the compressed domain. We also showed that the rate needed to perform K-means on the compressed vectors is lower than the rate needed to reconstruct all the measurements.

K-Means distribuído sobre dados binários comprimidos

Modularização do Morfismo de Redes Neurais

Quem foi que disse que não podem ocorrer alterações morfológicas nas arquiteturas/topologias de Redes Neurais?

Modularized Morphing of Neural Networks – Tao Wei, Changhu Wang, Chang Wen Chen

Abstract: In this work we study the problem of network morphism, an effective learning scheme to morph a well-trained neural network to a new one with the network function completely preserved. Different from existing work where basic morphing types on the layer level were addressed, we target at the central problem of network morphism at a higher level, i.e., how a convolutional layer can be morphed into an arbitrary module of a neural network. To simplify the representation of a network, we abstract a module as a graph with blobs as vertices and convolutional layers as edges, based on which the morphing process is able to be formulated as a graph transformation problem. Two atomic morphing operations are introduced to compose the graphs, based on which modules are classified into two families, i.e., simple morphable modules and complex modules. We present practical morphing solutions for both of these two families, and prove that any reasonable module can be morphed from a single convolutional layer. Extensive experiments have been conducted based on the state-of-the-art ResNet on benchmark datasets, and the effectiveness of the proposed solution has been verified.

Conclusions: This paper presented a systematic study on the problem of network morphism at a higher level, and tried to answer the central question of such learning scheme, i.e., whether and how a convolutional layer can be morphed into an arbitrary module. To facilitate the study, we abstracted a modular network as a graph, and formulated the process of network morphism as a graph transformation process. Based on this formulation, both simple morphable modules and complex modules have been defined and corresponding morphing algorithms have been proposed. We have shown that a convolutional layer can be morphed into any module of a network. We have also carried out experiments to illustrate how to achieve a better performing model based on the state-of-the-art ResNet with minimal extra computational cost on benchmark datasets.

Modularização do Morfismo de Redes Neurais

Aplicação de Deep Learning para relacionar Pins

Ao que parece, o Pinterest está virando a nova casa de força de Deep Learning aplicada à imagens.

Using deep learning to generate Related Pins

We built Pin2Vec to embed all the Pins in a 128-dimension space. First, we label a Pin with all the other Pins someone has saved in his/her activity session, each as a Pin tuple. Pin tuples are used in supervised training to train the embedding matrix for each of the tens of millions of Pins of the vocabulary. We use TensorFlow as the trainer. At serving time, a set of nearest neighbors are found as Related Pins in the space for each of the Pins.

Training data is collected from recent engagement, such as saving or clicking, and a sliding window is applied. Low quality Pins and those not engaged with are removed from training. Then, each Pin is assigned with a unique Pin ID. Within the sliding window, training pairs are extracted such that the first Pin is the example and each of the following Pins is its label. Figure 3 illustrates an example session and training pairs. In our case, you can imagine each user session is a sentence with Pins as words.

We used a feedforward neural network with a hidden layer of 128 dimensions. Figure 4 shows the architecture. The network is inspired by word2vec. The input vector is a one-hot vector with a size of vocabulary and, in our case, is tens of millions of Pins. The vector is reduced to the 128-dimension vector by multiplying with the hidden layer weight matrix. An eLu activation function is applied after hidden layer. At last the hidden layer output is multiplied with the softmax matrix and a cross-entropy is used to calculate the loss. We sampled 64 negative Pins in loss optimization in lieu of iterating on tens of millions of Pins. We trained the Pin2Vec embedding on machines with 32 cores and 244GB memory.

Aplicação de Deep Learning para relacionar Pins

Akid: Uma biblioteca de Redes Neurais para pesquisa e produção

Finalmente começaram a pensar em eliminar esse vale entre ciência/academia e indústria.

Akid: A Library for Neural Network Research and Production from a Dataism Approach – Shuai Li
Abstract: Neural networks are a revolutionary but immature technique that is fast evolving and heavily relies on data. To benefit from the newest development and newly available data, we want the gap between research and production as small as possibly. On the other hand, differing from traditional machine learning models, neural network is not just yet another statistic model, but a model for the natural processing engine — the brain. In this work, we describe a neural network library named {\texttt akid}. It provides higher level of abstraction for entities (abstracted as blocks) in nature upon the abstraction done on signals (abstracted as tensors) by Tensorflow, characterizing the dataism observation that all entities in nature processes input and emit out in some ways. It includes a full stack of software that provides abstraction to let researchers focus on research instead of implementation, while at the same time the developed program can also be put into production seamlessly in a distributed environment, and be production ready. At the top application stack, it provides out-of-box tools for neural network applications. Lower down, akid provides a programming paradigm that lets user easily build customized models. The distributed computing stack handles the concurrency and communication, thus letting models be trained or deployed to a single GPU, multiple GPUs, or a distributed environment without affecting how a model is specified in the programming paradigm stack. Lastly, the distributed deployment stack handles how the distributed computing is deployed, thus decoupling the research prototype environment with the actual production environment, and is able to dynamically allocate computing resources, so development (Devs) and operations (Ops) could be separated. 

Akid: Uma biblioteca de Redes Neurais para pesquisa e produção

Tuning via hiper-parametrização para Máquinas de Vetor de Suporte (Support Vector Machines) por estimação de distribuição de algoritmos

Em épocas de Deep Learning, é sempre bom ver um paper com as boas e velhas Máquinas de Vetor de Suporte (Support Vector Machines). Em breve teremos um post sobre essa técnica aqui no blog.

Hyper-Parameter Tuning for Support Vector Machines by Estimation of Distribution Algorithms

Abstract: Hyper-parameter tuning for support vector machines has been widely studied in the past decade. A variety of metaheuristics, such as Genetic Algorithms and Particle Swarm Optimization have been considered to accomplish this task. Notably, exhaustive strategies such as Grid Search or Random Search continue to be implemented for hyper-parameter tuning and have recently shown results comparable to sophisticated metaheuristics. The main reason for the success of exhaustive techniques is due to the fact that only two or three parameters need to be adjusted when working with support vector machines. In this chapter, we analyze two Estimation Distribution Algorithms, the Univariate Marginal Distribution Algorithm and the Boltzmann Univariate Marginal Distribution Algorithm, to verify if these algorithms preserve the effectiveness of Random Search and at the same time make more efficient the process of finding the optimal hyper-parameters without increasing the complexity of Random Search.

Tuning via hiper-parametrização para Máquinas de Vetor de Suporte (Support Vector Machines) por estimação de distribuição de algoritmos