Para quem quiser uma forma definitiva de fazer isso de forma correta, esse webnar é obrigatório.

# Porque o xGBoost ganha todas as competições de Machine Learning

Uma (longa e) boa resposta está nesta tese de Didrik Nielsen.

**Abstract: **Tree boosting has empirically proven to be a highly effective approach to predictive modeling.

*It has shown remarkable results for a vast array of problems.*

*For many years, MART has been the tree boosting method of choice.*

*More recently, a tree boosting method known as XGBoost has gained popularity by winning numerous machine learning competitions.*

*In this thesis, we will investigate how XGBoost differs from the more traditional MART.*

*We will show that XGBoost employs a boosting algorithm which we will term Newton boosting. This boosting algorithm will further be compared with the gradient boosting algorithm that MART employs.*

*Moreover, we will discuss the regularization techniques that these methods offer and the effect these have on the models.*

*In addition to this, we will attempt to answer the question of why XGBoost seems to win so many competitions.*

*To do this, we will provide some arguments for why tree boosting, and in particular XGBoost, seems to be such a highly effective and versatile approach to predictive modeling.*

*The core argument is that tree boosting can be seen to adaptively determine the local neighbourhoods of the model. Tree boosting can thus be seen to take the bias-variance tradeoff into consideration during model fitting. XGBoost further introduces some subtle improvements which allows it to deal with the bias-variance tradeoff even more carefully.*

**Conclusion**: After determining the different boosting algorithms and regularization techniques these methods utilize and exploring the effects of these, we turned to providing arguments for why XGBoost seems to win “every” competition. To provide possible answers to this question, we first gave reasons for why tree boosting in general can be an effective approach. We provided two main arguments for this. First off, additive tree models can be seen to have rich representational abilities. Provided that enough trees of sufficient depth is combined, they are capable of closely approximating complex functional relationships, including high-order interactions. The most important argument provided for the versatility of tree boosting however, was that tree boosting methods are adaptive. Determining neighbourhoods adaptively allows tree boosting methods to use varying degrees of flexibility in different parts of the input space. They will consequently also automatically perform feature selection. This also makes tree boosting methods robust to the curse of dimensionality. Tree boosting can thus be seen actively take the bias-variance tradeoff into account when fitting models. They start out with a low variance, high bias model and gradually reduce bias by decreasing the size of neighbourhoods where it seems most necessary. *Both MART and XGBoost have these properties in common. However, compared to MART, XGBoost uses a higher-order approximation at each iteration, and **can thus be expected to learn “better” tree structures. Moreover, it provides clever **penalization of individual trees. As discussed earlier, this can be seen to make **the method even more adaptive. It will allow the method to adaptively determine **the appropriate number of terminal nodes, which might vary among trees. It will **further alter the learnt tree structures and leaf weights in order to reduce variance **in estimation of the individual trees. Ultimately, this makes XGBoost a highly **adaptive method which carefully takes the bias-variance tradeoff into account in **nearly every aspect of the learning process.*

# Extração de Vocais de músicas usando Rede Neural Convolucional

Este trabalho do Ollin Boer Bohan é simplesmente fenomenal. E além de tudo tem o repositório no GitHub.

# Softmax GAN

# Data Science: Como agentes reguladores, professores e praticantes estão fazendo isso errado

Esse post da Data Robot é um daqueles tipos de post que mostra muito como a evolução das plataformas de Big Data, aliado com um maior arsenal computacional e preditivo estão varrendo para baixo do tapete qualquer *bullshit* disfarçado com tecnicalidades em relação à Data Science.

Vou reproduzir na íntegra, pois vale a pena usar esse post quando você tiver que justificar a qualquer *burocrata de números* (não vou dar nome aos bois dado o butthurt que isso poderia causar) porque ninguém mais dá a mínima para P-Valor, testes de hipóteses, etc na era em que temos uma abundância de dados; e principalmente está havendo a morte da significância estatística.

*“Underpinning many published scientific conclusions is the concept of ‘statistical significance,’ typically assessed with an index called the p-value. While the p-value can be a useful statistical measure, it is commonly misused and misinterpreted.” **ASA Statement on Statistical Significance and p-Values*

If you’ve ever heard the words “statistically significant” or “fail to reject,” then you are among the countless thousands who have been traumatized by an academic approach building predictive models. Unfortunately, I can’t claim innocence in this matter. I taught statistics when I was in grad school, and I do have a Ph.D. in applied statistics. I was born into the world that uses formal hypothesis testing to justify every decision made in the model building process:

Should I include this variable in my model? How about an F-test?

Do my two samples have different means? Student’s t-test!

Does my model fit my data? Why not try the Hosmer–Lemeshow test or maybe use the Cramér–von Mises criterion?

Are my variables correlated? How about a test using a Pearson Correlation Coefficient?

And on, and on, and on, and on…

These tests are all based on various theoretical assumptions. If the assumptions are valid, then they allegedly tell you whether or not your results are “statistically significant.”

Over the last century, as businesses and governments have begun to incorporate data science into their business processes, these “statistical tests” have also leaked into commercial and regulatory practices.

For instance, federal regulators in the banking industry issued this tortured guidance in 2011:

*“… statistical tests depend on specific distributional assumptions and the purpose of the model… Any single test is rarely sufficient, so banks should apply a variety of tests to develop a sound model.”*

In other words, statistical tests have lots of assumptions that are often (always) untrue, so use lots of them. (?!)

## Here’s why statistical significance is a waste of time

### If assumptions are invalid, the tests are invalid — even if your model is good

I developed a statistical test of my very own for my dissertation. The procedure for doing this is pretty simple. First, you make some assumptions about independence and data distributions, and variance, and so on. Then, you do some math that relies (heavily) on these assumptions in order to come up with a p-value. The p-value tells you what decision to make.

As an example, let’s take linear regression. Every business stats student memorizes the three assumptions associated with the p-values in this approach: independence (for which no real test exists), constant variance, and normality. If all these assumptions aren’t met, then none of the statistical tests that you might do are valid; yet regulators, professors, scientists, and statisticians all expect you to rely (heavily) on these tests.

What’s are you to do if your assumptions are invalid? In practice, the general practice is to wave your hands about “robustness” or some such thing and then continue along the same path.

### If your data is big enough, EVERYTHING is significant

*“The primary product of a research inquiry is one or more measures of effect size, not P values.” **Jacob Cohen*

As your data gets bigger and bigger (as data tends to do these days), everything becomes statistically significant. On one hand, this makes intuitive sense. For example, the larger a dataset is, the most likely an F-test is to tell you that your GLM coefficients are nonzero; i.e., larger datasets can support more complex models, as expected. On the other hand, for many assumption validity tests — e.g., tests for constant variance — statistical significance indicates *invalid* assumptions. So, for big datasets, you end up with tests telling you every feature is significant, but assumption tests telling you to throw out all of your results.

### Validating assumptions is expensive and doesn’t add value

Nobody ever generated a single dollar of revenue by validating model assumptions (except of course the big consulting firms that are doing the work). No prospect was converted; no fraud was detected; no marketing message was honed by the drudgery of validating model assumptions. To make matters worse, it’s a never ending task. Every time a model is backtested, refreshed, or evaluated, the same assumption-validation-song-and-dance has to happen again. And that’s assuming that the dozens of validity tests don’t give you inconsistent results. It’s a gigantic waste of resources because there is a better way.

### You can cheat, and nobody will ever know

Known as data dredging, data snooping, or p-hacking, it is very easy and relatively undetectable to manufacture statistically significant results. Andrew Gelman observed that most modelers have a (perverse) incentive to produce *statistically significant*results — even at the expense of reality. It’s hardly surprising that these techniques exist, given the pressure to produce valuable data driven solutions. This risk, on its own, should be sufficient reason to abandon p-values entirely in some settings, like financial services, where cheating could result in serious consequences for the economy.

### If the model is misspecified, then your p-values are likely to be misleading

Suppose you’re investigating whether or not a gender gap exists in America. Lots of things are correlated with gender; e.g., career choice, hours worked per week, percentage of vacation taken, participation in a STEM career, and so on. To the extent that *any* of these variables are excluded from your investigation — whether you know about them or not — the significance of gender will be overstated. In other words, *statistical significance *will give the impression that a gender gap exists, when it may not — simply due to model misspecification.

## Only out-of-sample accuracy matters

Whether or not results are statistically significant is the *wrong question*. The only metric that actually matters when building models is whether or not your models can make accurate predictions on new data. Not only is this metric difficult to fake, but it also perfectly aligns with the business motivation for building the model in the first place. Fraud models that do a good job predicting fraud actually prevent losses. Underwriting models that accurately segment credit risk really do increase profits. Optimizing model accuracy instead of identifying statistical significance makes good business sense.

Over the course of the last few decades lots and lots of tools have been developed outside of the hypothesis testing framework. Cross-validation, partial dependence, feature importance, and boosting/bagging methods are just some of the tools in the machine learning toolbox. They provide a means not only for ensuring out-of-sample accuracy, but also understanding which features are important and how complex models work.

A survey of these methods is out of scope, but let me close with a final point. Unlike traditional statistical methods, tasks like cross-validation, model tuning, feature selection, and model selection are highly automatable. Custom coded solutions of any kind are inherently error prone, even for the most experienced data scientist

Many of the world’s biggest companies are recognizing that bespoke models, hand-built by Ph.D.’s are too slow and expensive to develop and maintain. Solutions like DataRobot provide a way for business experts to build predictive models in a safe, repeatable, systematic way that yields business value much more quickly and much cheaper than other approaches.

By Greg Michaelson, Director – DataRobot Labs

# Novel Revenue Development and Forecasting Model using Machine Learning Approaches for Cosmetics Enterprises.

**Abstract:**In the contemporary information society, constructing an effective sales prediction model is challenging due to the sizeable amount of purchasing information obtained from diverse consumer preferences. Many empirical cases shown in the existing literature argue that the traditional forecasting methods, such as the index of smoothness, moving average, and time series, have lost their dominance of prediction accuracy when they are compared with modern forecasting approaches such as neural network (NN) and support vector machine (SVM) models. To verify these findings, this paper utilizes the Taiwanese cosmetic sales data to examine three forecasting models: i) the back propagation neural network (BPNN), ii) least-square support vector machine (LSSVM), and iii) auto regressive model (AR). The result concludes that the LS-SVM has the smallest mean absolute percent error (MAPE) and largest Pearson correlation coefficient ( R2 ) between model and predicted values.

# Análise de Múltipla Correspondência no R para o problema de Churn

*Analytical challenges in multivariate data analysis and predictive modeling include identifying redundant and irrelevant variables. A recommended analytics approach is to first address the redundancy; which can be achieved by identifying groups of variables that are as correlated as possible among themselves and as uncorrelated as possible with other variable groups in the same data set. On the other hand, relevancy is about potential predictor variables and involves understanding the relationship between the target variable and input variables.*

*Multiple correspondence analysis (MCA) is a multivariate data analysis and data mining tool for finding and constructing a low-dimensional visual representation of variable associations among groups of categorical variables. Variable clustering as a tool for identifying redundancy is often applied to get a first impression of variable associations and multivariate data structure.*

*The motivations of this post are to illustrate the applications of: 1) preparing input variables for analysis and predictive modeling, 2) MCA as a multivariate exploratory data analysis and categorical data mining tool for business insights of customer churn data, and 3) variable clustering of categorical variables for the identification of redundant variables.*

# Sinal e ruído

# Interpretando a razão de chances

Agora o Matt Bogard do Econometric Sense dá a dica de como interpretar esse número:

*From the basic probabilities above, we know that the probability of event Y is greater for males than females. The odds of event Y are also greater for males than females. These relationships are also reflected in the odds ratios. The odds of event Y for males is 3 times the odds of females. The odds of event Y for females are only .33 times the odds of males. In other words, the odds of event Y for males are greater and the odds of event Y for females is less.*

*This can also be seen from the formula for odds ratios. If the OR M vs F = odds(M)/odds(F), we can see that if the odds (M) > odds(F), the odds ratio will be greater than 1. Alternatively, for OR F vs M = odds(F)/odds(M), we can see that if the odds(F) < odds(M) then the ratio will be less than 1. If the odds for both groups are equal, the odds ratio will be 1 exactly.*

**RELATION TO LOGISTIC REGRESSION**

** **Odds ratios can be obtained from logistic regression by exponentiating the coefficient or beta for a given explanatory variable. For categorical variables, the odds ratios are interpreted as above. For continuous variables, odds ratios are in terms of changes in odds as a result of a one-unit change in the variable.

# Falhar na preparação, é se preparar para falhar…

Assunto antigo, mas que deve ser lembrado sempre que possível:

*Given this context, it is curious to note that so much of what is published (again, especially on-line; think of titles such as: “The 10 Learning Algorithms Every Data Scientist Must Know”) and so many job listings emphasize- almost to the point of exclusivity- learning algorithms, as opposed to practical questions of data sampling, data preparation and enhancement, variable reduction, solving the business problem (instead of the technical one) or ability to deploy the final product.*

# Deep Learning AMI Amazon Web Services

*The Deep Learning AMI is an Amazon Linux image supported and maintained by Amazon Web Services for use on Amazon Elastic Compute Cloud (Amazon EC2). It is designed to provide a stable, secure, and high performance execution environment for deep learning applications running on Amazon EC2. It includes popular deep learning frameworks, including MXNet, Caffe, Tensorflow, Theano, CNTK and Torch as well as packages that enable easy integration with AWS, including launch configuration tools and many popular AWS libraries and tools. It also includes the Anaconda Data Science Platform for Python2 and Python3. Amazon Web Services provides ongoing security and maintenance updates to all instances running the Amazon Linux AMI. The Deep Learning AMI is provided at no additional charge to Amazon EC2 users.*

*The AMI Ids for the Deep Learning Amazon Linux AMI are the following:*

*us-east-1 : ami-e7c96af1*

*us-west-2: ami-dfb13ebf*

*eu-west-1: ami-6e5d6808 *

*Release tags/Branches used for the DW Frameworks:*

*MXNet : v0.9.3 tag*

*Tensorflow : v1.0.0 tag*

*Theano : rel-0.8.2 tag*

*Caffe : rc5 tag*

*CNTK : v2.0beta12.0 tag*

*Torch : master branch*

*Keras : 1.2.2 tag*

# Ferramenta para Machine Learning – MLJAR

**WHAT IS MLJAR?**

*MLJAR is a human-first platform for machine learning.*

*It provides a service for prototyping, development and deploying pattern recognition algorithms.*

*It makes algorithm search and tuning painless!*

**HOW IT WORKS?**

*You pay for computational time used for models training, predictions and data analysis. 1 credit is 1 computation hour on machine with 8 CPU and 15GB RAM. Computational time is aggregated per second basis.*

# Falhas na abordagem de Deep Learning: Arquiteturas e Meta-parametrização

O maior desafio corrente enfrentado pela indústria no que diz respeito à Deep Learning está sem sombra de dúvidas na parte computacional em que todo o mercado está absorvendo tanto os serviços de nuvem para realizar cálculos cada vez mais complexos como também bem como investindo em capacidade de computação das GPU.

Entretanto, mesmo com o hardware nos dias de hoje já ser um *commodity*, a academia está resolvendo um problema que pode revolucionar a forma na qual se faz Deep Learning que é no **aspecto arquitetural/parametrização**.

Esse comentário da thread diz muito a respeito desse problema em que o usuário diz:

“*The main problem I see with Deep Learning: too many parameters.*

*When you have to find the best value for the parameters, that’s a gradient search by itself. The curse of meta-dimensionality.*“

Ou seja, mesmo com toda a disponibilidade do hardware a questão de saber * qual é o melhor arranjo arquitetural de uma rede neural profunda?* ainda não está resolvido.

Este paper do Shai Shalev-Shwartz , Ohad Shamir, e Shaked Shammah chamado “*Failures of Deep Learning*” expõe esse problema de forma bastante rica inclusive com experimentos (este é o repositório no Git).

Os autores colocam que os pontos de falha das redes Deep Learning que são a) f*alta de métodos baseados em gradiente para otimização de parâmetros*, b) *problemas estruturais nos algoritmos de Deep Learning na decomposição dos problemas*, c) *arquitetura* e d) *saturação das funções de ativação*.

Em outras palavras, o que pode estar acontecendo em grande parte das aplicações de Deep Learning é que o tempo de convergência poderia ser muito menor ainda, se estes aspectos já estivessem resolvidos.

Com isso resolvido, grande parte do que conhecemos hoje como indústria de hardware para as redes Deep Learning seria ou sub-utilizada ao extremo (*i.e.* dado que haverá uma melhora do ponto de vista de otimização arquitetural/algorítmica) ou poderia ser aproveitada para tarefas mais complexas (*e.g.* como reconhecimento de imagens com baixo número de pixels).

Desta forma mesmo adotando uma metodologia baseada em hardware como a indústria vem fazendo, há ainda muito espaço de otimização em relação às redes Deep Learning do ponto de vista arquitetural e algorítmico.

Abaixo uma lista de referências direto do Stack Exchange para quem quiser se aprofundar mais no assunto:

Algoritmos Neuro-Evolutivos

- Zaremba, Wojciech. Ilya Sutskever. Rafal Jozefowicz “An empirical exploration of recurrent network architectures.” (2015): used evolutionary computation to find optimal RNN structures.
- Franck Dernoncourt. “The medial Reticular Formation: a neural substrate for action selection? An evaluation via evolutionary computation.“. Master’s Thesis. École Normale Supérieure Ulm. 2011.
- Bayer, Justin, Daan Wierstra, Julian Togelius, and Jürgen Schmidhuber. “Evolving memory cell structures for sequence learning.” In International Conference on Artificial Neural Networks, pp. 755-764. Springer Berlin Heidelberg, 2009.: used evolutionary computation to find optimal RNN structures.

Aprendizado por Reforço:

- Jose M Alvarez, Mathieu Salzmann. Learning the Number of Neurons in Deep Networks. NIPS 2016. https://arxiv.org/abs/1611.06321
- Bowen Baker, Otkrist Gupta, Nikhil Naik, Ramesh Raskar. Designing Neural Network Architectures using Reinforcement Learning. https://arxiv.org/abs/1611.02167
- Barret Zoph, Quoc V. Le. Neural Architecture Search with Reinforcement Learning. https://arxiv.org/abs/1611.01578

Miscelânea:

- Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W. Hoffman, David Pfau, Tom Schaul, Brendan Shillingford, Nando de Freitas. Learning to learn by gradient descent by gradient descent. https://arxiv.org/abs/1606.04474
- Franck Dernoncourt, Ji Young Lee Optimizing Neural Network Hyperparameters with Gaussian Processes for Dialog Act Classification, IEEE SLT 2016.
- Cortes, Corinna, Xavi Gonzalvo, Vitaly Kuznetsov, Mehryar Mohri, and Scott Yang. “AdaNet: Adaptive Structural Learning of Artificial Neural Networks.” arXiv preprint arXiv:1607.01097 (2016). https://arxiv.org/abs/1607.01097 : Approach that learns both the structure of the network as well as its weights.

PS: O WordPress retirou a opção de justificar texto, logo desculpem de antemão a aparência amadora do blog nos próximos dias.

# Além do aprendizado ativo em Sistemas de Recomendação de domínio cruzado

Um dos problemas mais comuns em Sistemas de Recomendação é o famoso *Cold Start* (*i.e.* quando não há conhecimento prévio sobre os gostos de alguém que acaba de entrar na plataforma).

Esse paper trás uma perspectiva interessante sobre o assunto.

**Conclusions**: In this paper, we have evaluated several widely used active *learning strategies adopted to tackle the cold-start problem **in a novel usage scenario, i.e., Cross-domain recommendation **scenario. In such a case, the user preferences are available **not only in the target domain, but also in additional **auxiliary domain. Hence, the active learner can exploit such **knowledge to better estimate which preferences are more **valuable for the system to acquire. **Our results have shown that the performance of the considered **active learning strategies significantly change in the **cross-domain recommendation scenario in comparison to the **single-domain recommendation. Hence, the presence of the **auxiliary domain may strongly influence the performance of **the active learning strategies. Indeed, while a certain active **learning strategy performs the best for MAE reduction in the **single scenario (i.e., highest-predicted strategy), it actually **performs poor in the cross-domain scenario. On the other **hand, the strategy with the worst MAE in single-domain scenario **(i.e., lowest-predicted strategy) can perform excellent **in the cross-domain scenario. This is an interesting observation **which indicates the importance of further analysis of **these two scenarios in order to better design and develop **active learning strategies for them. **Our future work includes the further analysis of the AL **strategies in other domains such as book, electronic products, **tourism, etc. Moreover, we plan to investigate the **potential impact of considering different rating prediction **models (e.g., context-aware models) on the performance of **different active learning strategies.*

# Ética Estóica para agentes artificiais

E não é que adaptaram a filosofia estóica para a Inteligência Artificial?

# MEBoost – Novo método para seleção de variáveis

Um dos campos bem pouco explorados em termos acadêmicos é sem sombra de dúvidas a parte de seleção de variáveis. Esse paper trás um pouco de luz sobre esse assunto tão importante e que drena parte do tempo produtivo de Data Scientists.

# Regressão com instâncias corrompidas: Uma abordagem robusta e suas aplicações

Trabalho interessante.

**Conclusions**: We consider a new approach dedicating to the multivariate regression problem where some output labels are either corrupted or missing. The gross error is explicitly addressed in our model, while it allows the adaptation of distinct regression elements or tasks according to their own noise levels. We further propose and analyze the convergence and runtime properties of the proposed proximal ADMM algorithm which is globally convergent and efficient. The *model combined with the specifically designed solver enable our approach to tackle a diverse range of applications. This is practically demonstrated on two distinct applications, that is, to predict personalities based on behaviors at SNSs, as well as to estimation 3D hand pose from single depth images. Empirical experiments on synthetic and real datasets have showcased the applicability of our approach in the presence of label noises. For future work, we plan to integrate with more advanced deep learning techniques to better address more practical problems, including 3D hand pose estimation and beyond.*

# Feature Screening in Large Scale Cluster Analysis

Mais trabalhos sobre clustering.

# Deterministic quantum annealing expectation-maximization (DQAEM)

Apesar do nome bem complicado o paper fala de uma modificação do mecanismo do algoritmo de cluster Expectation-Maximization (EM) em que o mesmo tem o incremento de uma meta-heurísica similar ao Simulated Annealing (arrefecimento simulado) para eliminar duas deficiências do EM que é de depender muito dos dados de início (atribuições iniciais) e o fato de que as vezes há problemas de mínimos locais.

**Relaxation of the EM Algorithm via Quantum Annealing for Gaussian Mixture Models**

# K-Means distribuído sobre dados binários comprimidos

E quem disse que o K-Means estava morto hein?