RLScore: Regularized Least-Squares Learners

Uma boa alternativa para ensemble quando a dimensionalidade dos datasets for alta, ou as alternativas com Elastic Net, Lasso e Ridge não derem a convergência desejada.

RLScore: Regularized Least-Squares Learners

RLScore is a Python open source module for kernel based machine learning. The library provides implementations of several regularized least-squares (RLS) type of learners. RLS methods for regression and classification, ranking, greedy feature selection, multi-task and zero-shot learning, and unsupervised classification are included. Matrix algebra based computational short-cuts are used to ensure efficiency of both training and cross-validation. A simple API and extensive tutorials allow for easy use of RLScore.

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

RLS is used for two main reasons. The first comes up when the number of variables in the linear system exceeds the number of observations. In such settings, the ordinary least-squares problem is ill-posed and is therefore impossible to fit because the associated optimization problem has infinitely many solutions. RLS allows the introduction of further constraints that uniquely determine the solution.

The second reason that RLS is used occurs when the number of variables does not exceed the number of observations, but the learned model suffers from poor generalization. RLS can be used in such cases to improve the generalizability of the model by constraining it at training time. This constraint can either force the solution to be “sparse” in some way or to reflect other prior knowledge about the problem such as information about correlations between features. A Bayesian understanding of this can be reached by showing that RLS methods are often equivalent to priors on the solution to the least-squares problem.

To sse in Depth

Installation
1) $ pip install rlscore
2) $ export CFLAGS="-I /usr/local/lib/python2.7/site-packages/numpy/core/include $CFLAGS"

Original post

In [1]:
# Import libraries
import numpy as np
from rlscore.learner import RLS
from rlscore.measure import sqerror
from rlscore.learner import LeaveOneOutRLS
In [2]:
# Function to load dataset and split in train and test sets
def load_housing():
    np.random.seed(1)
    D = np.loadtxt("/Volumes/PANZER/Github/learning-space/Datasets/02 - Classification/housing_data.txt")
    np.random.shuffle(D)
    X = D[:,:-1] # Independent variables
    Y = D[:,-1]  # Dependent variable
    X_train = X[:250]
    Y_train = Y[:250]
    X_test = X[250:]
    Y_test = Y[250:]
    return X_train, Y_train, X_test, Y_test
In [3]:
def print_stats():
    X_train, Y_train, X_test, Y_test = load_housing()
    print("Housing data set characteristics")
    print("Training set: %d instances, %d features" %X_train.shape)
    print("Test set: %d instances, %d features" %X_test.shape)

if __name__ == "__main__":
    print_stats()
Housing data set characteristics
Training set: 250 instances, 13 features
Test set: 256 instances, 13 features

Linear regression with default parameters

In [4]:
# Function to train RLS method
def train_rls():
    #Trains RLS with default parameters (regparam=1.0, kernel='LinearKernel')
    X_train, Y_train, X_test, Y_test = load_housing()
    learner = RLS(X_train, Y_train)
    
    #Leave-one-out cross-validation predictions, this is fast due to
    #computational short-cut
    P_loo = learner.leave_one_out()
    
    #Test set predictions
    P_test = learner.predict(X_test)
    
    # Stats
    print("leave-one-out error %f" %sqerror(Y_train, P_loo))
    print("test error %f" %sqerror(Y_test, P_test))
    
    #Sanity check, can we do better than predicting mean of training labels?
    print("mean predictor %f" %sqerror(Y_test, np.ones(Y_test.shape)*np.mean(Y_train)))

if __name__=="__main__":
    train_rls()
leave-one-out error 25.959399
test error 25.497222
mean predictor 81.458770

Choosing regularization parameter with leave-one-out

Regularization parameter with grid search in exponential grid to catch the lowest LOO-CV error.

In [5]:
def train_rls():
    #Select regparam with leave-one-out cross-validation
    X_train, Y_train, X_test, Y_test = load_housing()
    learner = RLS(X_train, Y_train)
    best_regparam = None
    best_error = float("inf")
   
    #exponential grid of possible regparam values
    log_regparams = range(-15, 16)
    for log_regparam in log_regparams:
        regparam = 2.**log_regparam
        
        #RLS is re-trained with the new regparam, this
        #is very fast due to computational short-cut
        learner.solve(regparam)
        
        #Leave-one-out cross-validation predictions, this is fast due to
        #computational short-cut
        P_loo = learner.leave_one_out()
        e = sqerror(Y_train, P_loo)
        print("regparam 2**%d, loo-error %f" %(log_regparam, e))
        if e < best_error:
            best_error = e
            best_regparam = regparam
    learner.solve(best_regparam)
    P_test = learner.predict(X_test)
    print("best regparam %f with loo-error %f" %(best_regparam, best_error)) 
    print("test error %f" %sqerror(Y_test, P_test))

if __name__=="__main__":
    train_rls()
regparam 2**-15, loo-error 24.745479
regparam 2**-14, loo-error 24.745463
regparam 2**-13, loo-error 24.745431
regparam 2**-12, loo-error 24.745369
regparam 2**-11, loo-error 24.745246
regparam 2**-10, loo-error 24.745010
regparam 2**-9, loo-error 24.744576
regparam 2**-8, loo-error 24.743856
regparam 2**-7, loo-error 24.742982
regparam 2**-6, loo-error 24.743309
regparam 2**-5, loo-error 24.750966
regparam 2**-4, loo-error 24.786243
regparam 2**-3, loo-error 24.896991
regparam 2**-2, loo-error 25.146493
regparam 2**-1, loo-error 25.537315
regparam 2**0, loo-error 25.959399
regparam 2**1, loo-error 26.285436
regparam 2**2, loo-error 26.479254
regparam 2**3, loo-error 26.603001
regparam 2**4, loo-error 26.801196
regparam 2**5, loo-error 27.352322
regparam 2**6, loo-error 28.837002
regparam 2**7, loo-error 32.113350
regparam 2**8, loo-error 37.480625
regparam 2**9, loo-error 43.843555
regparam 2**10, loo-error 49.748687
regparam 2**11, loo-error 54.912297
regparam 2**12, loo-error 59.936226
regparam 2**13, loo-error 65.137825
regparam 2**14, loo-error 70.126118
regparam 2**15, loo-error 74.336978
best regparam 0.007812 with loo-error 24.742982
test error 24.509981

Training with RLS and simultaneously selecting the regularization parameter with leave-one-out using LeaveOneOutRLS

In [6]:
def train_rls():
    #Trains RLS with automatically selected regularization parameter
    X_train, Y_train, X_test, Y_test = load_housing()
    
    # Grid search
    regparams = [2.**i for i in range(-15, 16)]
    learner = LeaveOneOutRLS(X_train, Y_train, regparams = regparams)
    loo_errors = learner.cv_performances
    P_test = learner.predict(X_test)
    print("leave-one-out errors " +str(loo_errors))
    print("chosen regparam %f" %learner.regparam)
    print("test error %f" %sqerror(Y_test, P_test))

if __name__=="__main__":
    train_rls()
leave-one-out errors [ 24.74547881  24.74546295  24.74543138  24.74536884  24.74524616
  24.74501033  24.7445764   24.74385625  24.74298177  24.74330936
  24.75096639  24.78624255  24.89699067  25.14649266  25.53731465
  25.95939943  26.28543584  26.47925431  26.6030015   26.80119588
  27.35232186  28.83700156  32.11334986  37.48062503  43.84355496
  49.7486873   54.91229746  59.93622566  65.1378248   70.12611801
  74.33697809]
chosen regparam 0.007812
test error 24.509981

Learning nonlinear predictors using kernels

RLS using a non-linear kernel function.

In [7]:
def train_rls():
    #Selects both the gamma parameter for Gaussian kernel, and regparam with loocv
    X_train, Y_train, X_test, Y_test = load_housing()
    
    regparams = [2.**i for i in range(-15, 16)]
    gammas = regparams
    best_regparam = None
    best_gamma = None
    best_error = float("inf")
    
    for gamma in gammas:
        #New RLS is initialized for each kernel parameter
        learner = RLS(X_train, Y_train, kernel="GaussianKernel", gamma=gamma)
        for regparam in regparams:
            #RLS is re-trained with the new regparam, this
            #is very fast due to computational short-cut
            learner.solve(regparam)
            
            #Leave-one-out cross-validation predictions, this is fast due to
            #computational short-cut
            P_loo = learner.leave_one_out()
            e = sqerror(Y_train, P_loo)
            
            #print "regparam", regparam, "gamma", gamma, "loo-error", e
            if e < best_error:
                best_error = e
                best_regparam = regparam
                best_gamma = gamma
    learner = RLS(X_train, Y_train, regparam = best_regparam, kernel="GaussianKernel", gamma=best_gamma)
    P_test = learner.predict(X_test)
    print("best parameters gamma %f regparam %f" %(best_gamma, best_regparam))
    print("best leave-one-out error %f" %best_error)
    print("test error %f" %sqerror(Y_test, P_test))
    
    
if __name__=="__main__":
    train_rls()
best parameters gamma 0.000031 regparam 0.000244
best leave-one-out error 21.910837
test error 16.340877

Binary classification and Area under ROC curve

In [8]:
from rlscore.utilities.reader import read_svmlight

# Load dataset and stats
def print_stats():
    X_train, Y_train, foo = read_svmlight("/Volumes/PANZER/Github/learning-space/Datasets/02 - Classification/a1a.t")
    X_test, Y_test, foo = read_svmlight("/Volumes/PANZER/Github/learning-space/Datasets/02 - Classification/a1a")
    print("Adult data set characteristics")
    print("Training set: %d instances, %d features" %X_train.shape)
    print("Test set: %d instances, %d features" %X_test.shape)

if __name__=="__main__":
    print_stats()
Adult data set characteristics
Training set: 30956 instances, 123 features
Test set: 1605 instances, 119 features
In [ ]:
from rlscore.learner import RLS
from rlscore.measure import accuracy
from rlscore.utilities.reader import read_svmlight


def train_rls():
    # Train ans test datasets    
    X_train, Y_train, foo = read_svmlight("/Volumes/PANZER/Github/learning-space/Datasets/02 - Classification/a1a.t")
    X_test, Y_test, foo = read_svmlight("/Volumes/PANZER/Github/learning-space/Datasets/02 - Classification/a1a", X_train.shape[1])
    learner = RLS(X_train, Y_train)
    best_regparam = None
    best_accuracy = 0.
    
    #exponential grid of possible regparam values
    log_regparams = range(-15, 16)
    for log_regparam in log_regparams:
        regparam = 2.**log_regparam
        #RLS is re-trained with the new regparam, this
        #is very fast due to computational short-cut
        learner.solve(regparam)
        
        #Leave-one-out cross-validation predictions, this is fast due to
        #computational short-cut
        P_loo = learner.leave_one_out()
        acc = accuracy(Y_train, P_loo)
        
        print("regparam 2**%d, loo-accuracy %f" %(log_regparam, acc))
        if acc > best_accuracy:
            best_accuracy = acc
            best_regparam = regparam
    learner.solve(best_regparam)
    P_test = learner.predict(X_test)
    
    print("best regparam %f with loo-accuracy %f" %(best_regparam, best_accuracy)) 
    print("test set accuracy %f" %accuracy(Y_test, P_test))

if __name__=="__main__":
    train_rls()
RLScore: Regularized Least-Squares Learners

Cleverhans – Lib python para prevenção de ataques de ruídos nos modelos

É um tema bem recente o ataque em sensores e por consequência em modelos de Machine Learning (isso será abordado em algum momento do futuro por aqui, mas esse artigo mostra bem o potencial danoso disso).

O Cleverhans é uma lib em Python que insere de maneira artificial um pouco de ruído/distúrbio na rede como forma de treino para esse tipo de ataque.

This repository contains the source code for cleverhans, a Python library to benchmark machine learning systems’ vulnerability to adversarial examples. You can learn more about such vulnerabilities on the accompanying blog.

The cleverhans library is under continual development, always welcoming contributions of the latest attacks and defenses. In particular, we always welcome help towards resolving the issues currently open.

About the name

The name cleverhans is a reference to a presentation by Bob Sturm titled “Clever Hans, Clever Algorithms: Are Your Machine Learnings Learning What You Think?” and the corresponding publication, “A Simple Method to Determine if a Music Information Retrieval System is a ‘Horse’.” Clever Hans was a horse that appeared to have learned to answer arithmetic questions, but had in fact only learned to read social cues that enabled him to give the correct answer. In controlled settings where he could not see people’s faces or receive other feedback, he was unable to answer the same questions. The story of Clever Hans is a metaphor for machine learning systems that may achieve very high accuracy on a test set drawn from the same distribution as the training data, but that do not actually understand the underlying task and perform poorly on other inputs.

Cleverhans – Lib python para prevenção de ataques de ruídos nos modelos

Interpretabilidade versus Desempenho: Ceticismo e AI-Winter

Neste post do Michael Elad que é editor chefe da SIAM da publicação Journal on Imaging Sciences ele faz uma série de reflexões bem ponderadas de como os métodos de Deep Learning estão resolvendo problemas reais e alcançando um alto grau de visibilidade mesmo com métodos não tão elegantes dentro da perspectiva matemática.

Ele coloca o ponto principal de que, no que tange o processamento de imagens, a academia teve sempre um lugar de destaque em uma abordagem na qual a interpretabilidade e o entendimento do modelos sempre teve precedência em relação aos resultados alcançados.

Isso fica claro no parágrafo abaixo:

A series of papers during the early 2000s suggested the successful application of this architecture, leading to state-of-the-art results in practically any assigned task. Key aspects in these contributions included the following: the use of many network layers, which explains the term “deep learning;” a huge amount of data on which to train; massive computations typically run on computer clusters or graphic processing units; and wise optimization algorithms that employ effective initializations and gradual stochastic gradient learning. Unfortunately, all of these great empirical achievements were obtained with hardly any theoretical understanding of the underlying paradigm. Moreover, the optimization employed in the learning process is highly non-convex and intractable from a theoretical viewpoint.

No final ele coloca uma visão sobre pragmatismo e agenda acadêmica:

Should we be happy about this trend? Well, if we are in the business of solving practical problems such as noise removal, the answer must be positive. Right? Therefore, a company seeking such a solution should be satisfied. But what about us scientists? What is the true objective behind the vast effort that we invested in the image denoising problem? Yes, we do aim for effective noise-removal algorithms, but this constitutes a small fraction of our motivation, as we have a much wider and deeper agenda. Researchers in our field aim to understand the data on which we operate. This is done by modeling information in order to decipher its true dimensionality and manifested phenomena. Such models serve denoising and other problems in image processing, but far more than that, they allow identifying new ways to extract knowledge from the data and enable new horizons.

Isso lembra uma passagem minha na RCB Investimentos quando eu trabalhava com o grande Renato Toledo no mercado de NPL em que ele me ensinou que bons modelos têm um alto grau de interpretabilidade e simplicidade, no qual esse fator deve ser o barômetro da tomada de decisão, dado que um modelo cujo a sua incerteza (ou erro) seja conhecido é melhor do que um modelo que ninguém sabe o que está acontecendo (Nota pessoal: Quem me conhece sabe que eu tenho uma frase sobre isso que é: se você não entende a dinâmica do modelo quando ele funciona, nunca vai saber o que deu errado quando ele falhar.)

Contudo é inegável que as redes Deep Learning estão resolvendo, ao meu ver, uma demanda reprimida de problemas que já existiam e que os métodos computacionais não conseguiam resolver de forma fácil, como reconhecimento facial, classificação de imagens, tradução, e problemas estruturados como fraude (a Fast.AI está fazendo um ótimo trabalho de clarificar isso).

Em que pese o fato dos pesquisadores de DL terem hardware infinito a preços módicos, o fato brutal é que esse campo de pesquisa durante aproximadamente 30 anos engoliu uma pílula bem amarga de ceticismo proveniente da própria academia: seja em colocar esse método em uma esfera de alto ceticismo levando a sua quase total extinção, ou mesmo com alguns jornals implicitamente não aceitarem trabalhos de DL; enquanto matemáticos estavam ganhando prêmios e tendo um alto nível de visibilidade por causa da acurácia dos seus métodos ao invés de uma pretensa ideia de que o mundo gostava da interpretabilidade de seus métodos.

Duas grandes questões estão em tela que são: 1) Será que os matemáticos e comunidades que estão chocadas com esse fenômeno podem aguentar o mesmo que a comunidade de Redes Neurais aguentou por mais de 30 anos? e 2) E em caso de um Math-Winter, a comunidade matemática consegue suportar uma potencial marginalização de sua pesquisa?

É esperar e ver.

 

Interpretabilidade versus Desempenho: Ceticismo e AI-Winter

Self-Driving Cars no GTA 5 usando Deep Learning

Pra quem acompanha o Python Programming, sabe que sempre quando eles postam algo é que coisa boa vem aí; e dessa vez não foi diferente.

O Harrison está fazendo uma série de posts sobre como jogar GTA V usando Deep Learning com Tensor Flow usando CNN (convolutional neural network).

Este é o primeiro vídeo da série em que ele faz o setup da solução:

 

E essa é a última versão treinada:

Para quem estiver interessado o Harrison deixou uma playlist com todos os estágios do treinamento, e um BOT rodando sozinho em um livestream (vale a pena ver o quão divertido é ver o bot tentando dirigir).

E o código está disponível no Github.

Self-Driving Cars no GTA 5 usando Deep Learning

Processamento de Linguagem Natural com o FAIRSeq – Facebook AI Research Sequence-to-Sequence Toolkit

No post recente do Facebook Code foi apresentado o FAIRSeq, acrônimo para Facebook AI Research Sequence-to-Sequence Toolkit em que os pesquisadores conseguiram ter bons resultados misturando uma abordagem com CNN (convolutional neural network) juntamente com sequence to sequence learning; abordagem essa que além de ter uma acurácia maior do que abordagens com RNN (recurrent neural networks) tem um poder de processamento muito maior.

Apesar dos resultados, e da abordagem; o mais interessante é ver como que aspectos básicos da ciência observacional tem uma grande influência na inovação; em outras palavras, como que a observação simples pode levar a ótimos resultados.

Para entender melhor, vejam a inspiração na qual o mecanismo principal da arquitetura que faz a tradução foi pensado:

“A distinguishing component of our architecture is multi-hop attention. An attention mechanism is similar to the way a person would break down a sentence when translating it: Instead of looking at the sentence only once and then writing down the full translation without looking back, the network takes repeated “glimpses” at the sentence to choose which words it will translate next, much like a human occasionally looks back at specific keywords when writing down a translation.3 Multi-hop attention is an enhanced version of this mechanism, which allows the network to make multiple such glimpses to produce better translations. These glimpses also depend on each other. For example, the first glimpse could focus on a verb and the second glimpse on the associated auxiliary verb.”

Para quem interessar há uma versão do código disponível no Github e o paper original com os resultados está aqui.

Processamento de Linguagem Natural com o FAIRSeq – Facebook AI Research Sequence-to-Sequence Toolkit

Melhores papers de Deep Learning de 2012 até 2016

Para estudar com lápis na mão, e café na caneca.

Via Kdnuggets

1. Understanding / Generalization / Transfer

Distilling the knowledge in a neural network (2015), G. Hinton et al. [pdf]

2. Optimization / Training Techniques

Batch normalization: Accelerating deep network training by reducing internal covariate shift (2015), S. Loffe and C. Szegedy [pdf]

3. Unsupervised / Generative Models

Unsupervised representation learning with deep convolutional generative adversarial networks (2015), A. Radford et al. [pdf]

4. Convolutional Neural Network Models

Deep residual learning for image recognition (2016), K. He et al. [pdf]

5. Image: Segmentation / Object Detection

Fast R-CNN (2015), R. Girshick [pdf]

6. Image / Video / Etc.

Show and tell: A neural image caption generator (2015), O. Vinyals et al. [pdf]

7. Natural Language Processing / RNNs

Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014), K. Cho et al. [pdf]

8. Speech / Other Domain

Speech recognition with deep recurrent neural networks (2013), A. Graves [pdf]

9. Reinforcement Learning / Robotics

Human-level control through deep reinforcement learning (2015), V. Mnih et al. [pdf]

10. More Papers from 2016

Domain-adversarial training of neural networks (2016), Y. Ganin et al. [pdf]

Melhores papers de Deep Learning de 2012 até 2016

Generalized Additive Models em Séries Temporais

Aqui no AlgoBeans provavelmente você verá a melhor explicação sobre modelos aditivos generalizados (Generalized Additive Models) da internet. De forma simples e didática, o post explica tudo sobre essa técnica.

Therefore, google search trends for persimmons could well be modeled by adding a seasonal trend to an increasing growth trend, in what’s called a generalized additive model (GAM).

The principle behind GAMs is similar to that of regression, except that instead of summing effects of individual predictors, GAMs are a sum of smooth functions. Functions allow us to model more complex patterns, and they can be averaged to obtain smoothed curves that are more generalizable.

Because GAMs are based on functions rather than variables, they are not restricted by the linearity assumption in regression that requires predictor and outcome variables to move in a straight line. Furthermore, unlike in neural networks, we can isolate and study effects of individual functions in a GAM on resulting predictions.

Generalized Additive Models em Séries Temporais

Accelerating the XGBoost algorithm using GPU computing

A fronteira final em relação ao uso com GPU de um dos mais poderosos algoritmos de todos os tempos está aqui.

Abstract: We present a CUDA based implementation of a decision tree construction algorithm within the gradient boosting library XGBoost. The tree construction algorithm is executed entirely on the GPU and shows high performance with a variety of datasets and settings, including sparse input matrices. Individual boosting iterations are parallelized, combining two approaches. An interleaved approach is used for shallow trees, switching to a more conventional radix sort based approach for larger depths. We show speedups of between 3-6x using a Titan X compared to a 4 core i7 CPU, and 1.2x using a Titan X compared to 2x Xeon CPUs (24 cores). We show that it is possible to process the Higgs dataset (10 million instances, 28 features) entirely within GPU memory. The algorithm is made available as a plug-in within the XGBoost library and fully supports all XGBoost features including classification, regression and ranking tasks. 

Accelerating the XGBoost algorithm using GPU computing

Aplicação de Natural Processing Language com Python em reviews de comida

Um ótimo notebook do Patrick Harrison da S&P Global Market Intelligence.

Para quem tem vontade de trabalhar com NLP, esse de longe é um dos melhores tutoriais da internet, em especial pela riqueza de como trabalhar com texto, em especial na modelagem de tópicos usando LDA e análise semântica usando pyLDAvis.

Para quem deseja trabalhar seriamente com NLP esse post é mandatório.

Aplicação de Natural Processing Language com Python em reviews de comida

Porque o xGBoost ganha todas as competições de Machine Learning

Uma (longa e) boa resposta está nesta tese de Didrik Nielsen.

16128_FULLTEXT

Abstract: Tree boosting has empirically proven to be a highly effective approach to predictive modeling.
It has shown remarkable results for a vast array of problems.
For many years, MART has been the tree boosting method of choice.
More recently, a tree boosting method known as XGBoost has gained popularity by winning numerous machine learning competitions.
In this thesis, we will investigate how XGBoost differs from the more traditional MART.
We will show that XGBoost employs a boosting algorithm which we will term Newton boosting. This boosting algorithm will further be compared with the gradient boosting algorithm that MART employs.
Moreover, we will discuss the regularization techniques that these methods offer and the effect these have on the models.
In addition to this, we will attempt to answer the question of why XGBoost seems to win so many competitions.
To do this, we will provide some arguments for why tree boosting, and in particular XGBoost, seems to be such a highly effective and versatile approach to predictive modeling.
The core argument is that tree boosting can be seen to adaptively determine the local neighbourhoods of the model. Tree boosting can thus be seen to take the bias-variance tradeoff into consideration during model fitting. XGBoost further introduces some subtle improvements which allows it to deal with the bias-variance tradeoff even more carefully.

Conclusion: After determining the different boosting algorithms and regularization techniques these methods utilize and exploring the effects of these, we turned to providing arguments for why XGBoost seems to win “every” competition. To provide possible answers to this question, we first gave reasons for why tree boosting in general can be an effective approach. We provided two main arguments for this. First off, additive tree models can be seen to have rich representational abilities. Provided that enough trees of sufficient depth is combined, they are capable of closely approximating complex functional relationships, including high-order interactions. The most important argument provided for the versatility of tree boosting however, was that tree boosting methods are adaptive. Determining neighbourhoods adaptively allows tree boosting methods to use varying degrees of flexibility in different parts of the input space. They will consequently also automatically perform feature selection. This also makes tree boosting methods robust to the curse of dimensionality. Tree boosting can thus be seen actively take the bias-variance tradeoff into account when fitting models. They start out with a low variance, high bias model and gradually reduce bias by decreasing the size of neighbourhoods where it seems most necessary. Both MART and XGBoost have these properties in common. However, compared to MART, XGBoost uses a higher-order approximation at each iteration, and can thus be expected to learn “better” tree structures. Moreover, it provides clever penalization of individual trees. As discussed earlier, this can be seen to make the method even more adaptive. It will allow the method to adaptively determine the appropriate number of terminal nodes, which might vary among trees. It will further alter the learnt tree structures and leaf weights in order to reduce variance in estimation of the individual trees. Ultimately, this makes XGBoost a highly adaptive method which carefully takes the bias-variance tradeoff into account in nearly every aspect of the learning process.

Porque o xGBoost ganha todas as competições de Machine Learning

Extração de Vocais de músicas usando Rede Neural Convolucional

Este trabalho do Ollin Boer Bohan é simplesmente fenomenal. E além de tudo tem o repositório no GitHub.

Extração de Vocais de músicas usando Rede Neural Convolucional

Softmax GAN

Abstract: Softmax GAN is a novel variant of Generative Adversarial Network (GAN). The key idea of Softmax GAN is to replace the classification loss in the original GAN with a softmax cross-entropy loss in the sample space of one single batch. In the adversarial learning of N real training samples and M generated samples, the target of discriminator training is to distribute all the probability mass to the real samples, each with probability 1M, and distribute zero probability to generated data. In the generator training phase, the target is to assign equal probability to all data points in the batch, each with probability 1M+N. While the original GAN is closely related to Noise Contrastive Estimation (NCE), we show that Softmax GAN is the Importance Sampling version of GAN. We futher demonstrate with experiments that this simple change stabilizes GAN training.

Softmax GAN

Data Science: Como agentes reguladores, professores e praticantes estão fazendo isso errado

Esse post da Data Robot é um daqueles tipos de post que mostra muito como a evolução das plataformas de Big Data, aliado com um maior arsenal computacional e preditivo estão varrendo para baixo do tapete qualquer bullshit disfarçado com tecnicalidades em relação à Data Science.

Vou reproduzir na íntegra, pois vale a pena usar esse post quando você tiver que justificar a qualquer burocrata de números (não vou dar nome aos bois dado o butthurt que isso poderia causar) porque ninguém mais dá a mínima para P-Valor, testes de hipóteses, etc na era em que temos uma abundância de dados; e principalmente está havendo a morte da significância estatística.

“Underpinning many published scientific conclusions is the concept of ‘statistical significance,’ typically assessed with an index called the p-value. While the p-value can be a useful statistical measure, it is commonly misused and misinterpreted.”  ASA Statement on Statistical Significance and p-Values

If you’ve ever heard the words “statistically significant” or “fail to reject,” then you are among the countless thousands who have been traumatized by an academic approach building predictive models.  Unfortunately, I can’t claim innocence in this matter.  I taught statistics when I was in grad school, and I do have a Ph.D. in applied statistics.  I was born into the world that uses formal hypothesis testing to justify every decision made in the model building process:

Should I include this variable in my model?  How about an F-test?

Do my two samples have different means?  Student’s t-test!

Does my model fit my data?  Why not try the Hosmer–Lemeshow test or maybe use the Cramér–von Mises criterion?

Are my variables correlated?  How about a test using a Pearson Correlation Coefficient?

And on, and on, and on, and on…

These tests are all based on various theoretical assumptions.  If the assumptions are valid, then they allegedly tell you whether or not your results are “statistically significant.”

Over the last century, as businesses and governments have begun to incorporate data science into their business processes, these “statistical tests” have also leaked into commercial and regulatory practices.

For instance, federal regulators in the banking industry issued this tortured guidance in 2011:

“… statistical tests depend on specific distributional assumptions and the purpose of the model… Any single test is rarely sufficient, so banks should apply a variety of tests to develop a sound model.”

In other words, statistical tests have lots of assumptions that are often (always) untrue, so use lots of them. (?!)

Here’s why statistical significance is a waste of time

statistical-significance

If assumptions are invalid, the tests are invalid — even if your model is good

I developed a statistical test of my very own for my dissertation.  The procedure for doing this is pretty simple.  First, you make some assumptions about independence and data distributions, and variance, and so on.  Then, you do some math that relies (heavily) on these assumptions in order to come up with a p-value. The p-value tells you what decision to make.

As an example, let’s take linear regression.  Every business stats student memorizes the three assumptions associated with the p-values in this approach: independence (for which no real test exists), constant variance, and normality.  If all these assumptions aren’t met, then none of the statistical tests that you might do are valid; yet regulators, professors, scientists, and statisticians all expect you to rely (heavily) on these tests.

What’s are you to do if your assumptions are invalid?  In practice, the general practice is to wave your hands about “robustness” or some such thing and then continue along the same path.

If your data is big enough, EVERYTHING is significant

“The primary product of a research inquiry is one or more measures of effect size, not P values.” Jacob Cohen

As your data gets bigger and bigger (as data tends to do these days), everything becomes statistically significant.  On one hand, this makes intuitive sense.  For example, the larger a dataset is, the most likely an F-test is to tell you that your GLM coefficients are nonzero; i.e., larger datasets can support more complex models, as expected.  On the other hand, for many assumption validity tests — e.g., tests for constant variance — statistical significance indicates invalid assumptions.  So, for big datasets, you end up with tests telling you every feature is significant, but assumption tests telling you to throw out all of your results.

Validating assumptions is expensive and doesn’t add value

Nobody ever generated a single dollar of revenue by validating model assumptions (except of course the big consulting firms that are doing the work).  No prospect was converted; no fraud was detected; no marketing message was honed by the drudgery of validating model assumptions.  To make matters worse, it’s a never ending task.  Every time a model is backtested, refreshed, or evaluated, the same assumption-validation-song-and-dance has to happen again.  And that’s assuming that the dozens of validity tests don’t give you inconsistent results.  It’s a gigantic waste of resources because there is a better way.

You can cheat, and nobody will ever know

Known as data dredging, data snooping, or p-hacking, it is very easy and relatively undetectable to manufacture statistically significant results.  Andrew Gelman observed that most modelers have a (perverse) incentive to produce statistically significantresults — even at the expense of reality.  It’s hardly surprising that these techniques exist, given the pressure to produce valuable data driven solutions.  This risk, on its own, should be sufficient reason to abandon p-values entirely in some settings, like financial services, where cheating could result in serious consequences for the economy.

If the model is misspecified, then your p-values are likely to be misleading

Suppose you’re investigating whether or not a gender gap exists in America.  Lots of things are correlated with gender; e.g., career choice, hours worked per week, percentage of vacation taken, participation in a STEM career, and so on.  To the extent that any of these variables are excluded from your investigation — whether you know about them or not — the significance of gender will be overstated.  In other words, statistical significance will give the impression that a gender gap exists, when it may not — simply due to model misspecification.

Only out-of-sample accuracy matters

Whether or not results are statistically significant is the wrong question.  The only metric that actually matters when building models is whether or not your models can make accurate predictions on new data.  Not only is this metric difficult to fake, but it also perfectly aligns with the business motivation for building the model in the first place.  Fraud models that do a good job predicting fraud actually prevent losses.  Underwriting models that accurately segment credit risk really do increase profits.  Optimizing model accuracy instead of identifying statistical significance makes good business sense.

Over the course of the last few decades lots and lots of tools have been developed outside of the hypothesis testing framework.  Cross-validation, partial dependence, feature importance, and boosting/bagging methods are just some of the tools in the machine learning toolbox.  They provide a means not only for ensuring out-of-sample accuracy, but also understanding which features are important and how complex models work.

A survey of these methods is out of scope, but let me close with a final point.  Unlike traditional statistical methods, tasks like cross-validation, model tuning, feature selection, and model selection are highly automatable.  Custom coded solutions of any kind are inherently error prone, even for the most experienced data scientist

Many of the world’s biggest companies are recognizing that bespoke models, hand-built by Ph.D.’s are too slow and expensive to develop and maintain.  Solutions like DataRobot provide a way for business experts to build predictive models in a safe, repeatable, systematic way that yields business value much more quickly and much cheaper than other approaches.

By Greg Michaelson, Director – DataRobot Labs

Data Science: Como agentes reguladores, professores e praticantes estão fazendo isso errado

Novel Revenue Development and Forecasting Model using Machine Learning Approaches for Cosmetics Enterprises.

Abstract:In the contemporary information society, constructing an effective sales prediction model is challenging due to the sizeable amount of purchasing information obtained from diverse consumer preferences. Many empirical cases shown in the existing literature argue that the traditional forecasting methods, such as the index of smoothness, moving average, and time series, have lost their dominance of prediction accuracy when they are compared with modern forecasting approaches such as neural network (NN) and support vector machine (SVM) models. To verify these findings, this paper utilizes the Taiwanese cosmetic sales data to examine three forecasting models: i) the back propagation neural network (BPNN), ii) least-square support vector machine (LSSVM), and iii) auto regressive model (AR). The result concludes that the LS-SVM has the smallest mean absolute percent error (MAPE) and largest Pearson correlation coefficient ( R2 ) between model and predicted values.

Novel Revenue Development and Forecasting Model using Machine Learning Approaches for Cosmetics Enterprises.

Análise de Múltipla Correspondência no R para o problema de Churn

Via Data Science Plus

Analytical challenges in multivariate data analysis and predictive modeling include identifying redundant and irrelevant variables. A recommended analytics approach is to first address the redundancy; which can be achieved by identifying groups of variables that are as correlated as possible among themselves and as uncorrelated as possible with other variable groups in the same data set. On the other hand, relevancy is about potential predictor variables and involves understanding the relationship between the target variable and input variables.
Multiple correspondence analysis (MCA) is a multivariate data analysis and data mining tool for finding and constructing a low-dimensional visual representation of variable associations among groups of categorical variables. Variable clustering as a tool for identifying redundancy is often applied to get a first impression of variable associations and multivariate data structure.
The motivations of this post are to illustrate the applications of: 1) preparing input variables for analysis and predictive modeling, 2) MCA as a multivariate exploratory data analysis and categorical data mining tool for business insights of customer churn data, and 3) variable clustering of categorical variables for the identification of redundant variables.

Análise de Múltipla Correspondência no R para o problema de Churn