Productionizing Machine Learning Models and taking care about the neighbors

In Movile we have a Machine Learning Squad composed of the following members:

  • 1 Tech Lead (Mixed engineering and computational)
  • 2 Core ML engineers (production side)
  • 1 Data Scientist (with statistical background) – (data analysis and prototyping side)
  • 1 Data Scientist (with computational background) – (data analysis and prototyping side)

As we can see, there are different backgrounds in the team and to make the entire workflow productive and smoothed as possible, we need to get a good fence (a.k.a. crystal clear vision about the roles) to keep everyone motivated and productive.

This article written by Jhonatan Morra brings a good vision about this and how we deal with that fact in Movile.

Here are some quotes:

One of the most important goals of any data science team is the ability to create machine learning models, evaluate them offline, and get them safely to production. The faster this process can be performed, the more effective most teams will be. In most organizations, the team responsible for scoring a model and the team responsible for training a model are separate. Because of this, a clear separation of concerns is necessary for these two teams to operate at whatever speed suits them best. This post will cover how to make this work: implementing your ML algorithms in such a way that they can be tested, improved, and updated without causing problems downstream or requiring changes upstream in the data pipeline.

We can get clarity about the requirements for the data and production teams by breaking the data-driven application down into its constituent parts. In building and deploying a real-time data application, the goal of the data science team is to produce a function that reliably and in real-time ingests each data point and returns a prediction. For instance, if the business concern is modeling churn, we might ingest the data about a user and return a predicted probability of churn. The fact that we have to featurize that user and then send them through a random forest, for instance, is not the concern of the scoring team and should not be exposed to them.

 

 

 

 

Productionizing Machine Learning Models and taking care about the neighbors

Best price to bill approach

From whom are looking an initial approach to know the best time to bill your customers for some subscription services, this paper can be a good start.

In my current company, this is a very challenging problem.

Machine-Learning System For Recurring Subscription Billing

Jack Greenberg, Thomas Price

Abstract: A system and method for recurring billing of periodic subscriptions are disclosed. The system attempts to maximize a metric like long term customer retention while tailoring the subscription billing to the customer, using machine learning. The system is initially trained with a set of training data — a large corpus of records of subscription billings — including successes, billing failures, and customer cancellations. Any available metadata about the users or the type of subscription is also attached and may be used as features for the machine learning model. Such metadata may include, for example, customers’ age, gender, demographics, interests, and online behavioral profile/history, as well as metadata to identify the type of service being billed, such as music subscriptions, delivery subscriptions or other types of subscriptions, or the payment instrument. The system is used to predict the subscription model for a given user with relevant user-related constraints, while optimizing acceptability to that user.

Best price to bill approach

Algorithm over Regulations (?)

This scene is the best thing that can I relate to this particular topic.

“But, the bells have already been rung and they’ve heard it. Out in the dark. Among the stars. Ding dong, the God is dead. The bells, cannot be unrung! He’s hungry. He’s found us. And He’s coming!

Ding, ding, ding, ding, ding…”

(Hint Fellas: This is a great time to be not evil and check your models to avoid any kind of discrimination over their current or potential customers.)

European Union regulations on algorithmic decision-making and a “right to explanation – By Bryce Goodman, Seth Flaxman

Abstract: We summarize the potential impact that the European Union’s new General Data Protection Regulation will have on the routine use of machine learning algorithms. Slated to take effect as law across the EU in 2018, it will restrict automated individual decision-making (that is, algorithms that make decisions based on userlevel predictors) which “significantly affect” users. The law will also effectively create a “right to explanation,” whereby a user can ask for an explanation of an algorithmic decision that was made about them. We argue that while this law will pose large challenges for industry, it highlights opportunities for computer scientists to take the lead in designing algorithms and evaluation frameworks which avoid discrimination and enable explanation.

Conclusion: While the GDPR presents a number of problems for current applications in machine learning they are, we believe, good problems to have. The challenges described in this paper emphasize the importance of work that ensures that algorithms are not merely efficient, but transparent and fair. Research is underway in pursuit of rendering algorithms more amenable to ex post and ex ante inspection [11, 31, 20]. Furthermore, a number of recent studies have attempted to tackle the issue of discrimination within algorithms by introducing tools to both identify [5, 29] and rectify [9, 16, 32, 6, 12, 14] cases of unwanted bias. It remains to be seen whether these techniques are adopted in practice. One silver lining of this research is to show that, for certain types of algorithmic profiling, it is possible to both identify and implement interventions to correct for discrimination. This is in contrast to cases where discrimination arises from human judgment. The role of extraneous and ethically inappropriate factors in human decision making is well documented (e.g., [30, 10, 1]), and discriminatory decision making is pervasive in many of the sectors where algorithmic profiling might be introduced (e.g. [19, 7]). We believe that, properly applied, algorithms can not only make more accurate predictions, but offer increased transparency and fairness over their human counterparts (cf. [23]). Above all else, the GDPR is a vital acknowledgement that, when algorithms are deployed in society, few if any decisions are purely “technical”. Rather, the ethical design of algorithms requires coordination between technical and philosophical resources of the highest caliber. A start has been made, but there is far to go. And, with less than two years until the GDPR takes effect, the clock is ticking.

European Union regulations on algorithmic decision-making and a “right to explanation”

 

Algorithm over Regulations (?)

Densely Connected Convolutional Networks – implementations

Abstract: Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections – one between each layer and its subsequent layer – our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance.

Densely Connected Convolutional Networks – implementations

Matching e o uso de regressões para análise do efeito de um tratamento

Um dos assuntos mais espinhosos quando falamos de estatística para realizar estimativas de populações com características diferentes é o Matching.

Para quem não sabe, o Matching é basicamente uma técnica para comparação observacional entre um grupo de controle e um grupo de tratamento para cada observação espcífica dos dois grupos (i.e. para cada membro do grupo de tratamento, será feita uma estimativa em paralelo com um membro do grupo de controle e observará as diferenças nas estimativas) em que o objetivo principal é atestar os efeitos do tratamento considerando características dos dados observados, isolando ou realizando a análise considerando as diferenças entre as covariáveis.

Um exemplo de aplicação é dado no trabalho do IPEA em que há estimativas das populações pobres e indigentes, em que no estudo é realizado o mapeamento das características socioeconômicas similares do conjunto de familias participantes.

Neste post o Matt Bogard ele faz algumas considerações sobre a regressão como uma variância baseada em pesos (dos estimadores) poderados em relação a uma indicação de efeito no tratamento. 

Hence, regression gives us a variance based weighted average treatment effect, whereas matching provides a distribution weighted average treatment effect.

So what does this mean in practical terms? Angrist and Piscke explain that regression puts more weight on covariate cells where the conditional variance of treatment status is the greatest, or where there are an equal number of treated and control units. They state that differences matter little when the variation of δx is minimal across covariate combinations.

In his post The cardinal sin of matching, Chris Blattman puts it this way:

“For causal inference, the most important difference between regression and matching is what observations count the most. A regression tries to minimize the squared errors, so observations on the margins get a lot of weight. Matching puts the emphasis on observations that have similar X’s, and so those observations on the margin might get no weight at all….Matching might make sense if there are observations in your data that have no business being compared to one another, and in that way produce a better estimate”

 

We can see that those in the treatment group tend to have higher outcome values so a straight comparison between treatment and controls will overestimate treatment effects due to selection bias:

E[Y­­­i|di=1] – E[Y­­­i|di=0] =E[Y1i-Y0i] +{E[Y0i|di=1] – E[Y0i|di=0]}

However, if we estimate differences based on an exact matching scheme, we get a much smaller estimate of .67. If we run a regression using all of the data we get .75. If we consider 3.78 to be biased upward then both matching and regression have significantly reduced it, and depending on the application the difference between .67 and .75 may not be of great consequence. Of course if we run the regression including only matched variables, we get exactly the same results. (see R code below). This is not so different than the method of trimming based on propensity scores suggested in Angrist and Pischke.

Matching e o uso de regressões para análise do efeito de um tratamento

A Note on the Validity of Cross-Validation for Evaluating Autoregressive Time Series Prediction

Um bom artigo sobre a aplicação de Cross Validation em séries temporais.

Abstract: One of the most widely used standard procedures for model evaluation in classification and regression is K-fold cross-validation (CV). However, when it comes to time series forecasting, because of the inherent serial correlation and potential non-stationarity of the data, its application is not straightforward and often omitted by practitioners in favour of an out-of-sample (OOS) evaluation. In this paper, we show that in the case of a purely autoregressive model, the use of standard K-fold CV is possible as long as the models considered have uncorrelated errors. Such a setup occurs, for example, when the models nest a more appropriate model. This is very common when Machine Learning methods are used for prediction, where CV in particular is suitable to control for overfitting the data. We present theoretical insights supporting our arguments. Furthermore, we present a simulation study and a real-world example where we show empirically that K-fold CV performs favourably compared to both OOS evaluation and other time-series-specific techniques such as non-dependent cross-validation.

Conclusions: In this work we have investigated the use of cross-validation procedures for time series prediction evaluation when purely autoregressive models are used, which is a very common situation; e.g., when using Machine Learning procedures for time series forecasting. In a theoretical proof, we have shown that a normal K-fold cross-validation procedure can be used if the residuals of our model are uncorrelated, which is especially the case if the model nests an appropriate model. In the Monte Carlo experiments, we have shown empirically that even if the lag structure is not correct, as long as the data are fitted well by the model, cross-validation without any modification is a better choice than OOS evaluation. We have then in a real-world data example shown how these findings can be used in a practical situation. Cross-validation can adequately control overfitting in this application, and only if the models underfit the data and lead to heavily correlated errors, are the cross-validation procedures to be avoided as in such a case they may yield a systematic underestimation of the error. However, this case can be easily detected by checking the residuals for serial correlation, e.g., using the Ljung-Box test.

cv-wp

A Note on the Validity of Cross-Validation for Evaluating Autoregressive Time Series Prediction

Loan products and Credit Scoring Methods by Commercial Banks

Abstract – This study describes the loan products offered by the commercial banks and credit scoring techniques used for classifying risks and granting credit to the applicants in India. The loan products offered by commercial banks are: Housing loans, Personal loans, Business loan, Education loans, Vehicle loans etc. All the loan products are categorized as secures and unsecured loans. Credit scoring techniques used for both secured as well as unsecured loans are broadly divided into two categories as Advanced Statistical Methods and Traditional Statistical Methods

Conclusion: In a new or emerging market, the operational, technical, business and cultural issues should be considered with the implementation of the credit scoring models for retail loan products. The operational issues relate to the use of the model and it is imperative that the staff and the management of the bank understand the purpose of the model. Application scoring models should be used for making credit decisions on new applications and behavioral models for retail loan products to supervise existing borrowers for limit expansion or for marketing of new products. The technical issues relate to the development of proper infrastructure, maintenance of historical data and software needed to build a credit scoring model for retail loan products within the bank. The business issues relate to whether the soundness and safety of the banks could be achieved through the adoption of the quantitative credit decision models, which would send a positive impact in the banking sector. The cultural issues relate to making credit irrespective of race, colour, sex, religion, marital status, age or ethnic origin. Further, the models have to be validated so as to ensure that the model performance is compatible in meeting the business as well as the regulatory requirements. Thus, the above issues have to be considered while developing and implementing credit scoring models for retail loan products within a new or emerging markets.

Loan products and Credit Scoring Methods by Commercial Banks

Prevendo eventos extremos no Uber com LSTM

Em breve teremos alguns posts aqui no blog sobre o assunto, mas é um case de ML com engenharia de caras questão mandando bem com métodos bem avançados com arquiteturas escaláveis.

ENGINEERING EXTREME EVENT FORECASTING AT UBER WITH RECURRENT NEURAL NETWORKS J – BY NIKOLAY LAPTEV, SLAWEK SMYL, & SANTHOSH SHANMUGAM

We ultimately settled on conducting time series modeling based on the Long Short Term Memory (LSTM) architecture, a technique that features end-to-end modeling, ease of incorporating external variables, and automatic feature extraction abilities.4 By providing a large amount of data across numerous dimensions, an LSTM approach can model complex nonlinear feature interactions.

We decided to build a neural network architecture that provides single-model, heterogeneous forecasting through an automatic feature extraction module.6 As Figure 4 demonstrates, the model first primes the network by automatic, ensemble-based feature extraction. After feature vectors are extracted, they are averaged using a standard ensemble technique. The final vector is then concatenated with the input to produce the final forecast.

During testing, we were able to achieve a 14.09 percent symmetric mean absolute percentage error (SMAPE) improvement over the base LSTM architecture and over 25 percent improvement over the classical time series model used in Argos, Uber’s real-time monitoring and root cause-exploration tool.

Prevendo eventos extremos no Uber com LSTM

Improving the Forecasts of European Regional Banks’ Profitability with Machine Learning Algorithms

Abstract: Regional banks as savings and cooperative banks are widespread in continental Europe. In the aftermath of the financial crisis, however, they had problems keeping their profitability which is an important quantitative indicator for the health of a bank and the banking sector overall. We use a large data set of bank-level balance sheet items and regional economic variables to forecast profitability for about 2,000 regional banks. Machine learning algorithms are able to beat traditional estimators as ordinary least squares as well as autoregressive models in forecasting performance.

Conclusion: In the aftermath of the financial crisis regional banks had problems keeping up their profitability. Banks’ profitability is an important indicator for the stability of the banking sector. We use a data set of bank-level balance sheet items and regional economic variables to forecast profitability. For the 2,000 savings and cooperative banks from eight European countries and the 2000-2015 time period, we found that machine learning algorithms are able to beat traditional estimators as ordinary least squares as well as autoregressive models in forecasting performance. Therefore, our paper is in line with the literature on machine learning models and their superior forecasting performance (Khandani et al., 2010; Butaru et al., 2016; Fitzpatrick & Mues, 2016). The performance of the machine learning algorithms was particularly well during the European debt crisis which points out the importance of our forecasting exercise as during this time policy makers’ interest in banks’ profitability was enhanced as further potential rescue packages for banks could deteriorate fiscal stability. Policy makers and, especially, regulators should therefore use these algorithms instead of traditional estimators in combination with their even larger regulatory data sets in regard to size and frequency to forecast banks’ profitability or other balance sheet items of interest.

Improving the Forecasts of European Regional Banks_ Profi tability with Machine Learning Algorithms

Improving the Forecasts of European Regional Banks’ Profitability with Machine Learning Algorithms

Pagou por múltiplas GPUs na Azure e não consegue usar com Deep Learning? Este post é pra você.

Você foi lá no portal da Azure pegou uma NC24R que tem maravilhosos 224 Gb de memória, 24 núcleos, mais de 1Tb de disco, e o melhor: 4 placas M80 para o seu deleite completo no treinamento com Deep Learning.

Tudo perfeito, certo? Quase.

Logo de início tentei usar um script para um treinamento e com um simples htop para monitorar o treinamento, vi que o Tensor Flow estava despejando todo o treinamento nos processadores.

Mesmo com esses 24 processadores maravilhosos batendo 100% de processamento, isso não chega nem perto do que as nossas GPUs mastodônticas podem produzir. (Nota: Você não trocaria 4 Ferraris 2017 por 24 Fiat 147 modelo 1985, certo?)

Acessando a nossa maravilhosa máquina para ver o que tinha acontecido, verifiquei primeiro se as GPUs estavam na máquina, o que de fato aconteceu.

azure_teste@deep-learning:~$ nvidia-smi
Tue Jun 27 18:21:05 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.66                 Driver Version: 375.66                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           Off  | B2A3:00:00.0     Off |                    0 |
| N/A   47C    P0    71W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           Off  | C4D8:00:00.0     Off |                    0 |
| N/A   57C    P0    61W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla K80           Off  | D908:00:00.0     Off |                    0 |
| N/A   52C    P0    56W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla K80           Off  | EEAF:00:00.0     Off |                    0 |
| N/A   42C    P0    69W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
azure_teste@deep-learning:~/deep-learning-moderator-msft$ lspci | grep -i NVIDIA
b2a3:00:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
c4d8:00:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
d908:00:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
eeaf:00:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)

OK, as Teslas K80 estão prontas para o combate, contudo a própria Azure reconhece que há problemas no processo como um todo, e pra fazer isso basta alguns procedimentos bem simples em dois passos segundo a documentação.

PASSO 1

1) Clone o repositório do Git (Sim, eu dei fork pois rotineiramente isso costuma sumir por motivos que nem sempre sabemos).

$ git clone https://github.com/leestott/Azure-GPU-Setup.git

2) Entre na pasta que foi clonada

$ cd azure-gpu-setup

3) Execute o script que irá instalar algumas libs da NVIDIA e posteriormente fará o servidor reiniciar.

$ bash gpu-setup-part1.sh

PASSO 2

1) Vá na pasta do repositório do git

$ cd azure-gpu-setup

2) Execute o segundo script que fará a instalação do Tensorflow, do Toolkit do CUDA, e do CUDNN além de fazer o set de uma porção de variáveis de ambiente.

$ bash gpu-setup-part2.sh

3) Depois faça o teste da instalação

$ python gpu-test.py

Depois disso é só aproveitar as suas GPUs em carga total e aproveitar para treinar as suas GPUs.

Pagou por múltiplas GPUs na Azure e não consegue usar com Deep Learning? Este post é pra você.

Como criar um Virtualenv no Python sem bullshit

Via Eiti Kimura.

Direto e reto:

1) Realize a instalação do virtualenv pelo pip

$ pip install virtualenv

2) Faça a definição do seu diretório

$ mkdir deep-learning-virtual-env

3) Após a definição, entre no diretório

$ cd deep-learning-virtual-env

4) Faça a inicialização do seu virtualenv

$ virtualenv .

5) Com isso realize a ativação do seu virtualenv

$ source bin/activate

6) Para facilitar o seu trabalho, criamos até mesmo um arquivo de requirements com o Theano, Keras, Jupyter Notebook, Scikit-Learn. Para fazer isso basta rodar o seguinte comando:

$ pip install -r requirements.key

requirements.key

 

 

Como criar um Virtualenv no Python sem bullshit

Stanford CoreNLP – Core natural language software

Stanford CoreNLP provides a set of natural language analysis tools. It can give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and word dependencies, indicate which noun phrases refer to the same entities, indicate sentiment, extract particular or open-class relations between entity mentions, get quotes people said, etc.

Choose Stanford CoreNLP if you need:

  • An integrated toolkit with a good range of grammatical analysis tools
  • Fast, reliable analysis of arbitrary texts
  • The overall highest quality text analytics
  • Support for a number of major (human) languages
  • Available interfaces for most major modern programming languages
  • Ability to run as a simple web service

Stanford CoreNLP’s goal is to make it very easy to apply a bunch of linguistic analysis tools to a piece of text. A tool pipeline can be run on a piece of plain text with just two lines of code. CoreNLP is designed to be highly flexible and extensible. With a single option you can change which tools should be enabled and which should be disabled. Stanford CoreNLP integrates many of Stanford’s NLP tools, including the part-of-speech (POS) tagger, the named entity recognizer (NER), the parser, the coreference resolution system, sentiment analysis, bootstrapped pattern learning, and the open information extraction tools. Moreover, an annotator pipeline can include additional custom or third-party annotators. CoreNLP’s analyses provide the foundational building blocks for higher-level and domain-specific text understanding applications.

Stanford CoreNLP – Core natural language software

Página do professor Jürgen Schmidhuber

Para quem não sabe, o professor Jürgen Schmidhuber é um dos pioneiros em pesquisa aplicada de Deep Learning.

Ele atesta em uma de suas palestras, que as Redes Deep Learning são datadas de 1971 quando o Aleksei Ivakhnenko e Valentin Lapa publicaram o primeiro trabalho com uma rede de 8 camadas.

Para quem quiser saber mais das suas ideias em relação a perspectivas em Deep Learning esse Ask Me Anything (AMA ou pergunte-me qualquer coisa) no Reddit é bem legal.

E aqui para baixar no post a sua revisão sistemática em relação à Deep Learning.

cse597g-deep_learning

Página do professor Jürgen Schmidhuber

SLIM: Sparse Linear Methods for Top-N Recommender Systems

Um ótimo artigo de base teórica, relativo a geração de Top-N recomendações em cenários bem esparsos (e.g. sistema de rating 0-5 em que poucas pessoas fazem a anotação do rating, etc).

Recentemente, esse problema de recomendar dentro de uma matriz muito esparsa foi o motivo pelo qual o Netflix mudou o seu sistema de Rating que era de 1 a 5 para jóia ou ruim.

 

Em todo o caso vale a pena a leitura para ver a forma na qual os autores estão trabalhando nesse tipo de desafio.

Abstract: This paper focuses on developing effective and efficient algorithms for top-N recommender systems. A novel Sparse Linear Method (SLIM) is proposed, which generates top-N recommendations by aggregating from user purchase/rating profiles. A sparse aggregation coefficient matrix W is learned from SLIM by solving an `1-norm and `2-norm regularized optimization problem. W is demonstrated to produce high quality recommendations and its sparsity allows SLIM to generate recommendations very fast. A comprehensive set of experiments is conducted by comparing the SLIM method and other state-of-the-art top-N recommendation methods. The experiments show that SLIM achieves significant improvements both in run time performance and recommendation quality over the best existing methods. 

SLIM: Sparse Linear Methods for Top-N Recommender Systems

Local Item-Item Models For Top-N Recommendation

Esse é um dos segredos teóricos por trás do Netflix: Porque computacionalmente tratar todos os clientes como diferentes, se alguns deles têm preferências semelhantes.

Abstract: Item-based approaches based on SLIM (Sparse LInear Methods) have demonstrated very good performance for top-N recommendation; however they only estimate a single model for all the users. This work is based on the intuition that not all users behave in the same way — instead there exist subsets of like-minded users. By using different item-item models for these user subsets, we can capture differences in their preferences and this can lead to improved performance for top-N recommendations. In this work, we extend SLIM by combining global and local SLIM models. We present a method that computes the prediction scores as a user-specific combination of the predictions derived by a global and local item-item models. We present an approach in which the global model, the local models, their user-specific combination, and the assignment of users to the local models are jointly optimized to improve the top-N recommendation performance. Our experiments show that the proposed method improves upon the standard SLIM model and outperforms competing top-N recommendation approaches.

Local Item-Item Models For Top-N Recommendation

RLScore: Regularized Least-Squares Learners

Uma boa alternativa para ensemble quando a dimensionalidade dos datasets for alta, ou as alternativas com Elastic Net, Lasso e Ridge não derem a convergência desejada.

RLScore: Regularized Least-Squares Learners

RLScore is a Python open source module for kernel based machine learning. The library provides implementations of several regularized least-squares (RLS) type of learners. RLS methods for regression and classification, ranking, greedy feature selection, multi-task and zero-shot learning, and unsupervised classification are included. Matrix algebra based computational short-cuts are used to ensure efficiency of both training and cross-validation. A simple API and extensive tutorials allow for easy use of RLScore.

Regularized least squares (RLS) is a family of methods for solving the least-squares problem while using regularization to further constrain the resulting solution.

RLS is used for two main reasons. The first comes up when the number of variables in the linear system exceeds the number of observations. In such settings, the ordinary least-squares problem is ill-posed and is therefore impossible to fit because the associated optimization problem has infinitely many solutions. RLS allows the introduction of further constraints that uniquely determine the solution.

The second reason that RLS is used occurs when the number of variables does not exceed the number of observations, but the learned model suffers from poor generalization. RLS can be used in such cases to improve the generalizability of the model by constraining it at training time. This constraint can either force the solution to be “sparse” in some way or to reflect other prior knowledge about the problem such as information about correlations between features. A Bayesian understanding of this can be reached by showing that RLS methods are often equivalent to priors on the solution to the least-squares problem.

To sse in Depth

Installation
1) $ pip install rlscore
2) $ export CFLAGS="-I /usr/local/lib/python2.7/site-packages/numpy/core/include $CFLAGS"

Original post

In [1]:
# Import libraries
import numpy as np
from rlscore.learner import RLS
from rlscore.measure import sqerror
from rlscore.learner import LeaveOneOutRLS
In [2]:
# Function to load dataset and split in train and test sets
def load_housing():
    np.random.seed(1)
    D = np.loadtxt("/Volumes/PANZER/Github/learning-space/Datasets/02 - Classification/housing_data.txt")
    np.random.shuffle(D)
    X = D[:,:-1] # Independent variables
    Y = D[:,-1]  # Dependent variable
    X_train = X[:250]
    Y_train = Y[:250]
    X_test = X[250:]
    Y_test = Y[250:]
    return X_train, Y_train, X_test, Y_test
In [3]:
def print_stats():
    X_train, Y_train, X_test, Y_test = load_housing()
    print("Housing data set characteristics")
    print("Training set: %d instances, %d features" %X_train.shape)
    print("Test set: %d instances, %d features" %X_test.shape)

if __name__ == "__main__":
    print_stats()
Housing data set characteristics
Training set: 250 instances, 13 features
Test set: 256 instances, 13 features

Linear regression with default parameters

In [4]:
# Function to train RLS method
def train_rls():
    #Trains RLS with default parameters (regparam=1.0, kernel='LinearKernel')
    X_train, Y_train, X_test, Y_test = load_housing()
    learner = RLS(X_train, Y_train)
    
    #Leave-one-out cross-validation predictions, this is fast due to
    #computational short-cut
    P_loo = learner.leave_one_out()
    
    #Test set predictions
    P_test = learner.predict(X_test)
    
    # Stats
    print("leave-one-out error %f" %sqerror(Y_train, P_loo))
    print("test error %f" %sqerror(Y_test, P_test))
    
    #Sanity check, can we do better than predicting mean of training labels?
    print("mean predictor %f" %sqerror(Y_test, np.ones(Y_test.shape)*np.mean(Y_train)))

if __name__=="__main__":
    train_rls()
leave-one-out error 25.959399
test error 25.497222
mean predictor 81.458770

Choosing regularization parameter with leave-one-out

Regularization parameter with grid search in exponential grid to catch the lowest LOO-CV error.

In [5]:
def train_rls():
    #Select regparam with leave-one-out cross-validation
    X_train, Y_train, X_test, Y_test = load_housing()
    learner = RLS(X_train, Y_train)
    best_regparam = None
    best_error = float("inf")
   
    #exponential grid of possible regparam values
    log_regparams = range(-15, 16)
    for log_regparam in log_regparams:
        regparam = 2.**log_regparam
        
        #RLS is re-trained with the new regparam, this
        #is very fast due to computational short-cut
        learner.solve(regparam)
        
        #Leave-one-out cross-validation predictions, this is fast due to
        #computational short-cut
        P_loo = learner.leave_one_out()
        e = sqerror(Y_train, P_loo)
        print("regparam 2**%d, loo-error %f" %(log_regparam, e))
        if e < best_error:
            best_error = e
            best_regparam = regparam
    learner.solve(best_regparam)
    P_test = learner.predict(X_test)
    print("best regparam %f with loo-error %f" %(best_regparam, best_error)) 
    print("test error %f" %sqerror(Y_test, P_test))

if __name__=="__main__":
    train_rls()
regparam 2**-15, loo-error 24.745479
regparam 2**-14, loo-error 24.745463
regparam 2**-13, loo-error 24.745431
regparam 2**-12, loo-error 24.745369
regparam 2**-11, loo-error 24.745246
regparam 2**-10, loo-error 24.745010
regparam 2**-9, loo-error 24.744576
regparam 2**-8, loo-error 24.743856
regparam 2**-7, loo-error 24.742982
regparam 2**-6, loo-error 24.743309
regparam 2**-5, loo-error 24.750966
regparam 2**-4, loo-error 24.786243
regparam 2**-3, loo-error 24.896991
regparam 2**-2, loo-error 25.146493
regparam 2**-1, loo-error 25.537315
regparam 2**0, loo-error 25.959399
regparam 2**1, loo-error 26.285436
regparam 2**2, loo-error 26.479254
regparam 2**3, loo-error 26.603001
regparam 2**4, loo-error 26.801196
regparam 2**5, loo-error 27.352322
regparam 2**6, loo-error 28.837002
regparam 2**7, loo-error 32.113350
regparam 2**8, loo-error 37.480625
regparam 2**9, loo-error 43.843555
regparam 2**10, loo-error 49.748687
regparam 2**11, loo-error 54.912297
regparam 2**12, loo-error 59.936226
regparam 2**13, loo-error 65.137825
regparam 2**14, loo-error 70.126118
regparam 2**15, loo-error 74.336978
best regparam 0.007812 with loo-error 24.742982
test error 24.509981

Training with RLS and simultaneously selecting the regularization parameter with leave-one-out using LeaveOneOutRLS

In [6]:
def train_rls():
    #Trains RLS with automatically selected regularization parameter
    X_train, Y_train, X_test, Y_test = load_housing()
    
    # Grid search
    regparams = [2.**i for i in range(-15, 16)]
    learner = LeaveOneOutRLS(X_train, Y_train, regparams = regparams)
    loo_errors = learner.cv_performances
    P_test = learner.predict(X_test)
    print("leave-one-out errors " +str(loo_errors))
    print("chosen regparam %f" %learner.regparam)
    print("test error %f" %sqerror(Y_test, P_test))

if __name__=="__main__":
    train_rls()
leave-one-out errors [ 24.74547881  24.74546295  24.74543138  24.74536884  24.74524616
  24.74501033  24.7445764   24.74385625  24.74298177  24.74330936
  24.75096639  24.78624255  24.89699067  25.14649266  25.53731465
  25.95939943  26.28543584  26.47925431  26.6030015   26.80119588
  27.35232186  28.83700156  32.11334986  37.48062503  43.84355496
  49.7486873   54.91229746  59.93622566  65.1378248   70.12611801
  74.33697809]
chosen regparam 0.007812
test error 24.509981

Learning nonlinear predictors using kernels

RLS using a non-linear kernel function.

In [7]:
def train_rls():
    #Selects both the gamma parameter for Gaussian kernel, and regparam with loocv
    X_train, Y_train, X_test, Y_test = load_housing()
    
    regparams = [2.**i for i in range(-15, 16)]
    gammas = regparams
    best_regparam = None
    best_gamma = None
    best_error = float("inf")
    
    for gamma in gammas:
        #New RLS is initialized for each kernel parameter
        learner = RLS(X_train, Y_train, kernel="GaussianKernel", gamma=gamma)
        for regparam in regparams:
            #RLS is re-trained with the new regparam, this
            #is very fast due to computational short-cut
            learner.solve(regparam)
            
            #Leave-one-out cross-validation predictions, this is fast due to
            #computational short-cut
            P_loo = learner.leave_one_out()
            e = sqerror(Y_train, P_loo)
            
            #print "regparam", regparam, "gamma", gamma, "loo-error", e
            if e < best_error:
                best_error = e
                best_regparam = regparam
                best_gamma = gamma
    learner = RLS(X_train, Y_train, regparam = best_regparam, kernel="GaussianKernel", gamma=best_gamma)
    P_test = learner.predict(X_test)
    print("best parameters gamma %f regparam %f" %(best_gamma, best_regparam))
    print("best leave-one-out error %f" %best_error)
    print("test error %f" %sqerror(Y_test, P_test))
    
    
if __name__=="__main__":
    train_rls()
best parameters gamma 0.000031 regparam 0.000244
best leave-one-out error 21.910837
test error 16.340877

Binary classification and Area under ROC curve

In [8]:
from rlscore.utilities.reader import read_svmlight

# Load dataset and stats
def print_stats():
    X_train, Y_train, foo = read_svmlight("/Volumes/PANZER/Github/learning-space/Datasets/02 - Classification/a1a.t")
    X_test, Y_test, foo = read_svmlight("/Volumes/PANZER/Github/learning-space/Datasets/02 - Classification/a1a")
    print("Adult data set characteristics")
    print("Training set: %d instances, %d features" %X_train.shape)
    print("Test set: %d instances, %d features" %X_test.shape)

if __name__=="__main__":
    print_stats()
Adult data set characteristics
Training set: 30956 instances, 123 features
Test set: 1605 instances, 119 features
In [ ]:
from rlscore.learner import RLS
from rlscore.measure import accuracy
from rlscore.utilities.reader import read_svmlight


def train_rls():
    # Train ans test datasets    
    X_train, Y_train, foo = read_svmlight("/Volumes/PANZER/Github/learning-space/Datasets/02 - Classification/a1a.t")
    X_test, Y_test, foo = read_svmlight("/Volumes/PANZER/Github/learning-space/Datasets/02 - Classification/a1a", X_train.shape[1])
    learner = RLS(X_train, Y_train)
    best_regparam = None
    best_accuracy = 0.
    
    #exponential grid of possible regparam values
    log_regparams = range(-15, 16)
    for log_regparam in log_regparams:
        regparam = 2.**log_regparam
        #RLS is re-trained with the new regparam, this
        #is very fast due to computational short-cut
        learner.solve(regparam)
        
        #Leave-one-out cross-validation predictions, this is fast due to
        #computational short-cut
        P_loo = learner.leave_one_out()
        acc = accuracy(Y_train, P_loo)
        
        print("regparam 2**%d, loo-accuracy %f" %(log_regparam, acc))
        if acc > best_accuracy:
            best_accuracy = acc
            best_regparam = regparam
    learner.solve(best_regparam)
    P_test = learner.predict(X_test)
    
    print("best regparam %f with loo-accuracy %f" %(best_regparam, best_accuracy)) 
    print("test set accuracy %f" %accuracy(Y_test, P_test))

if __name__=="__main__":
    train_rls()
RLScore: Regularized Least-Squares Learners

Cleverhans – Lib python para prevenção de ataques de ruídos nos modelos

É um tema bem recente o ataque em sensores e por consequência em modelos de Machine Learning (isso será abordado em algum momento do futuro por aqui, mas esse artigo mostra bem o potencial danoso disso).

O Cleverhans é uma lib em Python que insere de maneira artificial um pouco de ruído/distúrbio na rede como forma de treino para esse tipo de ataque.

This repository contains the source code for cleverhans, a Python library to benchmark machine learning systems’ vulnerability to adversarial examples. You can learn more about such vulnerabilities on the accompanying blog.

The cleverhans library is under continual development, always welcoming contributions of the latest attacks and defenses. In particular, we always welcome help towards resolving the issues currently open.

About the name

The name cleverhans is a reference to a presentation by Bob Sturm titled “Clever Hans, Clever Algorithms: Are Your Machine Learnings Learning What You Think?” and the corresponding publication, “A Simple Method to Determine if a Music Information Retrieval System is a ‘Horse’.” Clever Hans was a horse that appeared to have learned to answer arithmetic questions, but had in fact only learned to read social cues that enabled him to give the correct answer. In controlled settings where he could not see people’s faces or receive other feedback, he was unable to answer the same questions. The story of Clever Hans is a metaphor for machine learning systems that may achieve very high accuracy on a test set drawn from the same distribution as the training data, but that do not actually understand the underlying task and perform poorly on other inputs.

Cleverhans – Lib python para prevenção de ataques de ruídos nos modelos