Productionizing Machine Learning Models and taking care about the neighbors

In Movile we have a Machine Learning Squad composed of the following members:

  • 1 Tech Lead (Mixed engineering and computational)
  • 2 Core ML engineers (production side)
  • 1 Data Scientist (with statistical background) – (data analysis and prototyping side)
  • 1 Data Scientist (with computational background) – (data analysis and prototyping side)

As we can see, there are different backgrounds in the team and to make the entire workflow productive and smoothed as possible, we need to get a good fence (a.k.a. crystal clear vision about the roles) to keep everyone motivated and productive.

This article written by Jhonatan Morra brings a good vision about this and how we deal with that fact in Movile.

Here are some quotes:

One of the most important goals of any data science team is the ability to create machine learning models, evaluate them offline, and get them safely to production. The faster this process can be performed, the more effective most teams will be. In most organizations, the team responsible for scoring a model and the team responsible for training a model are separate. Because of this, a clear separation of concerns is necessary for these two teams to operate at whatever speed suits them best. This post will cover how to make this work: implementing your ML algorithms in such a way that they can be tested, improved, and updated without causing problems downstream or requiring changes upstream in the data pipeline.

We can get clarity about the requirements for the data and production teams by breaking the data-driven application down into its constituent parts. In building and deploying a real-time data application, the goal of the data science team is to produce a function that reliably and in real-time ingests each data point and returns a prediction. For instance, if the business concern is modeling churn, we might ingest the data about a user and return a predicted probability of churn. The fact that we have to featurize that user and then send them through a random forest, for instance, is not the concern of the scoring team and should not be exposed to them.

 

 

 

 

Productionizing Machine Learning Models and taking care about the neighbors

Best price to bill approach

From whom are looking an initial approach to know the best time to bill your customers for some subscription services, this paper can be a good start.

In my current company, this is a very challenging problem.

Machine-Learning System For Recurring Subscription Billing

Jack Greenberg, Thomas Price

Abstract: A system and method for recurring billing of periodic subscriptions are disclosed. The system attempts to maximize a metric like long term customer retention while tailoring the subscription billing to the customer, using machine learning. The system is initially trained with a set of training data — a large corpus of records of subscription billings — including successes, billing failures, and customer cancellations. Any available metadata about the users or the type of subscription is also attached and may be used as features for the machine learning model. Such metadata may include, for example, customers’ age, gender, demographics, interests, and online behavioral profile/history, as well as metadata to identify the type of service being billed, such as music subscriptions, delivery subscriptions or other types of subscriptions, or the payment instrument. The system is used to predict the subscription model for a given user with relevant user-related constraints, while optimizing acceptability to that user.

Best price to bill approach

Algorithm over Regulations (?)

This scene is the best thing that can I relate to this particular topic.

“But, the bells have already been rung and they’ve heard it. Out in the dark. Among the stars. Ding dong, the God is dead. The bells, cannot be unrung! He’s hungry. He’s found us. And He’s coming!

Ding, ding, ding, ding, ding…”

(Hint Fellas: This is a great time to be not evil and check your models to avoid any kind of discrimination over their current or potential customers.)

European Union regulations on algorithmic decision-making and a “right to explanation – By Bryce Goodman, Seth Flaxman

Abstract: We summarize the potential impact that the European Union’s new General Data Protection Regulation will have on the routine use of machine learning algorithms. Slated to take effect as law across the EU in 2018, it will restrict automated individual decision-making (that is, algorithms that make decisions based on userlevel predictors) which “significantly affect” users. The law will also effectively create a “right to explanation,” whereby a user can ask for an explanation of an algorithmic decision that was made about them. We argue that while this law will pose large challenges for industry, it highlights opportunities for computer scientists to take the lead in designing algorithms and evaluation frameworks which avoid discrimination and enable explanation.

Conclusion: While the GDPR presents a number of problems for current applications in machine learning they are, we believe, good problems to have. The challenges described in this paper emphasize the importance of work that ensures that algorithms are not merely efficient, but transparent and fair. Research is underway in pursuit of rendering algorithms more amenable to ex post and ex ante inspection [11, 31, 20]. Furthermore, a number of recent studies have attempted to tackle the issue of discrimination within algorithms by introducing tools to both identify [5, 29] and rectify [9, 16, 32, 6, 12, 14] cases of unwanted bias. It remains to be seen whether these techniques are adopted in practice. One silver lining of this research is to show that, for certain types of algorithmic profiling, it is possible to both identify and implement interventions to correct for discrimination. This is in contrast to cases where discrimination arises from human judgment. The role of extraneous and ethically inappropriate factors in human decision making is well documented (e.g., [30, 10, 1]), and discriminatory decision making is pervasive in many of the sectors where algorithmic profiling might be introduced (e.g. [19, 7]). We believe that, properly applied, algorithms can not only make more accurate predictions, but offer increased transparency and fairness over their human counterparts (cf. [23]). Above all else, the GDPR is a vital acknowledgement that, when algorithms are deployed in society, few if any decisions are purely “technical”. Rather, the ethical design of algorithms requires coordination between technical and philosophical resources of the highest caliber. A start has been made, but there is far to go. And, with less than two years until the GDPR takes effect, the clock is ticking.

European Union regulations on algorithmic decision-making and a “right to explanation”

 

Algorithm over Regulations (?)

Densely Connected Convolutional Networks – implementations

Abstract: Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections – one between each layer and its subsequent layer – our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance.

Densely Connected Convolutional Networks – implementations

Matching e o uso de regressões para análise do efeito de um tratamento

Um dos assuntos mais espinhosos quando falamos de estatística para realizar estimativas de populações com características diferentes é o Matching.

Para quem não sabe, o Matching é basicamente uma técnica para comparação observacional entre um grupo de controle e um grupo de tratamento para cada observação espcífica dos dois grupos (i.e. para cada membro do grupo de tratamento, será feita uma estimativa em paralelo com um membro do grupo de controle e observará as diferenças nas estimativas) em que o objetivo principal é atestar os efeitos do tratamento considerando características dos dados observados, isolando ou realizando a análise considerando as diferenças entre as covariáveis.

Um exemplo de aplicação é dado no trabalho do IPEA em que há estimativas das populações pobres e indigentes, em que no estudo é realizado o mapeamento das características socioeconômicas similares do conjunto de familias participantes.

Neste post o Matt Bogard ele faz algumas considerações sobre a regressão como uma variância baseada em pesos (dos estimadores) poderados em relação a uma indicação de efeito no tratamento. 

Hence, regression gives us a variance based weighted average treatment effect, whereas matching provides a distribution weighted average treatment effect.

So what does this mean in practical terms? Angrist and Piscke explain that regression puts more weight on covariate cells where the conditional variance of treatment status is the greatest, or where there are an equal number of treated and control units. They state that differences matter little when the variation of δx is minimal across covariate combinations.

In his post The cardinal sin of matching, Chris Blattman puts it this way:

“For causal inference, the most important difference between regression and matching is what observations count the most. A regression tries to minimize the squared errors, so observations on the margins get a lot of weight. Matching puts the emphasis on observations that have similar X’s, and so those observations on the margin might get no weight at all….Matching might make sense if there are observations in your data that have no business being compared to one another, and in that way produce a better estimate”

 

We can see that those in the treatment group tend to have higher outcome values so a straight comparison between treatment and controls will overestimate treatment effects due to selection bias:

E[Y­­­i|di=1] – E[Y­­­i|di=0] =E[Y1i-Y0i] +{E[Y0i|di=1] – E[Y0i|di=0]}

However, if we estimate differences based on an exact matching scheme, we get a much smaller estimate of .67. If we run a regression using all of the data we get .75. If we consider 3.78 to be biased upward then both matching and regression have significantly reduced it, and depending on the application the difference between .67 and .75 may not be of great consequence. Of course if we run the regression including only matched variables, we get exactly the same results. (see R code below). This is not so different than the method of trimming based on propensity scores suggested in Angrist and Pischke.

Matching e o uso de regressões para análise do efeito de um tratamento

A Note on the Validity of Cross-Validation for Evaluating Autoregressive Time Series Prediction

Um bom artigo sobre a aplicação de Cross Validation em séries temporais.

Abstract: One of the most widely used standard procedures for model evaluation in classification and regression is K-fold cross-validation (CV). However, when it comes to time series forecasting, because of the inherent serial correlation and potential non-stationarity of the data, its application is not straightforward and often omitted by practitioners in favour of an out-of-sample (OOS) evaluation. In this paper, we show that in the case of a purely autoregressive model, the use of standard K-fold CV is possible as long as the models considered have uncorrelated errors. Such a setup occurs, for example, when the models nest a more appropriate model. This is very common when Machine Learning methods are used for prediction, where CV in particular is suitable to control for overfitting the data. We present theoretical insights supporting our arguments. Furthermore, we present a simulation study and a real-world example where we show empirically that K-fold CV performs favourably compared to both OOS evaluation and other time-series-specific techniques such as non-dependent cross-validation.

Conclusions: In this work we have investigated the use of cross-validation procedures for time series prediction evaluation when purely autoregressive models are used, which is a very common situation; e.g., when using Machine Learning procedures for time series forecasting. In a theoretical proof, we have shown that a normal K-fold cross-validation procedure can be used if the residuals of our model are uncorrelated, which is especially the case if the model nests an appropriate model. In the Monte Carlo experiments, we have shown empirically that even if the lag structure is not correct, as long as the data are fitted well by the model, cross-validation without any modification is a better choice than OOS evaluation. We have then in a real-world data example shown how these findings can be used in a practical situation. Cross-validation can adequately control overfitting in this application, and only if the models underfit the data and lead to heavily correlated errors, are the cross-validation procedures to be avoided as in such a case they may yield a systematic underestimation of the error. However, this case can be easily detected by checking the residuals for serial correlation, e.g., using the Ljung-Box test.

cv-wp

A Note on the Validity of Cross-Validation for Evaluating Autoregressive Time Series Prediction