PlaidML: An open source portable deep learning engine

Via Vertex.ai

We’re pleased to announce the next step towards deep learning for every device and platform. Today Vertex.AI is releasing PlaidML, our open source portable deep learning engine. Our mission is to make deep learning accessible to every person on every device, and we’re building PlaidML to help make that a reality. We’re starting by supporting the most popular hardware and software already in the hands of developers, researchers, and students. The initial version of PlaidML runs on most existing PC hardware with OpenCL-capable GPUs from NVIDIA, AMD, or Intel. Additionally, we’re including support for running the widely popular Keras framework on top of Plaid to allow existing code and tutorials to run unchanged.

Anúncios
PlaidML: An open source portable deep learning engine

Baidu are bringing HPC Techniques to Deep Learning

Via Baidu Research Blog.

The Ring all-reduce approach came to save a lot of work when training deep neural networks. The approach to propagate and update the gradients (and control the convergence of the model) are well explained below:

The Ring Allreduce

The main issue with the simplistic communication strategy described above was that the communication cost grew linearly with the number of GPUs in the system. In contrast, a ring allreduce is an algorithm for which the communication cost is constant and independent of the number of GPUs in the system, and is determined solely by the slowest connection between GPUs in the system; in fact, if you only consider bandwidth as a factor in your communication cost (and ignore latency), the ring allreduce is an optimal communication algorithm [4]. (This is a good estimate for communication cost when your model is large, and you need to send large amounts of data a small number of times.)

The GPUs in a ring allreduce are arranged in a logical ring. Each GPU should have a left neighbor and a right neighbor; it will only ever send data to its right neighbor, and receive data from its left neighbor.

The algorithm proceeds in two steps: first, a scatter-reduce, and then, an allgather. In the scatter-reduce step, the GPUs will exchange data such that every GPU ends up with a chunk of the final result. In the allgather step, the GPUs will exchange those chunks such that all GPUs end up with the complete final result.

More can be found here. To implement there’s a Github project called Tensor All-reduce that can be used for distributed deep learning.

Baidu are bringing HPC Techniques to Deep Learning

Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow

Via Uber Engineering Blog.

Yet, this is another tool for Deep Learning, but I think that those guys hit the nail exposing and fixing one of the major concerns about Tensor Flow that is distributed training.

When Uber needed to use Deep Learning they found some endeavors to use the conventional Data Parallelism architecture.  Using Data Parallelism arch they can distribute the training using several instances in parallel and when the gradients for every batch are calculated in each instance (node/worker) these gradients are propagated for all nodes and averaged to control the convergence (update) of the model in the training phase. The following image explains better than words.

But using this architecture Uber faced two problems that were a) the right ratio of worker to parameter servers (to avoid/deal with network and processing bottleneck) and b) the complexity of TensorFlow code (more details here).

To avoid these problems they used an idea of a 2009 paper  “Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations” called Ring all-reduce.

They explain the workflow of this approach:

In the ring-allreduce algorithm, each of N nodes communicates with two of its peers 2*(N-1) times. During this communication, a node sends and receives chunks of the data buffer. In the first N-1 iterations, received values are added to the values in the node’s buffer. In the second N-1 iterations, received values replace the values held in the node’s buffer. Baidu’s paper suggests that this algorithm is bandwidth-optimal, meaning that if the buffer is large enough, it will optimally utilize the available network.

The implementations details can be found here.

 

 

Horovod: Uber’s Open Source Distributed Deep Learning Framework for TensorFlow

Tensorflow sucks (?)

This post of Nico’s blog has good points about why Pytorch even without all Google support and money is taking out users from Tensor Flow.

[…]The most interesting question to me is why Google chose a purely declarative paradigm for Tensorflow in spite of the obvious downsides of this approach. Did they feel that encapsulating all the computation in a single computation graph would simplify executing models on their TPU’s so they can cut Nvidia out of the millions of dollars to be made from cloud hosting of deep learning powered applications? It’s difficult to say. Overall, Tensorflow does not feel like a pure open source project for the common good. Which I would have no problem with, had their design been sound. In comparison with beautiful Google open source projects out there such as Protobuf, Golang, and Kubernetes, Tensorflow falls dramatically short.

While declarative paradigms are great for UI programming, there are many reasons why it is a problematic choice for deep learning.

Take the React Javascript library as an example, the standard choice today for interactive web applications. In React, the complexity of how data flows through the application makes sense to be hidden from the developer, since Javascript execution is generally orders of magnitudes faster than updates to the DOM. React developers don’t want to worry about the mechanics of how state is propagated, so long as the end user experience is “good enough”.

On the other hand, in deep learning, a single layer can literally execute billions of FLOP’s! And deep learning researchers care very much about the mechanics of how computation is done and want fine control because they are constantly pushing the edge of what’s possible (e.g. dynamic networks) and want easy access to intermediate results.[…]

Tensorflow sucks (?)

How to pass environment variables in Jupyter Notebook

(Sharing some personal suffering)

One thing that gets me mad it’s to make several .csv/.txt files in my computer to perform some analysis. I personally prefer to connect directly in some RDBMS (Redshift) and get the data in some straightforward way and store the query inside the Jupiter Notebook.

The main problem with this approach is: a high number of people put their passwords inside the notebooks/scripts and this is very unsafe. (You don’t need to believe me, check it by yourself)

I was trying to pass the environment variables in a traditional way using export VARIABLE_NAME=xptoSomeValue  but after starting the Jupyter Notebook I get the following error:

 

 

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-13-2288aa3f6b7a> in <module>()
      2 import os
      3 
----> 4 HOST = os.environ['REDSHIFT_HOST']
      5 PORT = os.environ['REDSHIFT_PORT']
      6 USER = os.environ['REDSHIFT_USER']

/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/UserDict.pyc in __getitem__(self, key)
     38         if hasattr(self.__class__, "__missing__"):
     39             return self.__class__.__missing__(self, key)
---> 40         raise KeyError(key)
     41     def __setitem__(self, key, item): self.data[key] = item
     42     def __delitem__(self, key): del self.data[key]

KeyError: 'REDSHIFT_HOST'

For some reason, this approach didn’t work. I make a small workaround to start using some environmental variables when I call of jupyter notebook command in that way:

env REDSHIFT_HOST='myRedshiftHost' REDSHIFT_USER='flavio.clesio' REDSHIFT_PORT='5439' REDSHIFT_DATA='myDatabase' REDSHIFT_PASS='myVeryHardPass' jupyter notebook

I hope it helps!

How to pass environment variables in Jupyter Notebook

Six reasons your boss must send you to Spark Summit Europe 2017

It’s redundant to say that Apache Spark is becoming the most prominent open-source big data cluster-computing framework in the last 2 years, where this technology not only shattered old paradigms of general purpose distributed data processing, but also built a very vibrant, innovation-driven, and receptive community.

This is my first time at Spark Summit, and for me personally, it’s a great time as Machine Learning professional to be part of such event that has grown dramatically in the last 2 years only.

Here in Brazil we do not have such tradition to invest in conferences (that are some cultural reasons involved that needed to break down in another blog post), but this is the six reasons that your boss must send you to Spark Summit Europe 2017:

  1. Accomplish more than the rest: While some your company competitors are heavily busy making re-work in old frameworks, your company can stay focused to solve real problems that permit scalability for your business using bleeding edge technologies.
  2. Stay ahead of the game: You can choose one of these two sentences to put in your resumé: 1) “Worked with Apache Spark, the most prominent open-source cluster-computing framework for Big Data Projects“; or 2) “Worked with <<Put some obsolete framework the needs a couple USD millions to be deployed and have 70% fewer features than Apache Spark and the most stable version was written 9 years ago and the whole marketing are migrating>>”. It’s up to you.
  3. Connect with Apache Spark experts: In Spark Summit you’ll meet some real dealers of Apache Spark, not someone with marketing pitch (no offense) offering difficulties (e.g. closed-source, buggy platform) to sell facilities (e.g. never-ending-consulting-until-drain-your-entire-budget style, sell (buggy) plugins, add-ons, etc… ). Some of Spark experts are Tim Hunter, Tathagata Das,  Sue Ann Hong, Holden Karau, to name a few.
  4. Network that matters: I mean people with shared interest in enthusiasm over an open-source framework Apache Spark and technology, headhunters of good companies that understand that data plays a strong role at business; not some B.S. artist or pseudo-tech-cloaked-sellers someone else.
  5. Applied knowledge produce innovation, and innovation produce results: Some cases using Apache Spark to innovate and help business – Saving more than US$ 3 million using Apache Spark and Machine Learning, managing 300TB data workload using Apache Spark, real-time anomaly detection in some systems, changing the game of digital marketing using Apache Spark,  and predicting traffic using weather data.
  6. Opting out will destroy your business and your career: Refuse to get knowledge and apply that it’s the fast way to destroy your career with stagnation in old methods/process/platforms and become obsolete in a few months. For your company, opting out of innovation or learning new methods and technologies that can help to scale the business or enhance productivity, it’s a good way to get out of business in a few years.

To register and learn more about the event, please visit Spark Summit 2017 and follow spark_summit on Twitter.

Six reasons your boss must send you to Spark Summit Europe 2017

See you at Spark Summit Europe 2017

In October 26, my friend Eiti Kimura and I will provide a talk called Preventing leakage and monitoring distributed systems with Machine Learning at Spark Summit Europe 2017 where we’ll show our solution to monitoring a highly complex distributed system using Apache Spark as a tool for Machine Learning.

We’re very excited to share our experience in this journey, and how we solved a complex problem using a simple solution that saved more than US$ 3 million in the last 19 months.

See you at Spark Summit at Dublin.

See you at Spark Summit Europe 2017

Anomaly Detection in Multivariate Non-stationary Time Series for Automatic DBMS Diagnosis

Anomaly Detection in Multivariate Non-stationary Time Series for Automatic DBMS Diagnosis

ABSTRACT— Anomaly detection in database management systems (DBMSs) is difficult because of increasing number of statistics (stat) and event metrics in big data system. In this paper, I propose an automatic DBMS diagnosis system that detects anomaly periods with abnormal DB stat metrics and finds causal events in the periods. Reconstruction error from deep autoencoder and statistical process control approach are applied to detect time period with anomalies. Related events are found using time series similarity measures between events and abnormal stat metrics. After training deep autoencoder with DBMS metric data, efficacy of anomaly detection is investigated from other DBMSs containing anomalies. Experiment results show effectiveness of proposed model, especially, batch temporal normalization layer. Proposed model is used for publishing automatic DBMS diagnosis reports in order to determine DBMS configuration and SQL tuning.

CONCLUSION AND FUTURE WORK I proposed a machine learning model for automatic DBMS diagnosis. The proposed model detects anomaly periods from reconstruct error with deep autoencoder. I also verified empirically that temporal normalization is essential when input data is non-stationary multivariate time series. With SPC approach, time period is considered anomaly period when reconstruction error is outside of control limit. According types or users of DBMSs, decision rules that are used in SPC can be added. For example, warning line with 2 sigma can be utilized to decide whether it is anomaly or not [12, 13]. In this paper, anomaly detection test is proceeded in other DBMSs whose data is not used in training, because performance of basic pre-trained model is important in service providers’ perspective. Efficacy of detection performance is validated with blind test and DBAs’ opinions. The result of automatic anomaly diagnosis would help DB consultants save time for anomaly periods and main wait events. Thus, they can concentrate on only making solution when DB disorders occur. For better performance of anomaly detection, additional training can be proceeded after pre-trained model is adopted. In addition, recurrent and convolutional neural network can be used in reconstruction part to capture hidden representation of sequential and local relationship. If anomaly labeled data is generated, detection result can be analyzed with numerical performance measures. However, in practice, it is hard to secure labeled anomaly dataset according to each DBMS. Proposed model is meaningful in unsupervised anomaly detection model that doesn’t need labeled data and can be generalized to other DBMSs with pre-trained model

Anomaly Detection in Multivariate Non-stationary Time Series for Automatic DBMS Diagnosis

Anomaly Detection in Multivariate Non-stationary Time Series for Automatic DBMS Diagnosis

Very Deep Convolutional Networks for Large-Scale Image Recognition

ABSTRACT In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small ( 3 × 3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision

CONCLUSION In this work we evaluated very deep convolutional networks (up to 19 weight layers) for largescale image classification. It was demonstrated that the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture (LeCun et al., 1989; Krizhevsky et al., 2012) with substantially increased depth. In the appendix, we also show that our models generalise well to a wide range of tasks and datasets, matching or outperforming more complex recognition pipelines built around less deep image representations. Our results yet again confirm the importance of depth in visual representations.

Very Deep Convolutional Networks for Large-Scale Image Recognition

Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks

Abstract We develop an algorithm which exceeds the performance of board certified cardiologists in detecting a wide range of heart arrhythmias from electrocardiograms recorded with a single-lead wearable monitor. We build a dataset with more than 500 times the number of unique patients than previously studied corpora. On this dataset, we train a 34-layer convolutional neural network which maps a sequence of ECG samples to a sequence of rhythm classes. Committees of boardcertified cardiologists annotate a gold standard test set on which we compare the performance of our model to that of 6 other individual cardiologists. We exceed the average cardiologist performance in both recall (sensitivity) and precision (positive predictive value).

Conclusion We develop a model which exceeds the cardiologist performance in detecting a wide range of heart arrhythmias from single-lead ECG records. Key to the performance of the model is a large annotated dataset and a very deep convolutional network which can map a sequence of ECG samples to a sequence of arrhythmia annotations. On the clinical side, future work should investigate extending the set of arrhythmias and other forms of heart disease which can be automatically detected with high-accuracy from single or multiple lead ECG records. For example we do not detect Ventricular Flutter or Fibrillation. We also do not detect Left or Right Ventricular Hypertrophy, Myocardial Infarction or a number of other heart diseases which do not necessarily exhibit as arrhythmias. Some of these may be difficult or even impossible to detect on a single-lead ECG but can often be seen on a multiple-lead ECG. Given that more than 300 million ECGs are recorded annually, high-accuracy diagnosis from ECG can save expert clinicians and cardiologists considerable time and decrease the number of misdiagnoses. Furthermore, we hope that this technology coupled with low-cost ECG devices enables more widespread use of the ECG as a diagnostic tool in places where access to a cardiologist is difficult.

Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks

Learning to Optimize Neural Nets

Abstract Learning to Optimize (Li & Malik, 2016) is a recently proposed framework for learning optimization algorithms using reinforcement learning. In this paper, we explore learning an optimization algorithm for training shallow neural nets. Such high-dimensional stochastic optimization problems present interesting challenges for existing reinforcement learning algorithms. We develop an extension that is suited to learning optimization algorithms in this setting and demonstrate that the learned optimization algorithm consistently outperforms other known optimization algorithms even on unseen tasks and is robust to changes in stochasticity of gradients and the neural net architecture. More specifically, we show that an optimization algorithm trained with the proposed method on the problem of training a neural net on MNIST generalizes to the problems of training neural nets on the Toronto Faces Dataset, CIFAR-10 and CIFAR- 100

 

Learning to Optimize Neural Nets

L2 Regularization versus Batch and Weight Normalization

Abstract: Batch Normalization is a commonly used trick to improve the training of deep neural networks. These neural networks use L2 regularization, also called weight decay, ostensibly to prevent overfitting. However, we show that L2 regularization has no regularizing effect when combined with normalization. Instead, regularization has an influence on the scale of weights, and thereby on the effective learning rate. We investigate this dependence, both in theory, and experimentally. We show that popular optimization methods such as ADAM only partially eliminate the in- fluence of normalization on the learning rate. This leads to a discussion on other ways to mitigate this issue.

Discussion: Normalization, either Batch Normalization, Layer Normalization, or Weight Normalization makes the learned function invariant to scaling of the weights w. This scaling is strongly affected by regularization. We know of no first order gradient method that can fully eliminate this effect. However, a direct solution of forcing kwk = 1 solves the problem. By doing this we also remove one hyperparameter from the training procedure. As noted by Salimans & Kingma (2016), the effect of weight and batch normalization on the effective learning rate might not necessarily be bad. If no regularization is used, then the norm of the weights tends to increase over time, and so the effective learning rate decreases. Often that is a desirable thing, and many training methods lower the learning rate explicitly. However, the decrease of effective learning rate can be hard to control, and can depend a lot on initial steps of training, which makes it harder to reproduce results. With batch normalization we have added two additional parameters, γ and β, and it of course makes sense to also regularize these. In our experiments we did not use regularization for these parameters, though preliminary experiments show that regularization here does not affect the results. This is not very surprising, since with rectified linear activation functions, scaling of γ also has no effect on the function value in subsequent layers. So the only parameters that are actually regularized are the γ’s for the last layer of the network.

L2 Regularization versus Batch and Weight Normalization

Analysis of dropout learning regarded as ensemble learning

Abstract: Deep learning is the state-of-the-art in fields such as visual object recognition and speech recognition. This learning uses a large number of layers, huge number of units, and connections. Therefore, overfitting is a serious problem. To avoid this problem, dropout learning is proposed. Dropout learning neglects some inputs and hidden units in the learning process with a probability, p, and then, the neglected inputs and hidden units are combined with the learned network to express the final output. We find that the process of combining the neglected hidden units with the learned network can be regarded as ensemble learning, so we analyze dropout learning from this point of view.

Results: After the learning, the ensemble output is calculated by using the average of the sub-network outputs. We showed that dropout learning can be regarded as ensemble learning except for using a different set of hidden units in every learning iteration. Using a different set of hidden unit outperforms ensemble learning. We also showed that dropout learning achieves the same performance as the L2 regularizer. Our future work is the theoretical analysis of dropout learning with ReLU activation function.

Analysis of dropout learning regarded as ensemble learning

Deep Learning for Tumor Classification in Imaging Mass Spectrometry

Motivation: Tumor classification using Imaging Mass Spectrometry (IMS) data has a high potential for future applications in pathology. Due to the complexity and size of the data, automated feature extraction and classification steps are required to fully process the data. Deep learning offers an approach to learn feature extraction and classification combined in a single model. Commonly these steps are handled separately in IMS data analysis, hence deep learning offers an alternative strategy worthwhile to explore.

Results: Methodologically, we propose an adapted architecture based on deep convolutional networks to handle the characteristics of mass spectrometry data, as well as a strategy to interpret the learned model in the spectral domain based on a sensitivity analysis. The proposed methods are evaluated on two challenging tumor classification tasks and compared to a baseline approach. Competitiveness of the proposed methods are shown on both tasks by studying the performance via cross-validation. Moreover, the learned models are analyzed by the proposed sensitivity analysis revealing biologically plausible effects as well as confounding factors of the considered task. Thus, this study may serve as a starting point for further development of deep learning approaches in IMS classification tasks.

Source Code: https://gitlab.informatik.uni-bremen.de/digipath/Deep Learning

Data: https://seafile.zfn.uni-bremen.de/d/85c915784e/

Deep Learning for Tumor Classification in Imaging Mass Spectrometry

Do you have some co-worker that wants to left your company because he’s not working with bleeding edge Deep Learning tools/algos?

Please, show this post of Ben Lorica’s podcast:

Adoption of machine learning and deep learning in large companies

Everything in the enterprise space is ROI driven. They don’t know that the newest deep learning paper just came out from Google. They’re not going to clone some random GitHub repository and try it out, and just try to put it in production. They don’t do that. They want to understand ROI. They work a job, they have a goal, and they have a budget. They need to figure out what to do with that budget as it relates to their job at their company. Their company is usually a for-profit corporation trying to make money, or trying to increase margins for shareholders.

… Frankly, they don’t care if it’s linear regression, or random forest, either. … Machine learning has barely penetrated the Fortune 2000. Despite all these tools existing, most of them don’t have it in production because they don’t see a point in adopting it. I think Intel said this right: as far as enterprise adoption is concerned, it’s still fairly early for machine learning.

Do you have some co-worker that wants to left your company because he’s not working with bleeding edge Deep Learning tools/algos?

Lack of transparency is the bottleneck in academia

One of my biggest mistakes was to make my whole master’s degrees dissertation using private data (provided by my former employer) using closed tools (e.g. Viscovery Mine).

This was for me a huge blocker to share my research with every single person in the community, and get a second opinion about my work in regard of reproducibility. I working to open my data and making a new version, or book, about this kind of analysis using Non Performing Loans data.

Here in Denny’s blog, he talks about how engineering is the bottleneck in Deep Learning Research, where he made the following statements:

I will use the Deep Learning community as an example, because that’s what I’m familiar with, but this probably applies to other communities as well. As a community of researchers we all share a common goal: Move the field forward. Push the state of the art. There are various ways to do this, but the most common one is to publish research papers. The vast majority of published papers are incremental, and I don’t mean this in a degrading fashion. I believe that research is incremental by definition, which is just another way of saying that new work builds upon what other’s have done in the past. And that’s how it should be. To make this concrete, the majority of the papers I come across consist of more than 90% existing work, which includes datasets, preprocessing techniques, evaluation metrics, baseline model architectures, and so on. The authors then typically add a bit novelty and show improvement over well-established baselines.

So far nothing is wrong with this. The problem is not the process itself, but how it is implemented. There are two issues that stand out to me, both of which can be solved with “just engineering.” 1. Waste of research time and 2. Lack of rigor and reproducibility. Let’s look at each of them.

And the final musing:

Personally, I do not trust paper results at all. I tend to read papers for inspiration – I look at the ideas, not at the results. This isn’t how it should be. What if all researchers published code? Wouldn’t that solve the problem? Actually, no. Putting your 10,000 lines of undocumented code on Github and saying “here, run this command to reproduce my number” is not the same as producing code that people will read, understand, verify, and build upon. It’s like Shinichi Mochizuki’s proof of the ABC Conjecture, producing something that nobody except you understands.

Personally, I think this approach of discarding the results and focus on the novelty of methods is better than to try to understand any result aspect that the researcher wants to cover up through academic BS complexity.

 

 

 

Lack of transparency is the bottleneck in academia

Why you need enforce reproducibility as habit

In the few months that I arrived in Movile, I saw some strange pattern about several “data analysis”. The pattern was: once the analysis is delivered by someone without any background of data science this kind of insight suddenly become a kind of dogma.

In other words, no one will check the information, and in most of the cases there is no code, no commit at github, no .sql file or .R/.py file with the scripts used.

The practical problem is: What if this information was deadly wrong? And worse: How to discover if this information was harmful to the business?

Seeing this, my first mission statement as Data Intelligence Tech Lead at that time was to enforce to every BI Developer, Revenue Assurance Data Analyst, and Data Scientist  that every code must be reproducible no matter what conditions. Every insight must be delivered with some code in github.

Someone could say: “Wow… We have a little dictator here!”

With this simple rule, we are having this not exhaustive list of positive effects:

  • We’re collecting until today a huge dividend about the reproductive science: Any opinion have a code behind, and this code can be tested for anyone with access in github. This avoids the “excel kid” to drive any decision making without one hand on their shoulder, BEFORE the decision making;
  • We unmasked several “BS artists” that exploit the lack of data literacy of our internal clients (e.g. analysts, managers, et cetera) showing unnecessary complexities or delusional estimates without any kind of method behind; and
  • We developed a culture to be very skeptic about our estimates, especially to what we do not know about the data (a.k.a. exogenous factors about the market, brazilian economy, and so on). In another words: We stop to guessing about what we don’t know at that time and MADE IT CLEAR for our internal clients.

To know a little bit more how we operate, this article was the key reference for us to built our culture of compliance and deployment.

 

No matter what time crunch you are facing, it’s not worth putting a flaky implementation of an analysis into production. As data scientists, we are working to create a culture of data-driven decision-making. If your application breaks without an explanation (likely because you are unable to reproduce the results), people will lose confidence in your application and stop making decisions based on the results of your application. Even if you eventually fix it, that confidence is very, very hard to win back.

Data science teams should require reproducibility in the same way they require unit testing, linting, code versioning, and review. Without consistently producing results as good or better than known results for known data, analyses should never be passed on to deployment. This performance can be measured via techniques similar to integration testing. Further, if possible, models can be run in parallel on current data running through your systems for a side-by-side comparison with current production models.

Don’t get me wrong: Without any kind of compliance about your analysis, your organization will be a house of BS artists and any benefit to extract insights of the data, will be contaminated with BS hidden bias and can lead to several disasters in decision making, as we already experienced.

Why you need enforce reproducibility as habit

Productionizing Machine Learning Models and taking care about the neighbors

In Movile we have a Machine Learning Squad composed of the following members:

  • 1 Tech Lead (Mixed engineering and computational)
  • 2 Core ML engineers (production side)
  • 1 Data Scientist (with statistical background) – (data analysis and prototyping side)
  • 1 Data Scientist (with computational background) – (data analysis and prototyping side)

As we can see, there are different backgrounds in the team and to make the entire workflow productive and smoothed as possible, we need to get a good fence (a.k.a. crystal clear vision about the roles) to keep everyone motivated and productive.

This article written by Jhonatan Morra brings a good vision about this and how we deal with that fact in Movile.

Here are some quotes:

One of the most important goals of any data science team is the ability to create machine learning models, evaluate them offline, and get them safely to production. The faster this process can be performed, the more effective most teams will be. In most organizations, the team responsible for scoring a model and the team responsible for training a model are separate. Because of this, a clear separation of concerns is necessary for these two teams to operate at whatever speed suits them best. This post will cover how to make this work: implementing your ML algorithms in such a way that they can be tested, improved, and updated without causing problems downstream or requiring changes upstream in the data pipeline.

We can get clarity about the requirements for the data and production teams by breaking the data-driven application down into its constituent parts. In building and deploying a real-time data application, the goal of the data science team is to produce a function that reliably and in real-time ingests each data point and returns a prediction. For instance, if the business concern is modeling churn, we might ingest the data about a user and return a predicted probability of churn. The fact that we have to featurize that user and then send them through a random forest, for instance, is not the concern of the scoring team and should not be exposed to them.

 

 

 

 

Productionizing Machine Learning Models and taking care about the neighbors

Best price to bill approach

From whom are looking an initial approach to know the best time to bill your customers for some subscription services, this paper can be a good start.

In my current company, this is a very challenging problem.

Machine-Learning System For Recurring Subscription Billing

Jack Greenberg, Thomas Price

Abstract: A system and method for recurring billing of periodic subscriptions are disclosed. The system attempts to maximize a metric like long term customer retention while tailoring the subscription billing to the customer, using machine learning. The system is initially trained with a set of training data — a large corpus of records of subscription billings — including successes, billing failures, and customer cancellations. Any available metadata about the users or the type of subscription is also attached and may be used as features for the machine learning model. Such metadata may include, for example, customers’ age, gender, demographics, interests, and online behavioral profile/history, as well as metadata to identify the type of service being billed, such as music subscriptions, delivery subscriptions or other types of subscriptions, or the payment instrument. The system is used to predict the subscription model for a given user with relevant user-related constraints, while optimizing acceptability to that user.

Best price to bill approach

Algorithm over Regulations (?)

This scene is the best thing that can I relate to this particular topic.

“But, the bells have already been rung and they’ve heard it. Out in the dark. Among the stars. Ding dong, the God is dead. The bells, cannot be unrung! He’s hungry. He’s found us. And He’s coming!

Ding, ding, ding, ding, ding…”

(Hint Fellas: This is a great time to be not evil and check your models to avoid any kind of discrimination over their current or potential customers.)

European Union regulations on algorithmic decision-making and a “right to explanation – By Bryce Goodman, Seth Flaxman

Abstract: We summarize the potential impact that the European Union’s new General Data Protection Regulation will have on the routine use of machine learning algorithms. Slated to take effect as law across the EU in 2018, it will restrict automated individual decision-making (that is, algorithms that make decisions based on userlevel predictors) which “significantly affect” users. The law will also effectively create a “right to explanation,” whereby a user can ask for an explanation of an algorithmic decision that was made about them. We argue that while this law will pose large challenges for industry, it highlights opportunities for computer scientists to take the lead in designing algorithms and evaluation frameworks which avoid discrimination and enable explanation.

Conclusion: While the GDPR presents a number of problems for current applications in machine learning they are, we believe, good problems to have. The challenges described in this paper emphasize the importance of work that ensures that algorithms are not merely efficient, but transparent and fair. Research is underway in pursuit of rendering algorithms more amenable to ex post and ex ante inspection [11, 31, 20]. Furthermore, a number of recent studies have attempted to tackle the issue of discrimination within algorithms by introducing tools to both identify [5, 29] and rectify [9, 16, 32, 6, 12, 14] cases of unwanted bias. It remains to be seen whether these techniques are adopted in practice. One silver lining of this research is to show that, for certain types of algorithmic profiling, it is possible to both identify and implement interventions to correct for discrimination. This is in contrast to cases where discrimination arises from human judgment. The role of extraneous and ethically inappropriate factors in human decision making is well documented (e.g., [30, 10, 1]), and discriminatory decision making is pervasive in many of the sectors where algorithmic profiling might be introduced (e.g. [19, 7]). We believe that, properly applied, algorithms can not only make more accurate predictions, but offer increased transparency and fairness over their human counterparts (cf. [23]). Above all else, the GDPR is a vital acknowledgement that, when algorithms are deployed in society, few if any decisions are purely “technical”. Rather, the ethical design of algorithms requires coordination between technical and philosophical resources of the highest caliber. A start has been made, but there is far to go. And, with less than two years until the GDPR takes effect, the clock is ticking.

European Union regulations on algorithmic decision-making and a “right to explanation”

 

Algorithm over Regulations (?)