Movendo modelos de Machine Learning para produção

Isso ainda vai virar um post extenso aqui no blog, mas vamos com esse belo post do Ram Balakrishnan.

After building, training and deploying your models to production, the task is still not complete unless you have monitoring systems in place. A crucial component to ensuring the success of your models is being able to measure and quantify their performance. A number of questions are worth answering in this area.

Movendo modelos de Machine Learning para produção

Quando haverá a chegada do próximo inverno em Inteligência Artificial?

Direto do Inference

Não é preciso pesquisar muito pra saber que Data Mining, Machine Learning e Inteligência Artificial está na crista da onda, seja em termos de investimentos ou mesmo de utilização pela indústria.

Contudo, para quem não sabe há um termo chamado AI Winter (ou inverno da Inteligência Artificial) que é utilizado quando há um período de desilusão em relação a tudo o que é prometido dentro de uma onda de hype em IA.

Esse artigo fala de que mesmo com todas as evoluções em deep/machine learning ainda há tarefas que possivelmente ficarão com os seres humanos.

Será que estamos chegando em um inverno em machine/deep learning?

At the end of this machine/deep learning hype cycle, either of two scenarios could occur:

  1. winter scenario: we have exploited current state of AI/ML to its limits, and discovered the boundaries of tasks we can easily and feasibly solve with this new technology, and we agree that human-level general intelligence is not within those boundaries. As a result, we have characterised what makes humans intelligent a bit better, developed very powerful and valuable technologies along the way, Nature has published a couple more of DeepMind’s papers, but research-wise an AI winter is likely to set in. AI will no doubt continue to be useful for industry, but some of the research community will scale back and search for the next breakthrough idea or component.
  2. holy shit scenario: (also known as Eureka moment) We really do solve AI in a way that is clearly and generally recognised as artificial intelligence, and some form of singularity happens. Intelligence is a case of ‘I recognise it when I see it’, but it’s hard to predict what shape it will take.

 

Quando haverá a chegada do próximo inverno em Inteligência Artificial?

Como não rodar um teste A/B

Esse artigo da Evan Miller é quase um clássico sobre esse assunto.

When an A/B testing dashboard says there is a “95% chance of beating original” or “90% probability of statistical significance,” it’s asking the following question: Assuming there is no underlying difference between A and B, how often will we see a difference like we do in the data just by chance? The answer to that question is called the significance level, and “statistically significant results” mean that the significance level is low, e.g. 5% or 1%. Dashboards usually take the complement of this (e.g. 95% or 99%) and report it as a “chance of beating the original” or something like that.

However, the significance calculation makes a critical assumption that you have probably violated without even realizing it: that the sample size was fixed in advance. If instead of deciding ahead of time, “this experiment will collect exactly 1,000 observations,” you say, “we’ll run it until we see a significant difference,” all the reported significance levels become meaningless. This result is completely counterintuitive and all the A/B testing packages out there ignore it, but I’ll try to explain the source of the problem with a simple example.

Como não rodar um teste A/B

Qual a diferença entre o Gradiente Descendente e o Gradiente Descendente Estocástico?

Aqui no Quora a resposta mais simples elaborada na história do mundo:

In both gradient descent (GD) and stochastic gradient descent (SGD), you update a set of parameters in an iterative manner to minimize an error function.

While in GD, you have to run through ALL the samples in your training set to do a single update for a parameter in a particular iteration, in SGD, on the other hand, you use ONLY ONE training sample from your training set to do the update for a parameter in a particular iteration.

Thus, if the number of training samples are large, in fact very large, then using gradient descent may take too long because in every iteration when you are updating the values of the parameters, you are running through the complete training set. On the other hand, using SGD will be faster because you use only one training sample and it starts improving itself right away from the first sample.

SGD often converges much faster compared to GD but the error function is not as well minimized as in the case of GD. Often in most cases, the close approximation that you get in SGD for the parameter values are enough because they reach the optimal values and keep oscillating there.

Qual a diferença entre o Gradiente Descendente e o Gradiente Descendente Estocástico?

Conselhos práticos para análise de dados

Direto do Blog não oficial da Google.

Nesse post do Patrick Riley há uma ótima descrição de uma heurística de como realizar análise de dados de uma maneira um pouco mais sistematizada.

Technical: Ideas and techniques for how to manipulate and examine your data.
Process: Recommendations on how you approach your data, what questions to ask, and what things to check.
Social: How to work with others and communicate about your data and insights.

Em relação aos desdobramentos dessa heurística ele apresenta os seguintes pontos:

Technical
– Look at your distributions
– Consider the outliers
– Report noise/confidence
– Look at examples
– Slice your data
– Consider practical significance
– Check for consistency over time

Process
– Separate Validation, Description, and Evaluation
– Confirm expt/data collection setup
– Check vital signs
– Standard first, custom second
– Measure twice, or more
– Check for reproducibility
– Check for consistency with past measurements
– Make hypotheses and look for evidence
– Exploratory analysis benefits from end to end iteration

Social
– Data analysis starts with questions, not data or a technique
– Acknowledge and count your filtering
– Ratios should have clear numerator and denominators
– Educate your consumers
– Be both skeptic and champion
– Share with peers first, external consumers second
– Expect and accept ignorance and mistakes
– Closing thoughts

Conselhos práticos para análise de dados

Enciclopédia das Distâncias (Michel Deza & Elena Deza)

Para quem está interessado em conhecer mais sobre as distâncias matemáticas (ex: Encludiana, Mahalanobis, ou a Minkowski) esse livro é essencial.

É um compêndio de inúmeras distâncias matemáticas, e além disso contém inúmeras informações de quais distâncias devem ser usadas de acordo com inúmeros contextos.

Enciclopédia das Distâncias (Michel Deza & Elena Deza)

Introdução sobre Análise de Cluster

Por mais que a análise exploratória de dados ocupe um espaço muito grande em relação em problemas de ciência de dados, os métodos de aprendizado não-supervisionados ainda tem o seu valor, mesmo que nas comunidades científicas e profissionais pouco se fala sobre esse tema com a mesma recorrência dos métodos preditivos.

Uma das técnicas mais subestimadas em machine learning é a técnica de clustering (ou análise de agrupamento).

Esse post do Kunal Jain trás um dos melhores reviews sobre análise de cluster e as suas peculiaridades.

Connectivity models: As the name suggests, these models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away. These models can follow two approaches. In the first approach, they start with classifying all data points into separate clusters & then aggregating them as the distance decreases. In the second approach, all data points are classified as a single cluster and then partitioned as the distance increases. Also, the choice of distance function is subjective. These models are very easy to interpret but lacks scalability for handling big datasets. Examples of these models are hierarchical clustering algorithm and its variants.
Centroid models: These are iterative clustering algorithms in which the notion of similarity is derived by the closeness of a data point to the centroid of the clusters. K-Means clustering algorithm is a popular algorithm that falls into this category. In these models, the no. of clusters required at the end have to be mentioned beforehand, which makes it important to have prior knowledge of the dataset. These models run iteratively to find the local optima.
Distribution models: These clustering models are based on the notion of how probable is it that all data points in the cluster belong to the same distribution (For example: Normal, Gaussian). These models often suffer from overfitting. A popular example of these models is Expectation-maximization algorithm which uses multivariate normal distributions.
Density Models: These models search the data space for areas of varied density of data points in the data space. It isolates various different density regions and assign the data points within these regions in the same cluster. Popular examples of density models are DBSCAN and OPTICS.

Introdução sobre Análise de Cluster

Panorama Competitivo em Machine Learning

Para quem desejar saber como está a indústria de Machine Learning e suas sub-divisões, esse artigo da Harvard Business Review apresenta um ótimo diagrama que pode ser baixado no link a seguir: the_state_of_machine_intelligence.

Panorama Competitivo em Machine Learning

Previsão de Séries Temporais usando XGBoost – Pacote forecastxgb

Para quem já teve a oportunidade de trabalhar com previsão de variáveis categóricas em Machine Learning sabe que o XGBoost é um dos melhores pacotes do mercado, sendo largamente utilizado em inúmeras competições no Kaggle.

A grande diferença feita pelo Peter Ellis foi realizar algumas adaptações para incorporar algumas variáveis independentes através do parâmetro xreg ao modelo preditivo de séries temporais.

Para quem trabalha com análise de séries temporais, esse trabalho é muito importante até porque o forecastxgb  tríade Média-Móvel/ARIMA (ARMA)/(S)ARIMA em que tanto estatísticos/Data Miners/Data Scientists ficam presos por comodidade ou falta de meios.

Um exemplo da utilização do pacote está abaixo:

# Install devtools to install packages that aren't in CRAN
install.packages("devtools")

# Installing package from github 
devtools::install_github("ellisp/forecastxgb-r-package/pkg") 

# Load the libary
library(forecastxgb)

# Time Series Example
gas

# Model
model <- xgbts(gas)

# Summary of the model
summary(model)

# Forecasting 12 periods 
fc <- forecast(model, h = 12)

# Plot
plot(fc)
Previsão de Séries Temporais usando XGBoost – Pacote forecastxgb