Data Science: Como agentes reguladores, professores e praticantes estão fazendo isso errado

Esse post da Data Robot é um daqueles tipos de post que mostra muito como a evolução das plataformas de Big Data, aliado com um maior arsenal computacional e preditivo estão varrendo para baixo do tapete qualquer bullshit disfarçado com tecnicalidades em relação à Data Science.

Vou reproduzir na íntegra, pois vale a pena usar esse post quando você tiver que justificar a qualquer burocrata de números (não vou dar nome aos bois dado o butthurt que isso poderia causar) porque ninguém mais dá a mínima para P-Valor, testes de hipóteses, etc na era em que temos uma abundância de dados; e principalmente está havendo a morte da significância estatística.

“Underpinning many published scientific conclusions is the concept of ‘statistical significance,’ typically assessed with an index called the p-value. While the p-value can be a useful statistical measure, it is commonly misused and misinterpreted.”  ASA Statement on Statistical Significance and p-Values

If you’ve ever heard the words “statistically significant” or “fail to reject,” then you are among the countless thousands who have been traumatized by an academic approach building predictive models.  Unfortunately, I can’t claim innocence in this matter.  I taught statistics when I was in grad school, and I do have a Ph.D. in applied statistics.  I was born into the world that uses formal hypothesis testing to justify every decision made in the model building process:

Should I include this variable in my model?  How about an F-test?

Do my two samples have different means?  Student’s t-test!

Does my model fit my data?  Why not try the Hosmer–Lemeshow test or maybe use the Cramér–von Mises criterion?

Are my variables correlated?  How about a test using a Pearson Correlation Coefficient?

And on, and on, and on, and on…

These tests are all based on various theoretical assumptions.  If the assumptions are valid, then they allegedly tell you whether or not your results are “statistically significant.”

Over the last century, as businesses and governments have begun to incorporate data science into their business processes, these “statistical tests” have also leaked into commercial and regulatory practices.

For instance, federal regulators in the banking industry issued this tortured guidance in 2011:

“… statistical tests depend on specific distributional assumptions and the purpose of the model… Any single test is rarely sufficient, so banks should apply a variety of tests to develop a sound model.”

In other words, statistical tests have lots of assumptions that are often (always) untrue, so use lots of them. (?!)

Here’s why statistical significance is a waste of time


If assumptions are invalid, the tests are invalid — even if your model is good

I developed a statistical test of my very own for my dissertation.  The procedure for doing this is pretty simple.  First, you make some assumptions about independence and data distributions, and variance, and so on.  Then, you do some math that relies (heavily) on these assumptions in order to come up with a p-value. The p-value tells you what decision to make.

As an example, let’s take linear regression.  Every business stats student memorizes the three assumptions associated with the p-values in this approach: independence (for which no real test exists), constant variance, and normality.  If all these assumptions aren’t met, then none of the statistical tests that you might do are valid; yet regulators, professors, scientists, and statisticians all expect you to rely (heavily) on these tests.

What’s are you to do if your assumptions are invalid?  In practice, the general practice is to wave your hands about “robustness” or some such thing and then continue along the same path.

If your data is big enough, EVERYTHING is significant

“The primary product of a research inquiry is one or more measures of effect size, not P values.” Jacob Cohen

As your data gets bigger and bigger (as data tends to do these days), everything becomes statistically significant.  On one hand, this makes intuitive sense.  For example, the larger a dataset is, the most likely an F-test is to tell you that your GLM coefficients are nonzero; i.e., larger datasets can support more complex models, as expected.  On the other hand, for many assumption validity tests — e.g., tests for constant variance — statistical significance indicates invalid assumptions.  So, for big datasets, you end up with tests telling you every feature is significant, but assumption tests telling you to throw out all of your results.

Validating assumptions is expensive and doesn’t add value

Nobody ever generated a single dollar of revenue by validating model assumptions (except of course the big consulting firms that are doing the work).  No prospect was converted; no fraud was detected; no marketing message was honed by the drudgery of validating model assumptions.  To make matters worse, it’s a never ending task.  Every time a model is backtested, refreshed, or evaluated, the same assumption-validation-song-and-dance has to happen again.  And that’s assuming that the dozens of validity tests don’t give you inconsistent results.  It’s a gigantic waste of resources because there is a better way.

You can cheat, and nobody will ever know

Known as data dredging, data snooping, or p-hacking, it is very easy and relatively undetectable to manufacture statistically significant results.  Andrew Gelman observed that most modelers have a (perverse) incentive to produce statistically significantresults — even at the expense of reality.  It’s hardly surprising that these techniques exist, given the pressure to produce valuable data driven solutions.  This risk, on its own, should be sufficient reason to abandon p-values entirely in some settings, like financial services, where cheating could result in serious consequences for the economy.

If the model is misspecified, then your p-values are likely to be misleading

Suppose you’re investigating whether or not a gender gap exists in America.  Lots of things are correlated with gender; e.g., career choice, hours worked per week, percentage of vacation taken, participation in a STEM career, and so on.  To the extent that any of these variables are excluded from your investigation — whether you know about them or not — the significance of gender will be overstated.  In other words, statistical significance will give the impression that a gender gap exists, when it may not — simply due to model misspecification.

Only out-of-sample accuracy matters

Whether or not results are statistically significant is the wrong question.  The only metric that actually matters when building models is whether or not your models can make accurate predictions on new data.  Not only is this metric difficult to fake, but it also perfectly aligns with the business motivation for building the model in the first place.  Fraud models that do a good job predicting fraud actually prevent losses.  Underwriting models that accurately segment credit risk really do increase profits.  Optimizing model accuracy instead of identifying statistical significance makes good business sense.

Over the course of the last few decades lots and lots of tools have been developed outside of the hypothesis testing framework.  Cross-validation, partial dependence, feature importance, and boosting/bagging methods are just some of the tools in the machine learning toolbox.  They provide a means not only for ensuring out-of-sample accuracy, but also understanding which features are important and how complex models work.

A survey of these methods is out of scope, but let me close with a final point.  Unlike traditional statistical methods, tasks like cross-validation, model tuning, feature selection, and model selection are highly automatable.  Custom coded solutions of any kind are inherently error prone, even for the most experienced data scientist

Many of the world’s biggest companies are recognizing that bespoke models, hand-built by Ph.D.’s are too slow and expensive to develop and maintain.  Solutions like DataRobot provide a way for business experts to build predictive models in a safe, repeatable, systematic way that yields business value much more quickly and much cheaper than other approaches.

By Greg Michaelson, Director – DataRobot Labs

Data Science: Como agentes reguladores, professores e praticantes estão fazendo isso errado

Desenvolvimento e validação de um algoritmo de Deep Learning para detecção da retinopatia diabética em fotografias de fundo retinal

Em 2008 quando eu iniciei grande parte dos meus estudos e interesse em Data Mining (que é uma disciplina irmã do que chamamos hoje de Data Science) desde aquela época tinha uma convicção muito forte de que os dados seriam o motor do que estamos vivendo hoje que é a quarta revolução industrial; revolução esta que tem os dados como principal combustível seja nas mas diversas áreas do conhecimento humano.

Dito isso, a Google juntamente com pesquisadores dos times da University of Texas, 3EyePACS LLC, University of California, Aravind Medical Research Foundation, Shri Bhagwan Mahavir Vitreoretinal Services, Verily Life Sciences e da Harvard Medical School conseguiram um avanço da aplicação de Deep Learning que confirma essa tese acima.

A retinopatia diabética é uma doença causada pela evolução de um quadro de diabetes em que os vasos sanguíneos do fundo da retina sofrem algum tipo de dilatação ou rompimento e que e não tratada pode causar cegueira. É uma doença que tem mais de 150 mil casos no brasil e não tem cura. Contudo, se o diagnóstico for realizado de forma antecipada o tratamento pode ajudar na minimização da cegueira.

E a aplicação de Deep Learning por esses pesquisadores ocorreu para auxiliar os médicos no diagnóstico desta doença.


Colocando de forma bem resumida: esse time utilizou Deep Learning para detecção da retinopatia diabética usando uma base de fotografias de fundo retinal de treinamento e obteve 90.3% e 87.0% de sensibilidade (i.e. a capacidade do algoritmo saber corretamente quem está com a doença ou não dos casos em que a doença é presente) e 98.1% e 98.5% de especificidade (i.e. o grau de precisão do algoritmo para identificar corretamente as pessoas que não tem a doença em casos em que de fato a doença não está presente) na detecção da retinopatia diabética.

Essa alta taxa de especificidade de 98.1% (i.e. capacidade de saber corretamente quem não tem a doença dentro do grupo que realmente não tem, ou seja, saber identificar corretamente a classe negativa) direciona de forma muito assertiva os tratamentos no sentido em que há uma certeza maior nos resultados da classe negativa (e essa forma os esforços não seriam direcionados de fato para quem não tem a doença).

Uma ideia de estratégia de saúde preventiva ou mesmo de diagnósticos pontuais é que após o exame de imagem a fotografia de fundo de retina passaria pelo algoritmo de Deep Learning já treinado, e caso o resultado do teste fosse positivo , o médico já iniciaria o tratamento e ao mesmo tempo a rede hospitalar e de medicamentos seria notificada de mais um caso, e haveria automaticamente o provisionamento de medicação e alocação de recursos médicos para o acompanhamento/tratamento.

Isso sem dúvidas muda todo o eixo de como vemos a medicina hoje, dado que isso daria um ganho ainda maior de eficiência para a medicina preventiva e evitaria a ineficiente (e bem mais cara) medicina curativa.

Abaixo os principais resultados do estudo.

Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs

Varun Gulshan, PhD1; Lily Peng, MD, PhD1; Marc Coram, PhD1; et al Martin C. Stumpe, PhD1; Derek Wu, BS1; Arunachalam Narayanaswamy, PhD1; Subhashini Venugopalan, MS1,2; Kasumi Widner, MS1; Tom Madams, MEng1; Jorge Cuadros, OD, PhD3,4; Ramasamy Kim, OD, DNB5; Rajiv Raman, MS, DNB6; Philip C. Nelson, BS1; Jessica L. Mega, MD, MPH7,8; Dale R. Webster, PhD1

Key Points
Question How does the performance of an automated deep learning algorithm compare with manual grading by ophthalmologists for identifying diabetic retinopathy in retinal fundus photographs?

Finding In 2 validation sets of 9963 images and 1748 images, at the operating point selected for high specificity, the algorithm had 90.3% and 87.0% sensitivity and 98.1% and 98.5% specificity for detecting referable diabetic retinopathy, defined as moderate or worse diabetic retinopathy or referable macular edema by the majority decision of a panel of at least 7 US board-certified ophthalmologists. At the operating point selected for high sensitivity, the algorithm had 97.5% and 96.1% sensitivity and 93.4% and 93.9% specificity in the 2 validation sets.

Meaning Deep learning algorithms had high sensitivity and specificity for detecting diabetic retinopathy and macular edema in retinal fundus photographs.

Importance Deep learning is a family of computational methods that allow an algorithm to program itself by learning from a large set of examples that demonstrate the desired behavior, removing the need to specify rules explicitly. Application of these methods to medical imaging requires further assessment and validation.

Objective To apply deep learning to create an algorithm for automated detection of diabetic retinopathy and diabetic macular edema in retinal fundus photographs.

Design and Setting A specific type of neural network optimized for image classification called a deep convolutional neural network was trained using a retrospective development data set of 128 175 retinal images, which were graded 3 to 7 times for diabetic retinopathy, diabetic macular edema, and image gradability by a panel of 54 US licensed ophthalmologists and ophthalmology senior residents between May and December 2015. The resultant algorithm was validated in January and February 2016 using 2 separate data sets, both graded by at least 7 US board-certified ophthalmologists with high intragrader consistency.

Exposure Deep learning–trained algorithm.

Main Outcomes and Measures The sensitivity and specificity of the algorithm for detecting referable diabetic retinopathy (RDR), defined as moderate and worse diabetic retinopathy, referable diabetic macular edema, or both, were generated based on the reference standard of the majority decision of the ophthalmologist panel. The algorithm was evaluated at 2 operating points selected from the development set, one selected for high specificity and another for high sensitivity.

Results The EyePACS-1 data set consisted of 9963 images from 4997 patients (mean age, 54.4 years; 62.2% women; prevalence of RDR, 683/8878 fully gradable images [7.8%]); the Messidor-2 data set had 1748 images from 874 patients (mean age, 57.6 years; 42.6% women; prevalence of RDR, 254/1745 fully gradable images [14.6%]). For detecting RDR, the algorithm had an area under the receiver operating curve of 0.991 (95% CI, 0.988-0.993) for EyePACS-1 and 0.990 (95% CI, 0.986-0.995) for Messidor-2. Using the first operating cut point with high specificity, for EyePACS-1, the sensitivity was 90.3% (95% CI, 87.5%-92.7%) and the specificity was 98.1% (95% CI, 97.8%-98.5%). For Messidor-2, the sensitivity was 87.0% (95% CI, 81.1%-91.0%) and the specificity was 98.5% (95% CI, 97.7%-99.1%). Using a second operating point with high sensitivity in the development set, for EyePACS-1 the sensitivity was 97.5% and specificity was 93.4% and for Messidor-2 the sensitivity was 96.1% and specificity was 93.9%.

Conclusions and Relevance In this evaluation of retinal fundus photographs from adults with diabetes, an algorithm based on deep machine learning had high sensitivity and specificity for detecting referable diabetic retinopathy. Further research is necessary to determine the feasibility of applying this algorithm in the clinical setting and to determine whether use of the algorithm could lead to improved care and outcomes compared with current ophthalmologic assessment.

Desenvolvimento e validação de um algoritmo de Deep Learning para detecção da retinopatia diabética em fotografias de fundo retinal

Strata Singapore 2016 – Machine learning in practice with Spark MLlib: An intelligent data analyzer


No próximo dia 8 o meu amigo Eiti Kimura e eu faremos uma apresentação no Strata Hadoop World em Singapura para falarmos de uma solução que fizemos na Movile de monitoramento usando Machine Learning.

Já tivemos a oportunidade de falar um pouco (de maneira bem breve é verdade) no TDC 2016 aqui em São Paulo, mas agora iremos falar em um evento que pode ser considerado o mais importante de Big Data, Analytics e Machine Learning do mundo.


Falaremos um pouco do nosso case que basicamente é um problema de previsão de série temporal, em que tínhamos MUITOS alarmes que não funcionavam da maneira adequada e mais: como a nossa plataforma de tarifação funciona 24 x 7, a cada minuto que ela fica parada estamos perdendo muito dinheiro.

E a nossa solução que foi batizada de Watcher-AI usa basicamente o Spark MLLib e é acoplada na nossa plataforma de billing; e em qualquer sinal de instabilidade faz notificação para todo o time para solucionar o problema o mais rápido possível.

Ao longo dos dias vamos falar um pouco da conferência, sobre as novidades em relação à Machine Learning, Big Data, e algumas reflexões sobre Data Science e Machine Learning.

Não percam os próximos dias.





Strata Singapore 2016 – Machine learning in practice with Spark MLlib: An intelligent data analyzer

Rexer Analytics Data Science Survey

Some interesting highlights.

• SURVEY & PARTICIPANTS: 59-item online survey conducted in 2015. Participants: 1,220 analytic professionals from 72 countries. This is the 7th survey in this ongoing research program.

• CORE ALGORITHM TRIAD: Regression, Decision Trees, and Cluster analysis remain the most commonly used algorithms in the field.

• THE ASCENDANCE OF R: 76% of respondents report using R. This is up dramatically from just 23% in 2007. More than a third of respondents (36%) identify R as their primary tool.

• JOB SATISFACTION: Job satisfaction in the field remains high, but has slipped since the 2013 survey. A number of factors predict Data Scientist job satisfaction levels.

• DEPLOYMENT: Deployment continues to be a challenge for organizations, with less than two thirds of respondents indicating that their models are deployed most or all of the time. Getting organizational buy-in is the largest barrier to deployment, with real-time scoring and other technology issues also causing significant deployment problems.

• TERMINOLOGY: The term “Data Scientist” has surged in popularity with over 30% of us describing ourselves as data scientists now compared to only 17% in 2013.


Rexer Analytics Data Science Survey

What is hardcore data science—in practice?

Via Ideas O’Reilly

Um bom artigo que fala a respeito da inserção de modelos de Data Science em produção, ou o que é mais conhecido como Core Machine Learning.

So far, we have focused on how systems typically look in production. There are variations in how far you want to go to make the production system really robust and efficient. Sometimes, it may suffice to directly deploy a model in Python, but the separation between the exploratory part and production part is usually there.

One of the big challenges you will face is how to organize the collaboration between data scientists and developers. “Data scientist” is still a somewhat new role, but the work they have to do differs enough from those of typical developers that you should expect some misunderstandings and difficulties in communication.

The work of data scientists is usually highly exploratory. Data science projects often start with a vague goal and some ideas of what kind of data is available and methods that could be used, but very often, you have to try out ideas and get insights into your data. Data scientists write a lot of code, but much of this code is there to test out ideas and is expected to not be part of the final solution.


What is hardcore data science—in practice?

Lições em competições do Kaggle

Já é desnecessário dizer o quando o Kaggle vem contribuindo com a comunidade de Data Science, e essas lições do Harasymiv mostram que essas contribuições vão além do básico.

Vejam abaixo:

  • XG Boosting is the engine of choice for structured problems (where feature manufacturing is the key). Now available as python package. Behind XG are the typical suspects – Random Forest and Gradient Boosted Trees. However, hyper parameter tuning is only the few % accuracy points improvement on top, the major breakthroughs in predictive power come from feature manufacturing;
  • Feature manufacturing for structured problems is the key process (or otherwise random permutation of features to find most predictive/telling combination) either by iteratively trying various approaches (as do thousands of individual contributions to competition) or in an automatic fashion (as done by DataRobot. BTW, DataRobot is based partially in Boston and partially in Ukraine). Some Amazon engineers who attended from Seattle commented they are building a platform which would iteratively try to permute features to randomly (aka “genetic algorithm” fashion) find best features for structured problems, too;
  • For unstructured problems (visuals, text, sound) – Neural Networks run the show (and their deep learning – auto feature extracting – and variants of those). Great example was application of NN to Diabetic Retinopathy problem at which surpassed in accuracy commercially available products;
  • is really suitable for two types of problems:
      A problem solved now for which a more accurate solution is highly desirable – any fraction % accuracy turns into millions of $ (e.g. loan default rate prediction) or

    • Problems which were never tackled by machine learning in order to see if ML can help solve them (e.g. EEG readings to predict epilepsy);
  • Don’t expect data scientists to perform best in the office! Anthony mentioned his first successful 24h data science hackathon when his senior was guiding him 5 min, coding himself for 15 min and then playing basketball for 40 min each hour. Personally, I find walking, gardening and running are great creativity boosters. How will you work tomorrow? 🙂

Lições em competições do Kaggle

Data Scientists não escalam!

Esse artigo da HBR que fala que a linguagem natural é a última fronteira para escalar de fato o que eles denominam como data science, e também mostram que os cientistas de dados ‘manuais’  existem em um arranjo de trabalho que simplesmente não tem escalabilidade.

Para jogar um pouco de lenha na fogueira em relação ao haterismo (aqui, aqui, aqui, e aqui) que vem tomando conta da comunidade de analytics/data mining/data science sobre as ferramentas de análise baseadas em GUI e os novos Work Horses em analytics como Amazon, Google e Microsoft.

Muito do que foi colocado no artigo tem muito a ver com o antigo, porém excelente artigo da Continental Airlines em que usando a extensão do trabalho do Richard Hackathorn coloca os tipos de latência no contexto de decisão-ação:


Tudo começa com o evento de negócios que pode ser uma venda, uma transação qualquer que tenha valor monetário para a companhia em questão. A contar do evento de negócios, tem início a latência de dados que nada mais é do que o tempo requirido para capturar, transformar, higienizar o dado de algum sistema transacional e armazenar no DW, com isso tem-se o segundo ponto na linha do tempo de ação que é o dado armazenado.

Com o dado armazenado inicia-se a latência de análise que é o tempo utilizado para analisar e disseminar os resultados da análise para as pessoas apropriadas, e no fim desse processo tem-se o que é chamado de informação entregue. Após a informação chegar para as pessoas corretas inicia-se a latência de decisão que é o tempo para que o agente decisor entenda o contexto e a situação, faça um plano de ação e inicie o conjunto de tarefas listadas no plano.

Dentro do atual cenário em que temos o problema de armazenamento de dados quase que resolvido pelas novas tecnologias, pode ser dito que o problema de latência de dados está definitivamente resolvido  (e pode ser escalado com dinheiro), com isso resta a latência de análise e decisão.

Muito do que é apresentado como Data Science não está diretamente relacionado a questões de negócios em que grande parte das vezes o tempo é a variável mais determinante. Isso é, o eixo X do gráfico é extremamente reduzido.

Com isso, muito do que é feito é uma solução ótima para um problema que muitas das vezes já era para estar resolvido ou pior: a solução foi tão demorada que a organização perdeu o timing para a solução do problema. Isso pode significar desde uma oportunidade perdida (e.g. custo de oportunidade) até mesmo milhões de reais/dólares (e.g. perda de receita que poderia ser garantida usando o ativo de inteligência de dados).

E é nesse ponto que vamos chegar: Em grande parte das corporações não é necessária a solução perfeita; mas sim a solução que atenda uma questão de negócios dentro de um limite de tempo pré-estabelecido; e é nesse contexto que as soluções das suítes de Data Mining e ferramentas GUI vem a solucionar, ou ajudar na solução desse problema.

Além do mais, como a Julia Evans colocou, muitas as vezes o entendimento do problema é tão ou mais importante que a solução em si.

Dessa forma, dentro desse cenário a reportagem da HBR está correta: Cientistas de Dados não escalam por dois motivos (i) apesar da inteligência ser escalável, o agente humano (peça cognitiva no processo) não escala (não em termos industriais como o artigo coloca), e (ii) as soluções estão restritas a um intervalo de tempo finito e curto.



Data Scientists não escalam!

Fake Reviews e Data Science – Caso Amazon

Vincent Granville mais uma vez com um óitmo post sobre a questão.

Post obrigatório para quem trabalha em empresas que são expostas em mídias sociais, em especial em sites de opinião e review.

Fake Reviews e Data Science – Caso Amazon

Falsa Ciência de Dados

No Analytic Bridge tem uma definição sensacional sobre o atual momento dessa disciplina:

Books, certificates and graduate degrees in data science are spreading like mushrooms after the rain.

Unfortunately, many are just a mirage: some old guys taking advantage of the new paradigm to quickly re-package some very old material (statistics, R programming) with the new label: data science.

To add to the confusion, executives, decision makers building a new team of data scientists sometimes don’t know exactly what they are looking for, ending up hiring pure tech geeks, computer scientists, or people lacking proper experience. The problem is compounded by HR who do not know better, producing job ads which always contain the same keywords: Java, Python, Map Reduce, R, NoSQL. As if a data scientist was a mix of these skills.

Falsa Ciência de Dados

Estatística x Data Science x Business Intelligence

Neste post do David Smith no R Bloggers ele apresenta um paralelo bem interessante sobre essas três disciplinas. Isso mostra que cada vez mais analistas de dados serão necessários para compreensão do ambiente de negócios com uma complexidade em franco crescimento.

Estatística x Data Science x Business Intelligence