Matching e o uso de regressões para análise do efeito de um tratamento

Um dos assuntos mais espinhosos quando falamos de estatística para realizar estimativas de populações com características diferentes é o Matching.

Para quem não sabe, o Matching é basicamente uma técnica para comparação observacional entre um grupo de controle e um grupo de tratamento para cada observação espcífica dos dois grupos (i.e. para cada membro do grupo de tratamento, será feita uma estimativa em paralelo com um membro do grupo de controle e observará as diferenças nas estimativas) em que o objetivo principal é atestar os efeitos do tratamento considerando características dos dados observados, isolando ou realizando a análise considerando as diferenças entre as covariáveis.

Um exemplo de aplicação é dado no trabalho do IPEA em que há estimativas das populações pobres e indigentes, em que no estudo é realizado o mapeamento das características socioeconômicas similares do conjunto de familias participantes.

Neste post o Matt Bogard ele faz algumas considerações sobre a regressão como uma variância baseada em pesos (dos estimadores) poderados em relação a uma indicação de efeito no tratamento. 

Hence, regression gives us a variance based weighted average treatment effect, whereas matching provides a distribution weighted average treatment effect.

So what does this mean in practical terms? Angrist and Piscke explain that regression puts more weight on covariate cells where the conditional variance of treatment status is the greatest, or where there are an equal number of treated and control units. They state that differences matter little when the variation of δx is minimal across covariate combinations.

In his post The cardinal sin of matching, Chris Blattman puts it this way:

“For causal inference, the most important difference between regression and matching is what observations count the most. A regression tries to minimize the squared errors, so observations on the margins get a lot of weight. Matching puts the emphasis on observations that have similar X’s, and so those observations on the margin might get no weight at all….Matching might make sense if there are observations in your data that have no business being compared to one another, and in that way produce a better estimate”

 

We can see that those in the treatment group tend to have higher outcome values so a straight comparison between treatment and controls will overestimate treatment effects due to selection bias:

E[Y­­­i|di=1] – E[Y­­­i|di=0] =E[Y1i-Y0i] +{E[Y0i|di=1] – E[Y0i|di=0]}

However, if we estimate differences based on an exact matching scheme, we get a much smaller estimate of .67. If we run a regression using all of the data we get .75. If we consider 3.78 to be biased upward then both matching and regression have significantly reduced it, and depending on the application the difference between .67 and .75 may not be of great consequence. Of course if we run the regression including only matched variables, we get exactly the same results. (see R code below). This is not so different than the method of trimming based on propensity scores suggested in Angrist and Pischke.

Matching e o uso de regressões para análise do efeito de um tratamento

Data Science: Como agentes reguladores, professores e praticantes estão fazendo isso errado

Esse post da Data Robot é um daqueles tipos de post que mostra muito como a evolução das plataformas de Big Data, aliado com um maior arsenal computacional e preditivo estão varrendo para baixo do tapete qualquer bullshit disfarçado com tecnicalidades em relação à Data Science.

Vou reproduzir na íntegra, pois vale a pena usar esse post quando você tiver que justificar a qualquer burocrata de números (não vou dar nome aos bois dado o butthurt que isso poderia causar) porque ninguém mais dá a mínima para P-Valor, testes de hipóteses, etc na era em que temos uma abundância de dados; e principalmente está havendo a morte da significância estatística.

“Underpinning many published scientific conclusions is the concept of ‘statistical significance,’ typically assessed with an index called the p-value. While the p-value can be a useful statistical measure, it is commonly misused and misinterpreted.”  ASA Statement on Statistical Significance and p-Values

If you’ve ever heard the words “statistically significant” or “fail to reject,” then you are among the countless thousands who have been traumatized by an academic approach building predictive models.  Unfortunately, I can’t claim innocence in this matter.  I taught statistics when I was in grad school, and I do have a Ph.D. in applied statistics.  I was born into the world that uses formal hypothesis testing to justify every decision made in the model building process:

Should I include this variable in my model?  How about an F-test?

Do my two samples have different means?  Student’s t-test!

Does my model fit my data?  Why not try the Hosmer–Lemeshow test or maybe use the Cramér–von Mises criterion?

Are my variables correlated?  How about a test using a Pearson Correlation Coefficient?

And on, and on, and on, and on…

These tests are all based on various theoretical assumptions.  If the assumptions are valid, then they allegedly tell you whether or not your results are “statistically significant.”

Over the last century, as businesses and governments have begun to incorporate data science into their business processes, these “statistical tests” have also leaked into commercial and regulatory practices.

For instance, federal regulators in the banking industry issued this tortured guidance in 2011:

“… statistical tests depend on specific distributional assumptions and the purpose of the model… Any single test is rarely sufficient, so banks should apply a variety of tests to develop a sound model.”

In other words, statistical tests have lots of assumptions that are often (always) untrue, so use lots of them. (?!)

Here’s why statistical significance is a waste of time

statistical-significance

If assumptions are invalid, the tests are invalid — even if your model is good

I developed a statistical test of my very own for my dissertation.  The procedure for doing this is pretty simple.  First, you make some assumptions about independence and data distributions, and variance, and so on.  Then, you do some math that relies (heavily) on these assumptions in order to come up with a p-value. The p-value tells you what decision to make.

As an example, let’s take linear regression.  Every business stats student memorizes the three assumptions associated with the p-values in this approach: independence (for which no real test exists), constant variance, and normality.  If all these assumptions aren’t met, then none of the statistical tests that you might do are valid; yet regulators, professors, scientists, and statisticians all expect you to rely (heavily) on these tests.

What’s are you to do if your assumptions are invalid?  In practice, the general practice is to wave your hands about “robustness” or some such thing and then continue along the same path.

If your data is big enough, EVERYTHING is significant

“The primary product of a research inquiry is one or more measures of effect size, not P values.” Jacob Cohen

As your data gets bigger and bigger (as data tends to do these days), everything becomes statistically significant.  On one hand, this makes intuitive sense.  For example, the larger a dataset is, the most likely an F-test is to tell you that your GLM coefficients are nonzero; i.e., larger datasets can support more complex models, as expected.  On the other hand, for many assumption validity tests — e.g., tests for constant variance — statistical significance indicates invalid assumptions.  So, for big datasets, you end up with tests telling you every feature is significant, but assumption tests telling you to throw out all of your results.

Validating assumptions is expensive and doesn’t add value

Nobody ever generated a single dollar of revenue by validating model assumptions (except of course the big consulting firms that are doing the work).  No prospect was converted; no fraud was detected; no marketing message was honed by the drudgery of validating model assumptions.  To make matters worse, it’s a never ending task.  Every time a model is backtested, refreshed, or evaluated, the same assumption-validation-song-and-dance has to happen again.  And that’s assuming that the dozens of validity tests don’t give you inconsistent results.  It’s a gigantic waste of resources because there is a better way.

You can cheat, and nobody will ever know

Known as data dredging, data snooping, or p-hacking, it is very easy and relatively undetectable to manufacture statistically significant results.  Andrew Gelman observed that most modelers have a (perverse) incentive to produce statistically significantresults — even at the expense of reality.  It’s hardly surprising that these techniques exist, given the pressure to produce valuable data driven solutions.  This risk, on its own, should be sufficient reason to abandon p-values entirely in some settings, like financial services, where cheating could result in serious consequences for the economy.

If the model is misspecified, then your p-values are likely to be misleading

Suppose you’re investigating whether or not a gender gap exists in America.  Lots of things are correlated with gender; e.g., career choice, hours worked per week, percentage of vacation taken, participation in a STEM career, and so on.  To the extent that any of these variables are excluded from your investigation — whether you know about them or not — the significance of gender will be overstated.  In other words, statistical significance will give the impression that a gender gap exists, when it may not — simply due to model misspecification.

Only out-of-sample accuracy matters

Whether or not results are statistically significant is the wrong question.  The only metric that actually matters when building models is whether or not your models can make accurate predictions on new data.  Not only is this metric difficult to fake, but it also perfectly aligns with the business motivation for building the model in the first place.  Fraud models that do a good job predicting fraud actually prevent losses.  Underwriting models that accurately segment credit risk really do increase profits.  Optimizing model accuracy instead of identifying statistical significance makes good business sense.

Over the course of the last few decades lots and lots of tools have been developed outside of the hypothesis testing framework.  Cross-validation, partial dependence, feature importance, and boosting/bagging methods are just some of the tools in the machine learning toolbox.  They provide a means not only for ensuring out-of-sample accuracy, but also understanding which features are important and how complex models work.

A survey of these methods is out of scope, but let me close with a final point.  Unlike traditional statistical methods, tasks like cross-validation, model tuning, feature selection, and model selection are highly automatable.  Custom coded solutions of any kind are inherently error prone, even for the most experienced data scientist

Many of the world’s biggest companies are recognizing that bespoke models, hand-built by Ph.D.’s are too slow and expensive to develop and maintain.  Solutions like DataRobot provide a way for business experts to build predictive models in a safe, repeatable, systematic way that yields business value much more quickly and much cheaper than other approaches.

By Greg Michaelson, Director – DataRobot Labs

Data Science: Como agentes reguladores, professores e praticantes estão fazendo isso errado

Previsão em séries temporais probabilísticas

Abstract

A large body of the forecasting literature so far has been focused on forecasting the conditional mean of future observations. However, there is an increasing need for generating the entire conditional distribution of future observations in order to effectively quantify the uncertainty in time series data. We present two different methods for probabilistic time series forecasting that allow the inclusion of a possibly large set of exogenous variables. One method is based on forecasting both the conditional mean and variance of the future distribution using a traditional regression approach. The other directly computes multiple quantiles of the future distribution using quantile regression. We propose an implementation for the two methods based on boosted additive models, which enjoy many useful properties including accuracy, flexibility, interpretability and automatic variable selection. We conduct extensive experiments using electricity smart meter data, on both aggregated and disaggregated scales, to compare the two forecasting methods for the challenging problem of forecasting the distribution of future electricity consumption. The empirical results demonstrate that the mean and variance forecasting provides better forecasts for aggregated demand, while the flexibility of the quantile regression approach is more suitable for disaggregated demand. These results are particularly useful since more energy data will become available at the disaggregated level in the future.

Probabilistic time series forecasting with boosted additive models

Previsão em séries temporais probabilísticas

Cross-Validation, e a estimativa da estimativa

No ótimo post do blog do Andrew Gelman em que o título é “Cross-Validation != Magic” tem uma observação muito importante sobre a definição do Cross-Validation (Validação Cruzada) que é:

“2. Cross-validation is a funny thing. When people tune their models using cross-validation they sometimes think that because it’s an optimum that it’s the best. Two things I like to say, in an attempt to shake people out of this attitude:

(a) The cross-validation estimate is itself a statistic, i.e. it is a function of data, it has a standard error etc.

(b) We have a sample and we’re interested in a population. Cross-validation tells us what performs best on the sample, or maybe on the hold-out sample, but our goal is to use what works best on the population. A cross-validation estimate might have good statistical properties for the goal of prediction for the population, or maybe it won’t.

Just cos it’s “cross-validation,” that doesn’t necessarily make it a good estimate. An estimate is an estimate, and it can and should be evaluated based on its statistical properties. We can accept cross-validation as a useful heuristic for estimation (just as Bayes is another useful heuristic) without buying into it as necessarily best.

Cross-Validation é um assunto bem espinhoso quando se trata de amostragem e/ou estimação de modelos devido ao fato de que há diversas opiniões a favor e contra.

Conhecer as propriedades de cada método de amostragem e saber as suas propriedades matemáticas/estatísticas, vantagens e desvantagens e principalmente limitações é regra número 1 para qualquer data miner.

Particularmente, eu vejo o Cross-Validation como um método excelente quando se tem um universo de dados restrito (poucas instâncias treinamento), ou mesmo quando faço as validações com método normal de 80-10-10 de amostragem; mas isso é mais uma heurística de trabalho do que uma regra propriamente dita .

Cross-Validation, e a estimativa da estimativa

MVN – Ferramenta web para verificar se os dados seguem uma distribuição normal

Tá certo que quem está acreditando na bolha do big data nem sabe o que é isso; mas para quem usa a estatística como ferramenta esse portal pode ajudar e muito.

MVN – Ferramenta web para verificar se os dados seguem uma distribuição normal

10 coisas que a estatística pode nos ensinar sobre a análise de Big Data

Por mais que o ruído sobre o Big Data seja maior do que o sinal, posts como esse mostram que há uma luz no fim do túnel.

  1. If the goal is prediction accuracy, average many prediction models together. In general, the prediction algorithms that most frequently win Kaggle competitions or the Netflix prize blend multiple models together. The idea is that by averaging (or majority voting) multiple good prediction algorithms you can reduce variability without giving up bias. One of the earliest descriptions of this idea was of a much simplified version based onbootstrapping samples and building multiple prediction functions – a process called bagging (short for bootstrap aggregating). Random forests, another incredibly successful prediction algorithm, is based on a similar idea with classification trees.
  2. Know what your real sample size is.  It can be easy to be tricked by the size of a data set. Imagine you have an image of a simple black circle on a white background stored as pixels. As the resolution increases the size of the data increases, but the amount of information may not (hence vector graphics). Similarly in genomics, the number of reads you measure (which is a main determinant of data size) is not the sample size, it is the number of individuals. In social networks, the number of people in the network may not be the sample size. If the network is very dense, the sample size might be much less. In general the bigger the sample size the better and sample size and data size aren’t always tightly correlated.

10 coisas que a estatística pode nos ensinar sobre a análise de Big Data

Comunicando Risco e Incerteza

Um pequeno guia de como comunicar questões como Oportunidade x Risco; Risco Relativo x Risco Absoluto; e Probabilidade Condicional.

Essa tabela de tradução de probabilidades para palavras mostra como realizar a transcrição de forma clara de acordo com os números:

LikehoodScale

 E por fim essa é a tabela que fala em relação a externalização de confiança:

ConfidenceScale

Comunicando Risco e Incerteza

Caudas longas, curtose e risco

O Matt Bogard escreve um pequeno post com essas questões, e aponta algumas referências.

O entendimento do que é Curtose (que é proveniente da estatística) é de fundamental importância em estudos para mensuração de amplitude de valores de uma variável específica; especialmente se esses estudos forem ligados a questões ligadas a probabilidade.

They recommend that kurtosis be defined as “the location- and scale-free movement of probability mass from the shoulders of a distribution into its center and tails. In particular, this definition implies that peakedness and tail weight are best viewed as components [emphasis mine] of kurtosis…. This definition is necessarily vague because the movement can be formalized in many ways” (p. 116). In other words, the peaks and tails of a distribution contribute to the value of the kurtosis, but so do other features.

The tail of the distribution is the most important contributor. Although Balanda and MacGillivray do not mention it, the kurtosis is a non-robust statistic that can be severely influenced by the value of a single outlier. For example, if you choose 999 observations from a normal distribution, the sample kurtosis will be close to 0. However, if you add a single observation that has the value 100, the sample kurtosis jumps to more than 800!

Caudas longas, curtose e risco

Manipulação de opiniões no Facebook… Manipulação?

Primeiro uma breve contextualização sobre o assunto.

Em meados de setembro/outubro do ano passado alguns pesquisadores ligados à Google fizeram um estudo relativo ao contágio de sentimentos através das redes sociais usando informações do próprio Facebook.

Aqui está o abstract do artigo:

We show, via a massive (N = 689,003) experiment on Facebook, that emotional states can be transferred to others via emotional contagion, leading people to experience the same emotions without their awareness. We provide experimental evidence that emotional contagion occurs without direct interaction between people (exposure to a friend expressing an emotion is sufficient), and in the complete absence of nonverbal cues.

Emotional states can be transferred to others via emotional contagion, leading people to experience the same emotions without their awareness. Emotional contagion is well established in laboratory experiments, with people transferring positive and negative emotions to others. Data from a large real-world social network, collected over a 20-y period suggests that longer-lasting moods (e.g., depression, happiness) can be transferred through networks [Fowler JH, Christakis NA (2008) BMJ 337:a2338], although the results are controversial. In an experiment with people who use Facebook, we test whether emotional contagion occurs outside of in-person interaction between individuals by reducing the amount of emotional content in the News Feed. When positive expressions were reduced, people produced fewer positive posts and more negative posts; when negative expressions were reduced, the opposite pattern occurred. These results indicate that emotions expressed by others on Facebook influence our own emotions, constituting experimental evidence for massive-scale contagion via social networks. This work also suggests that, in contrast to prevailing assumptions, in-person interaction and nonverbal cues are not strictly necessary for emotional contagion, and that the observation of others’ positive experiences constitutes a positive experience for people.

Em suma: O Facebook propositalmente testou em pouco mais de 700 mil usuários o efeito do contágio de sentimentos através da ‘supressão ou adição de informações’ na linha do tempo desses usuários.

Houve uma grande polêmica em torno do assunto, inclusive até os editores emitiram uma nota esclarecendo alguns aspectos do estudo, e houve a mesma reclamação de sempre.

Com esse plano de fundo, no blog do Andrew Gelman foi escrito um post interessante sobre a questão e se essas reclamações são justificáveis ou não, e a resposta é categórica:

[…] It seems a bit ridiculous to say that a researcher needs special permission to do some small alteration of an internet feed, when advertisers and TV networks can broadcast all sorts of emotionally affecting images whenever they want. The other thing that’s bugging me is the whole IRB thing, the whole ridiculous idea that if you’re doing research you need to do permission for noninvasive things like asking someone a survey question.[…]

[…]So, do I consider this Facebook experiment unethical? No, but I could see how it could be considered thus, in which case you’d also have to consider all sorts of non-research experiments (the famous A/B testing that’s so popular now in industry) to be unethical as well. In all these cases, you have researchers, of one sort or another, experimenting on people to see their reactions. And I don’t see the goal of getting published in PNAS to be so much worse than the goal of making money by selling more ads.[…]

[…]Again, I can respect if you take a Stallman-like position here (or, at least, what I imagine rms would say) and argue that all of these manipulations are unethical, that the code should be open and we should all be able to know, at least in principle, how our messages are being filtered. So I agree that there is an ethical issue here and I respect those who have a different take on it than I do—but I don’t see the advantage of involving institutional review boards here. All sorts of things are unethical but still legal, and I don’t see why doing something and publishing it in a scientific journal should be considered more unethical or held to a more stringent standard than doing the same thing and publishing it in an internal business report.[…]

Em outras palavras: Não adianta a critica ao que o Facebook fez se de uma maneira ou de outra a propaganda/publicidade/marketing vem fazendo isso a anos. Não é porque alguém publica em um periódico acadêmico que faz ele menos “ético”(cabe ao juízo de valor de cada um) de quem faz isso internamente através de relatórios.

Nota Pessoal: Como ‘insider’ do mundo do crédito, produtos bancários não padronizados, e localização eu recomendo que a paranóia nada ajuda nestes casos. Hoje com um CEP preenchido em algum formulário para se ganhar um desconto em alguma coisa e o CPF qualquer pessoa pode ser localizada no Brasil; e as empresas de cartão de crédito sabem muito sobre nós todos.

Privacidade hoje só existe em dois lugares: Mídias não estruturadas  (e.g. cadernos, post it, anotações espúrias, etc); ou para terroristas e demais membros de organizações criminosas que não possuem nenhum traço no meio digital e só realizam transações off-market (e.g. contrabando, tráfico de drogas, fluxo de armas para terroristas, etc.) .

Manipulação de opiniões no Facebook… Manipulação?

CART e Cross-Validation

Neste vídeo da Salford Systems há um pequeno trecho de uma palestra feita em 2004.

A validação cruzada é um tema que gera algumas controvérsias, mas querendo ou não para quem realiza experimentos com bases de dados com menos de 50.000 registros (é um número cabalístico, mas ainda serve) pode ser a saída para bons resultados.

Uma ótima (na verdade é a melhor) referência sobre esse assunto e as vantagens e desvantagens está no livro de HASTIE, Trevor et al. The elements of statistical learning que pode ser baixado aqui.

CART e Cross-Validation

A predição que eu não quero…

Este site trata de questões ligadas diretamente à Data Mining e as demais variantes em relação à análise de dados, aprendizado de máquina, meta-heurísticas, matemática e estatística.

No entanto, um artigo do John Katz do New York Times que fala sobre os modelos de predição para as eleições do senado deste ano é um (mal) exemplo claro de que as atividades de análise de dados nunca podem ser um fim em si mesmas.

Em suma o artigo fala dos problemas dos modelos de predição, e mostra que mais uma vez os modelos erraram em detectar uma onda Republicana.

Até aí nada de mais: Modelos preditivos falhando.

Contudo, depois do excelente  livro do Nate Silver  um efeito nocivo dessa popularização da análise de dados e da mineração de dados é que muitos jornais, revistas, sites começaram a realizar o que eu chamo de análises estéreis no qual essas análises não olham a consequência da decisão, mas sim olham somente os números como se a análise preditiva fosse uma imensa gincana.

Não que eleições dessa natureza venham contribuir em termos de práticos para os pagadores de impostos; porém, para o pagador de impostos melhor do que saber qual indicador preditivo está com melhor desempenho; o certo seria entender como a composição de um senado iria influenciar em questões orçamentarias, fiscais, e principalmente de grandes questões importantes para todos.

A lição que fica é que a análise e a mineração de dados sempre está sujeita a questões ligadas ao suporte à decisão, e não somente a análise per se.

PS: As análises e o código fonte estão neste link.

A predição que eu não quero…

10 coisas que a estatística pode nos ensinar sobre Big Data

De tempos em tempos vemos vendedores de software tentando empurrar ‘novidades’ como Big Data, Map Reduce, Processamento Distribuído, etc. Isso é muito bom no sentido de marketing e propaganda, mas dentro do aspecto técnico todos que trabalham com análise de dados devem no mínimo conhecer o básico, e este básico se chama estatística.

Entendam uma coisa, Big Data hoje nada mais é do que um jargão de marketing utilizado por todos os players do mercado para causar frisson em gerentes de tecnologia da informação, diretores, coordenadores, gerentes entre outros.

Análise de dados sempre houve desde quando Edgar Frank Codd começou os seus postulados sobre modelagem de bases de dados baseado no paradigma da álgebra relacional.

O que mudou foi que a Lei de Moore que se aplicava à capacidade de processamento (transistores nos chips) e que muitos acreditavam se também aplicava-se ao armazenamento simplesmente provou-se errada. Em outras palavras, descobrimos que podemos armazenar muito mais informação, a um custo extremamente baixo do que fazíamos a 40 anos atrás.

Veja no gráfico abaixo o que o mesmo Jeff Leek considera como a ‘revolução do big data’.

Big Data Revolution

Se isso aumentou a disponibilidade dos dados para a análise, por outro lado muito por culpa da ciência da computação que (na minha visão pessoal de momento) prostituiu a estatística com o advento dos algoritmos muitos cientistas da computação, bacharéis em Sistemas de Informação, entre outros que por ventura passaram a realizar análise de dados acharam que poderiam subestimar a estatística que está a muito tempo ajudando cientistas do mundo inteiro.

Um pequeno aforismo que eu tenho sobre essa questão é “não dá para pensar em Big Data, quando ainda não aprendemos os postulados sobre amostragem que a estatística nos oferece”.** Simples assim.

Com isso, seguem as 10 coisas que a estatística pode ajudar o Big Data elencadas pelo Jeff Leek:

1) If the goal is prediction accuracy, average many prediction models together
2) When testing many hypotheses, correct for multiple testing
3) When you have data measured over space, distance, or time, you should smooth
4) Before you analyze your data with computers, be sure to plot it
5) Interactive analysis is the best way to really figure out what is going on in a data set
6) Know what your real sample size is
7) Unless you ran a randomized trial, potential confounders should keep you up at night
8) Define a metric for success up front
9) Make your code and data available and have smart people check it
10) Problem first not solution backward

**Assim que eu finalizar algumas leituras importantes sobre o assunto vou falar mais um pouco dessa besteira de big data que estão vendendo, e algumas alternativas a respeito disso.

10 coisas que a estatística pode nos ensinar sobre Big Data

Porque visualizamos dados quantitativos?

O Stephen  Few dá uma explicação magistral:

But why is it that we must sometimes use graphical displays to perform these tasks rather than other forms of representation? Why not always express values as numbers in tables? Why express them visually rather than audibly? Essentially, there is only one good reason to express quantitative data visually: some features of quantitative data can be best perceived and understood, and some quantitative tasks can be best performed, when values are displayed graphically. This is so because of the ways our brains work. Vision is by far our dominant sense. We have evolved to perform many data sensing and processing tasks visually. This has been so since the days of our earliest ancestors who survived and learned to thrive on the African savannah. What visual perception evolved to do especially well, it can do faster and better than the conscious thinking parts of our brains. Data exploration, sensemaking, and communication should always involve an intimate collaboration between seeing and thinking (i.e., visual thinking).

Abaixo ele coloca a tabela das tarefas e metas da visualização de dados.

Web

Porque visualizamos dados quantitativos?

Porque o fenômeno do Big Data está envolvido em Problemas? Eles esqueceram estatística aplicada

O Jeff Leek neste post coloca um ponto de vista bem relevante no que tange a análise de dados.

Em tempos em que vendedores de software de Business Intelligence, ou mesmo vendedores deSistemas Gerenciadores de Banco de Dados tentam seduzir gerentes, diretores, e tomadores de decisão de que precisamos de mais dados; este post simplesmente diz: “Não, aprendam estatística antes!

One reason is that when you actually take the time to do an analysis right, with careful attention to all the sources of variation in the data, it is almost a law that you will have to make smaller claims than you could if you just shoved your data in a machine learning algorithm and reported whatever came out the other side.

The prime example in the press is Google Flu trends. Google Flu trends was originally developed as a machine learning algorithm for predicting the number of flu cases based on Google Search Terms. While the underlying data management and machine learning algorithms were correct, a misunderstanding about the uncertainties in the data collection and modeling process have led to highly inaccurate estimates over time. A statistician would have thought carefully about the sampling process, identified time series components to the spatial trend, investigated why the search terms were predictive and tried to understand what the likely reason that Google Flu trends was working.

As we have seen, lack of expertise in statistics has led to fundamental errors in both genomic science and economics. In the first case a team of scientists led by Anil Potti created an algorithm for predicting the response to chemotherapy. This solution was widely praised in both the scientific and popular press. Unfortunately the researchers did not correctly account for all the sources of variation in the data set and had misapplied statistical methods and ignored major data integrity problems. The lead author and the editors who handled this paper didn’t have the necessary statistical expertise, which led to major consequences and cancelled clinical trials.

Similarly, two economists Reinhart and Rogoff, published a paper claiming that GDP growth was slowed by high governmental debt. Later it was discovered that there was an error in an Excel spreadsheet they used to perform the analysis. But more importantly, the choice of weights they used in their regression model were questioned as being unrealistic and leading to dramatically different conclusions than the authors espoused publicly. The primary failing was a lack of sensitivity analysis to data analytic assumptions that any well-trained applied statisticians would have performed.

No final o autor faz uma pergunta que eu acho extremamente relevante: ” When thinking about the big data era, what are some statistical ideas we’ve already figured out?”

Eu tenho algumas:

1) Determinação de tamanho de amostra para criação de modelos usando tamanho de população conhecida ou desconhecida;

2) Design de Experimentos

3) Análise Exploratória de Dados

Porque o fenômeno do Big Data está envolvido em Problemas? Eles esqueceram estatística aplicada

Dados x Teoria

Neste post do Noahpinion mais uma vez tem o debate sobre Dados x Teoria. O autor até colocou uma frase do Paul Krugman:

But you can’t be an effective fox just by letting the data speak for itself — because it never does. You use data to inform your analysis, you let it tell you that your pet hypothesis is wrong, but data are never a substitute for hard thinking. If you think the data are speaking for themselves, what you’re really doing is implicit theorizing, which is a really bad idea (because you can’t test your assumptions if you don’t even know what you’re assuming.)

No final o autor literalmente escorrega no tomate com essa frase:

In the past, data-laziness was probably more of a threat to humanity. Since systematic data was scarce, people had a tendency to sit around and daydream about how stuff might work. But now that Big Data is getting bigger and computing power is cheap, theory-laziness seems to be becoming more of a menace. The lure of Big Data is that we can get all our ideas from mining for patterns, but A) we get a lot of false patterns that way, and B) the patterns insidiously and subtly suggest interpretations for themselves, and those interpretations are often wrong.

Três notas rápidas sobre esse artigo:

1 – O sucesso do Nate Silver que através do seu site e também de seu  livro simplesmente acabou com todos os comentaristas políticos nos EUA e fizeram a opinião pública questionar os ‘especialistas’ e os vieses de suas opiniões. E até o Paul Krugman está incomodado com isso;

2 – Nos dias de hoje contamos com aparatos estatísticos muito mais avançados que na antiguidade para analisar os dados. Isso significa que se antigamente não haviam os dados não significa necessariamente que as teorias eram válidas por não serem testáveis. Logo, a análise quantitativa nos dias de hoje representa uma condição de que a teoria pode ser testada e submetida ao falseamento constante, o que é um requisito básico da análise científica; e

3 – A era de ouro no qual economistas, sociólogos, estatísticos, jornalistas e tutti quanti simplesmente deitavam-se sobre aspectos formais e estruturais da teoria usando amostragem (sem revelar, logicamente, vieses e metodologia) está acabando. E isso é ótimo.

Pra quem quer saber um pouco mais o porque da raiva com o Nate Silver e sobre a sua abordagem está aqui.

Dados x Teoria

Definições sobre Mineração de Dados, Estatística, e Aprendizado de Máquina

Esse post do Geomblog de maneira bem simples (beirando a genialidade) define bem essas disciplinas da seguinte forma:

  • Mineração de Dados é a arte de encontrar padrões nos dados;
  • Estatística é a ciência matemática associada com o desenho de inferências de dados com ruído; e
  • Aprendizado de Máquina é [uma ramificação da Ciência da Computação] que desenvolve tecnologia para inferência automatizada (sua caracterização original era como uma ramificação da engenharia).
Definições sobre Mineração de Dados, Estatística, e Aprendizado de Máquina