# Data Science: Como agentes reguladores, professores e praticantes estão fazendo isso errado

Esse post da Data Robot é um daqueles tipos de post que mostra muito como a evolução das plataformas de Big Data, aliado com um maior arsenal computacional e preditivo estão varrendo para baixo do tapete qualquer bullshit disfarçado com tecnicalidades em relação à Data Science.

Vou reproduzir na íntegra, pois vale a pena usar esse post quando você tiver que justificar a qualquer burocrata de números (não vou dar nome aos bois dado o butthurt que isso poderia causar) porque ninguém mais dá a mínima para P-Valor, testes de hipóteses, etc na era em que temos uma abundância de dados; e principalmente está havendo a morte da significância estatística.

“Underpinning many published scientific conclusions is the concept of ‘statistical significance,’ typically assessed with an index called the p-value. While the p-value can be a useful statistical measure, it is commonly misused and misinterpreted.”  ASA Statement on Statistical Significance and p-Values

If you’ve ever heard the words “statistically significant” or “fail to reject,” then you are among the countless thousands who have been traumatized by an academic approach building predictive models.  Unfortunately, I can’t claim innocence in this matter.  I taught statistics when I was in grad school, and I do have a Ph.D. in applied statistics.  I was born into the world that uses formal hypothesis testing to justify every decision made in the model building process:

Should I include this variable in my model?  How about an F-test?

Do my two samples have different means?  Student’s t-test!

Does my model fit my data?  Why not try the Hosmer–Lemeshow test or maybe use the Cramér–von Mises criterion?

Are my variables correlated?  How about a test using a Pearson Correlation Coefficient?

And on, and on, and on, and on…

These tests are all based on various theoretical assumptions.  If the assumptions are valid, then they allegedly tell you whether or not your results are “statistically significant.”

Over the last century, as businesses and governments have begun to incorporate data science into their business processes, these “statistical tests” have also leaked into commercial and regulatory practices.

For instance, federal regulators in the banking industry issued this tortured guidance in 2011:

“… statistical tests depend on specific distributional assumptions and the purpose of the model… Any single test is rarely sufficient, so banks should apply a variety of tests to develop a sound model.”

In other words, statistical tests have lots of assumptions that are often (always) untrue, so use lots of them. (?!)

## Here’s why statistical significance is a waste of time

### If assumptions are invalid, the tests are invalid — even if your model is good

I developed a statistical test of my very own for my dissertation.  The procedure for doing this is pretty simple.  First, you make some assumptions about independence and data distributions, and variance, and so on.  Then, you do some math that relies (heavily) on these assumptions in order to come up with a p-value. The p-value tells you what decision to make.

As an example, let’s take linear regression.  Every business stats student memorizes the three assumptions associated with the p-values in this approach: independence (for which no real test exists), constant variance, and normality.  If all these assumptions aren’t met, then none of the statistical tests that you might do are valid; yet regulators, professors, scientists, and statisticians all expect you to rely (heavily) on these tests.

What’s are you to do if your assumptions are invalid?  In practice, the general practice is to wave your hands about “robustness” or some such thing and then continue along the same path.

### If your data is big enough, EVERYTHING is significant

“The primary product of a research inquiry is one or more measures of effect size, not P values.” Jacob Cohen

As your data gets bigger and bigger (as data tends to do these days), everything becomes statistically significant.  On one hand, this makes intuitive sense.  For example, the larger a dataset is, the most likely an F-test is to tell you that your GLM coefficients are nonzero; i.e., larger datasets can support more complex models, as expected.  On the other hand, for many assumption validity tests — e.g., tests for constant variance — statistical significance indicates invalid assumptions.  So, for big datasets, you end up with tests telling you every feature is significant, but assumption tests telling you to throw out all of your results.

### Validating assumptions is expensive and doesn’t add value

Nobody ever generated a single dollar of revenue by validating model assumptions (except of course the big consulting firms that are doing the work).  No prospect was converted; no fraud was detected; no marketing message was honed by the drudgery of validating model assumptions.  To make matters worse, it’s a never ending task.  Every time a model is backtested, refreshed, or evaluated, the same assumption-validation-song-and-dance has to happen again.  And that’s assuming that the dozens of validity tests don’t give you inconsistent results.  It’s a gigantic waste of resources because there is a better way.

### You can cheat, and nobody will ever know

Known as data dredging, data snooping, or p-hacking, it is very easy and relatively undetectable to manufacture statistically significant results.  Andrew Gelman observed that most modelers have a (perverse) incentive to produce statistically significantresults — even at the expense of reality.  It’s hardly surprising that these techniques exist, given the pressure to produce valuable data driven solutions.  This risk, on its own, should be sufficient reason to abandon p-values entirely in some settings, like financial services, where cheating could result in serious consequences for the economy.

### If the model is misspecified, then your p-values are likely to be misleading

Suppose you’re investigating whether or not a gender gap exists in America.  Lots of things are correlated with gender; e.g., career choice, hours worked per week, percentage of vacation taken, participation in a STEM career, and so on.  To the extent that any of these variables are excluded from your investigation — whether you know about them or not — the significance of gender will be overstated.  In other words, statistical significance will give the impression that a gender gap exists, when it may not — simply due to model misspecification.

## Only out-of-sample accuracy matters

Whether or not results are statistically significant is the wrong question.  The only metric that actually matters when building models is whether or not your models can make accurate predictions on new data.  Not only is this metric difficult to fake, but it also perfectly aligns with the business motivation for building the model in the first place.  Fraud models that do a good job predicting fraud actually prevent losses.  Underwriting models that accurately segment credit risk really do increase profits.  Optimizing model accuracy instead of identifying statistical significance makes good business sense.

Over the course of the last few decades lots and lots of tools have been developed outside of the hypothesis testing framework.  Cross-validation, partial dependence, feature importance, and boosting/bagging methods are just some of the tools in the machine learning toolbox.  They provide a means not only for ensuring out-of-sample accuracy, but also understanding which features are important and how complex models work.

A survey of these methods is out of scope, but let me close with a final point.  Unlike traditional statistical methods, tasks like cross-validation, model tuning, feature selection, and model selection are highly automatable.  Custom coded solutions of any kind are inherently error prone, even for the most experienced data scientist

Many of the world’s biggest companies are recognizing that bespoke models, hand-built by Ph.D.’s are too slow and expensive to develop and maintain.  Solutions like DataRobot provide a way for business experts to build predictive models in a safe, repeatable, systematic way that yields business value much more quickly and much cheaper than other approaches.

By Greg Michaelson, Director – DataRobot Labs

# Strata Singapore 2016 – Machine learning in practice with Spark MLlib: An intelligent data analyzer

Pessoal,

Já tivemos a oportunidade de falar um pouco (de maneira bem breve é verdade) no TDC 2016 aqui em São Paulo, mas agora iremos falar em um evento que pode ser considerado o mais importante de Big Data, Analytics e Machine Learning do mundo.

Falaremos um pouco do nosso case que basicamente é um problema de previsão de série temporal, em que tínhamos MUITOS alarmes que não funcionavam da maneira adequada e mais: como a nossa plataforma de tarifação funciona 24 x 7, a cada minuto que ela fica parada estamos perdendo muito dinheiro.

E a nossa solução que foi batizada de Watcher-AI usa basicamente o Spark MLLib e é acoplada na nossa plataforma de billing; e em qualquer sinal de instabilidade faz notificação para todo o time para solucionar o problema o mais rápido possível.

Ao longo dos dias vamos falar um pouco da conferência, sobre as novidades em relação à Machine Learning, Big Data, e algumas reflexões sobre Data Science e Machine Learning.

Não percam os próximos dias.

# Engenheiros não devem fazer ETL

But the role sounds really nice, and it’s easy to recruit for. Thus was born the traditional, modern day data science department: data scientists (Report developers aka “thinkers”), data engineers (ETL engineers aka “doers”), and infrastructure engineers (DBAs aka “plumbers”).

Whoops. It would seem that the business intelligence department never really changed, we just added a Hadoop cluster and started calling it by a new name.

# O ocaso das ferramentas proprietárias de Machine Learning e Data Mining

Vendo esse post do KDNuggets que pergunta “se as ferramentas proprietárias ainda são relevantes?” é resposta é um sim, porém com uma relevância menor e estado avançado de atrofia em comparação com as ferramentas open source.

Desses quase 8 anos de Data Mining e Machine Learning é bem fácil identificar as causas desse declínio, e o porque isso foi ótimo para toda a indústria de machine learning como um todo:

1. Ênfase nas grandes corporações em que as grandes soluções de analytics vieram chegar nas médias empresas somente quando as gigantes começaram a apertar o seu budget enquanto as opções open source já haviam dominado esse mercado;
2. Ciclos de desenvolvimento lentos em que para se colocar um algoritmo K-Means levava quase 6 meses, enquanto no Scikit-Learn tem sprint que não dura nem 3 meses;
3. Não incorporar os algoritmos modernos nas suas respectivas plataformas como Redes Neurais, LDA, SVM, entre outros;
4. Falta de integração com outras plataformas open como Linux ou Debian, ou linguagens como Java, Python, etc;
5. Tentativa de vendor lock-in em um cenário que a competitividade está aumentando muito e todos estão com orçamentos restritos;
6. Preço: Eles REALMENTE acham que vão vender suites desktop por R\$ 5.000.
7. Mais investimentos em Marketing do que em pesquisa: O quadrante da Gartner agradece (Veja essa pedrada antológica para entender como esse business funciona).
8. Perda da guerra das universidades: Todos sabem que a próxima geração de profissionais de Analytics, Machine Learning estão nesse momento nas universidades aprendendo R, usando Weka e demais ferramentas open, mesmo com grandes ferramentas point-and-click. Enquanto isso a Matlab está tentando usar uma tática de desconstrução desnecessária.
9. E o mais importante: O que eles vendem, empresas muito maiores estão dando ou patrocinando de graça.

Com todo esse cenário ótimo para os entusiastas e profissionais de Machine Learning, as empresas de software proprietário vão ter que se reinventar caso queiram sobreviver em um futuro a médio prazo.

# 10 coisas que a estatística pode nos ensinar sobre a análise de Big Data

1. If the goal is prediction accuracy, average many prediction models together. In general, the prediction algorithms that most frequently win Kaggle competitions or the Netflix prize blend multiple models together. The idea is that by averaging (or majority voting) multiple good prediction algorithms you can reduce variability without giving up bias. One of the earliest descriptions of this idea was of a much simplified version based onbootstrapping samples and building multiple prediction functions – a process called bagging (short for bootstrap aggregating). Random forests, another incredibly successful prediction algorithm, is based on a similar idea with classification trees.
2. Know what your real sample size is.  It can be easy to be tricked by the size of a data set. Imagine you have an image of a simple black circle on a white background stored as pixels. As the resolution increases the size of the data increases, but the amount of information may not (hence vector graphics). Similarly in genomics, the number of reads you measure (which is a main determinant of data size) is not the sample size, it is the number of individuals. In social networks, the number of people in the network may not be the sample size. If the network is very dense, the sample size might be much less. In general the bigger the sample size the better and sample size and data size aren’t always tightly correlated.

# Um post demolidor do Stephen Few sobre o Big Data

Contrariando os departamentos de marketing dos grandes vendedores de software, o Stephen Few vem travando uma guerra quase que pessoal contra a indústria do Big Data.

Como esse termo que é mais comentado nas redes sociais e no marketing do que é praticado em campo (como eu chamo esses verdadeiros soldados da ciência de dados como o Luti, Erickson Ricci, Big Leka, Fabiano Amorim, Fabrício Lima, Marcos Freccia, entre outros) há uma entropia de opiniões e conceitos. Com essa entropia quem perde são somente os desinformados que não conseguem separar o sinal do ruído que acabam virando presas fáceis de produtos com qualidade duvidosa.

A vítima da vez foi o livro Dataclysm do Christian Rudder.

Em um dado momento do livro, o autor realiza um tipo de criticismo ao processo científico em que alguns pesquisadores das ciências do comportamento aplicadas utilizam seus alunos como amostra, e o autor de forma quase que pedante chama essas pesquisas de WEIRD (White, Educated, Industrialized, Rich and Democratic). Em tradução livre uma brincadeira com o acrônimo da palavra “Esquisita” em inglês como uma espécie de conotação pejorativa.

I understand how it happens: in person, getting a real representative data set is often more difficult than the actual experiment you’d like to perform. You’re a professor or postdoc who wants to push forward, so you take what’s called a “convenience sample”—and that means the students at your university. But it’s a big problem, especially when you’re researching belief and behavior. It even has a name: It’s called WEIRD research: white, educated, industrialized, rich, and democratic. And most published social research papers are WEIRD.

O que poderia ser um criticismo de um autor que tem como background os méritos em ser um dos co-fundadores do OKCupid, vira em uma leitura mais cuidadosa da exposição de uma lacuna em relação à análise de dados e pior: expõe um erro de entendimento em relação à teoria da amostragem (nada que uma leitura atenciosa do livro dos professores Bolfarine e Bussab não solucionasse).

E a resposta do Stephen Few é demolidora:

Rudder is a co-founder of the online dating service OKCupid. As such, he has access to an enormous amount of data that is generated by the choices that customers make while seeking romantic connections. Add to this the additional data that he’s collected from other social media sites, such as Facebook and Twitter, and he has a huge data set. Even though the people who use these social media sites are more demographically diverse than WEIRD college students, they don’t represent society as a whole. Derek Ruths of McGill University and Jürgen Pfeffer of Carnegie Mellon University recently expressed this concern in an article titled “Social Medial for Large Studies of Behavior,” published in the November 28, 2014 issue of Science. Also, the conditions under which the data was collected exercise a great deal of influence, but Rudder has “stripped away” most of this context.

Lição #1: Demografia não é sinal de diversidade em análise de dados.

Após esse trecho vem uma fala do Stephen Few que mostra de maneira bem sutil o arsenal retórico dos departamentos de marketing para convencer pessoas inteligentes em investir em algo que elas não entendem que é a poesia do entendimento; e uma outra situação mais grave: acreditar que os dados online em que somos perfis falam de maneira exata quem somos.

Contrary to his disclaimers about Big Data hype, Rudder expresses some hype of his own. Social media Big Data opens the door to a “poetry…of understanding. We are at the cusp of momentous change in the study of human communication.” He believes that the words people write on these sites provide the best source of information to date about the state and nature of human communication. I believe, however, that this data source reveals less than Rudder’s optimistic assessment. I suspect that it mostly reveals what people tend to say and how they tend to communicate on these particular social media sites, which support specific purposes and tend to be influenced by technological limitations—some imposed (e.g., Twitter’s 140 character limit) and others a by-product of the input device (e.g., the tiny keyboard of a smartphone). We can certainly study the effects that these technological limitations have on language, or the way in which anonymity invites offensive behavior, but are we really on the “cusp of momentous change in the study of human communication”? To derive useful insights from social media data, we’ll need to apply the rigor of science to our analyses just as we do with other data sources.

Lição #2: Entender o viés amostral, sempre irá reduzir a chance de más generalizações.

Lição #3: Contextos específicos não são generalizáveis (i.e. indução não é a mesma coisa que dedução).

E por último o autor fala uma pérola que merece estar em um panteão de bullshits (como esse da Bastter.com que é o maior combatente do bullshit midiático e de marketing do Brasil). É necessário que os leitores mais sensíveis a ausência de raciocínio lógico-cientifico segurem-se com o que vem aí. Segurem-se porque essa afirmação é forte:

“With Big Data we no longer need to adhere to the basic principles of science.”

“Com Big Data não precisaremos aderir os princípios básicos da ciência”

A resposta, mais uma demolição:

Sourcing data from the wild rather than from controlled experiments in the lab has always been an important avenue of scientific study. These studies are observational rather than experimental. When we do this, we must carefully consider the many conditions that might affect the behavior that we’re observing. From these observations, we carefully form hypotheses, and then we test them, if possible, in controlled experiments. Large social media data sets don’t alleviate the need for this careful approach. I’m not saying that large stores of social media data are useless. Rather, I’m saying that if we’re going to call what we do with it data science, let’s make sure that we adhere to the principles and practices of science. How many of the people who call themselves “data scientists” on resumes today have actually been trained in science? I don’t know the answer, but I suspect that it’s relatively few, just as most of those who call themselves “data analysts” of some type or other have not been trained in data analysis. No matter how large the data source, scientific study requires rigor. This need is not diminished in the least by data volume. Social media data may be able to reveal aspects of human behavior that would be difficult to observe in any other way. We should take advantage of this. However, we mustn’t treat social media data as magical, nor analyze it with less rigor than other sources of data. It is just data. It is abundantly available, but it’s still just data.

Utilizando a mesma lógica contida na argumentação, não precisamos de ensaios randomizados para saber se um determinado remédio ou mesmo tipo de paradigma de alimentação está errado; podemos esquecer questões como determinação amostral, a questão das hipóteses, ou mesmo conceitos básicos de randomização amostral, ou mesmo verificar especificidades da população para generalizar conclusões, ou sequer considerar erros aleatórios ou flutuações estatísticas.

Apenas pegue dados de redes sociais e generalize.

Lição #4: Volume não significa nada sem significância amostral.

Lição #5: Independente da fonte dos dados, ainda continuam sendo dados. E sempre devem ser tratados com rigor.

Haverá alguns posts sobre essa questão amostral, mas o mais importante são as lições que podemos tirar desses que eu considero inocentes a serviço da desinformação.

# Manipulação de opiniões no Facebook… Manipulação?

Primeiro uma breve contextualização sobre o assunto.

Aqui está o abstract do artigo:

We show, via a massive (N = 689,003) experiment on Facebook, that emotional states can be transferred to others via emotional contagion, leading people to experience the same emotions without their awareness. We provide experimental evidence that emotional contagion occurs without direct interaction between people (exposure to a friend expressing an emotion is sufficient), and in the complete absence of nonverbal cues.

Emotional states can be transferred to others via emotional contagion, leading people to experience the same emotions without their awareness. Emotional contagion is well established in laboratory experiments, with people transferring positive and negative emotions to others. Data from a large real-world social network, collected over a 20-y period suggests that longer-lasting moods (e.g., depression, happiness) can be transferred through networks [Fowler JH, Christakis NA (2008) BMJ 337:a2338], although the results are controversial. In an experiment with people who use Facebook, we test whether emotional contagion occurs outside of in-person interaction between individuals by reducing the amount of emotional content in the News Feed. When positive expressions were reduced, people produced fewer positive posts and more negative posts; when negative expressions were reduced, the opposite pattern occurred. These results indicate that emotions expressed by others on Facebook influence our own emotions, constituting experimental evidence for massive-scale contagion via social networks. This work also suggests that, in contrast to prevailing assumptions, in-person interaction and nonverbal cues are not strictly necessary for emotional contagion, and that the observation of others’ positive experiences constitutes a positive experience for people.

Em suma: O Facebook propositalmente testou em pouco mais de 700 mil usuários o efeito do contágio de sentimentos através da ‘supressão ou adição de informações’ na linha do tempo desses usuários.

Houve uma grande polêmica em torno do assunto, inclusive até os editores emitiram uma nota esclarecendo alguns aspectos do estudo, e houve a mesma reclamação de sempre.

Com esse plano de fundo, no blog do Andrew Gelman foi escrito um post interessante sobre a questão e se essas reclamações são justificáveis ou não, e a resposta é categórica:

[…] It seems a bit ridiculous to say that a researcher needs special permission to do some small alteration of an internet feed, when advertisers and TV networks can broadcast all sorts of emotionally affecting images whenever they want. The other thing that’s bugging me is the whole IRB thing, the whole ridiculous idea that if you’re doing research you need to do permission for noninvasive things like asking someone a survey question.[…]

[…]So, do I consider this Facebook experiment unethical? No, but I could see how it could be considered thus, in which case you’d also have to consider all sorts of non-research experiments (the famous A/B testing that’s so popular now in industry) to be unethical as well. In all these cases, you have researchers, of one sort or another, experimenting on people to see their reactions. And I don’t see the goal of getting published in PNAS to be so much worse than the goal of making money by selling more ads.[…]

[…]Again, I can respect if you take a Stallman-like position here (or, at least, what I imagine rms would say) and argue that all of these manipulations are unethical, that the code should be open and we should all be able to know, at least in principle, how our messages are being filtered. So I agree that there is an ethical issue here and I respect those who have a different take on it than I do—but I don’t see the advantage of involving institutional review boards here. All sorts of things are unethical but still legal, and I don’t see why doing something and publishing it in a scientific journal should be considered more unethical or held to a more stringent standard than doing the same thing and publishing it in an internal business report.[…]

Em outras palavras: Não adianta a critica ao que o Facebook fez se de uma maneira ou de outra a propaganda/publicidade/marketing vem fazendo isso a anos. Não é porque alguém publica em um periódico acadêmico que faz ele menos “ético”(cabe ao juízo de valor de cada um) de quem faz isso internamente através de relatórios.

Nota Pessoal: Como ‘insider’ do mundo do crédito, produtos bancários não padronizados, e localização eu recomendo que a paranóia nada ajuda nestes casos. Hoje com um CEP preenchido em algum formulário para se ganhar um desconto em alguma coisa e o CPF qualquer pessoa pode ser localizada no Brasil; e as empresas de cartão de crédito sabem muito sobre nós todos.

Privacidade hoje só existe em dois lugares: Mídias não estruturadas  (e.g. cadernos, post it, anotações espúrias, etc); ou para terroristas e demais membros de organizações criminosas que não possuem nenhum traço no meio digital e só realizam transações off-market (e.g. contrabando, tráfico de drogas, fluxo de armas para terroristas, etc.) .

# Michael Jordan (Não o do basquete) fala sobre alguns tópicos em Aprendizado de Máquina e sobre Big Data

Abaixo está o depoimento mais sensato sobre alguns assuntos relativos à análise de dados, Data Mining, e principalmente Big Data.

UPDATE: O próprio MJordan deu uma entrevista dizendo que em alguns pontos foi mal interpretado. No entanto, cabe ressaltar que muito do que é importante na fala ele não falou nada a respeito; então tirem as suas conclusões.

Para quem não sabe, o Michael Jordan (IEEE) é uma das maiores autoridades no que diz respeito em aprendizado de máquina no mundo acadêmico.

Sobre a parte de Big Data em especial, esses comentários convidam à uma reflexão, e acima de tudo colocam pontos que merecem ser discutidos sobre esse fenômeno.

Obviamente empresas do calibre da Google, Amazon, Yahoo, e alguns projetos como Genoma podem ter benefício de grandes volumes de dados. O problema principal é que todo essa hipsterização em torno do Big Data parece muito mais algo orientado ao marketing do que a resolução de questões de negócio pertinentes.

Seguem alguns trechos importantes:

Sobre Deep Learning, simplificações e afins…

IEEE Spectrum: I infer from your writing that you believe there’s a lot of misinformation out there about deep learning, big data, computer vision, and the like.

Michael Jordan: Well, on all academic topics there is a lot of misinformation. The media is trying to do its best to find topics that people are going to read about. Sometimes those go beyond where the achievements actually are. Specifically on the topic of deep learning, it’s largely a rebranding of neural networks, which go back to the 1980s. They actually go back to the 1960s; it seems like every 20 years there is a new wave that involves them. In the current wave, the main success story is the convolutional neural network, but that idea was already present in the previous wave. And one of the problems with both the previous wave, that has unfortunately persisted in the current wave, is that people continue to infer that something involving neuroscience is behind it, and that deep learning is taking advantage of an understanding of how the brain processes information, learns, makes decisions, or copes with large amounts of data. And that is just patently false.

Spectrum: It’s always been my impression that when people in computer science describe how the brain works, they are making horribly reductionist statements that you would never hear from neuroscientists. You called these “cartoon models” of the brain.

Michael Jordan: I wouldn’t want to put labels on people and say that all computer scientists work one way, or all neuroscientists work another way. But it’s true that with neuroscience, it’s going to require decades or even hundreds of years to understand the deep principles. There is progress at the very lowest levels of neuroscience. But for issues of higher cognition—how we perceive, how we remember, how we act—we have no idea how neurons are storing information, how they are computing, what the rules are, what the algorithms are, what the representations are, and the like. So we are not yet in an era in which we can be using an understanding of the brain to guide us in the construction of intelligent systems.

Sobre Big Data

Spectrum: If we could turn now to the subject of big data, a theme that runs through your remarks is that there is a certain fool’s gold element to our current obsession with it. For example, you’ve predicted that society is about to experience an epidemic of false positives coming out of big-data projects.

Michael Jordan: When you have large amounts of data, your appetite for hypotheses tends to get even larger. And if it’s growing faster than the statistical strength of the data, then many of your inferences are likely to be false. They are likely to be white noise.

Spectrum: How so?

Michael Jordan: In a classical database, you have maybe a few thousand people in them. You can think of those as the rows of the database. And the columns would be the features of those people: their age, height, weight, income, et cetera.

Now, the number of combinations of these columns grows exponentially with the number of columns. So if you have many, many columns—and we do in modern databases—you’ll get up into millions and millions of attributes for each person.

Now, if I start allowing myself to look at all of the combinations of these features—if you live in Beijing, and you ride bike to work, and you work in a certain job, and are a certain age—what’s the probability you will have a certain disease or you will like my advertisement? Now I’m getting combinations of millions of attributes, and the number of such combinations is exponential; it gets to be the size of the number of atoms in the universe.

Those are the hypotheses that I’m willing to consider. And for any particular database, I will find some combination of columns that will predict perfectly any outcome, just by chance alone. If I just look at all the people who have a heart attack and compare them to all the people that don’t have a heart attack, and I’m looking for combinations of the columns that predict heart attacks, I will find all kinds of spurious combinations of columns, because there are huge numbers of them.

So it’s like having billions of monkeys typing. One of them will write Shakespeare.

Spectrum:Do you think this aspect of big data is currently underappreciated?

Michael Jordan: Definitely.

Spectrum: What are some of the things that people are promising for big data that you don’t think they will be able to deliver?

Michael Jordan: I think data analysis can deliver inferences at certain levels of quality. But we have to be clear about what levels of quality. We have to have error bars around all our predictions. That is something that’s missing in much of the current machine learning literature.

Spectrum: What will happen if people working with data don’t heed your advice?

Michael Jordan: I like to use the analogy of building bridges. If I have no principles, and I build thousands of bridges without any actual science, lots of them will fall down, and great disasters will occur.

Similarly here, if people use data and inferences they can make with the data without any concern about error bars, about heterogeneity, about noisy data, about the sampling pattern, about all the kinds of things that you have to be serious about if you’re an engineer and a statistician—then you will make lots of predictions, and there’s a good chance that you will occasionally solve some real interesting problems. But you will occasionally have some disastrously bad decisions. And you won’t know the difference a priori. You will just produce these outputs and hope for the best.

And so that’s where we are currently. A lot of people are building things hoping that they work, and sometimes they will. And in some sense, there’s nothing wrong with that; it’s exploratory. But society as a whole can’t tolerate that; we can’t just hope that these things work. Eventually, we have to give real guarantees. Civil engineers eventually learned to build bridges that were guaranteed to stand up. So with big data, it will take decades, I suspect, to get a real engineering approach, so that you can say with some assurance that you are giving out reasonable answers and are quantifying the likelihood of errors.

Spectrum: Do we currently have the tools to provide those error bars?

Michael Jordan: We are just getting this engineering science assembled. We have many ideas that come from hundreds of years of statistics and computer science. And we’re working on putting them together, making them scalable. A lot of the ideas for controlling what are called familywise errors, where I have many hypotheses and want to know my error rate, have emerged over the last 30 years. But many of them haven’t been studied computationally. It’s hard mathematics and engineering to work all this out, and it will take time.

It’s not a year or two. It will take decades to get right. We are still learning how to do big data well.

Spectrum: When you read about big data and health care, every third story seems to be about all the amazing clinical insights we’ll get almost automatically, merely by collecting data from everyone, especially in the cloud.

Michael Jordan: You can’t be completely a skeptic or completely an optimist about this. It is somewhere in the middle. But if you list all the hypotheses that come out of some analysis of data, some fraction of them will be useful. You just won’t know which fraction. So if you just grab a few of them—say, if you eat oat bran you won’t have stomach cancer or something, because the data seem to suggest that—there’s some chance you will get lucky. The data will provide some support.

But unless you’re actually doing the full-scale engineering statistical analysis to provide some error bars and quantify the errors, it’s gambling. It’s better than just gambling without data. That’s pure roulette. This is kind of partial roulette.

Spectrum: What adverse consequences might await the big-data field if we remain on the trajectory you’re describing?

Michael Jordan: The main one will be a “big-data winter.” After a bubble, when people invested and a lot of companies overpromised without providing serious analysis, it will bust. And soon, in a two- to five-year span, people will say, “The whole big-data thing came and went. It died. It was wrong.” I am predicting that. It’s what happens in these cycles when there is too much hype, i.e., assertions not based on an understanding of what the real problems are or on an understanding that solving the problems will take decades, that we will make steady progress but that we haven’t had a major leap in technical progress. And then there will be a period during which it will be very hard to get resources to do data analysis. The field will continue to go forward, because it’s real, and it’s needed. But the backlash will hurt a large number of important projects.

# Para os cientistas de Big-Data “Trabalho de Limpeza dos Dados” é principal obstáculo para Insights

Essa matéria do NYT fala a respeito do principal gargalo que os “analistas de Big Data” enfrentam que é a parte de limpeza dos dados.

Quem está acompanhando, estudando ou mesmo comentando a mais de 2 anos sobre as áreas de Mineração de Dados, Machine Learning, e KDD sabe que o trabalho de tratamento dos dados representa 80% de todo esforço em análise de dados.

Por tanto, falando em termos computacionais, aplicar 80% de esforço em uma tarefa não é um problema mas sim uma característica tratando-se de processos sérios de KDD.

Com o fenômeno do “Big Data” muitos dos “analistas de dados” esqueceram-se de que uma das partes mais significativas de todo trabalho de análise está por trás do fato que gerou aquela informação, e não a análise por sí só. Isto é, quem compreende a estrutura e a conceitualização na qual aquele aquela informação é criada e posteriormente persistida, tem por definição lógica mais conhecimento sobre o dado do que quem apenas está fazendo o quarteto Treino – Cross Validation – Teste – Validação.

Realizando um exercício de alegoria, se realizássemos uma transposição de Big Data para Big Food  com os mesmos 3V (Volume, Velocidade, e Variedade), seria algo como falássemos somente sobre as características nutricionais dos alimentos (quantidade proteínas, carboidratos, gorduras) com todo o academicismo para passarmos uma ilusão de erudição; mas esquecendo que essas concentrações estão estritamente relacionadas a forma de criação/plantio desses insumos (e.g. esteroides para bovinos e aves, modificações genéticas para as sementes, etc.) o que obviamente pode indicar que a métrica final de análise (no caso as informações nutricionais) não passam de uma ilusão.

Para saber mais sobre o porque o Big Data está criando analistas iludidos (como alguns do NYT) leiam essas referências aqui, aqui, aqui, aqui, aqui, aqui, e finalmente aqui.

# Big Data e a Copa do Mundo

Diretamente do KDNuggets

Técnico: “Lembrem-se, outro time está contando com insights provenientes de Big Data baseados nos jogos anteriores. Então, chutem a bola com o outro pé.”

# 10 coisas que a estatística pode nos ensinar sobre Big Data

De tempos em tempos vemos vendedores de software tentando empurrar ‘novidades’ como Big Data, Map Reduce, Processamento Distribuído, etc. Isso é muito bom no sentido de marketing e propaganda, mas dentro do aspecto técnico todos que trabalham com análise de dados devem no mínimo conhecer o básico, e este básico se chama estatística.

Entendam uma coisa, Big Data hoje nada mais é do que um jargão de marketing utilizado por todos os players do mercado para causar frisson em gerentes de tecnologia da informação, diretores, coordenadores, gerentes entre outros.

O que mudou foi que a Lei de Moore que se aplicava à capacidade de processamento (transistores nos chips) e que muitos acreditavam se também aplicava-se ao armazenamento simplesmente provou-se errada. Em outras palavras, descobrimos que podemos armazenar muito mais informação, a um custo extremamente baixo do que fazíamos a 40 anos atrás.

Veja no gráfico abaixo o que o mesmo Jeff Leek considera como a ‘revolução do big data’.

Se isso aumentou a disponibilidade dos dados para a análise, por outro lado muito por culpa da ciência da computação que (na minha visão pessoal de momento) prostituiu a estatística com o advento dos algoritmos muitos cientistas da computação, bacharéis em Sistemas de Informação, entre outros que por ventura passaram a realizar análise de dados acharam que poderiam subestimar a estatística que está a muito tempo ajudando cientistas do mundo inteiro.

Um pequeno aforismo que eu tenho sobre essa questão é “não dá para pensar em Big Data, quando ainda não aprendemos os postulados sobre amostragem que a estatística nos oferece”.** Simples assim.

Com isso, seguem as 10 coisas que a estatística pode ajudar o Big Data elencadas pelo Jeff Leek:

1) If the goal is prediction accuracy, average many prediction models together
2) When testing many hypotheses, correct for multiple testing
3) When you have data measured over space, distance, or time, you should smooth
4) Before you analyze your data with computers, be sure to plot it
5) Interactive analysis is the best way to really figure out what is going on in a data set
6) Know what your real sample size is
7) Unless you ran a randomized trial, potential confounders should keep you up at night
8) Define a metric for success up front
9) Make your code and data available and have smart people check it
10) Problem first not solution backward

**Assim que eu finalizar algumas leituras importantes sobre o assunto vou falar mais um pouco dessa besteira de big data que estão vendendo, e algumas alternativas a respeito disso.

# Porque o fenômeno do Big Data está envolvido em Problemas? Eles esqueceram estatística aplicada

O Jeff Leek neste post coloca um ponto de vista bem relevante no que tange a análise de dados.

Em tempos em que vendedores de software de Business Intelligence, ou mesmo vendedores deSistemas Gerenciadores de Banco de Dados tentam seduzir gerentes, diretores, e tomadores de decisão de que precisamos de mais dados; este post simplesmente diz: “Não, aprendam estatística antes!

One reason is that when you actually take the time to do an analysis right, with careful attention to all the sources of variation in the data, it is almost a law that you will have to make smaller claims than you could if you just shoved your data in a machine learning algorithm and reported whatever came out the other side.

The prime example in the press is Google Flu trends. Google Flu trends was originally developed as a machine learning algorithm for predicting the number of flu cases based on Google Search Terms. While the underlying data management and machine learning algorithms were correct, a misunderstanding about the uncertainties in the data collection and modeling process have led to highly inaccurate estimates over time. A statistician would have thought carefully about the sampling process, identified time series components to the spatial trend, investigated why the search terms were predictive and tried to understand what the likely reason that Google Flu trends was working.

As we have seen, lack of expertise in statistics has led to fundamental errors in both genomic science and economics. In the first case a team of scientists led by Anil Potti created an algorithm for predicting the response to chemotherapy. This solution was widely praised in both the scientific and popular press. Unfortunately the researchers did not correctly account for all the sources of variation in the data set and had misapplied statistical methods and ignored major data integrity problems. The lead author and the editors who handled this paper didn’t have the necessary statistical expertise, which led to major consequences and cancelled clinical trials.

Similarly, two economists Reinhart and Rogoff, published a paper claiming that GDP growth was slowed by high governmental debt. Later it was discovered that there was an error in an Excel spreadsheet they used to perform the analysis. But more importantly, the choice of weights they used in their regression model were questioned as being unrealistic and leading to dramatically different conclusions than the authors espoused publicly. The primary failing was a lack of sensitivity analysis to data analytic assumptions that any well-trained applied statisticians would have performed.

No final o autor faz uma pergunta que eu acho extremamente relevante: ” When thinking about the big data era, what are some statistical ideas we’ve already figured out?”

Eu tenho algumas:

1) Determinação de tamanho de amostra para criação de modelos usando tamanho de população conhecida ou desconhecida;

2) Design de Experimentos

# Bocas grandes sobre Big Data

Neste post o Stephen Few (aos moldes do que vem fazendo o Nassim Taleb) vai desmascarando a grande falácia que é o Big Data nos dias atuais.

Esse trecho é simplesmente destruidor:

Dr. Hidalgo,

Your response regarding the definition of Big Data demonstrates the problem that I’m trying to expose: Big Data has not been defined in a manner that lends itself to intelligent discussion. Your definition does not at all represent a generally accepted definition of Big Data. It is possible that the naysayers with whom you disagree define Big Data differently than you do. I’ve observed a great many false promises and much wasted effort in the name of Big Data. Unless you’re involved with a broad audience of people who work with data in organizations of all sorts (not just academia), you might not be aware of some of the problems that exist with Big Data.

Your working definition of Big Data is somewhat similar to the popular definition involving the 3 Vs (volume, velocity, and variety) that is often cited. The problem with the 3 Vs and your “size, resolution, and scope” definition is that they define Big Data in a way that could be applied to the data that I worked with when I began my career 30 years ago. Back then I routinely worked with data that was big in size (a.k.a., volume), detailed in resolution, and useful for purposes other than that for which it was originally generated. By defining Big Data as you have, you are supporting the case that I’ve been making for years that Big Data has always existed and therefore doesn’t deserve a new name.

I don’t agree that the term Big Data emerged as a “way to refer to digital traces of human activity that were collected for operational purposes by service providers serving large populations, and that could be used for purposes that were beyond those for which the data was originally collected.” What you’ve described has been going on for many years. In the past we called it data, with no need for the new term “Big Data.” What I’ve observed is that the term Big Data emerged as a marketing campaign by technology vendors and those who support them (e.g., large analyst firms such as Gartner) to promote sales. Every few years vendors come up with a new name for the same thing. Thirty years ago, we called it decision support. Not long after that we called it data warehousing. Later, the term business intelligence came into vogue. Since then we’ve been subjected to marketing campaigns associated with analytics and data science. These campaigns keep organizations chasing the latest technologies, believing that they’re new and necessary, which is rarely the case. All the while, they never slow down long enough to develop the basic skills of data sensemaking.

When you talk about data visualization, you’re venturing into territory that I know well. It is definitely not true that data visualization has “progressed enormously during recent years.” As a leading practitioner in the field, I am painfully aware that progress in data visualization has been slow and, in actual practice, is taking two steps backwards, repeating past mistakes, for every useful step forwards.

What various people and organizations value from data certainly differs, as you’ve said. The question that I asked, however, is whether or not the means of gleaning value from data, regardless of what we deem valuable, are significantly different from the past. I believe that the answer is “No.” While it is true that we are always making gradual progress in the development of analytical techniques and technologies, what we do today is largely the same as what we did when I first began my work in the field 30 years ago. Little has changed, and what has changed is an extension of the past, not a revolutionary or qualitative departure.

# Porque você não precisa do Hadoop?

Aqui estão algumas das respostas.

# Towards OLAP in Graph Databases

Direto do Another Word for It:

Towards OLAP in Graph Databases (MSc. Thesis) by Michal Bachman.

Abstract:

Graph databases are becoming increasingly popular as an alternative to relational databases for managing complex, densely-connected, semi-structured data. Whilst primarily optimised for online transactional processing, graph databases would greatly benefit from online analytical processing capabilities. Since relational databases were introduced over four decades ago, they have acquired online analytical processing facilities; this is not the case with graph databases, which have only drawn mainstream attention in the past few years.

In this project, we study the problem of online analytical processing in graph databases that use the property graph data model, which is a graph with properties attached to both vertices and edges. We use vertex degree analysis as a simple example problem, create a formal definition of vertex degree in a property graph, and develop a theoretical vertex degree cache with constant space and read time complexity, enabled by a cache compaction operation and a property change frequency heuristic.

We then apply the theory to Neo4j, an open-source property graph database, by developing a Relationship Count Module, which implements the theoretical vertex degree caching. We also design and implement a framework, called GraphAware, which provides supporting functionality for the module and serves as a platform for additional development, particularly of modules that store and maintain graph metadata.

Finally, we show that for certain use cases, for example those in which vertices have relatively high degrees and edges are created in separate transactions, vertex degree analysis can be performed several orders of magnitude faster, whilst sacrificing less than 20% of the write throughput, when using GraphAware Framework with the Relationship Count Module.

By demonstrating the extent of possible performance improvements, exposing the true complexity of a seemingly simple problem, and providing a starting point for future analysis and module development, we take an important step towards online analytical processing in graph databases.

The MSc. thesis: GraphAware: Towards Online Analytical Processing in Graph Databases.

Framework at Github: GraphAware Neo4j Framework.

Michal laments:

It’s not an easy, cover-to-cover read, but there might be some interesting parts, even if you don’t go through all the (over 100) pages.

It’s one hundred and forty-nine pages according to my PDF viewer.

I don’t think Michal needs to worry. If anyone thinks it is too long to read, it’s their loss.

Definitely going on my short list of things to read in detail sooner rather than later.

# Palantir e a simbiose entre Governo e Empresas

Esse artigo do Andy Greenberg o autor faz um panorama muito útil sobre a história da Palantir e a simbiose  governo americano em relação aos casos de vazamento de informações sobre o maior programa de Data Gathering em curso da história da humanidade.

# A Era do Big Data – The Age Of Big Data

Um documentário sensacional sobre a era do Big Data que apresenta como em alguns casos está ajudando no suporte à decisão de corporações e até mesmo à polícia.