Nassim Taleb sintetizou muito bem o erro sobre o erro e o porque de que devemos não levar tão a sério medidas de erro, principalmente na representação de modelos que mimetizam parte de uma realidade limitada.
An error rate can be measured. The measurement, in turn, will have an error rate. The measurement of the error rate will have an error rate. The measurement of the error rate will have an error rate. We can use the same argument by replacing “measurement” by “estimation” (say estimating the future value of an economic variable, the rainfall in Brazil, or the risk of a nuclear accident). What is called a regress argument by philosophers can be used to put some scrutiny on quantitative methods or risk and probability. The mere existence of such regress argument will lead to two different regimes, both leading to the necessity to raise the values of small probabilities, and one of them to the necessity to use power law distributions.
Uma imagem que vale mais que mil palavras:
Após a leitura desse artigo, fica mais evidente que modelos de dados devem ser testados, se possíveis, com amostras separadas dos conjuntos de dados de treinamento e teste (Holdout).
Direto do site do Dave Giles:
“My name is Jan H. Höffler, I have been working on a replication project funded by the Institute for New Economic Thinking during the last two years and found your blog that I find very interesting. I like very much that you link to data and code related to what you write about. I thought you might be interested in the following:
We developed a wiki website that serves as a database of empirical studies, the availability of replication material for them and of replication studies: http://replication.uni-goettingen.de
It can help for research as well as for teaching replication to students. We taught seminars at several faculties internationally – also in Canada, at UofT – for which the information of this database was used. In the starting phase the focus was on some leading journals in economics, and we now cover more than 1800 empirical studies and 142 replications. Replication results can be published as replication working papers of the University of Göttingen’s Center for Statistics.
Teaching and providing access to information will raise awareness for the need for replications, provide a basis for research about the reasons why replications so often fail and how this can be changed, and educate future generations of economists about how to make research replicable.
I would be very grateful if you could take a look at our website, give us feedback, register and vote which studies should be replicated – votes are anonymous. If you could also help us to spread the message about this project, this would be most appreciated.”
O Stephen Few dá uma explicação magistral:
But why is it that we must sometimes use graphical displays to perform these tasks rather than other forms of representation? Why not always express values as numbers in tables? Why express them visually rather than audibly? Essentially, there is only one good reason to express quantitative data visually: some features of quantitative data can be best perceived and understood, and some quantitative tasks can be best performed, when values are displayed graphically. This is so because of the ways our brains work. Vision is by far our dominant sense. We have evolved to perform many data sensing and processing tasks visually. This has been so since the days of our earliest ancestors who survived and learned to thrive on the African savannah. What visual perception evolved to do especially well, it can do faster and better than the conscious thinking parts of our brains. Data exploration, sensemaking, and communication should always involve an intimate collaboration between seeing and thinking (i.e., visual thinking).
Abaixo ele coloca a tabela das tarefas e metas da visualização de dados.
O Jeff Leek neste post coloca um ponto de vista bem relevante no que tange a análise de dados.
Em tempos em que vendedores de software de Business Intelligence, ou mesmo vendedores deSistemas Gerenciadores de Banco de Dados tentam seduzir gerentes, diretores, e tomadores de decisão de que precisamos de mais dados; este post simplesmente diz: “Não, aprendam estatística antes!“
One reason is that when you actually take the time to do an analysis right, with careful attention to all the sources of variation in the data, it is almost a law that you will have to make smaller claims than you could if you just shoved your data in a machine learning algorithm and reported whatever came out the other side.
The prime example in the press is Google Flu trends. Google Flu trends was originally developed as a machine learning algorithm for predicting the number of flu cases based on Google Search Terms. While the underlying data management and machine learning algorithms were correct, a misunderstanding about the uncertainties in the data collection and modeling process have led to highly inaccurate estimates over time. A statistician would have thought carefully about the sampling process, identified time series components to the spatial trend, investigated why the search terms were predictive and tried to understand what the likely reason that Google Flu trends was working.
As we have seen, lack of expertise in statistics has led to fundamental errors in both genomic science and economics. In the first case a team of scientists led by Anil Potti created an algorithm for predicting the response to chemotherapy. This solution was widely praised in both the scientific and popular press. Unfortunately the researchers did not correctly account for all the sources of variation in the data set and had misapplied statistical methods and ignored major data integrity problems. The lead author and the editors who handled this paper didn’t have the necessary statistical expertise, which led to major consequences and cancelled clinical trials.
Similarly, two economists Reinhart and Rogoff, published a paper claiming that GDP growth was slowed by high governmental debt. Later it was discovered that there was an error in an Excel spreadsheet they used to perform the analysis. But more importantly, the choice of weights they used in their regression model were questioned as being unrealistic and leading to dramatically different conclusions than the authors espoused publicly. The primary failing was a lack of sensitivity analysis to data analytic assumptions that any well-trained applied statisticians would have performed.
No final o autor faz uma pergunta que eu acho extremamente relevante: ” When thinking about the big data era, what are some statistical ideas we’ve already figured out?”
Eu tenho algumas:
1) Determinação de tamanho de amostra para criação de modelos usando tamanho de população conhecida ou desconhecida;
2) Design de Experimentos
3) Análise Exploratória de Dados
1. Study a Machine Learning Tool
2. Study a Machine Learning Dataset
3. Study a Machine Learning Algorithm
4. Implement a Machine Learning Algorithm
Um trecho: “[…]To generalize, a model that overfits its training set has low bias but high variance – it predicts the targets in the training set very accurately, but any slight changes to the predictors would result in vastly different predictions for the targets.
Overfitting differs from multicollinearity, which I will explain in later post. Overfitting has irrelevant predictors, whereas multicollinearity has redundant predictors.
Descrição: As Técnicas de mineração de dados (Data Mining) vem ganhando um espaço notável no âmbito acadêmico bem como no corporativo devido a sua eficiência na descoberta de conhecimento em grandes bases de dados;
Com o fenômeno do Big Data o domínio das técnicas e ferramentas em mineração de dados tornam-se obrigatórias para todas as organizações para não somente compreensão do conhecimento passado, mas também para direcionar estratégias e dar suporte a tomadas de decisão no nível estratégico/tático.
Este curso é direcionado para profissionais de tecnologia da informação, como coordenadores, analistas de dados, DBAs, diretores, técnicos e também professores e estudantes que pretendem buscar aspectos práticos bem como background teórico para implementação de projetos de mineração de dados, bem como elaborar estratégias através da análise de dados utilizando as técnicas e algoritmos da mineração de dados.
Carga Horária: 24 horas
Público-Alvo: Profissionais de Tecnologia da Informação, Analistas de Sistemas, Analistas de Dados, Gerentes, Coordenadores e Desenvolvedores das áreas de negócios, administração, e Datawarehousing / business intelligence;
Demais profissionais das mais diversas áreas do conhecimento que desejam uma abordagem Hands-On emmineração de dados.
2. Overview sobre o KDD e a Mineração de Dados (data Mining)
• Knowledge Discovery in Databases (KDD)
• Aprendizado de Máquina
• Análise Exporatória de Dados
3. Introdução a Ferramenta do WEKA
4. Introdução à Metodologia CRISP-DM
• Alternativas de modelagem CRISP-DM x Agile X PMI e afins;
5. Pré-Processamento de Dados
6. Técnicas de Mineração de Dados
Regras de Associação
7. Avaliação e Validação de Modelos
• Treinamento e Teste
• Avaliação de Modelos
8. Casos de Uso Mineração de Dados
• Mineração de Dados Médicos (Classificação)
• Previsão de Comportamento de Ações na Bolsa de Valores (Regressão Linear)
• Credit Scoring (Regras de Associação, Clustering)
• Análise de Clusters de Varejo (Agrupamento)
• Formulação de Estratégias de Marketing (Regras de Associação, Classificação)
Este site tem um dos melhores guias para aplicação de dados sobre o aspecto da programação/customização.
Para quem conhece Python é uma ótima pedida.
Nesta apresentação do Stanley Young ele fala um pouco sobre aspectos relativos à reprodutibilidade científica.
Neste post o Stephen Few (aos moldes do que vem fazendo o Nassim Taleb) vai desmascarando a grande falácia que é o Big Data nos dias atuais.
Esse trecho é simplesmente destruidor:
Your response regarding the definition of Big Data demonstrates the problem that I’m trying to expose: Big Data has not been defined in a manner that lends itself to intelligent discussion. Your definition does not at all represent a generally accepted definition of Big Data. It is possible that the naysayers with whom you disagree define Big Data differently than you do. I’ve observed a great many false promises and much wasted effort in the name of Big Data. Unless you’re involved with a broad audience of people who work with data in organizations of all sorts (not just academia), you might not be aware of some of the problems that exist with Big Data.
Your working definition of Big Data is somewhat similar to the popular definition involving the 3 Vs (volume, velocity, and variety) that is often cited. The problem with the 3 Vs and your “size, resolution, and scope” definition is that they define Big Data in a way that could be applied to the data that I worked with when I began my career 30 years ago. Back then I routinely worked with data that was big in size (a.k.a., volume), detailed in resolution, and useful for purposes other than that for which it was originally generated. By defining Big Data as you have, you are supporting the case that I’ve been making for years that Big Data has always existed and therefore doesn’t deserve a new name.
I don’t agree that the term Big Data emerged as a “way to refer to digital traces of human activity that were collected for operational purposes by service providers serving large populations, and that could be used for purposes that were beyond those for which the data was originally collected.” What you’ve described has been going on for many years. In the past we called it data, with no need for the new term “Big Data.” What I’ve observed is that the term Big Data emerged as a marketing campaign by technology vendors and those who support them (e.g., large analyst firms such as Gartner) to promote sales. Every few years vendors come up with a new name for the same thing. Thirty years ago, we called it decision support. Not long after that we called it data warehousing. Later, the term business intelligence came into vogue. Since then we’ve been subjected to marketing campaigns associated with analytics and data science. These campaigns keep organizations chasing the latest technologies, believing that they’re new and necessary, which is rarely the case. All the while, they never slow down long enough to develop the basic skills of data sensemaking.
When you talk about data visualization, you’re venturing into territory that I know well. It is definitely not true that data visualization has “progressed enormously during recent years.” As a leading practitioner in the field, I am painfully aware that progress in data visualization has been slow and, in actual practice, is taking two steps backwards, repeating past mistakes, for every useful step forwards.
What various people and organizations value from data certainly differs, as you’ve said. The question that I asked, however, is whether or not the means of gleaning value from data, regardless of what we deem valuable, are significantly different from the past. I believe that the answer is “No.” While it is true that we are always making gradual progress in the development of analytical techniques and technologies, what we do today is largely the same as what we did when I first began my work in the field 30 years ago. Little has changed, and what has changed is an extension of the past, not a revolutionary or qualitative departure.