Lack of transparency is the bottleneck in academia

One of my biggest mistakes was to make my whole master’s degrees dissertation using private data (provided by my former employer) using closed tools (e.g. Viscovery Mine).

This was for me a huge blocker to share my research with every single person in the community, and get a second opinion about my work in regard of reproducibility. I working to open my data and making a new version, or book, about this kind of analysis using Non Performing Loans data.

Here in Denny’s blog, he talks about how engineering is the bottleneck in Deep Learning Research, where he made the following statements:

I will use the Deep Learning community as an example, because that’s what I’m familiar with, but this probably applies to other communities as well. As a community of researchers we all share a common goal: Move the field forward. Push the state of the art. There are various ways to do this, but the most common one is to publish research papers. The vast majority of published papers are incremental, and I don’t mean this in a degrading fashion. I believe that research is incremental by definition, which is just another way of saying that new work builds upon what other’s have done in the past. And that’s how it should be. To make this concrete, the majority of the papers I come across consist of more than 90% existing work, which includes datasets, preprocessing techniques, evaluation metrics, baseline model architectures, and so on. The authors then typically add a bit novelty and show improvement over well-established baselines.

So far nothing is wrong with this. The problem is not the process itself, but how it is implemented. There are two issues that stand out to me, both of which can be solved with “just engineering.” 1. Waste of research time and 2. Lack of rigor and reproducibility. Let’s look at each of them.

And the final musing:

Personally, I do not trust paper results at all. I tend to read papers for inspiration – I look at the ideas, not at the results. This isn’t how it should be. What if all researchers published code? Wouldn’t that solve the problem? Actually, no. Putting your 10,000 lines of undocumented code on Github and saying “here, run this command to reproduce my number” is not the same as producing code that people will read, understand, verify, and build upon. It’s like Shinichi Mochizuki’s proof of the ABC Conjecture, producing something that nobody except you understands.

Personally, I think this approach of discarding the results and focus on the novelty of methods is better than to try to understand any result aspect that the researcher wants to cover up through academic BS complexity.




Lack of transparency is the bottleneck in academia

A verdadeira razão porque a reprodutibilidade é importante

Explicada aqui por Roger Peng.

Basicamente ele usou o caso do livro do Piketty e fez uma afirmação que eu acho que diz tudo sobre o que é reprodutibilidade de fato:

Many people seem to conflate the ideas of reproducible and correctness, but they are not the same thing. One must always remember that a study can be reproducible and still be wrong. By “wrong”, I mean that the conclusion or claim can be wrong. If I claim that X causes Y (think “sugar causes cancer”), my data analysis might be reproducible, but my claim might ultimately be incorrect for a variety of reasons. If my claim has any value, then others will attempt to replicate it and the correctness of the claim will be determined by whether others come to similar conclusions.

Then why is reproducibility so important? Reproducibility is important because it is the only thing that an investigator can guarantee about a study.

Meus dois centavos a respeito do caso do livro do Piketty e que apesar de tudo que falaram, as refutações (especialmente do Financial Times e do Wall Street Journal) e os conflitos de escolas econômicas, é invejável a qualidade do trabalho em termos de reprodutibilidade. É inegável que erros metodológicos podem acontecer e nenhum pesquisador está imune a isso (especialmente aqui no Brasil no qual uma instituição como IPEA faz confusão de aspectos básicos de pesquisa como por exemplo manipulação de dados em Excel. Contudo, o estudo e a discussão a cerca do que o livro colocou em pauta serve como norte para quem deseja navegar na área acadêmica.

A verdadeira razão porque a reprodutibilidade é importante

Reprodutibilidade de Estudos

Um comunicado porque devemos prezar para que os papers ou rotinas de trabalho sejam reprodutíveis.

Direto do site do Dave Giles:

“My name is Jan H. Höffler, I have been working on a replication project funded by the Institute for New Economic Thinking during the last two years and found your blog that I find very interesting. I like very much that you link to data and code related to what you write about. I thought you might be interested in the following:

We developed a wiki website that serves as a database of empirical studies, the availability of replication material for them and of replication studies:

It can help for research as well as for teaching replication to students. We taught seminars at several faculties internationally – also in Canada, at UofT – for which the information of this database was used. In the starting phase the focus was on some leading journals in economics, and we now cover more than 1800 empirical studies and 142 replications. Replication results can be published as replication working papers of the University of Göttingen’s Center for Statistics.

Teaching and providing access to information will raise awareness for the need for replications, provide a basis for research about the reasons why replications so often fail and how this can be changed, and educate future generations of economists about how to make research replicable.

I would be very grateful if you could take a look at our website, give us feedback, register and vote which studies should be replicated – votes are anonymous. If you could also help us to spread the message about this project, this would be most appreciated.”

Reprodutibilidade de Estudos

A Mineração de Dados está proibida de falhar?

Pois é, parece que sim. Ao menos de acordo com a Nature.

Para quem não sabe o que aconteceu, alguns pesquisadores realizaram análises no Google Flu Trends e encontraram problemas em relação ao modelo.

Os resultados estão nos artigos abaixo:

Nature News – When Google got flu wrong

The Parable of Google Flu: Traps in Big Data Analysis 
In February 2013, Google Flu Trends (GFT) made headlines but not for a reason that Google executives or the creators of the flu tracking system would have hoped. Nature reported that GFT was predicting more than double the proportion of doctor visits for influenza-like illness (ILI) than the Centers for Disease Control and Prevention (CDC), which bases its estimates on surveillance reports from laboratories across the United States (1, 2). This happened despite the fact that GFT was built to predict CDC reports. Given that GFT is often held up as an exemplary use of big data (3, 4), what lessons can we draw from this error?

The Mystery of the Exploding Tongue

Why Google Flu Trends Will Not Replace the CDC Anytime Soon

Toward a more useful definition of Big Data


Se alguém quiser saber como funciona o (‘brilhante’) sistema de Peer-Review da Nature (assim como de muitas revistas) o Sydney Brenner fala um pouco sobre o assunto.

A Mineração de Dados está proibida de falhar?

Ainda sobre a Reprodutibilidade: Alguns sites

Aqui no blog eu tenho abordado alguns temas relativos à reprodutibilidade devido ao fato de que, ao meu ver, grande parte dos resultados científicos das áreas aplicadas soa mais como exercícios de ficção do que ciência de fato.

Abaixo alguns recursos para quem deseja saber um pouco mais sobre iniciativas de reprodutibilidade nas áreas de Ciência da Computação.

Página do Shriram Krishnamurthi (Vencedor do Robin Milner Young Researcher Award de 2012)

Post no Embedded in Academia

Página na Universidade de Brown

Página da Universidade do Arizona

Ainda sobre a Reprodutibilidade: Alguns sites