Algumas dicas úteis de como fazer uma revisão sistemática

Esse é um artigo bem antigo que escrevi em 2013, mas face aos recentes eventos na minha carreira acadêmica estou postando publicamente para ajudar quem se propõe a fazer tal tarefa.

Existem inúmeros manuais de como se fazer uma boa Revisão Sistemática, então aqui eu vou colocar um apanhado de idéias que eu copiei rigorosamente dos autores das referências e colocar o que eu fiz para manter a minha sanidade durante o processo.

Um dos principais fatos dos dias de hoje é que vivemos na era da informação em que o volume de dados e informações geradas aumentam quase que exponencialmente a cada ano

Este fato tem um impacto gigantesco quando falamos de pesquisa acadêmica, especificamente para os pesquisadores que desejam entender o que está sendo escrito mesmo no meio desta miríade de informações que está sendo gerada a cada dia que passa. 

Esse post é dedicado especialmente para:

  1. pessoas que estão em momento de definir o seu projeto de pesquisa para um doutorado ou mestrado
  2. pessoas que estão escrevendo um artigo científico, mas que gostariam de saber o que está sendo discutido na literatura
  3. pessoas que estão fazendo pesquisa corporativa para determinar rumos de ação práticos em algum departamento de R&D

Uma das técnicas acadêmicas mais subestimadas na minha opinião para resolver isso no que se refere à atividade de pesquisa é a Revisão Sistemática.

Pessoalmente eu não consigo imaginar pesquisas cientificas sérias começando sem o uso desta ferramenta, e ao final desse artigo esta questão vai ficar mais clara. 

Mas o que é uma revisão sistemática? Aqui eu pego emprestado a citação de Cook, D. J., Mulrow, C. D., & Haynes, R. B. (1997):

Systematic reviews are scientific investigations in themselves, with pre-planned methods and an as  sembly of original studies as their “subjects.” They synthesize the results of multiple primary investigations by using strategies that limit bias and random error (9, 10). These strategies include a comprehensive search of all potentially relevant articles and the use of explicit, reproducible criteria in the selection of articles for review. Primary research designs and study characteristics are appraised, data are synthesized, and results are interpreted.

Cook, D. J., Mulrow, C. D., & Haynes, R. B. (1997)

A percepção que eu tenho é que inúmeros trabalhos iniciam-se cheios de expectativas e promessas de ineditismo, mas grande parte das vezes são trabalhos que só reinventaram a roda sobre outros trabalhos de outras pessoas que não levaram o crédito, e que caso houvesse uma revisão sistemática mais apurada esses trabalhos ou receberiam menos recursos ou nem existiriam e os recursos poderiam ser alocados em outros espaços com maior potencial de relevância/retorno.

Mas se eu tivesse que sumarizar em alguns pontos básicos da importância da revisão sistemática, eu consideraria os seguintes:

  1. A Revisão Sistemática ajuda a entender o passado de um tópico dentro de um campo da ciência e a sua evolução ao longo do tempo;
  2. Apresenta o atual estado da arte para os pesquisadores do presente;
  3. É uma importante ferramenta para descobrir os gaps e limitações (e.g.metodológicas) na atual literatura;
  4. Faz a sua pesquisa conversar com a literatura corrente desde o dia da publicação, 
  5. Faz o trabalho de mostrar formas de monitoramento da literatura quando trás fontes relevantes; e último, mas não menos importante;
  6. Evita que os pesquisadores reinventem a roda alocando recursos para o desenvolvimento de trabalhos com um grau maior de ineditismo

Aqui eu concordo com a afirmação de Webster e Watson(2002) de que a Revisão Sistemática une os conceitos ao longo do tempo, mas ocasionalmente prepara para o futuro, em que a revisão ela entende a teoria e a prática e a relação ontológica do campo de estudos.

Como afirma Webster, Watson(2002) revisões sistemáticas ou de literatura não podem ser uma compilação de citações como uma lista telefônica, mas sim um exercício ativo de análise dos estudos que estão sendo analisados. 

Systematic reviews can help practitioners keep abreast of the medical literature by summarizing large bodies of evidence and helping to explain differences among studies on the same question. A systematic review involves the application of scientific strategies, in ways that limit bias, to the assembly, critical appraisal, and synthesis of all relevant studies that address a specific clinical question. 

Cook, D. J., Mulrow, C. D., & Haynes, R. B. (1997)

Um dos pontos que eu quero ressaltar em algum ponto do futuro, é como a pesquisa de forma sistematizada deveria ser o objetivo de qualquer empresa para incorporar dados em processos decisórios e na arquitetura de novas soluções corporativas. Mas o meu argumento é o mesmo do Cook, D. J., Mulrow, C. D., & Haynes, R. B. (1997) que eu coloco a citação abaixo:

Review articles are one type of integrative publication; practice guidelines, economic evaluations, and clinical decision analyses are others. These other types of integrative articles often incorporate the results of systematic reviews. For example, practice guidelines are systematically developed statements intended to assist practitioners and patients with decisions about appropriate health care for specific clinical circumstances (11). Evidence-based practice guidelines are based on systematic reviews of the literature, appropriately adapted to local circumstances and values. Economic evaluations compare both the costs and the consequences of different courses of action; the knowledge of consequences that are considered in these evaluations is often generated by systematic reviews of primary studies. Decision analyses quantify both the likelihood and the valuation of the expected outcomes associated with competing alternatives. 

Cook, D. J., Mulrow, C. D., & Haynes, R. B. (1997)

Um ponto a fato das revisões sistemáticas é que muitos jornais por questões de limitação de espaço geralmente limitam o número de páginas dos trabalhos em que infelizmente a primeira área a ser sacrificada é a revisão de literatura pelo motivo de que ela não tem um foco tão grande quanto a metodologia, os resultados ou as conclusões. 

Aqui eu vou reunir algumas dicas de 3 artigos, que considero que são boas referências no que se refere à revisão sistemática. São algumas anotações desses artigos, junto com alguns comentários do que eu faço quando tenho que realizar uma revisão sistemática seja para ver como está o estado da arte de um tópico de pesquisa. 

A base fundamental desse artigo está nas ideias de Cook, Mulrow, & Haynes (1997);  Webster e Watson (2002) e Brereton e autores (2007). Esta será apenas uma lista não exaustiva de tópicos, e a leitura dos originais são imprescindíveis. 

Se eu tivesse que escolher um framework de adoção de revisão sistemática, seria este de Brereton e autores (2007):

/var/folders/8d/tpf0m9tx1b51b7lw05rfxnn40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/p41606

Autores e tópicos prospectivos

Posicionar sobre o progresso e o aprendizado e embarcar em novos projetos para o desenvolvimento de novos modelos teóricos Webster e Watson (2002) em que essas revisões podem ser em um a) tópico maduro com um vasto corpo de conhecimento ou b) sobre um tópico emergente com uma velocidade de desenvolvimento maior. 

A revisão sobre os tópicos dá direções sobre conceitos e a sua evolução e direções, e a revisão em autores ajuda o trabalho a comunicar com os grandes laboratórios ou com pesquisadores que vão auxiliar no debate sobre o campo científico. 

Escrevendo um artigo de revisão sistemática

O que está sendo buscado? Quais são as keywords? Qual é a contribuição esperada?

Um dos pontos mais importantes é realizar o disclosuredas limitações da revisão como:

  • Escopo da busca (e.g.keywordsusadas, base de artigos)
  • Limite temporal dos artigos
  • Sumário da pesquisa passada, destaque nos gaps, propostas de como encurtar esse gap e implicações da teoria na prática 

Esse disclosuresinaliza que o seu trabalho está tirando uma foto da literatura no momento, e que ela pode não ser perfeita por questões de vícios metodológicos ou mesmo por fatores exógenos a sua pesquisa, como por exemplo, se uma base de dados mudar o indexador de artigos, e a query com os mesmos parâmetros trouxerem outros resultados.

Identificação da literatura relevante 

Aqui a revisão sistemática foca no conceito não importando onde esses conceitos estão. 

Isso implica dizer que o foco não está somente:

  • Nos melhores journals
  • Em alguns autores mais produtivos
  • Em algumas áreas do conhecimento
  • Em questões de amplitude geográfica do país

Escolha de bases de dados

Aqui a revisão toma mais ares de arte do que de ciência de fato, e aqui vem uma visão muito pessoal: Eu particularmente gosto de lidar com mais de 5 bases de pesquisa. Este número eu encontrei através de algumas experimentações, mas foi o número que me dá uma certa amplitude em relação aos artigos que estão indexados nos melhores journalse ajuda a pegar alguns bons artigos e principalmente teses de doutorado que por ventura estão escondidas na página 8 de alguma keywordobscura. 

Outro ponto da base de dados é entender a seletividade da mesma, e seletividade aqui eu chamo de o quanto a ferramenta de busca consegue me trazer um número suficientemente relevantes de artigos com o menor índice de sinal e ruído. 

E como eu não poderia deixar de falar, é sempre tentador ir apenas onde estamos mais familiarizados como o Google Scholare no Microsoft Research; mas a dica aqui é procurar bases de dados de outras áreas do conhecimento. 

Estrutura da revisão

Deve focar principalmente nos conceitos e não nos autores.

Uma coisa que ajuda e muito é as categorizações dos artigos de forma qualitativa, em que aspectos de gaps, tipo de metodologia, natureza do trabalho pode ser compiladas posteriormente.

Desenvolvimento Teórico

A ponto aqui é baseado no passado, no atual estado das coisas e as limitações e gaps presentes, como usar isso para o futuro?

Aqui eu recomendo uma expansão de ideias modesta, algo que não seja uma pesquisa de 10 anos para o futuro e que tenha plausibilidade.

Razão para os proponentes

  • Explicações teóricas (O porquê?): Essa será a cola que vai grudar toda a prática de uma maneira sistematizada, reprodutível, observável, transferível e replicável;
  • Achados empíricos do passado: O suporte do que foi observado ao longo do tempo, e a qualidade das evidências apresentadas e como essas observações foram realizadas; e
  • Prática e experiência: Mecanismos de validação da teoria e refinamentos posteriores do que está sendo teorizado e feito. 

Conclusão

Eu acho que neste ponto eu consegui mostrar o meu ponto em relação a importância da revisão sistemática como ferramenta para entendimento do passado e do presente, como também como método que ajuda a planejar o futuro em termos de pesquisa. 

Eu pessoalmente recomendo sempre que houver algum tipo de adoção de prática o uso dessa ferramenta antes de qualquer projeto acadêmico.

Referências 

Cook, D. J., Mulrow, C. D., & Haynes, R. B. (1997). Systematic reviews: synthesis of best evidence for clinical decisions. Annals of internal medicine126(5), 376-380. – Link: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.733.1479&rep=rep1&type=pdf

Brereton, P., Kitchenham, B. A., Budgen, D., Turner, M., & Khalil, M. (2007). Lessons from applying the systematic literature review process within the software engineering domain. Journal of systems and software80(4), 571-583. – 

Link: https://www.sciencedirect.com/science/article/pii/S016412120600197X

Webster, J., & Watson, R. T. (2002). Analyzing the past to prepare for the future: Writing a literature review. MIS quarterly, xiii-xxiii. Link: https://www.researchgate.net/profile/Harald_Kindermann/post/How_to_write_the_academic_review_article_in_the_field_of_management/attachment/5abe1af54cde260d15d5d477/AS%3A609838266593280%401522408181169/download/2002_Webster_Writing+a+Literature+Review.pdf

Algumas dicas úteis de como fazer uma revisão sistemática

Deep Learning, Nature and Data Leakage, Reproducibility and Academic Engineering

This piece of Rajiv Shah called “Stand up for Best Practices” that involves a well known scientific journal Nature shows the academic rigor failed during several layers down and why reproducibility matters.

The letter called Deep learning of aftershock patterns following large earthquakes from DeVries Et al. that was published in Nature, according to Shah, shows a basic problem of Data Leakage and this problem could invalidate all the experiments. Shah tried to replicate the results and found the scenario of Data Leakage and after he tried to communicate the authors and Nature about the error got some harsh responses (some will be at the end of this post).

Paper abstract – Source: https://www.nature.com/articles/s41586-018-0438-y


The repository with all analysis it is here.

Of course that a letter it’s a small piece that communicates in a brief way a larger research and sometimes the authors need to suppress some information to the matter of clarity of journal limitations. And for this point, I can understand the authors, and since they gentle provided the source code (here) more skeptical minds can check the ground truth.

As a said before in my 2019 mission statement: “In god we trust, others must bring the raw data with the source code of the extraction in the GitHub“.

The main question here it’s not about if the authors made a mistake or not (that did, because they incorporated a part of an earthquake to train the model, and this for itself can explain the AUC bigger in test than in training set) but how this academic engineering it’s killing the Machine Learning field and inflating a bubble of expectations.

But first I’ll borrow the definition of Academic Engineering provided by Filip Piekniewski in his classic called Autopsy Of A Deep Learning Paper:

I read a lot of deep learning papers, typically a few/week. I’ve read probably several thousands of papers. My general problem with papers in machine learning or deep learning is that often they sit in some strange no man’s land between science and engineering, I call it “academic engineering”. Let me describe what I mean:

1) A scientific paper IMHO, should convey an idea that has the ability to explain something. For example a paper that proves a mathematical theorem, a paper that presents a model of some physical phenomenon. Alternatively a scientific paper could be experimental, where the result of an experiment tells us something fundamental about the reality. Nevertheless the central point of a scientific paper is a relatively concisely expressible idea of some nontrivial universality (and predictive power) or some nontrivial observation about the nature of reality.

2) An engineering paper shows a method of solving a particular problem. Problems may vary and depend on an application, sometimes they could be really uninteresting and specific but nevertheless useful for somebody somewhere. For an engineering paper, things that matter are different than for a scientific paper: the universality of the solution may not be of paramount importance. What matters is that the solution works, could be practically implemented e.g. given available components, is cheaper or more energy efficient than other solutions and so on. The central point of an engineering paper is an application, and the rest is just a collection of ideas that allow to solve the application.

Machine learning sits somewhere in between. There are examples of clear scientific papers (such as e.g. the paper that introduced the backprop itself) and there are examples of clearly engineering papers where a solution to a very particular practical problem is described. But the majority of them appear to be engineering, only they engineer for a synthetic measure on a more or less academic dataset. In order to show superiority some ad-hoc trick is being pulled out of nowhere (typically of extremely limited universality) and after some statistically non significant testing a victory is announced.

One thing that I noticed in this Academic Engineering phenomena it’s that a lot of people (well-intentioned) are doing a lot of experiments, using nice tools and put their code available and this is very cool. However one thing that I noticed it’s that some of this Academic Engineering papers brings tons of methodological problems regarding of Machine Learning part.

I tackled one example of this some months ago related with a systematic review from Christodoulou Et al. called “A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models” that the authors want to start a confirmatory study without a clear understanding of the methodology behind of Machine Learning and Deep Learning papers (you can read the full post here).

In Nature’s letter from DeVries Et al. it’s not different. Let’s check, for example, HOW they end it up with the right architecture. The paper only made the following consideration about the architecture:

The neural networks used here are fully connected and have six hidden layers with 50 neurons each and hyperbolic tangent activation functions (13,451 weights and biases in total). The first layer corresponds to the inputs to the neural network; in this case, these inputs are the magnitudes of the six independent components of the co-seismically generated static elastic stress-change tensor calculated at the centroid of a grid cell and their negative values. 

DeVries, P. M. R., Viégas, F., Wattenberg, M., & Meade, B. J. (2018)

The code available in GitHub shows the architecture:

Only the aspect of choosing the right architecture can rises tons of questions regarding the methodological rigor as:

  • Why 6 layers and not 10 or 20? How did they get in this number of layers?
  • What the criteria to choose 50 as number of neurons? What’s the processed to identify that number?
  • All layers uses the lecun_uniform as a Kernel Initializer. Why this initializer it’s most suitable for this problem/data? Other options were tested? If yes, how was the results? And why the seed for the lecun_uniform was not set?

These questions I raised in only 8 minutes (and believe me, even junior reviewers from B-class journals would make those questions), and for the bottom of my heart, I would like to believe that Nature it’s doing the same.

After that a question arises: If even a very well know scientific journal it’s rewarding this kind of academic engineering – even with all code available – and not even considering to review the letter, what could happen in this moment in several papers that do not have this kind of mechanism of verification and the research itself it’s a completely a black box?

Final thoughts

There’s an eagerness to believe in almost every journal that has a huge impact and spread the word about the good results, but if you cannot explain HOW that result was made in a sense to have a methodological rigor, IMHO the result it’s meaningless.

Keep sane from hype, keep skeptic.

Below all the letters exchanged:

FIRST LETTER FROM MR. SHAH

Dear Editors:

A recent paper you published by DeVries, et al., Deep learning of aftershock patterns following large Earthquakes, contains significant methodological errors that undermine its conclusion. These errors should be highlighted, as data science is still an emerging field that hasn’t yet matured to the rigor of other fields. Additionally, not correcting the published results will stymie research in the area, as it will not be possible for others to match or improve upon the results. We have contacted the author and shared with them the problems around data leakage, learning curves, and model choice. They have not yet responded back.

​ First, the results published in the paper, AUC of 0.849, are inflated because of target leakage. The approach in the paper used part of an earthquake to train the model, which then was used again to test the model. This form of target leakage can lead to inflated results in machine learning. To prevent against this, a technique called group partitioning is used. This requires ensuring an earthquake appears either in the train portion of the data or the test portion. This is not an unusual methodological mistake, for example a recent paper by Rajpurkar et. al on chest x-rays made the same mistake, where x-rays for an individual patient could be found in both the train and test set. These authors later revised their paper to correct this mistake.

In this paper, several earthquakes, including 1985NAHANN01HART, 1996HYUGAx01YAGI, 1997COLFIO01HERN, 1997KAGOSH01HORI, 2010NORTHE01HAYE were represented in both the train and test part of the dataset. For example, in 1985 two large magnitude earthquakes occurred near the North Nahanni River in the northeast Cordillera, Northwest Territories, Canada, on 5 October (MS 6.6) and 23 December (MS 6.9). In this dataset, one of the earthquakes is in the train set and the other in the test set. To ensure the network wasn’t learning the specifics about the regions, we used group partitioning, this ensures an earthquake’s data only was in test or in train and not in both. If the model was truly learning to predict aftershocks, such a partitioning should not affect the results.

We applied group partitioning of earthquakes randomly across 10 different runs with different random seeds for the partitioning. I am happy to share/post the group partitioning along with the revised datasets. We found the following results as averaged across the 10 runs (~20% validation):

MethodMean AUC
Coulomb failure stress-change0.60
Maximum change in shear stress0.77
von Mises yield criterion0.77
Random Forest0.76
Neural Network0.77

In terms of predictive performance, the machine learning methods are not an improvement over traditional techniques of the maximum change in shear stress or the von Mises yield criterion. To assess the value of the deep learning approach, we also compared the performance to a baseline Random Forest algorithm (basic default parameters – 100 trees) and found only a slight improvement.

It is crucial that the results in the paper will be corrected. The published results provide an inaccurate portrayal of the results of machine learning / deep learning to predict aftershocks. Moreover, other researchers will have trouble sharing or publishing results because they cannot meet these published benchmarks. It is in the interest of progress and transparency that the AUC performance in the paper will be corrected.

The second problem we noted is not using learning curves. Andrew Ng has popularized the notion of learning curves as a fundamental tool in error analysis for models. Using learning curves, one can find that training a model on just a small sample of the dataset is enough to get very good performance. In this case, when I run the neural network with a batch size of 2,000 and 8 steps for one epoch, I find that 16,000 samples are enough to get a good performance of 0.77 AUC. This suggests that there is a relatively small signal in the dataset that can be found very quickly by the neural network. This is an important insight and should be noted. While we have 6 million rows, you can get the insights from just a small portion of that data.

The third issue is jumping straight to a deep learning model without considering baselines. Most mainstream machine learning papers will use benchmark algorithms, say logistic regression or random forest when discussing new algorithms or approaches. This paper did not have that. However, we found that a simple random forest model was able to achieve similar performance to neural network. This is an important point when using deep learning approaches. In this case, really any simple model (e.g. SVM, GAM) will provide comparable results. The paper gives the misleading impression that only deep learning is capable of learning the aftershocks.

As practicing data scientists, we see these sorts of problems on a regular basis. As a field, data science is still immature and there isn’t the methodological rigor of other fields. Addressing these errors will provide the research community with a good learning example of common issues practitioners can run into when using machine learning. The only reason we can learn from this is that the authors were kind enough to share their code and data. This sort of sharing benefits everyone in the long run.

At this point, I have not publicly shared or posted any of these concerns. I have shared them with the author and she did not reply back after two weeks. I thought it would be best to privately share them with you first. Please let me know what you think. If we do not hear back from you by November 20th, we will make our results public.

Thank you

Rajiv Shah

University of Illinois at Chicago

Lukas Innig

DataRobot

NATURE COMMENTS

Referee’s Comments:

In this proposed Matters Arising contribution, Shah and Innig provide critical commentary on the paper “Deep learning aftershock patterns following large earthquakes”, authored by Devries et al. and published in Nature in 2018. While I think that Shah and Innig raise make several valid and interesting points, I do not endorse publication of the comment-and-reply in Matters Arising. I will explain my reasoning for this decision in more detail below, but the upshot of my thinking is that (1) I do not feel that the central results of the study are compromised in any way, and (2) I am not convinced that the commentary is of interest to audience of non-specialists (that is, non machine learning practicioners).

Shah and Innig’s comment (and Devries and Meade’s response) centers on three main points of contention: (1) the notion of data leakage, (2) learning curve usage, and (3) the choice of deep learning approach in lieu of a simpler machine learning method. Point (1) is related to the partitioning of earthquakes into training and testing datasets. In the ideal world, these datasets should be completely independent, such that the latter constitutes a truly fair test of the trained model’s performance on data that it has never seen before. Shah and Innig note that some of the ruptures in the training dataset are nearly collocated in space and time with ruptures in the testing dataset, and thus a subset of aftershocks are shared mutually. This certainly sets up the potential for information to transfer from the training to testing datasets (violating the desired independence described above), and it would be better if the authors had implemented grouping or pooling to safeguard against this risk. However, I find Devries and Meade’s rebuttal to the point to be compelling, and would further posit that the potential data leakage between nearby ruptures is a somewhat rare occurrence that should not modify the main results significantly.

Shah and Innig’s points (2) and (3) are both related, and while they are interesting to me, they are not salient to the central focus of the paper. It is neat (and perhaps worth noting in a supplement), that the trainable parameters in the neural network, the network biases and weights, can be adequately trained using a small batch of the full dataset. Unfortunately, this insight from the proposed learning curve scheme would likely shoot over the heads of the 95% of the general Nature audience that are unfamiliar with the mechanics of neural networks and how they are trained. Likewise, most readers wouldn’t have the foggiest notion of what a Random Forest is, nor how it differs from a deep neural network, nor why it is considered simpler and more transparent. The purpose of the paper (to my understanding) was not to provide a benchmark machine learning algorithm so that future groups could apply more advanced techniques (GANs, Variational Autoencoders, etc.) to boost AUC performance by 5%. Instead, the paper showed that a relatively simple, but purely data-driven approach could predict aftershock locations better than Coulomb stress (the metric used in most studies to date) and also identify stress-based proxies (max shear stress, von Mises stress) that have physical significance and are better predictors than the classical Coulomb stress. In this way, the deep learning algorithm was used as a tool to remove our human bias toward the Coulomb stress criterion, which has been ingrained in our psyche by more than 20 years of published literature.

To summarize: regarding point (1), I wish the Devries et al. study had controlled for potential data leakage, but do not feel that the main results of the paper are compromised by doing so. As for point (2), I think it is interesting (though not surprising) that the neural network only needs a small batch of data to be adequately trained, but this is certainly a minor point of contention, relative to the key takeaways of the paper, which Shah and Innig may have missed. Point (3) follows more or less directly from (2), and it is intuitive that a simpler and more transparent machine learning algorithm (like a Random Forest) would give comparable performance to a deep neural network. Again, it would have been nice to have noted in the manuscript that the main insights could have been derived from a different machine learning approach, but this detail is of more interest to a data science or machine learning specialist than to a general Nature audience. I think the disconnect between the Shah and Innig and Devries et al. is a matter of perspective. Shah and Innig are concerned primarily with machine learning best practices methodology, and with formulating the problem as “Kaggle”-like machine learning challenge with proper benchmarking. Devries et al. are concerned primarily with using machine learning as tool to extract insight into the natural world, and not with details of the algorithm design.

AUTHORS RESPONSE

Deep Learning, Nature and Data Leakage, Reproducibility and Academic Engineering

Deep Learning and Radiology, False Dichotomy, Tools and a Paradigm Shift

From MIT Tech Review article called “Google shows how AI might detect lung cancer faster and more reliably” we have the following information:

Early warning: Danial Tse, a researcher at Google, developed an algorithm that beat a number of trained radiologists in testing. Tse and colleagues trained a deep-learning algorithm to detect malignant lung nodules in more than 42,000 CT scans. The resulting algorithms turned up 11% fewer false positives and 5% fewer false negatives than their human counterparts. The work is described in a paper published in the journal Nature today.

That reminds me of a lot of haterism, defensiveness, confirmation bias and especially a lack of understanding of technology and their potentials to help people worldwide. I’ll not cite most of this here but you can check in my Twitter @flavioclesio.

Some people from academic circles, especially from Statistics and Epidemiology, started in several different ways bashing the automation of statistical methods (Machine Learning) using a lot of questionable methods to assess ML even using one of the worst systematic reviews in history to create a false dichotomy between the Stats and ML researchers.

Most of the time that kind of criticism without a consistent argumentation around the central point sounds more like pedantism where these people say to us in a subliminal way: “- Hey look those nerds, they do not know what they are doing. Trust use <<Classical Methods Professors>>, We have <<Number of Papers>> in that field and those folks are only coders that don’t have all the training that we have.

This situation’s so common that In April I needed to enter in a thread with Frank Harrell to discuss that an awful/pointless Systematic Review should not be used to create that kind of point less dichotomy in that thread:

My point it’s: Statistics, Machine Learning, Artificial Intelligence, Python, R, and so on are tools and should be and should be treated as such.

Closing thoughts

I invite all my 5 readers to exercise the following paradigm shift: Instead to think

This AI in Health will take Doctors out of their jobs?

let’s change the question to

Hey, you’re telling me that using this very easy to implement free software with commodity CPU power can we democratize health exams for the less favored people together with the Doctors?

Deep Learning and Radiology, False Dichotomy, Tools and a Paradigm Shift

Two gently ways to fix Peer Review

Jacob Buckman in this beautiful blog piece gave two gentle ways to enhance Peer Review.

About the relative certification from Peer Review process provided by conferences:

So my first suggestion is this: change from a relative metric to a standalone evaluation. Conferences should accept or reject each paper by some fixed criteria, regardless of how many papers get submitted that year. If there end up being too many papers to physically fit in the venue, select a subset of accepted papers, at random, to invite. This mitigates one major source of randomness from the certification process: the quality of the other papers in any given submission pool.

And the most important piece it’s about the create a rejection board to disincentivize low-quality submissions:

This means that if you submit to NeurIPS and they give you an F (rejection), it’s a matter of public record. The paper won’t be released, and you can resubmit that work elsewhere, but the failure will always live on. (Ideally we’ll develop community norms around academic integrity that mandate including a section on your CV to report your failures. But if not, we can at least make it easy for potential employers to find that information.)
Why would this be beneficial? Well, it should be immediately obvious that this will directly disincentivize people from submitting half-done work. Each submission will have to be hyper-polished to the best it can possibly be before being submitted. It seems impossible that the number of papers polished to this level will be anywhere close to the number of submissions that we see at major conferences today. Those who choose to repeatedly submit poor-quality work anyways will have their CVs marred with a string of Fs, cancelling out any certification benefits they had hoped to achieve.

I personally bet € 100 that if any conference adopt this mechanism, at least 98% of all of these planting-flag papers will be vanished forever.

Two gently ways to fix Peer Review

The sunset of statistical significance

Brian Resnick hit the nail in his last column in Vox called 800 scientists say it’s time to abandon “statistical significance” where he brings an important discussion in how the p-value is misleading science, especially for for the studies that has clear measurements of some particular effect but they’re thrown away because of the lack of statistical significance.

In the column Mr. Resnick put one alternative to how to get (…) a better, more nuanced approaches to evaluating science (…).

– Concentrating oneffect sizes (how big of a difference does an intervention make, and is it practically meaningful?)

– Confidence intervals (what’s the range of doubt built into any given answer?)

– Whether a result is novel study or a replication (put some more weight into a theory many labs have looked into)

– Whether a study’s design was preregistered (so that authors can’t manipulate their results post-test), and that the underlying data is freely accessible (so anyone can check the math)

– There are also alternative statistical techniques — like Bayesian analysis — that in some ways more directly evaluate a study’s results. (P-values ask the question “how rare are my results?” Bayes factors ask the question “what is the probability my hypothesis is the best explanation for the results we found?” Both approaches have trade-offs. )

PS: Frank Harrell (Founding Chair of Biostatistics, Vanderbilt U. Expert Statistical Advisor, Office of Biostatistics) gave to us this very delightful tweet:

Source Twitter

The sunset of statistical significance

Some quick comments about Genevera Allen statements regarding Machine Learning

Start note: Favio Vazquez made a great job in his article about it with a lot of charts and showing that in modern Machine Learning approach with the tools that we currently have the problems of replication and methodology are being tackled.

It’s becoming a great trend: Some researcher has some criticism about Machine Learning and they start to do some cherry picking (fallacy of incomplete evidence) in potential issues start with statements like “We have a problem in Machine Learning and the results it’s not reproducible“, “Machine Learning doesn’t work“, “Artificial intelligence faces reproducibility crisis, “AI researchers allege that machine learning is alchemy and boom: we have click bait, rant, bashing and a never-ending spiral of non-construcive critcism. Afterward this researcher get some spotlights in public debate about Machine Learning, goes to CNN to give some interviews and becomes a “reference in issues in Machine Learning“.

Right now it’s time for Ms. Allen do the following question/statement “Can we trust scientific discoveries made using machine learning?” where she brings good arguments for the debate, but I think she misses the point to 1) not bring any solution/proposal and 2) the statement itself its too abroad and obvious that can be applied in any science field.

My main intention here it’s just to make very short comments to prove that these issues are very known by the Machine Learning community and we have several tools and methods to tackle these issues.

The second intention here it’s to demonstrate that this kind of very broad-obvious argument brings more friction than light to debate. I’ll include the statement and a short response below:

“The question is, ‘Can we really trust the discoveries that are currently being made using machine-learning techniques applied to large data sets?'” Allen said. “The answer in many situations is probably, ‘Not without checking,’ but work is underway on next-generation machine-learning systems that will assess the uncertainty and reproducibility of their predictions.”

Comment: More data do not imply in more insights and harder to have more data it’s to have the right combination of hyperparameters, feature engineering, and ensembling/stacking the models. And every scientific statement must be checked (this is a basic assumption of the scientific method). But this trend maybe cannot be a truth in modern research, as we are celebrating scientific statements (over selling) with the researchers intentionally hiding their methods and findings. It’s like Hans Bethe hiding his discoveries about stellar nucleosynthesis because in some point in the future someone can potentially use this to make atomic bombs.

“A lot of these techniques are designed to always make a prediction,” she said. “They never come back with ‘I don’t know,’ or ‘I didn’t discover anything,’ because they aren’t made to.”

Comment: This is simply not true. A very quick check in Scikit-Learn, XGBoost and Keras (3 of the most popular libraries of ML) shattered this argument.

“In precision medicine, it’s important to find groups of patients that have genomically similar profiles so you can develop drug therapies that are targeted to the specific genome for their disease,” Allen said. “People have applied machine learning to genomic data from clinical cohorts to find groups, or clusters, of patients with similar genomic profiles. “But there are cases where discoveries aren’t reproducible; the clusters discovered in one study are completely different than the clusters found in another,”

Comment: Here it’s the classic use of misleading experience with a clear use of confirmation bias because of a lack of understanding between tools with methodology . The ‘logic‘ of this argument is: A person wants to cut some vegetables to make a salad. This person uses a salad knife (the tool) but instead to use it accordingly (in the kitchen with a proper cutting board) this person cut the vegetables on the top of a stair after drink 2 bottles of vodka (the wrong method) and end up being cut; and after that this person get the conclusion that the knife is dangerous and doesn’t work.

There’s a bunch of guidelines being proposed and there’s several good resources like Machine Learning Mastery that already tackled this issue, this excellent post of Determined ML makes a good argument and this repo has tons of reproducible papers even using Deep Learning. The main point is: Any junior Machine Learning Engineer knows that hashing the dataset and fixing a seed at the beginning of the experiment can solve at least 90% of these problems.

Conclusion

There’s a lot of researches and journalists that cannot (or do not want to) understand that not only in Machine Learning but in all science there’s a huge problem of replication of the studies (this is not the case for Ms. Allen because she had a very interesting track record in ML in terms of publications). In psychology half of the studies cannot be replicated and even the medical findings in some instance are false that proves that is a very long road to minimize that kind of problem.

Some quick comments about Genevera Allen statements regarding Machine Learning

Porque o xGBoost ganha todas as competições de Machine Learning

Uma (longa e) boa resposta está nesta tese de Didrik Nielsen.

16128_FULLTEXT

Abstract: Tree boosting has empirically proven to be a highly effective approach to predictive modeling.
It has shown remarkable results for a vast array of problems.
For many years, MART has been the tree boosting method of choice.
More recently, a tree boosting method known as XGBoost has gained popularity by winning numerous machine learning competitions.
In this thesis, we will investigate how XGBoost differs from the more traditional MART.
We will show that XGBoost employs a boosting algorithm which we will term Newton boosting. This boosting algorithm will further be compared with the gradient boosting algorithm that MART employs.
Moreover, we will discuss the regularization techniques that these methods offer and the effect these have on the models.
In addition to this, we will attempt to answer the question of why XGBoost seems to win so many competitions.
To do this, we will provide some arguments for why tree boosting, and in particular XGBoost, seems to be such a highly effective and versatile approach to predictive modeling.
The core argument is that tree boosting can be seen to adaptively determine the local neighbourhoods of the model. Tree boosting can thus be seen to take the bias-variance tradeoff into consideration during model fitting. XGBoost further introduces some subtle improvements which allows it to deal with the bias-variance tradeoff even more carefully.

Conclusion: After determining the different boosting algorithms and regularization techniques these methods utilize and exploring the effects of these, we turned to providing arguments for why XGBoost seems to win “every” competition. To provide possible answers to this question, we first gave reasons for why tree boosting in general can be an effective approach. We provided two main arguments for this. First off, additive tree models can be seen to have rich representational abilities. Provided that enough trees of sufficient depth is combined, they are capable of closely approximating complex functional relationships, including high-order interactions. The most important argument provided for the versatility of tree boosting however, was that tree boosting methods are adaptive. Determining neighbourhoods adaptively allows tree boosting methods to use varying degrees of flexibility in different parts of the input space. They will consequently also automatically perform feature selection. This also makes tree boosting methods robust to the curse of dimensionality. Tree boosting can thus be seen actively take the bias-variance tradeoff into account when fitting models. They start out with a low variance, high bias model and gradually reduce bias by decreasing the size of neighbourhoods where it seems most necessary. Both MART and XGBoost have these properties in common. However, compared to MART, XGBoost uses a higher-order approximation at each iteration, and can thus be expected to learn “better” tree structures. Moreover, it provides clever penalization of individual trees. As discussed earlier, this can be seen to make the method even more adaptive. It will allow the method to adaptively determine the appropriate number of terminal nodes, which might vary among trees. It will further alter the learnt tree structures and leaf weights in order to reduce variance in estimation of the individual trees. Ultimately, this makes XGBoost a highly adaptive method which carefully takes the bias-variance tradeoff into account in nearly every aspect of the learning process.

Porque o xGBoost ganha todas as competições de Machine Learning

Novel Revenue Development and Forecasting Model using Machine Learning Approaches for Cosmetics Enterprises.

Abstract:In the contemporary information society, constructing an effective sales prediction model is challenging due to the sizeable amount of purchasing information obtained from diverse consumer preferences. Many empirical cases shown in the existing literature argue that the traditional forecasting methods, such as the index of smoothness, moving average, and time series, have lost their dominance of prediction accuracy when they are compared with modern forecasting approaches such as neural network (NN) and support vector machine (SVM) models. To verify these findings, this paper utilizes the Taiwanese cosmetic sales data to examine three forecasting models: i) the back propagation neural network (BPNN), ii) least-square support vector machine (LSSVM), and iii) auto regressive model (AR). The result concludes that the LS-SVM has the smallest mean absolute percent error (MAPE) and largest Pearson correlation coefficient ( R2 ) between model and predicted values.

Novel Revenue Development and Forecasting Model using Machine Learning Approaches for Cosmetics Enterprises.

A (very) provocative essay about systematic reviews

Via O’Reilly Ideas

Systematic reviews still the best method to validate (with some degree of certainty) any theory, but this is not a silver bullet.

Context: Currently, most systematic reviews and meta-analyses are done retrospectively with fragmented published information. This article aims to explore the growth of published systematic reviews and meta-analyses and to estimate how often they are redundant, misleading, or serving conflicted interests.

Methods: Data included information from PubMed surveys and from empirical evaluations of meta-analyses.

Findings: Publication of systematic reviews and meta-analyses has increased rapidly. In the period January 1, 1986, to December 4, 2015, PubMed tags 266,782 items as “systematic reviews” and 58,611 as “meta-analyses.” Annual publications between 1991 and 2014 increased 2,728% for systematic reviews and 2,635% for meta-analyses versus only 153% for all PubMed-indexed items. Currently, probably more systematic reviews of trials than new randomized trials are published annually. Most topics addressed by meta-analyses of randomized trials have overlapping, redundant meta-analyses; same topic meta-analyses may exceed 20 sometimes. Some fields produce massive numbers of meta-analyses; for example, 185 meta-analyses of antidepressants for depression were published between 2007 and 2014. These meta-analyses are often produced either by industry employees or by authors with industry ties and results are aligned with sponsor interests. China has rapidly become the most prolific producer of English-language, PubMed-indexed meta-analyses. The most massive presence of Chinese meta-analyses is on genetic associations (63% of global production in 2014), where almost all results are misleading since they combine fragmented information from mostly abandoned era of candidate genes. Furthermore, many contracting companies working on evidence synthesis receive industry contracts to produce meta-analyses, many of which probably remain unpublished. Many other meta-analyses have serious flaws. Of the remaining, most have weak or insufficient evidence to inform decision making. Few systematic reviews and meta-analyses are both non-misleading and useful.

Conclusions: The production of systematic reviews and meta-analyses has reached epidemic proportions. Possibly, the large majority of produced systematic reviews and meta-analyses are unnecessary, misleading, and/or conflicted.

 

A (very) provocative essay about systematic reviews

Experimento do Facebook e Gatos Felizes

Direto do KDNuggets.

 

cartoon-facebook-data-science-experiment

Homem: “Eu estava para escrever um post agressivo sobre o estudo de manipulação emocional do Facebook, mas então eu me distraí com todas as fotos de gatos felizes que eles me mostraram.”

Para quem não entendeu esse é o artigo da Forbes que mostra um pouco sobre o estudo de manipulação de sentimentos, e aqui é o artigo original.

Para quem quiser realizar o download do artigo original, o link está abaixo.

PNAS-2014-Kramer-8788-90

 

Experimento do Facebook e Gatos Felizes

Reprodutibilidade em Mineração de Dados e Aprendizado de Máquina

Esse post do Geomblog coloca esse assunto de uma maneira bem particular. Abaixo um pequeno relato:

So one thing I often look for when reviewing such papers is sensitivity: how well can the authors demonstrate robustness with respect to the parameter/algorithm choices. If they can, then I feel much more confident that the result is real and is not just an artifact of a random collection of knob settings combined with twirling around and around holding one’s nose and scratching one’s ear. 

 

Aqui no site falamos um pouco sobre isso neste post.

Reprodutibilidade em Mineração de Dados e Aprendizado de Máquina

Reproducible Research with R and RStudio – Livro sobre Pesquisa Reprodutível

Ainda sobre o assunto da reprodução de pesquisas, está em vias de ser lançado um livro sobre o assunto chamado Reproducible Research with R and RStudio escrito por Christopher Gandrud.

No enxerto do livro o autor disponibiliza 5 dicas práticas para criação/reprodução de pesquisas que são:

  1. Document everything!,
  2. Everything is a (text) file,
  3. All files should be human readable,
  4. Explicitly tie your files together,
  5. Have a plan to organize, store, and make your files available.

 

 

Reproducible Research with R and RStudio – Livro sobre Pesquisa Reprodutível

Replicação em Pesquisa Acadêmica em Mineração de Dados

Lendo este post do John Taylor sobre a replicação da pesquisa econômica publicada até em journals de alto impacto lembrei de uma prática bem comum em revistas acadêmicas da área de Engenharia de Produção e Mineração de Dados que é a irreprodutibilidade dos artigos publicados.

Essa irreprodutibilidade se dá na forma em que se conseguem os resultados, em especial, de técnicas como Clustering, Regras de Associação, e principalmente Redes Neurais.

Um trabalho acadêmico/técnico/experimental que não pode ser reproduzido é a priori 1) metodologicamente fraco, e 2) pessimamente revisado. Trabalhos com essas características tem tanto suporte para o conhecimento como a chamada evidência anedótica.

Depois de ler mais de 150 papers em 2012 (e rumo aos 300 em 2013) a estrutura não muda:

  • Introdução;
  • Revisão Bibliográfica;
  • Aplicação da Técnica;
  • Resultados; e
  • Discussão na qual fala que teve  ganho de 90% em redes neurais.

Há um check-list bem interessante para analisar um artigo acadêmico com um péssimo DOE, e mal fundamentado metologicamente:

Artigos de Clustering 

  • Qual foi o tamanho da amostra?;
  • Qual é o tamanho mínimo da amostra dentro da população estimada?
  • Foram realizados testes estatísticos sobre a população como teste-Z ou ANOVA?
  • Qual é o P-Valor?
  • Qual foi a técnica para a determinação da separação dos clusters?
  • Quais os parâmetros foram usados para a clusterização?
  • Porque foi escolhido o algoritmo Z?

Artigos de Regras de Associação

  • Qual foi o suporte mínimo?
  • Qual é o tamanho da amostra e o quanto ela é representativa estatisticamente de acordo com a população?
  • O quanto o SUPORTE representa a POPULAÇÃO dentro do seu estudo?
  • Como foi realizado o prunning as regras acionáveis?
  • A amostra é generalizável? Porque não foi realizado o experimento em TODA a população?

Redes Neurais

  • Qual é a arquitetura da rede?
  • Porque foi utilizada a função de ativação Tangente e não a Hiperbólica (ou vice-versa)?
  • A função de ativação é adequada para os dados que estão sendo estudados? Como foi feito o pré-processamento e a discretização dos dados?
  • Porque foi escolhida o número de camadas internas?
  • Tem taxa de aprendizado? Qual foi e porque foi determinada essa taxa?
  • Tem decaímento (Decay)? Porque?
  • E o momentum? Foi utilizado? Com quais parâmetros?
  • Qual estrutura de custos está vinculada nos resultados? Qual foi a quantidade de erros tipo I e II que foram realizados pela rede?
  • E o número de épocas? Como foi determinada e em qual momento a rede deixou de convergir? Você acha que é um erro mínimo global ou local? Como você explica isso no resultado do artigo

Pode parecer algo como o desconstrucionismo acadêmico fantasiado de exame crítico em um primeiro momento mas para quem vive em um meio no qual estudos mais do que fraudulentos são pintados como revolucionários é um recurso como um escudo contra besteiras (Bullshit Shield).

Em suma, com 50% das respostas das perguntas acima o risco de ser um paper ruim com resultados do tipo “caixa-preta” já caí para 10% e aí entra o verdadeiro trabalho de análise para a reprodução do artigo.

Abaixo um vídeo bem interessante sobre papers que nada mais passam de evidência anedótica.

Replicação em Pesquisa Acadêmica em Mineração de Dados

Porque o review acadêmico é um filtro desnecessário para a ciência?

Esse post do Normal Deviate, apresenta uma situação a qual é bem comum em ambientes acadêmicos: O autor trabalha meses na escrita de um arquivo original, faz a revisões, manda para algum journal e após isso tem simplesmente a negativa da publicação; na qual muitas das vezes bons artigos são descartados muito mais por questões relacionadas a forma do que pelo conteúdo, e artigos que não fazem nada mais do que ser um bolo de citações são publicados.

A critica é bem pertinente, e apresenta um ponto de vista interessante na qual defende um movimento paralelo a isto (que pode viver e sincronia com o método de peer review que é utilizado a mais de 350 anos) que é um site de publicações livres; pois, como o autor elencou, não parece razoável realizar a ciência em sua forma mais moderna, utilizando-se métodos de revisão e validação de 350 anos atrás sem nenhum tipo de crítica a respeito.

O Marcelo Hermes França (o qual é um dos mais respeitados cientistas do Brasil e dono do Site Ciência Brasil) é um grande defensor do sistema atual de revisão, e um os maiores críticos de revistas científicas e fator de impacto no Brasil, no qual muitos dos seus posts expõe a forma picareta que essas revistas realizam ciência através de artigos pagos, revisões capengas, e principalmente o clube da citação que é a forma mais horrenda de se desenvolver ciência e obter financiamento público.

É m tema bem interessante e que impacta diretamente a mineração de dados devido ao fato de ser um campo novo, no qual há muito mais preocupação com as formas do que estamos analisando e vendo como conhecimento do que a aplicação prática, a qual penso que está sendo ceifada pelos journals da área.

Porque o review acadêmico é um filtro desnecessário para a ciência?

Acadêmicos deveriam considerar os desafios do Kaggle válidos para pesquisas

A algum tempo atrás foi realizada uma postagem neste espaço sobre o Kaggle o qual é um site no qual empresas terceirizam a sua análise de dados, através de competições que podem ser remuneradas ou não.

Neste post há uma boa provocação no sentido de porque os acadêmicos não consideram os desafios do Kaggle como válidos para pesquisas; em especial os famosos Data Scientists da web.

Em especial, e aqui é um mea culpa com uma crítica coletiva; é muito bom de ver diversos livros de mineração de dados em português, e até mesmo a popularização do ensino; entretanto, seria muito válido que os profesores e demais acadêmicos de mineração de dados que tanto escrevem artigos e livros (que só o webmaster desse site e mais duas dúzias de alunos fazem questão de ler) se submetessem com o seu background para esse tipo de disputa; o que colocaria não são em perspectiva a teoria como a prática.

Acadêmicos deveriam considerar os desafios do Kaggle válidos para pesquisas