Algumas dicas úteis de como fazer uma revisão sistemática

Esse é um artigo bem antigo que escrevi em 2013, mas face aos recentes eventos na minha carreira acadêmica estou postando publicamente para ajudar quem se propõe a fazer tal tarefa.

Existem inúmeros manuais de como se fazer uma boa Revisão Sistemática, então aqui eu vou colocar um apanhado de idéias que eu copiei rigorosamente dos autores das referências e colocar o que eu fiz para manter a minha sanidade durante o processo.

Um dos principais fatos dos dias de hoje é que vivemos na era da informação em que o volume de dados e informações geradas aumentam quase que exponencialmente a cada ano

Este fato tem um impacto gigantesco quando falamos de pesquisa acadêmica, especificamente para os pesquisadores que desejam entender o que está sendo escrito mesmo no meio desta miríade de informações que está sendo gerada a cada dia que passa. 

Esse post é dedicado especialmente para:

  1. pessoas que estão em momento de definir o seu projeto de pesquisa para um doutorado ou mestrado
  2. pessoas que estão escrevendo um artigo científico, mas que gostariam de saber o que está sendo discutido na literatura
  3. pessoas que estão fazendo pesquisa corporativa para determinar rumos de ação práticos em algum departamento de R&D

Uma das técnicas acadêmicas mais subestimadas na minha opinião para resolver isso no que se refere à atividade de pesquisa é a Revisão Sistemática.

Pessoalmente eu não consigo imaginar pesquisas cientificas sérias começando sem o uso desta ferramenta, e ao final desse artigo esta questão vai ficar mais clara. 

Mas o que é uma revisão sistemática? Aqui eu pego emprestado a citação de Cook, D. J., Mulrow, C. D., & Haynes, R. B. (1997):

Systematic reviews are scientific investigations in themselves, with pre-planned methods and an as  sembly of original studies as their “subjects.” They synthesize the results of multiple primary investigations by using strategies that limit bias and random error (9, 10). These strategies include a comprehensive search of all potentially relevant articles and the use of explicit, reproducible criteria in the selection of articles for review. Primary research designs and study characteristics are appraised, data are synthesized, and results are interpreted.

Cook, D. J., Mulrow, C. D., & Haynes, R. B. (1997)

A percepção que eu tenho é que inúmeros trabalhos iniciam-se cheios de expectativas e promessas de ineditismo, mas grande parte das vezes são trabalhos que só reinventaram a roda sobre outros trabalhos de outras pessoas que não levaram o crédito, e que caso houvesse uma revisão sistemática mais apurada esses trabalhos ou receberiam menos recursos ou nem existiriam e os recursos poderiam ser alocados em outros espaços com maior potencial de relevância/retorno.

Mas se eu tivesse que sumarizar em alguns pontos básicos da importância da revisão sistemática, eu consideraria os seguintes:

  1. A Revisão Sistemática ajuda a entender o passado de um tópico dentro de um campo da ciência e a sua evolução ao longo do tempo;
  2. Apresenta o atual estado da arte para os pesquisadores do presente;
  3. É uma importante ferramenta para descobrir os gaps e limitações (e.g.metodológicas) na atual literatura;
  4. Faz a sua pesquisa conversar com a literatura corrente desde o dia da publicação, 
  5. Faz o trabalho de mostrar formas de monitoramento da literatura quando trás fontes relevantes; e último, mas não menos importante;
  6. Evita que os pesquisadores reinventem a roda alocando recursos para o desenvolvimento de trabalhos com um grau maior de ineditismo

Aqui eu concordo com a afirmação de Webster e Watson(2002) de que a Revisão Sistemática une os conceitos ao longo do tempo, mas ocasionalmente prepara para o futuro, em que a revisão ela entende a teoria e a prática e a relação ontológica do campo de estudos.

Como afirma Webster, Watson(2002) revisões sistemáticas ou de literatura não podem ser uma compilação de citações como uma lista telefônica, mas sim um exercício ativo de análise dos estudos que estão sendo analisados. 

Systematic reviews can help practitioners keep abreast of the medical literature by summarizing large bodies of evidence and helping to explain differences among studies on the same question. A systematic review involves the application of scientific strategies, in ways that limit bias, to the assembly, critical appraisal, and synthesis of all relevant studies that address a specific clinical question. 

Cook, D. J., Mulrow, C. D., & Haynes, R. B. (1997)

Um dos pontos que eu quero ressaltar em algum ponto do futuro, é como a pesquisa de forma sistematizada deveria ser o objetivo de qualquer empresa para incorporar dados em processos decisórios e na arquitetura de novas soluções corporativas. Mas o meu argumento é o mesmo do Cook, D. J., Mulrow, C. D., & Haynes, R. B. (1997) que eu coloco a citação abaixo:

Review articles are one type of integrative publication; practice guidelines, economic evaluations, and clinical decision analyses are others. These other types of integrative articles often incorporate the results of systematic reviews. For example, practice guidelines are systematically developed statements intended to assist practitioners and patients with decisions about appropriate health care for specific clinical circumstances (11). Evidence-based practice guidelines are based on systematic reviews of the literature, appropriately adapted to local circumstances and values. Economic evaluations compare both the costs and the consequences of different courses of action; the knowledge of consequences that are considered in these evaluations is often generated by systematic reviews of primary studies. Decision analyses quantify both the likelihood and the valuation of the expected outcomes associated with competing alternatives. 

Cook, D. J., Mulrow, C. D., & Haynes, R. B. (1997)

Um ponto a fato das revisões sistemáticas é que muitos jornais por questões de limitação de espaço geralmente limitam o número de páginas dos trabalhos em que infelizmente a primeira área a ser sacrificada é a revisão de literatura pelo motivo de que ela não tem um foco tão grande quanto a metodologia, os resultados ou as conclusões. 

Aqui eu vou reunir algumas dicas de 3 artigos, que considero que são boas referências no que se refere à revisão sistemática. São algumas anotações desses artigos, junto com alguns comentários do que eu faço quando tenho que realizar uma revisão sistemática seja para ver como está o estado da arte de um tópico de pesquisa. 

A base fundamental desse artigo está nas ideias de Cook, Mulrow, & Haynes (1997);  Webster e Watson (2002) e Brereton e autores (2007). Esta será apenas uma lista não exaustiva de tópicos, e a leitura dos originais são imprescindíveis. 

Se eu tivesse que escolher um framework de adoção de revisão sistemática, seria este de Brereton e autores (2007):

/var/folders/8d/tpf0m9tx1b51b7lw05rfxnn40000gp/T/com.microsoft.Word/WebArchiveCopyPasteTempFiles/p41606

Autores e tópicos prospectivos

Posicionar sobre o progresso e o aprendizado e embarcar em novos projetos para o desenvolvimento de novos modelos teóricos Webster e Watson (2002) em que essas revisões podem ser em um a) tópico maduro com um vasto corpo de conhecimento ou b) sobre um tópico emergente com uma velocidade de desenvolvimento maior. 

A revisão sobre os tópicos dá direções sobre conceitos e a sua evolução e direções, e a revisão em autores ajuda o trabalho a comunicar com os grandes laboratórios ou com pesquisadores que vão auxiliar no debate sobre o campo científico. 

Escrevendo um artigo de revisão sistemática

O que está sendo buscado? Quais são as keywords? Qual é a contribuição esperada?

Um dos pontos mais importantes é realizar o disclosuredas limitações da revisão como:

  • Escopo da busca (e.g.keywordsusadas, base de artigos)
  • Limite temporal dos artigos
  • Sumário da pesquisa passada, destaque nos gaps, propostas de como encurtar esse gap e implicações da teoria na prática 

Esse disclosuresinaliza que o seu trabalho está tirando uma foto da literatura no momento, e que ela pode não ser perfeita por questões de vícios metodológicos ou mesmo por fatores exógenos a sua pesquisa, como por exemplo, se uma base de dados mudar o indexador de artigos, e a query com os mesmos parâmetros trouxerem outros resultados.

Identificação da literatura relevante 

Aqui a revisão sistemática foca no conceito não importando onde esses conceitos estão. 

Isso implica dizer que o foco não está somente:

  • Nos melhores journals
  • Em alguns autores mais produtivos
  • Em algumas áreas do conhecimento
  • Em questões de amplitude geográfica do país

Escolha de bases de dados

Aqui a revisão toma mais ares de arte do que de ciência de fato, e aqui vem uma visão muito pessoal: Eu particularmente gosto de lidar com mais de 5 bases de pesquisa. Este número eu encontrei através de algumas experimentações, mas foi o número que me dá uma certa amplitude em relação aos artigos que estão indexados nos melhores journalse ajuda a pegar alguns bons artigos e principalmente teses de doutorado que por ventura estão escondidas na página 8 de alguma keywordobscura. 

Outro ponto da base de dados é entender a seletividade da mesma, e seletividade aqui eu chamo de o quanto a ferramenta de busca consegue me trazer um número suficientemente relevantes de artigos com o menor índice de sinal e ruído. 

E como eu não poderia deixar de falar, é sempre tentador ir apenas onde estamos mais familiarizados como o Google Scholare no Microsoft Research; mas a dica aqui é procurar bases de dados de outras áreas do conhecimento. 

Estrutura da revisão

Deve focar principalmente nos conceitos e não nos autores.

Uma coisa que ajuda e muito é as categorizações dos artigos de forma qualitativa, em que aspectos de gaps, tipo de metodologia, natureza do trabalho pode ser compiladas posteriormente.

Desenvolvimento Teórico

A ponto aqui é baseado no passado, no atual estado das coisas e as limitações e gaps presentes, como usar isso para o futuro?

Aqui eu recomendo uma expansão de ideias modesta, algo que não seja uma pesquisa de 10 anos para o futuro e que tenha plausibilidade.

Razão para os proponentes

  • Explicações teóricas (O porquê?): Essa será a cola que vai grudar toda a prática de uma maneira sistematizada, reprodutível, observável, transferível e replicável;
  • Achados empíricos do passado: O suporte do que foi observado ao longo do tempo, e a qualidade das evidências apresentadas e como essas observações foram realizadas; e
  • Prática e experiência: Mecanismos de validação da teoria e refinamentos posteriores do que está sendo teorizado e feito. 

Conclusão

Eu acho que neste ponto eu consegui mostrar o meu ponto em relação a importância da revisão sistemática como ferramenta para entendimento do passado e do presente, como também como método que ajuda a planejar o futuro em termos de pesquisa. 

Eu pessoalmente recomendo sempre que houver algum tipo de adoção de prática o uso dessa ferramenta antes de qualquer projeto acadêmico.

Referências 

Cook, D. J., Mulrow, C. D., & Haynes, R. B. (1997). Systematic reviews: synthesis of best evidence for clinical decisions. Annals of internal medicine126(5), 376-380. – Link: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.733.1479&rep=rep1&type=pdf

Brereton, P., Kitchenham, B. A., Budgen, D., Turner, M., & Khalil, M. (2007). Lessons from applying the systematic literature review process within the software engineering domain. Journal of systems and software80(4), 571-583. – 

Link: https://www.sciencedirect.com/science/article/pii/S016412120600197X

Webster, J., & Watson, R. T. (2002). Analyzing the past to prepare for the future: Writing a literature review. MIS quarterly, xiii-xxiii. Link: https://www.researchgate.net/profile/Harald_Kindermann/post/How_to_write_the_academic_review_article_in_the_field_of_management/attachment/5abe1af54cde260d15d5d477/AS%3A609838266593280%401522408181169/download/2002_Webster_Writing+a+Literature+Review.pdf

Algumas dicas úteis de como fazer uma revisão sistemática

Deep Learning, Nature and Data Leakage, Reproducibility and Academic Engineering

This piece of Rajiv Shah called “Stand up for Best Practices” that involves a well known scientific journal Nature shows the academic rigor failed during several layers down and why reproducibility matters.

The letter called Deep learning of aftershock patterns following large earthquakes from DeVries Et al. that was published in Nature, according to Shah, shows a basic problem of Data Leakage and this problem could invalidate all the experiments. Shah tried to replicate the results and found the scenario of Data Leakage and after he tried to communicate the authors and Nature about the error got some harsh responses (some will be at the end of this post).

Paper abstract – Source: https://www.nature.com/articles/s41586-018-0438-y


The repository with all analysis it is here.

Of course that a letter it’s a small piece that communicates in a brief way a larger research and sometimes the authors need to suppress some information to the matter of clarity of journal limitations. And for this point, I can understand the authors, and since they gentle provided the source code (here) more skeptical minds can check the ground truth.

As a said before in my 2019 mission statement: “In god we trust, others must bring the raw data with the source code of the extraction in the GitHub“.

The main question here it’s not about if the authors made a mistake or not (that did, because they incorporated a part of an earthquake to train the model, and this for itself can explain the AUC bigger in test than in training set) but how this academic engineering it’s killing the Machine Learning field and inflating a bubble of expectations.

But first I’ll borrow the definition of Academic Engineering provided by Filip Piekniewski in his classic called Autopsy Of A Deep Learning Paper:

I read a lot of deep learning papers, typically a few/week. I’ve read probably several thousands of papers. My general problem with papers in machine learning or deep learning is that often they sit in some strange no man’s land between science and engineering, I call it “academic engineering”. Let me describe what I mean:

1) A scientific paper IMHO, should convey an idea that has the ability to explain something. For example a paper that proves a mathematical theorem, a paper that presents a model of some physical phenomenon. Alternatively a scientific paper could be experimental, where the result of an experiment tells us something fundamental about the reality. Nevertheless the central point of a scientific paper is a relatively concisely expressible idea of some nontrivial universality (and predictive power) or some nontrivial observation about the nature of reality.

2) An engineering paper shows a method of solving a particular problem. Problems may vary and depend on an application, sometimes they could be really uninteresting and specific but nevertheless useful for somebody somewhere. For an engineering paper, things that matter are different than for a scientific paper: the universality of the solution may not be of paramount importance. What matters is that the solution works, could be practically implemented e.g. given available components, is cheaper or more energy efficient than other solutions and so on. The central point of an engineering paper is an application, and the rest is just a collection of ideas that allow to solve the application.

Machine learning sits somewhere in between. There are examples of clear scientific papers (such as e.g. the paper that introduced the backprop itself) and there are examples of clearly engineering papers where a solution to a very particular practical problem is described. But the majority of them appear to be engineering, only they engineer for a synthetic measure on a more or less academic dataset. In order to show superiority some ad-hoc trick is being pulled out of nowhere (typically of extremely limited universality) and after some statistically non significant testing a victory is announced.

One thing that I noticed in this Academic Engineering phenomena it’s that a lot of people (well-intentioned) are doing a lot of experiments, using nice tools and put their code available and this is very cool. However one thing that I noticed it’s that some of this Academic Engineering papers brings tons of methodological problems regarding of Machine Learning part.

I tackled one example of this some months ago related with a systematic review from Christodoulou Et al. called “A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models” that the authors want to start a confirmatory study without a clear understanding of the methodology behind of Machine Learning and Deep Learning papers (you can read the full post here).

In Nature’s letter from DeVries Et al. it’s not different. Let’s check, for example, HOW they end it up with the right architecture. The paper only made the following consideration about the architecture:

The neural networks used here are fully connected and have six hidden layers with 50 neurons each and hyperbolic tangent activation functions (13,451 weights and biases in total). The first layer corresponds to the inputs to the neural network; in this case, these inputs are the magnitudes of the six independent components of the co-seismically generated static elastic stress-change tensor calculated at the centroid of a grid cell and their negative values. 

DeVries, P. M. R., Viégas, F., Wattenberg, M., & Meade, B. J. (2018)

The code available in GitHub shows the architecture:

Only the aspect of choosing the right architecture can rises tons of questions regarding the methodological rigor as:

  • Why 6 layers and not 10 or 20? How did they get in this number of layers?
  • What the criteria to choose 50 as number of neurons? What’s the processed to identify that number?
  • All layers uses the lecun_uniform as a Kernel Initializer. Why this initializer it’s most suitable for this problem/data? Other options were tested? If yes, how was the results? And why the seed for the lecun_uniform was not set?

These questions I raised in only 8 minutes (and believe me, even junior reviewers from B-class journals would make those questions), and for the bottom of my heart, I would like to believe that Nature it’s doing the same.

After that a question arises: If even a very well know scientific journal it’s rewarding this kind of academic engineering – even with all code available – and not even considering to review the letter, what could happen in this moment in several papers that do not have this kind of mechanism of verification and the research itself it’s a completely a black box?

Final thoughts

There’s an eagerness to believe in almost every journal that has a huge impact and spread the word about the good results, but if you cannot explain HOW that result was made in a sense to have a methodological rigor, IMHO the result it’s meaningless.

Keep sane from hype, keep skeptic.

Below all the letters exchanged:

FIRST LETTER FROM MR. SHAH

Dear Editors:

A recent paper you published by DeVries, et al., Deep learning of aftershock patterns following large Earthquakes, contains significant methodological errors that undermine its conclusion. These errors should be highlighted, as data science is still an emerging field that hasn’t yet matured to the rigor of other fields. Additionally, not correcting the published results will stymie research in the area, as it will not be possible for others to match or improve upon the results. We have contacted the author and shared with them the problems around data leakage, learning curves, and model choice. They have not yet responded back.

​ First, the results published in the paper, AUC of 0.849, are inflated because of target leakage. The approach in the paper used part of an earthquake to train the model, which then was used again to test the model. This form of target leakage can lead to inflated results in machine learning. To prevent against this, a technique called group partitioning is used. This requires ensuring an earthquake appears either in the train portion of the data or the test portion. This is not an unusual methodological mistake, for example a recent paper by Rajpurkar et. al on chest x-rays made the same mistake, where x-rays for an individual patient could be found in both the train and test set. These authors later revised their paper to correct this mistake.

In this paper, several earthquakes, including 1985NAHANN01HART, 1996HYUGAx01YAGI, 1997COLFIO01HERN, 1997KAGOSH01HORI, 2010NORTHE01HAYE were represented in both the train and test part of the dataset. For example, in 1985 two large magnitude earthquakes occurred near the North Nahanni River in the northeast Cordillera, Northwest Territories, Canada, on 5 October (MS 6.6) and 23 December (MS 6.9). In this dataset, one of the earthquakes is in the train set and the other in the test set. To ensure the network wasn’t learning the specifics about the regions, we used group partitioning, this ensures an earthquake’s data only was in test or in train and not in both. If the model was truly learning to predict aftershocks, such a partitioning should not affect the results.

We applied group partitioning of earthquakes randomly across 10 different runs with different random seeds for the partitioning. I am happy to share/post the group partitioning along with the revised datasets. We found the following results as averaged across the 10 runs (~20% validation):

MethodMean AUC
Coulomb failure stress-change0.60
Maximum change in shear stress0.77
von Mises yield criterion0.77
Random Forest0.76
Neural Network0.77

In terms of predictive performance, the machine learning methods are not an improvement over traditional techniques of the maximum change in shear stress or the von Mises yield criterion. To assess the value of the deep learning approach, we also compared the performance to a baseline Random Forest algorithm (basic default parameters – 100 trees) and found only a slight improvement.

It is crucial that the results in the paper will be corrected. The published results provide an inaccurate portrayal of the results of machine learning / deep learning to predict aftershocks. Moreover, other researchers will have trouble sharing or publishing results because they cannot meet these published benchmarks. It is in the interest of progress and transparency that the AUC performance in the paper will be corrected.

The second problem we noted is not using learning curves. Andrew Ng has popularized the notion of learning curves as a fundamental tool in error analysis for models. Using learning curves, one can find that training a model on just a small sample of the dataset is enough to get very good performance. In this case, when I run the neural network with a batch size of 2,000 and 8 steps for one epoch, I find that 16,000 samples are enough to get a good performance of 0.77 AUC. This suggests that there is a relatively small signal in the dataset that can be found very quickly by the neural network. This is an important insight and should be noted. While we have 6 million rows, you can get the insights from just a small portion of that data.

The third issue is jumping straight to a deep learning model without considering baselines. Most mainstream machine learning papers will use benchmark algorithms, say logistic regression or random forest when discussing new algorithms or approaches. This paper did not have that. However, we found that a simple random forest model was able to achieve similar performance to neural network. This is an important point when using deep learning approaches. In this case, really any simple model (e.g. SVM, GAM) will provide comparable results. The paper gives the misleading impression that only deep learning is capable of learning the aftershocks.

As practicing data scientists, we see these sorts of problems on a regular basis. As a field, data science is still immature and there isn’t the methodological rigor of other fields. Addressing these errors will provide the research community with a good learning example of common issues practitioners can run into when using machine learning. The only reason we can learn from this is that the authors were kind enough to share their code and data. This sort of sharing benefits everyone in the long run.

At this point, I have not publicly shared or posted any of these concerns. I have shared them with the author and she did not reply back after two weeks. I thought it would be best to privately share them with you first. Please let me know what you think. If we do not hear back from you by November 20th, we will make our results public.

Thank you

Rajiv Shah

University of Illinois at Chicago

Lukas Innig

DataRobot

NATURE COMMENTS

Referee’s Comments:

In this proposed Matters Arising contribution, Shah and Innig provide critical commentary on the paper “Deep learning aftershock patterns following large earthquakes”, authored by Devries et al. and published in Nature in 2018. While I think that Shah and Innig raise make several valid and interesting points, I do not endorse publication of the comment-and-reply in Matters Arising. I will explain my reasoning for this decision in more detail below, but the upshot of my thinking is that (1) I do not feel that the central results of the study are compromised in any way, and (2) I am not convinced that the commentary is of interest to audience of non-specialists (that is, non machine learning practicioners).

Shah and Innig’s comment (and Devries and Meade’s response) centers on three main points of contention: (1) the notion of data leakage, (2) learning curve usage, and (3) the choice of deep learning approach in lieu of a simpler machine learning method. Point (1) is related to the partitioning of earthquakes into training and testing datasets. In the ideal world, these datasets should be completely independent, such that the latter constitutes a truly fair test of the trained model’s performance on data that it has never seen before. Shah and Innig note that some of the ruptures in the training dataset are nearly collocated in space and time with ruptures in the testing dataset, and thus a subset of aftershocks are shared mutually. This certainly sets up the potential for information to transfer from the training to testing datasets (violating the desired independence described above), and it would be better if the authors had implemented grouping or pooling to safeguard against this risk. However, I find Devries and Meade’s rebuttal to the point to be compelling, and would further posit that the potential data leakage between nearby ruptures is a somewhat rare occurrence that should not modify the main results significantly.

Shah and Innig’s points (2) and (3) are both related, and while they are interesting to me, they are not salient to the central focus of the paper. It is neat (and perhaps worth noting in a supplement), that the trainable parameters in the neural network, the network biases and weights, can be adequately trained using a small batch of the full dataset. Unfortunately, this insight from the proposed learning curve scheme would likely shoot over the heads of the 95% of the general Nature audience that are unfamiliar with the mechanics of neural networks and how they are trained. Likewise, most readers wouldn’t have the foggiest notion of what a Random Forest is, nor how it differs from a deep neural network, nor why it is considered simpler and more transparent. The purpose of the paper (to my understanding) was not to provide a benchmark machine learning algorithm so that future groups could apply more advanced techniques (GANs, Variational Autoencoders, etc.) to boost AUC performance by 5%. Instead, the paper showed that a relatively simple, but purely data-driven approach could predict aftershock locations better than Coulomb stress (the metric used in most studies to date) and also identify stress-based proxies (max shear stress, von Mises stress) that have physical significance and are better predictors than the classical Coulomb stress. In this way, the deep learning algorithm was used as a tool to remove our human bias toward the Coulomb stress criterion, which has been ingrained in our psyche by more than 20 years of published literature.

To summarize: regarding point (1), I wish the Devries et al. study had controlled for potential data leakage, but do not feel that the main results of the paper are compromised by doing so. As for point (2), I think it is interesting (though not surprising) that the neural network only needs a small batch of data to be adequately trained, but this is certainly a minor point of contention, relative to the key takeaways of the paper, which Shah and Innig may have missed. Point (3) follows more or less directly from (2), and it is intuitive that a simpler and more transparent machine learning algorithm (like a Random Forest) would give comparable performance to a deep neural network. Again, it would have been nice to have noted in the manuscript that the main insights could have been derived from a different machine learning approach, but this detail is of more interest to a data science or machine learning specialist than to a general Nature audience. I think the disconnect between the Shah and Innig and Devries et al. is a matter of perspective. Shah and Innig are concerned primarily with machine learning best practices methodology, and with formulating the problem as “Kaggle”-like machine learning challenge with proper benchmarking. Devries et al. are concerned primarily with using machine learning as tool to extract insight into the natural world, and not with details of the algorithm design.

AUTHORS RESPONSE

Deep Learning, Nature and Data Leakage, Reproducibility and Academic Engineering

My Personal Holy Trinity for Machine Learning Reproducibility

Short and direct:

ML Flow
Why I do use? (a.k.a What was my pain?)
One of the most painful situations that I faced was spent a huge time coding doing hyperparameter search and track the whole experimental setup. With ML Flow right now the only thing that I need to do it’s just investing time to pre-process the data and choose the algorithm to train; the model serialization, data serialization, packaging it’s all done by MLFlow. A great advantage it is that the best model can be deployed in a REST/API easily instead to use a customized Flask script.


Caveats: I really love Databricks but I think sometimes they’re so fast in their development (sic.) and this can cause some problems, especially if you’re relying on a very stable version and suddenly with some migration you can lose a lot of work (e.g. RDD to Dataframe) because rewrite things again.

Pachyderm
Why I do use? (a.k.a What was my pain?)
Data pre-processing sometimes can be very annoying and there’s a lot of new tools that actually overpromise to solve it, but in reality, it’s only a over-engineer stuff with a good Marketing (see this classic provided by Daniel Molnar to understand what I’m talking about (minute 15:48))

My main wish in the last 5 years it’s package all dirty SQL scripts in a single place just to execute with decent version control using Kubernetes and Docker and throw all ETLs made in Jenkins to trash (a.k.a embrace the dirty, cold, and complex reality of ETL). Nothing less, nothing more.

So, with Pachyderm I can do that.

Caveats: It’s necessary to say that you’ll need to know Docker and embrace all the problems related, and the bug list can be a little frightening.

DVC
Why I do use? (a.k.a What was my pain?)
ML Flow can serialize data and models. But DVC put this reproducibility in another level. With less than 15 commands in bash git-like you can easily serialize one versioning your data, code, and models. You can put the entire ML Pipeline in a single place and rolling back any point in time. In terms of reproducibility I think this is the best all-round tool.

Caveats: In comparison with ML Flow the navigation over the experiments here it’s a little bit hard tricky and demands some time to get used.

My Personal Holy Trinity for Machine Learning Reproducibility

Deep Learning and Radiology, False Dichotomy, Tools and a Paradigm Shift

From MIT Tech Review article called “Google shows how AI might detect lung cancer faster and more reliably” we have the following information:

Early warning: Danial Tse, a researcher at Google, developed an algorithm that beat a number of trained radiologists in testing. Tse and colleagues trained a deep-learning algorithm to detect malignant lung nodules in more than 42,000 CT scans. The resulting algorithms turned up 11% fewer false positives and 5% fewer false negatives than their human counterparts. The work is described in a paper published in the journal Nature today.

That reminds me of a lot of haterism, defensiveness, confirmation bias and especially a lack of understanding of technology and their potentials to help people worldwide. I’ll not cite most of this here but you can check in my Twitter @flavioclesio.

Some people from academic circles, especially from Statistics and Epidemiology, started in several different ways bashing the automation of statistical methods (Machine Learning) using a lot of questionable methods to assess ML even using one of the worst systematic reviews in history to create a false dichotomy between the Stats and ML researchers.

Most of the time that kind of criticism without a consistent argumentation around the central point sounds more like pedantism where these people say to us in a subliminal way: “- Hey look those nerds, they do not know what they are doing. Trust use <<Classical Methods Professors>>, We have <<Number of Papers>> in that field and those folks are only coders that don’t have all the training that we have.

This situation’s so common that In April I needed to enter in a thread with Frank Harrell to discuss that an awful/pointless Systematic Review should not be used to create that kind of point less dichotomy in that thread:

My point it’s: Statistics, Machine Learning, Artificial Intelligence, Python, R, and so on are tools and should be and should be treated as such.

Closing thoughts

I invite all my 5 readers to exercise the following paradigm shift: Instead to think

This AI in Health will take Doctors out of their jobs?

let’s change the question to

Hey, you’re telling me that using this very easy to implement free software with commodity CPU power can we democratize health exams for the less favored people together with the Doctors?

Deep Learning and Radiology, False Dichotomy, Tools and a Paradigm Shift

Two gently ways to fix Peer Review

Jacob Buckman in this beautiful blog piece gave two gentle ways to enhance Peer Review.

About the relative certification from Peer Review process provided by conferences:

So my first suggestion is this: change from a relative metric to a standalone evaluation. Conferences should accept or reject each paper by some fixed criteria, regardless of how many papers get submitted that year. If there end up being too many papers to physically fit in the venue, select a subset of accepted papers, at random, to invite. This mitigates one major source of randomness from the certification process: the quality of the other papers in any given submission pool.

And the most important piece it’s about the create a rejection board to disincentivize low-quality submissions:

This means that if you submit to NeurIPS and they give you an F (rejection), it’s a matter of public record. The paper won’t be released, and you can resubmit that work elsewhere, but the failure will always live on. (Ideally we’ll develop community norms around academic integrity that mandate including a section on your CV to report your failures. But if not, we can at least make it easy for potential employers to find that information.)
Why would this be beneficial? Well, it should be immediately obvious that this will directly disincentivize people from submitting half-done work. Each submission will have to be hyper-polished to the best it can possibly be before being submitted. It seems impossible that the number of papers polished to this level will be anywhere close to the number of submissions that we see at major conferences today. Those who choose to repeatedly submit poor-quality work anyways will have their CVs marred with a string of Fs, cancelling out any certification benefits they had hoped to achieve.

I personally bet € 100 that if any conference adopt this mechanism, at least 98% of all of these planting-flag papers will be vanished forever.

Two gently ways to fix Peer Review

The real payoff about the Reproducible/Replicable science

This article from Jeffrey T. Leek and Roger D. Peng called Opinion: Reproducible research can still be wrong: Adopting a prevention approach address an important question about the reproducible /replicable research and provides good definitions about it as we can see in the following below:

We define reproducibility as the ability to recompute data analytic results given an observed dataset and knowledge of the data analysis pipeline. The replicability of a study is the chance that an independent experiment targeting the same scientific question will produce a consistent result (1). Concerns among scientists about both have gained significant traction recently due in part to a statistical argument that suggested most published scientific results may be false positives (2). At the same time, there have been some very public failings of reproducibility across a range of disciplines from cancer genomics (3) to economics (4), and the data for many publications have not been made publicly available, raising doubts about the quality of data analyses. Popular press articles have raised questions about the reproducibility of all scientific research (5), and the US Congress has convened hearings focused on the transparency of scientific research (6). The result is that much of the scientific enterprise has been called into question, putting funding and hard won scientific truths at risk.

So far so good. But the problem it is about the following sentence:

Unfortunately, the mere reproducibility of computational results is insufficient to address the replication crisis because even a reproducible analysis can suffer from many problems—confounding from omitted variables, poor study design, missing data—that threaten the validity and useful interpretation of the results. 

If we think that using enforce replication/reproduction patterns in any experiments will prevent/vanish any methodological problems, this assumption it’s not only wrong but naive for the lack of a better word.

The point about the replication/reproducibility it’s a matter to put a higher standard in science where we can ensure that: 1) All the process follow some methodology that explains how some solution transformed until the final result, 2) implies that with that we have a better chance to remove any bias (e.g. cognitive, publication, systematic, etc. ), and 3) if the methodology it’s wrong this methodology can be verified, checked and fixed for the entire scientific society.

When we have an important paper from important economists whom voices are listened by world leaders (those can change the economic policy using that kind of study) fails to be reproducible and someone can catch the methodological flaws and fix it, the point of the importance replication/reproducible research it’s already made.

This is the real payoff of reproducible/replicable science.

The real payoff about the Reproducible/Replicable science

Tunability, Hyperparameters and a simple Initial Assessment Strategy

Most of the time we completely rely in the default parameters of Machine Learning Algorithm and this fact can hide that sometimes we can make wrong statements about the ‘efficiency’ of some algorithm.

The paper called Tunability: Importance of Hyperparameters of Machine Learning Algorithms from Philipp Probst, Anne-Laure Boulesteix, Bernd Bischl in Journal of Machine Learning Research (JMLR) bring some light in this subject. This is the abstract:

Modern supervised machine learning algorithms involve hyperparameters that have to be set before running them. Options for setting hyperparameters are default values from the software package, manual configuration by the user or configuring them for optimal predictive performance by a tuning procedure. The goal of this paper is two-fold. Firstly, we formalize the problem of tuning from a statistical point of view, define data-based defaults and suggest general measures quantifying the tunability of hyperparameters of algorithms. Secondly, we conduct a large-scale benchmarking study based on 38 datasets from the OpenML platform and six common machine learning algorithms. We apply our measures to assess the tunability of their parameters. Our results yield default values for hyperparameters and enable users to decide whether it is worth conducting a possibly time consuming tuning strategy, to focus on the most important hyperparameters and to choose adequate hyperparameter spaces for tuning.

Probst, Boulesteix, Bischl in Tunability: Importance of Hyperparameters of Machine Learning Algorithms

I recognize that the abstract sounds not so appealing, but the most important part of the text for sure it’s related in one table and one graph about the Tunability, i.e. how tuneable one parameter is according the other default values.

As we can observe in the columns Def.P (package defaults) and Def.O (optimal defaults) even in some vanilla algorithms we have some big differences between them, specially in Part, XGBoost and Ranger.

If we check the variance across this hyper parameters, the results indicates that the problem can be worse that we imagined:

As we can see in a first sight there’s a huge variance in terms of AUC when we talk about the default parameters.

Checking these experiments two big questions arises:

  1. How much inefficiency it’s included in some process of algorithm assessment and selection because for the ‘initial model‘ (that most of the times becomes the last model) because of relying in the default values? and;
  2. Because of this misleading path to consider some algorithm based purely in defaults how many ML implementations out there is underperforming and wasting research/corporate resources (e.g. people’s time, computational time, money in cloud providers, etc…)?

Initial Assessment Strategy

A simple strategy that I use for this particular purpose it’s to use a two-phase hyperparameter search strategy where in the first phase I use to make a small knockout round with all algorithms using Random Search to grab the top 2 or 3 models, and in the second phase I use Grid Search where most of the time I explore a large number of parameters.

According the number of samples that I have in the Test and Validation sets, I usually let the search for at least 24 hours in some local machine or in some cloud provider.

I do that because with this ‘initial‘ assessment we can have a better idea which algorithm will learn more according the data that I have considering dimensionality, selectivity of the columns or complexity of the word embeddings in NLP tasks, data volume and so on.

Conclusion

The paper makes a great job to expose the variance in terms of AUC using default parameters for the practitioners and can give us a better heuristic path in terms to know which parameters are most tunable and with this information in hands we can perform better search strategies to have better implementations of Machine Learning Algorithms.

Tunability, Hyperparameters and a simple Initial Assessment Strategy

Some quick comments about Genevera Allen statements regarding Machine Learning

Start note: Favio Vazquez made a great job in his article about it with a lot of charts and showing that in modern Machine Learning approach with the tools that we currently have the problems of replication and methodology are being tackled.

It’s becoming a great trend: Some researcher has some criticism about Machine Learning and they start to do some cherry picking (fallacy of incomplete evidence) in potential issues start with statements like “We have a problem in Machine Learning and the results it’s not reproducible“, “Machine Learning doesn’t work“, “Artificial intelligence faces reproducibility crisis, “AI researchers allege that machine learning is alchemy and boom: we have click bait, rant, bashing and a never-ending spiral of non-construcive critcism. Afterward this researcher get some spotlights in public debate about Machine Learning, goes to CNN to give some interviews and becomes a “reference in issues in Machine Learning“.

Right now it’s time for Ms. Allen do the following question/statement “Can we trust scientific discoveries made using machine learning?” where she brings good arguments for the debate, but I think she misses the point to 1) not bring any solution/proposal and 2) the statement itself its too abroad and obvious that can be applied in any science field.

My main intention here it’s just to make very short comments to prove that these issues are very known by the Machine Learning community and we have several tools and methods to tackle these issues.

The second intention here it’s to demonstrate that this kind of very broad-obvious argument brings more friction than light to debate. I’ll include the statement and a short response below:

“The question is, ‘Can we really trust the discoveries that are currently being made using machine-learning techniques applied to large data sets?'” Allen said. “The answer in many situations is probably, ‘Not without checking,’ but work is underway on next-generation machine-learning systems that will assess the uncertainty and reproducibility of their predictions.”

Comment: More data do not imply in more insights and harder to have more data it’s to have the right combination of hyperparameters, feature engineering, and ensembling/stacking the models. And every scientific statement must be checked (this is a basic assumption of the scientific method). But this trend maybe cannot be a truth in modern research, as we are celebrating scientific statements (over selling) with the researchers intentionally hiding their methods and findings. It’s like Hans Bethe hiding his discoveries about stellar nucleosynthesis because in some point in the future someone can potentially use this to make atomic bombs.

“A lot of these techniques are designed to always make a prediction,” she said. “They never come back with ‘I don’t know,’ or ‘I didn’t discover anything,’ because they aren’t made to.”

Comment: This is simply not true. A very quick check in Scikit-Learn, XGBoost and Keras (3 of the most popular libraries of ML) shattered this argument.

“In precision medicine, it’s important to find groups of patients that have genomically similar profiles so you can develop drug therapies that are targeted to the specific genome for their disease,” Allen said. “People have applied machine learning to genomic data from clinical cohorts to find groups, or clusters, of patients with similar genomic profiles. “But there are cases where discoveries aren’t reproducible; the clusters discovered in one study are completely different than the clusters found in another,”

Comment: Here it’s the classic use of misleading experience with a clear use of confirmation bias because of a lack of understanding between tools with methodology . The ‘logic‘ of this argument is: A person wants to cut some vegetables to make a salad. This person uses a salad knife (the tool) but instead to use it accordingly (in the kitchen with a proper cutting board) this person cut the vegetables on the top of a stair after drink 2 bottles of vodka (the wrong method) and end up being cut; and after that this person get the conclusion that the knife is dangerous and doesn’t work.

There’s a bunch of guidelines being proposed and there’s several good resources like Machine Learning Mastery that already tackled this issue, this excellent post of Determined ML makes a good argument and this repo has tons of reproducible papers even using Deep Learning. The main point is: Any junior Machine Learning Engineer knows that hashing the dataset and fixing a seed at the beginning of the experiment can solve at least 90% of these problems.

Conclusion

There’s a lot of researches and journalists that cannot (or do not want to) understand that not only in Machine Learning but in all science there’s a huge problem of replication of the studies (this is not the case for Ms. Allen because she had a very interesting track record in ML in terms of publications). In psychology half of the studies cannot be replicated and even the medical findings in some instance are false that proves that is a very long road to minimize that kind of problem.

Some quick comments about Genevera Allen statements regarding Machine Learning

Practical advice about research modelling with Andrew

A post about ROC analysis becomes a small lecture about decision analysis:

It’s good for researchers to present their raw data, along with clean summary analyses. Report what your data show, and publish everything! But when it comes to decision making, including the decision of what lines of research to pursue further, I’d go Bayesian, incorporating prior information and making the sources and reasoning underlying that prior information clear, and laying out costs and benefits. Of course, that’s all a lot of work, and I don’t usually do it myself. Look at my applied papers and you’ll see tons of point estimates and uncertainty intervals, and only a few formal decision analyses. Still, I think it makes sense to think of Bayesian decision analysis as the ideal form and to interpret inferential summaries in light of these goals. Or, even more, short-term than that, if people are using statistical significance to make publication decisions, we can do our best to correct for the resulting biases, as in section 2.1 of this paper.

Practical advice about research modelling with Andrew

Progressive Neural Architecture Search

AbstractWe propose a new method for learning the structure of convolutional neural networks (CNNs) that is more efficient than recent state-of-the-art methods based on reinforcement learning and evolutionary algorithms. Our approach uses a sequential model-based optimization (SMBO) strategy, in which we search for structures in order of increasing complexity, while simultaneously learning a surrogate model to guide the search through structure space. Direct comparison under the same search space shows that our method is up to 5 times more efficient than the RL method of Zoph et al. (2018) in terms of number of models evaluated, and 8 times faster in terms of total compute. The structures we discover in this way achieve state of the art classification accuracies on CIFAR-10 and ImageNet.

Conclusions: The main contribution of this work is to show how we can accelerate the search for good CNN structures by using progressive search through the space of increasingly complex graphs, combined with a learned prediction function to efficiently identify the most promising models to explore. The resulting models achieve the same level of performance as previous work but with a fraction of the computational cost. There are many possible directions for future work, including: the use of better surrogate predictors, such as Gaussian processes with string kernels; the use of model-based early stopping, such as [3], so we can stop the training of “unpromising” models before reaching E1 epochs; the use of “warm starting”, to initialize the training of a larger b+ 1-sized model from its smaller parent; the use of Bayesian optimization, in which we use an acquisition function, such as expected improvement or upper confidence bound, to rank the candidate models, rather than greedily picking the top K (see e.g., [31,30]); adaptively varying the number of models K evaluated at each step (e.g., reducing it over time); the automatic exploration of speed-accuracy tradeoffs (cf., [11]), etc.

Progressive Neural Architecture Search

Multi-objective Architecture Search for CNNs

Good ideas to perform an architecture search in CNN/DL.

 

Multi-objective Architecture Search for CNNs

Abstract: Architecture search aims at automatically finding neural architectures that are competitive with architectures designed by human experts. While recent approaches have come close to matching the predictive performance of manually designed architectures for image recognition, these approaches are problematic under constrained resources for two reasons: first, the architecture search itself requires vast computational resources for most proposed methods. Secondly, the found neural architectures are solely optimized for high predictive performance without penalizing excessive resource consumption. We address the first shortcoming by proposing NASH, an architecture search which considerable reduces the computational resources required for training novel architectures by applying network morphisms and aggressive learning rate schedules. On CIFAR10, NASH finds architectures with errors below 4% in only 3 days. We address the second shortcoming by proposing Pareto-NASH, a method for multi-objective architecture search that allows approximating the Pareto-front of architectures under multiple objective, such as predictive performance and number of parameters, in a single run of the method. Within 56 GPU days of architecture search, Pareto-NASH finds a model with 4M parameters and test error of 3.5%, as well as a model with less than 1M parameters and test error of 4.6%.

Conclusion: We proposed NASH, a simple and fast method for automated architecture search based on a hill climbing strategy, network morphisms, and training via SGDR. Experiments on CIFAR10 showed that our method yields competitive results while requiring considerably less computational resources for architecture search than most alternative approaches. However, in most practical application not only the predictive performance plays an important role but also resource consumption. To address this, we proposed Pareto-NASH, a multi-objective architecture search method that employs additional operators for shrinking models and extends NASH’s hill climbing strategy to an evolutionary algorithm. ParetoNASH is designed to exploit the fact that evaluating the performance of a neural network is orders of magnitude more expensive than evaluating, e.g., the model’s size. Experiments on CIFAR-10 showed that Pareto-NASH is able to find competitive models in terms of both predictive performance and resource efficiency.

Multi-objective Architecture Search for CNNs

Driver behavior profiling: An investigation with different smartphone sensors and machine learning

Driver behavior profiling: An investigation with different smartphone sensors and machine learning

Abstract: Driver behavior impacts traffic safety, fuel/energy consumption and gas emissions. Driver behavior profiling tries to understand and positively impact driver behavior. Usually driver behavior profiling tasks involve automated collection of driving data and application of computer models to generate a classification that characterizes the driver aggressiveness profile. Different sensors and classification methods have been employed in this task, however, low-cost solutions and high performance are still research targets. This paper presents an investigation with different Android smartphone sensors, and classification algorithms in order to assess which sensor/method assembly enables classification with higher performance. The results show that specific combinations of sensors and intelligent methods allow classification performance improvement.
Results: We executed all combinations of the 4 MLAs and their configurations described on Table 1 over the 15 data sets described in Section 4.3 using 5 different nf values. We trained, tested, and assessed every evaluation assembly with 15 different random seeds. Finally, we calculated the mean AUC for these executions, grouped them by driving event type, and ranked the 5 best performing assemblies in the boxplot displayed in Fig 6. This figure shows the driving events on the left-hand side and the 5 best evaluation assemblies for each event on the right-hand side, with the best ones at the bottom. The assembly text identification in Fig 6 encodes, in this order: (i) the nf value; (ii) the sensor and its axis (if there is no axis indication, then all sensor axes are used); and (iii) the MLA and its configuration identifier.
Conclusions and future work: In this work we presented a quantitative evaluation of the performances of 4 MLAs (BN, MLP, RF, and SVM) with different configurations applied in the detection of 7 driving event types using data collected from 4 Android smartphone sensors (accelerometer, linear acceleration, magnetometer, and gyroscope). We collected 69 samples of these event types in a real-world experiment with 2 drivers. The start and end times of these events were recorded serve as the experiment ground-truth. We also compared the performances when applying different sliding time window sizes.
We performed 15 executions with different random seeds of 3865 evaluation assemblies of the form EA = {1:sensor, 2:sensor axis(es), 3:MLA, 4:MLA configuration, 5:number of frames in sliding window}. As a result, we found the top 5 performing assemblies for each driving event type. In the context of our experiment, these results show that (i) bigger window sizes perform better; (ii) the gyroscope and the accelerometer are the best sensors to detect our driving events; (iii) as general rule, using all sensor axes perform better than using a single one, except for aggressive left turns events; (iv) RF is by far the best performing MLA, followed by MLP; and (v) the performance of the top 35 combinations is both satisfactory and equivalent, varying from 0.980 to 0.999 mean AUC values.
As future work, we expect to collect a greater number of driving events samples using different vehicles, Android smartphone models, road conditions, weather, and temperature. We also expect to add more MLAs to our evaluation, including those based on fuzzy logic and DTW. Finally, we intend use the best evaluation assemblies observed in this work to develop an Android smartphone application which can detect driving events in real-time and calculate the driver behavior profile.
Driver behavior profiling: An investigation with different smartphone sensors and machine learning

Study of Engineered Features and Learning Features in Machine Learning – A Case Study in Document Classification

Study of Engineered Features and Learning Features in Machine Learning – A Case Study in Document Classification

Abstract:. Document classification is challenging due to handling of voluminous and highly non-linear data, generated exponentially in the era of digitization. Proper representation of documents increases efficiency and performance of classification, ultimate goal of retrieving information from large corpus. Deep neural network models learn features for document classification unlike the engineered feature based approaches where features are extracted or selected from the data. In the paper we investigate performance of different classifiers based on the features obtained using two approaches. We apply deep autoencoder for learning features while engineering features are extracted by exploiting semantic association within the terms of the documents. Experimentally it has been observed that learning feature based classification always perform better than the proposed engineering feature based classifiers.

Conclusion and Future Work: In the paper we emphasize the importance of feature representation for classification. The potential of deep learning in feature extraction process for efficient compression and representation of raw features is explored. By conducting multiple experiments we deduce that a DBN – Deep AE feature extractor and a DNNC outperforms most other techniques providing a trade-off between accuracy and execution time. In this paper we have dealt with the most significant feature extraction and classification techniques for text documents where each text document belongs to a single class label. With the explosion of digital information a large number of documents may belong to multiple class labels handling of which is a new challenge and scope of future work. Word2vec models [18] in association with Recurrent Neural Networks(RNN) [4,14] have recently started gaining popularity in feature representation domain. We would like to compare their performance with our deep learning method in future. Similar feature extraction techniques can also be applied to image data to generate compressed feature which can facilitate efficient classification. We would also like to explore such possibilities in our future work.

Study of Engineered Features and Learning Features in Machine Learning – A Case Study in Document Classification

Machine Learning Methods to Predict Diabetes Complications

Machine Learning Methods to Predict Diabetes Complications

Abstract: One of the areas where Artificial Intelligence is having more impact is machine learning, which develops algorithms able to learn patterns and decision rules from data. Machine learning algorithms have been embedded into data mining pipelines, which can combine them with classical statistical strategies, to extract knowledge from data. Within the EU-funded MOSAIC project, a data mining pipeline has been used to derive a set of predictive models of type 2 diabetes mellitus (T2DM) complications based on electronic health record data of nearly one thousand patients. Such pipeline comprises clinical center profiling, predictive model targeting, predictive model construction and model validation. After having dealt with missing data by means of random forest (RF) and having applied suitable strategies to handle class imbalance, we have used Logistic Regression with stepwise feature selection to predict the onset of retinopathy, neuropathy, or nephropathy, at different time scenarios, at 3, 5, and 7 years from the first visit at the Hospital Center for Diabetes (not from the diagnosis). Considered
variables are gender, age, time from diagnosis, body mass index (BMI), glycated hemoglobin (HbA1c), hypertension, and smoking habit. Final models, tailored in accordance with the complications, provided an accuracy up to 0.838. Different variables were selected for each complication and time scenario, leading to specialized models easy to translate to the clinical
practice.

Conclusions: This work shows how data mining and computational methods can be effectively adopted in clinical medicine to derive models that use patient-specific information to predict an outcome of interest. Predictive data mining methods may be applied to the construction of decision models for procedures such as prognosis, diagnosis and treatment planning, which—once evaluated and verified—may be embedded within clinical information systems. Developing predictive models for the onset of chronic microvascular complications in patients suffering from T2DM could contribute to evaluating the relation between exposure to individual factors and the risk of onset of a specific complication, to stratifying the patients’ population in a medical center with respect to this risk, and to developing tools for the support of clinical informed decisions in patients’ treatment.

Machine Learning Methods to Predict Diabetes Complications

Exploration of machine learning techniques in predicting multiple sclerosis disease course

Abstract
Objective: To explore the value of machine learning methods for predicting multiple sclerosis disease course.
Methods: 1693 CLIMB study patients were classified as increased EDSS≥1.5 (worsening) or not (non-worsening) at up to five years after baseline visit. Support vector machines (SVM) were used to build the classifier, and compared to logistic regression (LR) using demographic, clinical and MRI data obtained at years one and two to predict EDSS at five years follow-up.
Results: Baseline data alone provided little predictive value. Clinical observation for one year improved overall SVM sensitivity to 62% and specificity to 65% in predicting worsening cases. The addition of one year MRI data improved sensitivity to 71% and specificity to 68%. Use of non-uniform misclassification costs in the SVM model, weighting towards increased sensitivity, improved predictions (up to 86%). Sensitivity, specificity, and overall accuracy improved minimally with additional follow-up data. Predictions improved within specific groups defined by baseline EDSS. LR performed more poorly than SVM in most cases. Race, family history of MS, and brain parenchymal fraction, ranked highly as predictors of the non-worsening group. Brain T2 lesion volume ranked highly as predictive of the worsening group.
Exploration of machine learning techniques in predicting multiple sclerosis disease course

When do traumatic experiences alter risk-taking behavior? A machine learning analysis of reports from refugees

Abstract: Exposure to traumatic stressors and subsequent trauma-related mental changes may alter a person’s risk-taking behavior. It is unclear whether this relationship depends on the specific types of traumatic experiences. Moreover, the association has never been tested in displaced individuals with substantial levels of traumatic experiences. The present study assessed risk-taking behavior in 56 displaced individuals by means of the balloon analogue risk task (BART). Exposure to traumatic events, symptoms of posttraumatic stress disorder and depression were assessed by means of semi-structured interviews. Using a novel statistical approach (stochastic gradient boosting machines), we analyzed predictors of risk-taking behavior. Exposure to organized violence was associated with less risk-taking, as indicated by fewer adjusted pumps in the BART, as was the reported experience of physical abuse and neglect, emotional abuse, and peer violence in childhood. However, civil traumatic stressors, as well as other events during childhood were associated with lower risk taking. This suggests that the association between global risk-taking behavior and exposure to traumatic stress depends on the particular type of the stressors that have been experienced.
Results: All participants had experienced a minimum of one traumatic event, and the overall majority of 93 percent had been exposed to various forms and frequencies of organized violence. The mean exposure to types of torture and war events (vivo checklist) was 7.9 (SD = 6.5, median = 5); the mean exposure in the PSS-I event checklist was 3.3 (SD = 1.4, median = 3).
Childhood maltreatment measured by the KERF was generally high and had been experienced by 94% of participants, but the types presented a very heterogeneous pattern. Physical abuse was most common (85%; mean = 7.9, SD = 5.9, median = 6.6), followed by emotional abuse (65%; mean = 4.8, SD = 4.8, median = 3.3). Peer violence, emotional neglect, and physical neglect were experienced by half of the participants (54%, 52%, and 50%, respectively; mean peer violence = 3.8, SD = 3.9, median = 3.3; mean emotional neglect = 3.1, SD = 3.5, median = 3.3; mean physical neglect = 2.3, SD = 2.7, median = 1.7). The least frequent adverse experiences during childhood were witnessing an event (37%, mean = 2.5, SD = 3.7, median = 0) and sexual abuse (17%, mean = .4, SD = 1.4, median = 0).
Regarding PTSD diagnosis, 55% fulfilled criteria according to DSM-IV (PSS-I mean = 16.4, SD = 13.3, median = 18). The mean score in the PHQ-9 was 10.9 (SD = 7.8, median = 10), indicating a mild to intermediate severity of depression symptoms.
Risk behavior as measured by the BART had a large range between 3.1 and 77.5 adjusted pumps. The mean number of adjusted pumps was 33.0 (SD = 18.9, median = 28.2).
Conclusions: Altogether, the current study suggests that the experience of organized violence versus domestic violence differentially impacts subsequent performance on a laboratory test for risk-taking behavior such as the BART. Further research with larger sample sizes is needed in order to clarify the specific associations between types of exposure to traumatic events and risk-taking behavior.
When do traumatic experiences alter risk-taking behavior? A machine learning analysis of reports from refugees

Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning

Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning

Abstract: Traditionally, medical discoveries are made by observing associations, making hypotheses from them and then designing and running experiments to test the hypotheses. However, with medical images, observing and quantifying associations can often be difficult because of the wide variety of features, patterns, colours, values and shapes that are present in real data. Here, we show that deep learning can extract new knowledge from retinal fundus images. Using deep-learning models trained on data from 284,335 patients and validated on two independent datasets of 12,026 and 999 patients, we predicted cardiovascular risk factors not previously thought to be present or quantifiable in retinal images, such as age (mean absolute error within 3.26 years), gender (area under the receiver operating characteristic curve (AUC) = 0.97), smoking status (AUC = 0.71), systolic blood pressure (mean absolute error within 11.23 mmHg) and major adverse cardiac events (AUC = 0.70). We also show that the trained deep-learning models used anatomical features, such as the optic disc or blood vessels, to generate each prediction.

Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning

Comparison of Machine Learning Approaches for Prediction of Advanced Liver Fibrosis in Chronic Hepatitis C Patients.

BACKGROUND/AIM:
Using machine learning approaches as non-invasive methods have been used recently as an alternative method in staging chronic liver diseases for avoiding the drawbacks of biopsy. This study aims to evaluate different machine learning techniques in prediction of advanced fibrosis by combining the serum bio-markers and clinical information to develop the classification models.

METHODS:
A prospective cohort of 39,567 patients with chronic hepatitis C was divided into two sets – one categorized as mild to moderate fibrosis (F0-F2), and the other categorized as advanced fibrosis (F3-F4) according to METAVIR score. Decision tree, genetic algorithm, particle swarm optimization, and multilinear regression models for advanced fibrosis risk prediction were developed. Receiver operating characteristic curve analysis was performed to evaluate the performance of the proposed models.

RESULTS:
Age, platelet count, AST, and albumin were found to be statistically significant to advanced fibrosis. The machine learning algorithms under study were able to predict advanced fibrosis in patients with HCC with AUROC ranging between 0.73 and 0.76 and accuracy between 66.3% and 84.4%.

CONCLUSIONS:
Machine-learning approaches could be used as alternative methods in prediction of the risk of advanced liver fibrosis due to chronic hepatitis C.

Comparison of Machine Learning Approaches for Prediction of Advanced Liver Fibrosis in Chronic Hepatitis C Patients.

Anomaly Detection in Multivariate Non-stationary Time Series for Automatic DBMS Diagnosis

Anomaly Detection in Multivariate Non-stationary Time Series for Automatic DBMS Diagnosis

ABSTRACT— Anomaly detection in database management systems (DBMSs) is difficult because of increasing number of statistics (stat) and event metrics in big data system. In this paper, I propose an automatic DBMS diagnosis system that detects anomaly periods with abnormal DB stat metrics and finds causal events in the periods. Reconstruction error from deep autoencoder and statistical process control approach are applied to detect time period with anomalies. Related events are found using time series similarity measures between events and abnormal stat metrics. After training deep autoencoder with DBMS metric data, efficacy of anomaly detection is investigated from other DBMSs containing anomalies. Experiment results show effectiveness of proposed model, especially, batch temporal normalization layer. Proposed model is used for publishing automatic DBMS diagnosis reports in order to determine DBMS configuration and SQL tuning.

CONCLUSION AND FUTURE WORK I proposed a machine learning model for automatic DBMS diagnosis. The proposed model detects anomaly periods from reconstruct error with deep autoencoder. I also verified empirically that temporal normalization is essential when input data is non-stationary multivariate time series. With SPC approach, time period is considered anomaly period when reconstruction error is outside of control limit. According types or users of DBMSs, decision rules that are used in SPC can be added. For example, warning line with 2 sigma can be utilized to decide whether it is anomaly or not [12, 13]. In this paper, anomaly detection test is proceeded in other DBMSs whose data is not used in training, because performance of basic pre-trained model is important in service providers’ perspective. Efficacy of detection performance is validated with blind test and DBAs’ opinions. The result of automatic anomaly diagnosis would help DB consultants save time for anomaly periods and main wait events. Thus, they can concentrate on only making solution when DB disorders occur. For better performance of anomaly detection, additional training can be proceeded after pre-trained model is adopted. In addition, recurrent and convolutional neural network can be used in reconstruction part to capture hidden representation of sequential and local relationship. If anomaly labeled data is generated, detection result can be analyzed with numerical performance measures. However, in practice, it is hard to secure labeled anomaly dataset according to each DBMS. Proposed model is meaningful in unsupervised anomaly detection model that doesn’t need labeled data and can be generalized to other DBMSs with pre-trained model

Anomaly Detection in Multivariate Non-stationary Time Series for Automatic DBMS Diagnosis

Anomaly Detection in Multivariate Non-stationary Time Series for Automatic DBMS Diagnosis

Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks

Abstract We develop an algorithm which exceeds the performance of board certified cardiologists in detecting a wide range of heart arrhythmias from electrocardiograms recorded with a single-lead wearable monitor. We build a dataset with more than 500 times the number of unique patients than previously studied corpora. On this dataset, we train a 34-layer convolutional neural network which maps a sequence of ECG samples to a sequence of rhythm classes. Committees of boardcertified cardiologists annotate a gold standard test set on which we compare the performance of our model to that of 6 other individual cardiologists. We exceed the average cardiologist performance in both recall (sensitivity) and precision (positive predictive value).

Conclusion We develop a model which exceeds the cardiologist performance in detecting a wide range of heart arrhythmias from single-lead ECG records. Key to the performance of the model is a large annotated dataset and a very deep convolutional network which can map a sequence of ECG samples to a sequence of arrhythmia annotations. On the clinical side, future work should investigate extending the set of arrhythmias and other forms of heart disease which can be automatically detected with high-accuracy from single or multiple lead ECG records. For example we do not detect Ventricular Flutter or Fibrillation. We also do not detect Left or Right Ventricular Hypertrophy, Myocardial Infarction or a number of other heart diseases which do not necessarily exhibit as arrhythmias. Some of these may be difficult or even impossible to detect on a single-lead ECG but can often be seen on a multiple-lead ECG. Given that more than 300 million ECGs are recorded annually, high-accuracy diagnosis from ECG can save expert clinicians and cardiologists considerable time and decrease the number of misdiagnoses. Furthermore, we hope that this technology coupled with low-cost ECG devices enables more widespread use of the ECG as a diagnostic tool in places where access to a cardiologist is difficult.

Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks