Machine Learning e o modelo de queijo suíço: falhas ativas e condições latentes

TL;DR: Problemas sempre vão existir. Uma postura reflexiva, sistemática, e com um plano de ação sempre foi e sempre será o caminho para resolução destes mesmos problemas.

Eu estava escrevendo um post sobre a importância dos Post Mortems em machine learning e vi que esta parte em específico estava ficando maior do que o ponto principal do outro post. Dessa forma eu resolvi quebrar esse post em um assunto específico com um pouco mais de foco e detalhes.

Aplicações de Machine Learning (ML) e Inteligência Artificial (IA) estão avançando em domínios cada vez mais críticos como medicina, aviação, setor bancário, investimentos entre outros. 

Estas aplicações estão diariamente tomando decisões de forma automatizada e em alta escala; não somente moldando a forma na qual indústrias estão operando, mas também como pessoas estão interagindo com plataformas que utilizam estas tecnologias.

Dito isso, é de fundamental importância que a cultura de engenharia em ML/AI incorpore e adapte cada vez mais conceitos como confiabilidade e robustez que são óbvios em outros campos da engenharia.

E um dos caminhos para essa adaptação é o entendimento de aspectos causais que possam elevar o risco de indisponibilidade destes sistemas.

Antes de prosseguir no texto eu recomendo a leitura do post Accountability, Core Machine Learning e Machine Learning Operations que fala um pouco de aplicações de ML em produção e da importância da engenharia na construção desses sistemas complexos.

A ideia aqui é falar sobre falhas ativas e condições latentes utilizando de forma simples o Modelo de Queijo Suíço. O objetivo é mostrar como estes dois fatores estão ligados na cadeia de eventos de indisponibilidades e/ou catastróficos em sistemas de ML.

Mas antes disso vamos entender um pouco do porque o entendimento das falhas pode ser um caminho alternativo para a melhoria da confiabilidade, e também sobre os “cases de sucesso” que vemos todos os dias na internet.

Viés de sobrevivência e aprendizado pela falha

Hoje na internet há uma miríade de informações sobre praticamente qualquer área técnica. Com todo o hype em ML e com a sua crescente adoção, estas informações materializam-se na forma de tutoriais, blog posts, fóruns de discussão, MOOCs, Twitter, entre outras fontes.

Porém, um leitor mais atento pode notar um determinado padrão em parte dessas histórias: Na maioria das vezes são cases de algo que (a) deu extremamente certo, (b) ou que gerou receita para a empresa, (c) ou como a solução o salvou X% em termos de eficiência, e/ou (d) como a nova solução de tecnologia foi uma das maiores maravilhas técnicas que já foram construídas.

Isso rende claps no Medium, posts no Hacker News, artigos em grandes portais de tecnologia, technical blog posts que que viram referências técnicas, papers e mais papers no Arxiv, palestras em conferências, etc.

Logo de antemão eu quero adiantar que eu sou um grande entusiasta da ideia de que “pessoas inteligentes aprendem com os próprios erros, e pessoas sábias aprendem com os erros dos outros”. Estes recursos, especialmente os technical blog posts e as conferências, reúnem um nível altíssimo de informações extremamente valiosas de pessoas que estão nas trincheiras técnicas.

Este bazar de ideias é extremamente saudável para a comunidade como um todo. Além do mais, este bazar está enterrando o antigo modelo de gatekeeping em que algumas consultorias de conferências surfaram por anos às custas de desinformação fazendo inúmeras empresas desperdiçarem rios de dinheiro. Ademais, este bazar de ideias está ajudando a acabar com o nefasto culto à personalidades de tecnologia em que qualquer pessoa pode ter uma voz.

Contudo, o que muitos destes posts, conference talks, papers e demais artigos não citam geralmente, são as coisas que dão/deram muito erradas durante o desenvolvimento dessas soluções; e isso essencialmente é um problema dado que estamos apenas vendo o resultado final e não o como esse resultado foi gerado e os falhas/erros cometidos no caminho.

Realizando um simples exercício de reflexão, é até compreensível que pouquíssimas pessoas socializem os erros cometidos e lições aprendidas; dado que nos dias de hoje, especialmente com as mídias sociais, a mensagem fica muito mais amplificada e distorcida.

Admitir erros não é algo fácil. Dependendo do grau de maturidade psicológica da pessoa que errou, junto com o erro pode vir uma montanha de sentimentos como constrangimento, inadequação, raiva, vergonha, negação etc. Isso pode levar a problemas de ordem psicológica em que um profissional de saúde mental tenha que acompanhar a pessoa que cometeu o erro. 

Do ponto de vista das empresas, a imagem que pode ficar em relação à relações públicas, é de desorganização corporativa, times de engenharia ruins, líderes técnicos que não sabem o que estão fazendo, etc. Isso pode afetar, por exemplo, ações de recrutamento.

Devido a estes pontos acima, isso implica que (1) talvez grande parte destes problemas podem estar acontecendo neste exato momento e estão sendo simplesmente suprimidos e (2) talvez exista um grande viés de sobrevivência nestes posts/talks/papers.

Não existe nada de errado com a forma na qual as empresas colocam os seus relatos, entretanto, um pouco de ceticismo e pragmatismo sempre é bom; pois, para cada caso de sucesso, sempre existirá uma infinidade de times que falharam miseramente, empresas que quebraram, pessoas que foram demitidas, etc.

Mas afinal, o que isso tudo tem a ver com a falhas que acontecem e porque entender os seus fatores contribuintes?

A resposta é: Porque primeiramente o seu time/solução tem que ser capaz de sobreviver à situações catastróficas para que o caso de sucesso exista. E ter a sobrevivência como aspecto motivador para aumentar a confiabilidade de times/sistemas, torna o entendimento dos erros em uma forma atrativa de aprendizado.

E quando existem cenários pequenas violações, supressão de erros, ausência de procedimentos, imperícia, imprudência ou negligência, as coisas dão espetacularmente muito errado, como nos exemplos abaixo:

Claro que nestas linhas mal escritas não haverá um ode à catástrofe ou disaster porn.

Porém, eu quero colocar um outro ponto de vista no sentido de que sempre existe uma lição a ser aprendida diante do que dá errado, e que empresas/times que mantém uma atitude introspectiva em relação aos problemas que acontecem ou analisam os fatores que possam a vir contribuir para um incidente, reforçam não somente uma cultura saudável de aprendizado como promovem uma cultura de engenharia mais orientada para aspectos de confiabilidade.

Partindo para o ponto prático, eu vou comentar um pouco sobre uma ferramenta (modelo mental) de gerenciamento de riscos que é o Modelo do Queijo Suíço que auxilia no entendimento de fatores causais que contribuem para a desastres em sistemas complexos.

O Modelo do Queijo Suíço

Se eu tivesse que dar um exemplo de indústria em que a confiabilidade pode ser considerada referência, com certeza seria a indústria da aviação [N2]. 

Em cada evento catastrófico que ocorre, há uma investigação minuciosa para entender o que aconteceu, e posteriormente endereçar os fatores contribuintes e fatores determinantes para um novo evento catastrófico nunca mais venha a acontecer.

Dessa forma, a aviação garante que aplicando o que foi aprendido devido ao evento catastrófico, todo o sistema fica mais confiável. Não é por acaso que mesmo com o aumento no número de voos (39 milhões de voos no último ano, 2019) o número de fatalidades vem caindo a cada ano que passa.

Uma das ferramentas mais utilizadas em investigação de acidentes aéreos para análise de riscos e aspectos causais é o Modelo de Queijo Suíço

Este modelo foi criado por James Reason através do artigo “The contribution of latent human failures to the breakdown of complex systems” em que houve a construção do seu framework (mas sem referência direta do termo). Entretanto, somente no paper “Human error: models and management o modelo aparece de forma mais direta.

A justificativa do modelo por parte do autor, é feita considerando um cenário de um sistema complexo e dinâmico da seguinte forma:

Defesas, barreiras e salvaguardas ocupam uma posição-chave na abordagem do sistema. Os sistemas de alta tecnologia têm muitas camadas defensivas: algumas são projetadas (alarmes, barreiras físicas, desligamentos automáticos etc.), outras contam com pessoas (cirurgiões, anestesistas, pilotos, operadores de salas de controle, etc.) e outras dependem de procedimentos e controles administrativos. Sua função é proteger possíveis vítimas e ativos contra riscos locais. Muitas das vezes essas camadas fazem isso de maneira muito eficaz, mas sempre há fraquezas.

Em um mundo ideal, cada camada defensiva estaria intacta. Na realidade, porém, são mais como fatias de queijo suíço, com muitos buracos – embora, diferentemente do queijo, esses buracos estejam continuamente abrindo, fechando e mudando de local. A presença de orifícios em qualquer “fatia” normalmente não causa um resultado ruim. Geralmente, isso pode acontecer apenas quando os orifícios em várias camadas se alinham momentaneamente para permitir uma trajetória de oportunidade de acidente – trazendo riscos para o contato prejudicial com as vítimas.

Human error: models and management

Uma forma de visualização deste alinhamento pode ser vista no gráfico abaixo:

Ou seja, neste caso cada fatia do queijo suíço seria uma linha de defesa com camadas projetadas (ex: monitoramento, alarmes, travas de push de código em produção, etc.) e/ou as camadas procedurais que envolvem pessoas (ex: aspectos culturais, treinamento e qualificação de commiters no repositório, mecanismos de rollback, testes unitários e de integração, etc.).

Ainda dentro do que o autor colocou, cada furo em alguma das fatias do queijo acontecem por dois fatores: falhas ativas e condições latentes, em que:

  • Condições latentes são como uma espécie de situações intrinsecamente residentes dentro do sistema; que são consequências de decisões de design, engenharia, de quem escreveu as normas ou procedimentos e até mesmo dos níveis hierárquicos mais altos de uma organização. Essas condições latentes podem levar a dois tipos de efeitos adversos que são situações que provocam ao erro e a criação de vulnerabilidades. Isto é, a solução possui um design que eleva a probabilidade de eventos de alto impacto negativo que pode ser equivalente a um fator causal ou fator contribuinte.  
  • Falhas Ativas são atos inseguros ou pequenas transgressões cometidos pelas pessoas que estão em contato direto com o sistema; atos estes que podem ser deslizes, lapsos, distorções, omissões, erros e violações processuais.

Se as condições latentes estão ligadas à aspectos ligados a engenharia e produto; as falhas ativas estão muito mais relacionadas com fatores humanos. Um ótimo framework para análise de fatores humanos é o Human Factors Analysis and Classification System (HFACS).

O HFACS coloca que as falhas humanas em sistema tecnológico-sociais complexos acontecem em quatro diferentes níveis como pode ser visto na imagem abaixo:

A ideia aqui no post não é discutir esses conceitos, e sim realizar um paralelo com machine learning em que alguns destes aspectos serão tratados. Para quem quiser saber mais eu recomendo a leitura do HFACS para uma leitura aprofundada do framework.

Já que temos alguns dos conceitos bem claros do que são as falhas ativas e condições latentes, vamos realizar um exercício de reflexão usando alguns exemplos com ML.

Gerenciamento de falhas ativas e condições latentes em Machine Learning

Para fazer a transposição destes fatores para a arena de ML de uma forma mais concreta, eu vou usar alguns exemplos do que eu já vi acontecer, do que já aconteceu comigo, e mais alguns dos pontos do excelente artigo de Sculley, David, et al. chamado “Hidden technical debt in machine learning systems.” apenas para efeitos didáticos. 

De maneira geral esses conjuntos de fatores (não-exaustivos) estariam representados da seguinte maneira:

Condições Latentes

  • Cultura de arranjos técnicos improvisados (workarounds): O uso de arranjos técnicos improvisados gambiarra em algumas situações é extremamente necessário. Contudo, uma cultura de voltada a workarounds [N3] em um campo que tem complexidades intrínsecas como ML tende a incluir potenciais fragilidades em sistemas de ML e tornar o processo de identificação e correção de erros muito mais lento.
  • Ausência de monitoramento e alarmística: Em plataformas de ML alguns fatores que precisam de monitoramento específico como data drift (i.e. mudança na distribuição dos dados que servem de input para o treinamento) model drift (i.e. degradação do modelo em relação aos dados que são previstos) e adversarial monitoring que é o monitoramento para assegurar que o modelo está sendo testado para coleta de informações ou ataques adversariais.
  • Resumé-Driven Development ou RDD, é quando engenheiros ou times implementam uma ferramenta em produção apenas para ter no CV que trabalharam com a mesma, potencialmente prospectando um futuro empregador. O RDD tem como principal característica de criar uma dificuldade desnecessária para vender uma facilidade inexistente se a coisa certa tivesse sido feita. 
  • Decisões de tipo democracia com pessoas menos informadas ao invés do consenso entre especialistas e tomadores de risco: O ponto aqui é simples: Decisões chave só podem ser tomadas por (a) quem estiver envolvido diretamente na construção e na operacionalização dos sistemas, (b) quem estiver financiando e/ou tomando o risco, e (c) quem tem o nível de habilidades técnicas para saber os prós e contras de cada aspecto da decisão. A razão é que essas pessoas têm ao menos a própria pele em jogo ou sabem os pontos fracos e fortes do que está sendo tratado. O Fabio Akita já fez um argumento bem interessante nesta linha que mostra o quão ruim pode ser quando pessoas sem a pele em jogo e mal informadas estão tomando decisões. Democracia em profissões de prática não existe. Essa neo-democracia corporativa coletivista não tem rosto, e logo não tem accountability caso algo dê errado. Democracia em aspectos técnicos nos termos colocados acima é uma condição latente. Algo errado nunca será correto apenas porque uma maioria decidiu.

Falhas Ativas

  • Código não revisado indo para produção: Diferentemente da boa engenharia de software tradicional em que existe um camada de revisão de código para assegurar se tudo está dentro dos padrões de qualidade, em ML isso é um tema que ainda tem muito a amadurecer, dado que grande parte dos Data Scientists não têm um background programação e versionamento de código fonte. Outro ponto que dificulta bastante é que no fluxo de trabalho de cientistas de dados muitas das ferramentas usadas, tornam a revisão de código que impossível (e.g. Knit para o R) e Jupyter Notebook para Python.
  • Glue code: Nesta categoria eu coloco os códigos que fazemos no momento da prototipação e do MVP que vai para produção da mesma forma que foram criados. Uma coisa que já vi acontecer bastante neste sentido foi ter aplicações com dependências de inúmeros pacotes e que para ter uma “integração” mínima necessitavam de muito glue code. O código ficava tão frágil que uma mudança na dependência (ex: uma simples atualização do código fonte) quebrava praticamente toda a API em produção.

Um cenário de indisponibilidade em um sistema de ML

Vamos imaginar que a uma empresa financeira fictícia chamada “Leyman Brothers” teve uma indisponibilidade na qual a sua plataforma de trading de ações ficou indisponível por 6 horas causando perdas massivas em alguns investidores.

Após a construção de um devido Post-Mortem o time chegou à seguinte narrativa em relação aos fatores determinantes e contribuintes na indisponibilidade:

O motivo da indisponibilidade foi devido a um erro do tipo falta de memória devido a um bug na biblioteca de ML.

Este erro é conhecido pelos desenvolvedores da biblioteca e existe um ticket aberto sobre o problema desde 2017, mas que até o presente momento não teve solução (Condição Latente).

Outro aspecto verificado foi que o tempo de resposta e solução foi demasiadamente longo devido ao fato de que não haviam mecanismos de alarmística, heartbeating ou monitoramento na plataforma de ML. Dessa forma, sem as informações de diagnóstico, o problema levou mais tempo do que o necessário para ser corrigido (Condição Latente).

No momento do debugging foi verificado que o desenvolvedor responsável pela implementação do trecho de código em que aconteceu a origem do erro, tinha conhecimento das alternativas de correção, mas não o fez devido ao fato de que a correção levaria a implementação de outra biblioteca em uma linguagem de programação a qual ele não têm domínio; mesmo com esta linguagem já sendo utilizada em outras partes do stack de tecnologia (Falha Ativa). 

Por fim, foi visto também que o código entrou diretamente em produção sem nenhum tipo revisão. O projeto no Github não possui nenhuma “trava” para impedir que códigos não revisados entrem em produção. (Falha Ativa devido à Condição Latente).

Transpondo o evento da narrativa para o modelo de Queijo Suíço, visualmente teríamos a seguinte imagem:

No nosso Queijo Suíço cada uma das fatias seriam camadas ou linhas de defesa em que temos aspectos como a arquitetura e engenharia dos sistemas, o stack de tecnologia, os procedimentos específicos de desenvolvimento, a cultura de engenharia da empresa e por fim as pessoas como última salvaguarda.

Os furos por sua vez seriam os elementos falhos em cada uma destas camadas de defesa que podem ser falhas ativas (ex: dar commit direto na master pelo fato de hão haver Code Review) ou condições latentes (e.g. biblioteca de ML, falta de monitoramento e alarmística).

Em uma situação ideal, após um evento de indisponibilidade, todas as condições latentes e as falhas ativas seriam endereçadas e haveria um plano de ação para a solução dos problemas para que o mesmo evento nunca mais acontecesse no futuro

Apesar da narrativa de alto nível, o ponto principal é que indisponibilidades em sistemas complexos e dinâmicos nunca acontecem devido a um fator isolado, mas sim devido à conjunção e sincronização de condições latentes e falhas ativas.

CONSIDERAÇÕES FINAIS

Claro que não existe panaceia em relação ao que pode ser feito em termos de gestão de riscos: alguns riscos e problemas podem ser tolerados e muitas das vezes não existe o tempo e os recursos necessários para aplicação dos devidos ajustes.

Entretanto, quando falamos de sistemas de missão crítica que usam ML fica claro que existem uma miríade de problemas específicos que podem acontecer além dos naturais problemas de engenharia.

O modelo do Queijo Suíço é um modelo de gerenciamento de riscos que é muito utilizado na aviação e oferece uma maneira simples de elencar condições latentes e falhas ativas em eventos que possam levar a falhas catastróficas.

O entendimento dos fatores contribuintes e determinantes em eventos de falha, pode ajudar a eliminar ou minimizar potenciais riscos e consequentemente reduzir o impacto na cadeia de consequências estes eventos.

NOTAS

[N1] – O objetivo deste post é única e exclusivamente comunicar com times de Machine Learning Engineering, Data Science, Data Product Management e demais áreas que tenham realmente a cultura de melhoria e feedback contínuo. Se você e/ou a sua empresa entende que conceitos de qualidade, robustez, confiabilidade e aprendizado são importantes, este post é dedicado especialmente a vocês.

[N2] No momento em que esse artigo estava sendo revisado apareceu essa matéria do novo avião Boeing 787 que devido ao fato de que o sistema core não consegue eliminar dados obsoletos (flush de dados) de algumas informações de sistemas críticos do avião que afetam a aeronavegabilidade, e que por isso a cada 51 dias todos os aviões deste modelo devem ser desligados. Isso mesmo, um avião Boeing precisa do mesmo tipo de reboot ao estilo “já tentou desligar a sua máquina e religar novamente?” para que um evento catastrófico não ocorra. Mas isto mostra que mesmo com uma condição latente é possível operar um sistema complexo de maneira segura.

[N3] Cultura de Gambiarras + eXtreme Go Horse (XGH) + Jenga-Oriented Architecture = Usina de indisponibilidades

[N4] – Agradecimentos especiais ao Comandante Ronald Van Der Put do canal Teaching for Free pela gentileza em me ceder alguns materiais relacionados à segurança e prevenção de acidentes.

REFERÊNCIAS

Reason, James. “The contribution of latent human failures to the breakdown of complex systems.” Philosophical Transactions of the Royal Society of London. B, Biological Sciences 327.1241 (1990): 475-484.

Reason, J. “Human error: models and management.” BMJ (Clinical research ed.) vol. 320,7237 (2000): 768-70. doi:10.1136/bmj.320.7237.768

Morgenthaler, J. David, et al. “Searching for build debt: Experiences managing technical debt at Google.” 2012 Third International Workshop on Managing Technical Debt (MTD). IEEE, 2012.

Alahdab, Mohannad, and Gül Çalıklı. “Empirical Analysis of Hidden Technical Debt Patterns in Machine Learning Software.” International Conference on Product-Focused Software Process Improvement. Springer, Cham, 2019.

Perneger, Thomas V. “The Swiss cheese model of safety incidents: are there holes in the metaphor?.” BMC health services research vol. 5 71. 9 Nov. 2005, doi:10.1186/1472-6963-5-71

“Hot cheese: a processed Swiss cheese model.” JR Coll Physicians Edinb 44 (2014): 116-21.

Breck, Eric, et al. “What’s your ML Test Score? A rubric for ML production systems.” (2016).

SEC Charges Knight Capital With Violations of Market Access Rule

Blog da Qualidade – Modelo Queijo Suíço para analisar riscos e falhas.

Machine Learning Goes Production! Engineering, Maintenance Cost, Technical Debt, Applied Data Analysis Lab Seminar

Nassim Taleb – Lectures on Fat Tails, (Anti)Fragility, Precaution, and Asymmetric Exposures

Skybrary – Human Factors Analysis and Classification System (HFACS)

CEFA Aviation – Swiss Cheese Model

A List of Post-mortems

Richard Cook – How Complex Systems Fail

Airbus – Hull Losses

Number of flights performed by the global airline industry from 2004 to 2020

Machine Learning e o modelo de queijo suíço: falhas ativas e condições latentes

Facebook FastText – Automatic Hyperparameter optimization with Autotune

Disclaimer: some of the information in this blog post might be incorrect and as FastText it’s very fast-paced to correct and adjust things probably some parts of this post may be can be out-of-date very soon too. If you have some correction or feedback feel free to comment.

I’m finishing some experiments with the new feature of FastText for hyperparametrization for the training time called Autotune.

What is Autotune?

From the press release the description of Autotune is:

[…]This feature automatically determines the best hyperparameters for your data set in order to build an efficient text classifier[…].

[…]FastText then uses the allotted time to search for the hyperparameters that give the best performance on the validation set.[…].

[…]Our strategy to explore various hyperparameters is inspired by existing tools, such as Nevergrad, but tailored to fastText by leveraging the specific structure of models. Our autotune explores hyperparameters by sampling, initially in a large domain that shrinks around the best combinations found over time[…]

Autotune Strategy

Checking the code we can find the search strategy for the Autotune follows:

For all parameters, the Autotuner have an updater (method updateArgGauss()) that considers a random number provided by a Gaussian distribution function (coeff) and set an update number between a single standard deviation (parameters startSigma and endSigma) and based on these values the coefficients have an update.

Each parameter has a specific range for the startSigma and endSigma that it’s fixed in the updateArgGauss method.

Updates for each coefficient can be linear (i.e. updateCoeff + val) or power (i.e. pow(2.0, coeff); updateCoeff * val) and depends from the first random gaussian random number that are inside of standard deviation.

After each validation (that uses a different combination of parameters) one score (f1-score only) it’s stored and the best one will be used to train the full model using the best combination of parameters.

Arguments Range

  • epoch1 to 100
  • learning rate0.01 to 5.00
  • dimensions1 to 1000
  • wordNgrams1 to 5
  • loss: Only softmax
  • bucket size10000 to 10000000
  • minn (min length of char ngram): 1 to 3
  • maxn (max length of char ngram): 1 to minn + 3
  • dsub (size of each sub-vector): 1 to 4

Clarification posted in issues in FastText project.

In terms of metrics for optimization there’s only the f1score and labelf1score metrics.

Advantages

  • In some domains where the FaxtText models are not so critical in terms of accuracy/recall/precision, the Timeboxing optimization can be very useful
  • Extreme simplicity for implementation. It’s just to call more args in the train_supervised()
  • Source code transparent where we can check some of the behaviors
  • The search strategy it’s simple and has some boundaries that cut extreme training parameters (e.g. Learning Rate=10.0, Epoch=10000, WordNGrams=70, etc)

Disadvantages

  • FastText still doesn’t provide any log about the convergence. In that case, maybe a log for each model tested could be nice.
  • Maybe the search strategy could be a bit clarified in terms of boundaries, parameter initialization and so on
  • Boundaries parameters `startSigma` and `endSigma` follow a Gaussian distribution and I think this maybe can be explained in docs
  • Same for the hardcoded parameters that define the boundaries for each parameter. Something like _Based in some empirical tests we got these values. However, you can test a certain amount of combinations an open a PR if you find some good intervals. _
  • Autotune maybe can process in several combination with not so good parameters before starting a good sequence of optimization (i.e. in a search space budget of 100 combinations the first 70 can be not so useful). The main idea of Autotune it’s to be “automatic” but could be useful have some option/configuration to a more broader or optimized configuration.

The Jupyter Notebook can be found in my Github.

Facebook FastText – Automatic Hyperparameter optimization with Autotune

Deep Learning, Nature and Data Leakage, Reproducibility and Academic Engineering

This piece of Rajiv Shah called “Stand up for Best Practices” that involves a well known scientific journal Nature shows the academic rigor failed during several layers down and why reproducibility matters.

The letter called Deep learning of aftershock patterns following large earthquakes from DeVries Et al. that was published in Nature, according to Shah, shows a basic problem of Data Leakage and this problem could invalidate all the experiments. Shah tried to replicate the results and found the scenario of Data Leakage and after he tried to communicate the authors and Nature about the error got some harsh responses (some will be at the end of this post).

Paper abstract – Source: https://www.nature.com/articles/s41586-018-0438-y


The repository with all analysis it is here.

Of course that a letter it’s a small piece that communicates in a brief way a larger research and sometimes the authors need to suppress some information to the matter of clarity of journal limitations. And for this point, I can understand the authors, and since they gentle provided the source code (here) more skeptical minds can check the ground truth.

As a said before in my 2019 mission statement: “In god we trust, others must bring the raw data with the source code of the extraction in the GitHub“.

The main question here it’s not about if the authors made a mistake or not (that did, because they incorporated a part of an earthquake to train the model, and this for itself can explain the AUC bigger in test than in training set) but how this academic engineering it’s killing the Machine Learning field and inflating a bubble of expectations.

But first I’ll borrow the definition of Academic Engineering provided by Filip Piekniewski in his classic called Autopsy Of A Deep Learning Paper:

I read a lot of deep learning papers, typically a few/week. I’ve read probably several thousands of papers. My general problem with papers in machine learning or deep learning is that often they sit in some strange no man’s land between science and engineering, I call it “academic engineering”. Let me describe what I mean:

1) A scientific paper IMHO, should convey an idea that has the ability to explain something. For example a paper that proves a mathematical theorem, a paper that presents a model of some physical phenomenon. Alternatively a scientific paper could be experimental, where the result of an experiment tells us something fundamental about the reality. Nevertheless the central point of a scientific paper is a relatively concisely expressible idea of some nontrivial universality (and predictive power) or some nontrivial observation about the nature of reality.

2) An engineering paper shows a method of solving a particular problem. Problems may vary and depend on an application, sometimes they could be really uninteresting and specific but nevertheless useful for somebody somewhere. For an engineering paper, things that matter are different than for a scientific paper: the universality of the solution may not be of paramount importance. What matters is that the solution works, could be practically implemented e.g. given available components, is cheaper or more energy efficient than other solutions and so on. The central point of an engineering paper is an application, and the rest is just a collection of ideas that allow to solve the application.

Machine learning sits somewhere in between. There are examples of clear scientific papers (such as e.g. the paper that introduced the backprop itself) and there are examples of clearly engineering papers where a solution to a very particular practical problem is described. But the majority of them appear to be engineering, only they engineer for a synthetic measure on a more or less academic dataset. In order to show superiority some ad-hoc trick is being pulled out of nowhere (typically of extremely limited universality) and after some statistically non significant testing a victory is announced.

One thing that I noticed in this Academic Engineering phenomena it’s that a lot of people (well-intentioned) are doing a lot of experiments, using nice tools and put their code available and this is very cool. However one thing that I noticed it’s that some of this Academic Engineering papers brings tons of methodological problems regarding of Machine Learning part.

I tackled one example of this some months ago related with a systematic review from Christodoulou Et al. called “A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models” that the authors want to start a confirmatory study without a clear understanding of the methodology behind of Machine Learning and Deep Learning papers (you can read the full post here).

In Nature’s letter from DeVries Et al. it’s not different. Let’s check, for example, HOW they end it up with the right architecture. The paper only made the following consideration about the architecture:

The neural networks used here are fully connected and have six hidden layers with 50 neurons each and hyperbolic tangent activation functions (13,451 weights and biases in total). The first layer corresponds to the inputs to the neural network; in this case, these inputs are the magnitudes of the six independent components of the co-seismically generated static elastic stress-change tensor calculated at the centroid of a grid cell and their negative values. 

DeVries, P. M. R., Viégas, F., Wattenberg, M., & Meade, B. J. (2018)

The code available in GitHub shows the architecture:

Only the aspect of choosing the right architecture can rises tons of questions regarding the methodological rigor as:

  • Why 6 layers and not 10 or 20? How did they get in this number of layers?
  • What the criteria to choose 50 as number of neurons? What’s the processed to identify that number?
  • All layers uses the lecun_uniform as a Kernel Initializer. Why this initializer it’s most suitable for this problem/data? Other options were tested? If yes, how was the results? And why the seed for the lecun_uniform was not set?

These questions I raised in only 8 minutes (and believe me, even junior reviewers from B-class journals would make those questions), and for the bottom of my heart, I would like to believe that Nature it’s doing the same.

After that a question arises: If even a very well know scientific journal it’s rewarding this kind of academic engineering – even with all code available – and not even considering to review the letter, what could happen in this moment in several papers that do not have this kind of mechanism of verification and the research itself it’s a completely a black box?

Final thoughts

There’s an eagerness to believe in almost every journal that has a huge impact and spread the word about the good results, but if you cannot explain HOW that result was made in a sense to have a methodological rigor, IMHO the result it’s meaningless.

Keep sane from hype, keep skeptic.

Below all the letters exchanged:

FIRST LETTER FROM MR. SHAH

Dear Editors:

A recent paper you published by DeVries, et al., Deep learning of aftershock patterns following large Earthquakes, contains significant methodological errors that undermine its conclusion. These errors should be highlighted, as data science is still an emerging field that hasn’t yet matured to the rigor of other fields. Additionally, not correcting the published results will stymie research in the area, as it will not be possible for others to match or improve upon the results. We have contacted the author and shared with them the problems around data leakage, learning curves, and model choice. They have not yet responded back.

​ First, the results published in the paper, AUC of 0.849, are inflated because of target leakage. The approach in the paper used part of an earthquake to train the model, which then was used again to test the model. This form of target leakage can lead to inflated results in machine learning. To prevent against this, a technique called group partitioning is used. This requires ensuring an earthquake appears either in the train portion of the data or the test portion. This is not an unusual methodological mistake, for example a recent paper by Rajpurkar et. al on chest x-rays made the same mistake, where x-rays for an individual patient could be found in both the train and test set. These authors later revised their paper to correct this mistake.

In this paper, several earthquakes, including 1985NAHANN01HART, 1996HYUGAx01YAGI, 1997COLFIO01HERN, 1997KAGOSH01HORI, 2010NORTHE01HAYE were represented in both the train and test part of the dataset. For example, in 1985 two large magnitude earthquakes occurred near the North Nahanni River in the northeast Cordillera, Northwest Territories, Canada, on 5 October (MS 6.6) and 23 December (MS 6.9). In this dataset, one of the earthquakes is in the train set and the other in the test set. To ensure the network wasn’t learning the specifics about the regions, we used group partitioning, this ensures an earthquake’s data only was in test or in train and not in both. If the model was truly learning to predict aftershocks, such a partitioning should not affect the results.

We applied group partitioning of earthquakes randomly across 10 different runs with different random seeds for the partitioning. I am happy to share/post the group partitioning along with the revised datasets. We found the following results as averaged across the 10 runs (~20% validation):

MethodMean AUC
Coulomb failure stress-change0.60
Maximum change in shear stress0.77
von Mises yield criterion0.77
Random Forest0.76
Neural Network0.77

In terms of predictive performance, the machine learning methods are not an improvement over traditional techniques of the maximum change in shear stress or the von Mises yield criterion. To assess the value of the deep learning approach, we also compared the performance to a baseline Random Forest algorithm (basic default parameters – 100 trees) and found only a slight improvement.

It is crucial that the results in the paper will be corrected. The published results provide an inaccurate portrayal of the results of machine learning / deep learning to predict aftershocks. Moreover, other researchers will have trouble sharing or publishing results because they cannot meet these published benchmarks. It is in the interest of progress and transparency that the AUC performance in the paper will be corrected.

The second problem we noted is not using learning curves. Andrew Ng has popularized the notion of learning curves as a fundamental tool in error analysis for models. Using learning curves, one can find that training a model on just a small sample of the dataset is enough to get very good performance. In this case, when I run the neural network with a batch size of 2,000 and 8 steps for one epoch, I find that 16,000 samples are enough to get a good performance of 0.77 AUC. This suggests that there is a relatively small signal in the dataset that can be found very quickly by the neural network. This is an important insight and should be noted. While we have 6 million rows, you can get the insights from just a small portion of that data.

The third issue is jumping straight to a deep learning model without considering baselines. Most mainstream machine learning papers will use benchmark algorithms, say logistic regression or random forest when discussing new algorithms or approaches. This paper did not have that. However, we found that a simple random forest model was able to achieve similar performance to neural network. This is an important point when using deep learning approaches. In this case, really any simple model (e.g. SVM, GAM) will provide comparable results. The paper gives the misleading impression that only deep learning is capable of learning the aftershocks.

As practicing data scientists, we see these sorts of problems on a regular basis. As a field, data science is still immature and there isn’t the methodological rigor of other fields. Addressing these errors will provide the research community with a good learning example of common issues practitioners can run into when using machine learning. The only reason we can learn from this is that the authors were kind enough to share their code and data. This sort of sharing benefits everyone in the long run.

At this point, I have not publicly shared or posted any of these concerns. I have shared them with the author and she did not reply back after two weeks. I thought it would be best to privately share them with you first. Please let me know what you think. If we do not hear back from you by November 20th, we will make our results public.

Thank you

Rajiv Shah

University of Illinois at Chicago

Lukas Innig

DataRobot

NATURE COMMENTS

Referee’s Comments:

In this proposed Matters Arising contribution, Shah and Innig provide critical commentary on the paper “Deep learning aftershock patterns following large earthquakes”, authored by Devries et al. and published in Nature in 2018. While I think that Shah and Innig raise make several valid and interesting points, I do not endorse publication of the comment-and-reply in Matters Arising. I will explain my reasoning for this decision in more detail below, but the upshot of my thinking is that (1) I do not feel that the central results of the study are compromised in any way, and (2) I am not convinced that the commentary is of interest to audience of non-specialists (that is, non machine learning practicioners).

Shah and Innig’s comment (and Devries and Meade’s response) centers on three main points of contention: (1) the notion of data leakage, (2) learning curve usage, and (3) the choice of deep learning approach in lieu of a simpler machine learning method. Point (1) is related to the partitioning of earthquakes into training and testing datasets. In the ideal world, these datasets should be completely independent, such that the latter constitutes a truly fair test of the trained model’s performance on data that it has never seen before. Shah and Innig note that some of the ruptures in the training dataset are nearly collocated in space and time with ruptures in the testing dataset, and thus a subset of aftershocks are shared mutually. This certainly sets up the potential for information to transfer from the training to testing datasets (violating the desired independence described above), and it would be better if the authors had implemented grouping or pooling to safeguard against this risk. However, I find Devries and Meade’s rebuttal to the point to be compelling, and would further posit that the potential data leakage between nearby ruptures is a somewhat rare occurrence that should not modify the main results significantly.

Shah and Innig’s points (2) and (3) are both related, and while they are interesting to me, they are not salient to the central focus of the paper. It is neat (and perhaps worth noting in a supplement), that the trainable parameters in the neural network, the network biases and weights, can be adequately trained using a small batch of the full dataset. Unfortunately, this insight from the proposed learning curve scheme would likely shoot over the heads of the 95% of the general Nature audience that are unfamiliar with the mechanics of neural networks and how they are trained. Likewise, most readers wouldn’t have the foggiest notion of what a Random Forest is, nor how it differs from a deep neural network, nor why it is considered simpler and more transparent. The purpose of the paper (to my understanding) was not to provide a benchmark machine learning algorithm so that future groups could apply more advanced techniques (GANs, Variational Autoencoders, etc.) to boost AUC performance by 5%. Instead, the paper showed that a relatively simple, but purely data-driven approach could predict aftershock locations better than Coulomb stress (the metric used in most studies to date) and also identify stress-based proxies (max shear stress, von Mises stress) that have physical significance and are better predictors than the classical Coulomb stress. In this way, the deep learning algorithm was used as a tool to remove our human bias toward the Coulomb stress criterion, which has been ingrained in our psyche by more than 20 years of published literature.

To summarize: regarding point (1), I wish the Devries et al. study had controlled for potential data leakage, but do not feel that the main results of the paper are compromised by doing so. As for point (2), I think it is interesting (though not surprising) that the neural network only needs a small batch of data to be adequately trained, but this is certainly a minor point of contention, relative to the key takeaways of the paper, which Shah and Innig may have missed. Point (3) follows more or less directly from (2), and it is intuitive that a simpler and more transparent machine learning algorithm (like a Random Forest) would give comparable performance to a deep neural network. Again, it would have been nice to have noted in the manuscript that the main insights could have been derived from a different machine learning approach, but this detail is of more interest to a data science or machine learning specialist than to a general Nature audience. I think the disconnect between the Shah and Innig and Devries et al. is a matter of perspective. Shah and Innig are concerned primarily with machine learning best practices methodology, and with formulating the problem as “Kaggle”-like machine learning challenge with proper benchmarking. Devries et al. are concerned primarily with using machine learning as tool to extract insight into the natural world, and not with details of the algorithm design.

AUTHORS RESPONSE

Deep Learning, Nature and Data Leakage, Reproducibility and Academic Engineering

Choosing Learning Rate?

Ben Hammel shows how to do it.

Reduce learning rate on plateau

Every time the loss begins to plateau, the learning rate decreases by a set fraction. The belief is that the model has become caught in region similar to the “high learning rate” scenario shown at the start of this post (or visualized in the ‘chaotic’ landscape of the VGG-56 model above). Reducing the learning rate will allow the optimizer to more efficiently find the minimum in the loss surface. At this time, one might be concerned about converging to a local minimum. This is where building intuition from an illustrative representation can betray you, I encourage you to convince yourself of the discussion in the “Local minima in deep learning” section.

Use a learning-rate finder

The learning rate finder was first suggested by L. Smith and popularized by Jeremy Howard in Deep Learning : Lesson 2 2018. There are lots of great references on how this works but they usually stop short of hand-wavy justifications. If you want some convincing of this method, this is a simple implementation on a linear regression problem, followed by a theoretical justification.

Behavior or different learning rate values
Choosing Learning Rate?

Deep Learning and Radiology, False Dichotomy, Tools and a Paradigm Shift

From MIT Tech Review article called “Google shows how AI might detect lung cancer faster and more reliably” we have the following information:

Early warning: Danial Tse, a researcher at Google, developed an algorithm that beat a number of trained radiologists in testing. Tse and colleagues trained a deep-learning algorithm to detect malignant lung nodules in more than 42,000 CT scans. The resulting algorithms turned up 11% fewer false positives and 5% fewer false negatives than their human counterparts. The work is described in a paper published in the journal Nature today.

That reminds me of a lot of haterism, defensiveness, confirmation bias and especially a lack of understanding of technology and their potentials to help people worldwide. I’ll not cite most of this here but you can check in my Twitter @flavioclesio.

Some people from academic circles, especially from Statistics and Epidemiology, started in several different ways bashing the automation of statistical methods (Machine Learning) using a lot of questionable methods to assess ML even using one of the worst systematic reviews in history to create a false dichotomy between the Stats and ML researchers.

Most of the time that kind of criticism without a consistent argumentation around the central point sounds more like pedantism where these people say to us in a subliminal way: “- Hey look those nerds, they do not know what they are doing. Trust use <<Classical Methods Professors>>, We have <<Number of Papers>> in that field and those folks are only coders that don’t have all the training that we have.

This situation’s so common that In April I needed to enter in a thread with Frank Harrell to discuss that an awful/pointless Systematic Review should not be used to create that kind of point less dichotomy in that thread:

My point it’s: Statistics, Machine Learning, Artificial Intelligence, Python, R, and so on are tools and should be and should be treated as such.

Closing thoughts

I invite all my 5 readers to exercise the following paradigm shift: Instead to think

This AI in Health will take Doctors out of their jobs?

let’s change the question to

Hey, you’re telling me that using this very easy to implement free software with commodity CPU power can we democratize health exams for the less favored people together with the Doctors?

Deep Learning and Radiology, False Dichotomy, Tools and a Paradigm Shift

NLP it’s still an open problem, no matter how the tech giants sell it to you…

This post from Ana Marasović tells why. Here a small sample:

We should use more inductive biases, but we have to work out what are the most suitable ways to integrate them into neural architectures such that they really lead to expected improvements.
We have to enhance pattern-matching state-of-the-art models with some notion of 
human-like common sense that will enable them to capture the higher-order relationships among facts, entities, events or activities. But mining common sense is challenging, so we are in need of new, creative ways of extracting common sense.
Finally, we should deal with 
unseen distributions and unseen tasks, otherwise “any expressive model with enough data will do the job.” Obviously, training such models is harder and results will not immediately be impressive. As researchers we have to be bold with developing such models, and as reviewers we should not penalize work that tries to do so.
This discussion within the field of NLP reflects a larger trend within AI in general—reflection on the flaws and strengths of deep learning. Yuille and Liu wrote an opinion titled 
Deep Nets: What have they ever done for Vision? in the context of vision, and Gary Marcus  has long championed using approaches  beyond  deep  learning
 for AI in general. It is a healthy sign that AI researchers are very much clear eyed about the limitations of deep learning, and working to address them.

Today we have at maximum a good way to perform word frequencies and some structural language analysis using NLP (in practical terms). For the rest, we’re far away from the human capacity.

NLP it’s still an open problem, no matter how the tech giants sell it to you…

Not Safe for Work Detector using Tensorflow JS

From the official repository:

A simple JavaScript library to help you quickly identify unseemly images; all in the client’s browser. NSFWJS isn’t perfect, but it’s pretty accurate (~90% from our test set of 15,000 test images)… and it’s getting more accurate all the time.

The library categorizes image probabilities in the following 5 classes:

  • Drawing – safe for work drawings (including anime)
  • Hentai – hentai and pornographic drawings
  • Neutral – safe for work neutral images
  • Porn – pornographic images, sexual acts
  • Sexy – sexually explicit images, not pornography

The demo is a continuous deployment source – Give it a go: http://nsfwjs.com/

Not Safe for Work Detector using Tensorflow JS

Benchmark-ML: Cutting the Big Data Hype

This is the most important benchmark project already done in Machine Learning. I’ll let for you the summary provided:

When I started this benchmark in March 2015, the “big data” hype was all the rage, and the fanboys wanted to do machine learning on “big data” with distributed computing (Hadoop, Spark etc.), while for the datasets most people had single-machine tools were not only good enough, but also faster, with more features and less bugs. I gave quite a few talks at conferences and meetups about these benchmarks starting 2015 and while at the beginning I had several people asking angrily about my results on Spark, by 2017 most people realized single machine tools are much better for solving most of their ML problems. While Spark is a decent tool for ETL on raw data (which often is indeed “big”), its ML libraries are totally garbage and outperformed (in training time, memory footpring and even accuracy) by much better tools by orders of magnitude. Furthermore, the increase in available RAM over the last years in servers and also in the cloud, and the fact that for machine learning one typically refines the raw data into a much smaller sized data matrix is making the mostly single-machine highly-performing tools (such as xgboost, lightgbm, VW but also h2o) the best choice for most practical applications now. The big data hype is finally over.

Github Repo
Benchmark-ML: Cutting the Big Data Hype

Reproducibility in FastText

A few days ago I wrote about FastText and one thing that is not clear in docs it’s about how to make the experiments reproducible in a deterministic day.

In default settings of train_supervised() method i’m using the thread parameter with multiprocessing.cpu_count() - 1 as value.

This means that we’re using all the CPUs available for training. As a result, this implies a shorter training time if we’re using multicore servers or machines.

However, this implicates in a totally non-deterministic result because of the optimization algorithm used by fastText (asynchronous stochastic gradient descent, or Hogwildpaper here), the obtained vectors will be different, even if initialized identically.

This very gentle guide of FastText with Gensim states that:

for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization).

Radim Řehůřek in FastText Model

So for that particular reason the main assumption here it’s even playing in a very stocastic environment of experimentation we’ll consider only the impact of data volume itself and abstract this information from the results, for the reason that this stocastic issue can play for both experiments.

To make reproducible experiments the only thing that it’s needed it’s to change the value of thread parameter from multiprocessing.cpu_count() - 1to 1.

So for the sake of reproducibility the training time will take longer (in my experiments I’m facing an increase of 8000% in the training time.

Reproducibility in FastText

FastText – A great tool for Text Classification

In some point of time, I’ll post a field report about FastText in a project for Text Classification. My opinion until this moment (16.03.19): For a fast Alpha version of a text classification with robust use of Bag-of-Tricks and WordNGrams it’s amazing in terms of practical results (especially Recall) and speed of development.

Imagem

Robô chines é primeira máquina a passar em um exame de Medicina

Matéria interessante do China Daily:

A robot has passed the written test of China’s national medical licensing examination, an essential entrance exam for doctors, making it the first robot in the world to pass such an exam.

Its developer iFlytek Co Ltd, a leading Chinese artificial intelligence company, said on Thursday that the robot scored 456 points, 96 points higher than the required marks.

The artificial-intelligence-enabled robot can automatically capture and analyze patient information and make initial diagnosis. It will be used to assist doctors to improve efficiency in future treatments, iFlytek said.

This is part of broader efforts by China to accelerate the application of AI in healthcare, consumer electronics, and other industries.

Liu Qingfeng, chairman of iFlytek, said, “We will officially launch the robot in March 2018. It is not meant to replace doctors. Instead, it is to promote better people-machine cooperation so as to boost efficiency.”

Ao menos essa noticia e interessante por dois motivos simples:

1) Estamos vivendo em um tempo em que os AI Deniers (eu denomino eles como Terraplanistas da Inteligência Artificial) onde eles têm no Gary Marcus a sua maior expressão e em ao mesmo tempo tem uma mistura de Um ótimo discurso de ceticismo em relação ao hype do Deep Learning e Artificial General Intelligence (AGI) com críticas sem sentido, como neste exemplo em que há negação contra resultados experimentais claros com disclaimer de limitações metodológicas; e

2) Em termos de alocação de recursos médicos e econômicos a automação desses sistemas de robôs médicos traria um grande avanço social no sentido de que a) haveria uma maior democratização do acesso à saúde preventiva por parte das pessoas menos beneficiadas, uma vez que os custos teriam uma redução drastica e b) potencialmente haveria uma melhor alocação do tempo dos profissionais da saúde (e.g. médicos e enfermeiros) em tarefas de maior valor para a prevenção ou recuperação/intervenção para os pacientes ao invés da execução de procedimentos repetitivos, como por exemplo um foco maior em diagnóstico e tratamento (e isso é realmente importante no Brasil dado que 40% dos médicos recém-formados são reprovados na prova do CREMESP, Sendo ainda que 70% dos médicos não sabiam medir pressão e 86% erraram abordagem a vítima de acidente de trânsito).

Conclusão

Em uma realidade em que no mínimo 50% de todos os empregos podem ser extintos do mercado de trabalho com a automação e Inteligência Artificial, como também o crescimento da demanda de bens e serviços (com a competitividade de custos cada vez maior) notícias como essas são muito bem vindas para colocar em perspectiva para as sociedades de que o correto entendimento das potencialidades e limitações da Inteligência Artificial é o caminho para o desenvolvimento social e prosperidade econômica.

Como diria um autor em uma conferência que eu fui na Asia: “A Inteligência Artificial não está para tirar o emprego das pessoas, mas sim está para acabar com o emprego daquelas que não usarem.”

Robô chines é primeira máquina a passar em um exame de Medicina

Siamese Survival Analysis with Competing Risks

A nice approach in Survival Analysis especially when we deal with several different covariates that interact with themselves with very different hazards acting over in the event along the time.

Abstract. Survival analysis in the presence of multiple possible adverse events, i.e., competing risks, is a pervasive problem in many industries (healthcare, finance, etc.). Since only one event is typically observed, the incidence of an event of interest is often obscured by other related competing events. This nonidentifiability, or inability to estimate true cause-specific survival curves from empirical data, further complicates competing risk survival analysis. We introduce Siamese Survival Prognosis Network (SSPN), a novel deep learning architecture for estimating personalized risk scores in the presence of competing risks. SSPN circumvents the nonidentifiability problem by avoiding the estimation of cause-specific survival curves and instead determines pairwise concordant time-dependent risks, where longer event times are assigned lower risks. Furthermore, SSPN is able to directly optimize an approximation to the C-discrimination index, rather than relying on well-known metrics which are unable to capture the unique requirements of survival analysis with competing risks.

Conclusion: Competing risks settings are pervasive in healthcare. They are encountered in cardiovascular diseases, in cancer, and in the geriatric population suffering from multiple diseases. To solve the challenging problem of learning the model parameters from time-to-event data while handling right censoring, we have developed a novel deep learning architecture for estimating personalized risk scores in the presence of competing risks based on the well-known Siamese network architecture. Our method is able to capture complex non-linear representations missed by classical machine learning and statistical models. Experimental results show that our method is able to outperform existing competing risk methods by successfully learning representations which flexibly describe non-proportional hazard rates with complex interactions between covariates and survival times that are common in many diseases with heterogeneous phenotypes.

Siamese Survival Analysis with Competing Risks

Progressive Neural Architecture Search

AbstractWe propose a new method for learning the structure of convolutional neural networks (CNNs) that is more efficient than recent state-of-the-art methods based on reinforcement learning and evolutionary algorithms. Our approach uses a sequential model-based optimization (SMBO) strategy, in which we search for structures in order of increasing complexity, while simultaneously learning a surrogate model to guide the search through structure space. Direct comparison under the same search space shows that our method is up to 5 times more efficient than the RL method of Zoph et al. (2018) in terms of number of models evaluated, and 8 times faster in terms of total compute. The structures we discover in this way achieve state of the art classification accuracies on CIFAR-10 and ImageNet.

Conclusions: The main contribution of this work is to show how we can accelerate the search for good CNN structures by using progressive search through the space of increasingly complex graphs, combined with a learned prediction function to efficiently identify the most promising models to explore. The resulting models achieve the same level of performance as previous work but with a fraction of the computational cost. There are many possible directions for future work, including: the use of better surrogate predictors, such as Gaussian processes with string kernels; the use of model-based early stopping, such as [3], so we can stop the training of “unpromising” models before reaching E1 epochs; the use of “warm starting”, to initialize the training of a larger b+ 1-sized model from its smaller parent; the use of Bayesian optimization, in which we use an acquisition function, such as expected improvement or upper confidence bound, to rank the candidate models, rather than greedily picking the top K (see e.g., [31,30]); adaptively varying the number of models K evaluated at each step (e.g., reducing it over time); the automatic exploration of speed-accuracy tradeoffs (cf., [11]), etc.

Progressive Neural Architecture Search

Classification using deep learning neural networks for brain tumors

Abstract: Deep Learning is a new machine learning field that gained a lot of interest over the past few years. It was widely applied to several applications and proven to be a powerful machine learning tool for many of the complex problems. In this paper we used Deep Neural Network classifier which is one of the DL architectures for classifying a dataset of 66 brain MRIs into 4 classes e.g. normal, glioblastoma, sarcoma and metastatic bronchogenic carcinoma tumors. The classifier was combined with the discrete wavelet transform (DWT) the powerful feature extraction tool and principal components analysis (PCA) and the evaluation of the performance was quite good over all the performance measures.

Conclusion and future work: In this paper we proposed an efficient methodology which combines the discrete wavelet transform (DWT) with the Deep Neural Network (DNN) to classify the brain MRIs into Normal and 3 types of malignant brain tumors: glioblastoma, sarcoma and metastatic bronchogenic carcinoma. The new methodology architecture resemble the convolutional neural networks (CNN) architecture but requires less hardware specifications and takes a convenient time of processing for large size images (256 × 256). In addition using the DNN classifier shows high accuracy compared to traditional classifiers. The good results achieved using the DWT could be employed with the CNN in the future and compare the results.

Classification using deep learning neural networks for brain tumors

Deep learning sharpens views of cells and genes

The research relied on a convolutional neural network, a type of deep-learning algorithm that is transforming how biologists analyse images. Scientists are using the approach to find mutations in genomes and predict variations in the layout of single cells. Google’s method, described in a preprint in August (R. Poplin et al. Preprint at https://arxiv.org/abs/1708.09843; 2017), is part of a wave of new deep-learning applications that are making image processing easier and more versatile — and could even identify overlooked biological phenomena.
Cell biologists at the Allen Institute for Cell Science in Seattle, Washington, are using convolutional neural networks to convert flat, grey images of cells captured with light microscopes into 3D images in which some of a cell’s organelles are labelled in colour. The approach eliminates the need to stain cells — a process that requires more time and a sophisticated lab, and can damage the cell. Last month, the group published details of an advanced technique that can predict the shape and location of even more cell parts using just a few pieces of data — such as the cell’s outline (G. R. Johnson et al. Preprint at bioRxiv http://doi.org/chwv; 2017).
Other machine-learning connoisseurs in biology have set their sights on new frontiers, now that convolutional neural networks are taking flight for image processing. “Imaging is important, but so is chemistry and molecular data,” says Alex Wolf, a computational biologist at the German Research Center for Environmental Health in Neuherberg. Wolf hopes to tweak neural networks so that they can analyse gene expression. “I think there will be a very big breakthrough in the next few years,” he says, “that allows biologists to apply neural networks much more broadly.”
Deep learning sharpens views of cells and genes

Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning

Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning

Abstract: Traditionally, medical discoveries are made by observing associations, making hypotheses from them and then designing and running experiments to test the hypotheses. However, with medical images, observing and quantifying associations can often be difficult because of the wide variety of features, patterns, colours, values and shapes that are present in real data. Here, we show that deep learning can extract new knowledge from retinal fundus images. Using deep-learning models trained on data from 284,335 patients and validated on two independent datasets of 12,026 and 999 patients, we predicted cardiovascular risk factors not previously thought to be present or quantifiable in retinal images, such as age (mean absolute error within 3.26 years), gender (area under the receiver operating characteristic curve (AUC) = 0.97), smoking status (AUC = 0.71), systolic blood pressure (mean absolute error within 11.23 mmHg) and major adverse cardiac events (AUC = 0.70). We also show that the trained deep-learning models used anatomical features, such as the optic disc or blood vessels, to generate each prediction.

Prediction of cardiovascular risk factors from retinal fundus photographs via deep learning

Deep Reinforcement Learning Doesn’t Work Yet

That’s why I like more “tear down” projects and papers (that kind of paper helps to open our eyes in the right criticism) than “look-my-shiny-not-reproducible-paper-project” .

By Sorta Insightful

 

 

 

 

 

Deep Reinforcement Learning Doesn’t Work Yet

Optimization for Deep Learning Algorithms: A Review

ABSTRACT: In past few years, deep learning has received attention in the field of artificial intelligence. This paper reviews three focus areas of learning methods in deep learning namely supervised, unsupervised and reinforcement learning. These learning methods are used in implementing deep and convolutional neural networks. They offered unified computational approach, flexibility and scalability capabilities. The computational model implemented by deep learning is used in understanding data representation with multiple levels of abstractions. Furthermore, deep learning enhanced the state-of-the-art methods in terms of domains like genomics. This can be applied in pathway analysis for modelling biological network. Thus, the extraction of biochemical production can be improved by using deep learning. On the other hand, this review covers the implementation of optimization in terms of meta-heuristics methods. This optimization is used in machine learning as a part of modelling methods.
CONCLUSION
In this review, discussed about deep learning techniques which implementing multiple level of abstraction in feature representation. Deep learning can be characterized as rebranding of artificial neural network. This learning methods gains a large interest among the researchers because of better representation and easier to learn tasks. Even though deep learning is implemented, however there are some issues has been arise. There are easily getting stuck at local optima and computationally expensive. DeepBind algorithm shows that deep learning can cooperate in genomics study. It is to ensure on achieving high level of prediction protein binding affinity. On the other hand, the optimization method which has been discusses consists of several meta-heuristics
methods which can be categorized under evolutionary algorithms. The application of the techniques involvedCRO shows the diversity of optimization algorithm to improve the analysis of modelling techniques. Furthermore, these methods are able to solve the problems arise in conventional neural network as it provides high quality in finding solution in a given search space. The application of optimization methods enable the
extraction of biochemical production of metabolic pathway. Deep learning will gives a good advantage in the biochemical production as it allows high level abstraction in cellular biological network. Thus, the use of CRO will improve the problems arise in deep learning which are getting stuck at local optima and it is computationally expensive. As CRO use global search in the search space to identify global minimum point. Thus, it will improve the training process in the network on refining the weight in order to have minimum error.
Optimization for Deep Learning Algorithms: A Review