Abstract: Traditionally, medical discoveries are made by observing associations, making hypotheses from them and then designing and running experiments to test the hypotheses. However, with medical images, observing and quantifying associations can often be difficult because of the wide variety of features, patterns, colours, values and shapes that are present in real data. Here, we show that deep learning can extract new knowledge from retinal fundus images. Using deep-learning models trained on data from 284,335 patients and validated on two independent datasets of 12,026 and 999 patients, we predicted cardiovascular risk factors not previously thought to be present or quantifiable in retinal images, such as age (mean absolute error within 3.26 years), gender (area under the receiver operating characteristic curve (AUC) = 0.97), smoking status (AUC = 0.71), systolic blood pressure (mean absolute error within 11.23 mmHg) and major adverse cardiac events (AUC = 0.70). We also show that the trained deep-learning models used anatomical features, such as the optic disc or blood vessels, to generate each prediction.
That’s why I like more “tear down” projects and papers (that kind of paper helps to open our eyes in the right criticism) than “look-my-shiny-not-reproducible-paper-project” .
Fast and simple.
What is an Adversarial Attack?
Machine learning algorithms accept inputs as numeric vectors. Designing an input in a specific way to get the wrong result from the model is called an adversarial attack.
How is this possible? No machine learning algorithm is perfect and they make mistakes — albeit very rarely. However, machine learning models consist of a series of specific transformations, and most of these transformations turn out to be very sensitive to slight changes in input. Harnessing this sensitivity and exploiting it to modify an algorithm’s behavior is an important problem in AI security.
In this article we will show practical examples of the main types of attacks, explain why is it so easy to perform them, and discuss the security implications that stem from this technology.
Types of Adversarial Attacks
Here are the main types of hacks we will focus on:
- Non-targeted adversarial attack: the most general type of attack when all you want to do is to make the classifier give an incorrect result.
- Targeted adversarial attack: a slightly more difficult attack which aims to receive a particular class for your input.
Abstract: Cancer is the most dangerous and stubborn disease known to mankind. It accounts for the most deaths caused by any disease. However, if detected early this medical condition is not very difficult to defeconvoat. Tumors which are cancerous grow very rapidly and spread into different parts of the body and this process continues until that tumor spreads in the entire body and ultimately our organs stop functioning. If any tumor is developed in any part of our body it requires immediate medical attention to verify that the tumor is malignant(cancerous) or Benign(non-cancerous). Until now if any tumor has to be tested for malignancy a sample of tumor should be extracted out and then tested in the laboratory. But using the computational logic of Deep Neural Networks we can predict that the tumor is malignant or Benign by only a photograph of that tumor. If cancer is detected in early stage chances are very high that it can be cured completely. In this work, we detect Melanoma(Skin cancer) in tumors by processing images of those tumors.
Conclusion: We have trained our model using Vgg16, Inception and ResNet50 neural network architecture. In training, we have provided 2 categories of images one with Malignant (MelanomaSkin cancer) tumors and other with benign tumors. After training, we tested our model with random images of tumor and an accuracy of 83.86%-86.02% was recorded in classifying that it is malignant or benign. By using neural network our model can classify Malignant(cancerous) and benign(non-cancerous) tumors with an accuracy of 86.02%. Since cancer, if detected early can be cured completely. This technology can be used to detect cancer when a tumor is developed at early stage and precautions can be taken accordingly.
By Joel Ruben Antony Moniz, David Krueger
We propose Nested LSTMs (NLSTM), a novel RNN architecture with multiple levels of memory. Nested LSTMs add depth to LSTMs via nesting as opposed to stacking. The value of a memory cell in an NLSTM is computed by an LSTM cell, which has its own inner memory cell. Specifically, instead of computing the value of the (outer) memory cell as coutert=ft⊙ct−1+it⊙gt, NLSTM memory cells use the concatenation (ft⊙ct−1,it⊙gt) as input to an inner LSTM (or NLSTM) memory cell, and set coutert = hinnert. Nested LSTMs outperform both stacked and single-layer LSTMs with similar numbers of parameters in our experiments on various character-level language modeling tasks, and the inner memories of an LSTM learn longer term dependencies compared with the higher-level units of a stacked LSTM.
One of the most misunderstood concepts and the reason that a lot of cash is spent in Machine Learning as a Service (MLaaS) due a lack of optimization in this parameter that is responsible to control the convergence.
Abstract: Any gradient descent optimization requires to choose a learning rate. With deeper and deeper models, tuning that learning rate can easily become tedious and does not necessarily lead to an ideal convergence. We propose a variation of the gradient descent algorithm in the which the learning rate η is not fixed. Instead, we learn η itself, either by another gradient descent (first-order method), or by Newton’s method (second-order). This way, gradient descent for any machine learning algorithm can be optimized.
Conclusion: In this paper, we have built a new way to learn the learning rate at each step using finite differences on the loss. We have tested it on a variety of convex and non-convex optimization tasks. Based on our results, we believe that our method would be able to adapt a good learning rate at every iteration on convex problems. In the case of non-convex problems, we repeatedly observed faster training in the first few epochs. However, our adaptive model seems more inclined to overfit the training data, even though its test accuracy is always comparable to standard SGD performance, if not slightly better. Hence we believe that in neural network architectures, our model can be used initially for pretraining for a few epochs, and then continue with any other standard optimization technique to lead to faster convergence and be computationally more efficient, and perhaps reach a new highest accuracy on the given problem. Moreover, the learning rate that our algorithm converges to suggests an ideal learning rate for the given training task. One could use our method to tune the learning rate of a standard neural network (using Adam for instance), giving a more precise value than with line-search or random search.
We’re pleased to announce the next step towards deep learning for every device and platform. Today Vertex.AI is releasing PlaidML, our open source portable deep learning engine. Our mission is to make deep learning accessible to every person on every device, and we’re building PlaidML to help make that a reality. We’re starting by supporting the most popular hardware and software already in the hands of developers, researchers, and students. The initial version of PlaidML runs on most existing PC hardware with OpenCL-capable GPUs from NVIDIA, AMD, or Intel. Additionally, we’re including support for running the widely popular Keras framework on top of Plaid to allow existing code and tutorials to run unchanged.
Via Baidu Research Blog.
The Ring all-reduce approach came to save a lot of work when training deep neural networks. The approach to propagate and update the gradients (and control the convergence of the model) are well explained below:
The Ring Allreduce
The main issue with the simplistic communication strategy described above was that the communication cost grew linearly with the number of GPUs in the system. In contrast, a ring allreduce is an algorithm for which the communication cost is constant and independent of the number of GPUs in the system, and is determined solely by the slowest connection between GPUs in the system; in fact, if you only consider bandwidth as a factor in your communication cost (and ignore latency), the ring allreduce is an optimal communication algorithm . (This is a good estimate for communication cost when your model is large, and you need to send large amounts of data a small number of times.)
The GPUs in a ring allreduce are arranged in a logical ring. Each GPU should have a left neighbor and a right neighbor; it will only ever send data to its right neighbor, and receive data from its left neighbor.
The algorithm proceeds in two steps: first, a scatter-reduce, and then, an allgather. In the scatter-reduce step, the GPUs will exchange data such that every GPU ends up with a chunk of the final result. In the allgather step, the GPUs will exchange those chunks such that all GPUs end up with the complete final result.
Yet, this is another tool for Deep Learning, but I think that those guys hit the nail exposing and fixing one of the major concerns about Tensor Flow that is distributed training.
When Uber needed to use Deep Learning they found some endeavors to use the conventional Data Parallelism architecture. Using Data Parallelism arch they can distribute the training using several instances in parallel and when the gradients for every batch are calculated in each instance (node/worker) these gradients are propagated for all nodes and averaged to control the convergence (update) of the model in the training phase. The following image explains better than words.
But using this architecture Uber faced two problems that were a) the right ratio of worker to parameter servers (to avoid/deal with network and processing bottleneck) and b) the complexity of TensorFlow code (more details here).
To avoid these problems they used an idea of a 2009 paper “Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations” called Ring all-reduce.
They explain the workflow of this approach:
In the ring-allreduce algorithm, each of N nodes communicates with two of its peers 2*(N-1) times. During this communication, a node sends and receives chunks of the data buffer. In the first N-1 iterations, received values are added to the values in the node’s buffer. In the second N-1 iterations, received values replace the values held in the node’s buffer. Baidu’s paper suggests that this algorithm is bandwidth-optimal, meaning that if the buffer is large enough, it will optimally utilize the available network.
The implementations details can be found here.
[…]The most interesting question to me is why Google chose a purely declarative paradigm for Tensorflow in spite of the obvious downsides of this approach. Did they feel that encapsulating all the computation in a single computation graph would simplify executing models on their TPU’s so they can cut Nvidia out of the millions of dollars to be made from cloud hosting of deep learning powered applications? It’s difficult to say. Overall, Tensorflow does not feel like a pure open source project for the common good. Which I would have no problem with, had their design been sound. In comparison with beautiful Google open source projects out there such as Protobuf, Golang, and Kubernetes, Tensorflow falls dramatically short.
While declarative paradigms are great for UI programming, there are many reasons why it is a problematic choice for deep learning.
On the other hand, in deep learning, a single layer can literally execute billions of FLOP’s! And deep learning researchers care very much about the mechanics of how computation is done and want fine control because they are constantly pushing the edge of what’s possible (e.g. dynamic networks) and want easy access to intermediate results.[…]
One of my biggest mistakes was to make my whole master’s degrees dissertation using private data (provided by my former employer) using closed tools (e.g. Viscovery Mine).
This was for me a huge blocker to share my research with every single person in the community, and get a second opinion about my work in regard of reproducibility.
I working to open my data and making a new version, or book, about this kind of analysis using Non Performing Loans data.
Here in Denny’s blog, he talks about how engineering is the bottleneck in Deep Learning Research, where he made the following statements:
I will use the Deep Learning community as an example, because that’s what I’m familiar with, but this probably applies to other communities as well. As a community of researchers we all share a common goal: Move the field forward. Push the state of the art. There are various ways to do this, but the most common one is to publish research papers. The vast majority of published papers are incremental, and I don’t mean this in a degrading fashion. I believe that research is incremental by definition, which is just another way of saying that new work builds upon what other’s have done in the past. And that’s how it should be. To make this concrete, the majority of the papers I come across consist of more than 90% existing work, which includes datasets, preprocessing techniques, evaluation metrics, baseline model architectures, and so on. The authors then typically add a bit novelty and show improvement over well-established baselines.
So far nothing is wrong with this. The problem is not the process itself, but how it is implemented. There are two issues that stand out to me, both of which can be solved with “just engineering.” 1. Waste of research time and 2. Lack of rigor and reproducibility. Let’s look at each of them.
And the final musing:
Personally, I do not trust paper results at all. I tend to read papers for inspiration – I look at the ideas, not at the results. This isn’t how it should be. What if all researchers published code? Wouldn’t that solve the problem? Actually, no. Putting your 10,000 lines of undocumented code on Github and saying “here, run this command to reproduce my number” is not the same as producing code that people will read, understand, verify, and build upon. It’s like Shinichi Mochizuki’s proof of the ABC Conjecture, producing something that nobody except you understands.
Personally, I think this approach of discarding the results and focus on the novelty of methods is better than to try to understand any
result aspect that the researcher wants to cover up through academic BS complexity.
Abstract: Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections – one between each layer and its subsequent layer – our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance.
Em breve teremos alguns posts aqui no blog sobre o assunto, mas é um case de ML com engenharia de caras questão mandando bem com métodos bem avançados com arquiteturas escaláveis.
We ultimately settled on conducting time series modeling based on the Long Short Term Memory (LSTM) architecture, a technique that features end-to-end modeling, ease of incorporating external variables, and automatic feature extraction abilities.4 By providing a large amount of data across numerous dimensions, an LSTM approach can model complex nonlinear feature interactions.
We decided to build a neural network architecture that provides single-model, heterogeneous forecasting through an automatic feature extraction module.6 As Figure 4 demonstrates, the model first primes the network by automatic, ensemble-based feature extraction. After feature vectors are extracted, they are averaged using a standard ensemble technique. The final vector is then concatenated with the input to produce the final forecast.
During testing, we were able to achieve a 14.09 percent symmetric mean absolute percentage error (SMAPE) improvement over the base LSTM architecture and over 25 percent improvement over the classical time series model used in Argos, Uber’s real-time monitoring and root cause-exploration tool.
Você foi lá no portal da Azure pegou uma NC24R que tem maravilhosos 224 Gb de memória, 24 núcleos, mais de 1Tb de disco, e o melhor: 4 placas M80 para o seu deleite completo no treinamento com Deep Learning.
Tudo perfeito, certo? Quase.
Logo de início tentei usar um script para um treinamento e com um simples
htop para monitorar o treinamento, vi que o Tensor Flow estava despejando todo o treinamento nos processadores.
Mesmo com esses 24 processadores maravilhosos batendo 100% de processamento, isso não chega nem perto do que as nossas GPUs mastodônticas podem produzir. (Nota: Você não trocaria 4 Ferraris 2017 por 24 Fiat 147 modelo 1985, certo?)
Acessando a nossa maravilhosa máquina para ver o que tinha acontecido, verifiquei primeiro se as GPUs estavam na máquina, o que de fato aconteceu.
Tue Jun 27 18:21:05 2017 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 375.66 Driver Version: 375.66 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 Tesla K80 Off | B2A3:00:00.0 Off | 0 | | N/A 47C P0 71W / 149W | 0MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 1 Tesla K80 Off | C4D8:00:00.0 Off | 0 | | N/A 57C P0 61W / 149W | 0MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 2 Tesla K80 Off | D908:00:00.0 Off | 0 | | N/A 52C P0 56W / 149W | 0MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ | 3 Tesla K80 Off | EEAF:00:00.0 Off | 0 | | N/A 42C P0 69W / 149W | 0MiB / 11439MiB | 0% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+
azure_teste@deep-learning:~/deep-learning-moderator-msft$ lspci | grep -i NVIDIA
b2a3:00:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) c4d8:00:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) d908:00:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1) eeaf:00:00.0 3D controller: NVIDIA Corporation GK210GL [Tesla K80] (rev a1)
OK, as Teslas K80 estão prontas para o combate, contudo a própria Azure reconhece que há problemas no processo como um todo, e pra fazer isso basta alguns procedimentos bem simples em dois passos segundo a documentação.
$ git clone https://github.com/leestott/Azure-GPU-Setup.git
2) Entre na pasta que foi clonada
$ cd azure-gpu-setup
3) Execute o script que irá instalar algumas libs da NVIDIA e posteriormente fará o servidor reiniciar.
$ bash gpu-setup-part1.sh
1) Vá na pasta do repositório do git
$ cd azure-gpu-setup
2) Execute o segundo script que fará a instalação do Tensorflow, do Toolkit do CUDA, e do CUDNN além de fazer o set de uma porção de variáveis de ambiente.
$ bash gpu-setup-part2.sh
3) Depois faça o teste da instalação
$ python gpu-test.py
Ele atesta em uma de suas palestras, que as Redes Deep Learning são datadas de 1971 quando o Aleksei Ivakhnenko e Valentin Lapa publicaram o primeiro trabalho com uma rede de 8 camadas.
E aqui para baixar no post a sua revisão sistemática em relação à Deep Learning.
É um tema bem recente o ataque em sensores e por consequência em modelos de Machine Learning (isso será abordado em algum momento do futuro por aqui, mas esse artigo mostra bem o potencial danoso disso).
O Cleverhans é uma lib em Python que insere de maneira artificial um pouco de ruído/distúrbio na rede como forma de treino para esse tipo de ataque.
This repository contains the source code for cleverhans, a Python library to benchmark machine learning systems’ vulnerability to adversarial examples. You can learn more about such vulnerabilities on the accompanying blog.
The cleverhans library is under continual development, always welcoming contributions of the latest attacks and defenses. In particular, we always welcome help towards resolving the issues currently open.
About the name
The name cleverhans is a reference to a presentation by Bob Sturm titled “Clever Hans, Clever Algorithms: Are Your Machine Learnings Learning What You Think?” and the corresponding publication, “A Simple Method to Determine if a Music Information Retrieval System is a ‘Horse’.” Clever Hans was a horse that appeared to have learned to answer arithmetic questions, but had in fact only learned to read social cues that enabled him to give the correct answer. In controlled settings where he could not see people’s faces or receive other feedback, he was unable to answer the same questions. The story of Clever Hans is a metaphor for machine learning systems that may achieve very high accuracy on a test set drawn from the same distribution as the training data, but that do not actually understand the underlying task and perform poorly on other inputs.
Neste post do Michael Elad que é editor chefe da SIAM da publicação Journal on Imaging Sciences ele faz uma série de reflexões bem ponderadas de como os métodos de Deep Learning estão resolvendo problemas reais e alcançando um alto grau de visibilidade mesmo com métodos não tão elegantes dentro da perspectiva matemática.
Ele coloca o ponto principal de que, no que tange o processamento de imagens, a academia teve sempre um lugar de destaque em uma abordagem na qual a interpretabilidade e o entendimento do modelos sempre teve precedência em relação aos resultados alcançados.
Isso fica claro no parágrafo abaixo:
A series of papers during the early 2000s suggested the successful application of this architecture, leading to state-of-the-art results in practically any assigned task. Key aspects in these contributions included the following: the use of many network layers, which explains the term “deep learning;” a huge amount of data on which to train; massive computations typically run on computer clusters or graphic processing units; and wise optimization algorithms that employ effective initializations and gradual stochastic gradient learning. Unfortunately, all of these great empirical achievements were obtained with hardly any theoretical understanding of the underlying paradigm. Moreover, the optimization employed in the learning process is highly non-convex and intractable from a theoretical viewpoint.
No final ele coloca uma visão sobre pragmatismo e agenda acadêmica:
Should we be happy about this trend? Well, if we are in the business of solving practical problems such as noise removal, the answer must be positive. Right? Therefore, a company seeking such a solution should be satisfied. But what about us scientists? What is the true objective behind the vast effort that we invested in the image denoising problem? Yes, we do aim for effective noise-removal algorithms, but this constitutes a small fraction of our motivation, as we have a much wider and deeper agenda. Researchers in our field aim to understand the data on which we operate. This is done by modeling information in order to decipher its true dimensionality and manifested phenomena. Such models serve denoising and other problems in image processing, but far more than that, they allow identifying new ways to extract knowledge from the data and enable new horizons.
Isso lembra uma passagem minha na RCB Investimentos quando eu trabalhava com o grande Renato Toledo no mercado de NPL em que ele me ensinou que bons modelos têm um alto grau de interpretabilidade e simplicidade, no qual esse fator deve ser o barômetro da tomada de decisão, dado que um modelo cujo a sua incerteza (ou erro) seja conhecido é melhor do que um modelo que ninguém sabe o que está acontecendo (Nota pessoal: Quem me conhece sabe que eu tenho uma frase sobre isso que é: se você não entende a dinâmica do modelo quando ele funciona, nunca vai saber o que deu errado quando ele falhar.)
Contudo é inegável que as redes Deep Learning estão resolvendo, ao meu ver, uma demanda reprimida de problemas que já existiam e que os métodos computacionais não conseguiam resolver de forma fácil, como reconhecimento facial, classificação de imagens, tradução, e problemas estruturados como fraude (a Fast.AI está fazendo um ótimo trabalho de clarificar isso).
Em que pese o fato dos pesquisadores de DL terem hardware infinito a preços módicos, o fato brutal é que esse campo de pesquisa durante aproximadamente 30 anos engoliu uma pílula bem amarga de ceticismo proveniente da própria academia: seja em colocar esse método em uma esfera de alto ceticismo levando a sua quase total extinção, ou mesmo com alguns jornals implicitamente não aceitarem trabalhos de DL; enquanto matemáticos estavam ganhando prêmios e tendo um alto nível de visibilidade por causa da acurácia dos seus métodos ao invés de uma pretensa ideia de que o mundo gostava da interpretabilidade de seus métodos.
Duas grandes questões estão em tela que são: 1) Será que os matemáticos e comunidades que estão chocadas com esse fenômeno podem aguentar o mesmo que a comunidade de Redes Neurais aguentou por mais de 30 anos? e 2) E em caso de um Math-Winter, a comunidade matemática consegue suportar uma potencial marginalização de sua pesquisa?
É esperar e ver.
Pra quem acompanha o Python Programming, sabe que sempre quando eles postam algo é que coisa boa vem aí; e dessa vez não foi diferente.
O Harrison está fazendo uma série de posts sobre como jogar GTA V usando Deep Learning com Tensor Flow usando CNN (convolutional neural network).
Este é o primeiro vídeo da série em que ele faz o setup da solução:
E essa é a última versão treinada:
Para quem estiver interessado o Harrison deixou uma playlist com todos os estágios do treinamento, e um BOT rodando sozinho em um livestream (vale a pena ver o quão divertido é ver o bot tentando dirigir).