A good case to replicate in Brazil.
BACKGROUND: In China, dengue remains an important public health issue with expanded areas and increased incidence recently. Accurate and timely forecasts of dengue incidence in China are still lacking. We aimed to use the state-of-the-art machine learning algorithms to develop an accurate predictive model of dengue.
METHODOLOGY/PRINCIPAL FINDINGS: Weekly dengue cases, Baidu search queries and climate factors (mean temperature, relative humidity and rainfall) during 2011-2014 in Guangdong were gathered. A dengue search index was constructed for developing the predictive models in combination with climate factors. The observed year and week were also included in the models to control for the long-term trend and seasonality. Several machine learning algorithms, including the support vector regression (SVR) algorithm, step-down linear regression model, gradient boosted regression tree algorithm (GBM), negative binomial regression model (NBM), least absolute shrinkage and selection operator (LASSO) linear regression model and generalized additive model (GAM), were used as candidate models to predict dengue incidence. Performance and goodness of fit of the models were assessed using the root-mean-square error (RMSE) and R-squared measures. The residuals of the models were examined using the autocorrelation and partial autocorrelation function analyses to check the validity of the models. The models were further validated using dengue surveillance data from five other provinces. The epidemics during the last 12 weeks and the peak of the 2014 large outbreak were accurately forecasted by the SVR model selected by a cross-validation technique. Moreover, the SVR model had the consistently smallest prediction error rates for tracking the dynamics of dengue and forecasting the outbreaks in other areas in China.
CONCLUSION AND SIGNIFICANCE: The proposed SVR model achieved a superior performance in comparison with other forecasting techniques assessed in this study. The findings can help the government and community respond early to dengue epidemics.
Em épocas de Deep Learning, é sempre bom ver um paper com as boas e velhas Máquinas de Vetor de Suporte (Support Vector Machines). Em breve teremos um post sobre essa técnica aqui no blog.
Abstract: Hyper-parameter tuning for support vector machines has been widely studied in the past decade. A variety of metaheuristics, such as Genetic Algorithms and Particle Swarm Optimization have been considered to accomplish this task. Notably, exhaustive strategies such as Grid Search or Random Search continue to be implemented for hyper-parameter tuning and have recently shown results comparable to sophisticated metaheuristics. The main reason for the success of exhaustive techniques is due to the fact that only two or three parameters need to be adjusted when working with support vector machines. In this chapter, we analyze two Estimation Distribution Algorithms, the Univariate Marginal Distribution Algorithm and the Boltzmann Univariate Marginal Distribution Algorithm, to verify if these algorithms preserve the effectiveness of Random Search and at the same time make more efficient the process of finding the optimal hyper-parameters without increasing the complexity of Random Search.
Um ótimo paper de como o hardware vai exercer função crucial em alguns anos em relação à Core Machine Learning, em especial em sistemas embarcados.
Hardware for Machine Learning: Challenges and Opportunities
Abstract—Machine learning plays a critical role in extracting meaningful information out of the zetabytes of sensor data collected every day. For some applications, the goal is to analyze and understand the data to identify trends (e.g., surveillance, portable/wearable electronics); in other applications, the goal is to take immediate action based the data (e.g., robotics/drones, self-driving cars, smart Internet of Things). For many of these applications, local embedded processing near the sensor is preferred over the cloud due to privacy or latency concerns, or limitations in the communication bandwidth. However, at the sensor there are often stringent constraints on energy consumption and cost in addition to throughput and accuracy requirements. Furthermore, flexibility is often required such that the processing can be adapted for different applications or environments (e.g., update the weights and model in the classifier). In many applications, machine learning often involves transforming the input data into a higher dimensional space, which, along with programmable weights, increases data movement and consequently energy consumption. In this paper, we will discuss how these challenges can be addressed at various levels of hardware design ranging from architecture, hardware-friendly algorithms, mixed-signal circuits, and advanced technologies (including memories and sensors).
Conclusions: Machine learning is an important area of research with many promising applications and opportunities for innovation at various levels of hardware design. During the design process, it is important to balance the accuracy, energy, throughput and cost requirements. Since data movement dominates energy consumption, the primary focus of recent research has been to reduce the data movement while maintaining performance accuracy, throughput and cost. This means selecting architectures with favorable memory hierarchies like a spatial array, and developing dataflows that increase data reuse at the low-cost levels of the memory hierarchy. With joint design of algorithm and hardware, reduced bitwidth precision, increased sparsity and compression are used to minimize the data movement requirements. With mixed-signal circuit design and advanced technologies, computation is moved closer to the source by embedding computation near or within the sensor and the memories. One should also consider the interactions between these different levels. For instance, reducing the bitwidth through hardware-friendly algorithm design enables reduced precision processing with mixed-signal circuits and non-volatile memory. Reducing the cost of memory access with advanced technologies could result in more energy-efficient dataflows.
Mais um estudo colocando alguns algoritmos de Machine Learning contra métodos tradicionais de scoring, e levando a melhor.
Abstract: The benefits of cardiac surgery are sometimes difficult to predict and the decision to operate on a given individual is complex. Machine Learning and Decision Curve Analysis (DCA) are recent methods developed to create and evaluate prediction models.
Methods and finding: We conducted a retrospective cohort study using a prospective collected database from December 2005 to December 2012, from a cardiac surgical center at University Hospital. The different models of prediction of mortality in-hospital after elective cardiac surgery, including EuroSCORE II, a logistic regression model and a machine learning model, were compared by ROC and DCA. Of the 6,520 patients having elective cardiac surgery with cardiopulmonary bypass, 6.3% died. Mean age was 63.4 years old (standard deviation 14.4), and mean EuroSCORE II was 3.7 (4.8) %. The area under ROC curve (IC95%) for the machine learning model (0.795 (0.755–0.834)) was significantly higher than EuroSCORE II or the logistic regression model (respectively, 0.737 (0.691–0.783) and 0.742 (0.698–0.785), p < 0.0001). Decision Curve Analysis showed that the machine learning model, in this monocentric study, has a greater benefit whatever the probability threshold.
Conclusions: According to ROC and DCA, machine learning model is more accurate in predicting mortality after elective cardiac surgery than EuroSCORE II. These results confirm the use of machine learning methods in the field of medical prediction.
Um tutorial para quem quiser aplicar no dia-a-dia.
O paper é seminal (ou seja precisa ser revisado com um pouco mais de cautela), mas representa um bom avanço na utilização das RNAs, tendo em vista que as Random Forests (Florestas Aleatórias) e as Support Vector Machines (Máquinas de Vetor de Suporte) estão apresentando resultados bem melhores, academicamente falando.
Abaixo o resumo do artigo:
Artificial neural networks are powerful pattern classifiers; however, they have been surpassed in accuracy by methods such as support vector machines and random forests that are also easier to use and faster to train. Backpropagation, which is used to train artificial neural networks, suffers from the herd effect problem which leads to long training times and limit classification accuracy. We use the disjunctive normal form and approximate the boolean conjunction operations with products to construct a novel network architecture. The proposed model can be trained by minimizing an error function and it allows an effective and intuitive initialization which solves the herd-effect problem associated with backpropagation. This leads to state-of-the art classification accuracy and fast training times. In addition, our model can be jointly optimized with convolutional features in an unified structure leading to state-of-the-art results on computer vision problems with fast convergence rates. A GPU implementation of LDNN with optional convolutional features is also available.
Neste paper de Rosillo, Giner, De la fuente e Pino é realizado um estudo experimental da aplicação de SVM para um sistema de trading. Em linhas gerais o sistema teve um comportamento satisfatório em períodos de retração do mercado. Vale a pena a leitura para quem quiser realizar adaptações em relação à metodologia aplicada.
Esse artigo apresenta um framework muito elaborado no qual Yang Liu passa pelos aspectos básicos da mineração de dados. O artigo conta com uma ótima bibliografia de apoio. De maneira geral o artigo coloca a mineração de dados como um meio de obter análises de portfólios através de métodos indutivos paramétricos e/ou não paramétricos. A diagramação é ótima na qual dá apoio significativo ao que está sendo explicado. Obrigatório para quem trabalha com scoring de crédito em geral.
Um ótimo estudo do BioDataMining que poderia ser reproduzido aqui em terra brasilis. Uma crítica que eu vejo nesse trabalho foi que a seleção de atributos como diria o Daniel Larose foi um pouco black-box e particularmente a abordagem em Algoritmos Genéticos não deve ser tão performática em relação a SVM (o ponto dos autores é que os dados tinha uma dimensionalidade razoável).
Como técnica de classificação o SVM tem sido bastante utilizado em casos de construção de sistemas especialistas para indicação de ordens de stop e demais aplicações financeiras; e essas bibliotecas vem a ser um enriquecimento muito pertinente para quem deseja trabalhar com esse tipo de técnica independente da linguagem de programação. As implementações vão desde o código java, até chegar nas bibliotecas do R e do WEKA (Implementada pelo Prof. Yasser da Universidade de Iowa).
Esse artigo escrito por Phichhang Ou e Hengshan Wang ambos da University of Shanghai apresenta um estudo sobre a aplicação de dez técnicas de Mineração de Dados aplicado a predição dos índices relativos à bolsa de valores de Hong Kong.
O artigo tem como idéia principal realizar uma análise experimental e comparativa sobre dez técnicas de Mineração de Dados (Linear discriminant analysis (LDA), Quadratic discriminant analysis (QDA), K-nearest neighbor classification, Naïve Bayes based on kernel estimation, Logit model, Tree based classification, Neural Network, Bayesian Classification with Gaussian Process, Support Vector Machine (SVM) e Least Squares Support Vector Machine (LS-SVM)) na qual os pesquisadores realizam uma série de ajustes no modelo para cálculo da flutuação do índice ao longo do estudo.
Como resultado do estudo os autores chegaram à conclusão que a maioria das técnicas aplicadas tiveram um hit rate acima de 80%, o que é um ótimo sinal dado o número imenso de variáveis a serem consideradas e o grau de dificuldade de mapeamento do domínio.
Em geral o artigo é bem escrito e dá uma perspectiva muito interessante em modelagem matemática aplicada a esse tipo de domínio. O único ponto contra é que o artigo poderia ter o método de cross-validation mais bem descrito, e claro o conteúdo matemático é uma barreira para os iniciantes; mas nada que um pouco de dedicação pessoal não possa superar.