Abstract: One of the areas where Artificial Intelligence is having more impact is machine learning, which develops algorithms able to learn patterns and decision rules from data. Machine learning algorithms have been embedded into data mining pipelines, which can combine them with classical statistical strategies, to extract knowledge from data. Within the EU-funded MOSAIC project, a data mining pipeline has been used to derive a set of predictive models of type 2 diabetes mellitus (T2DM) complications based on electronic health record data of nearly one thousand patients. Such pipeline comprises clinical center profiling, predictive model targeting, predictive model construction and model validation. After having dealt with missing data by means of random forest (RF) and having applied suitable strategies to handle class imbalance, we have used Logistic Regression with stepwise feature selection to predict the onset of retinopathy, neuropathy, or nephropathy, at different time scenarios, at 3, 5, and 7 years from the first visit at the Hospital Center for Diabetes (not from the diagnosis). Considered
variables are gender, age, time from diagnosis, body mass index (BMI), glycated hemoglobin (HbA1c), hypertension, and smoking habit. Final models, tailored in accordance with the complications, provided an accuracy up to 0.838. Different variables were selected for each complication and time scenario, leading to specialized models easy to translate to the clinical
Conclusions: This work shows how data mining and computational methods can be effectively adopted in clinical medicine to derive models that use patient-specific information to predict an outcome of interest. Predictive data mining methods may be applied to the construction of decision models for procedures such as prognosis, diagnosis and treatment planning, which—once evaluated and verified—may be embedded within clinical information systems. Developing predictive models for the onset of chronic microvascular complications in patients suffering from T2DM could contribute to evaluating the relation between exposure to individual factors and the risk of onset of a specific complication, to stratifying the patients’ population in a medical center with respect to this risk, and to developing tools for the support of clinical informed decisions in patients’ treatment.
Mais um estudo colocando alguns algoritmos de Machine Learning contra métodos tradicionais de scoring, e levando a melhor.
Abstract: The benefits of cardiac surgery are sometimes difficult to predict and the decision to operate on a given individual is complex. Machine Learning and Decision Curve Analysis (DCA) are recent methods developed to create and evaluate prediction models.
Methods and finding: We conducted a retrospective cohort study using a prospective collected database from December 2005 to December 2012, from a cardiac surgical center at University Hospital. The different models of prediction of mortality in-hospital after elective cardiac surgery, including EuroSCORE II, a logistic regression model and a machine learning model, were compared by ROC and DCA. Of the 6,520 patients having elective cardiac surgery with cardiopulmonary bypass, 6.3% died. Mean age was 63.4 years old (standard deviation 14.4), and mean EuroSCORE II was 3.7 (4.8) %. The area under ROC curve (IC95%) for the machine learning model (0.795 (0.755–0.834)) was significantly higher than EuroSCORE II or the logistic regression model (respectively, 0.737 (0.691–0.783) and 0.742 (0.698–0.785), p < 0.0001). Decision Curve Analysis showed that the machine learning model, in this monocentric study, has a greater benefit whatever the probability threshold.
Conclusions: According to ROC and DCA, machine learning model is more accurate in predicting mortality after elective cardiac surgery than EuroSCORE II. These results confirm the use of machine learning methods in the field of medical prediction.
Paper bem atual que fala como os autores erraram a crise apenas em relação ao ano mostrando o potencial das Random Forests.
Abstract Even at the beginning of 2008, the economic recession of 2008/09 was not being predicted by the economic forecasting community. The failure to predict recessions is a persistent theme in economic forecasting. The Survey of Professional Forecasters (SPF) provides data on predictions made for the growth of total output, GDP, in the United States for one, two, three and four quarters ahead, going back to the end of the 1960s. Over a three quarters ahead horizon, the mean prediction made for GDP growth has never been negative over this period. The correlation between the mean SPF three quarters ahead forecast and the data is very low, and over the most recent 25 years is not significantly different from zero. Here, we show that the machine learning technique of random forests has the potential to give early warning of recessions. We use a small set of explanatory variables from financial markets which would have been available to a forecaster at the time of making the forecast. We train the algorithm over the 1970Q2-1990Q1 period, and make predictions one, three and six quarters ahead. We then re-train over 1970Q2-1990Q2 and make a further set of predictions, and so on. We did not attempt any optimisation of predictions, using only the default input parameters to the algorithm we downloaded in the package R. We compare the predictions made from 1990 to the present with the actual data. One quarter ahead, the algorithm is not able to improve on the SPF predictions. Three and six quarters ahead, the correlations between actual and predicted are low, but they are very significantly different from zero. Although the timing is slightly wrong, a serious downturn in the first half of 2009 could have been predicted six quarters ahead in late 2007. The algorithm never predicts a recession when one did not occur. We obtain even stronger results with random forest machine learning techniques in the case of the United Kingdom.
Conclusions: We have tried, as far as it is possible, to replicate an actual forecasting situation starting for the United States in 1990Q2 and moving forward a quarter at a time through to 2016. We use a small number of lags on a small number of financial variables in order to make predictions. In terms of one step ahead predictions of real GDP growth, we have not been able to improve upon the mean forecasts made by the Society of Professional Forecasters. However, even just three quarters ahead, the SPF track record is very poor. A regression of actual GDP growth on the mean prediction made three quarters previously has zero explanatory power, and the SPF predictions never indicated a single quarter of negative growth. The random forest approach improves very considerably on this. Even more strikingly, over a six period ahead horizon, the random forest approach would have predicted, during the winter of 2007/08, a severe recession in the United States during 2009, ending in 2009Q4. Again to emphasise, we have not attempted in any way to optimise these results in an ex post manner. We use only the default values of the input parameters into the machine learning algorithm, and use only a small number of explanatory variables. We obtain qualitatively similar results for the UK, though the predictive power of the random forest algorithm is even better than it is for the United States. As Ormerod and Mounfield (2000) show, using modern signal processing techniques, the time series GDP growth data is dominated by noise rather than by signal. So there is almost certainly a quite restrictive upper bound on the degree of accuracy of prediction which can be achieved. However, machine learning techniques do seem to have considerable promise in extending useful forecasting horizons and providing better information to policy makers over such horizons.
Não é mais necessário dizer que o futuro das cidades vai passar pela a análise de dados e principalmente pela a aplicação da inteligência para resolução de problemas dos pagadores de impostos.
Nesse caso específico o problema era que o sistema de trens urbanos de Seattle disponibiliza 500 bicicletas em suas estações e que a oferta dessas bicicletas deve estar ajustada com a demanda de cada estação.
Aqui está o post original, e a abordagem utilizada:
“From clustering, I discovered two distinct ecosystems of bike stations—Seattle, and the University District—based on traffic flows from station to station,” Sadler said. “It turned out that having separate models for each lent itself to much better predictions.”
Sadler modeled hourly supply and hourly demand separately for each of the two ecosystems, summing the result to predict the change in current bike count, based on the current bike count data from the Pronto API. To do this, he used multiple random forest algorithms, each tuned for a specific task.
“Having groups of smaller random forests worked much better than having a single large random forest try to predict everything,” Sadler said. “This is probably due to the different ecosystems having vastly different signals and different types of noise.”
The model—which is actually two models (a random forest for each ecosystem), of which the branches of each are composed of additional random forests—draws from historical demand based on the current season, current hour, and current weekend. It also uses meta information about each station, such as elevation, size, and proximity to other stations. The model leverages this information to discover signals and patterns in ride usage, then predicts based on the signal it finds.
Muito do hype que está sendo feito sobre a Deep Learning se dá através de problemas de computação visual.
Contudo, uma abordagem que faz uma mistura entre Deep Learning com Decision Trees (na verdade eles chamam de forests dado que é aplicado o mesmo paradigma das Random Forests) ao menos no paper no link se mostra bem robusta e aplicável à problemas estruturados de Machine Learning (a.k.a. problemas em que os dados estão modelados de forma transacional ou relacionado).
A ideia principal é que no final da camada de ativação de uma rede neural (que pode ser deep ou não) haja um direcionamento do output (ou do objeto a ser predito) para um determinado lado da árvore.
Isso não só abre um espaço gigante para problemas de multi-classificação, mas também uma possibilidade de tratar modelos com um grau de latência maior (i.e. que não precisem de atualizações constantes) com uma forma mais robusta de decisão ao longo de toda a cadeia de predição.
The Random Forest™ is my shepherd; I shall not want.
He makes me watch the mean squared error decrease rapidly.
He leads me beside classification problems.
He restores my soul.
He leads me in paths of the power of ensembles
for his name’s sake.
Even though I walk through the valley of the curse of dimensionality,
I will fear no overfitting,
for you are with me;
your bootstrap and your randomness,
they comfort me.
You prepare a prediction before me
in the presence of complex interactions;
you anoint me data scientist;
my wallet overflows.
Surely goodness of fit and money shall follow me
all the days of my life,
and I shall use Random Forests™