Driver behavior profiling: An investigation with different smartphone sensors and machine learning

Driver behavior profiling: An investigation with different smartphone sensors and machine learning

Abstract: Driver behavior impacts traffic safety, fuel/energy consumption and gas emissions. Driver behavior profiling tries to understand and positively impact driver behavior. Usually driver behavior profiling tasks involve automated collection of driving data and application of computer models to generate a classification that characterizes the driver aggressiveness profile. Different sensors and classification methods have been employed in this task, however, low-cost solutions and high performance are still research targets. This paper presents an investigation with different Android smartphone sensors, and classification algorithms in order to assess which sensor/method assembly enables classification with higher performance. The results show that specific combinations of sensors and intelligent methods allow classification performance improvement.
Results: We executed all combinations of the 4 MLAs and their configurations described on Table 1 over the 15 data sets described in Section 4.3 using 5 different nf values. We trained, tested, and assessed every evaluation assembly with 15 different random seeds. Finally, we calculated the mean AUC for these executions, grouped them by driving event type, and ranked the 5 best performing assemblies in the boxplot displayed in Fig 6. This figure shows the driving events on the left-hand side and the 5 best evaluation assemblies for each event on the right-hand side, with the best ones at the bottom. The assembly text identification in Fig 6 encodes, in this order: (i) the nf value; (ii) the sensor and its axis (if there is no axis indication, then all sensor axes are used); and (iii) the MLA and its configuration identifier.
Conclusions and future work: In this work we presented a quantitative evaluation of the performances of 4 MLAs (BN, MLP, RF, and SVM) with different configurations applied in the detection of 7 driving event types using data collected from 4 Android smartphone sensors (accelerometer, linear acceleration, magnetometer, and gyroscope). We collected 69 samples of these event types in a real-world experiment with 2 drivers. The start and end times of these events were recorded serve as the experiment ground-truth. We also compared the performances when applying different sliding time window sizes.
We performed 15 executions with different random seeds of 3865 evaluation assemblies of the form EA = {1:sensor, 2:sensor axis(es), 3:MLA, 4:MLA configuration, 5:number of frames in sliding window}. As a result, we found the top 5 performing assemblies for each driving event type. In the context of our experiment, these results show that (i) bigger window sizes perform better; (ii) the gyroscope and the accelerometer are the best sensors to detect our driving events; (iii) as general rule, using all sensor axes perform better than using a single one, except for aggressive left turns events; (iv) RF is by far the best performing MLA, followed by MLP; and (v) the performance of the top 35 combinations is both satisfactory and equivalent, varying from 0.980 to 0.999 mean AUC values.
As future work, we expect to collect a greater number of driving events samples using different vehicles, Android smartphone models, road conditions, weather, and temperature. We also expect to add more MLAs to our evaluation, including those based on fuzzy logic and DTW. Finally, we intend use the best evaluation assemblies observed in this work to develop an Android smartphone application which can detect driving events in real-time and calculate the driver behavior profile.
Driver behavior profiling: An investigation with different smartphone sensors and machine learning

Machine Learning Methods to Predict Diabetes Complications

Machine Learning Methods to Predict Diabetes Complications

Abstract: One of the areas where Artificial Intelligence is having more impact is machine learning, which develops algorithms able to learn patterns and decision rules from data. Machine learning algorithms have been embedded into data mining pipelines, which can combine them with classical statistical strategies, to extract knowledge from data. Within the EU-funded MOSAIC project, a data mining pipeline has been used to derive a set of predictive models of type 2 diabetes mellitus (T2DM) complications based on electronic health record data of nearly one thousand patients. Such pipeline comprises clinical center profiling, predictive model targeting, predictive model construction and model validation. After having dealt with missing data by means of random forest (RF) and having applied suitable strategies to handle class imbalance, we have used Logistic Regression with stepwise feature selection to predict the onset of retinopathy, neuropathy, or nephropathy, at different time scenarios, at 3, 5, and 7 years from the first visit at the Hospital Center for Diabetes (not from the diagnosis). Considered
variables are gender, age, time from diagnosis, body mass index (BMI), glycated hemoglobin (HbA1c), hypertension, and smoking habit. Final models, tailored in accordance with the complications, provided an accuracy up to 0.838. Different variables were selected for each complication and time scenario, leading to specialized models easy to translate to the clinical
practice.

Conclusions: This work shows how data mining and computational methods can be effectively adopted in clinical medicine to derive models that use patient-specific information to predict an outcome of interest. Predictive data mining methods may be applied to the construction of decision models for procedures such as prognosis, diagnosis and treatment planning, which—once evaluated and verified—may be embedded within clinical information systems. Developing predictive models for the onset of chronic microvascular complications in patients suffering from T2DM could contribute to evaluating the relation between exposure to individual factors and the risk of onset of a specific complication, to stratifying the patients’ population in a medical center with respect to this risk, and to developing tools for the support of clinical informed decisions in patients’ treatment.

Machine Learning Methods to Predict Diabetes Complications

Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers

Abstract: There is a large literature explaining why AdaBoost is a successful classifier. The literature on AdaBoost focuses on classifier margins and boosting’s interpretation as the optimization of an exponential likelihood function. These existing explanations, however, have been pointed out to be incomplete. A random forest is another popular ensemble method for which there is substantially less explanation in the literature. We introduce a novel perspective on AdaBoost and random forests that proposes that the two algorithms work for similar reasons. While both classifiers achieve similar predictive accuracy, random forests cannot be conceived as a direct optimization procedure. Rather, random forests is a self- averaging, interpolating algorithm which creates what we denote as a spiked-smooth classifier, and we view AdaBoost in the same light. We conjecture that both AdaBoost and random forests succeed because of this mechanism. We provide a number of examples to support this explanation. In the process, we question the conventional wisdom that suggests that boosting algorithms for classification require regularization or early stopping and should be limited to low complexity classes of learners, such as decision stumps. We conclude that boosting should be used like random forests: with large decision trees, without regularization or early stopping.
Concluding Remarks: AdaBoost is an undeniably successful algorithm and random forests is at least as good, if not better. But AdaBoost is as puzzling as it is successful; it broke the basic rules of statistics by iteratively fitting even noisy data sets until every training set data point was fit without error. Even more puzzling, to statisticians at least, it will continue to iterate an already perfectly fit algorithm which lowers generalization error. The statistical view of boosting understands AdaBoost to be a stage wise optimization of an exponential loss, which suggest (demands!) regularization of tree size and control on the number of iterations.
In contrast, a random forest is not an optimization; it appears to work best with large
trees and as many iterations as possible. It is widely believed that AdaBoost is effective
because it is an optimization, while random forests works—well because it works. Breiman conjectured that “it is my belief that in its later stages AdaBoost is emulating a random forest” (Breiman, 2001). This paper sheds some light on this conjecture by providing a novel intuition supported by examples to show how AdaBoost and random forest are successful for the same reason.
A random forests model is a weighted ensemble of interpolating classifiers by construction. Although it is much less evident, we have shown that AdaBoost is also a weighted ensemble of interpolating classifiers. Viewed in this way, AdaBoost is actually a “random” forest of forests. The trees in random forests and the forests in the AdaBoost each interpolate the data without error. As the number of iterations increase the averaging of decision surface because smooths but nevertheless still interpolates. This is accomplished by whittling down the decision boundary around error points. We hope to have cast doubt on the commonly held belief that the later iterations of AdaBoost only serve to overfit the data. Instead, we argue that these later iterations lead to an “averaging effect”, which causes AdaBoost to behave like a random forest.
A central part of our discussion also focused on the merits of interpolation of the training
data, when coupled with averaging. Again, we hope to dispel the commonly held belief that interpolation always leads to overfitting. We have argued instead that fitting the training data in extremely local neighborhoods actually serves to prevent overfitting in the presence of averaging. The local fits serve to prevent noise points from having undue influence over the fit in other areas. Random forests and AdaBoost both achieve this desirable level of local interpolation by fitting deep trees. It is our hope that our emphasis on the “self-averaging” and interpolating aspects of AdaBoost will lead to a broader discussion of this classifier’s success that extends beyond the more traditional emphasis on margins and exponential loss minimization.
Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers

Driver behavior profiling: An investigation with different smartphone sensors and machine learning

Abstract: Driver behavior impacts traffic safety, fuel/energy consumption and gas emissions. Driver behavior profiling tries to understand and positively impact driver behavior. Usually driver behavior profiling tasks involve automated collection of driving data and application of computer models to generate a classification that characterizes the driver aggressiveness profile. Different sensors and classification methods have been employed in this task, however, low-cost solutions and high performance are still research targets. This paper presents an investigation with different Android smartphone sensors, and classification algorithms in order to assess which sensor/method assembly enables classification with higher performance. The results show that specific combinations of sensors and intelligent methods allow classification performance improvement.
Results: We executed all combinations of the 4 MLAs and their configurations described on Table 1 over the 15 data sets described in Section 4.3 using 5 different nf values. We trained, tested, and assessed every evaluation assembly with 15 different random seeds. Finally, we calculated the mean AUC for these executions, grouped them by driving event type, and ranked the 5 best performing assemblies in the boxplot displayed in Fig 6. This figure shows the driving events on the left-hand side and the 5 best evaluation assemblies for each event on the right-hand side, with the best ones at the bottom. The assembly text identification in Fig 6 encodes, in this order: (i) the nf value; (ii) the sensor and its axis (if there is no axis indication, then all sensor axes are used); and (iii) the MLA and its configuration identifier.
Conclusions and future work: In this work we presented a quantitative evaluation of the performances of 4 MLAs (BN, MLP, RF, and SVM) with different configurations applied in the detection of 7 driving event types using data collected from 4 Android smartphone sensors (accelerometer, linear acceleration, magnetometer, and gyroscope). We collected 69 samples of these event types in a real-world experiment with 2 drivers. The start and end times of these events were recorded serve as the experiment ground-truth. We also compared the performances when applying different sliding time window sizes.
We performed 15 executions with different random seeds of 3865 evaluation assemblies of the form EA = {1:sensor, 2:sensor axis(es), 3:MLA, 4:MLA configuration, 5:number of frames in sliding window}. As a result, we found the top 5 performing assemblies for each driving event type. In the context of our experiment, these results show that (i) bigger window sizes perform better; (ii) the gyroscope and the accelerometer are the best sensors to detect our driving events; (iii) as general rule, using all sensor axes perform better than using a single one, except for aggressive left turns events; (iv) RF is by far the best performing MLA, followed by MLP; and (v) the performance of the top 35 combinations is both satisfactory and equivalent, varying from 0.980 to 0.999 mean AUC values.
As future work, we expect to collect a greater number of driving events samples using different vehicles, Android smartphone models, road conditions, weather, and temperature. We also expect to add more MLAs to our evaluation, including those based on fuzzy logic and DTW. Finally, we intend use the best evaluation assemblies observed in this work to develop an Android smartphone application which can detect driving events in real-time and calculate the driver behavior profile.
Driver behavior profiling: An investigation with different smartphone sensors and machine learning

Comparação entre um modelo de Machine Learning e EuroSCOREII na previsão de mortalidade após cirurgia cardíaca eletiva

Mais um estudo colocando  alguns algoritmos de Machine Learning contra métodos tradicionais de scoring, e levando a melhor.

A Comparison of a Machine Learning Model with EuroSCORE II in Predicting Mortality after Elective Cardiac Surgery: A Decision Curve Analysis

Abstract: The benefits of cardiac surgery are sometimes difficult to predict and the decision to operate on a given individual is complex. Machine Learning and Decision Curve Analysis (DCA) are recent methods developed to create and evaluate prediction models.

Methods and finding: We conducted a retrospective cohort study using a prospective collected database from December 2005 to December 2012, from a cardiac surgical center at University Hospital. The different models of prediction of mortality in-hospital after elective cardiac surgery, including EuroSCORE II, a logistic regression model and a machine learning model, were compared by ROC and DCA. Of the 6,520 patients having elective cardiac surgery with cardiopulmonary bypass, 6.3% died. Mean age was 63.4 years old (standard deviation 14.4), and mean EuroSCORE II was 3.7 (4.8) %. The area under ROC curve (IC95%) for the machine learning model (0.795 (0.755–0.834)) was significantly higher than EuroSCORE II or the logistic regression model (respectively, 0.737 (0.691–0.783) and 0.742 (0.698–0.785), p < 0.0001). Decision Curve Analysis showed that the machine learning model, in this monocentric study, has a greater benefit whatever the probability threshold.

Conclusions: According to ROC and DCA, machine learning model is more accurate in predicting mortality after elective cardiac surgery than EuroSCORE II. These results confirm the use of machine learning methods in the field of medical prediction.

Comparação entre um modelo de Machine Learning e EuroSCOREII na previsão de mortalidade após cirurgia cardíaca eletiva

Prevendo recessões econômicas usando algoritmos de Machine Learning

Paper bem atual que fala como os autores erraram a crise apenas em relação ao ano mostrando o potencial das Random Forests.

screen-shot-2017-01-14-at-10-52-34-am

Predicting Economic Recessions Using Machine Learning Algorithms – Rickard Nyman and Paul Ormerod

Abstract Even at the beginning of 2008, the economic recession of 2008/09 was not being predicted by the economic forecasting community. The failure to predict recessions is a persistent theme in economic forecasting. The Survey of Professional Forecasters (SPF) provides data on predictions made for the growth of total output, GDP, in the United States for one, two, three and four quarters ahead, going back to the end of the 1960s. Over a three quarters ahead horizon, the mean prediction made for GDP growth has never been negative over this period. The correlation between the mean SPF three quarters ahead forecast and the data is very low, and over the most recent 25 years is not significantly different from zero. Here, we show that the machine learning technique of random forests has the potential to give early warning of recessions. We use a small set of explanatory variables from financial markets which would have been available to a forecaster at the time of making the forecast. We train the algorithm over the 1970Q2-1990Q1 period, and make predictions one, three and six quarters ahead. We then re-train over 1970Q2-1990Q2 and make a further set of predictions, and so on. We did not attempt any optimisation of predictions, using only the default input parameters to the algorithm we downloaded in the package R. We compare the predictions made from 1990 to the present with the actual data. One quarter ahead, the algorithm is not able to improve on the SPF predictions. Three and six quarters ahead, the correlations between actual and predicted are low, but they are very significantly different from zero. Although the timing is slightly wrong, a serious downturn in the first half of 2009 could have been predicted six quarters ahead in late 2007. The algorithm never predicts a recession when one did not occur. We obtain even stronger results with random forest machine learning techniques in the case of the United Kingdom.

Conclusions: We have tried, as far as it is possible, to replicate an actual forecasting situation starting for the United States in 1990Q2 and moving forward a quarter at a time through to 2016. We use a small number of lags on a small number of financial variables in order to make predictions. In terms of one step ahead predictions of real GDP growth, we have not been able to improve upon the mean forecasts made by the Society of Professional Forecasters. However, even just three quarters ahead, the SPF track record is very poor. A regression of actual GDP growth on the mean prediction made three quarters previously has zero explanatory power, and the SPF predictions never indicated a single quarter of negative growth. The random forest approach improves very considerably on this. Even more strikingly, over a six period ahead horizon, the random forest approach would have predicted, during the winter of 2007/08, a severe recession in the United States during 2009, ending in 2009Q4. Again to emphasise, we have not attempted in any way to optimise these results in an ex post manner. We use only the default values of the input parameters into the machine learning algorithm, and use only a small number of explanatory variables. We obtain qualitatively similar results for the UK, though the predictive power of the random forest algorithm is even better than it is for the United States. As Ormerod and Mounfield (2000) show, using modern signal processing techniques, the time series GDP growth data is dominated by noise rather than by signal. So there is almost certainly a quite restrictive upper bound on the degree of accuracy of prediction which can be achieved. However, machine learning techniques do seem to have considerable promise in extending useful forecasting horizons and providing better information to policy makers over such horizons.

Prevendo recessões econômicas usando algoritmos de Machine Learning

Utilização de Random Forests para o problema de compartilhamento de bicicletas em Seattle

Não é mais necessário dizer que o futuro das cidades vai passar pela a análise de dados e principalmente pela a aplicação da inteligência para resolução de problemas dos pagadores de impostos.

Nesse caso específico o problema era que o sistema de trens urbanos de Seattle disponibiliza 500 bicicletas em suas estações e que a oferta dessas bicicletas deve estar ajustada com a demanda de cada estação.

Aqui está o post original, e a abordagem utilizada:

“From clustering, I discovered two distinct ecosystems of bike stations—Seattle, and the University District—based on traffic flows from station to station,” Sadler said. “It turned out that having separate models for each lent itself to much better predictions.”

Sadler modeled hourly supply and hourly demand separately for each of the two ecosystems, summing the result to predict the change in current bike count, based on the current bike count data from the Pronto API. To do this, he used multiple random forest algorithms, each tuned for a specific task.

“Having groups of smaller random forests worked much better than having a single large random forest try to predict everything,” Sadler said. “This is probably due to the different ecosystems having vastly different signals and different types of noise.”

The model—which is actually two models (a random forest for each ecosystem), of which the branches of each are composed of additional random forests—draws from historical demand based on the current season, current hour, and current weekend. It also uses meta information about each station, such as elevation, size, and proximity to other stations. The model leverages this information to discover signals and patterns in ride usage, then predicts based on the signal it finds.

Utilização de Random Forests para o problema de compartilhamento de bicicletas em Seattle

Uma explicação sobre as Deep Neural Decision Forests

Muito do hype que está sendo feito sobre a Deep Learning se dá através de problemas de computação visual.

Contudo, uma abordagem que faz uma mistura entre Deep Learning com Decision Trees (na verdade eles chamam de forests dado que é aplicado o mesmo paradigma das Random Forests) ao menos no paper no link se mostra bem robusta e aplicável à problemas estruturados de Machine Learning (a.k.a. problemas em que os dados estão modelados de forma transacional ou relacionado).

A ideia principal é que no final da camada de ativação de uma rede neural (que pode ser deep ou não) haja um direcionamento do output (ou do objeto a ser predito) para um determinado lado da árvore.

dndf-forest

Isso não só abre um espaço gigante para problemas de multi-classificação, mas também uma possibilidade de tratar modelos com um grau de latência maior (i.e. que não precisem de atualizações constantes) com uma forma mais robusta de decisão ao longo de toda a cadeia de predição.

Uma explicação sobre as Deep Neural Decision Forests

A Toda Poderosa Floresta Aleatória

Um ótimo post sobre as potencialidades das Florestas Aleatórias (Random Forests) com o direito de um Salmo no final do post. O post ainda apresenta duas referências sobre o uso das Random Forests.

The Random Forest™ is my shepherd; I shall not want.
He makes me watch the mean squared error decrease rapidly.
He leads me beside classification problems.
He restores my soul.
He leads me in paths of the power of ensembles
for his name’s sake.

Even though I walk through the valley of the curse of dimensionality,
I will fear no overfitting,
for you are with me;
your bootstrap and your randomness,
they comfort me.

You prepare a prediction before me
in the presence of complex interactions;
you anoint me data scientist;
my wallet overflows.
Surely goodness of fit and money shall follow me
all the days of my life,
and I shall use Random Forests™
forever.

A Toda Poderosa Floresta Aleatória