Tunability, Hyperparameters and a simple Initial Assessment Strategy

Most of the time we completely rely in the default parameters of Machine Learning Algorithm and this fact can hide that sometimes we can make wrong statements about the ‘efficiency’ of some algorithm.

The paper called Tunability: Importance of Hyperparameters of Machine Learning Algorithms from Philipp Probst, Anne-Laure Boulesteix, Bernd Bischl in Journal of Machine Learning Research (JMLR) bring some light in this subject. This is the abstract:

Modern supervised machine learning algorithms involve hyperparameters that have to be set before running them. Options for setting hyperparameters are default values from the software package, manual configuration by the user or configuring them for optimal predictive performance by a tuning procedure. The goal of this paper is two-fold. Firstly, we formalize the problem of tuning from a statistical point of view, define data-based defaults and suggest general measures quantifying the tunability of hyperparameters of algorithms. Secondly, we conduct a large-scale benchmarking study based on 38 datasets from the OpenML platform and six common machine learning algorithms. We apply our measures to assess the tunability of their parameters. Our results yield default values for hyperparameters and enable users to decide whether it is worth conducting a possibly time consuming tuning strategy, to focus on the most important hyperparameters and to choose adequate hyperparameter spaces for tuning.

Probst, Boulesteix, Bischl in Tunability: Importance of Hyperparameters of Machine Learning Algorithms

I recognize that the abstract sounds not so appealing, but the most important part of the text for sure it’s related in one table and one graph about the Tunability, i.e. how tuneable one parameter is according the other default values.

As we can observe in the columns Def.P (package defaults) and Def.O (optimal defaults) even in some vanilla algorithms we have some big differences between them, specially in Part, XGBoost and Ranger.

If we check the variance across this hyper parameters, the results indicates that the problem can be worse that we imagined:

As we can see in a first sight there’s a huge variance in terms of AUC when we talk about the default parameters.

Checking these experiments two big questions arises:

  1. How much inefficiency it’s included in some process of algorithm assessment and selection because for the ‘initial model‘ (that most of the times becomes the last model) because of relying in the default values? and;
  2. Because of this misleading path to consider some algorithm based purely in defaults how many ML implementations out there is underperforming and wasting research/corporate resources (e.g. people’s time, computational time, money in cloud providers, etc…)?

Initial Assessment Strategy

A simple strategy that I use for this particular purpose it’s to use a two-phase hyperparameter search strategy where in the first phase I use to make a small knockout round with all algorithms using Random Search to grab the top 2 or 3 models, and in the second phase I use Grid Search where most of the time I explore a large number of parameters.

According the number of samples that I have in the Test and Validation sets, I usually let the search for at least 24 hours in some local machine or in some cloud provider.

I do that because with this ‘initial‘ assessment we can have a better idea which algorithm will learn more according the data that I have considering dimensionality, selectivity of the columns or complexity of the word embeddings in NLP tasks, data volume and so on.


The paper makes a great job to expose the variance in terms of AUC using default parameters for the practitioners and can give us a better heuristic path in terms to know which parameters are most tunable and with this information in hands we can perform better search strategies to have better implementations of Machine Learning Algorithms.

Tunability, Hyperparameters and a simple Initial Assessment Strategy

Multi Armed Bandit concept

This is the best no-tech concept available in internet.

By Datagenetics

Imagine you are standing in front of a row of slot machines, and wish to gamble. You have a bag full of coins. Your goal is to maximize the return on your investment. The problem is that you don’t know the payout percentages of any of the machines. Each has a, potentially, different expected return.

What is your strategy?

You could select one machine at random, and invest all your coins there, but what happens if you selected a poor payout machine? You could have done better.

You could spread your money out and divide it equally (or randomly) between all the different machines. However, if you did this, you’d spend some time investing in poorer payout machines and ‘wasting’ coins that could have be inserted into better machines. The benefit of this strategy, however, is diversification, and you’d be spreading your risk over many machines; you’re never going to be playing the best machine all the time, but you’re never going to be playing the worst all the time either!

Maybe a hybrid strategy is better? In a hybrid solution you could initially spend some time experimenting to estimate the payouts of the machines then, in an exploitation phase, you could put all your future investment into the best paying machine you’d discovered. The more you research, the more you learn about the machines (getting feedback on their individual payout percentages).

However, what is the optimal hybrid strategy? You could spend a long time researching the machines (increasing your confidence), and the longer you spend, certainly, the more accurate your prediction of the best machine would become. However, if you spend too long on research, you might not have many coins left to properly leverage this knowledge (and you’d have wasted many coins on lots of machines that are poor payers). Conversely, if you spend too short a time on research, your estimate for which is the best machine could be bogus (and if you are unlucky, you could become victim to a streak of ‘good-luck’ from a poor paying machine that tricks you into thinking it’s the best machine).

If you are playing a machine that is “good enough”, is it worth the risk of attempting to see if another machine is “better” (experiments to determine this might not be worth the effort).

Multi Armed Bandit concept

Algorithm over Regulations (?)

This scene is the best thing that can I relate to this particular topic.

“But, the bells have already been rung and they’ve heard it. Out in the dark. Among the stars. Ding dong, the God is dead. The bells, cannot be unrung! He’s hungry. He’s found us. And He’s coming!

Ding, ding, ding, ding, ding…”

(Hint Fellas: This is a great time to be not evil and check your models to avoid any kind of discrimination over their current or potential customers.)

European Union regulations on algorithmic decision-making and a “right to explanation – By Bryce Goodman, Seth Flaxman

Abstract: We summarize the potential impact that the European Union’s new General Data Protection Regulation will have on the routine use of machine learning algorithms. Slated to take effect as law across the EU in 2018, it will restrict automated individual decision-making (that is, algorithms that make decisions based on userlevel predictors) which “significantly affect” users. The law will also effectively create a “right to explanation,” whereby a user can ask for an explanation of an algorithmic decision that was made about them. We argue that while this law will pose large challenges for industry, it highlights opportunities for computer scientists to take the lead in designing algorithms and evaluation frameworks which avoid discrimination and enable explanation.

Conclusion: While the GDPR presents a number of problems for current applications in machine learning they are, we believe, good problems to have. The challenges described in this paper emphasize the importance of work that ensures that algorithms are not merely efficient, but transparent and fair. Research is underway in pursuit of rendering algorithms more amenable to ex post and ex ante inspection [11, 31, 20]. Furthermore, a number of recent studies have attempted to tackle the issue of discrimination within algorithms by introducing tools to both identify [5, 29] and rectify [9, 16, 32, 6, 12, 14] cases of unwanted bias. It remains to be seen whether these techniques are adopted in practice. One silver lining of this research is to show that, for certain types of algorithmic profiling, it is possible to both identify and implement interventions to correct for discrimination. This is in contrast to cases where discrimination arises from human judgment. The role of extraneous and ethically inappropriate factors in human decision making is well documented (e.g., [30, 10, 1]), and discriminatory decision making is pervasive in many of the sectors where algorithmic profiling might be introduced (e.g. [19, 7]). We believe that, properly applied, algorithms can not only make more accurate predictions, but offer increased transparency and fairness over their human counterparts (cf. [23]). Above all else, the GDPR is a vital acknowledgement that, when algorithms are deployed in society, few if any decisions are purely “technical”. Rather, the ethical design of algorithms requires coordination between technical and philosophical resources of the highest caliber. A start has been made, but there is far to go. And, with less than two years until the GDPR takes effect, the clock is ticking.

European Union regulations on algorithmic decision-making and a “right to explanation”


Algorithm over Regulations (?)

Loan products and Credit Scoring Methods by Commercial Banks

Abstract – This study describes the loan products offered by the commercial banks and credit scoring techniques used for classifying risks and granting credit to the applicants in India. The loan products offered by commercial banks are: Housing loans, Personal loans, Business loan, Education loans, Vehicle loans etc. All the loan products are categorized as secures and unsecured loans. Credit scoring techniques used for both secured as well as unsecured loans are broadly divided into two categories as Advanced Statistical Methods and Traditional Statistical Methods

Conclusion: In a new or emerging market, the operational, technical, business and cultural issues should be considered with the implementation of the credit scoring models for retail loan products. The operational issues relate to the use of the model and it is imperative that the staff and the management of the bank understand the purpose of the model. Application scoring models should be used for making credit decisions on new applications and behavioral models for retail loan products to supervise existing borrowers for limit expansion or for marketing of new products. The technical issues relate to the development of proper infrastructure, maintenance of historical data and software needed to build a credit scoring model for retail loan products within the bank. The business issues relate to whether the soundness and safety of the banks could be achieved through the adoption of the quantitative credit decision models, which would send a positive impact in the banking sector. The cultural issues relate to making credit irrespective of race, colour, sex, religion, marital status, age or ethnic origin. Further, the models have to be validated so as to ensure that the model performance is compatible in meeting the business as well as the regulatory requirements. Thus, the above issues have to be considered while developing and implementing credit scoring models for retail loan products within a new or emerging markets.

Loan products and Credit Scoring Methods by Commercial Banks

Improving the Forecasts of European Regional Banks’ Profitability with Machine Learning Algorithms

Abstract: Regional banks as savings and cooperative banks are widespread in continental Europe. In the aftermath of the financial crisis, however, they had problems keeping their profitability which is an important quantitative indicator for the health of a bank and the banking sector overall. We use a large data set of bank-level balance sheet items and regional economic variables to forecast profitability for about 2,000 regional banks. Machine learning algorithms are able to beat traditional estimators as ordinary least squares as well as autoregressive models in forecasting performance.

Conclusion: In the aftermath of the financial crisis regional banks had problems keeping up their profitability. Banks’ profitability is an important indicator for the stability of the banking sector. We use a data set of bank-level balance sheet items and regional economic variables to forecast profitability. For the 2,000 savings and cooperative banks from eight European countries and the 2000-2015 time period, we found that machine learning algorithms are able to beat traditional estimators as ordinary least squares as well as autoregressive models in forecasting performance. Therefore, our paper is in line with the literature on machine learning models and their superior forecasting performance (Khandani et al., 2010; Butaru et al., 2016; Fitzpatrick & Mues, 2016). The performance of the machine learning algorithms was particularly well during the European debt crisis which points out the importance of our forecasting exercise as during this time policy makers’ interest in banks’ profitability was enhanced as further potential rescue packages for banks could deteriorate fiscal stability. Policy makers and, especially, regulators should therefore use these algorithms instead of traditional estimators in combination with their even larger regulatory data sets in regard to size and frequency to forecast banks’ profitability or other balance sheet items of interest.

Improving the Forecasts of European Regional Banks_ Profi tability with Machine Learning Algorithms

Improving the Forecasts of European Regional Banks’ Profitability with Machine Learning Algorithms

Accelerating the XGBoost algorithm using GPU computing

A fronteira final em relação ao uso com GPU de um dos mais poderosos algoritmos de todos os tempos está aqui.

Abstract: We present a CUDA based implementation of a decision tree construction algorithm within the gradient boosting library XGBoost. The tree construction algorithm is executed entirely on the GPU and shows high performance with a variety of datasets and settings, including sparse input matrices. Individual boosting iterations are parallelized, combining two approaches. An interleaved approach is used for shallow trees, switching to a more conventional radix sort based approach for larger depths. We show speedups of between 3-6x using a Titan X compared to a 4 core i7 CPU, and 1.2x using a Titan X compared to 2x Xeon CPUs (24 cores). We show that it is possible to process the Higgs dataset (10 million instances, 28 features) entirely within GPU memory. The algorithm is made available as a plug-in within the XGBoost library and fully supports all XGBoost features including classification, regression and ranking tasks. 

Accelerating the XGBoost algorithm using GPU computing

Porque o xGBoost ganha todas as competições de Machine Learning

Uma (longa e) boa resposta está nesta tese de Didrik Nielsen.


Abstract: Tree boosting has empirically proven to be a highly effective approach to predictive modeling.
It has shown remarkable results for a vast array of problems.
For many years, MART has been the tree boosting method of choice.
More recently, a tree boosting method known as XGBoost has gained popularity by winning numerous machine learning competitions.
In this thesis, we will investigate how XGBoost differs from the more traditional MART.
We will show that XGBoost employs a boosting algorithm which we will term Newton boosting. This boosting algorithm will further be compared with the gradient boosting algorithm that MART employs.
Moreover, we will discuss the regularization techniques that these methods offer and the effect these have on the models.
In addition to this, we will attempt to answer the question of why XGBoost seems to win so many competitions.
To do this, we will provide some arguments for why tree boosting, and in particular XGBoost, seems to be such a highly effective and versatile approach to predictive modeling.
The core argument is that tree boosting can be seen to adaptively determine the local neighbourhoods of the model. Tree boosting can thus be seen to take the bias-variance tradeoff into consideration during model fitting. XGBoost further introduces some subtle improvements which allows it to deal with the bias-variance tradeoff even more carefully.

Conclusion: After determining the different boosting algorithms and regularization techniques these methods utilize and exploring the effects of these, we turned to providing arguments for why XGBoost seems to win “every” competition. To provide possible answers to this question, we first gave reasons for why tree boosting in general can be an effective approach. We provided two main arguments for this. First off, additive tree models can be seen to have rich representational abilities. Provided that enough trees of sufficient depth is combined, they are capable of closely approximating complex functional relationships, including high-order interactions. The most important argument provided for the versatility of tree boosting however, was that tree boosting methods are adaptive. Determining neighbourhoods adaptively allows tree boosting methods to use varying degrees of flexibility in different parts of the input space. They will consequently also automatically perform feature selection. This also makes tree boosting methods robust to the curse of dimensionality. Tree boosting can thus be seen actively take the bias-variance tradeoff into account when fitting models. They start out with a low variance, high bias model and gradually reduce bias by decreasing the size of neighbourhoods where it seems most necessary. Both MART and XGBoost have these properties in common. However, compared to MART, XGBoost uses a higher-order approximation at each iteration, and can thus be expected to learn “better” tree structures. Moreover, it provides clever penalization of individual trees. As discussed earlier, this can be seen to make the method even more adaptive. It will allow the method to adaptively determine the appropriate number of terminal nodes, which might vary among trees. It will further alter the learnt tree structures and leaf weights in order to reduce variance in estimation of the individual trees. Ultimately, this makes XGBoost a highly adaptive method which carefully takes the bias-variance tradeoff into account in nearly every aspect of the learning process.

Porque o xGBoost ganha todas as competições de Machine Learning

Falhas na abordagem de Deep Learning: Arquiteturas e Meta-parametrização

O maior desafio corrente enfrentado pela indústria no que diz respeito à Deep Learning está sem sombra de dúvidas na parte computacional em que todo o mercado está absorvendo tanto os serviços de nuvem para realizar cálculos cada vez mais complexos como também bem como investindo em capacidade de computação das GPU.

Entretanto, mesmo com o hardware nos dias de hoje já ser um commodity, a academia está resolvendo um problema que pode revolucionar a forma na qual se faz Deep Learning que é no aspecto arquitetural/parametrização.

Esse comentário da thread diz muito a respeito desse problema em que o usuário diz:

The main problem I see with Deep Learning: too many parameters.

When you have to find the best value for the parameters, that’s a gradient search by itself. The curse of meta-dimensionality.

Ou seja, mesmo com toda a disponibilidade do hardware a questão de saber qual é o melhor arranjo arquitetural de uma rede neural profunda? ainda não está resolvido.

Este paper do Shai Shalev-Shwartz , Ohad Shamir, e Shaked Shammah chamado “Failures of Deep Learning” expõe esse problema de forma bastante rica inclusive com experimentos (este é o repositório no Git).

Os autores colocam que os pontos de falha das redes Deep Learning que são a) falta de métodos baseados em gradiente para otimização de parâmetros, b) problemas estruturais nos algoritmos de Deep Learning na decomposição dos problemas, c) arquitetura e d) saturação das funções de ativação.

Em outras palavras, o que pode estar acontecendo em grande parte das aplicações de Deep Learning é que o tempo de convergência poderia ser muito menor ainda, se estes aspectos já estivessem resolvidos.

Com isso resolvido, grande parte do que conhecemos hoje como indústria de hardware para as redes Deep Learning seria ou sub-utilizada ao extremo (i.e. dado que haverá uma melhora do ponto de vista de otimização arquitetural/algorítmica) ou poderia ser aproveitada para tarefas mais complexas (e.g. como reconhecimento de imagens com baixo número de pixels).

Desta forma mesmo adotando uma metodologia baseada em hardware como a indústria vem fazendo, há ainda muito espaço de otimização em relação às redes Deep Learning do ponto de vista arquitetural e algorítmico.

Abaixo uma lista de referências direto do Stack Exchange para quem quiser se aprofundar mais no assunto:

Algoritmos Neuro-Evolutivos

Aprendizado por Reforço:


PS: O WordPress retirou a opção de justificar texto, logo desculpem de antemão a aparência amadora do blog nos próximos dias.


Falhas na abordagem de Deep Learning: Arquiteturas e Meta-parametrização

Comparação entre um modelo de Machine Learning e EuroSCOREII na previsão de mortalidade após cirurgia cardíaca eletiva

Mais um estudo colocando  alguns algoritmos de Machine Learning contra métodos tradicionais de scoring, e levando a melhor.

A Comparison of a Machine Learning Model with EuroSCORE II in Predicting Mortality after Elective Cardiac Surgery: A Decision Curve Analysis

Abstract: The benefits of cardiac surgery are sometimes difficult to predict and the decision to operate on a given individual is complex. Machine Learning and Decision Curve Analysis (DCA) are recent methods developed to create and evaluate prediction models.

Methods and finding: We conducted a retrospective cohort study using a prospective collected database from December 2005 to December 2012, from a cardiac surgical center at University Hospital. The different models of prediction of mortality in-hospital after elective cardiac surgery, including EuroSCORE II, a logistic regression model and a machine learning model, were compared by ROC and DCA. Of the 6,520 patients having elective cardiac surgery with cardiopulmonary bypass, 6.3% died. Mean age was 63.4 years old (standard deviation 14.4), and mean EuroSCORE II was 3.7 (4.8) %. The area under ROC curve (IC95%) for the machine learning model (0.795 (0.755–0.834)) was significantly higher than EuroSCORE II or the logistic regression model (respectively, 0.737 (0.691–0.783) and 0.742 (0.698–0.785), p < 0.0001). Decision Curve Analysis showed that the machine learning model, in this monocentric study, has a greater benefit whatever the probability threshold.

Conclusions: According to ROC and DCA, machine learning model is more accurate in predicting mortality after elective cardiac surgery than EuroSCORE II. These results confirm the use of machine learning methods in the field of medical prediction.

Comparação entre um modelo de Machine Learning e EuroSCOREII na previsão de mortalidade após cirurgia cardíaca eletiva

Paper Ensemble methods for uplift modeling

Esse paper sobre a aplicação de métodos ensemble especificamente em modelagem uplift, é um ótimo guia de como técnicas não são canônicas em termos de resolução de problemas.

Abstract: Uplift modeling is a branch of machine learning which aims at predicting the causal effect of an action such as a marketing campaign or a medical treatment on a given individual by taking into account responses in a treatment group, containing individuals subject to the action, and a control group serving as a background. The resulting model can then be used to select individuals for whom the action will be most profitable. This paper analyzes the use of ensemble methods: bagging and random forests in uplift modeling. We perform an extensive experimental evaluation to demonstrate that the application of those methods often results in spectacular gains in model performance, turning almost useless single models into highly capable uplift ensembles. The gains are much larger than those achieved in case of standard classifi- cation. We show that those gains are a result of high ensemble diversity, which in turn is a result of the differences between class probabilities in the treatment and control groups being harder to model than the class probabilities themselves. The feature of uplift modeling which makes it difficult thus also makes it amenable to the application of ensemble methods. As a result, bagging and random forests emerge from our evaluation as key tools in the uplift modeling toolbox.

Ensemble methods for uplift modeling

Paper Ensemble methods for uplift modeling

Comparação de Algoritmos de Aprendizado de Máquina de acordo com o hiperplano gerado de acordo com suas regiões de fronteira

Mostra a importância de entender o que cada algortimo faz e as suas aplicações.

Comparação de Algoritmos de Aprendizado de Máquina de acordo com o hiperplano gerado de acordo com suas regiões de fronteira

A Mineração de Dados está proibida de falhar?

Pois é, parece que sim. Ao menos de acordo com a Nature.

Para quem não sabe o que aconteceu, alguns pesquisadores realizaram análises no Google Flu Trends e encontraram problemas em relação ao modelo.

Os resultados estão nos artigos abaixo:

Nature News – When Google got flu wrong

The Parable of Google Flu: Traps in Big Data Analysis 
In February 2013, Google Flu Trends (GFT) made headlines but not for a reason that Google executives or the creators of the flu tracking system would have hoped. Nature reported that GFT was predicting more than double the proportion of doctor visits for influenza-like illness (ILI) than the Centers for Disease Control and Prevention (CDC), which bases its estimates on surveillance reports from laboratories across the United States (1, 2). This happened despite the fact that GFT was built to predict CDC reports. Given that GFT is often held up as an exemplary use of big data (3, 4), what lessons can we draw from this error?

The Mystery of the Exploding Tongue

Why Google Flu Trends Will Not Replace the CDC Anytime Soon

Toward a more useful definition of Big Data


Se alguém quiser saber como funciona o (‘brilhante’) sistema de Peer-Review da Nature (assim como de muitas revistas) o Sydney Brenner fala um pouco sobre o assunto.

A Mineração de Dados está proibida de falhar?

Reprodutibilidade em Mineração de Dados e Aprendizado de Máquina

Esse post do Geomblog coloca esse assunto de uma maneira bem particular. Abaixo um pequeno relato:

So one thing I often look for when reviewing such papers is sensitivity: how well can the authors demonstrate robustness with respect to the parameter/algorithm choices. If they can, then I feel much more confident that the result is real and is not just an artifact of a random collection of knob settings combined with twirling around and around holding one’s nose and scratching one’s ear. 


Aqui no site falamos um pouco sobre isso neste post.

Reprodutibilidade em Mineração de Dados e Aprendizado de Máquina

Pruning Algos

Este paper Russel Reed é um ótimo artigo para quem deseja saber como funcionam os algoritmos de pruning (poda). Merece atenção especial para quem utiliza regras de associação.


Pruning Algos

Algoritmos e Transparência

Neste post do Tim Harford é abordado dentro do contexto da falta de transparência dos algoritmos de HTF, entra em uma seara interessante a respeito do uso de algoritmos no suporte à decisão.

É mais do que necessário que seja realizada a reflexão e estudo antes de aplicar uma determinada regra para tomada de decisão, principalmente se parte do custo de ordenação e seleção de informações forem gerados via algoritmos.

Isso leva a uma preocupação que parece que ainda não está em pauta dentro da comunidade de mineração de dados, em especial em ensino e aplicação de ferramentas; preocupação essa que é de entendimento do que o processo computacional está realizando. Em outras palavras, saber o que ocorre “por trás dos bastidores”.

O ponto é que hoje, qualquer um está apto a utilizar uma suíte de mineração de dados, e até mesmo um SGDB (sic.) para decisões de negócios; entretanto, isso não elimina o débito técnico que a comunidade acadêmica tem sobre a explicação desses algortimos e principalmente o seu entendimento; e isso, pode levar a decisões totalmente black-box que é o inverso do que qualquer analista de mineração de dados deseja.

Algoritmos e Transparência

Aprendizado de Máquina com Python – scikit-learn

Em uma das competições do Kaggle me chamou a atenção o crescimento de usuários que vem utilizando o scikit-learn como ferramenta de aprendizado de máquina.

A linguagem Python tem um grande diferencial que é a comunidade acadêmica por trás de seus desenvolvimentos e em sua comunidade. Muito legal e vale a pena para quem deseja conhecer um pouco mais sobre esse pacote.

Aprendizado de Máquina com Python – scikit-learn

Introdução à Técnica de Árvores de Decisão

Este post de  Antonios Chorianopoulos no Inside Data Mining apresenta uma introdução bem interessante sobre o assunto colocando em perspectiva os algoritmos CART, C5.0 e CHAID em uma explicação bem simples e didática.

Introdução à Técnica de Árvores de Decisão