Most of the time we completely rely in the default parameters of Machine Learning Algorithm and this fact can hide that sometimes we can make wrong statements about the ‘efficiency’ of some algorithm.
The paper called Tunability: Importance of Hyperparameters of Machine Learning Algorithms from Philipp Probst, Anne-Laure Boulesteix, Bernd Bischl in Journal of Machine Learning Research (JMLR) bring some light in this subject. This is the abstract:
Modern supervised machine learning algorithms involve hyperparameters that have to be set before running them. Options for setting hyperparameters are default values from the software package, manual configuration by the user or configuring them for optimal predictive performance by a tuning procedure. The goal of this paper is two-fold. Firstly, we formalize the problem of tuning from a statistical point of view, define data-based defaults and suggest general measures quantifying the tunability of hyperparameters of algorithms. Secondly, we conduct a large-scale benchmarking study based on 38 datasets from the OpenML platform and six common machine learning algorithms. We apply our measures to assess the tunability of their parameters. Our results yield default values for hyperparameters and enable users to decide whether it is worth conducting a possibly time consuming tuning strategy, to focus on the most important hyperparameters and to choose adequate hyperparameter spaces for tuning.Probst, Boulesteix, Bischl in Tunability: Importance of Hyperparameters of Machine Learning Algorithms
I recognize that the abstract sounds not so appealing, but the most important part of the text for sure it’s related in one table and one graph about the Tunability, i.e. how tuneable one parameter is according the other default values.
As we can observe in the columns Def.P (package defaults) and Def.O (optimal defaults) even in some vanilla algorithms we have some big differences between them, specially in Part, XGBoost and Ranger.
If we check the variance across this hyper parameters, the results indicates that the problem can be worse that we imagined:
As we can see in a first sight there’s a huge variance in terms of AUC when we talk about the default parameters.
Checking these experiments two big questions arises:
- How much inefficiency it’s included in some process of algorithm assessment and selection because for the ‘initial model‘ (that most of the times becomes the last model) because of relying in the default values? and;
- Because of this misleading path to consider some algorithm based purely in defaults how many ML implementations out there is underperforming and wasting research/corporate resources (e.g. people’s time, computational time, money in cloud providers, etc…)?
Initial Assessment Strategy
A simple strategy that I use for this particular purpose it’s to use a two-phase hyperparameter search strategy where in the first phase I use to make a small knockout round with all algorithms using Random Search to grab the top 2 or 3 models, and in the second phase I use Grid Search where most of the time I explore a large number of parameters.
According the number of samples that I have in the Test and Validation sets, I usually let the search for at least 24 hours in some local machine or in some cloud provider.
I do that because with this ‘initial‘ assessment we can have a better idea which algorithm will learn more according the data that I have considering dimensionality, selectivity of the columns or complexity of the word embeddings in NLP tasks, data volume and so on.
The paper makes a great job to expose the variance in terms of AUC using default parameters for the practitioners and can give us a better heuristic path in terms to know which parameters are most tunable and with this information in hands we can perform better search strategies to have better implementations of Machine Learning Algorithms.