Machine Learning Tetrad = Business Knowledge + Statistical Understanding + ML Algos + Data

In the post called Learning Market Dynamics for Optimal Pricing post of Sharan Srinivasan he talks about how AirBnb uses ML and Structural Modeling (Mathematical + Statistical Modelling) combined to get some results about the offer to guests the optimal pricing based in market dynamics based in the anticipation of the booking and the difference the time between the booking date until the check-in (also know as Lead Time).

This part of the post summarizes the whole point why they choose that approach:

Machine Learning vs Structural Modeling or Both?

Modern ML models fare very well in terms of predictive performance, but seldom model the underlying data generation mechanism. In contrast, structural models provide interpretability by allowing us to explicitly specify the relationships between the variables (features and responses) to reflect the process that gives rise to the data, but often fall short on predictive performance. Combining the two schools of thought allows us to exploit the strengths of each approach to better model the data generating process as well as achieve good model performance.

When we have good intuition for a modeling task, we can use our insights to reinforce an ML model with structural context. Imagine we are looking to predict a response Y based on features (X₀,…,Xn). Ordinarily, we would train our favorite ML model to predict. However, suppose we also know that Y is distributed over an input feature X₀ with a distribution F parameterized by ? i.e. Y~ F(X₀; ? ), we could leverage this information and decompose the task to learning ? using features (X₀,…,Xn), and then simply plug our estimate of ? back into f to arrive at Y in the final step.

By employing this hybrid approach, we can leverage both the algorithmic powerhouse that ML provides and the informed intuition of statistical modeling. This is the approach we took to model lead time dynamics.

This post’s a good technical compass about the best combination for every modelling problem in Core Machine Learning always will be the tetrad: Business Knowledge + Statistical understanding of the data + ML Algos + Data.

Machine Learning Tetrad = Business Knowledge + Statistical Understanding + ML Algos + Data

Practical advice about research modelling with Andrew

A post about ROC analysis becomes a small lecture about decision analysis:

It’s good for researchers to present their raw data, along with clean summary analyses. Report what your data show, and publish everything! But when it comes to decision making, including the decision of what lines of research to pursue further, I’d go Bayesian, incorporating prior information and making the sources and reasoning underlying that prior information clear, and laying out costs and benefits. Of course, that’s all a lot of work, and I don’t usually do it myself. Look at my applied papers and you’ll see tons of point estimates and uncertainty intervals, and only a few formal decision analyses. Still, I think it makes sense to think of Bayesian decision analysis as the ideal form and to interpret inferential summaries in light of these goals. Or, even more, short-term than that, if people are using statistical significance to make publication decisions, we can do our best to correct for the resulting biases, as in section 2.1 of this paper.

Practical advice about research modelling with Andrew

Understandability of ML models and it’s applications

This Michael Kaminsky post called The Blacker the Box nail the whole point about understandability x formal modelling using the speed of feedback as a mechanism to help to decide the best approach to implement these models. This quote’s about the Fast feedback the author defines as “1) the ability to quickly evaluate the correctness of a prediction1 and 2) the ability to play the game near infinite amounts of time2“:

If I am designing an application for optimizing landing-page content for my e-commerce site (i.e., choose the content that converts best), then I do not care if my data scientist has rigged up a prediction pipeline that involves passing a SVM prediction through an octopus so long as that model out-performs every other model we have tested in our production environment and we have confidence that we are measuring performance correctly.

The key to this scenario is that I am able to quickly and easily evaluate the performance of candidate prediction models and compare the current production model to new candidate models for evaluation. Because I have an objective measure of predictive success, I do not need any understanding of what the model is doing under-the-hood in order to make use of it.

Understandability of ML models and it’s applications

A nice advice from Andrew about Multi-Armed Bandits

Small technical considerations about terminology:

First, and less importantly, each slot machine (or “bandit”) only has one arm. Hence it’s many one-armed bandits, not one multi-armed bandit.

Second, the basic strategy in these problems is to play on lots of machines until you find out which is the best, and then concentrate your plays on that best machine. This all presupposes that either (a) you’re required to play, or (b) at least one of the machines has positive expected value. But with slot machines, they all have negative expected value for the player (that’s why they’re called “bandits”), and the best strategy is not to play at all. So the whole analogy seems backward to me.

A nice advice from Andrew about Multi-Armed Bandits

Progressive Neural Architecture Search

AbstractWe propose a new method for learning the structure of convolutional neural networks (CNNs) that is more efficient than recent state-of-the-art methods based on reinforcement learning and evolutionary algorithms. Our approach uses a sequential model-based optimization (SMBO) strategy, in which we search for structures in order of increasing complexity, while simultaneously learning a surrogate model to guide the search through structure space. Direct comparison under the same search space shows that our method is up to 5 times more efficient than the RL method of Zoph et al. (2018) in terms of number of models evaluated, and 8 times faster in terms of total compute. The structures we discover in this way achieve state of the art classification accuracies on CIFAR-10 and ImageNet.

Conclusions: The main contribution of this work is to show how we can accelerate the search for good CNN structures by using progressive search through the space of increasingly complex graphs, combined with a learned prediction function to efficiently identify the most promising models to explore. The resulting models achieve the same level of performance as previous work but with a fraction of the computational cost. There are many possible directions for future work, including: the use of better surrogate predictors, such as Gaussian processes with string kernels; the use of model-based early stopping, such as [3], so we can stop the training of “unpromising” models before reaching E1 epochs; the use of “warm starting”, to initialize the training of a larger b+ 1-sized model from its smaller parent; the use of Bayesian optimization, in which we use an acquisition function, such as expected improvement or upper confidence bound, to rank the candidate models, rather than greedily picking the top K (see e.g., [31,30]); adaptively varying the number of models K evaluated at each step (e.g., reducing it over time); the automatic exploration of speed-accuracy tradeoffs (cf., [11]), etc.

Progressive Neural Architecture Search