Not Safe for Work Detector using Tensorflow JS

From the official repository:

A simple JavaScript library to help you quickly identify unseemly images; all in the client’s browser. NSFWJS isn’t perfect, but it’s pretty accurate (~90% from our test set of 15,000 test images)… and it’s getting more accurate all the time.

The library categorizes image probabilities in the following 5 classes:

  • Drawing – safe for work drawings (including anime)
  • Hentai – hentai and pornographic drawings
  • Neutral – safe for work neutral images
  • Porn – pornographic images, sexual acts
  • Sexy – sexually explicit images, not pornography

The demo is a continuous deployment source – Give it a go:

Not Safe for Work Detector using Tensorflow JS

Machine Learning Model Degradation

In a very insightful article made by David Talby he discuss about the fact that in a second that a Machine Learning goes to production, actually this model starts degradate itself because the model contact with the reality, where the author uses the following statement:

The key is that, in contrast to a calculator, your ML system does interact with the real world. If you’re using ML to predict demand and pricing for your grocery store, you’d better consider this week’s weather, the upcoming national holiday and what your competitor across the street is doing. If you’re designing clothes or recommending music, you’d better follow opinion-makers, celebrities and current events. If you’re using AI for auto-trading, bidding for online ads or video gaming, you must constantly adapt to what everyone else is doing.

The takeaways from the article it is that every ML in production should have some simple guidelines for monitoring and reassessment like 1) A online measure of accuracy to monitor the model degradation, 2) The ML Engineers most mind the gap between the distributions in the training and test sets, and 3) Data quality alerts regarding the unexpected growth in some groups of your sample that you’re facing some bad predictions.

Machine Learning Model Degradation

Tunability, Hyperparameters and a simple Initial Assessment Strategy

Most of the time we completely rely in the default parameters of Machine Learning Algorithm and this fact can hide that sometimes we can make wrong statements about the ‘efficiency’ of some algorithm.

The paper called Tunability: Importance of Hyperparameters of Machine Learning Algorithms from Philipp Probst, Anne-Laure Boulesteix, Bernd Bischl in Journal of Machine Learning Research (JMLR) bring some light in this subject. This is the abstract:

Modern supervised machine learning algorithms involve hyperparameters that have to be set before running them. Options for setting hyperparameters are default values from the software package, manual configuration by the user or configuring them for optimal predictive performance by a tuning procedure. The goal of this paper is two-fold. Firstly, we formalize the problem of tuning from a statistical point of view, define data-based defaults and suggest general measures quantifying the tunability of hyperparameters of algorithms. Secondly, we conduct a large-scale benchmarking study based on 38 datasets from the OpenML platform and six common machine learning algorithms. We apply our measures to assess the tunability of their parameters. Our results yield default values for hyperparameters and enable users to decide whether it is worth conducting a possibly time consuming tuning strategy, to focus on the most important hyperparameters and to choose adequate hyperparameter spaces for tuning.

Probst, Boulesteix, Bischl in Tunability: Importance of Hyperparameters of Machine Learning Algorithms

I recognize that the abstract sounds not so appealing, but the most important part of the text for sure it’s related in one table and one graph about the Tunability, i.e. how tuneable one parameter is according the other default values.

As we can observe in the columns Def.P (package defaults) and Def.O (optimal defaults) even in some vanilla algorithms we have some big differences between them, specially in Part, XGBoost and Ranger.

If we check the variance across this hyper parameters, the results indicates that the problem can be worse that we imagined:

As we can see in a first sight there’s a huge variance in terms of AUC when we talk about the default parameters.

Checking these experiments two big questions arises:

  1. How much inefficiency it’s included in some process of algorithm assessment and selection because for the ‘initial model‘ (that most of the times becomes the last model) because of relying in the default values? and;
  2. Because of this misleading path to consider some algorithm based purely in defaults how many ML implementations out there is underperforming and wasting research/corporate resources (e.g. people’s time, computational time, money in cloud providers, etc…)?

Initial Assessment Strategy

A simple strategy that I use for this particular purpose it’s to use a two-phase hyperparameter search strategy where in the first phase I use to make a small knockout round with all algorithms using Random Search to grab the top 2 or 3 models, and in the second phase I use Grid Search where most of the time I explore a large number of parameters.

According the number of samples that I have in the Test and Validation sets, I usually let the search for at least 24 hours in some local machine or in some cloud provider.

I do that because with this ‘initial‘ assessment we can have a better idea which algorithm will learn more according the data that I have considering dimensionality, selectivity of the columns or complexity of the word embeddings in NLP tasks, data volume and so on.


The paper makes a great job to expose the variance in terms of AUC using default parameters for the practitioners and can give us a better heuristic path in terms to know which parameters are most tunable and with this information in hands we can perform better search strategies to have better implementations of Machine Learning Algorithms.

Tunability, Hyperparameters and a simple Initial Assessment Strategy

Benchmark-ML: Cutting the Big Data Hype

This is the most important benchmark project already done in Machine Learning. I’ll let for you the summary provided:

When I started this benchmark in March 2015, the “big data” hype was all the rage, and the fanboys wanted to do machine learning on “big data” with distributed computing (Hadoop, Spark etc.), while for the datasets most people had single-machine tools were not only good enough, but also faster, with more features and less bugs. I gave quite a few talks at conferences and meetups about these benchmarks starting 2015 and while at the beginning I had several people asking angrily about my results on Spark, by 2017 most people realized single machine tools are much better for solving most of their ML problems. While Spark is a decent tool for ETL on raw data (which often is indeed “big”), its ML libraries are totally garbage and outperformed (in training time, memory footpring and even accuracy) by much better tools by orders of magnitude. Furthermore, the increase in available RAM over the last years in servers and also in the cloud, and the fact that for machine learning one typically refines the raw data into a much smaller sized data matrix is making the mostly single-machine highly-performing tools (such as xgboost, lightgbm, VW but also h2o) the best choice for most practical applications now. The big data hype is finally over.

Github Repo
Benchmark-ML: Cutting the Big Data Hype