Machine Learning new version of the quote: “In God we trust, all others must bring data”

Edwards Deming said:

In God we trust, all others must bring data.

Source Wikipedia

In face of a very nice thread of Cecile Janssens in Twitter I’m making this new statement for every ML Engineer, Data Analyst, Data Scientist hereafter:

“IN GOD WE TRUST, OTHERS MUST BRING THE RAW DATA WITH THE SOURCE CODE OF THE EXTRACTION IN THE GITHUB

CLESIO, Flavio. 2019. Berlin.

Machine Learning new version of the quote: “In God we trust, all others must bring data”

The sunset of statistical significance

Brian Resnick hit the nail in his last column in Vox called 800 scientists say it’s time to abandon “statistical significance” where he brings an important discussion in how the p-value is misleading science, especially for for the studies that has clear measurements of some particular effect but they’re thrown away because of the lack of statistical significance.

In the column Mr. Resnick put one alternative to how to get (…) a better, more nuanced approaches to evaluating science (…).

– Concentrating oneffect sizes (how big of a difference does an intervention make, and is it practically meaningful?)

– Confidence intervals (what’s the range of doubt built into any given answer?)

– Whether a result is novel study or a replication (put some more weight into a theory many labs have looked into)

– Whether a study’s design was preregistered (so that authors can’t manipulate their results post-test), and that the underlying data is freely accessible (so anyone can check the math)

– There are also alternative statistical techniques — like Bayesian analysis — that in some ways more directly evaluate a study’s results. (P-values ask the question “how rare are my results?” Bayes factors ask the question “what is the probability my hypothesis is the best explanation for the results we found?” Both approaches have trade-offs. )

PS: Frank Harrell (Founding Chair of Biostatistics, Vanderbilt U. Expert Statistical Advisor, Office of Biostatistics) gave to us this very delightful tweet:

Source Twitter

The sunset of statistical significance

Post-training quantization in FastText (or How to shrink your FastText model in 90%)

In one experiment using a very large text database I got at the end of training using train_supervised()in FastText a serialized model with more than 1Gb.

This behavior occurs because the mechanics of FastText deals with all computation embedded in the model itself: label encoding, parsing, TF-IDF transformation, word-embeddings, calculate the WordNGrams using bag-of-tricks, fit, calculate probabilities and the re-application of the label encoding.

As you noticed in a corpus with more than 200.000 words and wordNGrams > 3this can escalate very quickly in terms of storage.

As I wrote before it’s really nice then we have a good model, but the real value comes when you put this model in production; and this productionize machine learning it’s a barrier that separates girls/boy from woman/man.

With a large storage and memory footprint it’s nearly impossible to make production-ready machine learning models, and in terms of high performance APIs large models with a huge memory footprint can be a big blocker in any decent ML Project.

To solve this kind of problem FastText provides a good way to compress the size of the model with little impact in performance. This is called port-training quantization.

The main idea of Quantization it’s to reduce the size of original model compressing the vectors of the embeddings using several techniques since simple truncation or hashing. Probably this paper (Shu, Raphael, and Hideki Nakayama. “Compressing word embeddings via deep compositional code learning.”) it’s one of the best references of this kind of technique.

This is the performance metric of one vanilla model with full model:Recall:0.79

I used the following command in Python for the quantization, model saving and reload:

# Quantize the model
model.quantize(input=None,
                  qout=False,
                  cutoff=0,
                  retrain=False,
                  epoch=None,
                  lr=None,
                  thread=None,
                  verbose=None,
                  dsub=2,
                  qnorm=False,
                 )

# Save Quantized model
model.save_model('model_quantized.bin')

# Model Quantized Load
model_quantized = fastText.load_model('model_quantized.bin')

I made the retraining using the quantized model and I got the following results:

# Training Time: 00:02:46
# Recall: 0.78

info_old_model = os.path.getsize('model.bin') / 1024.0
info_new_model = os.path.getsize('model_quantized.bin') / 1024.0

print(f'Old Model Size (MB): {round(info_old_model, 0)}')
print(f'New Model Size (MB): {round(info_new_model, 0)}')

# Old Model Size (MB): 1125236.0
# New Model Size (MB): 157190.0

As we can see after the shrink in the vanilla model using quantization we had the Recall: 0.78 against 0.79 with a model 9x lighter in terms of space and memory footprint if we need to put this model in production.

Post-training quantization in FastText (or How to shrink your FastText model in 90%)

Reproducibility in FastText

A few days ago I wrote about FastText and one thing that is not clear in docs it’s about how to make the experiments reproducible in a deterministic day.

In default settings of train_supervised() method i’m using the thread parameter with multiprocessing.cpu_count() - 1 as value.

This means that we’re using all the CPUs available for training. As a result, this implies a shorter training time if we’re using multicore servers or machines.

However, this implicates in a totally non-deterministic result because of the optimization algorithm used by fastText (asynchronous stochastic gradient descent, or Hogwildpaper here), the obtained vectors will be different, even if initialized identically.

This very gentle guide of FastText with Gensim states that:

for a fully deterministically-reproducible run, you must also limit the model to a single worker thread (workers=1), to eliminate ordering jitter from OS thread scheduling. (In Python 3, reproducibility between interpreter launches also requires use of the PYTHONHASHSEED environment variable to control hash randomization).

Radim Řehůřek in FastText Model

So for that particular reason the main assumption here it’s even playing in a very stocastic environment of experimentation we’ll consider only the impact of data volume itself and abstract this information from the results, for the reason that this stocastic issue can play for both experiments.

To make reproducible experiments the only thing that it’s needed it’s to change the value of thread parameter from multiprocessing.cpu_count() - 1to 1.

So for the sake of reproducibility the training time will take longer (in my experiments I’m facing an increase of 8000% in the training time.

Reproducibility in FastText

Dica de Python: Dask

Para quem não aguenta mais sofrer com o Pandas e não quer lidar com as inúmeras limitações do Scala o Dask é uma ótima biblioteca para manipulação de dados e computação em Python.

Direto da documentação:

Familiar: Provides parallelized NumPy array and Pandas DataFrame objects

Flexible: Provides a task scheduling interface for more custom workloads and integration with other projects.

Native: Enables distributed computing in pure Python with access to the PyData stack.

Fast: Operates with low overhead, low latency, and minimal serialization necessary for fast numerical algorithms

Scales up: Runs resiliently on clusters with 1000s of cores

Scales down: Trivial to set up and run on a laptop in a single process

Responsive: Designed with interactive computing in mind, it provides rapid feedback and diagnostics to aid humans

Dica de Python: Dask

FastText – A great tool for Text Classification

In some point of time, I’ll post a field report about FastText in a project for Text Classification. My opinion until this moment (16.03.19): For a fast Alpha version of a text classification with robust use of Bag-of-Tricks and WordNGrams it’s amazing in terms of practical results (especially Recall) and speed of development.

Imagem

Robô chines é primeira máquina a passar em um exame de Medicina

Matéria interessante do China Daily:

A robot has passed the written test of China’s national medical licensing examination, an essential entrance exam for doctors, making it the first robot in the world to pass such an exam.

Its developer iFlytek Co Ltd, a leading Chinese artificial intelligence company, said on Thursday that the robot scored 456 points, 96 points higher than the required marks.

The artificial-intelligence-enabled robot can automatically capture and analyze patient information and make initial diagnosis. It will be used to assist doctors to improve efficiency in future treatments, iFlytek said.

This is part of broader efforts by China to accelerate the application of AI in healthcare, consumer electronics, and other industries.

Liu Qingfeng, chairman of iFlytek, said, “We will officially launch the robot in March 2018. It is not meant to replace doctors. Instead, it is to promote better people-machine cooperation so as to boost efficiency.”

Ao menos essa noticia e interessante por dois motivos simples:

1) Estamos vivendo em um tempo em que os AI Deniers (eu denomino eles como Terraplanistas da Inteligência Artificial) onde eles têm no Gary Marcus a sua maior expressão e em ao mesmo tempo tem uma mistura de Um ótimo discurso de ceticismo em relação ao hype do Deep Learning e Artificial General Intelligence (AGI) com críticas sem sentido, como neste exemplo em que há negação contra resultados experimentais claros com disclaimer de limitações metodológicas; e

2) Em termos de alocação de recursos médicos e econômicos a automação desses sistemas de robôs médicos traria um grande avanço social no sentido de que a) haveria uma maior democratização do acesso à saúde preventiva por parte das pessoas menos beneficiadas, uma vez que os custos teriam uma redução drastica e b) potencialmente haveria uma melhor alocação do tempo dos profissionais da saúde (e.g. médicos e enfermeiros) em tarefas de maior valor para a prevenção ou recuperação/intervenção para os pacientes ao invés da execução de procedimentos repetitivos, como por exemplo um foco maior em diagnóstico e tratamento (e isso é realmente importante no Brasil dado que 40% dos médicos recém-formados são reprovados na prova do CREMESP, Sendo ainda que 70% dos médicos não sabiam medir pressão e 86% erraram abordagem a vítima de acidente de trânsito).

Conclusão

Em uma realidade em que no mínimo 50% de todos os empregos podem ser extintos do mercado de trabalho com a automação e Inteligência Artificial, como também o crescimento da demanda de bens e serviços (com a competitividade de custos cada vez maior) notícias como essas são muito bem vindas para colocar em perspectiva para as sociedades de que o correto entendimento das potencialidades e limitações da Inteligência Artificial é o caminho para o desenvolvimento social e prosperidade econômica.

Como diria um autor em uma conferência que eu fui na Asia: “A Inteligência Artificial não está para tirar o emprego das pessoas, mas sim está para acabar com o emprego daquelas que não usarem.”

Robô chines é primeira máquina a passar em um exame de Medicina

Model explainability will be a security issue

Ben Lorica talks security in terms of Software Engineering but at least for me the most important aspect of security in Machine Learning in the future it’s the model explainability where he says:

Model explainability has become an important area of research in machine learning. Understanding why a model makes specific decisions is important for several reasons, not the least of which is that it makes people more comfortable with using machine learning. That “comfort” can be deceptive, of course. But being able to ask models why they made particular decisions will conceivably make it easier to see when they’ve been compromised. During development, explainability will make it possible to test how easy it is for an adversary to manipulate a model, in applications from image classification to credit scoring. In addition to knowing what a model does, explainability will tell us why, and help us build models that are more robust, less subject to manipulation; understanding why a model makes decisions should help us understand its limitations and weaknesses. At the same time, it’s conceivable that explainability will make it easier to discover weaknesses and attack vectors. If you want to poison the data flowing into a model, it can only help to know how the model responds to data.

This discussion it’s very important as we’re having here in Europe a huge government movement against lack of explainability in the way algorithms works and how automated decisions are made.

Model explainability will be a security issue

Project structure for Machine Learning and Data Science

Don’t know how to start? This repo can be a very useful resource.

This is the main project structure of the Cookie Cutter Data Science.

├── LICENSE
├── Makefile           <- Makefile with commands like `make data` or `make train`
├── README.md          <- The top-level README for developers using this project.
├── data
│   ├── external       <- Data from third party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling.
│   └── raw            <- The original, immutable data dump.
│
├── docs               <- A default Sphinx project; see sphinx-doc.org for details
│
├── models             <- Trained and serialized models, model predictions, or model summaries
│
├── notebooks          <- Jupyter notebooks. Naming convention is a number (for ordering),
│                         the creator's initials, and a short `-` delimited description, e.g.
│                         `1.0-jqp-initial-data-exploration`.
│
├── references         <- Data dictionaries, manuals, and all other explanatory materials.
│
├── reports            <- Generated analysis as HTML, PDF, LaTeX, etc.
│   └── figures        <- Generated graphics and figures to be used in reporting
│
├── requirements.txt   <- The requirements file for reproducing the analysis environment, e.g.
│                         generated with `pip freeze > requirements.txt`
│
├── setup.py           <- Make this project pip installable with `pip install -e`
├── src                <- Source code for use in this project.
│   ├── __init__.py    <- Makes src a Python module
│   │
│   ├── data           <- Scripts to download or generate data
│   │   └── make_dataset.py
│   │
│   ├── features       <- Scripts to turn raw data into features for modeling
│   │   └── build_features.py
│   │
│   ├── models         <- Scripts to train models and then use trained models to make
│   │   │                 predictions
│   │   ├── predict_model.py
│   │   └── train_model.py
│   │
│   └── visualization  <- Scripts to create exploratory and results oriented visualizations
│       └── visualize.py
│
└── tox.ini            <- tox file with settings for running tox; see tox.testrun.org
Project structure for Machine Learning and Data Science

Some quick comments about Genevera Allen statements regarding Machine Learning

Start note: Favio Vazquez made a great job in his article about it with a lot of charts and showing that in modern Machine Learning approach with the tools that we currently have the problems of replication and methodology are being tackled.

It’s becoming a great trend: Some researcher has some criticism about Machine Learning and they start to do some cherry picking (fallacy of incomplete evidence) in potential issues start with statements like “We have a problem in Machine Learning and the results it’s not reproducible“, “Machine Learning doesn’t work“, “Artificial intelligence faces reproducibility crisis, “AI researchers allege that machine learning is alchemy and boom: we have click bait, rant, bashing and a never-ending spiral of non-construcive critcism. Afterward this researcher get some spotlights in public debate about Machine Learning, goes to CNN to give some interviews and becomes a “reference in issues in Machine Learning“.

Right now it’s time for Ms. Allen do the following question/statement “Can we trust scientific discoveries made using machine learning?” where she brings good arguments for the debate, but I think she misses the point to 1) not bring any solution/proposal and 2) the statement itself its too abroad and obvious that can be applied in any science field.

My main intention here it’s just to make very short comments to prove that these issues are very known by the Machine Learning community and we have several tools and methods to tackle these issues.

The second intention here it’s to demonstrate that this kind of very broad-obvious argument brings more friction than light to debate. I’ll include the statement and a short response below:

“The question is, ‘Can we really trust the discoveries that are currently being made using machine-learning techniques applied to large data sets?'” Allen said. “The answer in many situations is probably, ‘Not without checking,’ but work is underway on next-generation machine-learning systems that will assess the uncertainty and reproducibility of their predictions.”

Comment: More data do not imply in more insights and harder to have more data it’s to have the right combination of hyperparameters, feature engineering, and ensembling/stacking the models. And every scientific statement must be checked (this is a basic assumption of the scientific method). But this trend maybe cannot be a truth in modern research, as we are celebrating scientific statements (over selling) with the researchers intentionally hiding their methods and findings. It’s like Hans Bethe hiding his discoveries about stellar nucleosynthesis because in some point in the future someone can potentially use this to make atomic bombs.

“A lot of these techniques are designed to always make a prediction,” she said. “They never come back with ‘I don’t know,’ or ‘I didn’t discover anything,’ because they aren’t made to.”

Comment: This is simply not true. A very quick check in Scikit-Learn, XGBoost and Keras (3 of the most popular libraries of ML) shattered this argument.

“In precision medicine, it’s important to find groups of patients that have genomically similar profiles so you can develop drug therapies that are targeted to the specific genome for their disease,” Allen said. “People have applied machine learning to genomic data from clinical cohorts to find groups, or clusters, of patients with similar genomic profiles. “But there are cases where discoveries aren’t reproducible; the clusters discovered in one study are completely different than the clusters found in another,”

Comment: Here it’s the classic use of misleading experience with a clear use of confirmation bias because of a lack of understanding between tools with methodology . The ‘logic‘ of this argument is: A person wants to cut some vegetables to make a salad. This person uses a salad knife (the tool) but instead to use it accordingly (in the kitchen with a proper cutting board) this person cut the vegetables on the top of a stair after drink 2 bottles of vodka (the wrong method) and end up being cut; and after that this person get the conclusion that the knife is dangerous and doesn’t work.

There’s a bunch of guidelines being proposed and there’s several good resources like Machine Learning Mastery that already tackled this issue, this excellent post of Determined ML makes a good argument and this repo has tons of reproducible papers even using Deep Learning. The main point is: Any junior Machine Learning Engineer knows that hashing the dataset and fixing a seed at the beginning of the experiment can solve at least 90% of these problems.

Conclusion

There’s a lot of researches and journalists that cannot (or do not want to) understand that not only in Machine Learning but in all science there’s a huge problem of replication of the studies (this is not the case for Ms. Allen because she had a very interesting track record in ML in terms of publications). In psychology half of the studies cannot be replicated and even the medical findings in some instance are false that proves that is a very long road to minimize that kind of problem.

Some quick comments about Genevera Allen statements regarding Machine Learning