How to pass environment variables in Jupyter Notebook

(Sharing some personal suffering)

One thing that gets me mad it’s to make several .csv/.txt files in my computer to perform some analysis. I personally prefer to connect directly in some RDBMS (Redshift) and get the data in some straightforward way and store the query inside the Jupiter Notebook.

The main problem with this approach is: a high number of people put their passwords inside the notebooks/scripts and this is very unsafe. (You don’t need to believe me, check it by yourself)

I was trying to pass the environment variables in a traditional way using export VARIABLE_NAME=xptoSomeValue  but after starting the Jupyter Notebook I get the following error:

 

 

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-13-2288aa3f6b7a> in <module>()
      2 import os
      3 
----> 4 HOST = os.environ['REDSHIFT_HOST']
      5 PORT = os.environ['REDSHIFT_PORT']
      6 USER = os.environ['REDSHIFT_USER']

/usr/local/Cellar/python/2.7.13/Frameworks/Python.framework/Versions/2.7/lib/python2.7/UserDict.pyc in __getitem__(self, key)
     38         if hasattr(self.__class__, "__missing__"):
     39             return self.__class__.__missing__(self, key)
---> 40         raise KeyError(key)
     41     def __setitem__(self, key, item): self.data[key] = item
     42     def __delitem__(self, key): del self.data[key]

KeyError: 'REDSHIFT_HOST'

For some reason, this approach didn’t work. I make a small workaround to start using some environmental variables when I call of jupyter notebook command in that way:

env REDSHIFT_HOST='myRedshiftHost' REDSHIFT_USER='flavio.clesio' REDSHIFT_PORT='5439' REDSHIFT_DATA='myDatabase' REDSHIFT_PASS='myVeryHardPass' jupyter notebook

I hope it helps!

Anúncios
How to pass environment variables in Jupyter Notebook

Six reasons your boss must send you to Spark Summit Europe 2017

It’s redundant to say that Apache Spark is becoming the most prominent open-source big data cluster-computing framework in the last 2 years, where this technology not only shattered old paradigms of general purpose distributed data processing, but also built a very vibrant, innovation-driven, and receptive community.

This is my first time at Spark Summit, and for me personally, it’s a great time as Machine Learning professional to be part of such event that has grown dramatically in the last 2 years only.

Here in Brazil we do not have such tradition to invest in conferences (that are some cultural reasons involved that needed to break down in another blog post), but this is the six reasons that your boss must send you to Spark Summit Europe 2017:

  1. Accomplish more than the rest: While some your company competitors are heavily busy making re-work in old frameworks, your company can stay focused to solve real problems that permit scalability for your business using bleeding edge technologies.
  2. Stay ahead of the game: You can choose one of these two sentences to put in your resumé: 1) “Worked with Apache Spark, the most prominent open-source cluster-computing framework for Big Data Projects“; or 2) “Worked with <<Put some obsolete framework the needs a couple USD millions to be deployed and have 70% fewer features than Apache Spark and the most stable version was written 9 years ago and the whole marketing are migrating>>”. It’s up to you.
  3. Connect with Apache Spark experts: In Spark Summit you’ll meet some real dealers of Apache Spark, not someone with marketing pitch (no offense) offering difficulties (e.g. closed-source, buggy platform) to sell facilities (e.g. never-ending-consulting-until-drain-your-entire-budget style, sell (buggy) plugins, add-ons, etc… ). Some of Spark experts are Tim Hunter, Tathagata Das,  Sue Ann Hong, Holden Karau, to name a few.
  4. Network that matters: I mean people with shared interest in enthusiasm over an open-source framework Apache Spark and technology, headhunters of good companies that understand that data plays a strong role at business; not some B.S. artist or pseudo-tech-cloaked-sellers someone else.
  5. Applied knowledge produce innovation, and innovation produce results: Some cases using Apache Spark to innovate and help business – Saving more than US$ 3 million using Apache Spark and Machine Learning, managing 300TB data workload using Apache Spark, real-time anomaly detection in some systems, changing the game of digital marketing using Apache Spark,  and predicting traffic using weather data.
  6. Opting out will destroy your business and your career: Refuse to get knowledge and apply that it’s the fast way to destroy your career with stagnation in old methods/process/platforms and become obsolete in a few months. For your company, opting out of innovation or learning new methods and technologies that can help to scale the business or enhance productivity, it’s a good way to get out of business in a few years.

To register and learn more about the event, please visit Spark Summit 2017 and follow spark_summit on Twitter.

Six reasons your boss must send you to Spark Summit Europe 2017

See you at Spark Summit Europe 2017

In October 26, my friend Eiti Kimura and I will provide a talk called Preventing leakage and monitoring distributed systems with Machine Learning at Spark Summit Europe 2017 where we’ll show our solution to monitoring a highly complex distributed system using Apache Spark as a tool for Machine Learning.

We’re very excited to share our experience in this journey, and how we solved a complex problem using a simple solution that saved more than US$ 3 million in the last 19 months.

See you at Spark Summit at Dublin.

See you at Spark Summit Europe 2017

Anomaly Detection in Multivariate Non-stationary Time Series for Automatic DBMS Diagnosis

Anomaly Detection in Multivariate Non-stationary Time Series for Automatic DBMS Diagnosis

ABSTRACT— Anomaly detection in database management systems (DBMSs) is difficult because of increasing number of statistics (stat) and event metrics in big data system. In this paper, I propose an automatic DBMS diagnosis system that detects anomaly periods with abnormal DB stat metrics and finds causal events in the periods. Reconstruction error from deep autoencoder and statistical process control approach are applied to detect time period with anomalies. Related events are found using time series similarity measures between events and abnormal stat metrics. After training deep autoencoder with DBMS metric data, efficacy of anomaly detection is investigated from other DBMSs containing anomalies. Experiment results show effectiveness of proposed model, especially, batch temporal normalization layer. Proposed model is used for publishing automatic DBMS diagnosis reports in order to determine DBMS configuration and SQL tuning.

CONCLUSION AND FUTURE WORK I proposed a machine learning model for automatic DBMS diagnosis. The proposed model detects anomaly periods from reconstruct error with deep autoencoder. I also verified empirically that temporal normalization is essential when input data is non-stationary multivariate time series. With SPC approach, time period is considered anomaly period when reconstruction error is outside of control limit. According types or users of DBMSs, decision rules that are used in SPC can be added. For example, warning line with 2 sigma can be utilized to decide whether it is anomaly or not [12, 13]. In this paper, anomaly detection test is proceeded in other DBMSs whose data is not used in training, because performance of basic pre-trained model is important in service providers’ perspective. Efficacy of detection performance is validated with blind test and DBAs’ opinions. The result of automatic anomaly diagnosis would help DB consultants save time for anomaly periods and main wait events. Thus, they can concentrate on only making solution when DB disorders occur. For better performance of anomaly detection, additional training can be proceeded after pre-trained model is adopted. In addition, recurrent and convolutional neural network can be used in reconstruction part to capture hidden representation of sequential and local relationship. If anomaly labeled data is generated, detection result can be analyzed with numerical performance measures. However, in practice, it is hard to secure labeled anomaly dataset according to each DBMS. Proposed model is meaningful in unsupervised anomaly detection model that doesn’t need labeled data and can be generalized to other DBMSs with pre-trained model

Anomaly Detection in Multivariate Non-stationary Time Series for Automatic DBMS Diagnosis

Anomaly Detection in Multivariate Non-stationary Time Series for Automatic DBMS Diagnosis

Very Deep Convolutional Networks for Large-Scale Image Recognition

ABSTRACT In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small ( 3 × 3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision

CONCLUSION In this work we evaluated very deep convolutional networks (up to 19 weight layers) for largescale image classification. It was demonstrated that the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture (LeCun et al., 1989; Krizhevsky et al., 2012) with substantially increased depth. In the appendix, we also show that our models generalise well to a wide range of tasks and datasets, matching or outperforming more complex recognition pipelines built around less deep image representations. Our results yet again confirm the importance of depth in visual representations.

Very Deep Convolutional Networks for Large-Scale Image Recognition

Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks

Abstract We develop an algorithm which exceeds the performance of board certified cardiologists in detecting a wide range of heart arrhythmias from electrocardiograms recorded with a single-lead wearable monitor. We build a dataset with more than 500 times the number of unique patients than previously studied corpora. On this dataset, we train a 34-layer convolutional neural network which maps a sequence of ECG samples to a sequence of rhythm classes. Committees of boardcertified cardiologists annotate a gold standard test set on which we compare the performance of our model to that of 6 other individual cardiologists. We exceed the average cardiologist performance in both recall (sensitivity) and precision (positive predictive value).

Conclusion We develop a model which exceeds the cardiologist performance in detecting a wide range of heart arrhythmias from single-lead ECG records. Key to the performance of the model is a large annotated dataset and a very deep convolutional network which can map a sequence of ECG samples to a sequence of arrhythmia annotations. On the clinical side, future work should investigate extending the set of arrhythmias and other forms of heart disease which can be automatically detected with high-accuracy from single or multiple lead ECG records. For example we do not detect Ventricular Flutter or Fibrillation. We also do not detect Left or Right Ventricular Hypertrophy, Myocardial Infarction or a number of other heart diseases which do not necessarily exhibit as arrhythmias. Some of these may be difficult or even impossible to detect on a single-lead ECG but can often be seen on a multiple-lead ECG. Given that more than 300 million ECGs are recorded annually, high-accuracy diagnosis from ECG can save expert clinicians and cardiologists considerable time and decrease the number of misdiagnoses. Furthermore, we hope that this technology coupled with low-cost ECG devices enables more widespread use of the ECG as a diagnostic tool in places where access to a cardiologist is difficult.

Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks

Learning to Optimize Neural Nets

Abstract Learning to Optimize (Li & Malik, 2016) is a recently proposed framework for learning optimization algorithms using reinforcement learning. In this paper, we explore learning an optimization algorithm for training shallow neural nets. Such high-dimensional stochastic optimization problems present interesting challenges for existing reinforcement learning algorithms. We develop an extension that is suited to learning optimization algorithms in this setting and demonstrate that the learned optimization algorithm consistently outperforms other known optimization algorithms even on unseen tasks and is robust to changes in stochasticity of gradients and the neural net architecture. More specifically, we show that an optimization algorithm trained with the proposed method on the problem of training a neural net on MNIST generalizes to the problems of training neural nets on the Toronto Faces Dataset, CIFAR-10 and CIFAR- 100

 

Learning to Optimize Neural Nets

L2 Regularization versus Batch and Weight Normalization

Abstract: Batch Normalization is a commonly used trick to improve the training of deep neural networks. These neural networks use L2 regularization, also called weight decay, ostensibly to prevent overfitting. However, we show that L2 regularization has no regularizing effect when combined with normalization. Instead, regularization has an influence on the scale of weights, and thereby on the effective learning rate. We investigate this dependence, both in theory, and experimentally. We show that popular optimization methods such as ADAM only partially eliminate the in- fluence of normalization on the learning rate. This leads to a discussion on other ways to mitigate this issue.

Discussion: Normalization, either Batch Normalization, Layer Normalization, or Weight Normalization makes the learned function invariant to scaling of the weights w. This scaling is strongly affected by regularization. We know of no first order gradient method that can fully eliminate this effect. However, a direct solution of forcing kwk = 1 solves the problem. By doing this we also remove one hyperparameter from the training procedure. As noted by Salimans & Kingma (2016), the effect of weight and batch normalization on the effective learning rate might not necessarily be bad. If no regularization is used, then the norm of the weights tends to increase over time, and so the effective learning rate decreases. Often that is a desirable thing, and many training methods lower the learning rate explicitly. However, the decrease of effective learning rate can be hard to control, and can depend a lot on initial steps of training, which makes it harder to reproduce results. With batch normalization we have added two additional parameters, γ and β, and it of course makes sense to also regularize these. In our experiments we did not use regularization for these parameters, though preliminary experiments show that regularization here does not affect the results. This is not very surprising, since with rectified linear activation functions, scaling of γ also has no effect on the function value in subsequent layers. So the only parameters that are actually regularized are the γ’s for the last layer of the network.

L2 Regularization versus Batch and Weight Normalization

Analysis of dropout learning regarded as ensemble learning

Abstract: Deep learning is the state-of-the-art in fields such as visual object recognition and speech recognition. This learning uses a large number of layers, huge number of units, and connections. Therefore, overfitting is a serious problem. To avoid this problem, dropout learning is proposed. Dropout learning neglects some inputs and hidden units in the learning process with a probability, p, and then, the neglected inputs and hidden units are combined with the learned network to express the final output. We find that the process of combining the neglected hidden units with the learned network can be regarded as ensemble learning, so we analyze dropout learning from this point of view.

Results: After the learning, the ensemble output is calculated by using the average of the sub-network outputs. We showed that dropout learning can be regarded as ensemble learning except for using a different set of hidden units in every learning iteration. Using a different set of hidden unit outperforms ensemble learning. We also showed that dropout learning achieves the same performance as the L2 regularizer. Our future work is the theoretical analysis of dropout learning with ReLU activation function.

Analysis of dropout learning regarded as ensemble learning

Deep Learning for Tumor Classification in Imaging Mass Spectrometry

Motivation: Tumor classification using Imaging Mass Spectrometry (IMS) data has a high potential for future applications in pathology. Due to the complexity and size of the data, automated feature extraction and classification steps are required to fully process the data. Deep learning offers an approach to learn feature extraction and classification combined in a single model. Commonly these steps are handled separately in IMS data analysis, hence deep learning offers an alternative strategy worthwhile to explore.

Results: Methodologically, we propose an adapted architecture based on deep convolutional networks to handle the characteristics of mass spectrometry data, as well as a strategy to interpret the learned model in the spectral domain based on a sensitivity analysis. The proposed methods are evaluated on two challenging tumor classification tasks and compared to a baseline approach. Competitiveness of the proposed methods are shown on both tasks by studying the performance via cross-validation. Moreover, the learned models are analyzed by the proposed sensitivity analysis revealing biologically plausible effects as well as confounding factors of the considered task. Thus, this study may serve as a starting point for further development of deep learning approaches in IMS classification tasks.

Source Code: https://gitlab.informatik.uni-bremen.de/digipath/Deep Learning

Data: https://seafile.zfn.uni-bremen.de/d/85c915784e/

Deep Learning for Tumor Classification in Imaging Mass Spectrometry

Do you have some co-worker that wants to left your company because he’s not working with bleeding edge Deep Learning tools/algos?

Please, show this post of Ben Lorica’s podcast:

Adoption of machine learning and deep learning in large companies

Everything in the enterprise space is ROI driven. They don’t know that the newest deep learning paper just came out from Google. They’re not going to clone some random GitHub repository and try it out, and just try to put it in production. They don’t do that. They want to understand ROI. They work a job, they have a goal, and they have a budget. They need to figure out what to do with that budget as it relates to their job at their company. Their company is usually a for-profit corporation trying to make money, or trying to increase margins for shareholders.

… Frankly, they don’t care if it’s linear regression, or random forest, either. … Machine learning has barely penetrated the Fortune 2000. Despite all these tools existing, most of them don’t have it in production because they don’t see a point in adopting it. I think Intel said this right: as far as enterprise adoption is concerned, it’s still fairly early for machine learning.

Do you have some co-worker that wants to left your company because he’s not working with bleeding edge Deep Learning tools/algos?

Lack of transparency is the bottleneck in academia

One of my biggest mistakes was to make my whole master’s degrees dissertation using private data (provided by my former employer) using closed tools (e.g. Viscovery Mine).

This was for me a huge blocker to share my research with every single person in the community, and get a second opinion about my work in regard of reproducibility. I working to open my data and making a new version, or book, about this kind of analysis using Non Performing Loans data.

Here in Denny’s blog, he talks about how engineering is the bottleneck in Deep Learning Research, where he made the following statements:

I will use the Deep Learning community as an example, because that’s what I’m familiar with, but this probably applies to other communities as well. As a community of researchers we all share a common goal: Move the field forward. Push the state of the art. There are various ways to do this, but the most common one is to publish research papers. The vast majority of published papers are incremental, and I don’t mean this in a degrading fashion. I believe that research is incremental by definition, which is just another way of saying that new work builds upon what other’s have done in the past. And that’s how it should be. To make this concrete, the majority of the papers I come across consist of more than 90% existing work, which includes datasets, preprocessing techniques, evaluation metrics, baseline model architectures, and so on. The authors then typically add a bit novelty and show improvement over well-established baselines.

So far nothing is wrong with this. The problem is not the process itself, but how it is implemented. There are two issues that stand out to me, both of which can be solved with “just engineering.” 1. Waste of research time and 2. Lack of rigor and reproducibility. Let’s look at each of them.

And the final musing:

Personally, I do not trust paper results at all. I tend to read papers for inspiration – I look at the ideas, not at the results. This isn’t how it should be. What if all researchers published code? Wouldn’t that solve the problem? Actually, no. Putting your 10,000 lines of undocumented code on Github and saying “here, run this command to reproduce my number” is not the same as producing code that people will read, understand, verify, and build upon. It’s like Shinichi Mochizuki’s proof of the ABC Conjecture, producing something that nobody except you understands.

Personally, I think this approach of discarding the results and focus on the novelty of methods is better than to try to understand any result aspect that the researcher wants to cover up through academic BS complexity.

 

 

 

Lack of transparency is the bottleneck in academia

Why you need enforce reproducibility as habit

In the few months that I arrived in Movile, I saw some strange pattern about several “data analysis”. The pattern was: once the analysis is delivered by someone without any background of data science this kind of insight suddenly become a kind of dogma.

In other words, no one will check the information, and in most of the cases there is no code, no commit at github, no .sql file or .R/.py file with the scripts used.

The practical problem is: What if this information was deadly wrong? And worse: How to discover if this information was harmful to the business?

Seeing this, my first mission statement as Data Intelligence Tech Lead at that time was to enforce to every BI Developer, Revenue Assurance Data Analyst, and Data Scientist  that every code must be reproducible no matter what conditions. Every insight must be delivered with some code in github.

Someone could say: “Wow… We have a little dictator here!”

With this simple rule, we are having this not exhaustive list of positive effects:

  • We’re collecting until today a huge dividend about the reproductive science: Any opinion have a code behind, and this code can be tested for anyone with access in github. This avoids the “excel kid” to drive any decision making without one hand on their shoulder, BEFORE the decision making;
  • We unmasked several “BS artists” that exploit the lack of data literacy of our internal clients (e.g. analysts, managers, et cetera) showing unnecessary complexities or delusional estimates without any kind of method behind; and
  • We developed a culture to be very skeptic about our estimates, especially to what we do not know about the data (a.k.a. exogenous factors about the market, brazilian economy, and so on). In another words: We stop to guessing about what we don’t know at that time and MADE IT CLEAR for our internal clients.

To know a little bit more how we operate, this article was the key reference for us to built our culture of compliance and deployment.

 

No matter what time crunch you are facing, it’s not worth putting a flaky implementation of an analysis into production. As data scientists, we are working to create a culture of data-driven decision-making. If your application breaks without an explanation (likely because you are unable to reproduce the results), people will lose confidence in your application and stop making decisions based on the results of your application. Even if you eventually fix it, that confidence is very, very hard to win back.

Data science teams should require reproducibility in the same way they require unit testing, linting, code versioning, and review. Without consistently producing results as good or better than known results for known data, analyses should never be passed on to deployment. This performance can be measured via techniques similar to integration testing. Further, if possible, models can be run in parallel on current data running through your systems for a side-by-side comparison with current production models.

Don’t get me wrong: Without any kind of compliance about your analysis, your organization will be a house of BS artists and any benefit to extract insights of the data, will be contaminated with BS hidden bias and can lead to several disasters in decision making, as we already experienced.

Why you need enforce reproducibility as habit

Productionizing Machine Learning Models and taking care about the neighbors

In Movile we have a Machine Learning Squad composed of the following members:

  • 1 Tech Lead (Mixed engineering and computational)
  • 2 Core ML engineers (production side)
  • 1 Data Scientist (with statistical background) – (data analysis and prototyping side)
  • 1 Data Scientist (with computational background) – (data analysis and prototyping side)

As we can see, there are different backgrounds in the team and to make the entire workflow productive and smoothed as possible, we need to get a good fence (a.k.a. crystal clear vision about the roles) to keep everyone motivated and productive.

This article written by Jhonatan Morra brings a good vision about this and how we deal with that fact in Movile.

Here are some quotes:

One of the most important goals of any data science team is the ability to create machine learning models, evaluate them offline, and get them safely to production. The faster this process can be performed, the more effective most teams will be. In most organizations, the team responsible for scoring a model and the team responsible for training a model are separate. Because of this, a clear separation of concerns is necessary for these two teams to operate at whatever speed suits them best. This post will cover how to make this work: implementing your ML algorithms in such a way that they can be tested, improved, and updated without causing problems downstream or requiring changes upstream in the data pipeline.

We can get clarity about the requirements for the data and production teams by breaking the data-driven application down into its constituent parts. In building and deploying a real-time data application, the goal of the data science team is to produce a function that reliably and in real-time ingests each data point and returns a prediction. For instance, if the business concern is modeling churn, we might ingest the data about a user and return a predicted probability of churn. The fact that we have to featurize that user and then send them through a random forest, for instance, is not the concern of the scoring team and should not be exposed to them.

 

 

 

 

Productionizing Machine Learning Models and taking care about the neighbors

Best price to bill approach

From whom are looking an initial approach to know the best time to bill your customers for some subscription services, this paper can be a good start.

In my current company, this is a very challenging problem.

Machine-Learning System For Recurring Subscription Billing

Jack Greenberg, Thomas Price

Abstract: A system and method for recurring billing of periodic subscriptions are disclosed. The system attempts to maximize a metric like long term customer retention while tailoring the subscription billing to the customer, using machine learning. The system is initially trained with a set of training data — a large corpus of records of subscription billings — including successes, billing failures, and customer cancellations. Any available metadata about the users or the type of subscription is also attached and may be used as features for the machine learning model. Such metadata may include, for example, customers’ age, gender, demographics, interests, and online behavioral profile/history, as well as metadata to identify the type of service being billed, such as music subscriptions, delivery subscriptions or other types of subscriptions, or the payment instrument. The system is used to predict the subscription model for a given user with relevant user-related constraints, while optimizing acceptability to that user.

Best price to bill approach

Algorithm over Regulations (?)

This scene is the best thing that can I relate to this particular topic.

“But, the bells have already been rung and they’ve heard it. Out in the dark. Among the stars. Ding dong, the God is dead. The bells, cannot be unrung! He’s hungry. He’s found us. And He’s coming!

Ding, ding, ding, ding, ding…”

(Hint Fellas: This is a great time to be not evil and check your models to avoid any kind of discrimination over their current or potential customers.)

European Union regulations on algorithmic decision-making and a “right to explanation – By Bryce Goodman, Seth Flaxman

Abstract: We summarize the potential impact that the European Union’s new General Data Protection Regulation will have on the routine use of machine learning algorithms. Slated to take effect as law across the EU in 2018, it will restrict automated individual decision-making (that is, algorithms that make decisions based on userlevel predictors) which “significantly affect” users. The law will also effectively create a “right to explanation,” whereby a user can ask for an explanation of an algorithmic decision that was made about them. We argue that while this law will pose large challenges for industry, it highlights opportunities for computer scientists to take the lead in designing algorithms and evaluation frameworks which avoid discrimination and enable explanation.

Conclusion: While the GDPR presents a number of problems for current applications in machine learning they are, we believe, good problems to have. The challenges described in this paper emphasize the importance of work that ensures that algorithms are not merely efficient, but transparent and fair. Research is underway in pursuit of rendering algorithms more amenable to ex post and ex ante inspection [11, 31, 20]. Furthermore, a number of recent studies have attempted to tackle the issue of discrimination within algorithms by introducing tools to both identify [5, 29] and rectify [9, 16, 32, 6, 12, 14] cases of unwanted bias. It remains to be seen whether these techniques are adopted in practice. One silver lining of this research is to show that, for certain types of algorithmic profiling, it is possible to both identify and implement interventions to correct for discrimination. This is in contrast to cases where discrimination arises from human judgment. The role of extraneous and ethically inappropriate factors in human decision making is well documented (e.g., [30, 10, 1]), and discriminatory decision making is pervasive in many of the sectors where algorithmic profiling might be introduced (e.g. [19, 7]). We believe that, properly applied, algorithms can not only make more accurate predictions, but offer increased transparency and fairness over their human counterparts (cf. [23]). Above all else, the GDPR is a vital acknowledgement that, when algorithms are deployed in society, few if any decisions are purely “technical”. Rather, the ethical design of algorithms requires coordination between technical and philosophical resources of the highest caliber. A start has been made, but there is far to go. And, with less than two years until the GDPR takes effect, the clock is ticking.

European Union regulations on algorithmic decision-making and a “right to explanation”

 

Algorithm over Regulations (?)

Densely Connected Convolutional Networks – implementations

Abstract: Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections – one between each layer and its subsequent layer – our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance.

Densely Connected Convolutional Networks – implementations

Matching e o uso de regressões para análise do efeito de um tratamento

Um dos assuntos mais espinhosos quando falamos de estatística para realizar estimativas de populações com características diferentes é o Matching.

Para quem não sabe, o Matching é basicamente uma técnica para comparação observacional entre um grupo de controle e um grupo de tratamento para cada observação espcífica dos dois grupos (i.e. para cada membro do grupo de tratamento, será feita uma estimativa em paralelo com um membro do grupo de controle e observará as diferenças nas estimativas) em que o objetivo principal é atestar os efeitos do tratamento considerando características dos dados observados, isolando ou realizando a análise considerando as diferenças entre as covariáveis.

Um exemplo de aplicação é dado no trabalho do IPEA em que há estimativas das populações pobres e indigentes, em que no estudo é realizado o mapeamento das características socioeconômicas similares do conjunto de familias participantes.

Neste post o Matt Bogard ele faz algumas considerações sobre a regressão como uma variância baseada em pesos (dos estimadores) poderados em relação a uma indicação de efeito no tratamento. 

Hence, regression gives us a variance based weighted average treatment effect, whereas matching provides a distribution weighted average treatment effect.

So what does this mean in practical terms? Angrist and Piscke explain that regression puts more weight on covariate cells where the conditional variance of treatment status is the greatest, or where there are an equal number of treated and control units. They state that differences matter little when the variation of δx is minimal across covariate combinations.

In his post The cardinal sin of matching, Chris Blattman puts it this way:

“For causal inference, the most important difference between regression and matching is what observations count the most. A regression tries to minimize the squared errors, so observations on the margins get a lot of weight. Matching puts the emphasis on observations that have similar X’s, and so those observations on the margin might get no weight at all….Matching might make sense if there are observations in your data that have no business being compared to one another, and in that way produce a better estimate”

 

We can see that those in the treatment group tend to have higher outcome values so a straight comparison between treatment and controls will overestimate treatment effects due to selection bias:

E[Y­­­i|di=1] – E[Y­­­i|di=0] =E[Y1i-Y0i] +{E[Y0i|di=1] – E[Y0i|di=0]}

However, if we estimate differences based on an exact matching scheme, we get a much smaller estimate of .67. If we run a regression using all of the data we get .75. If we consider 3.78 to be biased upward then both matching and regression have significantly reduced it, and depending on the application the difference between .67 and .75 may not be of great consequence. Of course if we run the regression including only matched variables, we get exactly the same results. (see R code below). This is not so different than the method of trimming based on propensity scores suggested in Angrist and Pischke.

Matching e o uso de regressões para análise do efeito de um tratamento

A Note on the Validity of Cross-Validation for Evaluating Autoregressive Time Series Prediction

Um bom artigo sobre a aplicação de Cross Validation em séries temporais.

Abstract: One of the most widely used standard procedures for model evaluation in classification and regression is K-fold cross-validation (CV). However, when it comes to time series forecasting, because of the inherent serial correlation and potential non-stationarity of the data, its application is not straightforward and often omitted by practitioners in favour of an out-of-sample (OOS) evaluation. In this paper, we show that in the case of a purely autoregressive model, the use of standard K-fold CV is possible as long as the models considered have uncorrelated errors. Such a setup occurs, for example, when the models nest a more appropriate model. This is very common when Machine Learning methods are used for prediction, where CV in particular is suitable to control for overfitting the data. We present theoretical insights supporting our arguments. Furthermore, we present a simulation study and a real-world example where we show empirically that K-fold CV performs favourably compared to both OOS evaluation and other time-series-specific techniques such as non-dependent cross-validation.

Conclusions: In this work we have investigated the use of cross-validation procedures for time series prediction evaluation when purely autoregressive models are used, which is a very common situation; e.g., when using Machine Learning procedures for time series forecasting. In a theoretical proof, we have shown that a normal K-fold cross-validation procedure can be used if the residuals of our model are uncorrelated, which is especially the case if the model nests an appropriate model. In the Monte Carlo experiments, we have shown empirically that even if the lag structure is not correct, as long as the data are fitted well by the model, cross-validation without any modification is a better choice than OOS evaluation. We have then in a real-world data example shown how these findings can be used in a practical situation. Cross-validation can adequately control overfitting in this application, and only if the models underfit the data and lead to heavily correlated errors, are the cross-validation procedures to be avoided as in such a case they may yield a systematic underestimation of the error. However, this case can be easily detected by checking the residuals for serial correlation, e.g., using the Ljung-Box test.

cv-wp

A Note on the Validity of Cross-Validation for Evaluating Autoregressive Time Series Prediction

Loan products and Credit Scoring Methods by Commercial Banks

Abstract – This study describes the loan products offered by the commercial banks and credit scoring techniques used for classifying risks and granting credit to the applicants in India. The loan products offered by commercial banks are: Housing loans, Personal loans, Business loan, Education loans, Vehicle loans etc. All the loan products are categorized as secures and unsecured loans. Credit scoring techniques used for both secured as well as unsecured loans are broadly divided into two categories as Advanced Statistical Methods and Traditional Statistical Methods

Conclusion: In a new or emerging market, the operational, technical, business and cultural issues should be considered with the implementation of the credit scoring models for retail loan products. The operational issues relate to the use of the model and it is imperative that the staff and the management of the bank understand the purpose of the model. Application scoring models should be used for making credit decisions on new applications and behavioral models for retail loan products to supervise existing borrowers for limit expansion or for marketing of new products. The technical issues relate to the development of proper infrastructure, maintenance of historical data and software needed to build a credit scoring model for retail loan products within the bank. The business issues relate to whether the soundness and safety of the banks could be achieved through the adoption of the quantitative credit decision models, which would send a positive impact in the banking sector. The cultural issues relate to making credit irrespective of race, colour, sex, religion, marital status, age or ethnic origin. Further, the models have to be validated so as to ensure that the model performance is compatible in meeting the business as well as the regulatory requirements. Thus, the above issues have to be considered while developing and implementing credit scoring models for retail loan products within a new or emerging markets.

Loan products and Credit Scoring Methods by Commercial Banks