# SLIM: Sparse Linear Methods for Top-N Recommender Systems

Um ótimo artigo de base teórica, relativo a geração de Top-N recomendações em cenários bem esparsos (e.g. sistema de rating 0-5 em que poucas pessoas fazem a anotação do rating, etc).

Recentemente, esse problema de recomendar dentro de uma matriz muito esparsa foi o motivo pelo qual o Netflix mudou o seu sistema de Rating que era de 1 a 5 para jóia ou ruim.

Em todo o caso vale a pena a leitura para ver a forma na qual os autores estão trabalhando nesse tipo de desafio.

Abstract: This paper focuses on developing effective and efficient algorithms for top-N recommender systems. A novel Sparse Linear Method (SLIM) is proposed, which generates top-N recommendations by aggregating from user purchase/rating profiles. A sparse aggregation coefficient matrix W is learned from SLIM by solving an 1-norm and 2-norm regularized optimization problem. W is demonstrated to produce high quality recommendations and its sparsity allows SLIM to generate recommendations very fast. A comprehensive set of experiments is conducted by comparing the SLIM method and other state-of-the-art top-N recommendation methods. The experiments show that SLIM achieves significant improvements both in run time performance and recommendation quality over the best existing methods.

# Melhores papers de Deep Learning de 2012 até 2016

Para estudar com lápis na mão, e café na caneca.

Via Kdnuggets

1. Understanding / Generalization / Transfer

Distilling the knowledge in a neural network (2015), G. Hinton et al. [pdf]

2. Optimization / Training Techniques

Batch normalization: Accelerating deep network training by reducing internal covariate shift (2015), S. Loffe and C. Szegedy [pdf]

3. Unsupervised / Generative Models

Unsupervised representation learning with deep convolutional generative adversarial networks (2015), A. Radford et al. [pdf]

4. Convolutional Neural Network Models

Deep residual learning for image recognition (2016), K. He et al. [pdf]

5. Image: Segmentation / Object Detection

Fast R-CNN (2015), R. Girshick [pdf]

6. Image / Video / Etc.

Show and tell: A neural image caption generator (2015), O. Vinyals et al. [pdf]

7. Natural Language Processing / RNNs

Learning phrase representations using RNN encoder-decoder for statistical machine translation (2014), K. Cho et al. [pdf]

8. Speech / Other Domain

Speech recognition with deep recurrent neural networks (2013), A. Graves [pdf]

9. Reinforcement Learning / Robotics

Human-level control through deep reinforcement learning (2015), V. Mnih et al. [pdf]

10. More Papers from 2016

Domain-adversarial training of neural networks (2016), Y. Ganin et al. [pdf]

# Porque o xGBoost ganha todas as competições de Machine Learning

Uma (longa e) boa resposta está nesta tese de Didrik Nielsen.

16128_FULLTEXT

Abstract: Tree boosting has empirically proven to be a highly effective approach to predictive modeling.
It has shown remarkable results for a vast array of problems.
For many years, MART has been the tree boosting method of choice.
More recently, a tree boosting method known as XGBoost has gained popularity by winning numerous machine learning competitions.
In this thesis, we will investigate how XGBoost differs from the more traditional MART.
We will show that XGBoost employs a boosting algorithm which we will term Newton boosting. This boosting algorithm will further be compared with the gradient boosting algorithm that MART employs.
Moreover, we will discuss the regularization techniques that these methods offer and the effect these have on the models.
In addition to this, we will attempt to answer the question of why XGBoost seems to win so many competitions.
To do this, we will provide some arguments for why tree boosting, and in particular XGBoost, seems to be such a highly effective and versatile approach to predictive modeling.
The core argument is that tree boosting can be seen to adaptively determine the local neighbourhoods of the model. Tree boosting can thus be seen to take the bias-variance tradeoff into consideration during model fitting. XGBoost further introduces some subtle improvements which allows it to deal with the bias-variance tradeoff even more carefully.

Conclusion: After determining the different boosting algorithms and regularization techniques these methods utilize and exploring the effects of these, we turned to providing arguments for why XGBoost seems to win “every” competition. To provide possible answers to this question, we first gave reasons for why tree boosting in general can be an effective approach. We provided two main arguments for this. First off, additive tree models can be seen to have rich representational abilities. Provided that enough trees of sufficient depth is combined, they are capable of closely approximating complex functional relationships, including high-order interactions. The most important argument provided for the versatility of tree boosting however, was that tree boosting methods are adaptive. Determining neighbourhoods adaptively allows tree boosting methods to use varying degrees of flexibility in different parts of the input space. They will consequently also automatically perform feature selection. This also makes tree boosting methods robust to the curse of dimensionality. Tree boosting can thus be seen actively take the bias-variance tradeoff into account when fitting models. They start out with a low variance, high bias model and gradually reduce bias by decreasing the size of neighbourhoods where it seems most necessary. Both MART and XGBoost have these properties in common. However, compared to MART, XGBoost uses a higher-order approximation at each iteration, and can thus be expected to learn “better” tree structures. Moreover, it provides clever penalization of individual trees. As discussed earlier, this can be seen to make the method even more adaptive. It will allow the method to adaptively determine the appropriate number of terminal nodes, which might vary among trees. It will further alter the learnt tree structures and leaf weights in order to reduce variance in estimation of the individual trees. Ultimately, this makes XGBoost a highly adaptive method which carefully takes the bias-variance tradeoff into account in nearly every aspect of the learning process.

# Novel Revenue Development and Forecasting Model using Machine Learning Approaches for Cosmetics Enterprises.

Abstract:In the contemporary information society, constructing an effective sales prediction model is challenging due to the sizeable amount of purchasing information obtained from diverse consumer preferences. Many empirical cases shown in the existing literature argue that the traditional forecasting methods, such as the index of smoothness, moving average, and time series, have lost their dominance of prediction accuracy when they are compared with modern forecasting approaches such as neural network (NN) and support vector machine (SVM) models. To verify these findings, this paper utilizes the Taiwanese cosmetic sales data to examine three forecasting models: i) the back propagation neural network (BPNN), ii) least-square support vector machine (LSSVM), and iii) auto regressive model (AR). The result concludes that the LS-SVM has the smallest mean absolute percent error (MAPE) and largest Pearson correlation coefficient ( R2 ) between model and predicted values.

# Falhas na abordagem de Deep Learning: Arquiteturas e Meta-parametrização

O maior desafio corrente enfrentado pela indústria no que diz respeito à Deep Learning está sem sombra de dúvidas na parte computacional em que todo o mercado está absorvendo tanto os serviços de nuvem para realizar cálculos cada vez mais complexos como também bem como investindo em capacidade de computação das GPU.

Entretanto, mesmo com o hardware nos dias de hoje já ser um commodity, a academia está resolvendo um problema que pode revolucionar a forma na qual se faz Deep Learning que é no aspecto arquitetural/parametrização.

Esse comentário da thread diz muito a respeito desse problema em que o usuário diz:

The main problem I see with Deep Learning: too many parameters.

When you have to find the best value for the parameters, that’s a gradient search by itself. The curse of meta-dimensionality.

Ou seja, mesmo com toda a disponibilidade do hardware a questão de saber qual é o melhor arranjo arquitetural de uma rede neural profunda? ainda não está resolvido.

Este paper do Shai Shalev-Shwartz , Ohad Shamir, e Shaked Shammah chamado “Failures of Deep Learning” expõe esse problema de forma bastante rica inclusive com experimentos (este é o repositório no Git).

Os autores colocam que os pontos de falha das redes Deep Learning que são a) falta de métodos baseados em gradiente para otimização de parâmetros, b) problemas estruturais nos algoritmos de Deep Learning na decomposição dos problemas, c) arquitetura e d) saturação das funções de ativação.

Em outras palavras, o que pode estar acontecendo em grande parte das aplicações de Deep Learning é que o tempo de convergência poderia ser muito menor ainda, se estes aspectos já estivessem resolvidos.

Com isso resolvido, grande parte do que conhecemos hoje como indústria de hardware para as redes Deep Learning seria ou sub-utilizada ao extremo (i.e. dado que haverá uma melhora do ponto de vista de otimização arquitetural/algorítmica) ou poderia ser aproveitada para tarefas mais complexas (e.g. como reconhecimento de imagens com baixo número de pixels).

Desta forma mesmo adotando uma metodologia baseada em hardware como a indústria vem fazendo, há ainda muito espaço de otimização em relação às redes Deep Learning do ponto de vista arquitetural e algorítmico.

Abaixo uma lista de referências direto do Stack Exchange para quem quiser se aprofundar mais no assunto:

Algoritmos Neuro-Evolutivos

Miscelânea:

PS: O WordPress retirou a opção de justificar texto, logo desculpem de antemão a aparência amadora do blog nos próximos dias.

# Akid: Uma biblioteca de Redes Neurais para pesquisa e produção

Finalmente começaram a pensar em eliminar esse vale entre ciência/academia e indústria.

Akid: A Library for Neural Network Research and Production from a Dataism Approach – Shuai Li
Abstract: Neural networks are a revolutionary but immature technique that is fast evolving and heavily relies on data. To benefit from the newest development and newly available data, we want the gap between research and production as small as possibly. On the other hand, differing from traditional machine learning models, neural network is not just yet another statistic model, but a model for the natural processing engine — the brain. In this work, we describe a neural network library named {\texttt akid}. It provides higher level of abstraction for entities (abstracted as blocks) in nature upon the abstraction done on signals (abstracted as tensors) by Tensorflow, characterizing the dataism observation that all entities in nature processes input and emit out in some ways. It includes a full stack of software that provides abstraction to let researchers focus on research instead of implementation, while at the same time the developed program can also be put into production seamlessly in a distributed environment, and be production ready. At the top application stack, it provides out-of-box tools for neural network applications. Lower down, akid provides a programming paradigm that lets user easily build customized models. The distributed computing stack handles the concurrency and communication, thus letting models be trained or deployed to a single GPU, multiple GPUs, or a distributed environment without affecting how a model is specified in the programming paradigm stack. Lastly, the distributed deployment stack handles how the distributed computing is deployed, thus decoupling the research prototype environment with the actual production environment, and is able to dynamically allocate computing resources, so development (Devs) and operations (Ops) could be separated.

# Hardware para Machine Learning: Desafios e oportunidades

Um ótimo paper de como o hardware vai exercer função crucial em alguns anos em relação à Core Machine Learning, em especial em sistemas embarcados.

Hardware for Machine Learning: Challenges and Opportunities

Abstract—Machine learning plays a critical role in extracting meaningful information out of the zetabytes of sensor data collected every day. For some applications, the goal is to analyze and understand the data to identify trends (e.g., surveillance, portable/wearable electronics); in other applications, the goal is to take immediate action based the data (e.g., robotics/drones, self-driving cars, smart Internet of Things). For many of these applications, local embedded processing near the sensor is preferred over the cloud due to privacy or latency concerns, or limitations in the communication bandwidth. However, at the sensor there are often stringent constraints on energy consumption and cost in addition to throughput and accuracy requirements. Furthermore, flexibility is often required such that the processing can be adapted for different applications or environments (e.g., update the weights and model in the classifier). In many applications, machine learning often involves transforming the input data into a higher dimensional space, which, along with programmable weights, increases data movement and consequently energy consumption. In this paper, we will discuss how these challenges can be addressed at various levels of hardware design ranging from architecture, hardware-friendly algorithms, mixed-signal circuits, and advanced technologies (including memories and sensors).

Conclusions: Machine learning is an important area of research with many promising applications and opportunities for innovation at various levels of hardware design. During the design process, it is important to balance the accuracy, energy, throughput and cost requirements. Since data movement dominates energy consumption, the primary focus of recent research has been to reduce the data movement while maintaining performance accuracy, throughput and cost. This means selecting architectures with favorable memory hierarchies like a spatial array, and developing dataflows that increase data reuse at the low-cost levels of the memory hierarchy. With joint design of algorithm and hardware, reduced bitwidth precision, increased sparsity and compression are used to minimize the data movement requirements. With mixed-signal circuit design and advanced technologies, computation is moved closer to the source by embedding computation near or within the sensor and the memories. One should also consider the interactions between these different levels. For instance, reducing the bitwidth through hardware-friendly algorithm design enables reduced precision processing with mixed-signal circuits and non-volatile memory. Reducing the cost of memory access with advanced technologies could result in more energy-efficient dataflows.

# Uma abordagem híbrida de aprendizado supervisionado com Machine Learning para composição de melodias de forma algorítmica

A hybrid approach to supervised machine learning for algorithmic melody composition

Abstract: In this work we present an algorithm for composing monophonic melodies similar in style to those of a given, phrase annotated, sample of melodies. For implementation, a hybrid approach incorporating parametric Markov models of higher order and a contour concept of phrases is used. This work is based on the master thesis of Thayabaran Kathiresan (2015). An online listening test conducted shows that enhancing a pure Markov model with musically relevant context, like count and planed melody contour, improves the result significantly.

Conclusions: Even though Markov models alone are seen as no proper method for algorithmic composition, we successfully showed that when combined with further methods they can yield much better results in terms of being closer to human composed melodies. This can be seen when comparing our results with the ones of Kathiresan [Kat15], whose basic algorithm solely relies on Markov models. Apart from the previous works, our algorithm outperforms a random guessing baseline, meaning that humans are not able to clearly distinguish its compositions from humans anymore.

# Deep Learning para análise de séries temporais

Por mais que problemas de reconhecimento de imagens, ou mesmo de segmentação sonora estejam em alta em Deep Learning, 90% dos problemas do mundo quando falamos de dados, passam por dados estruturados, em especial séries temporais. Esse paper mostra uma metodologia pouco convencional (a transformação de séries temporais em uma ‘imagem’ para o uso de uma Rede Coevolucionária) mas que pode mostrar que o céu é o limite quando falamos de arranjos para solução de problemas de predição usando dados estruturados.

Deep Learning for Time-Series Analysis – John Cristian Borges Gamboa

Abstract: In many real-world application, e.g., speech recognition or sleep stage classification, data are captured over the course of time, constituting a Time-Series. Time-Series often contain temporal dependencies that cause two otherwise identical points of time to belong to different classes or predict different behavior. This characteristic generally increases the difficulty of analysing them. Existing techniques often depended on hand-crafted features that were expensive to create and required expert knowledge of the field. With the advent of Deep Learning new models of unsupervised learning of features for Time-series analysis and forecast have been developed. Such new developments are the topic of this paper: a review of the main Deep Learning techniques is presented, and some applications on Time-Series analysis are summaried. The results make it clear that Deep Learning has a lot to contribute to the field.

Conclusions: When applying Deep Learning, one seeks to stack several independent neural network layers that, working together, produce better results than the already existing shallow structures. In this paper, we have reviewed some of these modules, as well the recent work that has been done by using them, found in the literature. Additionally, we have discussed some of the main tasks normally performed when manipulating Time-Series data using deep neural network structures. Finally, a more specific focus was given on one work performing each one of these tasks. Employing Deep Learning to Time-Series analysis has yielded results in these cases that are better than the previously existing techniques, which is an evidence that this is a promising field for improvement.

deep-learning-for-time-series-analysis

# Learning Pulse: Uma abordagem de Machine Learning para previsão de performance em regimes auto-regulados de aprendizado usando dados multimodais

Todo mundo sabe que educação é um assunto muito atual nos dias de hoje, e o principal: como usar os smartphones para que os mesmos saiam de vilões da atenção para uma ferramenta de monitoramento e acompanhamento do desempenho acadêmico?

Esse artigo trás uma resposta interessante sobre esse tema.

Learning Pulse: a machine learning approach for predicting performance in self-regulated learning using multimodal data

Abstract: Learning Pulse explores whether using a machine learning approach on multimodal data such as heart rate, step count, weather condition and learning activity can be used to predict learning performance in self-regulated learning settings. An experiment was carried out lasting eight weeks involving PhD students as participants, each of them wearing a Fitbit HR wristband and having their application on their computer recorded during their learning and working activities throughout the day. A software infrastructure for collecting multimodal learning experiences was implemented. As part of this infrastructure a Data Processing Application was developed to pre-process, analyse and generate predictions to provide feedback to the users about their learning performance. Data from different sources were stored using the xAPI standard into a cloud-based Learning Record Store. The participants of the experiment were asked to rate their learning experience through an Activity Rating Tool indicating their perceived level of productivity, stress, challenge and abilities. These self-reported performance indicators were used as markers to train a Linear Mixed Effect Model to generate learner-specific predictions of the learning performance. We discuss the advantages and the limitations of the used approach, highlighting further development points.

learningpulse_lak17_preprint

# Abordagem de Machine Learning para descoberta de regras para performance de guitarra jazz

Um estudo muito interessante de padrões de guitarra Jazz.

A Machine Learning Approach to Discover Rules for Expressive Performance Actions in Jazz Guitar Music

Expert musicians introduce expression in their performances by manipulating sound properties such as timing, energy, pitch, and timbre. Here, we present a data driven computational approach to induce expressive performance rule models for note duration, onset, energy, and ornamentation transformations in jazz guitar music. We extract high-level features from a set of 16 commercial audio recordings (and corresponding music scores) of jazz guitarist Grant Green in order to characterize the expression in the pieces. We apply machine learning techniques to the resulting features to learn expressive performance rule models. We (1) quantitatively evaluate the accuracy of the induced models, (2) analyse the relative importance of the considered musical features, (3) discuss some of the learnt expressive performance rules in the context of previous work, and (4) assess their generailty. The accuracies of the induced predictive models is significantly above base-line levels indicating that the audio performances and the musical features extracted contain sufficient information to automatically learn informative expressive performance patterns. Feature analysis shows that the most important musical features for predicting expressive transformations are note duration, pitch, metrical strength, phrase position, Narmour structure, and tempo and key of the piece. Similarities and differences between the induced expressive rules and the rules reported in the literature were found. Differences may be due to the fact that most previously studied performance data has consisted of classical music recordings. Finally, the rules’ performer specificity/generality is assessed by applying the induced rules to performances of the same pieces performed by two other professional jazz guitar players. Results show a consistency in the ornamentation patterns between Grant Green and the other two musicians, which may be interpreted as a good indicator for generality of the ornamentation rules.

3.1.2. Duration Rules

• D1: IF note is the final note of a phrase AND the note appears in the third position of an IP (Narmour) structure THEN shorten note
• D2: IF note duration is longer than a dotted half note AND tempo is Medium (90–160 BPM) THEN shorten note
• D3: IF note duration is less than an eighth note AND note is in a very strong metrical position THEN lengthen note.
3.1.3. Onset Deviation Rules

• T1: IF the note duration is short AND piece is up-tempo (≥ 180 BPM) THEN advance note
• T2: IF the duration of the previous note is nominal AND the note’s metrical strength is very strong THEN advance note
• T3: IF the duration of the previous note is short AND piece is up-tempo (≥ 180 BPM) THEN advance note
• T4: IF the tempo is medium (90–160 BPM) AND the note is played within a tonic chord AND the next note’s duration is not short nor long THEN delay note
3.1.4. Energy Deviation Rules

• E1: IF the interval with next note is ascending AND the note pitch not high (lower than B3) THEN play piano
• E2: IF the interval with next note is descending AND the note pitch is very high (higher than C5) THEN play forte
• E3: IF the note is an eight note AND note is the initial note of a phrase THEN play forte.

Conclusões do estudo

Concretely, the obtained accuracies (over the base-line) for the ornamentation, duration, onset, and energy models of 70%(67%), 56%(50%), 63%(54%), and 52%(43%), respectively. Both the features selected and model rules showed musical significance. Similarities and differences among the obtained rules and the ones reported in the literature were discussed. Pattern similarities between classical and jazz music expressive rules were identified, as well as expected dissimilarities expected by the inherent particular musical aspects of each tradition. The induced rules specificity/generality was assessed by applying them to performances of the same pieces performed by two other professional jazz guitar players. Results show a consistency in the ornamentation patterns between Grant Green and the other two musicians, which may be interpreted as a good indicator for generality of the ornamentation rules.

# Hibridização de modelos de Machine Learning pessoais e impessoais para reconhecimento de atividades nos dispositivos móveis

Para quem ainda tem dúvidas que em breve termos modelos de Machine Learning em nossos dispositivos móveis para identificar diversos comportamentos como andar, estar movimento em um veículo automotor, ou mesmo em situações de buffer (i.e. filas, ou outras situações que estamos parados) esse paper mostra um ótimo caminho de implementação.

Hybridizing Personal and Impersonal Machine Learning Models for Activity Recognition on Mobile Devices

Abstract: Recognition of human activities, using smart phones and wearable devices, has attracted much attention recently. The machine learning (ML) approach to human activity recognition can broadly be classified into two categories: training an ML model on (i) an impersonal dataset or (ii) a personal dataset. Previous research shows that models learned from personal datasets can provide better activity recognition accuracy compared to models trained on impersonal datasets. In this paper, we develop a hybrid incremental (HI) method with logistic regression models. This method uses incremental learning of logistic regression to combine the advantages of the impersonal and personal approaches. We investigate two essential issues for this method, which are the selection of the learning rate schedule and the class imbalance problem. Our experiments show that the models learned using our HI method give better accuracy than the models learned from personal or impersonal data only. Besides, the techniques of adaptive learning rate and cost-sensitive learning generally give faster updates and more robust ML models in incremental learning. Our method also has potential bene- fits in the area of privacy preservation.

Conclusions: In this paper, we propose a novel hybrid incremental (HI) method for activity recognition. Traditionally, activity recognition models have been trained on either impersonal or personal datasets. Our HI method effectively combines the advantages of these two approaches. After learning a model on an impersonal dataset in servers, the mobile devices can apply incremental learning on the model using personal data. We focus on logistic regression due to its several benefits, including its small model size that saves bandwidth, good performance in activity recognition, and easy incremental update. We address two important problems that are likely to arise in practical implementations of this incremental learning task. The first problem is associated with user diversity, making it very difficult to tune the learning-rate for each user. The second issue is related to personal data being so imbalanced at times that it may spoil the impersonal model. To overcome those problems, we applied an adaptive learning rate and a cost-sensitive technique. Finally, experimental results are used to validate our solutions.

# The Predictron: End-To-End Learning and Planning

Por David Silver, Hado van Hasselt, Matteo Hessel, Tom Schaul, Arthur Guez, Tim Harley, Gabriel Dulac-Arnold, David Reichert, Neil Rabinowitz, Andre Barreto, Thomas Degris

One of the key challenges of artificial intelligence is to learn models that are effective in the context of planning. In this document we introduce the predictron architecture. The predictron consists of a fully abstract model, represented by a Markov reward process, that can be rolled forward multiple “imagined” planning steps. Each forward pass of the predictron accumulates internal rewards and values over multiple planning depths. The predictron is trained end-to-end so as to make these accumulated values accurately approximate the true value function. We applied the predictron to procedurally generated random mazes and a simulator for the game of pool. The predictron yielded significantly more accurate predictions than conventional deep neural network architectures.

Um review do resultado em relação à arquitetura:

The predictron is a single differentiable architecture that rolls forward an internal model to estimate values. This internal model may be given both the structure and the semantics of traditional reinforcement learning models. But unlike most approaches to model-based reinforcement learning, the model is fully abstract: it need not correspond to the real environment in any human understandable fashion, so long as its rolled-forward “plans” accurately predict outcomes in the true environment.
The predictron may be viewed as a novel network architecture that incorporates several separable ideas. First, the predictron outputs a value by accumulating rewards over a series of internal planning steps. Second, each forward pass of the predictron outputs values at multiple planning depths. Third, these values may be combined together, also within a single forward pass, to output an overall ensemble value. Finally, the different values output by the predictron may be encouraged to be self-consistent with each other, to provide an additional signal during learning. Our experiments demonstrate that these differences result in more accurate predictions of value, in reinforcement learning environments, than more conventional network architectures.
We have focused on value prediction tasks in uncontrolled environments. However, these ideas may transfer to the control setting, for example by using the predictron as a Q-network (Mnih et al., 2015). Even more intriguing is the possibility of learning an internal MDP with abstract internal actions, rather than the MRP model considered in this paper. We aim to explore these ideas in future work.