Predição de Movimentações Criminais

Um bom artigo sobre a modelagem de eventos criminais e a sua movimentação.

[…] Data available on distance between criminals’ homes and their targets shows that burglars are willing to travel longer distances for high-value targets, and tend to employ different means of transportation to make these long trips. Of course, this tendency differs among types of criminals. Professionals and older criminals may travel further than younger amateurs. A group of professional burglars planning to rob a bank, for instance, would reasonably be expected to follow a Lévy flight.

“There is actually a relationship between how far these criminals are willing to travel for a target and the ability for a hotspot to form,” explain Kolokolnikov and McCalla. […]

Predição de Movimentações Criminais

Metáforas Estratégicas com Futebol

Um pequeno trecho do que o Matthew Hurst está planejando para o seu livro:

[…] When attacking, you always have a little more time to set up a shot. I see less experienced players who, when the ball is at their feet and the goal is available to them, panic and shoot. The lack of preparation often means the shot is misfired, the ball goes off target and the opponent gains possession. You always have more time than you think because you know something the defender doesn’t – which is precisely when you are going to shoot. Every moment you prepare improves your chances and keeps them guessing.

When defending, take time away from the attacker. I see this rather awkward movement of a defender standing their ground and moving backwards at the same speed as the attacker. You are giving the attacker that extra time. By taking the time away from the them – by aiming to take the ball aggressively – you force their hand (foot).

Own the direction of attack – you’re dribbling the ball and a defender runs back to protect the goal; they are running in front of you watching the ball; you dribble left, they turn left to follow – you turn right, they turn right to follow – they will never gain ownership of the direction of attack and you simply have to decide how long to run them around before shooting.

Use your brain not your legs – the fastest thing on the pitch is the ball. It is more efficient to pass to your team than to run, run, run. Your team needs the skill of making and owning space (options). Let the other team run.

Core competencies are not optional (here I’m talking above my station) – running, trapping the ball and passing are some of the basics of football. It is surprising that some players I see have trouble with these basics, including running (running efficiently is a learned skill). […]

Metáforas Estratégicas com Futebol

Um pouquinho sobre o Engenheiro de Dados

Também conhecidos como DBA’s, Analistas de Bancos de Dados, DBM’s e o que pode ser o futuro da natureza de seus trabalhos:

[…] The most important thing in data engineering (the job of building systems that aggregate data and improve it in some regard) is building a system that can respond to change and apply updates and improvements in a fluid manner. When evaluating a data provider, while it is important to ask them for details on the quality of their data (surprisingly, many of them won’t be able to tell you) it is equally important to learn about the processes they have in place to update and correct data with as low a latency as possible .[…]

Um pouquinho sobre o Engenheiro de Dados

Towards OLAP in Graph Databases

Direto do Another Word for It:

Towards OLAP in Graph Databases (MSc. Thesis) by Michal Bachman.


Graph databases are becoming increasingly popular as an alternative to relational databases for managing complex, densely-connected, semi-structured data. Whilst primarily optimised for online transactional processing, graph databases would greatly benefit from online analytical processing capabilities. Since relational databases were introduced over four decades ago, they have acquired online analytical processing facilities; this is not the case with graph databases, which have only drawn mainstream attention in the past few years.

In this project, we study the problem of online analytical processing in graph databases that use the property graph data model, which is a graph with properties attached to both vertices and edges. We use vertex degree analysis as a simple example problem, create a formal definition of vertex degree in a property graph, and develop a theoretical vertex degree cache with constant space and read time complexity, enabled by a cache compaction operation and a property change frequency heuristic.

We then apply the theory to Neo4j, an open-source property graph database, by developing a Relationship Count Module, which implements the theoretical vertex degree caching. We also design and implement a framework, called GraphAware, which provides supporting functionality for the module and serves as a platform for additional development, particularly of modules that store and maintain graph metadata.

Finally, we show that for certain use cases, for example those in which vertices have relatively high degrees and edges are created in separate transactions, vertex degree analysis can be performed several orders of magnitude faster, whilst sacrificing less than 20% of the write throughput, when using GraphAware Framework with the Relationship Count Module.

By demonstrating the extent of possible performance improvements, exposing the true complexity of a seemingly simple problem, and providing a starting point for future analysis and module development, we take an important step towards online analytical processing in graph databases.

The MSc. thesis: GraphAware: Towards Online Analytical Processing in Graph Databases.

Framework at Github: GraphAware Neo4j Framework.

Michal laments:

It’s not an easy, cover-to-cover read, but there might be some interesting parts, even if you don’t go through all the (over 100) pages.

It’s one hundred and forty-nine pages according to my PDF viewer.

I don’t think Michal needs to worry. If anyone thinks it is too long to read, it’s their loss.

Definitely going on my short list of things to read in detail sooner rather than later.

Towards OLAP in Graph Databases

Estatística x Mineração de Dados

Esse post do Piatetsky-Shapiro resume a discussão sobre essas duas disciplinas.

Statistics x Data Mining


“Estatística Vs Mineração de Dados: A Estatística começa após a conclusão da limpeza de dados, enquanto a Mineração de Dados incluí a limpeza de dados e a engenharia dos dados” (Tradução Livre)

Estatística x Mineração de Dados

Curso de Aprendizado de Máquina por Hal Daumé III

Este curso de aprendizado de máquina é focado em aspectos introdutórios dessa disciplina. O material de apoio conta com um draft do livro do HAl Daumé III e conta com diversos assuntos que vão desde redes neurais artificiais até aprendizado semi-supervisionado.

O livro do curso está disponível no link abaixo.

Curso de Aprendizado de Máquina por Hal Daumé III

O Parque de Diversões

Essa semana foi lançado no Kaggle uma modalidade de competição denominada Playground, ou algo como parque de diversões. Esse tipo de competição ao invés de ter o foco em uma resolução específica, têm uma abordagem muito mais voltada à extração de informações previamente desconhecidas das bases de dados.

Geralmente em ambientes de análise de dados não há demandas para abordagens semelhantes, devido não somente pressões para resultados como também um determinado ‘engessamento’ dos setores estratégicos.

Ambientes de sucesso em mineração de dados não são aqueles que procuram uma agulha no palheiro (isto é, torturando os dados, overfitting, padrões espúrios) mas sim aqueles que ‘brincam’ no palheiro até sentirem uma ‘picada’ (isto é, analisando os padrões, tendências, e regras).


O Parque de Diversões

O Estouro da Bolha do Big Data

Provavelmente esse é um dos melhores posts da blogosfera a respeito do assunto. A Cathy O’Neil toca na ferida de muitos dos Vendedores Engenheiros de Vendas no que tange o alto volume de publicações, posts, e demais White Advertised Papers lançados sobre o Big Data.

A questão como um todo merece reflexões em doses homeopáticas, mas seguem abaixo alguns dos interessantes pontos do post:

[…] Unfortunately, this process rarely actually happens the right way, often because the business people ask their data people the wrong questions to being with, and since they think of their data people as little more than pieces of software – data in, magic out – they don’t get their data people sufficiently involved with working on something that data can address.[…] 

[…] Also, since there are absolutely no standards for what constitutes a data scientist, and anyone who’s taken a machine learning class at college can claim to be one, the data scientists walking around often have no clue how to actually form the right questions to ask anyway. They are lopsided data people, and only know how to answer already well-defined questions like the ones that Kaggle comes up with. That’s less than half of what a good data scientist does, but people have no idea what a good data scientist does.[…] 

[…] Here’s what I see happening. People have invested some real money in data, and they’ve gotten burned with a lack of medium-term results. Now they’re getting impatient for proof that data is an appropriate place to invest what little money their VC’s have offered them. That means they want really short-term results, which means they’re lowballing data science expertise, which means they only attract people who’ve taken one machine learning class and fancy themselves experts.[…] 

[…] In other words, data science expertise has been commodified, and it’s a race to the bottom. Who will solve my business-critical data problem on a short-term consulting basis for less than $5000? Less than $4000?[…] 

[…] My forecast is that, once the hype wave of big data is dead and gone, there will emerge reasonable standards of what a data scientist should actually be able to do, and moreover a standard of when and how to hire a good one. It’ll be a rubrik, and possibly some tests, of both problem solving and communication.[…] 

O Estouro da Bolha do Big Data