Uma curva ROC neste post mostra porque os modelos customizados são mais eficientes.
Um estudo bacana do Peter Hauck de como utilizar o WEKA para análise de dados esportivos.
Direto do R & Bioconductor. Provavelmente é o melhor manual disponível via web.
Este artigo do Roger Pang exemplifica essa mistura explosiva.
Abaixo os três principais problemas elencados no artigo:
- Big Data are often “Wrong” Data. The students used the sensors measure something, but it didn’t give them everything they needed. Part of this is that the sensors were cheap, and budget was likely a big constraint here. But Big Data are often big because they are cheap. But of course, they still couldn’t tell that the elevator was broken.
- A failure of interrogation. With all the data the students collected with their multitude of sensors, they were unable to answer the question “What else could explain what I’m observing?”
- A strong desire to tell a story. Upon looking at the data, they seemed to “make sense” or to at least match a preconceived notion of that they should look like. This is related to #2 above, which is that you have to challenge what you see. It’s very easy and tempting to let the data tell an interesting story rather than the right story.
No Kgnuggets tem uma lista bem interessante sobre as razões para ler Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die.
1. New case studies. Find detailed stories you have never before heard from Hewlett-Packard, Chase, and the Obama Campaign. And did you know that John Elder once invested all his own personal money into a blackbox stock market system of his own design? That’s the opening story of Chapter 1.
2. Complete conceptual coverage. Although packaged with catchy chapter titles, the conceptual outline is fundamental: 1) deployment, 2) civil liberties, 3) data, 4) core modeling, 5) ensemble models, 6) IBM’s Jeopardy!-playing Watson, and 7) uplift modeling (aka net lift or persuasion modeling).
3. A cross-industry compendium of 147 cases. This comprehensive collection of mini-case studies serves to illustrate just how wide the field’s reach extends. A color insert, it includes a table for each of the verticals: Personal Life, Marketing, Finance, Healthcare, Crime Fighting, Reliability Modeling, Government and Nonprofit, Human Language and Thought, and Human Resources. One reviewer said, “The tables alone are worth the price of admission.”
4. Privacy and other civil liberty concerns. The author’s treatise on predictive analytics’ ethical realm, a chapter entitled “With Power Comes Responsibility,” addresses the questions: In what ways does predictive analytics fuel the contentious flames surrounding data privacy, raising its already-high stakes? What civil liberty concerns arise beyond privacy per se? What about predictive crime models that help decide who stays in prison?
O Deep Data Mining Blog neste post aborda um tema interessante que é a comparação e escolha de modelos de classificação. No post os autores realizam comparações sobre alguns métodos de classificação e tomam uma tabela de Lift para comparação de performance.
Os resultados são bem claros: Apesar do método de Gradient Boost Tree ser o mais perfomático a nível de acurácia, o método de seleção do modelo deve levar em conta também a complexidade de realizar o Walk-Through em ambientes de produção.