Algoritmos de Clustering para conjuntos de dados massivos

Via Big Data Central.

Potential applications:

  • Creating a keyword taxonomy to categorize the entire universe of cleaned (standardized), valuable English keywords. We are talking of about 10 million keywords made up of one, two or three tokens, that is, about 300 times the number of keywords found in a good English dictionary. The purpose might be to categorize all bid keywords that could be purchased by eBay and Amazon on Google (for pay-per-click ad campaigns), to better price them. This is the application discussed in this article.
  • Clustering millions of documents (e.g. books on Amazon.com) or
  • Clustering web pages, or even the entire Internet, which consist of about 100 million top websites – and billions of web pages.
Etiquetado

Qual a diferença entre LASSO e Ridge Regression?

Eu sei que essa pergunta é velha, mas sempre que alguém não consegue entender alguma coisa é uma nova oportunidade de transmitir esse conhecimento de forma mais inteligente em um novo formato.

Essas duas técnicas derivadas da regressão são conhecidos com métodos regressores de Shrinkage, ou de encolhimento.

Isso torna-se necessário partindo do seguinte princípio: Uma regressão com diversos coeficientes regressores torna o modelo como um todo muito mais complexo e pode tirar características de interpretabilidade. Uma forma de eliminar esse problema, que pode absorver o ruído dos dados e causar o overfitting, esses métodos fazem a retenção de um subconjunto de coeficientes regressores o que não somente reduz a complexidade do modelo e a forma que o mesmo é calculado e construído, como reduz o erro e de quebra miimiza qualquer possibilidade do modelo ter overfitting.

Dentro desses métodos de encolhimento (Shirinkage Methods)  há dois que se destacam em se tratando de aprendizado de máquina que são a Ridge Regression e o LASSO (Least Absolute Shrinkage and Selection Operator).

A Ridge Regression é um método de regularização do modelo que tem como principal objetivo suavizar atributos que sejam relacionados uns aos outros e que aumentam o ruído no modelo (A.K.A multicolinearidade). Com isso com a retirada de determinados atributos do modelo, o mesmo converge para um resultado muito mais estável em que com a redução desses atributos, a redução em termos de acuácia do modelo se mantêm inalterada. O mecanismo algoritmico que faz isso é através de um mecanismo de penalização que coloca um viés e que vai reduzindo os valores os betas até não zero. Com isso os atributos que contribuem menos para o poder preditivo do modelo são levados para a irrelevância usando esse mecanismo de penalização do viés.

Já o LASSO tem o mesmo mecanismo de penalização dos coeficientes com um alto grau de correlação entre si, mas que usa o mecanismo de penalizar os coeficientes de acordo com o seu valor absoluto (soma dos valores dos estimadores) usando o mecanismo de minimizar o erro quadrático. Isso é feito através da penalização do coeficiente até que o mesmo convirja para zero; o que naturalmente vai eliminar o atributo e reduzir a dimensionalidade do modelo.

A pergunta que todo mundo vai fazer no final desse post: “Mas Flavio quando eu posso aplicar um ou outro?

A resposta é um grande depende. 

Como heurística de trabalho, eu particularmente gosto do LASSO para validar um pré-processamento que foi feito usando alguma outra técnica como Rough Sets, ou mesmo quando eu tenho modelos que precisam em um primeiro momento ter o erro bem isolado (e.g. problemas relativos à precisão). Já a Ridge Regression é mais apropriada para problemas em que o erro deve estar contido como parte de uma solução (que nunca será a melhor) em que eu tenha que colocar um pouco mais de aleatoreidade no modelo (e.g. problemas de acurácia).

Abaixo os papers originais de LASSO e de Ridge Regression.

Regression Shrinkage and Selection via LASSO

Ridge Regression – Biased Estimation for Nonorthogonal Problems

 

Etiquetado , , , , ,

STR: A Seasonal-Trend Decomposition Procedure Based on Regression

Um dos maiores desafios em predição/decomposição de séries temporais (no espectro de aprendizado de máquina) é a inclusão de diversos efeitos sazonais ou até mesmo como saber quais efeitos cíclicos que estão contidos na série.

Esse paper do  Dokumentov e do Rob J Hyndman ataca essa questão com a criação do STR que é um procedimento para decomposição sazonal e de tendência baseado em regressão.

Abstract
We propose new generic methods for decomposing seasonal data: STR (a Seasonal-Trend decomposition procedure based on Regression) and Robust STR. In some ways, STR is similar to Ridge Regression and Robust STR can be related to LASSO. Our new methods are much more general than any alternative time series decomposition methods. They allow for multiple seasonal and cyclic components, and multiple linear regressors with constant, flexible, seasonal and cyclic influence. Seasonal patterns (for both seasonal components and seasonal regressors) can be fractional and flexible over time; moreover they can be either strictly periodic or have a more complex topology. We also provide confidence intervals for the estimated components, and discuss how STR can be used for forecasting.

wp13-15

Etiquetado , , , , ,

7 técnicas para redução da dimensionalidade

Na atual era do Big Data em que o custo de armazenamento praticamente foi levado ao nível de commodity, muitas corporações que se gabam que são ‘adeptas’ do Big Data acabam pagando/armazenando ruído ao invés de sinal.

Pelo motivo exposto acima, diante do prisma de Engenharia de Dados o problema de absorção/retenção dessas informações está resolvido.

No entanto, quando é necessário escalar negócios através de inteligência usando os dados (lembrando o que foi dito no passado: Dado > Informação > Conhecimento > Sabedoria) o que era uma característica inerente ao avanço tecnológico de engenharia de dados, torna-se um problema gigante dentro da ciência de dados.

Com esse aumento horizontal das bases de dados (dimensões / atributos) um problema grave é o aumento da dimensionalidade (Course of Dimensionality) em que temos não somente multicolinearidade, heteroscedasticidade e autocorreação para ficar em exemplos estatísticos simples.  Em termos computacionais nem é preciso dizer que o aumento de atributos faz com que os algoritmos de Data Mining ou Inteligência Computacional tenham que processar um volume de dados muito maior (aumento da complexidade do processamento = maior custo temporal).

Dada essa pequena introdução, essa é a razão na qual a redução da dimensionalidade é muito importante para qualquer data miner.

Esse post da Knime apresenta 7 técnicas para redução da dimensionalidade, que são:

Missing Values Ratio. Data columns with too many missing values are unlikely to carry much useful information. Thus data columns with number of missing values greater than a given threshold can be removed. The higher the threshold, the more aggressive the reduction.

Low Variance Filter. Similarly to the previous technique, data columns with little changes in the data carry little information. Thus all data columns with variance lower than a given threshold are removed. A word of caution: variance is range dependent; therefore normalization is required before applying this technique.

High Correlation Filter. Data columns with very similar trends are also likely to carry very similar information. In this case, only one of them will suffice to feed the machine learning model. Here we calculate the correlation coefficient between numerical columns and between nominal columns as the Pearson’s Product Moment Coefficient and the Pearson’s chi square value respectively. Pairs of columns with correlation coefficient higher than a threshold are reduced to only one. A word of caution: correlation is scale sensitive; therefore column normalization is required for a meaningful correlation comparison.

– Random Forests / Ensemble Trees. Decision Tree Ensembles, also referred to as random forests, are useful for feature selection in addition to being effective classifiers. One approach to dimensionality reduction is to generate a large and carefully constructed set of trees against a target attribute and then use each attribute’s usage statistics to find the most informative subset of features. Specifically, we can generate a large set (2000) of very shallow trees (2 levels), with each tree being trained on a small fraction (3) of the total number of attributes. If an attribute is often selected as best split, it is most likely an informative feature to retain. A score calculated on the attribute usage statistics in the random forest tells us ‒ relative to the other attributes ‒ which are the most predictive attributes.

Principal Component Analysis (PCA). Principal Component Analysis (PCA) is a statistical procedure that orthogonally transforms the original n coordinates of a data set into a new set of n coordinates called principal components. As a result of the transformation, the first principal component has the largest possible variance; each succeeding component has the highest possible variance under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components. Keeping only the first m < n components reduces the data dimensionality while retaining most of the data information, i.e. the variation in the data. Notice that the PCA transformation is sensitive to the relative scaling of the original variables. Data column ranges need to be normalized before applying PCA. Also notice that the new coordinates (PCs) are not real system-produced variables anymore. Applying PCA to your data set loses its interpretability. If interpretability of the results is important for your analysis, PCA is not the transformation for your project.

Backward Feature Elimination. In this technique, at a given iteration, the selected classification algorithm is trained on n input features. Then we remove one input feature at a time and train the same model on n-1 input features n times. The input feature whose removal has produced the smallest increase in the error rate is removed, leaving us with n-1 input features. The classification is then repeated using n-2 features, and so on. Each iteration k produces a model trained on n-k features and an error rate e(k). Selecting the maximum tolerable error rate, we define the smallest number of features necessary to reach that classification performance with the selected machine learning algorithm.

– Forward Feature Construction. This is the inverse process to the Backward Feature Elimination. We start with 1 feature only, progressively adding 1 feature at a time, i.e. the feature that produces the highest increase in performance. Both algorithms, Backward Feature Elimination and Forward Feature Construction, are quite time and computationally expensive. They are practically only applicable to a data set with an already relatively low number of input columns.

Os resultados obtidos em termos de acurácia foram:

dimensionality_reduction

Alguns insights em relação aos resultados:

  • Apesar da robustez matemática, o PCA apresenta um resultado não tão satisfatório em relação a métodos mais simples de seleção de atributos. Isso pode indicar que esse método não lida tão bem com bases de dados com inconsistências.
  • Filtro de baixa variância e de valores faltantes são técnicas absolutamente simples e tiveram o mesmo resultado de técnicas algoritmicamente mais complexas como Florestas Aleatórias.
  • Construção de modelos com inclusão incremental de atributos e eliminação de atributos retroativa são métodos que apresentam uma menor performance e são proibitivos em termos de processamento.
  • A estatística básica ainda é uma grande ferramenta para qualquer data miner, e não somente ajuda em termos de redução do custo temporal (processamento) quanto em custo espacial (custo de armazenamento).

A metodologia do estudo pode ser encontrada abaixo.

knime_seventechniquesdatadimreduction

Etiquetado , , ,

Métodos de reamostragem para estimativa de erro em aprendizado de máquina

Esses slides do Tanagra, mostram de maneira bem didática como são feitas as estimativas usando métodos de reamostragem como Cross-Validation, Bootstrap, e LOO Cross Validation.

resampling_evaluation

Etiquetado , ,

Previsão em séries temporais probabilísticas

Abstract

A large body of the forecasting literature so far has been focused on forecasting the conditional mean of future observations. However, there is an increasing need for generating the entire conditional distribution of future observations in order to effectively quantify the uncertainty in time series data. We present two different methods for probabilistic time series forecasting that allow the inclusion of a possibly large set of exogenous variables. One method is based on forecasting both the conditional mean and variance of the future distribution using a traditional regression approach. The other directly computes multiple quantiles of the future distribution using quantile regression. We propose an implementation for the two methods based on boosted additive models, which enjoy many useful properties including accuracy, flexibility, interpretability and automatic variable selection. We conduct extensive experiments using electricity smart meter data, on both aggregated and disaggregated scales, to compare the two forecasting methods for the challenging problem of forecasting the distribution of future electricity consumption. The empirical results demonstrate that the mean and variance forecasting provides better forecasts for aggregated demand, while the flexibility of the quantile regression approach is more suitable for disaggregated demand. These results are particularly useful since more energy data will become available at the disaggregated level in the future.

Probabilistic time series forecasting with boosted additive models

Etiquetado , , ,

Redes Neurais e Redes Neurais Profundas

Como o assunto está muito quente foi feito até um livro online (com códigos e tudo mais):

Neural Networks and Deep Learning is a free online book. The book will teach you about:

Neural networks, a beautiful biologically-inspired programming paradigm which enables a computer to learn from observational data

Deep learning, a powerful set of techniques for learning in neural networks

Neural networks and deep learning currently provide the best solutions to many problems in image recognition, speech recognition, and natural language processing. This book will teach you the core concepts behind neural networks and deep learning.

The book is currently an incomplete beta draft. More chapters will be added over the coming months. For now, you can:

Read Chapter 1, which explains how neural networks can learn to recognize handwriting

Read Chapter 2, which explains backpropagation, the most important algorithm used to learn in neural networks.

Read Chapter 3, which explains many techniques which can be used to improve the performance of backpropagation.

Read Chapter 4, which explains why neural networks can compute any function.

Learn more about the approach taken in this book

Etiquetado , , ,

Data Scientists não escalam!

Esse artigo da HBR que fala que a linguagem natural é a última fronteira para escalar de fato o que eles denominam como data science, e também mostram que os cientistas de dados ‘manuais’  existem em um arranjo de trabalho que simplesmente não tem escalabilidade.

Para jogar um pouco de lenha na fogueira em relação ao haterismo (aqui, aqui, aqui, e aqui) que vem tomando conta da comunidade de analytics/data mining/data science sobre as ferramentas de análise baseadas em GUI e os novos Work Horses em analytics como Amazon, Google e Microsoft.

Muito do que foi colocado no artigo tem muito a ver com o antigo, porém excelente artigo da Continental Airlines em que usando a extensão do trabalho do Richard Hackathorn coloca os tipos de latência no contexto de decisão-ação:

Action-Latency

Tudo começa com o evento de negócios que pode ser uma venda, uma transação qualquer que tenha valor monetário para a companhia em questão. A contar do evento de negócios, tem início a latência de dados que nada mais é do que o tempo requirido para capturar, transformar, higienizar o dado de algum sistema transacional e armazenar no DW, com isso tem-se o segundo ponto na linha do tempo de ação que é o dado armazenado.

Com o dado armazenado inicia-se a latência de análise que é o tempo utilizado para analisar e disseminar os resultados da análise para as pessoas apropriadas, e no fim desse processo tem-se o que é chamado de informação entregue. Após a informação chegar para as pessoas corretas inicia-se a latência de decisão que é o tempo para que o agente decisor entenda o contexto e a situação, faça um plano de ação e inicie o conjunto de tarefas listadas no plano.

Dentro do atual cenário em que temos o problema de armazenamento de dados quase que resolvido pelas novas tecnologias, pode ser dito que o problema de latência de dados está definitivamente resolvido  (e pode ser escalado com dinheiro), com isso resta a latência de análise e decisão.

Muito do que é apresentado como Data Science não está diretamente relacionado a questões de negócios em que grande parte das vezes o tempo é a variável mais determinante. Isso é, o eixo X do gráfico é extremamente reduzido.

Com isso, muito do que é feito é uma solução ótima para um problema que muitas das vezes já era para estar resolvido ou pior: a solução foi tão demorada que a organização perdeu o timing para a solução do problema. Isso pode significar desde uma oportunidade perdida (e.g. custo de oportunidade) até mesmo milhões de reais/dólares (e.g. perda de receita que poderia ser garantida usando o ativo de inteligência de dados).

E é nesse ponto que vamos chegar: Em grande parte das corporações não é necessária a solução perfeita; mas sim a solução que atenda uma questão de negócios dentro de um limite de tempo pré-estabelecido; e é nesse contexto que as soluções das suítes de Data Mining e ferramentas GUI vem a solucionar, ou ajudar na solução desse problema.

Além do mais, como a Julia Evans colocou, muitas as vezes o entendimento do problema é tão ou mais importante que a solução em si.

Dessa forma, dentro desse cenário a reportagem da HBR está correta: Cientistas de Dados não escalam por dois motivos (i) apesar da inteligência ser escalável, o agente humano (peça cognitiva no processo) não escala (não em termos industriais como o artigo coloca), e (ii) as soluções estão restritas a um intervalo de tempo finito e curto.

 

 

Etiquetado , , ,

Aplicações de Deep Learning e desafios e Big Data Analytics

Uma coisa interessante nesse artigo, foi que é um dos poucos que tem uma estratégia para Deep Learning não baseada em algoritmos, mas sim em indexação semântica. 

Abstract

Big Data Analytics and Deep Learning are two high-focus of data science. Big Data has become important as many organizations both public and private have been collecting massive amounts of domain-specific information, which can contain useful information about problems such as national intelligence, cyber security, fraud detection, marketing, and medical informatics. Companies such as Google and Microsoft are analyzing large volumes of data for business analysis and decisions, impacting existing and future technology. Deep Learning algorithms extract high-level, complex abstractions as data representations through a hierarchical learning process. Complex abstractions are learnt at a given level based on relatively simpler abstractions formulated in the preceding level in the hierarchy. A key benefit of Deep Learning is the analysis and learning of massive amounts of unsupervised data, making it a valuable tool for Big Data Analytics where raw data is largely unlabeled and un-categorized. In the present study, we explore how Deep Learning can be utilized for addressing some important problems in Big Data Analytics, including extracting complex patterns from massive volumes of data, semantic indexing, data tagging, fast information retrieval, and simplifying discriminative tasks. We also investigate some aspects of Deep Learning research that need further exploration to incorporate specific challenges introduced by Big Data Analytics, including streaming data, high-dimensional data, scalability of models, and distributed computing. We conclude by presenting insights into relevant future works by posing some questions, including defining data sampling criteria, domain adaptation modeling, defining criteria for obtaining useful data abstractions, improving semantic indexing, semi-supervised learning, and active learning.

s40537-014-0007-7

Etiquetado , , , , , ,

Usando Amazon Machine Learning para prever o tempo

Eu ainda tenho que preparar um post sobre esses workhorses produzidos pela Amazon, Microsoft, e Google; mas fica a leitura.

Etiquetado

Cross-Validation, e a estimativa da estimativa

No ótimo post do blog do Andrew Gelman em que o título é “Cross-Validation != Magic” tem uma observação muito importante sobre a definição do Cross-Validation (Validação Cruzada) que é:

“2. Cross-validation is a funny thing. When people tune their models using cross-validation they sometimes think that because it’s an optimum that it’s the best. Two things I like to say, in an attempt to shake people out of this attitude:

(a) The cross-validation estimate is itself a statistic, i.e. it is a function of data, it has a standard error etc.

(b) We have a sample and we’re interested in a population. Cross-validation tells us what performs best on the sample, or maybe on the hold-out sample, but our goal is to use what works best on the population. A cross-validation estimate might have good statistical properties for the goal of prediction for the population, or maybe it won’t.

Just cos it’s “cross-validation,” that doesn’t necessarily make it a good estimate. An estimate is an estimate, and it can and should be evaluated based on its statistical properties. We can accept cross-validation as a useful heuristic for estimation (just as Bayes is another useful heuristic) without buying into it as necessarily best.

Cross-Validation é um assunto bem espinhoso quando se trata de amostragem e/ou estimação de modelos devido ao fato de que há diversas opiniões a favor e contra.

Conhecer as propriedades de cada método de amostragem e saber as suas propriedades matemáticas/estatísticas, vantagens e desvantagens e principalmente limitações é regra número 1 para qualquer data miner.

Particularmente, eu vejo o Cross-Validation como um método excelente quando se tem um universo de dados restrito (poucas instâncias treinamento), ou mesmo quando faço as validações com método normal de 80-10-10 de amostragem; mas isso é mais uma heurística de trabalho do que uma regra propriamente dita .

Etiquetado , , , , ,

91 perguntas para cientistas de dados

Via Data Science Central.

  1. What is the biggest data set that you processed, and how did you process it, what were the results?
  2. Tell me two success stories about your analytic or computer science projects? How was lift (or success) measured?
  3. What is: lift, KPI, robustness, model fitting, design of experiments, 80/20 rule?
  4. What is: collaborative filtering, n-grams, map reduce, cosine distance?
  5. How to optimize a web crawler to run much faster, extract better information, and better summarize data to produce cleaner databases?
  6. How would you come up with a solution to identify plagiarism?
  7. How to detect individual paid accounts shared by multiple users?
  8. Should click data be handled in real time? Why? In which contexts?
  9. What is better: good data or good models? And how do you define “good”? Is there a universal good model? Are there any models that are definitely not so good?
  10. What is probabilistic merging (AKA fuzzy merging)? Is it easier to handle with SQL or other languages? Which languages would you choose for semi-structured text data reconciliation? 
  11. How do you handle missing data? What imputation techniques do you recommend?
  12. What is your favorite programming language / vendor? why?
  13. Tell me 3 things positive and 3 things negative about your favorite statistical software.
  14. Compare SAS, R, Python, Perl
  15. What is the curse of big data?
  16. Have you been involved in database design and data modeling?
  17. Have you been involved in dashboard creation and metric selection? What do you think about Birt?
  18. What features of Teradata do you like?
  19. You are about to send one million email (marketing campaign). How do you optimze delivery? How do you optimize response? Can you optimize both separately? (answer: not really)
  20. Toad or Brio or any other similar clients are quite inefficient to query Oracle databases. Why? How would you do to increase speed by a factor 10, and be able to handle far bigger outputs? 
  21. How would you turn unstructured data into structured data? Is it really necessary? Is it OK to store data as flat text files rather than in an SQL-powered RDBMS?
  22. What are hash table collisions? How is it avoided? How frequently does it happen?
  23. How to make sure a mapreduce application has good load balance? What is load balance?
  24. Examples where mapreduce does not work? Examples where it works very well? What are the security issues involved with the cloud? What do you think of EMC’s solution offering an hybrid approach – both internal and external cloud – to mitigate the risks and offer other advantages (which ones)?
  25. Is it better to have 100 small hash tables or one big hash table, in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics?
  26. Why is naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?
  27. Have you been working with white lists? Positive rules? (In the context of fraud or spam detection)
  28. What is star schema? Lookup tables? 
  29. Can you perform logistic regression with Excel? (yes) How? (use linest on log-transformed data)? Would the result be good? (Excel has numerical issues, but it’s very interactive)
  30. Have you optimized code or algorithms for speed: in SQL, Perl, C++, Python etc. How, and by how much?
  31. Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy? Depends on the context?
  32. Define: quality assurance, six sigma, design of experiments. Give examples of good and bad designs of experiments.
  33. What are the drawbacks of general linear model? Are you familiar with alternatives (Lasso, ridge regression, boosted trees)?
  34. Do you think 50 small decision trees are better than a large one? Why?
  35. Is actuarial science not a branch of statistics (survival analysis)? If not, how so?
  36. Give examples of data that does not have a Gaussian distribution, nor log-normal. Give examples of data that has a very chaotic distribution?
  37. Why is mean square error a bad measure of model performance? What would you suggest instead?
  38. How can you prove that one improvement you’ve brought to an algorithm is really an improvement over not doing anything? Are you familiar with A/B testing?
  39. What is sensitivity analysis? Is it better to have low sensitivity (that is, great robustness) and low predictive power, or the other way around? How to perform good cross-validation? What do you think about the idea of injecting noise in your data set to test the sensitivity of your models?
  40. Compare logistic regression w. decision trees, neural networks. How have these technologies been vastly improved over the last 15 years?
  41. Do you know / used data reduction techniques other than PCA? What do you think of step-wise regression? What kind of step-wise techniques are you familiar with? When is full data better than reduced data or sample?
  42. How would you build non parametric confidence intervals, e.g. for scores? (see the AnalyticBridge theorem)
  43. Are you familiar either with extreme value theory, monte carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?
  44. What is root cause analysis? How to identify a cause vs. a correlation? Give examples.
  45. How would you define and measure the predictive power of a metric?
  46. How to detect the best rule set for a fraud detection scoring technology? How do you deal with rule redundancy, rule discovery, and the combinatorial nature of the problem (for finding optimum rule set – the one with best predictive power)? Can an approximate solution to the rule set problem be OK? How would you find an OK approximate solution? How would you decide it is good enough and stop looking for a better one?
  47. How to create a keyword taxonomy?
  48. What is a Botnet? How can it be detected?
  49. Any experience with using API’s? Programming API’s? Google or Amazon API’s? AaaS (Analytics as a service)?
  50. When is it better to write your own code than using a data science software package?
  51. Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)?
  52. What is POC (proof of concept)?
  53. What types of clients have you been working with: internal, external, sales / finance / marketing / IT people? Consulting experience? Dealing with vendors, including vendor selection and testing?
  54. Are you familiar with software life cycle? With IT project life cycle – from gathering requests to maintenance? 
  55. What is a cron job? 
  56. Are you a lone coder? A production guy (developer)? Or a designer (architect)?
  57. Is it better to have too many false positives, or too many false negatives?
  58. Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples. 
  59. How does Zillow’s algorithm work? (to estimate the value of any home in US)
  60. How to detect bogus reviews, or bogus Facebook accounts used for bad purposes?
  61. How would you create a new anonymous digital currency?
  62. Have you ever thought about creating a startup? Around which idea / concept?
  63. Do you think that typed login / password will disappear? How could they be replaced?
  64. Have you used time series models? Cross-correlations with time lags? Correlograms? Spectral analysis? Signal processing and filtering techniques? In which context?
  65. Which data scientists do you admire most? which startups?
  66. How did you become interested in data science?
  67. What is an efficiency curve? What are its drawbacks, and how can they be overcome?
  68. What is a recommendation engine? How does it work?
  69. What is an exact test? How and when can simulations help us when we do not use an exact test?
  70. What do you think makes a good data scientist?
  71. Do you think data science is an art or a science?
  72. What is the computational complexity of a good, fast clustering algorithm? What is a good clustering algorithm? How do you determine the number of clusters? How would you perform clustering on one million unique keywords, assuming you have 10 million data points – each one consisting of two keywords, and a metric measuring how similar these two keywords are? How would you create this 10 million data points table in the first place?
  73. Give a few examples of “best practices” in data science.
  74. What could make a chart misleading, difficult to read or interpret? What features should a useful chart have?
  75. Do you know a few “rules of thumb” used in statistical or computer science? Or in business analytics?
  76. What are your top 5 predictions for the next 20 years?
  77. How do you immediately know when statistics published in an article (e.g. newspaper) are either wrong or presented to support the author’s point of view, rather than correct, comprehensive factual information on a specific subject? For instance, what do you think about the official monthly unemployment statistics regularly discussed in the press? What could make them more accurate?
  78. Testing your analytic intuition: look at these three charts. Two of them exhibit patterns. Which ones? Do you know that these charts are called scatter-plots? Are there other ways to visually represent this type of data?
  79. You design a robust non-parametric statistic (metric) to replace correlation or R square, that (1) is independent of sample size, (2) always between -1 and +1, and (3) based on rank statistics. How do you normalize for sample size? Write an algorithm that computes all permutations of n elements. How do you sample permutations (that is, generate tons of random permutations) when n is large, to estimate the asymptotic distribution for your newly created metric? You may use this asymptotic distribution for normalizing your metric. Do you think that an exact theoretical distribution might exist, and therefore, we should find it, and use it rather than wasting our time trying to estimate the asymptotic distribution using simulations? 
  80. More difficult, technical question related to previous one. There is an obvious one-to-one correspondence between permutations of n elements and integers between 1 and n! Design an algorithm that encodes an integer less than n! as a permutation of n elements. What would be the reverse algorithm, used to decode a permutation and transform it back into a number? Hint: An intermediate step is to use the factorial number system representation of an integer. Feel free to check this reference online to answer the question. Even better, feel free to browse the web to find the full answer to the question (this will test the candidate’s ability to quickly search online and find a solution to a problem without spending hours reinventing the wheel).  
  81. How many “useful” votes will a Yelp review receive? My answer: Eliminate bogus accounts (read this article), or competitor reviews (how to detect them: use taxonomy to classify users, and location – two Italian restaurants in same Zip code could badmouth each other and write great comments for themselves). Detect fake likes: some companies (e.g. FanMeNow.com) will charge you to produce fake accounts and fake likes. Eliminate prolific users who like everything, those who hate everything. Have a blacklist of keywords to filter fake reviews. See if IP address or IP block of reviewer is in a blacklist such as “Stop Forum Spam”. Create honeypot to catch fraudsters.  Also watch out for disgruntled employees badmouthing their former employer. Watch out for 2 or 3 similar comments posted the same day by 3 users regarding a company that receives very few reviews. Is it a brand new company? Add more weight to trusted users (create a category of trusted users).  Flag all reviews that are identical (or nearly identical) and come from same IP address or same user. Create a metric to measure distance between two pieces of text (reviews). Create a review or reviewer taxonomy. Use hidden decision trees to rate or score review and reviewers.
  82. What did you do today? Or what did you do this week / last week?
  83. What/when is the latest data mining book / article you read? What/when is the latest data mining conference / webinar / class / workshop / training you attended? What/when is the most recent programming skill that you acquired?
  84. What are your favorite data science websites? Who do you admire most in the data science community, and why? Which company do you admire most?
  85. What/when/where is the last data science blog post you wrote? 
  86. In your opinion, what is data science? Machine learning? Data mining?
  87. Who are the best people you recruited and where are they today?
  88. Can you estimate and forecast sales for any book, based on Amazon public data? Hint: read this article.
  89. What’s wrong with this picture?
  90. Should removing stop words be Step 1 rather than Step 3, in the search engine algorithm described here? Answer: Have you thought about the fact that mine and yours could also be stop words? So in a bad implementation, data mining would become data mine after stemming, then data. In practice, you remove stop words before stemming. So Step 3 should indeed become step 1. 
  91. Experimental design and a bit of computer science with Lego’s
Etiquetado

Os top 10 piores gráficos

Isso mostra que substância sem apresentação não faz qualquer sentido.

Roeder K (1994) DNA fingerprinting: A review of the controversy (with discussion). Statistical Science9:222-278, Figure 4
[The article | The figure | Discussion]
2. Wittke-Thompson JK, Pluzhnikov A, Cox NJ (2005) Rational inferences about departures from Hardy-Weinberg equilibrium. American Journal of Human Genetics 76:967-986, Figure 1
[The article | Fig 1AB | Fig 1CD | Discussion]
3. Epstein MP, Satten GA (2003) Inference on haplotype effects in case-control studies using unphased genotype data. American Journal of Human Genetics 73:1316-1329, Figure 1
[The article | The figure | Discussion]
4. Mykland P, Tierney L, Yu B (1995) Regeneration in Markov chain samplers. Journal of the American Statistical Association 90:233-241, Figure 1
[The article | The figure | Discussion]
5. Hummer BT, Li XL, Hassel BA (2001) Role for p53 in gene induction by double-stranded RNA. J Virol75:7774-7777, Figure 4
[The article | The figure | Discussion]
6. Cawley S, et al. (2004) Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116:499-509, Figure 1
[The article | The figure | Discussion]
7. Kim OY, et al. (2012) Higher levels of serum triglyceride and dietary carbohydrate intake are associated with smaller LDL particle size in healthy Korean women. Nutrition Research and Practice 6:120-125, Figure 1
[The article | The figure | Discussion]
8. Jorgenson E, et al. (2005) Ethnicity and human genetic linkage maps. American Journal of Human Genetics76:276-290, Figure 2
[The article | Figure 2a | Figure 2b | Discussion]
9. Cotter DJ, et al. (2004) Hematocrit was not validated as a surrogate endpoint for survival amoung epoetin-treated hemodialysis patients. Journal of Clinical Epidemiology 57:1086-1095, Figure 2
[The article | The figure | Discussion]
10. Broman KW, Murray JC, Sheffield VC, White RL, Weber JL (1998) Comprehensive human genetic maps: Individual and sex-specific variation in recombination. American Journal of Human Genetics 63:861-869, Figure 1
[The article | The figure | Discussion]

 

Etiquetado ,

DL-Learner – Framework para Aprendizado de Máquina

The DL-Learner software learns concepts in Description Logics (DLs) from examples. Equivalently, it can be used to learn classes in OWL ontologies from selected objects. It extends Inductive Logic Programming to Descriptions Logics and the Semantic Web. The goal of DL-Learner is to provide a DL/OWL based machine learning tool to solve supervised learning tasks and support knowledge engineers in constructing knowledge and learning about the data they created.

Purposes of Class Expression Learning

  1. Learn Definitions for Classes: Based on existing instances of an OWL class, DL-Learner can make suggestions for class definitions to be included as an owl:equivalentClass or rdfs:subClassOf Axiom. As the algorithm is biased towards short and human readable definitions, a knowledge engineer can be supported when editing the TBox of an ontology (see Protege Plugin).
  2. Find similar instances: DL-Learner’s suggested class expressions can be used to find similar instances via retrieval (Concept definitions as search). Scalable methods allow the generation of recommendations on the fly, e.g. in a web scenario (see DBpedia Navigator – in experimental stage).
  3. Classify instances: The learned class descriptions can be used in a typical classification scenario, i.e. to decide for unknown instances whether they belong to a certain class. Common ILP benchmarks have been tested with DL-Learner. On the Carcinogenesis page, DL-Learner competes with other state-of-the-art ILP algorithms.

Example

  1. Instance Classification: A user maintains a list of favorites. Based on these favorites, OWL Concepts are learned with DL-Learner and presented to the user in Natural Language. Such a concept could be all articles about proteins that are written by researchers from Germany (e.g. in Manchester syntax: Proteins and hasAuthor some (Person and hasLocation some Germany)). New articles, which fall in this category and are added to the knowledge base, are presented to the user automatically, like a customized RSS feed.
  2. Protégé: In a family ontology, a Protégé user wants to create a definition for the Concept ‘Father’. He / She already asserted some instances to the class Father. Now, the DL-Learner Protege plugin presents the definition (in Manchester OWL syntax): Male and hasChild some Thing.

Implementation

The application is written in Java. A user manual can be found PDF Documenthere. Also there is an overview, a page about its architecture, and a feature list Features. DL-Learner is available as Open-Source at Sourceforge.

It has different learning algorithms, which offer several parameters for fine-tuning. It can solve four closely related learning problems: learning based on positive and negative examples, positive only learning, and learning definitions and subclass relationships in ontologies.

Scalability

As reasoning is heavily used by DL-Learners algorithms, special methods were introduced to increase performance:

  1. Fast Instance Checker is a reasoning component, that is custom tailored for the needs of DL-Learner. After an initial reasoning step on the basis of Pellet, results are pre-calculated and cached. Besides the significant performance boost, the component can optionally apply a form of closed world reasoning, which allows to learn expressions like forall and max/min cardinality. It is an approximate reasoning method, where as usual rare cases of incomplete reasoning results are justified by a huge increase in performance.
  2. DL-Learner can also provide class suggestions for very large knowledge bases, since it uses local fragment reasoning, i.e. only the relevant part (which is small) is used for learning new classes. This enables class learning in real time on knowledge bases like DBpedia. More information can be found PDF Documenthere.
Etiquetado , , ,
%d blogueiros gostam disto: