- What is the biggest data set that you processed, and how did you process it, what were the results?
- Tell me two success stories about your analytic or computer science projects? How was lift (or success) measured?
- What is: lift, KPI, robustness, model fitting, design of experiments, 80/20 rule?
- What is: collaborative filtering, n-grams, map reduce, cosine distance?
- How to optimize a web crawler to run much faster, extract better information, and better summarize data to produce cleaner databases?
- How would you come up with a solution to identify plagiarism?
- How to detect individual paid accounts shared by multiple users?
- Should click data be handled in real time? Why? In which contexts?
- What is better: good data or good models? And how do you define “good”? Is there a universal good model? Are there any models that are definitely not so good?
- What is probabilistic merging (AKA fuzzy merging)? Is it easier to handle with SQL or other languages? Which languages would you choose for semi-structured text data reconciliation?
- How do you handle missing data? What imputation techniques do you recommend?
- What is your favorite programming language / vendor? why?
- Tell me 3 things positive and 3 things negative about your favorite statistical software.
- Compare SAS, R, Python, Perl
- What is the curse of big data?
- Have you been involved in database design and data modeling?
- Have you been involved in dashboard creation and metric selection? What do you think about Birt?
- What features of Teradata do you like?
- You are about to send one million email (marketing campaign). How do you optimze delivery? How do you optimize response? Can you optimize both separately? (answer: not really)
- Toad or Brio or any other similar clients are quite inefficient to query Oracle databases. Why? How would you do to increase speed by a factor 10, and be able to handle far bigger outputs?
- How would you turn unstructured data into structured data? Is it really necessary? Is it OK to store data as flat text files rather than in an SQL-powered RDBMS?
- What are hash table collisions? How is it avoided? How frequently does it happen?
- How to make sure a mapreduce application has good load balance? What is load balance?
- Examples where mapreduce does not work? Examples where it works very well? What are the security issues involved with the cloud? What do you think of EMC’s solution offering an hybrid approach – both internal and external cloud – to mitigate the risks and offer other advantages (which ones)?
- Is it better to have 100 small hash tables or one big hash table, in memory, in terms of access speed (assuming both fit within RAM)? What do you think about in-database analytics?
- Why is naive Bayes so bad? How would you improve a spam detection algorithm that uses naive Bayes?
- Have you been working with white lists? Positive rules? (In the context of fraud or spam detection)
- What is star schema? Lookup tables?
- Can you perform logistic regression with Excel? (yes) How? (use linest on log-transformed data)? Would the result be good? (Excel has numerical issues, but it’s very interactive)
- Have you optimized code or algorithms for speed: in SQL, Perl, C++, Python etc. How, and by how much?
- Is it better to spend 5 days developing a 90% accurate solution, or 10 days for 100% accuracy? Depends on the context?
- Define: quality assurance, six sigma, design of experiments. Give examples of good and bad designs of experiments.
- What are the drawbacks of general linear model? Are you familiar with alternatives (Lasso, ridge regression, boosted trees)?
- Do you think 50 small decision trees are better than a large one? Why?
- Is actuarial science not a branch of statistics (survival analysis)? If not, how so?
- Give examples of data that does not have a Gaussian distribution, nor log-normal. Give examples of data that has a very chaotic distribution?
- Why is mean square error a bad measure of model performance? What would you suggest instead?
- How can you prove that one improvement you’ve brought to an algorithm is really an improvement over not doing anything? Are you familiar with A/B testing?
- What is sensitivity analysis? Is it better to have low sensitivity (that is, great robustness) and low predictive power, or the other way around? How to perform good cross-validation? What do you think about the idea of injecting noise in your data set to test the sensitivity of your models?
- Compare logistic regression w. decision trees, neural networks. How have these technologies been vastly improved over the last 15 years?
- Do you know / used data reduction techniques other than PCA? What do you think of step-wise regression? What kind of step-wise techniques are you familiar with? When is full data better than reduced data or sample?
- How would you build non parametric confidence intervals, e.g. for scores? (see the AnalyticBridge theorem)
- Are you familiar either with extreme value theory, monte carlo simulations or mathematical statistics (or anything else) to correctly estimate the chance of a very rare event?
- What is root cause analysis? How to identify a cause vs. a correlation? Give examples.
- How would you define and measure the predictive power of a metric?
- How to detect the best rule set for a fraud detection scoring technology? How do you deal with rule redundancy, rule discovery, and the combinatorial nature of the problem (for finding optimum rule set – the one with best predictive power)? Can an approximate solution to the rule set problem be OK? How would you find an OK approximate solution? How would you decide it is good enough and stop looking for a better one?
- How to create a keyword taxonomy?
- What is a Botnet? How can it be detected?
- Any experience with using API’s? Programming API’s? Google or Amazon API’s? AaaS (Analytics as a service)?
- When is it better to write your own code than using a data science software package?
- Which tools do you use for visualization? What do you think of Tableau? R? SAS? (for graphs). How to efficiently represent 5 dimension in a chart (or in a video)?
- What is POC (proof of concept)?
- What types of clients have you been working with: internal, external, sales / finance / marketing / IT people? Consulting experience? Dealing with vendors, including vendor selection and testing?
- Are you familiar with software life cycle? With IT project life cycle – from gathering requests to maintenance?
- What is a cron job?
- Are you a lone coder? A production guy (developer)? Or a designer (architect)?
- Is it better to have too many false positives, or too many false negatives?
- Are you familiar with pricing optimization, price elasticity, inventory management, competitive intelligence? Give examples.
- How does Zillow’s algorithm work? (to estimate the value of any home in US)
- How to detect bogus reviews, or bogus Facebook accounts used for bad purposes?
- How would you create a new anonymous digital currency?
- Have you ever thought about creating a startup? Around which idea / concept?
- Do you think that typed login / password will disappear? How could they be replaced?
- Have you used time series models? Cross-correlations with time lags? Correlograms? Spectral analysis? Signal processing and filtering techniques? In which context?
- Which data scientists do you admire most? which startups?
- How did you become interested in data science?
- What is an efficiency curve? What are its drawbacks, and how can they be overcome?
- What is a recommendation engine? How does it work?
- What is an exact test? How and when can simulations help us when we do not use an exact test?
- What do you think makes a good data scientist?
- Do you think data science is an art or a science?
- What is the computational complexity of a good, fast clustering algorithm? What is a good clustering algorithm? How do you determine the number of clusters? How would you perform clustering on one million unique keywords, assuming you have 10 million data points – each one consisting of two keywords, and a metric measuring how similar these two keywords are? How would you create this 10 million data points table in the first place?
- Give a few examples of “best practices” in data science.
- What could make a chart misleading, difficult to read or interpret? What features should a useful chart have?
- Do you know a few “rules of thumb” used in statistical or computer science? Or in business analytics?
- What are your top 5 predictions for the next 20 years?
- How do you immediately know when statistics published in an article (e.g. newspaper) are either wrong or presented to support the author’s point of view, rather than correct, comprehensive factual information on a specific subject? For instance, what do you think about the official monthly unemployment statistics regularly discussed in the press? What could make them more accurate?
- Testing your analytic intuition: look at these three charts. Two of them exhibit patterns. Which ones? Do you know that these charts are called scatter-plots? Are there other ways to visually represent this type of data?
- You design a robust non-parametric statistic (metric) to replace correlation or R square, that (1) is independent of sample size, (2) always between -1 and +1, and (3) based on rank statistics. How do you normalize for sample size? Write an algorithm that computes all permutations of n elements. How do you sample permutations (that is, generate tons of random permutations) when n is large, to estimate the asymptotic distribution for your newly created metric? You may use this asymptotic distribution for normalizing your metric. Do you think that an exact theoretical distribution might exist, and therefore, we should find it, and use it rather than wasting our time trying to estimate the asymptotic distribution using simulations?
- More difficult, technical question related to previous one. There is an obvious one-to-one correspondence between permutations of n elements and integers between 1 and n! Design an algorithm that encodes an integer less than n! as a permutation of n elements. What would be the reverse algorithm, used to decode a permutation and transform it back into a number? Hint: An intermediate step is to use the factorial number system representation of an integer. Feel free to check this reference online to answer the question. Even better, feel free to browse the web to find the full answer to the question (this will test the candidate’s ability to quickly search online and find a solution to a problem without spending hours reinventing the wheel).
- How many “useful” votes will a Yelp review receive? My answer: Eliminate bogus accounts (read this article), or competitor reviews (how to detect them: use taxonomy to classify users, and location – two Italian restaurants in same Zip code could badmouth each other and write great comments for themselves). Detect fake likes: some companies (e.g. FanMeNow.com) will charge you to produce fake accounts and fake likes. Eliminate prolific users who like everything, those who hate everything. Have a blacklist of keywords to filter fake reviews. See if IP address or IP block of reviewer is in a blacklist such as “Stop Forum Spam”. Create honeypot to catch fraudsters. Also watch out for disgruntled employees badmouthing their former employer. Watch out for 2 or 3 similar comments posted the same day by 3 users regarding a company that receives very few reviews. Is it a brand new company? Add more weight to trusted users (create a category of trusted users). Flag all reviews that are identical (or nearly identical) and come from same IP address or same user. Create a metric to measure distance between two pieces of text (reviews). Create a review or reviewer taxonomy. Use hidden decision trees to rate or score review and reviewers.
- What did you do today? Or what did you do this week / last week?
- What/when is the latest data mining book / article you read? What/when is the latest data mining conference / webinar / class / workshop / training you attended? What/when is the most recent programming skill that you acquired?
- What are your favorite data science websites? Who do you admire most in the data science community, and why? Which company do you admire most?
- What/when/where is the last data science blog post you wrote?
- In your opinion, what is data science? Machine learning? Data mining?
- Who are the best people you recruited and where are they today?
- Can you estimate and forecast sales for any book, based on Amazon public data? Hint: read this article.
- What’s wrong with this picture?
- Should removing stop words be Step 1 rather than Step 3, in the search engine algorithm described here? Answer: Have you thought about the fact that mine and yours could also be stop words? So in a bad implementation, data mining would become data mine after stemming, then data. In practice, you remove stop words before stemming. So Step 3 should indeed become step 1.
- Experimental design and a bit of computer science with Lego’s
|Roeder K (1994) DNA fingerprinting: A review of the controversy (with discussion). Statistical Science9:222-278, Figure 4
[The article | The figure | Discussion]
|2.||Wittke-Thompson JK, Pluzhnikov A, Cox NJ (2005) Rational inferences about departures from Hardy-Weinberg equilibrium. American Journal of Human Genetics 76:967-986, Figure 1
[The article | Fig 1AB | Fig 1CD | Discussion]
|3.||Epstein MP, Satten GA (2003) Inference on haplotype effects in case-control studies using unphased genotype data. American Journal of Human Genetics 73:1316-1329, Figure 1
[The article | The figure | Discussion]
|4.||Mykland P, Tierney L, Yu B (1995) Regeneration in Markov chain samplers. Journal of the American Statistical Association 90:233-241, Figure 1
[The article | The figure | Discussion]
|5.||Hummer BT, Li XL, Hassel BA (2001) Role for p53 in gene induction by double-stranded RNA. J Virol75:7774-7777, Figure 4
[The article | The figure | Discussion]
|6.||Cawley S, et al. (2004) Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116:499-509, Figure 1
[The article | The figure | Discussion]
|7.||Kim OY, et al. (2012) Higher levels of serum triglyceride and dietary carbohydrate intake are associated with smaller LDL particle size in healthy Korean women. Nutrition Research and Practice 6:120-125, Figure 1
[The article | The figure | Discussion]
|8.||Jorgenson E, et al. (2005) Ethnicity and human genetic linkage maps. American Journal of Human Genetics76:276-290, Figure 2
[The article | Figure 2a | Figure 2b | Discussion]
|9.||Cotter DJ, et al. (2004) Hematocrit was not validated as a surrogate endpoint for survival amoung epoetin-treated hemodialysis patients. Journal of Clinical Epidemiology 57:1086-1095, Figure 2
[The article | The figure | Discussion]
|10.||Broman KW, Murray JC, Sheffield VC, White RL, Weber JL (1998) Comprehensive human genetic maps: Individual and sex-specific variation in recombination. American Journal of Human Genetics 63:861-869, Figure 1
[The article | The figure | Discussion]
The DL-Learner software learns concepts in Description Logics (DLs) from examples. Equivalently, it can be used to learn classes in OWL ontologies from selected objects. It extends Inductive Logic Programming to Descriptions Logics and the Semantic Web. The goal of DL-Learner is to provide a DL/OWL based machine learning tool to solve supervised learning tasks and support knowledge engineers in constructing knowledge and learning about the data they created.
- Learn Definitions for Classes: Based on existing instances of an OWL class, DL-Learner can make suggestions for class definitions to be included as an owl:equivalentClass or rdfs:subClassOf Axiom. As the algorithm is biased towards short and human readable definitions, a knowledge engineer can be supported when editing the TBox of an ontology (see Protege Plugin).
- Find similar instances: DL-Learner’s suggested class expressions can be used to find similar instances via retrieval (Concept definitions as search). Scalable methods allow the generation of recommendations on the fly, e.g. in a web scenario (see DBpedia Navigator – in experimental stage).
- Classify instances: The learned class descriptions can be used in a typical classification scenario, i.e. to decide for unknown instances whether they belong to a certain class. Common ILP benchmarks have been tested with DL-Learner. On the Carcinogenesis page, DL-Learner competes with other state-of-the-art ILP algorithms.
- Instance Classification: A user maintains a list of favorites. Based on these favorites, OWL Concepts are learned with DL-Learner and presented to the user in Natural Language. Such a concept could be all articles about proteins that are written by researchers from Germany (e.g. in Manchester syntax: Proteins and hasAuthor some (Person and hasLocation some Germany)). New articles, which fall in this category and are added to the knowledge base, are presented to the user automatically, like a customized RSS feed.
- Protégé: In a family ontology, a Protégé user wants to create a definition for the Concept ‘Father’. He / She already asserted some instances to the class Father. Now, the DL-Learner Protege plugin presents the definition (in Manchester OWL syntax): Male and hasChild some Thing.
The application is written in Java. A user manual can be found here. Also there is an overview, a page about its architecture, and a feature list Features. DL-Learner is available as Open-Source at Sourceforge.
It has different learning algorithms, which offer several parameters for fine-tuning. It can solve four closely related learning problems: learning based on positive and negative examples, positive only learning, and learning definitions and subclass relationships in ontologies.
- Fast Instance Checker is a reasoning component, that is custom tailored for the needs of DL-Learner. After an initial reasoning step on the basis of Pellet, results are pre-calculated and cached. Besides the significant performance boost, the component can optionally apply a form of closed world reasoning, which allows to learn expressions like forall and max/min cardinality. It is an approximate reasoning method, where as usual rare cases of incomplete reasoning results are justified by a huge increase in performance.
- DL-Learner can also provide class suggestions for very large knowledge bases, since it uses local fragment reasoning, i.e. only the relevant part (which is small) is used for learning new classes. This enables class learning in real time on knowledge bases like DBpedia. More information can be found here.
O R é uma linguagem e ambiente de análise estatística gratuita e de livre distribuição. Apesar de funcionar a partir de linha de comando, que em um primeiro momento se mostra como um obstáculo ao usuário iniciante, possui ampla utilização, possibilitando não só adequar o sistema às necessidades específicas de cada pesquisa, pela criação de complementos, como também possibilita a automatização de análises rotineiras e com grande quantidade de dados.
Como pretendo apresentar aqui algumas possibilidades de sua utilização, como feito no artigo sobre o mapa de John Snow, segue um breve tutorial para instalá-lo. Para auxiliar a sua utilização, ensino também a instalar do RStudio, que é uma interface gráfica criada para tornar a utilização do R mais amigável.
Tendo em vista a facilidade de instalação, vejo este artigo mais como um incentivo aos que ainda não se aventuraram do que qualquer outra coisa.
Ver o post original 337 mais palavras
[…]Deep Learning methods use a composition of multiple non-linear transformations to model high-level abstractions in data. Multi-layer feed-forward artificial neural networks are some of the oldest and yet most useful such techniques. We are now reaping the benefits of over 60 years of evolution in Deep Learning that began in the late 1950s when the term Machine Learning was coined. Large parts of the growing success of Deep Learning in the past decade can be attributed to Moore’s law and the exponential speedup of computers, but there were also many algorithmic breakthroughs that enabled robust training of deep learners.
Compared to more interpretable Machine Learning techniques such as tree-based methods, conventional Deep Learning (using stochastic gradient descentand back-propagation) is a rather “brute-force” method that optimizes lots of coefficients (it is a parametric method) starting from random noise by continuously looking at examples from the training data. It follows the basic idea of “(good) practice makes perfect” (similar to a real brain) without any strong guarantees on the quality of the model. […]
Neste trecho ele fala de algumas aplicações de Deep Learning:
[…]Deep Learning is really effective at learning non-linear derived featuresfrom the raw input features, unlike standard Machine Learning methods such as linear or tree-based methods. For example, if age and income are the two features used to predict spending, then a linear model would greatly benefit from manually splitting age and income ranges into distinct groups; while a tree-based model would learn to automatically dissect the two-dimensional space.
A Deep Learning model builds hierarchies of (hidden) derived non-linear features that get composed to approximate arbitrary functions such as sqrt((age-40)^2+0.3*log(income+1)-4) with much less effort than with other methods. Traditionally, data scientists perform many of these transformations explicitly based on domain knowledge and experience, but Deep Learning has been shown to be extremely effective at coming up with those transformations, often outperforming standard Machine Learning models by a substantial margin.
Deep Learning is also very good at predicting high-cardinality class memberships, such as in image or voice recognition problems, or in predicting the best item to recommend to a user. Another strength of Deep Learning is that it can also be used for unsupervised learning where it just learns the intrinsic structure of the data without making predictions (remember the Google cat?). This is useful in cases where there are no training labels, or for various other use cases such as anomaly detection. […]
Via John D. Cook.
Elementary numerical integration algorithms, such as Gaussian quadrature, are based on polynomial approximations. The method aims to exactly integrate a polynomial that approximates the integrand. But likelihood functions are not approximately polynomial, and they become less like polynomials when they contain more data. They become more like a normal density, asymptotically flat in the tails, something no polynomial can do. With better integration techniques, the integration accuracy will improve with more data rather than degrade.
With more data, the posterior distribution becomes more concentrated. This means that a naive approach to integration might entirely miss the part of the integrand where nearly all the mass is concentrated. You need to make sure your integration method is putting its effort where the action is. Fortunately, it’s easy to estimate where the mode should be.
The third problem is that software calculating the likelihood function can underflow with even a moderate amount of data. The usual solution is to work with the logarithm of the likelihood function, but with numerical integration the solution isn’t quite that simple. You need to integrate the likelihood function itself, not its logarithm. I describe how to deal with this situation in Avoiding underflow in Bayesian computations.
- If the goal is prediction accuracy, average many prediction models together. In general, the prediction algorithms that most frequently win Kaggle competitions or the Netflix prize blend multiple models together. The idea is that by averaging (or majority voting) multiple good prediction algorithms you can reduce variability without giving up bias. One of the earliest descriptions of this idea was of a much simplified version based onbootstrapping samples and building multiple prediction functions – a process called bagging (short for bootstrap aggregating). Random forests, another incredibly successful prediction algorithm, is based on a similar idea with classification trees.
- Know what your real sample size is. It can be easy to be tricked by the size of a data set. Imagine you have an image of a simple black circle on a white background stored as pixels. As the resolution increases the size of the data increases, but the amount of information may not (hence vector graphics). Similarly in genomics, the number of reads you measure (which is a main determinant of data size) is not the sample size, it is the number of individuals. In social networks, the number of people in the network may not be the sample size. If the network is very dense, the sample size might be much less. In general the bigger the sample size the better and sample size and data size aren’t always tightly correlated.
…esse arquivo (provavelmente o maior arquivo para download na web) pode ser um bom lugar para você começar.
É fato que a inconsistência de dados acaba com qualquer tipo de modelagem em Data Mining.
Dessa forma, ANTES de qualquer experimento com data mining é sempre desejável que se faça uma análise exploratória de dados utilizando estatísticas descritivas, gráficos, formulação de hipóteses para uma definição clara de quais técnicas serão utilizadas.
Um pequeno guia de como comunicar questões como Oportunidade x Risco; Risco Relativo x Risco Absoluto; e Probabilidade Condicional.
Essa tabela de tradução de probabilidades para palavras mostra como realizar a transcrição de forma clara de acordo com os números:
E por fim essa é a tabela que fala em relação a externalização de confiança:
Neste post do Frank Pasquale é aberta a discussão relativa à regulamentação da mineração de dados, especificamente na prática de escoragens (scoring), atividade esta que, na visão do autor, é um problema devido ao fato que pode discriminar as pessoas.
O argumento do artigo é que muitas formas de decisão baseadas em dados tornam o processo menos transparente do ponto de vista social.
Eis que o autor afirma:
Data-driven decision making is usually framed as a way of rewarding high performers and shaming shirkers. But it’s not so simple. Most of us don’t know that we’re being profiled, or, if we do, how the profiling works. We can’t anticipate, for instance, when an apparently innocuous action — such as joining the wrong group on Facebook — will trigger a red flag on some background checker that renders us in effect unemployable. We’ll likely never know what that action was, either, because we aren’t allowed to see our records.
It’s only complaints, investigations and leaks that give us occasional peeks into these black boxes of data mining. But what has emerged is terrifying.
Naturally, just as we’ve lost control of data, a plethora of new services are offering “credit repair” and “reputation optimization.” But can they really help? Credit scoring algorithms are secret, so it’s hard to know whether today’s “fix” will be tomorrow’s total fail. And no private company can save us from the thousands of other firms intent on mashing up whatever data is at hand to score and pigeonhole us. New approaches are needed.
In general, we need what technology law and policy researcher Meg Leta Jones calls “fair automation practices” to complement the “fair data practices” President Barack Obama is proposing. We can’t hope to prevent the collection or creation of inappropriate or inaccurate databases. But we can ensure the use of that data by employers, insurers and other decision makers is made clear to us when we are affected by it.
Uma afirmação que se não foi ingenua beira a maldade. A forma de decisão baseada em dados é a mais transparente que existe pois não obedece critérios subjetivos, vieses de qualquer natureza, e coloca a todos em um mesmo patamar de igualdade; isso para não dizer que é o mais justo.
Acreditar nos dados, em mais ainda entender o contexto decisório é a forma mais justa de se decidir.
O tema merece muitas discussões ligadas à diretrizes de como essas informações são manipuladas, mas não será com mais regulamentação que esse fato de que pessoas estão recebendo escores vai mudar.