Feature Screening in Large Scale Cluster Analysis

Mais trabalhos sobre clustering.

Feature Screening in Large Scale Cluster Analysis – Trambak Banerjee, Gourab Mukherjee, Peter Radchenko

Abstract: We propose a novel methodology for feature screening in clustering massive datasets, in which both the number of features and the number of observations can potentially be very large. Taking advantage of a fusion penalization based convex clustering criterion, we propose a very fast screening procedure that efficiently discards non-informative features by first computing a clustering score corresponding to the clustering tree constructed for each feature, and then thresholding the resulting values. We provide theoretical support for our approach by establishing uniform non-asymptotic bounds on the clustering scores of the “noise” features. These bounds imply perfect screening of non-informative features with high probability and are derived via careful analysis of the empirical processes corresponding to the clustering trees that are constructed for each of the features by the associated clustering procedure. Through extensive simulation experiments we compare the performance of our proposed method with other screening approaches, popularly used in cluster analysis, and obtain encouraging results. We demonstrate empirically that our method is applicable to cluster analysis of big datasets arising in single-cell gene expression studies.

Conclusions: We propose COSCI, a novel feature screening method for large scale cluster analysis problems that are characterized by both large sample sizes and high dimensionality of the observations. COSCI efficiently ranks the candidate features in a non-parametric fashion and, under mild regularity conditions, is robust to the distributional form of the true noise coordinates. We establish theoretical results supporting ideal feature screening properties of our proposed procedure and provide a data driven approach for selecting the screening threshold parameter. Extensive simulation experiments and real data studies demonstrate encouraging performance of our proposed approach. An interesting topic for future research is extending our marginal screening method by means of utilizing multivariate objective criteria, which are more potent in detecting multivariate cluster information among marginally unimodal features. Preliminary analysis of the corresponding `2 fusion penalty based criterion, which, unlike the `1 based approach used in this paper, is non-separable across dimensions, suggests that this criterion can provide a way to move beyond marginal screening.

Anúncios
Feature Screening in Large Scale Cluster Analysis

Enciclopédia das Distâncias (Michel Deza & Elena Deza)

Para quem está interessado em conhecer mais sobre as distâncias matemáticas (ex: Encludiana, Mahalanobis, ou a Minkowski) esse livro é essencial.

É um compêndio de inúmeras distâncias matemáticas, e além disso contém inúmeras informações de quais distâncias devem ser usadas de acordo com inúmeros contextos.

Enciclopédia das Distâncias (Michel Deza & Elena Deza)

Algoritmos de Clustering para conjuntos de dados massivos

Via Big Data Central.

Potential applications:

  • Creating a keyword taxonomy to categorize the entire universe of cleaned (standardized), valuable English keywords. We are talking of about 10 million keywords made up of one, two or three tokens, that is, about 300 times the number of keywords found in a good English dictionary. The purpose might be to categorize all bid keywords that could be purchased by eBay and Amazon on Google (for pay-per-click ad campaigns), to better price them. This is the application discussed in this article.
  • Clustering millions of documents (e.g. books on Amazon.com) or
  • Clustering web pages, or even the entire Internet, which consist of about 100 million top websites – and billions of web pages.
Algoritmos de Clustering para conjuntos de dados massivos

Agrupamento usando Mean Shift

Neste post do Matt Nedrich fica quase impossível não entender as vantagens de se usar o Mean Shift que é uma técnica pouco conhecida e muito útil no caso de clustering em que o cálculo de cada um dos clusters não é possível.

Agrupamento usando Mean Shift

Análise de Whiskies usando K-Means

Uma ótima análise usando K-Means com o R. Mais do que a análise, esse post é uma aula de como proceder com uma análise de cluster usando a determinação arbitrária de clusters como o K-means exige.

Com isso a geração dos resultados e da análise ficam muito mais ‘walk-thru’ e muito menos black-box.

O resultado final?

“[…]The results indicate that there is a lot of variation in flavor profiles within the different scotch whisky regions. Note that initial cluster centers are chosen at random. In order to replicate the results, you will need to run the following code before your analysis.
set.seed(1) Further data analysis would be required to determine whether proximity to types of water sources or terrain types drive common flavor profiles. This could be done by obtaining shape files and adding them as an additional layer to the ggmap plot.
For me, I have identified my next to-try single malt. Talisker is still within the familiar realm of cluster 4 but a little more malty, fruity and spicy. Sounds like the perfect holiday mix. […]”

Análise de Whiskies usando K-Means

Predição de Movimentações Criminais

Um bom artigo sobre a modelagem de eventos criminais e a sua movimentação.

[…] Data available on distance between criminals’ homes and their targets shows that burglars are willing to travel longer distances for high-value targets, and tend to employ different means of transportation to make these long trips. Of course, this tendency differs among types of criminals. Professionals and older criminals may travel further than younger amateurs. A group of professional burglars planning to rob a bank, for instance, would reasonably be expected to follow a Lévy flight.

“There is actually a relationship between how far these criminals are willing to travel for a target and the ability for a hotspot to form,” explain Kolokolnikov and McCalla. […]

Predição de Movimentações Criminais