A good paper to be applied in Brazil – Predicting Public Corruption with Neural Networks

Predicting Public Corruption with Neural Networks: An Analysis of Spanish Provinces

Abstract We contend that corruption must be detected as soon as possible so that corrective and preventive measures may be taken. Thus, we develop an early warning system based on a neural network approach, specifically self-organizing maps, to predict public corruption based on economic and political factors. Unlike previous research, which is based on the perception of corruption, we use data on actual cases of corruption. We apply the model to Spanish provinces in which actual cases of corruption were reported by the media or went to court between 2000 and 2012. We find that the taxation of real estate, economic growth, the increase in real estate prices, the growing number of deposit institutions and non-financial firms, and the same political party remaining in power for long periods seem to induce public corruption. Our model provides different profiles of corruption risk depending on the economic conditions of a region conditional on the timing of the prediction. Our model also provides different time frameworks to predict corruption up to 3 years before cases are detected.

Concluding Remarks We develop a model of neural networks to predict public corruption based on economic and political factors. We apply this model to the Spanish provinces in which corrupt cases have been uncovered by the media or have gone to trial. Unlike previous research, which is based on the perception of corruption, we use data on actual cases of corruption. The output of our model is a set of SOMs, which allow us to predict corruption in different time scenarios before corruption cases are detected. Our model provides two main insights. First, we identify some underlying economic and political factors that can result in public corruption. Taxation of real estate, economic growth, and an increase in real estate prices, in the number of deposit institutions, and the same party remaining in office for a long time seem to induce public corruption. Second, our model provides different time frameworks to predict corruption. In some regions, we are able to detect latent corruption long before it emerges (up to 3 years), and in other regions our model provides short-term alerts, and suggests the need to take urgent preventive or corrective measures. Given the connection we find between economic and political factors and public corruption, some caveats must be applied to our results. Our model does not mean that economic growth or a given party remaining in power causes public corruption but that the fastest growing regions or the ones ruled by the same party for a long time are the most likely to be involved in corruption cases. Economic growth per se is not a sign of corruption, but rather it increases the interactions between economic agents and public officers. Similarly, being in office too long might prove to be an incentive for creating a network of unfair relations between politicians and economic agents. In addition, more competitive markets may induce some agents to pay bribes in order to obtain public concessions or a better competitive position. These results are consistent with some research exploring the relation between economic growth and corruption (Kuo et al. 2002; Kaufman and Rousseeuw 2009; Chen et al. 2002). Since corruption remains a widespread global concern, a key issue in our research is the generalizability of our model and the proposed actions. We have used fairly common macroeconomic and political variables that are widely available from public sources in many countries. In turn, our model can be applied to other regions and countries as well. Of course, the model could be improved if a country or region-specific factors were taken into account. Our approach is interesting both for academia and public authorities. For academia, we provide an innovative way to predict public corruption using neural networks. These methods have often been used to predict corporate financial distress and other economic events, but, as far as we are aware, no studies have yet attempted to use neural networks to predict public corruption. Consequently, we extend the domain of neural network application. For public authorities, we provide a model that improves the efficiency of the measures aimed at fighting corruption. Because the resources available to combat corruption are limited, authorities can use the early corruption warning system, which categorizes each province according to its corruption profile, in order to narrow their focus and better implement preventive and corrective policies. In addition, our model predicts corruption cases long before they are discovered, which enhances anticipatory measures. Our model can be especially relevant in countries suffering the severest corruption problems. In fact, European Union authorities are highly concerned about widespread corruption in certain countries. The study of new methodologies based on neural networks is a fertile field to be applied to a number of legal and economic issues. One possible direction for future research is to extend our model to the international framework and to take into account country-specific factors. Another application may be the detection of patterns of corruption and money laundering across different countries in the European Union.

Anúncios
A good paper to be applied in Brazil – Predicting Public Corruption with Neural Networks

Feature Screening in Large Scale Cluster Analysis

Mais trabalhos sobre clustering.

Feature Screening in Large Scale Cluster Analysis – Trambak Banerjee, Gourab Mukherjee, Peter Radchenko

Abstract: We propose a novel methodology for feature screening in clustering massive datasets, in which both the number of features and the number of observations can potentially be very large. Taking advantage of a fusion penalization based convex clustering criterion, we propose a very fast screening procedure that efficiently discards non-informative features by first computing a clustering score corresponding to the clustering tree constructed for each feature, and then thresholding the resulting values. We provide theoretical support for our approach by establishing uniform non-asymptotic bounds on the clustering scores of the “noise” features. These bounds imply perfect screening of non-informative features with high probability and are derived via careful analysis of the empirical processes corresponding to the clustering trees that are constructed for each of the features by the associated clustering procedure. Through extensive simulation experiments we compare the performance of our proposed method with other screening approaches, popularly used in cluster analysis, and obtain encouraging results. We demonstrate empirically that our method is applicable to cluster analysis of big datasets arising in single-cell gene expression studies.

Conclusions: We propose COSCI, a novel feature screening method for large scale cluster analysis problems that are characterized by both large sample sizes and high dimensionality of the observations. COSCI efficiently ranks the candidate features in a non-parametric fashion and, under mild regularity conditions, is robust to the distributional form of the true noise coordinates. We establish theoretical results supporting ideal feature screening properties of our proposed procedure and provide a data driven approach for selecting the screening threshold parameter. Extensive simulation experiments and real data studies demonstrate encouraging performance of our proposed approach. An interesting topic for future research is extending our marginal screening method by means of utilizing multivariate objective criteria, which are more potent in detecting multivariate cluster information among marginally unimodal features. Preliminary analysis of the corresponding `2 fusion penalty based criterion, which, unlike the `1 based approach used in this paper, is non-separable across dimensions, suggests that this criterion can provide a way to move beyond marginal screening.

Feature Screening in Large Scale Cluster Analysis

Enciclopédia das Distâncias (Michel Deza & Elena Deza)

Para quem está interessado em conhecer mais sobre as distâncias matemáticas (ex: Encludiana, Mahalanobis, ou a Minkowski) esse livro é essencial.

É um compêndio de inúmeras distâncias matemáticas, e além disso contém inúmeras informações de quais distâncias devem ser usadas de acordo com inúmeros contextos.

Enciclopédia das Distâncias (Michel Deza & Elena Deza)

Algoritmos de Clustering para conjuntos de dados massivos

Via Big Data Central.

Potential applications:

  • Creating a keyword taxonomy to categorize the entire universe of cleaned (standardized), valuable English keywords. We are talking of about 10 million keywords made up of one, two or three tokens, that is, about 300 times the number of keywords found in a good English dictionary. The purpose might be to categorize all bid keywords that could be purchased by eBay and Amazon on Google (for pay-per-click ad campaigns), to better price them. This is the application discussed in this article.
  • Clustering millions of documents (e.g. books on Amazon.com) or
  • Clustering web pages, or even the entire Internet, which consist of about 100 million top websites – and billions of web pages.
Algoritmos de Clustering para conjuntos de dados massivos

Agrupamento usando Mean Shift

Neste post do Matt Nedrich fica quase impossível não entender as vantagens de se usar o Mean Shift que é uma técnica pouco conhecida e muito útil no caso de clustering em que o cálculo de cada um dos clusters não é possível.

Agrupamento usando Mean Shift

Análise de Whiskies usando K-Means

Uma ótima análise usando K-Means com o R. Mais do que a análise, esse post é uma aula de como proceder com uma análise de cluster usando a determinação arbitrária de clusters como o K-means exige.

Com isso a geração dos resultados e da análise ficam muito mais ‘walk-thru’ e muito menos black-box.

O resultado final?

“[…]The results indicate that there is a lot of variation in flavor profiles within the different scotch whisky regions. Note that initial cluster centers are chosen at random. In order to replicate the results, you will need to run the following code before your analysis.
set.seed(1) Further data analysis would be required to determine whether proximity to types of water sources or terrain types drive common flavor profiles. This could be done by obtaining shape files and adding them as an additional layer to the ggmap plot.
For me, I have identified my next to-try single malt. Talisker is still within the familiar realm of cluster 4 but a little more malty, fruity and spicy. Sounds like the perfect holiday mix. […]”

Análise de Whiskies usando K-Means

Predição de Movimentações Criminais

Um bom artigo sobre a modelagem de eventos criminais e a sua movimentação.

[…] Data available on distance between criminals’ homes and their targets shows that burglars are willing to travel longer distances for high-value targets, and tend to employ different means of transportation to make these long trips. Of course, this tendency differs among types of criminals. Professionals and older criminals may travel further than younger amateurs. A group of professional burglars planning to rob a bank, for instance, would reasonably be expected to follow a Lévy flight.

“There is actually a relationship between how far these criminals are willing to travel for a target and the ability for a hotspot to form,” explain Kolokolnikov and McCalla. […]

Predição de Movimentações Criminais