If the answer is “no”, please get this tutorial of Algobeans.
The term ‘self-organizing map’ might conjure up a militaristic image of data points marching towards their contingents on a map, which is a rather apt analogy of how the algorithm actually works.
A self-organizing map (SOM) is a clustering technique that helps you uncover categories in large datasets, such as to find customer profiles based on a list of past purchases. It is a special breed of unsupervised neural networks, where neurons (also called nodes or reference vectors) are arranged in a single, 2-dimensional grid, which can take the shape of either rectangles or hexagons.
HOW DOES SOM WORK?
In a nutshell, an SOM comprises neurons in the grid, which gradually adapt to the intrinsic shape of our data. The final result allows us to visualize data points and identify clusters in a lower dimension.
So how does the SOM grid learn the shape of our data? Well, this is done in an iterative process, which is summarized in the following steps, and visualized in the animated GIF below:
Step 0: Randomly position the grid’s neurons in the data space.
Step 1: Select one data point, either randomly or systematically cycling through the dataset in order
Step 2: Find the neuron that is closest to the chosen data point. This neuron is called the Best Matching Unit (BMU).
Step 3: Move the BMU closer to that data point. The distance moved by the BMU is determined by a learning rate, which decreases after each iteration.
Step 4: Move the BMU’s neighbors closer to that data point as well, with farther away neighbors moving less. Neighbors are identified using a radius around the BMU, and the value for this radius decreases after each iteration.
Step 5: Update the learning rate and BMU radius, before repeating Steps 1 to 4. Iterate these steps until positions of neurons have been stabilized.
Abstract We contend that corruption must be detected as soon as possible so that corrective and preventive measures may be taken. Thus, we develop an early warning system based on a neural network approach, specifically self-organizing maps, to predict public corruption based on economic and political factors. Unlike previous research, which is based on the perception of corruption, we use data on actual cases of corruption. We apply the model to Spanish provinces in which actual cases of corruption were reported by the media or went to court between 2000 and 2012. We find that the taxation of real estate, economic growth, the increase in real estate prices, the growing number of deposit institutions and non-financial firms, and the same political party remaining in power for long periods seem to induce public corruption. Our model provides different profiles of corruption risk depending on the economic conditions of a region conditional on the timing of the prediction. Our model also provides different time frameworks to predict corruption up to 3 years before cases are detected.
Concluding Remarks We develop a model of neural networks to predict public corruption based on economic and political factors. We apply this model to the Spanish provinces in which corrupt cases have been uncovered by the media or have gone to trial. Unlike previous research, which is based on the perception of corruption, we use data on actual cases of corruption. The output of our model is a set of SOMs, which allow us to predict corruption in different time scenarios before corruption cases are detected. Our model provides two main insights. First, we identify some underlying economic and political factors that can result in public corruption. Taxation of real estate, economic growth, and an increase in real estate prices, in the number of deposit institutions, and the same party remaining in office for a long time seem to induce public corruption. Second, our model provides different time frameworks to predict corruption. In some regions, we are able to detect latent corruption long before it emerges (up to 3 years), and in other regions our model provides short-term alerts, and suggests the need to take urgent preventive or corrective measures. Given the connection we find between economic and political factors and public corruption, some caveats must be applied to our results. Our model does not mean that economic growth or a given party remaining in power causes public corruption but that the fastest growing regions or the ones ruled by the same party for a long time are the most likely to be involved in corruption cases. Economic growth per se is not a sign of corruption, but rather it increases the interactions between economic agents and public officers. Similarly, being in office too long might prove to be an incentive for creating a network of unfair relations between politicians and economic agents. In addition, more competitive markets may induce some agents to pay bribes in order to obtain public concessions or a better competitive position. These results are consistent with some research exploring the relation between economic growth and corruption (Kuo et al. 2002; Kaufman and Rousseeuw 2009; Chen et al. 2002). Since corruption remains a widespread global concern, a key issue in our research is the generalizability of our model and the proposed actions. We have used fairly common macroeconomic and political variables that are widely available from public sources in many countries. In turn, our model can be applied to other regions and countries as well. Of course, the model could be improved if a country or region-specific factors were taken into account. Our approach is interesting both for academia and public authorities. For academia, we provide an innovative way to predict public corruption using neural networks. These methods have often been used to predict corporate financial distress and other economic events, but, as far as we are aware, no studies have yet attempted to use neural networks to predict public corruption. Consequently, we extend the domain of neural network application. For public authorities, we provide a model that improves the efficiency of the measures aimed at fighting corruption. Because the resources available to combat corruption are limited, authorities can use the early corruption warning system, which categorizes each province according to its corruption profile, in order to narrow their focus and better implement preventive and corrective policies. In addition, our model predicts corruption cases long before they are discovered, which enhances anticipatory measures. Our model can be especially relevant in countries suffering the severest corruption problems. In fact, European Union authorities are highly concerned about widespread corruption in certain countries. The study of new methodologies based on neural networks is a fertile field to be applied to a number of legal and economic issues. One possible direction for future research is to extend our model to the international framework and to take into account country-specific factors. Another application may be the detection of patterns of corruption and money laundering across different countries in the European Union.
Por mais que a análise exploratória de dados ocupe um espaço muito grande em relação em problemas de ciência de dados, os métodos de aprendizado não-supervisionados ainda tem o seu valor, mesmo que nas comunidades científicas e profissionais pouco se fala sobre esse tema com a mesma recorrência dos métodos preditivos.
Uma das técnicas mais subestimadas em machine learning é a técnica de clustering (ou análise de agrupamento).
Esse post do Kunal Jain trás um dos melhores reviews sobre análise de cluster e as suas peculiaridades.
Connectivity models: As the name suggests, these models are based on the notion that the data points closer in data space exhibit more similarity to each other than the data points lying farther away. These models can follow two approaches. In the first approach, they start with classifying all data points into separate clusters & then aggregating them as the distance decreases. In the second approach, all data points are classified as a single cluster and then partitioned as the distance increases. Also, the choice of distance function is subjective. These models are very easy to interpret but lacks scalability for handling big datasets. Examples of these models are hierarchical clustering algorithm and its variants.
Centroid models: These are iterative clustering algorithms in which the notion of similarity is derived by the closeness of a data point to the centroid of the clusters. K-Means clustering algorithm is a popular algorithm that falls into this category. In these models, the no. of clusters required at the end have to be mentioned beforehand, which makes it important to have prior knowledge of the dataset. These models run iteratively to find the local optima.
Distribution models: These clustering models are based on the notion of how probable is it that all data points in the cluster belong to the same distribution (For example: Normal, Gaussian). These models often suffer from overfitting. A popular example of these models is Expectation-maximization algorithm which uses multivariate normal distributions.
Density Models: These models search the data space for areas of varied density of data points in the data space. It isolates various different density regions and assign the data points within these regions in the same cluster. Popular examples of density models are DBSCAN and OPTICS.
Este paper publicado na revista acadêmica Expert Systems with Applications traz um trabalho interessante no qual pesquisadores indianos utilizaram as técnicas de clustering para construção e administração de portfólios de ativos da bolsa de valores da Índia e compararam os resultados com o índice Sensex.
A pesquisa utiliza como parâmetro de seleção de ativos idéias relativas ao artigo Portfolio Selection de Markowitz, no qual a carteira seria composta não somente pelos ativos que tivessem um melhor retorno financeiro, mas que também tivessem um baixo risco.
Partindo desse princípio, as empresas seriam agrupadas em clusters de acordo com alguns indicadores de análise técnica, e em um momento segunte de acordo com o valor do índice de validação dos clusters seriam formados os portfólios com os pesos de cada companhia.
O artigo trás idéias interessantes e o ponto negativo (e que provavelmente não foram apresentados pelos autores por desconhecimento ou abstração) é que fatores técnicos são inadequados para esse tipo de classificação devido ao seu alto volume de transações, bem como a pesquisa é inviável em termos de atualização de dados para alocação de ativos. O artigo se tivesse focado em indicadores fundamentalistas, macroeconômicos e setoriais para enquadrar a construção e gestão de portfólios apresentaria melhores resultados.