Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers

Abstract: There is a large literature explaining why AdaBoost is a successful classifier. The literature on AdaBoost focuses on classifier margins and boosting’s interpretation as the optimization of an exponential likelihood function. These existing explanations, however, have been pointed out to be incomplete. A random forest is another popular ensemble method for which there is substantially less explanation in the literature. We introduce a novel perspective on AdaBoost and random forests that proposes that the two algorithms work for similar reasons. While both classifiers achieve similar predictive accuracy, random forests cannot be conceived as a direct optimization procedure. Rather, random forests is a self- averaging, interpolating algorithm which creates what we denote as a spiked-smooth classifier, and we view AdaBoost in the same light. We conjecture that both AdaBoost and random forests succeed because of this mechanism. We provide a number of examples to support this explanation. In the process, we question the conventional wisdom that suggests that boosting algorithms for classification require regularization or early stopping and should be limited to low complexity classes of learners, such as decision stumps. We conclude that boosting should be used like random forests: with large decision trees, without regularization or early stopping.
Concluding Remarks: AdaBoost is an undeniably successful algorithm and random forests is at least as good, if not better. But AdaBoost is as puzzling as it is successful; it broke the basic rules of statistics by iteratively fitting even noisy data sets until every training set data point was fit without error. Even more puzzling, to statisticians at least, it will continue to iterate an already perfectly fit algorithm which lowers generalization error. The statistical view of boosting understands AdaBoost to be a stage wise optimization of an exponential loss, which suggest (demands!) regularization of tree size and control on the number of iterations.
In contrast, a random forest is not an optimization; it appears to work best with large
trees and as many iterations as possible. It is widely believed that AdaBoost is effective
because it is an optimization, while random forests works—well because it works. Breiman conjectured that “it is my belief that in its later stages AdaBoost is emulating a random forest” (Breiman, 2001). This paper sheds some light on this conjecture by providing a novel intuition supported by examples to show how AdaBoost and random forest are successful for the same reason.
A random forests model is a weighted ensemble of interpolating classifiers by construction. Although it is much less evident, we have shown that AdaBoost is also a weighted ensemble of interpolating classifiers. Viewed in this way, AdaBoost is actually a “random” forest of forests. The trees in random forests and the forests in the AdaBoost each interpolate the data without error. As the number of iterations increase the averaging of decision surface because smooths but nevertheless still interpolates. This is accomplished by whittling down the decision boundary around error points. We hope to have cast doubt on the commonly held belief that the later iterations of AdaBoost only serve to overfit the data. Instead, we argue that these later iterations lead to an “averaging effect”, which causes AdaBoost to behave like a random forest.
A central part of our discussion also focused on the merits of interpolation of the training
data, when coupled with averaging. Again, we hope to dispel the commonly held belief that interpolation always leads to overfitting. We have argued instead that fitting the training data in extremely local neighborhoods actually serves to prevent overfitting in the presence of averaging. The local fits serve to prevent noise points from having undue influence over the fit in other areas. Random forests and AdaBoost both achieve this desirable level of local interpolation by fitting deep trees. It is our hope that our emphasis on the “self-averaging” and interpolating aspects of AdaBoost will lead to a broader discussion of this classifier’s success that extends beyond the more traditional emphasis on margins and exponential loss minimization.
Explaining the Success of AdaBoost and Random Forests as Interpolating Classifiers

The Credit Scoring Model Based on Logistic-BP-AdaBoost Algorithm and its Application in P2P Credit Platform

apply the Logistics algorithm, BP neural network and the AdaBoost algorithm to
build the model (Logistic-BP-AdaBoost model) which can estimate credit score of
the applicant with their multidimensional personal data. Compared with other
the possibility of loan default of the applicant and provide a score for each applicant.
We apply this model to a websites and establish an online loan platform which
is expected to improve the efficiency and reduce costs of traditional lending
Conclusion: Based on the data mining technology and learned other researchers’ achievements, we studied the methods of logistic regression, BP neural network and AdaBoost, and improve complex approval work and reduce prediction error for the traditional loan. In this paper we combine logistic regression with BP neural network and then we use AdaBoost to intensify the model. For the traditional loan approval problem, we fully consider the user registration information and user sources to more accurately predict user success rate for the loan. According to the user multidimensional messages, we can clearly know the users, furthermore, through analyzing the sources of users as well as the user fraud score, we can make accurate judgment to user. Finally L-B-A model was used to the P2P loan platform, and the practice proved that model had high practicability and can achieve the purpose of simplifying the loan approval process.
The Credit Scoring Model Based on Logistic-BP-AdaBoost Algorithm and its Application in P2P Credit Platform

Top 10 Algoritmos em Mineração de Dados

Esse artigo de 2007 apresenta os 10 algoritmos de Mineração de Dados mais utilizados dentro dos mais diversos tipos de domínios. O processo de determinação desses algoritmos deram-se através de uma pesquisa da ACM KDD no qual diversos pesquisadores deram seus respectivos pareceres. Os algoritmos apresentados nessa pesquisa estão descritos de maneira bem sucinta e objetiva e vale a pena a leitura.

Ten Algorithms in Data Mining


Top 10 Algoritmos em Mineração de Dados