Cesare Furlanello, Stefano Merler, Annapaola Rizzoli*, Claudio Chemini*, Claudio Genchi**.
ITC-IRST, Trento, *Centro di Ecologia Alpina, Trento, **Institute of General Pathology and Parassitology, Veterinary Medicine, MilanoUniversity, Italy
|
|
A statistical procedure called bagging (for Bootstrap
AGGregatING) has recently been proposed for combining an ensemble of predictive models25.
The key idea is to grow many versions of the same predictor over resampled versions of the
data set L available for model development. Given this learning set L=(yn,xn)n=1,...,N, of N
observations, where the yn are the class labels and the xn are the multivariate
vectors of predictor variables, assume there exists a procedure to develop from L a model
f (x, L) for predicting class
y of a generic and possibly unseen vector x. When small changes in L can result in large changes
in the classification model f, future classification is improved by using a
combination of different f
(x, Lb), each developed on a different learning set Lb. The output of the bagged
classifier fB(x) is
defined by voting, i.e. the class receiving the most votes (if Nj=#{b; f
(x, Lb)=j}, fB (x)=argmaxj Nj),
or by the average in the case of regression models. Additional data is not required: the
replicate learning sets Lb are obtained as bootstrap samples of L, i.e. data sets of N cases drawn
at random, but with replacement, from L.
The bagging process is illustrated in figure 1. The technique can reduce output variability and,
thus, induce a surprisingly large reduction in error rates, which is especially due to the use of
recursive partitioning procedures (or tree-based classifiers) which have low bias but a dependence
on the training data. A single classification tree is not only appropriate for modeling phenomena
described by both discrete and continuous variables, but also yields a simple representation of
the model. The bagging procedure, in which output is obtained by voting or averaging, can reduce
the error variability on novel data, however it generally produces models of less immediate
interpretation. In our landscape epidemiology problem, the model outputs constitute a digital
map: the use of a geographical information system (GIS) as an interface thus eliminates the
problem of the interpretation of bagged models.
Fig. 1: The general bagging procedure.
|