Note: This article loosely a copy of this post
Model training and tuning
The caret package has several functions that attempt to streamline the model building and evaluation process.
The train function can be used to
- evaluate, using re-sampling, the effect of model tuning parameters on performance
- choose the "optimal" model across these parameters
- Estimate model performance from a training set
Generic Algorithm for Model Building:-
Define sets of model parameter values to evaluate
for each parameter set:
for each resampling iteration:
Hold out specific samples(Validation Set)
[Optional] Pre-Process the data
Fit the model on the remainder
Predict the Hold-Out samples
end
end
Determine the Optimal parameter set
Fit the model to all the training data using the optimal parameter set
Notes:
- First we need to choose the modeling technique.
- In the case of non-parametric models, there is model turning(But we need to do cross-validation to comment about generalization of the model performance).
- Available re-sampling techniques in caret : k-fold cross validation (once/repeated), leave-one-out cross-validation(this is very costly operation) & bootstrap(simple estimation or 632 rule).
- Be default, the function automatically chooses the tuning parameters associated with the best value, although different algorithms can be used.
Let's getting into an example :
> library(mlbench)
> data("Sonar")
> dim(Sonar)
[1] 208 61
> 100 * prop.table ( table ( Sonar$Class) )
M R
53.36538 46.63462
The function createDataPartition can be used to create a stratified random sample of the data into training and test sets.
> inTraing <- createDataPartition( Sonar$Class, = 0.75, list=FALSE)
> inTraing <- createDataPartition( Sonar$Class, p = 0.75, list=FALSE)
> training <- Sonar [ inTraing, ]
> testing <- Sonar[ - inTraing, ]
> 100 * prop.table ( table(training$Class))
M R
53.50318 46.49682
> 100 * prop.table ( table(testing$Class))
M R
52.94118 47.05882
Cautionary note: In highly class-imbalance data sets, stratified sampling may not be the right thing to do.
Basic Parameter Tuning:
By default, simple bootstrap resampling is used for estimation. The function trainControl can be used to specify the type of resampling:
> fitControl <- trainControl(## 10-fold CV
+ method = "repeatedcv",
+ number = 10,
+ ## repeated ten times
+ repeats = 10)
> gbmFit1 <- train(Class ~ ., data = training,
+ method = "gbm",
+ trControl = fitControl,
+ ## This last option is actually one
+ ## for gbm() that passes through
+ verbose = FALSE
> set.seed(825)
> gbmFit1 <- train(Class ~ ., data = training,
+ method = "gbm",
+ trControl = fitControl,
+ ## This last option is actually one
+ ## for gbm() that passes through
+ verbose = FALSE)
> gbmFit1
Stochastic Gradient Boosting
157 samples
60 predictor
2 classes: 'M', 'R'
No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times)
Summary of sample sizes: 142, 142, 140, 142, 142, 141, ...
Resampling results across tuning parameters:
interaction.depth n.trees Accuracy Kappa Accuracy SD Kappa SD
1 50 0.7505588 0.4951522 0.10073200 0.2041972
1 100 0.7876520 0.5702598 0.09028830 0.1833782
1 150 0.8047647 0.6047684 0.08520638 0.1733785
2 50 0.7878235 0.5701625 0.09287982 0.1890531
2 100 0.8055882 0.6054776 0.09520934 0.1943406
2 150 0.8161495 0.6260536 0.09111650 0.1873189
3 50 0.7987598 0.5916571 0.09573996 0.1964155
3 100 0.8205098 0.6353613 0.09711836 0.1994434
3 150 0.8251912 0.6450311 0.09312786 0.1911133
Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning parameter 'n.minobsinnode' was held constant at a value of 10
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were n.trees = 150, interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
> trellis.par.set( caretTheme())
> plot ( gbmFit1, metric = "Kappa")
> plot ( gbmFit1, metric = "Kappa", plotType = "level")
> plot ( gbmFit1, metric = "Kappa", plotType = "level", scales = list (x=list(rot=90)) )
> plot ( gbmFit1)
> plot ( gbmFit1, metric = "Kappa") # Alternate Performance Metrics
> plot ( gbmFit1, metric = "Kappa", plotType = "level", scales = list (x=list(rot=90)) )
The trainControl Function:
The function trainControl generates parameters that further
control how models are created, with possible values:
- method: The resampling method: "boot", "cv", "LOOCV", "LGOCV", "repeatedcv", "timeslice", "none" and "oob".
The
last value, out-of-bag estimates, can only be used by random
forest, bagged trees, bagged earth, bagged flexible discriminant
analysis, or conditional tree forest models. GBM models are not
included (the gbm package maintainer has indicated that
it would not be a good idea to choose tuning parameter values
based on the model OOB error estimates with boosted trees). Also,
for leave-one-out cross-validation, no uncertainty estimates
are given for the resampled performance measures.
- number and repeats: number controls with the
number of folds in K-fold cross-validation or number of
resampling iterations for bootstrapping and leave-group-out
cross-validation. repeats applied only to repeated
K-fold cross-validation. Suppose that
method = "repeatedcv",
number = 10 and
repeats = 3,then three separate
10-fold cross-validations are used as the resampling scheme.
- verboseIter: A logical for printing a training log.
- returnData: A logical for saving the data into a slot
called
trainingData
.
- p: For leave-group out cross-validation: the training
percentage
- For method = "timeslice", trainControl has options initialWindow, horizon and fixedWindow that govern how cross-validation can be used for time series data.
- classProbs: a logical value determining whether
class probabilities should be computed for held-out samples
during resample.
- index and indexOut: optional lists with elements for each resampling
iteration. Each list element is the sample rows used for training
at that iteration or should be held-out. When these values are not specified,
train will generate them.
- summaryFunction: a function to compute alternate
performance summaries.
- selectionFunction: a function to choose the optimal
tuning parameters.
and examples.
- PCAthresh, ICAcomp and k: these are
all options to pass to the preProcess function (when used).
- returnResamp: a character string containing one of
the following values: "all", "final" or
"none". This specifies how much of the resampled
performance measures to save.
- allowParallel: a logical that governs whether train should use parallel processing (if availible)
Alternate Performance Metrics :
The user can change the metric used to determine the best settings. By
default, RMSE and R2 are computed for regression while accuracy and
Kappa are computed for classification. Also by default, the parameter
values are chosen using RMSE and accuracy, respectively for
regression and classification. The metric argument of the
train function allows the user to control which the
optimality criterion to be used. For example, in problems where there are
a low percentage of samples in one class, using metric = "Kappa" can improve quality of the final model.
If none of these parameters are not satisfactory, the user can also compute performance metrics. The trainControl function has a argument called summaryFunction that specifies a function for computing performance. The function should have these arguments:
- data is a reference for a data frame or matrix with columns called obs and pred for the observed and predicted outcome values. Currently, class probabilities are not passed to the function. The values in data are the held-out predictions(and their associated reference values) for a single combination of tuning parameters. If the classProbs argument of the trainControl object is set to TRUE, additional columns in data will be present that contains the class probabilities. The names of these columns are the same as the class levels. Also if weights were specified in the call to train, a column called weights will also be in the datasets.
- lev is a character string that has the outcome factor levels taken from the training data.
- model is a character string for the model being used.
The output of the function should be a vector of numeric summary metrics with non-null names. By default, train evaluate classification models in terms of the predicted classes. Optionally, class probabilities can also be used to measure performance. To obtain predicted class probabilities within in the resampling process, the argument classProbs in trainControl must be set to TRUE. This merges columns of probabilities into the predictions generated from each resample(there is a column per class and the column names are the class names).
As mentioned previously, custom functions can be used to calculate performances scores that are averaged over the resamples.
To rebuild the boosted tree model using this criterion, we can see the relationship b/w the tuning parameters and the area under the ROC using the following code:
> fitControl <- trainControl(method = "repeatedcv",
+ number = 10,
+ repeats = 10,
+ ## Estimate class probabilities
+ classProbs = TRUE,
+ ## Evaluate performance using
+ ## the following function
+ summaryFunction = twoClassSummary)
> gbmFit3 <- train(
+ Class ~ .,
+ data = Sonar,
+ method = "gbm",
+ trControl = fitControl,
+ verbose = FALSE,
+ tuneGrid = gbmGrid,
+ metric = "ROC")
Choosing the Final Model :
Another method for customizing the tuning process is to modify the algorithm that is used to select the best parameter values, given the performance numbers. By default, the train function chooses the model with the largest performance value( or smallest, for RMSE in regression models). Other schemes for selecting model can be used, Breiman, suggested the "one standard error rule", for simple tree based models. In this case, the model with the best performing value is identified and, using resampling, we can estimate the standard error of performance. The final model used was the simplest model within one standard error of the (empirically)best model.
train allows the user to specify alternate rules for
selecting the final model. The argument selectionFunction
can be used to supply a function to algorithmically determine the
final model. There are three existing functions in the package:
best is chooses the largest/smallest value, oneSE
attempts to capture the spirit of Breiman et al (1984) and
tolerance selects the least complex model within some percent
tolerance of the best value.
User-defined functions can be used, as long as they have the
following arguments:
- x is a data frame containing the tune parameters and
their associated performance metrics. Each row corresponds to a
different tuning parameter combination.
- metric a character string indicating which
performance metric should be optimized (this is passed in
directly from the metric argument of train.
- maximize is a single logical value indicating
whether larger values of the performance metric are better
(this is also directly passed from the call to
train).
The function should output a single integer indicating which row in x is chosen.
The tolerance function could be used to find a less complex
model based on (x-xbest)/xbestx 100, which is the percent
difference. For example, to select parameter values based on a 2% loss of performance:
> whichTwoPct <- tolerance(gbmFit3$results, metric = "ROC",
+ tol = 2, maximize = TRUE)
> whichTwoPct
[1] 2
> gbmFit3$results[whichTwoPct,1:6]
shrinkage interaction.depth n.minobsinnode n.trees ROC Sens
6 0.1 5 20 50 0.8795213 0.8154167
|
|
|
Extracting Predictions and Class Probabilities:
> predict( gbmFit3, head(Sonar))
[1] R R R R R R
Levels: M R
> predict( gbmFit3, head(Sonar))
[1] R R R R R R
Levels: M R
Exploring and Comparing Resampling Distributions:
Within-Model :
There are several lattice functions that can be used to explore relationships between tuning parameters and the resampling results for a specific model.
- xyplot and stripplot can be used to plot resampling statistics
against (numeric) tuning parameters.
- histogram and densityplot can also be used
to look at distributions.
Between-Models :
The caret package also includes functions to characterize the differences
between models (generated using train, sbf or
rfe) via their resampling distributions.
First, a support vector machine model is fit to the Sonar data. The data are centered and scaled using the preProc
argument. Note that the same random number seed is set prior to the
model that is identical to the seed used for the boosted tree model.
This ensures that the same resampling sets are used, which will come in
handy when we compare the resampling profiles between models.
> svmFit <- train(Class ~ ., data = training,
+ method = "svmRadial",
+ trControl = fitControl,
+ preProc = c("center", "scale"),
+ tuneLength = 8,
+ metric = "ROC")
Also, a regularized discriminant analysis model was fit.
> set.seed(825)
> rdaFit <- train(Class ~ ., data = training,
+ method = "rda",
+ trControl = fitControl,
+ tuneLength = 4,
+ metric = "ROC")
Given these models, can we make statistical statements about their
performance differences? To do this, we first collect the resampling
results using resamples.
resamps <- resamples(list(GBM = gbmFit3,
SVM = svmFit,
RDA = rdaFit))
resamps
summary(resamps)
There are several lattice plot methods that can be used to visualize the resampling
distributions: density plots, box-whisker plots, scatterplot matrices and scatterplots
of summary statistics.
trellis.par.set( caretTheme() )
bwplot ( resamps, layout = c(3,1) )
trellis.par.set ( caretTheme() )
dotplot( resamps, metric="ROC" )
trellis.par.set ( caretTheme() )
xyplot ( resamps, what="BlandAltman")
Other visualizations are available in densityolot.resamples & parallel.resamples;
Since models are fit on the same versions of the training data, it
makes sense to make inferences on the differences between models. In
this way we reduce the within-resample correlation that may exist. We
can compute the differences, then use a simple t-test to evaluate
the null hypothesis that there is no difference between models.
diffValues <- diff ( resamps )
summary ( diffValues )
trellis.par.set ( caretTheme() )
bwplot ( diffValues, layout = c( 3,1 ) )
Fitting Models without Parameter Tuning:
In cases, where the model tuning values are known, train can be used to fit the model to the entire training set without any resampling or parameter tuning. Using the method="none" option in the trainControl.
fitControl <- trainControl ( method ="none", classProbs = TRUE )
set.seed ( 825 )
gbmFit4 <- train (
Class ~ .,
data = Sonar,
trControl = fitControl,
verbose = FALSE,
tuneGrid = data.frame ( interaction.depth = 4, n.trees = 100, shrinkage=0.1, n.minobsinnode=20), metric = "ROC" )
predict ( gbmFit4, newdata = head(Sonar) )
predict ( gbmFit4, newdata = head(Sonar), type="prob" )
Finally this huge article is coming to END.
Thanks for following to till this point.
Regards
Viswanath Gangavaram