Sunday, 10 January 2016

Statistical Learning MOOC by Trevor Hastie and Robert Tibshirani

Trevor Hastie and Robert Tibshirani Offering their course third time in succession. Here goes the course link https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about


It's time to refresh and strengthen my machine learning / statistical learning fundamentals, along with the course work, I am planning to complete following books/resources:
  1. An Introduction to Statistical Learning By course authors
  2. The elements of Statistical Learning
  3. Applied Predictive Modeling by Max Kuhn ( R Caret package author )
  4. In & Out of caret package
  5. The Art of R programming

Thursday, 7 January 2016

Model Training and Tuning in caret package


Note: This article loosely a copy of this post

Model training and tuning

The caret package has several functions that attempt to streamline the model building and evaluation process.

The train function can be used to
  • evaluate, using re-sampling, the effect of model tuning parameters on performance
  • choose the "optimal" model across these parameters
  • Estimate model performance from a training set

Generic Algorithm for Model Building:-

Define sets of model parameter values to evaluate
for each parameter set:
   for each resampling iteration:
      Hold out specific samples(Validation Set)
      [Optional] Pre-Process the data
      Fit the model on the remainder
      Predict the Hold-Out samples
   end
end
Determine the Optimal parameter set
Fit the model to all the training data using the optimal parameter set

Notes: 
- First we need to choose the modeling technique. 
- In the case of non-parametric models, there is model turning(But we need to do cross-validation to comment about generalization of the model performance).
-  Available re-sampling techniques in caret : k-fold cross validation (once/repeated), leave-one-out cross-validation(this is very costly operation) & bootstrap(simple estimation or 632 rule).
- Be default, the function automatically chooses the tuning parameters associated with the best value, although different algorithms can be used.

Let's getting into an example :

> library(mlbench)
> data("Sonar")
> dim(Sonar)
[1] 208  61
> 100 * prop.table ( table ( Sonar$Class) )

       M        R 
53.36538 46.63462 
 
The function createDataPartition can be used to create a stratified random sample of the data into training and test sets.
 


> inTraing <- createDataPartition( Sonar$Class, = 0.75, list=FALSE)
> inTraing <- createDataPartition( Sonar$Class, p = 0.75, list=FALSE)
> training <- Sonar [ inTraing, ]
> testing <- Sonar[ - inTraing, ]
> 100 * prop.table ( table(training$Class))

       M        R 
53.50318 46.49682 
> 100 * prop.table ( table(testing$Class))

       M        R 
52.94118 47.05882   

Cautionary note: In highly class-imbalance data sets, stratified sampling may not be the right thing to do.

Basic Parameter Tuning:

By default, simple bootstrap resampling is used for estimation. The function trainControl can be used to specify the type of resampling:
 
> fitControl <- trainControl(## 10-fold CV
+     method = "repeatedcv",
+     number = 10,
+     ## repeated ten times
+     repeats = 10)
 
> gbmFit1 <- train(Class ~ ., data = training,
+                  method = "gbm",
+                  trControl = fitControl,
+                  ## This last option is actually one
+                  ## for gbm() that passes through
+                  verbose = FALSE
 
> set.seed(825)
> gbmFit1 <- train(Class ~ ., data = training,
+                  method = "gbm",
+                  trControl = fitControl,
+                  ## This last option is actually one
+                  ## for gbm() that passes through
+                  verbose = FALSE)
 
> gbmFit1
Stochastic Gradient Boosting 

157 samples
 60 predictor
  2 classes: 'M', 'R' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 142, 142, 140, 142, 142, 141, ... 
Resampling results across tuning parameters:

  interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD  Kappa SD 
  1                   50      0.7505588  0.4951522  0.10073200   0.2041972
  1                  100      0.7876520  0.5702598  0.09028830   0.1833782
  1                  150      0.8047647  0.6047684  0.08520638   0.1733785
  2                   50      0.7878235  0.5701625  0.09287982   0.1890531
  2                  100      0.8055882  0.6054776  0.09520934   0.1943406
  2                  150      0.8161495  0.6260536  0.09111650   0.1873189
  3                   50      0.7987598  0.5916571  0.09573996   0.1964155
  3                  100      0.8205098  0.6353613  0.09711836   0.1994434
  3                  150      0.8251912  0.6450311  0.09312786   0.1911133

Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning parameter 'n.minobsinnode' was held constant at a value of 10
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were n.trees = 150, interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10. 
 
> trellis.par.set( caretTheme())
> plot ( gbmFit1, metric = "Kappa")
> plot ( gbmFit1, metric =  "Kappa", plotType = "level")
> plot ( gbmFit1, metric =  "Kappa", plotType = "level", scales = list (x=list(rot=90)) )
> plot ( gbmFit1)
> plot ( gbmFit1, metric = "Kappa") # Alternate Performance Metrics
> plot ( gbmFit1, metric =  "Kappa", plotType = "level", scales = list (x=list(rot=90)) )   
 
The trainControl Function:


The function trainControl generates parameters that further control how models are created, with possible values:
  • method: The resampling method: "boot", "cv", "LOOCV", "LGOCV", "repeatedcv", "timeslice", "none" and "oob". The last value, out-of-bag estimates, can only be used by random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models. GBM models are not included (the gbm package maintainer has indicated that it would not be a good idea to choose tuning parameter values based on the model OOB error estimates with boosted trees). Also, for leave-one-out cross-validation, no uncertainty estimates are given for the resampled performance measures.
  • number and repeats: number controls with the number of folds in K-fold cross-validation or number of resampling iterations for bootstrapping and leave-group-out cross-validation. repeats applied only to repeated K-fold cross-validation. Suppose that method = "repeatedcv", number = 10 and repeats = 3,then three separate 10-fold cross-validations are used as the resampling scheme.
  • verboseIter: A logical for printing a training log.
  • returnData: A logical for saving the data into a slot called trainingData.
  • p: For leave-group out cross-validation: the training percentage
  • For method = "timeslice", trainControl has options initialWindow, horizon and fixedWindow that govern how cross-validation can be used for time series data.
  • classProbs: a logical value determining whether class probabilities should be computed for held-out samples during resample.
  • index and indexOut: optional lists with elements for each resampling iteration. Each list element is the sample rows used for training at that iteration or should be held-out. When these values are not specified, train will generate them.
  • summaryFunction: a function to compute alternate performance summaries.
  • selectionFunction: a function to choose the optimal tuning parameters. and examples.
  • PCAthresh, ICAcomp and k: these are all options to pass to the preProcess function (when used).
  • returnResamp: a character string containing one of the following values: "all", "final" or "none". This specifies how much of the resampled performance measures to save.
  • allowParallel: a logical that governs whether train should use parallel processing (if availible) 

Alternate Performance Metrics :

The user can change the metric used to determine the best settings. By default, RMSE and R2 are computed for regression while accuracy and Kappa are computed for classification. Also by default, the parameter values are chosen using RMSE and accuracy, respectively for regression and classification. The metric argument of the train function allows the user to control which the optimality criterion to be used. For example, in problems where there are a low percentage of samples in one class, using metric = "Kappa" can improve quality of the final model


If none of these parameters are not satisfactory, the user can also compute performance metrics. The trainControl function has a argument called summaryFunction that specifies a function for computing performance. The function should have these arguments:

  • data is a reference for a data frame or matrix with columns called obs and pred for the observed and predicted outcome values. Currently, class probabilities are not passed to the function. The values in data are the held-out predictions(and their associated reference values) for a single combination of tuning parameters. If the classProbs argument of the trainControl object is set to TRUE, additional columns in data will be present that contains the class probabilities. The names of these columns are the same as the class levels. Also if weights were specified in the call to train, a column called weights will also be in the datasets.
  • lev is a character string that has the outcome factor levels taken from the training data. 
  • model is a character string for the model being used.
The output of the function should be a vector of numeric summary metrics with non-null names. By default, train evaluate classification models in terms of the predicted classes. Optionally, class probabilities can also be used to measure performance. To obtain predicted class probabilities within in the resampling process, the argument classProbs in trainControl must be set to TRUE. This merges columns of probabilities into the predictions generated from each resample(there is a column per class and the column names are the class names).
As mentioned previously, custom functions can be used to calculate performances scores that are averaged over the resamples.

To rebuild the boosted tree model using this criterion, we can see the relationship b/w the tuning parameters and the area under the ROC using the following code: 


> fitControl <- trainControl(method = "repeatedcv",
+                            number = 10,
+                            repeats = 10,
+                            ## Estimate class probabilities
+                            classProbs = TRUE,
+                            ## Evaluate performance using 
+                            ## the following function
+                            summaryFunction = twoClassSummary)
 
> gbmFit3 <- train( 
+ Class ~ .,
+ data = Sonar,
+ method = "gbm",
+ trControl = fitControl,
+ verbose = FALSE,
+ tuneGrid = gbmGrid,
+ metric = "ROC")
 

Choosing the Final Model :
Another method for customizing the tuning process is to modify the algorithm that is used to select the best parameter values, given the performance numbers. By default, the train function chooses the model with the largest performance value( or smallest, for RMSE in regression models). Other schemes for selecting model can be used, Breiman, suggested the "one standard error rule", for simple tree based models. In this case, the model with the best performing value is identified and, using resampling, we can estimate the standard error of performance. The final model used was the simplest model within one standard error of the (empirically)best model.  

train allows the user to specify alternate rules for selecting the final model. The argument selectionFunction can be used to supply a function to algorithmically determine the final model. There are three existing functions in the package: best is chooses the largest/smallest value, oneSE attempts to capture the spirit of Breiman et al (1984) and tolerance selects the least complex model within some percent tolerance of the best value. 

User-defined functions can be used, as long as they have the following arguments:  
  • x is a data frame containing the tune parameters and their associated performance metrics. Each row corresponds to a different tuning parameter combination. 
  • metric a character string indicating which performance metric should be optimized (this is passed in directly from the metric argument of train.
  • maximize is a single logical value indicating whether larger values of the performance metric are better (this is also directly passed from the call to train). 
The function should output a single integer indicating which row in x is chosen.

The tolerance function could be used to find a less complex model based on (x-xbest)/xbestx 100, which is the percent difference. For example, to select parameter values based on a 2% loss of performance: 

> whichTwoPct <- tolerance(gbmFit3$results, metric = "ROC",
+                          tol = 2, maximize = TRUE)
> whichTwoPct
[1] 2
> gbmFit3$results[whichTwoPct,1:6]
  shrinkage interaction.depth n.minobsinnode n.trees       ROC      Sens
6       0.1                 5             20      50 0.8795213 0.8154167

Extracting Predictions and Class Probabilities:
> predict( gbmFit3, head(Sonar))
[1] R R R R R R
Levels: M R
 
> predict( gbmFit3, head(Sonar))
[1] R R R R R R
Levels: M R
 
 
Exploring and Comparing Resampling Distributions:

Within-Model :

There are several lattice functions that can be used to explore relationships between tuning parameters and the resampling results for a specific model.
  • xyplot and stripplot can be used to plot resampling statistics against (numeric) tuning parameters.
  • histogram and densityplot can also be used to look at distributions.
Between-Models :

The caret package also includes functions to characterize the differences between models (generated using train, sbf or rfe) via their resampling distributions.

First, a support vector machine model is fit to the Sonar data. The data are centered and scaled using the preProc argument. Note that the same random number seed is set prior to the model that is identical to the seed used for the boosted tree model. This ensures that the same resampling sets are used, which will come in handy when we compare the resampling profiles between models.

> svmFit <- train(Class ~ ., data = training,
+                 method = "svmRadial",
+                 trControl = fitControl,
+                 preProc = c("center", "scale"),
+                 tuneLength = 8,
+                 metric = "ROC")

Also, a regularized discriminant analysis model was fit.  
> set.seed(825)
> rdaFit <- train(Class ~ ., data = training,
+                 method = "rda",
+                 trControl = fitControl,
+                 tuneLength = 4,
+                 metric = "ROC")

Given these models, can we make statistical statements about their performance differences? To do this, we first collect the resampling results using resamples.  

resamps <- resamples(list(GBM = gbmFit3,
                          SVM = svmFit,
                          RDA = rdaFit))
resamps
 

summary(resamps)
 
There are several lattice plot methods that can be used to visualize the resampling 
distributions: density plots, box-whisker plots, scatterplot matrices and scatterplots
of summary statistics.
 
trellis.par.set( caretTheme() )
bwplot ( resamps, layout = c(3,1) )

trellis.par.set ( caretTheme() )
dotplot( resamps, metric="ROC" )

trellis.par.set ( caretTheme() )
xyplot ( resamps, what="BlandAltman")

Other visualizations are available in densityolot.resamples & parallel.resamples;

Since models are fit on the same versions of the training data, it makes sense to make inferences on the differences between models. In this way we reduce the within-resample correlation that may exist. We can compute the differences, then use a simple t-test to evaluate the null hypothesis that there is no difference between models.  

diffValues <- diff ( resamps )
summary ( diffValues )

trellis.par.set ( caretTheme() )
bwplot ( diffValues, layout = c( 3,1 ) )
 
Fitting Models without Parameter Tuning:

In cases, where the model tuning values are known, train can be used to fit the model to the entire training set without any resampling or parameter tuning. Using the method="none" option in the trainControl.

fitControl <- trainControl ( method ="none", classProbs = TRUE )
set.seed ( 825 )

gbmFit4 <- train (
Class ~ .,
data = Sonar,
trControl = fitControl,
verbose = FALSE,
tuneGrid = data.frame ( interaction.depth = 4, n.trees = 100, shrinkage=0.1, n.minobsinnode=20), metric = "ROC" )

predict ( gbmFit4, newdata = head(Sonar) )

predict ( gbmFit4, newdata = head(Sonar), type="prob" )


Finally this huge article is coming to END.

Thanks for following to till this point.

Regards
Viswanath Gangavaram