Sunday, 10 January 2016

Statistical Learning MOOC by Trevor Hastie and Robert Tibshirani

Trevor Hastie and Robert Tibshirani Offering their course third time in succession. Here goes the course link https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about


It's time to refresh and strengthen my machine learning / statistical learning fundamentals, along with the course work, I am planning to complete following books/resources:
  1. An Introduction to Statistical Learning By course authors
  2. The elements of Statistical Learning
  3. Applied Predictive Modeling by Max Kuhn ( R Caret package author )
  4. In & Out of caret package
  5. The Art of R programming

Thursday, 7 January 2016

Model Training and Tuning in caret package


Note: This article loosely a copy of this post

Model training and tuning

The caret package has several functions that attempt to streamline the model building and evaluation process.

The train function can be used to
  • evaluate, using re-sampling, the effect of model tuning parameters on performance
  • choose the "optimal" model across these parameters
  • Estimate model performance from a training set

Generic Algorithm for Model Building:-

Define sets of model parameter values to evaluate
for each parameter set:
   for each resampling iteration:
      Hold out specific samples(Validation Set)
      [Optional] Pre-Process the data
      Fit the model on the remainder
      Predict the Hold-Out samples
   end
end
Determine the Optimal parameter set
Fit the model to all the training data using the optimal parameter set

Notes: 
- First we need to choose the modeling technique. 
- In the case of non-parametric models, there is model turning(But we need to do cross-validation to comment about generalization of the model performance).
-  Available re-sampling techniques in caret : k-fold cross validation (once/repeated), leave-one-out cross-validation(this is very costly operation) & bootstrap(simple estimation or 632 rule).
- Be default, the function automatically chooses the tuning parameters associated with the best value, although different algorithms can be used.

Let's getting into an example :

> library(mlbench)
> data("Sonar")
> dim(Sonar)
[1] 208  61
> 100 * prop.table ( table ( Sonar$Class) )

       M        R 
53.36538 46.63462 
 
The function createDataPartition can be used to create a stratified random sample of the data into training and test sets.
 


> inTraing <- createDataPartition( Sonar$Class, = 0.75, list=FALSE)
> inTraing <- createDataPartition( Sonar$Class, p = 0.75, list=FALSE)
> training <- Sonar [ inTraing, ]
> testing <- Sonar[ - inTraing, ]
> 100 * prop.table ( table(training$Class))

       M        R 
53.50318 46.49682 
> 100 * prop.table ( table(testing$Class))

       M        R 
52.94118 47.05882   

Cautionary note: In highly class-imbalance data sets, stratified sampling may not be the right thing to do.

Basic Parameter Tuning:

By default, simple bootstrap resampling is used for estimation. The function trainControl can be used to specify the type of resampling:
 
> fitControl <- trainControl(## 10-fold CV
+     method = "repeatedcv",
+     number = 10,
+     ## repeated ten times
+     repeats = 10)
 
> gbmFit1 <- train(Class ~ ., data = training,
+                  method = "gbm",
+                  trControl = fitControl,
+                  ## This last option is actually one
+                  ## for gbm() that passes through
+                  verbose = FALSE
 
> set.seed(825)
> gbmFit1 <- train(Class ~ ., data = training,
+                  method = "gbm",
+                  trControl = fitControl,
+                  ## This last option is actually one
+                  ## for gbm() that passes through
+                  verbose = FALSE)
 
> gbmFit1
Stochastic Gradient Boosting 

157 samples
 60 predictor
  2 classes: 'M', 'R' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 142, 142, 140, 142, 142, 141, ... 
Resampling results across tuning parameters:

  interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD  Kappa SD 
  1                   50      0.7505588  0.4951522  0.10073200   0.2041972
  1                  100      0.7876520  0.5702598  0.09028830   0.1833782
  1                  150      0.8047647  0.6047684  0.08520638   0.1733785
  2                   50      0.7878235  0.5701625  0.09287982   0.1890531
  2                  100      0.8055882  0.6054776  0.09520934   0.1943406
  2                  150      0.8161495  0.6260536  0.09111650   0.1873189
  3                   50      0.7987598  0.5916571  0.09573996   0.1964155
  3                  100      0.8205098  0.6353613  0.09711836   0.1994434
  3                  150      0.8251912  0.6450311  0.09312786   0.1911133

Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning parameter 'n.minobsinnode' was held constant at a value of 10
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were n.trees = 150, interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10. 
 
> trellis.par.set( caretTheme())
> plot ( gbmFit1, metric = "Kappa")
> plot ( gbmFit1, metric =  "Kappa", plotType = "level")
> plot ( gbmFit1, metric =  "Kappa", plotType = "level", scales = list (x=list(rot=90)) )
> plot ( gbmFit1)
> plot ( gbmFit1, metric = "Kappa") # Alternate Performance Metrics
> plot ( gbmFit1, metric =  "Kappa", plotType = "level", scales = list (x=list(rot=90)) )   
 
The trainControl Function:


The function trainControl generates parameters that further control how models are created, with possible values:
  • method: The resampling method: "boot", "cv", "LOOCV", "LGOCV", "repeatedcv", "timeslice", "none" and "oob". The last value, out-of-bag estimates, can only be used by random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models. GBM models are not included (the gbm package maintainer has indicated that it would not be a good idea to choose tuning parameter values based on the model OOB error estimates with boosted trees). Also, for leave-one-out cross-validation, no uncertainty estimates are given for the resampled performance measures.
  • number and repeats: number controls with the number of folds in K-fold cross-validation or number of resampling iterations for bootstrapping and leave-group-out cross-validation. repeats applied only to repeated K-fold cross-validation. Suppose that method = "repeatedcv", number = 10 and repeats = 3,then three separate 10-fold cross-validations are used as the resampling scheme.
  • verboseIter: A logical for printing a training log.
  • returnData: A logical for saving the data into a slot called trainingData.
  • p: For leave-group out cross-validation: the training percentage
  • For method = "timeslice", trainControl has options initialWindow, horizon and fixedWindow that govern how cross-validation can be used for time series data.
  • classProbs: a logical value determining whether class probabilities should be computed for held-out samples during resample.
  • index and indexOut: optional lists with elements for each resampling iteration. Each list element is the sample rows used for training at that iteration or should be held-out. When these values are not specified, train will generate them.
  • summaryFunction: a function to compute alternate performance summaries.
  • selectionFunction: a function to choose the optimal tuning parameters. and examples.
  • PCAthresh, ICAcomp and k: these are all options to pass to the preProcess function (when used).
  • returnResamp: a character string containing one of the following values: "all", "final" or "none". This specifies how much of the resampled performance measures to save.
  • allowParallel: a logical that governs whether train should use parallel processing (if availible) 

Alternate Performance Metrics :

The user can change the metric used to determine the best settings. By default, RMSE and R2 are computed for regression while accuracy and Kappa are computed for classification. Also by default, the parameter values are chosen using RMSE and accuracy, respectively for regression and classification. The metric argument of the train function allows the user to control which the optimality criterion to be used. For example, in problems where there are a low percentage of samples in one class, using metric = "Kappa" can improve quality of the final model


If none of these parameters are not satisfactory, the user can also compute performance metrics. The trainControl function has a argument called summaryFunction that specifies a function for computing performance. The function should have these arguments:

  • data is a reference for a data frame or matrix with columns called obs and pred for the observed and predicted outcome values. Currently, class probabilities are not passed to the function. The values in data are the held-out predictions(and their associated reference values) for a single combination of tuning parameters. If the classProbs argument of the trainControl object is set to TRUE, additional columns in data will be present that contains the class probabilities. The names of these columns are the same as the class levels. Also if weights were specified in the call to train, a column called weights will also be in the datasets.
  • lev is a character string that has the outcome factor levels taken from the training data. 
  • model is a character string for the model being used.
The output of the function should be a vector of numeric summary metrics with non-null names. By default, train evaluate classification models in terms of the predicted classes. Optionally, class probabilities can also be used to measure performance. To obtain predicted class probabilities within in the resampling process, the argument classProbs in trainControl must be set to TRUE. This merges columns of probabilities into the predictions generated from each resample(there is a column per class and the column names are the class names).
As mentioned previously, custom functions can be used to calculate performances scores that are averaged over the resamples.

To rebuild the boosted tree model using this criterion, we can see the relationship b/w the tuning parameters and the area under the ROC using the following code: 


> fitControl <- trainControl(method = "repeatedcv",
+                            number = 10,
+                            repeats = 10,
+                            ## Estimate class probabilities
+                            classProbs = TRUE,
+                            ## Evaluate performance using 
+                            ## the following function
+                            summaryFunction = twoClassSummary)
 
> gbmFit3 <- train( 
+ Class ~ .,
+ data = Sonar,
+ method = "gbm",
+ trControl = fitControl,
+ verbose = FALSE,
+ tuneGrid = gbmGrid,
+ metric = "ROC")
 

Choosing the Final Model :
Another method for customizing the tuning process is to modify the algorithm that is used to select the best parameter values, given the performance numbers. By default, the train function chooses the model with the largest performance value( or smallest, for RMSE in regression models). Other schemes for selecting model can be used, Breiman, suggested the "one standard error rule", for simple tree based models. In this case, the model with the best performing value is identified and, using resampling, we can estimate the standard error of performance. The final model used was the simplest model within one standard error of the (empirically)best model.  

train allows the user to specify alternate rules for selecting the final model. The argument selectionFunction can be used to supply a function to algorithmically determine the final model. There are three existing functions in the package: best is chooses the largest/smallest value, oneSE attempts to capture the spirit of Breiman et al (1984) and tolerance selects the least complex model within some percent tolerance of the best value. 

User-defined functions can be used, as long as they have the following arguments:  
  • x is a data frame containing the tune parameters and their associated performance metrics. Each row corresponds to a different tuning parameter combination. 
  • metric a character string indicating which performance metric should be optimized (this is passed in directly from the metric argument of train.
  • maximize is a single logical value indicating whether larger values of the performance metric are better (this is also directly passed from the call to train). 
The function should output a single integer indicating which row in x is chosen.

The tolerance function could be used to find a less complex model based on (x-xbest)/xbestx 100, which is the percent difference. For example, to select parameter values based on a 2% loss of performance: 

> whichTwoPct <- tolerance(gbmFit3$results, metric = "ROC",
+                          tol = 2, maximize = TRUE)
> whichTwoPct
[1] 2
> gbmFit3$results[whichTwoPct,1:6]
  shrinkage interaction.depth n.minobsinnode n.trees       ROC      Sens
6       0.1                 5             20      50 0.8795213 0.8154167

Extracting Predictions and Class Probabilities:
> predict( gbmFit3, head(Sonar))
[1] R R R R R R
Levels: M R
 
> predict( gbmFit3, head(Sonar))
[1] R R R R R R
Levels: M R
 
 
Exploring and Comparing Resampling Distributions:

Within-Model :

There are several lattice functions that can be used to explore relationships between tuning parameters and the resampling results for a specific model.
  • xyplot and stripplot can be used to plot resampling statistics against (numeric) tuning parameters.
  • histogram and densityplot can also be used to look at distributions.
Between-Models :

The caret package also includes functions to characterize the differences between models (generated using train, sbf or rfe) via their resampling distributions.

First, a support vector machine model is fit to the Sonar data. The data are centered and scaled using the preProc argument. Note that the same random number seed is set prior to the model that is identical to the seed used for the boosted tree model. This ensures that the same resampling sets are used, which will come in handy when we compare the resampling profiles between models.

> svmFit <- train(Class ~ ., data = training,
+                 method = "svmRadial",
+                 trControl = fitControl,
+                 preProc = c("center", "scale"),
+                 tuneLength = 8,
+                 metric = "ROC")

Also, a regularized discriminant analysis model was fit.  
> set.seed(825)
> rdaFit <- train(Class ~ ., data = training,
+                 method = "rda",
+                 trControl = fitControl,
+                 tuneLength = 4,
+                 metric = "ROC")

Given these models, can we make statistical statements about their performance differences? To do this, we first collect the resampling results using resamples.  

resamps <- resamples(list(GBM = gbmFit3,
                          SVM = svmFit,
                          RDA = rdaFit))
resamps
 

summary(resamps)
 
There are several lattice plot methods that can be used to visualize the resampling 
distributions: density plots, box-whisker plots, scatterplot matrices and scatterplots
of summary statistics.
 
trellis.par.set( caretTheme() )
bwplot ( resamps, layout = c(3,1) )

trellis.par.set ( caretTheme() )
dotplot( resamps, metric="ROC" )

trellis.par.set ( caretTheme() )
xyplot ( resamps, what="BlandAltman")

Other visualizations are available in densityolot.resamples & parallel.resamples;

Since models are fit on the same versions of the training data, it makes sense to make inferences on the differences between models. In this way we reduce the within-resample correlation that may exist. We can compute the differences, then use a simple t-test to evaluate the null hypothesis that there is no difference between models.  

diffValues <- diff ( resamps )
summary ( diffValues )

trellis.par.set ( caretTheme() )
bwplot ( diffValues, layout = c( 3,1 ) )
 
Fitting Models without Parameter Tuning:

In cases, where the model tuning values are known, train can be used to fit the model to the entire training set without any resampling or parameter tuning. Using the method="none" option in the trainControl.

fitControl <- trainControl ( method ="none", classProbs = TRUE )
set.seed ( 825 )

gbmFit4 <- train (
Class ~ .,
data = Sonar,
trControl = fitControl,
verbose = FALSE,
tuneGrid = data.frame ( interaction.depth = 4, n.trees = 100, shrinkage=0.1, n.minobsinnode=20), metric = "ROC" )

predict ( gbmFit4, newdata = head(Sonar) )

predict ( gbmFit4, newdata = head(Sonar), type="prob" )


Finally this huge article is coming to END.

Thanks for following to till this point.

Regards
Viswanath Gangavaram 

Tuesday, 1 December 2015

webinar notes "Apache Spark Release 1.6"

General notes :
What's coming in Apache Spark 1.6 :
  •  Key themes
    • Out of the box performance
    • Previews of key new APIs
  •  Two separate memory managers ( Spark 1.5 )
    • Execution memory : Computation of shuffles,
    • Storage memory
  • Goal : Allow memory regions to shrink / grow dynamically
  • Unified Memory Management in Spark 1.6
    • Can cross between execution and storage memory
      • Borrowing can happen from both sides
  • History of Spark API's
    • RDD API
      • Distributed collections of JVM objects
      • Functional operators (map, filter, etc.)
    • Data Frame API ( 2013 )
      • Distribute collection of Row Objects
      • Expression-based operations and UDFs
      • Logical Plans and optimizer
      • Fast / Efficient internal representations
    • DataSet API ( 2015 ) 
      • Internally Rows, externally JVM objects
      • "Best of both worlds : type safe + fast "
  • Encoder : Converts from JVM object into a DataSet Row
  • High Level APIs -> Data Frame ( & DataSet ) -> Tungsten Execution
  • SQL directly over files
    • select * from text.`fileName` where value != ''
  • Advanced JSON parsing
  • Better instrumentation for SQL operators
    • Tracking memory usage ( How much used on each machine )
  • Display the failed output op in streaming
  • Persist ML pipelines to
  • R-like statistics for GLMs
    • Provide R-like summary statistics
  • New algos added to MLlib
    • Bisecting K-Means
    • Online Hypothesis testing : A/B testing in Spark Streaming
    • Survival analysi
    • etc ..

Monday, 2 November 2015

Understanding fold() and aggregate() functions in Apache Spark

Note: I was trying to understand fold() and aggregate() function in Apache Spark, it took for a while to understand what are this two functions are really doing and how are they executing stuff. Manually traced, how this two functions are getting evaluated with couple of scenarios, thought of sharing this hard earned experience, with fellow folks.
 

An extract from "Learning Spark"

"Similar to reduce() is fold(), which also takes a function with the same signature as needed for reduce(), but in addition takes a "Zero Value" to be used for the initial call on each partition. The Zero value you should be the identity element for your operation; that is applying it multiple times your function should not change the value ( e.g. 0 for +, 1 for *, or an empty list for concatenation )"

Let's understand this with some examples

sc.parallelize ( [1,2,3,4] )
print nums.fold ( 0 , lambda a,b : a+b )# Which evaluates to 10 

First parameter to fold function is Zero Value, what gets used in the function that follows. Here the Zero Value  is going to get used in function addition(+), so Zero Value is 0.

So Why it evaluates to 10 ?

nums.getNumPartitions() # Which evaluates to 2

As we can see, from above our RDD distributed in two partitions, say [1,2] in partition1 and [3,4] are in partition2. Addition(+) function is going to get evaluate with each partition data independently with initial value of 0. Now accumulator1 has a value of 3 ( 0 + 1 + 2 ) and accumulator2 has a value of 7 ( 0 + 3 + 4). Now this two accumulators are going to evaluates with addition with initial value of 0, that's get evaluated to 10 ( 0 + 3 + 7 ).

Now let's try with some other non "Zero Value"

print nums.fold ( 5 , lambda a,b : a+b ) # Which evaluates to 25
 

So Why it evaluates to 25 ?

As we already now that our RDD is distributed in two partitions, [1,2] in partition1 and [3,4] are in partition2. Addition(+) function is going to get evaluate with each partition data independently with initial value of 5. Now accumulator1 has a value of 8 ( 5 + 1 + 2) and accumulator2 has a value of 12 ( 5 + 3 + 4). Now this two accumulators are going to evaluates with addition(+) with initial value of 5, that's get evaluated to 25 ( 5 + 8 + 12 ).

Let's try with multiplication ( * )

print nums.fold ( 1 , lambda a,b : a*b ) # which evaluates to 24

So Why it evaluates to 24 ?

multiplication(*) function is going to get evaluate with each partition data independently with initial value of 1. Now accumulator1 has a value of 2 ( 1 * 1 * 2) and accumulator2 has a value of 12 ( 1 * 3 * 4). Now this two accumulators are going to evaluates with multiplication(*) with an initial value of 1, that's get evaluated to 24 ( 1 * 2 * 12 ).

Question : So what is this evalutes to  nums.fold (5 , lambda a,b : a * b )


Caution: As mentioned previously, "The Zero value should be the identity element for your operation; that is applying it multiple times your function should not change the value ( e.g. 0 for +, 1 for *, or an empty list for concatenation )", just for understanding, used non identity values in above scenarios. 

This article is getting bigger than expected, let's get to know about aggregate() function in following article.


Thanks & Regards
Viswanath G.
Senior Data Scientist
    





 

 

Sunday, 1 November 2015

Four Short Links

  1. Beginners guide: Apache Spark Machine Learning Scenario with A Large Input DataSet
  2. How to find Simple and Interesting Multi-Gigabytes Dataset
  3. Supervised Term Weighting Schemes for Text Classification
  4. Apache Zeppelin Overview
    • Zeppelin is focusing on providing analytical environment on top of Hadoop eco-system.  Zeppelin is a analytical tool that supports multi language backand. Directly display data with many different charts. Parameterized query support.  Screen share through WebSocket





Saturday, 31 October 2015

Standalone Apache Spark application

The best way to learn Apache Spark, is through various interactive shells that it supports. As of now,  Apache Spark supports following interactive shells Python/IPython, Scala and R. 

But the best way to deploy Spark programs in production systems is through Standalone applications. The main difference from using  it in the shell is that we need to initialize your own SparkContext(where as interactive shells gives one to us).

Note: SparkContext represents  a connection to a computing cluster.

Python standalone application

In Python, Spark standalone applications are simple Python scripts, but need to run this scripts through bin/spark-submit script which included in Spark installation. The spark-submit includes the Spark dependencies for us in Python. This script sets up the environment for Spark's Python API to function.

"HelloWorld.py" :


from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("HelloWorld")
sc = SparkContext ( conf = conf )
lines = sc.textFile ("/home/viswanath/spark/README.md")
lines.count()

To run the script
spark-submit HelloWorld.py




Finally, to shut down Spark, we can either call the stop() method on SparkContext, or simply exit the application ( e.g. with System.exit(0) or sys.exit()).