Learning from Data & Big Data Technologies

Thursday, 18 February 2016

The Next three months : March, April & May 2016

Putting plan to to learn Scala & Apache Spark in the right way:

Note: What I can see, it seems like I need to pick up Scala & Spark in parallel rather than waiting for learning Scala and then move on to Spark;

Online Coursers:

Reading Wishlist :

Scala for the impatient { Complete by Feb 28th, 2016 }
Programming in Scala { March 31st }
Scala in Depth { April 30th }
Scala in Action { May 31st )
Programming Scala ( June 30th )
Scala Puzzlers ( July 15th )
http://twitter.github.io/scala_school/ ( July 31st )
Scala Cookbook ( July 31st )
Learning Spark { Feb 29th }
Machine Learning with Spark { March 15th }
Advanced Analytics with Spark { March 31st }
Mastering Spark { April 15th }
Spark Documentation {April 30th }

Online Sources:-

Apache Spark: Exploring Popular APIs and Libraries: Best of 2015

Advanced Sources( After mastering above resources ) :

Wednesday, 17 February 2016

Notes from "6 POINTS TO COMPARE PYTHON AND SCALA FOR DATA SCIENCE USING APACHE SPARK" Post

Original post available here

Introduction:

Apache Spark is a distributed computation framework that simplifies and speeds-up the data crunching and analytics workflow for data scientists and engineers working over large datasets.

It offers an unified interface for prototyping as well as building production quality application which makes it particularly suitable for an agile approach.

Above mentioned blog article's author, wants to restrict comparisons in building data products leveraging Apache Spark in an agile workflow.

From author's perspective, there are 6 important aspects that a Data Science programming language in this context should provide:

Productivity
Safe Refactoring
Spark Integration
Out of the box machine learning / statistics packages
Documentation / Community
Interactive Exploratory Data Analysis & Built in Visualization tools

Productivity :

Especially in the initial MVP phase we want to achieve high productivity with fewest possible lines of code and possibly be guided by a smart IDE
Python is a very simple to learn and highly productive language to get things done quickly and from day 1.
Scala requires a little bit more of thinking and abstraction due to its high level functional features but as soon as you get familiar with that, your productivity will dramatically boost.
Code conciseness are quite comparable, both can be very concise depending on how good you are at coding.
Reading Python is more explicit, it shows you step-by-step what your code execution is and the state of each variable.
Scala in the other hand will focus more on describing what you are trying to achieve as final result hiding most of the implementation details and execution order.
But remember with great power comes great responsibility.
Whilst pattern matching is a very cool way to extract variables, advance features like implicits or custom DSLs can be confusing to the non-expert user.
Nevertheless, Scala can take advantage of the type and compile-time cross-references that can provide some extra functionalities more naturally and without ambiguity, unlike in scripting languages.
Just to name few:

Find class/methods by name in the project and
linked dependencies,
find usages,
auto-completion based on type compatibility,
development-time errors or warnings.

In the other hand, all of those compile-time features comes with a cost: IntelliJ, sbt and all of the related tools are very slow and memory/cpu consuming.
You shouldn’t be surprise if 2GB of your RAM is allocated in order to open multiple parallel projects in Scala. Python is more lightweight in this concern.

Conclusion: Both scores very well here, Author's recommendation is if you are developing simple intuitive logic then Python does the job greatly, if you want to do something more complex than it may be worth investing in learning and writing functional code in Scala.

Safe Refactoring:

This requirement mainly comes with the agile methodology, we want to safely change the requirements of our code as we perform data explorations and adjust them at each iteration.
Very commonly you first write some code with associated tests and immediately after the tests, implementations and APIs are broken.
Every time we perform a refactoring we face the risk of introducing bugs and silently breaking the previous logic.
Both the two languages must require tests (unit tests, integration tests, property based tests, etc…) in order to be safely refactored.
Conclusion: Scala very well, Python average.

Spark Integration:

Conclusion: Scala better when comes to engineering, equivalent in terms of Spark integration and functionalities.

Out-of-the-box machine learning/statistics packages :

When you marry a language, you marry the whole family.
And Python has much more to bring on the table when it comes to out-of-the-box packages implementing most of the standard procedures and models you generally find in the literature and/or broadly adopted in the industry.
Scala is still way behind in that yet can benefit from the Java libraries compatibility and the community developing some of the popular machine learning algorithms on their distributed version directly on top of Spark (see MLlib, H20 Sparkling Water, DeepLearning4j …)
A little note regarding MLlib, from my experience its implementation is a bit hacky and often hard to be modified or extended due to a mediocre design and non-sense limitations of private fields and classes.
Regarding the Java compatibility honestly I don’t see any Java framework to be anywhere close to what Python today provides with its amazing scikit-learn and related libraries.
In the other hand many of those Python implementation only works locally (unless using some bootstrapping/bagging + model ensembling technique,
Scala in the other hand provides only a few implementations but already scalable and production-ready.
Nevertheless, do not forget that many big data problems can be reduced in small data problems, especially after an accurate feature selection, filtering and aggregation.
It might make sense in some scenarios to crunch your large dataset into a vector space which can perfectly fit in memory and take advantage of the richness and advanced algorithms available in Python
Conclusion: It really depends of what the size of your data is.

Prefer Python every time that it can fit in memory;
But keep in mind also what are the requirements of your project:

Is it just a prototype or
is something you want to deploy/maintain in a production system?
Python offers a complete selection of already-implemented packages that can satisfy any need.
Scala will only provide the basics but in case of “productionisation” is a better engineering choice.

Documentation / Community :-

Conclusion: Both of them have a good and comparable community in terms of software development. When we consider data science community and cool data science projects, Python is hard to beat.

Interactive Exploratory Analysis and built-in visualization tools:

Conclusion: Python wins, Scala is not enough mature yet even though the SparkNotebook does a good job. We haven’t yet considered the recent Apache Zeppelin which provides some fancy visualization features and supports the concept of language-agnostic notebook where each cell can represent any type of code: Scala, Python, SQL… and is specifically designed to integrate well with Spark.

Final Verdict:

Shall I use Scala or Python? The answer is: Yes!
Give a try to both of them and try to test yourself what better works for your specific use case. As a rule of thumb: Python is more analytical oriented while Scala is more engineering oriented but both are great languages for building Data Science applications. The ideal scenario would be to have a data science team able to be confident with both of them and swap when needed.
What most data scientists care at the end of the day is to deliver using whatever mean does the job.
If you do have to decide, my view is that if your scope is doing research, then a scripting language is enough complete in terms of experimentation and prototyping. If your goal is to build a product then you want to consider something more robust that gives you both experimentation and at the same delivers a product.
Since that the best solution is never white or black, I encourage trying hybrid approaches that can adapt based on each project specification.
A typical scenario could be developing the whole ETL, data cleansing and feature extraction in Scala and then distribute the data over multiple partitions and learning using algorithms written in Python for then collecting the results and presenting in a Jupyter notebook.
My motto is “the best tool for each task”. Whatever balance you choose, avoid to split into two teams: Data Science Engineers (the Big Data/Scala guys) and Data Science Analysts (the Python and SQL folks). Aim to build a cross-functional team with the full skillset to operate on the full end-to-end development of your product, from the raw data to the manual analysis and from the modelling to a scalable deployment.

Sunday, 10 January 2016

Statistical Learning MOOC by Trevor Hastie and Robert Tibshirani

Trevor Hastie and Robert Tibshirani Offering their course third time in succession. Here goes the course link https://lagunita.stanford.edu/courses/HumanitiesSciences/StatLearning/Winter2016/about

It's time to refresh and strengthen my machine learning / statistical learning fundamentals, along with the course work, I am planning to complete following books/resources:

Thursday, 7 January 2016

Model Training and Tuning in caret package

Note: This article loosely a copy of this post

Model training and tuning

The caret package has several functions that attempt to streamline the model building and evaluation process.

The train function can be used to

evaluate, using re-sampling, the effect of model tuning parameters on performance
choose the "optimal" model across these parameters
Estimate model performance from a training set

Generic Algorithm for Model Building:-

Define sets of model parameter values to evaluate

for each parameter set:

for each resampling iteration:

Hold out specific samples(Validation Set)

[Optional] Pre-Process the data

Fit the model on the remainder

Predict the Hold-Out samples

end

Determine the Optimal parameter set

Fit the model to all the training data using the optimal parameter set

Notes:

- First we need to choose the modeling technique.

- In the case of non-parametric models, there is model turning(But we need to do cross-validation to comment about generalization of the model performance).

- Available re-sampling techniques in caret : k-fold cross validation (once/repeated), leave-one-out cross-validation(this is very costly operation) & bootstrap(simple estimation or 632 rule).

- Be default, the function automatically chooses the tuning parameters associated with the best value, although different algorithms can be used.

Let's getting into an example :

> library(mlbench)
> data("Sonar")

> dim(Sonar)
[1] 208  61

> 100 * prop.table ( table ( Sonar$Class) )

       M        R 
53.36538 46.63462

The function createDataPartition can be used to create a stratified random sample of the data into training and test sets.

> inTraing <- createDataPartition( Sonar$Class, = 0.75, list=FALSE)

> inTraing <- createDataPartition( Sonar$Class, p = 0.75, list=FALSE)
> training <- Sonar [ inTraing, ]
> testing <- Sonar[ - inTraing, ]

> 100 * prop.table ( table(training$Class))

       M        R 
53.50318 46.49682 
> 100 * prop.table ( table(testing$Class))

       M        R 
52.94118 47.05882

Cautionary note: In highly class-imbalance data sets, stratified sampling may not be the right thing to do.

Basic Parameter Tuning:

By default, simple bootstrap resampling is used for estimation. The function trainControl can be used to specify the type of resampling:

> fitControl <- trainControl(## 10-fold CV
+     method = "repeatedcv",
+     number = 10,
+     ## repeated ten times
+     repeats = 10)

> gbmFit1 <- train(Class ~ ., data = training,
+                  method = "gbm",
+                  trControl = fitControl,
+                  ## This last option is actually one
+                  ## for gbm() that passes through
+                  verbose = FALSE

> set.seed(825)
> gbmFit1 <- train(Class ~ ., data = training,
+                  method = "gbm",
+                  trControl = fitControl,
+                  ## This last option is actually one
+                  ## for gbm() that passes through
+                  verbose = FALSE)

> gbmFit1
Stochastic Gradient Boosting 

157 samples
 60 predictor
  2 classes: 'M', 'R' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 10 times) 
Summary of sample sizes: 142, 142, 140, 142, 142, 141, ... 
Resampling results across tuning parameters:

  interaction.depth  n.trees  Accuracy   Kappa      Accuracy SD  Kappa SD 
  1                   50      0.7505588  0.4951522  0.10073200   0.2041972
  1                  100      0.7876520  0.5702598  0.09028830   0.1833782
  1                  150      0.8047647  0.6047684  0.08520638   0.1733785
  2                   50      0.7878235  0.5701625  0.09287982   0.1890531
  2                  100      0.8055882  0.6054776  0.09520934   0.1943406
  2                  150      0.8161495  0.6260536  0.09111650   0.1873189
  3                   50      0.7987598  0.5916571  0.09573996   0.1964155
  3                  100      0.8205098  0.6353613  0.09711836   0.1994434
  3                  150      0.8251912  0.6450311  0.09312786   0.1911133

Tuning parameter 'shrinkage' was held constant at a value of 0.1
Tuning parameter 'n.minobsinnode' was held constant at a value of 10
Accuracy was used to select the optimal model using  the largest value.
The final values used for the model were n.trees = 150, interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

> trellis.par.set( caretTheme())

> plot ( gbmFit1, metric = "Kappa")

> plot ( gbmFit1, metric =  "Kappa", plotType = "level")
> plot ( gbmFit1, metric =  "Kappa", plotType = "level", scales = list (x=list(rot=90)) )
> plot ( gbmFit1)
> plot ( gbmFit1, metric = "Kappa") # Alternate Performance Metrics
> plot ( gbmFit1, metric =  "Kappa", plotType = "level", scales = list (x=list(rot=90)) )

The trainControl Function:

The function trainControl generates parameters that further control how models are created, with possible values:

method: The resampling method: "boot", "cv", "LOOCV", "LGOCV", "repeatedcv", "timeslice", "none" and "oob". The last value, out-of-bag estimates, can only be used by random forest, bagged trees, bagged earth, bagged flexible discriminant analysis, or conditional tree forest models. GBM models are not included (the gbm package maintainer has indicated that it would not be a good idea to choose tuning parameter values based on the model OOB error estimates with boosted trees). Also, for leave-one-out cross-validation, no uncertainty estimates are given for the resampled performance measures.
number and repeats: number controls with the number of folds in K-fold cross-validation or number of resampling iterations for bootstrapping and leave-group-out cross-validation. repeats applied only to repeated K-fold cross-validation. Suppose that method = "repeatedcv", number = 10 and repeats = 3,then three separate 10-fold cross-validations are used as the resampling scheme.
verboseIter: A logical for printing a training log.
returnData: A logical for saving the data into a slot called trainingData.
p: For leave-group out cross-validation: the training percentage
For method = "timeslice", trainControl has options initialWindow, horizon and fixedWindow that govern how cross-validation can be used for time series data.
classProbs: a logical value determining whether class probabilities should be computed for held-out samples during resample.
index and indexOut: optional lists with elements for each resampling iteration. Each list element is the sample rows used for training at that iteration or should be held-out. When these values are not specified, train will generate them.
summaryFunction: a function to compute alternate performance summaries.
selectionFunction: a function to choose the optimal tuning parameters. and examples.
PCAthresh, ICAcomp and k: these are all options to pass to the preProcess function (when used).
returnResamp: a character string containing one of the following values: "all", "final" or "none". This specifies how much of the resampled performance measures to save.
allowParallel: a logical that governs whether train should use parallel processing (if availible)

Alternate Performance Metrics :

The user can change the metric used to determine the best settings. By default, RMSE and R² are computed for regression while accuracy and Kappa are computed for classification. Also by default, the parameter values are chosen using RMSE and accuracy, respectively for regression and classification. The metric argument of the train function allows the user to control which the optimality criterion to be used. For example, in problems where there are a low percentage of samples in one class, using metric = "Kappa" can improve quality of the final model.

If none of these parameters are not satisfactory, the user can also compute performance metrics. The trainControl function has a argument called summaryFunction that specifies a function for computing performance. The function should have these arguments:

data is a reference for a data frame or matrix with columns called obs and pred for the observed and predicted outcome values. Currently, class probabilities are not passed to the function. The values in data are the held-out predictions(and their associated reference values) for a single combination of tuning parameters. If the classProbs argument of the trainControl object is set to TRUE, additional columns in data will be present that contains the class probabilities. The names of these columns are the same as the class levels. Also if weights were specified in the call to train, a column called weights will also be in the datasets.
lev is a character string that has the outcome factor levels taken from the training data.
model is a character string for the model being used.

The output of the function should be a vector of numeric summary metrics with non-null names. By default, train evaluate classification models in terms of the predicted classes. Optionally, class probabilities can also be used to measure performance. To obtain predicted class probabilities within in the resampling process, the argument classProbs in trainControl must be set to TRUE. This merges columns of probabilities into the predictions generated from each resample(there is a column per class and the column names are the class names).

As mentioned previously, custom functions can be used to calculate performances scores that are averaged over the resamples.

To rebuild the boosted tree model using this criterion, we can see the relationship b/w the tuning parameters and the area under the ROC using the following code:

> fitControl <- trainControl(method = "repeatedcv",
+                            number = 10,
+                            repeats = 10,
+                            ## Estimate class probabilities
+                            classProbs = TRUE,
+                            ## Evaluate performance using 
+                            ## the following function
+                            summaryFunction = twoClassSummary)

> gbmFit3 <- train( 
+ Class ~ .,
+ data = Sonar,
+ method = "gbm",
+ trControl = fitControl,
+ verbose = FALSE,
+ tuneGrid = gbmGrid,
+ metric = "ROC")

Choosing the Final Model :
Another method for customizing the tuning process is to modify the algorithm that is used to select the best parameter values, given the performance numbers. By default, the train function chooses the model with the largest performance value( or smallest, for RMSE in regression models). Other schemes for selecting model can be used, Breiman, suggested the "one standard error rule", for simple tree based models. In this case, the model with the best performing value is identified and, using resampling, we can estimate the standard error of performance. The final model used was the simplest model within one standard error of the (empirically)best model.

train allows the user to specify alternate rules for selecting the final model. The argument selectionFunction can be used to supply a function to algorithmically determine the final model. There are three existing functions in the package: best is chooses the largest/smallest value, oneSE attempts to capture the spirit of Breiman et al (1984) and tolerance selects the least complex model within some percent tolerance of the best value.

User-defined functions can be used, as long as they have the following arguments:

x is a data frame containing the tune parameters and their associated performance metrics. Each row corresponds to a different tuning parameter combination.
metric a character string indicating which performance metric should be optimized (this is passed in directly from the metric argument of train.
maximize is a single logical value indicating whether larger values of the performance metric are better (this is also directly passed from the call to train).

The function should output a single integer indicating which row in x is chosen.

The tolerance function could be used to find a less complex model based on (x-x_best)/x_bestx 100, which is the percent difference. For example, to select parameter values based on a 2% loss of performance:

> whichTwoPct <- tolerance(gbmFit3$results, metric = "ROC",
+                          tol = 2, maximize = TRUE)
> whichTwoPct
[1] 2
> gbmFit3$results[whichTwoPct,1:6]
  shrinkage interaction.depth n.minobsinnode n.trees       ROC      Sens
6       0.1                 5             20      50 0.8795213 0.8154167

Extracting Predictions and Class Probabilities:

> predict( gbmFit3, head(Sonar))
[1] R R R R R R
Levels: M R

> predict( gbmFit3, head(Sonar))
[1] R R R R R R
Levels: M R

Exploring and Comparing Resampling Distributions:

Within-Model :

There are several lattice functions that can be used to explore relationships between tuning parameters and the resampling results for a specific model.

xyplot and stripplot can be used to plot resampling statistics against (numeric) tuning parameters.
histogram and densityplot can also be used to look at distributions.

Between-Models :

The caret package also includes functions to characterize the differences between models (generated using train, sbf or rfe) via their resampling distributions.

First, a support vector machine model is fit to the Sonar data. The data are centered and scaled using the preProc argument. Note that the same random number seed is set prior to the model that is identical to the seed used for the boosted tree model. This ensures that the same resampling sets are used, which will come in handy when we compare the resampling profiles between models.

> svmFit <- train(Class ~ ., data = training,
+                 method = "svmRadial",
+                 trControl = fitControl,
+                 preProc = c("center", "scale"),
+                 tuneLength = 8,
+                 metric = "ROC")

Also, a regularized discriminant analysis model was fit.

> set.seed(825)
> rdaFit <- train(Class ~ ., data = training,
+                 method = "rda",
+                 trControl = fitControl,
+                 tuneLength = 4,
+                 metric = "ROC")

Given these models, can we make statistical statements about their performance differences? To do this, we first collect the resampling results using resamples.

resamps <- resamples(list(GBM = gbmFit3,
                          SVM = svmFit,
                          RDA = rdaFit))
resamps

summary(resamps)

There are several lattice plot methods that can be used to visualize the resampling

distributions: density plots, box-whisker plots, scatterplot matrices and scatterplots

of summary statistics.

trellis.par.set( caretTheme() )
bwplot ( resamps, layout = c(3,1) )

trellis.par.set ( caretTheme() )
dotplot( resamps, metric="ROC" )

trellis.par.set ( caretTheme() )
xyplot ( resamps, what="BlandAltman")

Other visualizations are available in densityolot.resamples & parallel.resamples;

Since models are fit on the same versions of the training data, it makes sense to make inferences on the differences between models. In this way we reduce the within-resample correlation that may exist. We can compute the differences, then use a simple t-test to evaluate the null hypothesis that there is no difference between models.

diffValues <- diff ( resamps )
summary ( diffValues )

trellis.par.set ( caretTheme() )
bwplot ( diffValues, layout = c( 3,1 ) )

Fitting Models without Parameter Tuning:

In cases, where the model tuning values are known, train can be used to fit the model to the entire training set without any resampling or parameter tuning. Using the method="none" option in the trainControl.

fitControl <- trainControl ( method ="none", classProbs = TRUE )
set.seed ( 825 )

gbmFit4 <- train (
Class ~ .,
data = Sonar,
trControl = fitControl,
verbose = FALSE,
tuneGrid = data.frame ( interaction.depth = 4, n.trees = 100, shrinkage=0.1, n.minobsinnode=20), metric = "ROC" )

predict ( gbmFit4, newdata = head(Sonar) )

predict ( gbmFit4, newdata = head(Sonar), type="prob" )

Finally this huge article is coming to END.

Thanks for following to till this point.

Regards
Viswanath Gangavaram