Learning from Data & Big Data Technologies: February 2016

Original post available here

Introduction:

Apache Spark is a distributed computation framework that simplifies and speeds-up the data crunching and analytics workflow for data scientists and engineers working over large datasets.

It offers an unified interface for prototyping as well as building production quality application which makes it particularly suitable for an agile approach.

Above mentioned blog article's author, wants to restrict comparisons in building data products leveraging Apache Spark in an agile workflow.

From author's perspective, there are 6 important aspects that a Data Science programming language in this context should provide:

Productivity
Safe Refactoring
Spark Integration
Out of the box machine learning / statistics packages
Documentation / Community
Interactive Exploratory Data Analysis & Built in Visualization tools

Productivity :

Especially in the initial MVP phase we want to achieve high productivity with fewest possible lines of code and possibly be guided by a smart IDE
Python is a very simple to learn and highly productive language to get things done quickly and from day 1.
Scala requires a little bit more of thinking and abstraction due to its high level functional features but as soon as you get familiar with that, your productivity will dramatically boost.
Code conciseness are quite comparable, both can be very concise depending on how good you are at coding.
Reading Python is more explicit, it shows you step-by-step what your code execution is and the state of each variable.
Scala in the other hand will focus more on describing what you are trying to achieve as final result hiding most of the implementation details and execution order.
But remember with great power comes great responsibility.
Whilst pattern matching is a very cool way to extract variables, advance features like implicits or custom DSLs can be confusing to the non-expert user.
Nevertheless, Scala can take advantage of the type and compile-time cross-references that can provide some extra functionalities more naturally and without ambiguity, unlike in scripting languages.
Just to name few:

Find class/methods by name in the project and
linked dependencies,
find usages,
auto-completion based on type compatibility,
development-time errors or warnings.

In the other hand, all of those compile-time features comes with a cost: IntelliJ, sbt and all of the related tools are very slow and memory/cpu consuming.
You shouldn’t be surprise if 2GB of your RAM is allocated in order to open multiple parallel projects in Scala. Python is more lightweight in this concern.

Conclusion: Both scores very well here, Author's recommendation is if you are developing simple intuitive logic then Python does the job greatly, if you want to do something more complex than it may be worth investing in learning and writing functional code in Scala.

Safe Refactoring:

This requirement mainly comes with the agile methodology, we want to safely change the requirements of our code as we perform data explorations and adjust them at each iteration.
Very commonly you first write some code with associated tests and immediately after the tests, implementations and APIs are broken.
Every time we perform a refactoring we face the risk of introducing bugs and silently breaking the previous logic.
Both the two languages must require tests (unit tests, integration tests, property based tests, etc…) in order to be safely refactored.
Conclusion: Scala very well, Python average.

Spark Integration:

Conclusion: Scala better when comes to engineering, equivalent in terms of Spark integration and functionalities.

Out-of-the-box machine learning/statistics packages :

When you marry a language, you marry the whole family.
And Python has much more to bring on the table when it comes to out-of-the-box packages implementing most of the standard procedures and models you generally find in the literature and/or broadly adopted in the industry.
Scala is still way behind in that yet can benefit from the Java libraries compatibility and the community developing some of the popular machine learning algorithms on their distributed version directly on top of Spark (see MLlib, H20 Sparkling Water, DeepLearning4j …)
A little note regarding MLlib, from my experience its implementation is a bit hacky and often hard to be modified or extended due to a mediocre design and non-sense limitations of private fields and classes.
Regarding the Java compatibility honestly I don’t see any Java framework to be anywhere close to what Python today provides with its amazing scikit-learn and related libraries.
In the other hand many of those Python implementation only works locally (unless using some bootstrapping/bagging + model ensembling technique,
Scala in the other hand provides only a few implementations but already scalable and production-ready.
Nevertheless, do not forget that many big data problems can be reduced in small data problems, especially after an accurate feature selection, filtering and aggregation.
It might make sense in some scenarios to crunch your large dataset into a vector space which can perfectly fit in memory and take advantage of the richness and advanced algorithms available in Python
Conclusion: It really depends of what the size of your data is.

Prefer Python every time that it can fit in memory;
But keep in mind also what are the requirements of your project:

Is it just a prototype or
is something you want to deploy/maintain in a production system?
Python offers a complete selection of already-implemented packages that can satisfy any need.
Scala will only provide the basics but in case of “productionisation” is a better engineering choice.

Documentation / Community :-

Conclusion: Both of them have a good and comparable community in terms of software development. When we consider data science community and cool data science projects, Python is hard to beat.

Interactive Exploratory Analysis and built-in visualization tools:

Conclusion: Python wins, Scala is not enough mature yet even though the SparkNotebook does a good job. We haven’t yet considered the recent Apache Zeppelin which provides some fancy visualization features and supports the concept of language-agnostic notebook where each cell can represent any type of code: Scala, Python, SQL… and is specifically designed to integrate well with Spark.

Final Verdict:

Shall I use Scala or Python? The answer is: Yes!
Give a try to both of them and try to test yourself what better works for your specific use case. As a rule of thumb: Python is more analytical oriented while Scala is more engineering oriented but both are great languages for building Data Science applications. The ideal scenario would be to have a data science team able to be confident with both of them and swap when needed.
What most data scientists care at the end of the day is to deliver using whatever mean does the job.
If you do have to decide, my view is that if your scope is doing research, then a scripting language is enough complete in terms of experimentation and prototyping. If your goal is to build a product then you want to consider something more robust that gives you both experimentation and at the same delivers a product.
Since that the best solution is never white or black, I encourage trying hybrid approaches that can adapt based on each project specification.
A typical scenario could be developing the whole ETL, data cleansing and feature extraction in Scala and then distribute the data over multiple partitions and learning using algorithms written in Python for then collecting the results and presenting in a Jupyter notebook.
My motto is “the best tool for each task”. Whatever balance you choose, avoid to split into two teams: Data Science Engineers (the Big Data/Scala guys) and Data Science Analysts (the Python and SQL folks). Aim to build a cross-functional team with the full skillset to operate on the full end-to-end development of your product, from the raw data to the manual analysis and from the modelling to a scalable deployment.

Learning from Data & Big Data Technologies

Thursday, 18 February 2016

The Next three months : March, April & May 2016

Wednesday, 17 February 2016

Notes from "6 POINTS TO COMPARE PYTHON AND SCALA FOR DATA SCIENCE USING APACHE SPARK" Post