Learning from Data & Big Data Technologies: December 2015

Tuesday 1 December 2015

webinar notes "Apache Spark Release 1.6"

General notes :

http://go.databricks.com/hubfs/notebooks/Spark_1.6_Improvements.html
Huge open source eco system

Data Sources, Applications, Environments

Great number of distributors
3 months release cycle for each version

What's coming in Apache Spark 1.6 :

Key themes

Out of the box performance
Previews of key new APIs

Two separate memory managers ( Spark 1.5 )

Execution memory : Computation of shuffles,
Storage memory

Goal : Allow memory regions to shrink / grow dynamically
Unified Memory Management in Spark 1.6

Can cross between execution and storage memory

Borrowing can happen from both sides

History of Spark API's

RDD API

Distributed collections of JVM objects
Functional operators (map, filter, etc.)

Data Frame API ( 2013 )

Distribute collection of Row Objects
Expression-based operations and UDFs
Logical Plans and optimizer
Fast / Efficient internal representations

DataSet API ( 2015 )

Internally Rows, externally JVM objects
"Best of both worlds : type safe + fast "

Encoder : Converts from JVM object into a DataSet Row
High Level APIs -> Data Frame ( & DataSet ) -> Tungsten Execution
SQL directly over files

select * from text.`fileName` where value != ''

Advanced JSON parsing
Better instrumentation for SQL operators

Tracking memory usage ( How much used on each machine )

Display the failed output op in streaming
Persist ML pipelines to
R-like statistics for GLMs

Provide R-like summary statistics

New algos added to MLlib

Bisecting K-Means
Online Hypothesis testing : A/B testing in Spark Streaming
Survival analysi
etc ..

Subscribe to: Posts (Atom)