General notes :
- http://go.databricks.com/hubfs/notebooks/Spark_1.6_Improvements.html
- Huge open source eco system
- Data Sources, Applications, Environments
- Great number of distributors
- 3 months release cycle for each version
- Key themes
- Out of the box performance
- Previews of key new APIs
- Two separate memory managers ( Spark 1.5 )
- Execution memory : Computation of shuffles,
- Storage memory
- Goal : Allow memory regions to shrink / grow dynamically
- Unified Memory Management in Spark 1.6
- Can cross between execution and storage memory
- Borrowing can happen from both sides
- History of Spark API's
- RDD API
- Distributed collections of JVM objects
- Functional operators (map, filter, etc.)
- Data Frame API ( 2013 )
- Distribute collection of Row Objects
- Expression-based operations and UDFs
- Logical Plans and optimizer
- Fast / Efficient internal representations
- DataSet API ( 2015 )
- Internally Rows, externally JVM objects
- "Best of both worlds : type safe + fast "
- Encoder : Converts from JVM object into a DataSet Row
- High Level APIs -> Data Frame ( & DataSet ) -> Tungsten Execution
- SQL directly over files
- select * from text.`fileName` where value != ''
- Advanced JSON parsing
- Better instrumentation for SQL operators
- Tracking memory usage ( How much used on each machine )
- Display the failed output op in streaming
- Persist ML pipelines to
- R-like statistics for GLMs
- Provide R-like summary statistics
- New algos added to MLlib
- Bisecting K-Means
- Online Hypothesis testing : A/B testing in Spark Streaming
- Survival analysi
- etc ..