Tuesday 1 December 2015

webinar notes "Apache Spark Release 1.6"

General notes :
What's coming in Apache Spark 1.6 :
  •  Key themes
    • Out of the box performance
    • Previews of key new APIs
  •  Two separate memory managers ( Spark 1.5 )
    • Execution memory : Computation of shuffles,
    • Storage memory
  • Goal : Allow memory regions to shrink / grow dynamically
  • Unified Memory Management in Spark 1.6
    • Can cross between execution and storage memory
      • Borrowing can happen from both sides
  • History of Spark API's
    • RDD API
      • Distributed collections of JVM objects
      • Functional operators (map, filter, etc.)
    • Data Frame API ( 2013 )
      • Distribute collection of Row Objects
      • Expression-based operations and UDFs
      • Logical Plans and optimizer
      • Fast / Efficient internal representations
    • DataSet API ( 2015 ) 
      • Internally Rows, externally JVM objects
      • "Best of both worlds : type safe + fast "
  • Encoder : Converts from JVM object into a DataSet Row
  • High Level APIs -> Data Frame ( & DataSet ) -> Tungsten Execution
  • SQL directly over files
    • select * from text.`fileName` where value != ''
  • Advanced JSON parsing
  • Better instrumentation for SQL operators
    • Tracking memory usage ( How much used on each machine )
  • Display the failed output op in streaming
  • Persist ML pipelines to
  • R-like statistics for GLMs
    • Provide R-like summary statistics
  • New algos added to MLlib
    • Bisecting K-Means
    • Online Hypothesis testing : A/B testing in Spark Streaming
    • Survival analysi
    • etc ..