Saturday, 31 October 2015

Standalone Apache Spark application

The best way to learn Apache Spark, is through various interactive shells that it supports. As of now,  Apache Spark supports following interactive shells Python/IPython, Scala and R. 

But the best way to deploy Spark programs in production systems is through Standalone applications. The main difference from using  it in the shell is that we need to initialize your own SparkContext(where as interactive shells gives one to us).

Note: SparkContext represents  a connection to a computing cluster.

Python standalone application

In Python, Spark standalone applications are simple Python scripts, but need to run this scripts through bin/spark-submit script which included in Spark installation. The spark-submit includes the Spark dependencies for us in Python. This script sets up the environment for Spark's Python API to function.

"HelloWorld.py" :


from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("HelloWorld")
sc = SparkContext ( conf = conf )
lines = sc.textFile ("/home/viswanath/spark/README.md")
lines.count()

To run the script
spark-submit HelloWorld.py




Finally, to shut down Spark, we can either call the stop() method on SparkContext, or simply exit the application ( e.g. with System.exit(0) or sys.exit()).

Friday, 30 October 2015

How to configure IPython Notebook server with Apache Spark

IPython provides a rich architecture for interactive computing with a powerful interactive shell, a kernel for Jupyter Sand others.

Apache Spark  is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.


Steps for setting up IPython notebook server with Apache Spark:

  • Install Spark
  • Create a PySpark profile for IPython
  • WordCount example

Spark installation:

Installing Spark is pretty much a straight forward task, get the latest version from here, to make things simple, download pre-built version to your working directory, at the time of writing this article it is version 1.5.1(spark-1.5.1-bin-hadoop2.6.tgz).  Extract compressed file(.tgz) with tar -xvzf spark-1.5.1-bin-hadoop2.6.tgz. That's it,  now you have working Spark installation on your machine.

Some configurations :
To easily adapt  for different versions of Spark, let's create symbolic link for Spark installation directory
ln -s spark-1.5.1-bin-hadoop2.6 spark 

Now let's set up some environmental variables, 

export JAVA_HOME=/usr
export SPARK_HOME=/home/viswanath/spark
export PATH=$SPARK_HOME/bin:$PATH

# Where you specify options you would normally add after bin/pyspark
export PYSPARK_SUBMIT_ARGS="--master local[4]"
Note: In the case of Linux, you have to add above variables to .bashrc file

Create a PySpark profile for IPython:

Let's create IPython profile for PySpark
ipython profile create pyspark

Above command will create pyspark direcory in ~/.ipython/profile_pyspark.

Now open ipython_notebook_config.py file in profile_pyspark directory, add following lines to the file and save 


# Kernel config

c.IPKernelApp.pylab = 'inline'  # if you want plotting support always


# Notebook config
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
# It is a good idea to put it on a known, fixed port
c.NotebookApp.port = 9999


Now create a file in "00-pyspark-setup.py" in startup (~/.ipython/profile_pyspark/startup) directory and add the following code snippet to it

# Configure the necessary Spark environment
import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)

    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
    os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))



Now you are ready to launch your IPython notebook server with PySpark support

ipython notebook --profile=pyspark

Here goes, your IPython notebook http://server_ip:9999/


WordCount example :


inputRDD = sc.textFile ( "~/spark/README.md" )

inputRDD.flatMap( lambda line : line.split( ' ')).map ( lambda word : (word,1)).reduceByKey ( lambda a,b : a+b).collect()


Tip: To keep alive your notebook server after you left the current command line interaction, use screen.


Thanks & Regards
Viswanath G.
Senior Data Scientist