Friday 30 October 2015

How to configure IPython Notebook server with Apache Spark

IPython provides a rich architecture for interactive computing with a powerful interactive shell, a kernel for Jupyter Sand others.

Apache Spark  is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.


Steps for setting up IPython notebook server with Apache Spark:

  • Install Spark
  • Create a PySpark profile for IPython
  • WordCount example

Spark installation:

Installing Spark is pretty much a straight forward task, get the latest version from here, to make things simple, download pre-built version to your working directory, at the time of writing this article it is version 1.5.1(spark-1.5.1-bin-hadoop2.6.tgz).  Extract compressed file(.tgz) with tar -xvzf spark-1.5.1-bin-hadoop2.6.tgz. That's it,  now you have working Spark installation on your machine.

Some configurations :
To easily adapt  for different versions of Spark, let's create symbolic link for Spark installation directory
ln -s spark-1.5.1-bin-hadoop2.6 spark 

Now let's set up some environmental variables, 

export JAVA_HOME=/usr
export SPARK_HOME=/home/viswanath/spark
export PATH=$SPARK_HOME/bin:$PATH

# Where you specify options you would normally add after bin/pyspark
export PYSPARK_SUBMIT_ARGS="--master local[4]"
Note: In the case of Linux, you have to add above variables to .bashrc file

Create a PySpark profile for IPython:

Let's create IPython profile for PySpark
ipython profile create pyspark

Above command will create pyspark direcory in ~/.ipython/profile_pyspark.

Now open ipython_notebook_config.py file in profile_pyspark directory, add following lines to the file and save 


# Kernel config

c.IPKernelApp.pylab = 'inline'  # if you want plotting support always


# Notebook config
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
# It is a good idea to put it on a known, fixed port
c.NotebookApp.port = 9999


Now create a file in "00-pyspark-setup.py" in startup (~/.ipython/profile_pyspark/startup) directory and add the following code snippet to it

# Configure the necessary Spark environment
import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)

    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
    os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))



Now you are ready to launch your IPython notebook server with PySpark support

ipython notebook --profile=pyspark

Here goes, your IPython notebook http://server_ip:9999/


WordCount example :


inputRDD = sc.textFile ( "~/spark/README.md" )

inputRDD.flatMap( lambda line : line.split( ' ')).map ( lambda word : (word,1)).reduceByKey ( lambda a,b : a+b).collect()


Tip: To keep alive your notebook server after you left the current command line interaction, use screen.


Thanks & Regards
Viswanath G.
Senior Data Scientist



No comments:

Post a Comment