Learning from Data & Big Data Technologies: How to configure IPython Notebook server with Apache Spark

IPython provides a rich architecture for interactive computing with a powerful interactive shell, a kernel for Jupyter Sand others.

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Steps for setting up IPython notebook server with Apache Spark:

Install Spark
Create a PySpark profile for IPython
WordCount example

Spark installation:

Installing Spark is pretty much a straight forward task, get the latest version from here, to make things simple, download pre-built version to your working directory, at the time of writing this article it is version 1.5.1(spark-1.5.1-bin-hadoop2.6.tgz). Extract compressed file(.tgz) with tar -xvzf spark-1.5.1-bin-hadoop2.6.tgz. That's it, now you have working Spark installation on your machine.

Some configurations :

To easily adapt for different versions of Spark, let's create symbolic link for Spark installation directory

ln -s spark-1.5.1-bin-hadoop2.6 spark

Now let's set up some environmental variables,

export JAVA_HOME=/usr

export SPARK_HOME=/home/viswanath/spark

export PATH=$SPARK_HOME/bin:$PATH

# Where you specify options you would normally add after bin/pyspark

export PYSPARK_SUBMIT_ARGS="--master local[4]"

Note: In the case of Linux, you have to add above variables to .bashrc file

Create a PySpark profile for IPython:

Let's create IPython profile for PySpark

ipython profile create pyspark

Above command will create pyspark direcory in ~/.ipython/profile_pyspark.

Now open ipython_notebook_config.py file in profile_pyspark directory, add following lines to the file and save

# Kernel config

c.IPKernelApp.pylab = 'inline' # if you want plotting support always

# Notebook config

c.NotebookApp.ip = '*'

c.NotebookApp.open_browser = False

# It is a good idea to put it on a known, fixed port

c.NotebookApp.port = 9999

Now create a file in "00-pyspark-setup.py" in startup (~/.ipython/profile_pyspark/startup) directory and add the following code snippet to it

# Configure the necessary Spark environment

import os

import sys

spark_home = os.environ.get('SPARK_HOME', None)

pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")

if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"

os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.

# You may need to change the version number to match your install

sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

# Initialize PySpark to predefine the SparkContext variable 'sc'

execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

Now you are ready to launch your IPython notebook server with PySpark support

ipython notebook --profile=pyspark

Here goes, your IPython notebook http://server_ip:9999/

WordCount example :

inputRDD = sc.textFile ( "~/spark/README.md" )

inputRDD.flatMap( lambda line : line.split( ' ')).map ( lambda word : (word,1)).reduceByKey ( lambda a,b : a+b).collect()

Tip: To keep alive your notebook server after you left the current command line interaction, use screen.

Thanks & Regards

Viswanath G.

Senior Data Scientist

Learning from Data & Big Data Technologies

Friday, 30 October 2015

How to configure IPython Notebook server with Apache Spark

No comments:

Post a Comment