Learning from Data & Big Data Technologies: October 2015

The best way to learn Apache Spark, is through various interactive shells that it supports. As of now, Apache Spark supports following interactive shells Python/IPython, Scala and R.

But the best way to deploy Spark programs in production systems is through Standalone applications. The main difference from using it in the shell is that we need to initialize your own SparkContext(where as interactive shells gives one to us).

Note: SparkContext represents a connection to a computing cluster.

Python standalone application

In Python, Spark standalone applications are simple Python scripts, but need to run this scripts through bin/spark-submit script which included in Spark installation. The spark-submit includes the Spark dependencies for us in Python. This script sets up the environment for Spark's Python API to function.

"HelloWorld.py" :

from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster("local").setAppName("HelloWorld")

sc = SparkContext ( conf = conf )

lines = sc.textFile ("/home/viswanath/spark/README.md")

lines.count()

To run the script

spark-submit HelloWorld.py

Finally, to shut down Spark, we can either call the stop() method on SparkContext, or simply exit the application ( e.g. with System.exit(0) or sys.exit()).

IPython provides a rich architecture for interactive computing with a powerful interactive shell, a kernel for Jupyter Sand others.

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

Steps for setting up IPython notebook server with Apache Spark:

Install Spark
Create a PySpark profile for IPython
WordCount example

Spark installation:

Installing Spark is pretty much a straight forward task, get the latest version from here, to make things simple, download pre-built version to your working directory, at the time of writing this article it is version 1.5.1(spark-1.5.1-bin-hadoop2.6.tgz). Extract compressed file(.tgz) with tar -xvzf spark-1.5.1-bin-hadoop2.6.tgz. That's it, now you have working Spark installation on your machine.

Some configurations :

To easily adapt for different versions of Spark, let's create symbolic link for Spark installation directory

ln -s spark-1.5.1-bin-hadoop2.6 spark

Now let's set up some environmental variables,

export JAVA_HOME=/usr

export SPARK_HOME=/home/viswanath/spark

export PATH=$SPARK_HOME/bin:$PATH

# Where you specify options you would normally add after bin/pyspark

export PYSPARK_SUBMIT_ARGS="--master local[4]"

Note: In the case of Linux, you have to add above variables to .bashrc file

Create a PySpark profile for IPython:

Let's create IPython profile for PySpark

ipython profile create pyspark

Above command will create pyspark direcory in ~/.ipython/profile_pyspark.

Now open ipython_notebook_config.py file in profile_pyspark directory, add following lines to the file and save

# Kernel config

c.IPKernelApp.pylab = 'inline' # if you want plotting support always

# Notebook config

c.NotebookApp.ip = '*'

c.NotebookApp.open_browser = False

# It is a good idea to put it on a known, fixed port

c.NotebookApp.port = 9999

Now create a file in "00-pyspark-setup.py" in startup (~/.ipython/profile_pyspark/startup) directory and add the following code snippet to it

# Configure the necessary Spark environment

import os

import sys

spark_home = os.environ.get('SPARK_HOME', None)

pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")

if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"

os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.

# You may need to change the version number to match your install

sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

# Initialize PySpark to predefine the SparkContext variable 'sc'

execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

Now you are ready to launch your IPython notebook server with PySpark support

ipython notebook --profile=pyspark

Here goes, your IPython notebook http://server_ip:9999/

WordCount example :

inputRDD = sc.textFile ( "~/spark/README.md" )

inputRDD.flatMap( lambda line : line.split( ' ')).map ( lambda word : (word,1)).reduceByKey ( lambda a,b : a+b).collect()

Tip: To keep alive your notebook server after you left the current command line interaction, use screen.

Thanks & Regards

Viswanath G.

Senior Data Scientist

Learning from Data & Big Data Technologies

Saturday, 31 October 2015

Standalone Apache Spark application

Friday, 30 October 2015

How to configure IPython Notebook server with Apache Spark