Saturday, 31 October 2015

Standalone Apache Spark application

The best way to learn Apache Spark, is through various interactive shells that it supports. As of now,  Apache Spark supports following interactive shells Python/IPython, Scala and R. 

But the best way to deploy Spark programs in production systems is through Standalone applications. The main difference from using  it in the shell is that we need to initialize your own SparkContext(where as interactive shells gives one to us).

Note: SparkContext represents  a connection to a computing cluster.

Python standalone application

In Python, Spark standalone applications are simple Python scripts, but need to run this scripts through bin/spark-submit script which included in Spark installation. The spark-submit includes the Spark dependencies for us in Python. This script sets up the environment for Spark's Python API to function.

"HelloWorld.py" :


from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("HelloWorld")
sc = SparkContext ( conf = conf )
lines = sc.textFile ("/home/viswanath/spark/README.md")
lines.count()

To run the script
spark-submit HelloWorld.py




Finally, to shut down Spark, we can either call the stop() method on SparkContext, or simply exit the application ( e.g. with System.exit(0) or sys.exit()).

Friday, 30 October 2015

How to configure IPython Notebook server with Apache Spark

IPython provides a rich architecture for interactive computing with a powerful interactive shell, a kernel for Jupyter Sand others.

Apache Spark  is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.


Steps for setting up IPython notebook server with Apache Spark:

  • Install Spark
  • Create a PySpark profile for IPython
  • WordCount example

Spark installation:

Installing Spark is pretty much a straight forward task, get the latest version from here, to make things simple, download pre-built version to your working directory, at the time of writing this article it is version 1.5.1(spark-1.5.1-bin-hadoop2.6.tgz).  Extract compressed file(.tgz) with tar -xvzf spark-1.5.1-bin-hadoop2.6.tgz. That's it,  now you have working Spark installation on your machine.

Some configurations :
To easily adapt  for different versions of Spark, let's create symbolic link for Spark installation directory
ln -s spark-1.5.1-bin-hadoop2.6 spark 

Now let's set up some environmental variables, 

export JAVA_HOME=/usr
export SPARK_HOME=/home/viswanath/spark
export PATH=$SPARK_HOME/bin:$PATH

# Where you specify options you would normally add after bin/pyspark
export PYSPARK_SUBMIT_ARGS="--master local[4]"
Note: In the case of Linux, you have to add above variables to .bashrc file

Create a PySpark profile for IPython:

Let's create IPython profile for PySpark
ipython profile create pyspark

Above command will create pyspark direcory in ~/.ipython/profile_pyspark.

Now open ipython_notebook_config.py file in profile_pyspark directory, add following lines to the file and save 


# Kernel config

c.IPKernelApp.pylab = 'inline'  # if you want plotting support always


# Notebook config
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False
# It is a good idea to put it on a known, fixed port
c.NotebookApp.port = 9999


Now create a file in "00-pyspark-setup.py" in startup (~/.ipython/profile_pyspark/startup) directory and add the following code snippet to it

# Configure the necessary Spark environment
import os
import sys

spark_home = os.environ.get('SPARK_HOME', None)

    pyspark_submit_args = os.environ.get("PYSPARK_SUBMIT_ARGS", "")
    if not "pyspark-shell" in pyspark_submit_args: pyspark_submit_args += " pyspark-shell"
    os.environ["PYSPARK_SUBMIT_ARGS"] = pyspark_submit_args

sys.path.insert(0, spark_home + "/python")

# Add the py4j to the path.
# You may need to change the version number to match your install
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.2.1-src.zip'))

# Initialize PySpark to predefine the SparkContext variable 'sc'
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))



Now you are ready to launch your IPython notebook server with PySpark support

ipython notebook --profile=pyspark

Here goes, your IPython notebook http://server_ip:9999/


WordCount example :


inputRDD = sc.textFile ( "~/spark/README.md" )

inputRDD.flatMap( lambda line : line.split( ' ')).map ( lambda word : (word,1)).reduceByKey ( lambda a,b : a+b).collect()


Tip: To keep alive your notebook server after you left the current command line interaction, use screen.


Thanks & Regards
Viswanath G.
Senior Data Scientist



Friday, 12 December 2014

Four Short Links: Dec 12, 2014

1. Tab ( A Linux Shell Utility ) :-
   A modern text processing language that's similar to awk in spirit. 
    - Designed for concise one-liner aggregation and manipulation of tabular text data.

    - Makes no compromises on performance; aims to be no slower than traditional old-school UNIX             shell utilities whenever possible.

    - Feature-rich enough to support even very complex queries. (Also includes a good set of                         mathematical operations.)

    - Statically typed, type-inferred, declarative.

   Note:- By end of this weekend, will write a blog post "Tutorial on Tab:- A Linux Shell Utility"


2. Deep Learning Tutorial :- From Perceptron to Deep Networks :- 
    - In this tutorial, author is introducing the reader to the key concepts and algorithms behind deep             learning, begging with the simplest unit of composition and building to the concepts of machine             learning in Java

3. Faster Apache Pig with Apache Tez
    - Apache Pig 0.14.0 released on Nov 20th, 2014; And the good news is Tez if now one of the                 execution engine.

4. 10 Data Science Newsletters to subscribe to 

Friday, 29 August 2014

Pig UDFs in Jython and Python

In this post we are going to see how to compose Pig UDFs in Jython and Python, we are going to do this by using a running example; n-grams generator.

Problem Definition:-
Given a concatenated string with some delimiter, generate n-grams.

For example:-
concateNatedString = 'a_b_c_d'
1-gram:- a, b, c, d
2-grams:- a b, b c, c d
3-grams:- a b c, b c d
4-grams:- a b c d

Jython UDF to do above compuation:-

@outputSchema("y:bag{t:tuple(nGram:chararray)}")
def nGramsOnConcatenatedText(concatenatedString, delimiter, nGramValue):
listOfNGrams = []
allCategories = str (  concatenatedString ).split(delimiter)
for i in range( len(allCategories)):
if len ( allCategories[i:nGramValue+i] ) == nGramValue:
tVariable = ''
for eE in allCategories[i:nGramValue+i]:
tVariable += '\t' + eE
listOfNGrams.append( tVariable.strip() )
return  listOfNGrams
     
content of the file "test.txt":-
a_b_c_d
1_2_3_4
a_b_c_d