The best way to learn Apache Spark, is through various interactive shells that it supports. As of now, Apache Spark supports following interactive shells Python/IPython, Scala and R.
But the best way to deploy Spark programs in production systems is through Standalone applications. The main difference from using it in the shell is that we need to initialize your own SparkContext(where as interactive shells gives one to us).
Note: SparkContext represents a connection to a computing cluster.
Python standalone application
In Python, Spark standalone applications are simple Python scripts, but need to run this scripts through bin/spark-submit script which included in Spark installation. The spark-submit includes the Spark dependencies for us in Python. This script sets up the environment for Spark's Python API to function.
"HelloWorld.py" :
"HelloWorld.py" :
from pyspark import SparkConf, SparkContext
conf = SparkConf().setMaster("local").setAppName("HelloWorld")
sc = SparkContext ( conf = conf )
lines = sc.textFile ("/home/viswanath/spark/README.md")
lines.count()
To run the script
spark-submit HelloWorld.py
Finally, to shut down Spark, we can either call the stop() method on SparkContext, or simply exit the application ( e.g. with System.exit(0) or sys.exit()).