Installing Apache Spark on Windows 7 environment

Apache Spark is a lightening fast cluster computing engine conducive for big data processing. In order to learn how to work on it currently there is a MOOC conducted by UC Berkley here. However, they are using a pre-configured VM setup specific for the MOOC and for the lab exercises. But I wanted to get a taste of this technology on my personal computer. I invested two days searching the internet trying to find out how to install and configure it on a windows based environment. And finally, I was able to come up with the following brief steps that lead me to a working instantiation of Apache Spark.

To install Spark on a windows based environment the following prerequisites should be fulfilled first.

Requirement 1:

If you are a Python user then Install Python 2.6+ or above otherwise this step is not required. If you are not a python user then you also do not need to setup the python path as the environment variable
Download a pre-built Spark binary for Hadoop. I chose Spark release 1.2.1, package type Pre-built for Hadoop 2.3 or later from here.
Once downloaded I unzipped the *.tar file by using WinRar to the D drive. (You can unzip it to any drive on your computer)
The benefit of using a pre-built binary is that you will not have to go through the trouble of building the spark binaries from scratch.
Download and install Scala version 2.10.4 from here only if you are a Scala user otherwise this step is not required. If you are not a scala user then you also do not need to setup the scala path as the environment variable
Download and install winutils.exe and place it in any location in the D drive. Actually, the official release of Hadoop 2.6 does not include the required binaries (like winutils.exe) which are required to run Hadoop. Remember, Spark is a engine built over Hadoop.

Setting up the PATH variable in Windows environment :

This is the most important step. If the Path variable is not properly setup, you will not be able to start the spark shell. Now how to access the path variable?

Right click on Computer- Left click on Properties
Click on Advanced System Settings
Under Start up & Recovery, Click on the button labelled as "Environment Variable"
You will see the window divided into two parts, the upper part will read User variables for username and the lower part will read System variables. We will create two new system variables, So click on "New" button under System variable
Set the variable name as JAVA_HOME

JDK PATH

C:\Program Files\Java\jdk1.7.0_79\

Similarly, create a new system variable and name it as

PYTHON_PATH

C:\Python27\

Create a new system variable and name it as

HADOOP_HOME

C:\winutils

Create a new system variable and name it as

SPARK_HOME

C:\SPARK\BIN

NOTE:

Apache Maven installation is an optional step.

Download Apache Maven 3.1.1 from here
Choose Maven 3.1.1. (binary zip) and unpack it using WinZip or WinRAR. Create a new system variable and name it as

MAVEN_HOME and M2_HOME

D:\APACHE-MAVEN-3.1.1\BIN

MAVEN_HOME=D:\APACHE-MAVEN-3.1.1\BIN

M2_HOME=D:\APACHE-MAVEN-3.1.1\BIN

%JAVA_HOME%\BIN; %PYTHON_PATH%; %HADOOP_HOME%; %SPARK_HOME%; %M2_HOME%\BIN %MAVEN_HOME%\BIN

How to start Spark on windows

Open up the command prompt terminal
Change directory to the location where the spark directory is. For example in my case its present in the D directory
Navigate into the bin directory like cd bin
Run the command spark-shell and you should see the spark logo with the scala prompt

Open up the web browser and type localhost:4040 in the address bar and you shall see the Spark shell application UI

To quit Spark, at the command prompt type exit

Stories Data Speak

My thoughts and learnings.

Installing Apache Spark on Windows 7 environment