Achieving High Availability in Apache Spark using Apache Zookeeper quorum

Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. 

Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells.

 This Article is all about setting up of High Availability in Apache Spark. 

1stimage

1. Pre-requisites of Spark High Availability

Apache Zookeeper
Apache Spark
JDK1.7

1.1 To set in .bashrc

export JAVA_HOME=/home/datadotz/jdk1.7.0_45
(Note:This can be set in .bash_profile too)

2. Installation of Apache Spark 

(Kindly refer our chennaihug.org for Apache Spark Installation)

http://chennaihug.org/knowledgebase/spark-master-and-slaves-single-node-installation/

3. Installing Apache Zookeeper in Multiple Nodes

3.1. Download Apache Zookeeper from Apache Zookeeper site. Kindly check for Apache Zookeeper website if the link is broken. And Kindly check for current version of Apache Zookeeper

http://www.us.apache.org/dist/zookeeper/zookeeper-3.3.6/

3.2 Installation

3.2.1 Do the Installation in both Machines

Host name of Machine1 is spark1  and Host name for Machine2 is spark2

a. Untar zookeeper-3.3.6.tar.gz
b. Change the directory to conf
c. Create a new file as zoo.cfg

And add the below content

<br /># The number of milliseconds of each tick<br />tickTime=2000<br /># The number of ticks that the initial<br /># synchronization phase can take<br />initLimit=10<br /># The number of ticks that can pass between<br /># sending a request and getting an acknowledgement<br />syncLimit=5<br /># the directory where the snapshot is stored.<br />dataDir=/home/dd/zookeeper/data/<br /># the port at which the clients will connect<br />clientPort=2181<br />dataLogDir=/home/dd/zookeeper/logs/<br />server.1=spark1:2888:3888<br />server.2=spark2:2889:3889<br /><br />

Note : Create a file myid in “dataDir”( /home/dd/zookeeper/data) in both the Machines

In Machine 1 -  just type  1 in myid file and save.

In Machine 2 -  just type  2 in myid file and save.

4. Configuration of HA in Apache Spark, a Quickstart

4.1 Configure in both the Machines

a. Create a file ha.conf in SPARK_HOME
b. Add the following content in ha.conf
<br />spark.deploy.recoveryMode=ZOOKEEPER<br />spark.deploy.zookeeper.url=spark1:2181,spark2:2181<br />spark.deploy.zookeeper.dir=/home/dd/spark-1.5.1-bin-hadoop2.6/spark<br />

5. Steps

5a) To set in /etc/hosts

/etc/hosts file is used to set host names for the IP addresses, an one time setup.

Machine 1 – spark1

Machine 2 – spark2

etc

5b) Start Zookeeper in both the Machines

Check for Zookeeper Leader and Follower, since its a multi node.

zookeeper

5c). Start Masters

Start the Spark Master in all the nodes with the help of below command by passing arguments along with the file (ha.conf) in which you have configured for Zookeeper.

Machine1:

sbin/start-master.sh -h spark1 -p 7077 –webui-port 8080 –properties-file ha.conf

Machine2:

sbin/start-master.sh -h spark2 -p 17077 –webui-port 18080 –properties-file ha.conf

spaarkdaemon

5d). Start the Workers on both Machines

Workers are the slaves.

Machine1 - sbin/start-slave.sh spark1:7077

Machine2 - sbin/start-slave.sh spark2:17077

worker

5e). Check for web UI in “http://spark1:8080″ & “http://spark1:8080″

Monitor the spark UI  and check the no.of Alive Workers. The no.of Workers is 2 with respect to this QuickStart.

In spark1

active

In spark2

passive_ui

5f). To check for Active and Passive Master

Kill the Active Master manually, so that Passive node becomes Active

killcommand

5g). Check for Web UI again

The Active Master is down, as it was killed

spark1

killafter_ui

spark2

Passive node becomes the Master.

passiveactive

———————————-

Article written by DataDotz Team

DataDotz is a Chennai based BigData Team primarily focussed on consulting and training on technologies such as Apache Hadoop, Apache Spark , NoSQL(HBase, Cassandra, MongoDB), Search and Cloud Computing.

Note: DataDotz also provides classroom based Apache Kafka training in Chennai. The Course includes Cassandra , MongoDB, Scala and Apache Spark Training. For more details related to Apache Spark training in Chennai, please visit http://datadotz.com/training/