Achieving High Availability in Apache Spark using Apache Zookeeper quorum

Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing. 

Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells.

 This Article is all about setting up of High Availability in Apache Spark. 


1. Pre-requisites of Spark High Availability

Apache Zookeeper
Apache Spark

1.1 To set in .bashrc

export JAVA_HOME=/home/datadotz/jdk1.7.0_45
(Note:This can be set in .bash_profile too)

2. Installation of Apache Spark 

(Kindly refer our for Apache Spark Installation)

3. Installing Apache Zookeeper in Multiple Nodes

3.1. Download Apache Zookeeper from Apache Zookeeper site. Kindly check for Apache Zookeeper website if the link is broken. And Kindly check for current version of Apache Zookeeper

3.2 Installation

3.2.1 Do the Installation in both Machines

Host name of Machine1 is spark1  and Host name for Machine2 is spark2

a. Untar zookeeper-3.3.6.tar.gz
b. Change the directory to conf
c. Create a new file as zoo.cfg

And add the below content

<br /># The number of milliseconds of each tick<br />tickTime=2000<br /># The number of ticks that the initial<br /># synchronization phase can take<br />initLimit=10<br /># The number of ticks that can pass between<br /># sending a request and getting an acknowledgement<br />syncLimit=5<br /># the directory where the snapshot is stored.<br />dataDir=/home/dd/zookeeper/data/<br /># the port at which the clients will connect<br />clientPort=2181<br />dataLogDir=/home/dd/zookeeper/logs/<br />server.1=spark1:2888:3888<br />server.2=spark2:2889:3889<br /><br />

Note : Create a file myid in “dataDir”( /home/dd/zookeeper/data) in both the Machines

In Machine 1 -  just type  1 in myid file and save.

In Machine 2 -  just type  2 in myid file and save.

4. Configuration of HA in Apache Spark, a Quickstart

4.1 Configure in both the Machines

a. Create a file ha.conf in SPARK_HOME
b. Add the following content in ha.conf
<br />spark.deploy.recoveryMode=ZOOKEEPER<br />spark.deploy.zookeeper.url=spark1:2181,spark2:2181<br />spark.deploy.zookeeper.dir=/home/dd/spark-1.5.1-bin-hadoop2.6/spark<br />

5. Steps

5a) To set in /etc/hosts

/etc/hosts file is used to set host names for the IP addresses, an one time setup.

Machine 1 – spark1

Machine 2 – spark2


5b) Start Zookeeper in both the Machines

Check for Zookeeper Leader and Follower, since its a multi node.


5c). Start Masters

Start the Spark Master in all the nodes with the help of below command by passing arguments along with the file (ha.conf) in which you have configured for Zookeeper.


sbin/ -h spark1 -p 7077 –webui-port 8080 –properties-file ha.conf


sbin/ -h spark2 -p 17077 –webui-port 18080 –properties-file ha.conf


5d). Start the Workers on both Machines

Workers are the slaves.

Machine1 - sbin/ spark1:7077

Machine2 - sbin/ spark2:17077


5e). Check for web UI in “http://spark1:8080″ & “http://spark1:8080″

Monitor the spark UI  and check the no.of Alive Workers. The no.of Workers is 2 with respect to this QuickStart.

In spark1


In spark2


5f). To check for Active and Passive Master

Kill the Active Master manually, so that Passive node becomes Active


5g). Check for Web UI again

The Active Master is down, as it was killed




Passive node becomes the Master.



Article written by DataDotz Team

DataDotz is a Chennai based BigData Team primarily focussed on consulting and training on technologies such as Apache Hadoop, Apache Spark , NoSQL(HBase, Cassandra, MongoDB), Search and Cloud Computing.

Note: DataDotz also provides classroom based Apache Kafka training in Chennai. The Course includes Cassandra , MongoDB, Scala and Apache Spark Training. For more details related to Apache Spark training in Chennai, please visit