Spark has an advanced DAG execution engine that supports cyclic data flow and in-memory computing.
Spark offers over 80 high-level operators that make it easy to build parallel apps. And you can use it interactively from the Scala, Python and R shells.
1. Pre-requisites of Spark High Availability
1.1 To set in .bashrc
(Note:This can be set in .bash_profile too)
2. Installation of Apache Spark
(Kindly refer our chennaihug.org for Apache Spark Installation)
3. Installing Apache Zookeeper in Multiple Nodes
3.1. Download Apache Zookeeper from Apache Zookeeper site. Kindly check for Apache Zookeeper website if the link is broken. And Kindly check for current version of Apache Zookeeper
3.2.1 Do the Installation in both Machines
Host name of Machine1 is spark1 and Host name for Machine2 is spark2a. Untar zookeeper-3.3.6.tar.gz b. Change the directory to conf c. Create a new file as zoo.cfg
And add the below content
<br /># The number of milliseconds of each tick<br />tickTime=2000<br /># The number of ticks that the initial<br /># synchronization phase can take<br />initLimit=10<br /># The number of ticks that can pass between<br /># sending a request and getting an acknowledgement<br />syncLimit=5<br /># the directory where the snapshot is stored.<br />dataDir=/home/dd/zookeeper/data/<br /># the port at which the clients will connect<br />clientPort=2181<br />dataLogDir=/home/dd/zookeeper/logs/<br />server.1=spark1:2888:3888<br />server.2=spark2:2889:3889<br /><br />
Note : Create a file myid in “dataDir”( /home/dd/zookeeper/data) in both the Machines
In Machine 1 - just type 1 in myid file and save.
In Machine 2 - just type 2 in myid file and save.
4. Configuration of HA in Apache Spark, a Quickstart
4.1 Configure in both the Machinesa. Create a file ha.conf in SPARK_HOME b. Add the following content in ha.conf
<br />spark.deploy.recoveryMode=ZOOKEEPER<br />spark.deploy.zookeeper.url=spark1:2181,spark2:2181<br />spark.deploy.zookeeper.dir=/home/dd/spark-1.5.1-bin-hadoop2.6/spark<br />
5a) To set in /etc/hosts
/etc/hosts file is used to set host names for the IP addresses, an one time setup.
Machine 1 – spark1
Machine 2 – spark2
5b) Start Zookeeper in both the Machines
Check for Zookeeper Leader and Follower, since its a multi node.
5c). Start Masters
Start the Spark Master in all the nodes with the help of below command by passing arguments along with the file (ha.conf) in which you have configured for Zookeeper.
sbin/start-master.sh -h spark1 -p 7077 –webui-port 8080 –properties-file ha.conf
sbin/start-master.sh -h spark2 -p 17077 –webui-port 18080 –properties-file ha.conf
5d). Start the Workers on both Machines
Workers are the slaves.
Machine1 - sbin/start-slave.sh spark1:7077
Machine2 - sbin/start-slave.sh spark2:17077
5e). Check for web UI in “http://spark1:8080″ & “http://spark1:8080″
Monitor the spark UI and check the no.of Alive Workers. The no.of Workers is 2 with respect to this QuickStart.
5f). To check for Active and Passive Master
Kill the Active Master manually, so that Passive node becomes Active
5g). Check for Web UI again
The Active Master is down, as it was killed
Passive node becomes the Master.
Article written by DataDotz Team
DataDotz is a Chennai based BigData Team primarily focussed on consulting and training on technologies such as Apache Hadoop, Apache Spark , NoSQL(HBase, Cassandra, MongoDB), Search and Cloud Computing.
Note: DataDotz also provides classroom based Apache Kafka training in Chennai. The Course includes Cassandra , MongoDB, Scala and Apache Spark Training. For more details related to Apache Spark training in Chennai, please visit http://datadotz.com/training/