MultiNode Installation on AWS-EC2 Hadoop-2.x

Dear Hadoop Enthusiast,

As part of the this tutorial, we will be setting up a 3-Node Hadoop-2.X Cluster in AWS EC2. Please make sure that you have launched three machines in EC2 machines. We can use t2.micros EC2 instances as its cheaper.Lets name the 3 machines as Machine-1 , Machine-2 , Machine-3 respectively for easy identification.

AWS EC2 Hadoop Cluster Wiki

MACHINE – 1

1. Download the Hadoop tar from below link. You can also take the link from Apache Hadoop Site. Below is the link for hadoop-2.6.0. Current Version is hadoop-2.6.0 as of writing this wiki(April 12, 2015). Please check for current version in the hadoop Site.
wget http://mirrors.gigenet.com/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz

2. Download the Java from Oracle. Please Check for Oracle Website if the link is broken. Please check for current version in the Oracle Site.
wget http://download.java.net/jdk7u60/archive/b11/binaries/jdk-7u60-ea-bin-b11-linux-x64-19_mar_2014.tar.gz?q=download/jdk7u60/archive/b11/binaries/jdk-7u60-ea-bin-b11-linux-x64-19_mar_2014.tar.gz

3. Unpack the Comparisons
$tar -zxvf hadoop-2.6.0.tar.gz
$tar -zxf jdk1.7.0_60.tar.gz

4. Set JAVA PATH in Linux Environment.You can JAVA path to classpath using any one the files viz: /etc/profile , ~/.bashc_profile , ~/.bashrc. In this , we are using ~/.bashrc. Please edit .bashrc and add below 2 lines
$vi ~/.bashrc
export JAVA_HOME=/home/ec2-user/jdk1.7.0_60
export PATH=$HOME/bin:$JAVA_HOME/bin:$PATH
Execute the .bashrc File to make the changes in effect immediately for the current ssh session
$source ~/.bashrc

5. As we know, Core hadoop is made up of 5 daemons viz. NameNode(NN), SecondaryNameNode(SN), DataNode(NN),ResourceManager(RM),  NodeManager(NM). We need to modify the config files in the conf folder.Below are the files required for respective daemons.

NAMENODE core-site.xml
RESOURCE MANAGER mapred-site.xml
SECONDARYNAMENODE
DATANODE slaves
NODEMANAGER  slaves & yarn-site.xml

Ports used by Hadoop Daemons
Remote Procedure Call (RPC) is a protocol that one program can use to request a service from a program located in another computer in a network without having to understand network details.
WEB which is denoted in the table is the WEB port number.

Hadoop Daemons RPC Port WEB – UI
NameNode 50000(8020** – default) 50070
SecondaryNameNode 50090
DataNode 50010 50075
Resource Manager 8030 8088
Node Manager 8040 8042

Its best practice to add ips and its hostname in hosts file. If your network is backed by a DNS server, below changes are not needed. Edit /etc/hosts
$sudo vi /etc/hosts

10.0.0.2 datadotz_master
10.0.0.3 datadotz_slave1
10.0.0.4 datadotz_slave2

Below are the minimal changes in hadoop configuration files
$cd hadoop-2.6.0

$ vi etc/hadoop/core-site.xml

<!– This conf denotes the filesystem. Also which ip & port for NN to bind –>
<property>
<name>fs.default.name</name>
<value>hdfs://datadotz_master:50000</value>
</property>

$ vi etc/hadoop/mapred-site.xml.template

<!– mapreduce to under YARN–>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

$vi hdfs-site.xml

<!– Directory to store NameNode metadata–>
<property>
<name>dfs.namenode.name.dir</name>
<value>/home/ec2-user/hadoop2-dir/namenode-dir</value>
</property>

<!– Directory to store blocks and other related  by DataNode.–>
<property>
<name>dfs.datanode.data.dir</name>
<value>/home/ec2-user/hadoop2-dir/datanode-dir</value>
</property>

<property>
<name>dfs.permissions</name>
<value>false</value>
</property>

$ vi etc/hadoop/yarn-site.xml

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

$vi hadoop-env.sh

export JAVA_HOME=/home/ec2-user/jdk1.7.0_60

$ vi etc/hadoop/yarn-env.sh

export JAVA_HOME=/home/ec2-user/jdk1.7.0_60

$ vi etc/hadoop/mapred-env.sh

export JAVA_HOME=/home/ec2-user/jdk1.7.0_60

$vi masters

datadotz_slave1

$vi slaves

datadotz_master
datadotz_slave1
datadotz_slave2

————————————————————————————————————–

Passwordless Authentication. If you use default script such as start-all.sh, stop-all.sh and other similar scripts, it needs to log(using ssh) into other machines from the machine where you are running the script. Typically we run it from NN machine. While logging in , every machine will ask for passwords. If you are having 10 node cluster, you are needed to enter password minimum of 10. To avoid the same, we create passwordless authentication. First we need to generate the ssh key and copy the public key into authorized keys of the destination machine.

Below are the steps for the same 

Install the Openssh-server

$ sudo apt-get install openssh-server

$ cd
$ ssh-keygen -t rsa
$ cd .ssh
$ cat id_rsa.pub >> authorized_keys

Setup passwordless ssh to localhost(same machine) and to slaves(other machines)

$ ssh localhost (or) ipaddress

(It will successfully logs in without asking for password)
————————————————————————————————————–

MACHINE – 2

1. Download the Hadoop tar from below link. You can also take the link from Apache Hadoop Site. Below is the link for hadoop-2.6.0. Current Version is hadoop-2.6.0 as of writing this wiki(April 12, 2015). Please check for current version in the hadoop Site.
wget http://mirrors.gigenet.com/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz

2. Download the Java from Oracle. Please Check for Oracle Website if the link is broken. Please check for current version in the Oracle Site.
wget http://download.java.net/jdk7u60/archive/b11/binaries/jdk-7u60-ea-bin-b11-linux-x64-19_mar_2014.tar.gz?q=download/jdk7u60/archive/b11/binaries/jdk-7u60-ea-bin-b11-linux-x64-19_mar_2014.tar.gz

3. Unpack the Comparisons
$tar -zxvf hadoop-2.6.0.tar.gz
$tar -zxf jdk1.7.0_60.tar.gz

4. Set JAVA PATH in Linux Environment.You can JAVA path to classpath using any one the files viz: /etc/profile , ~/.bashc_profile , ~/.bashrc. In this , we are using ~/.bashrc. Please edit .bashrc and add below 2 lines
$vi ~/.bashrc
export JAVA_HOME=/home/ec2-user/jdk1.7.0_60
export PATH=$HOME/bin:$JAVA_HOME/bin:$PATH
Execute the .bashrc File to make the changes in effect immediately for the current ssh session
$source ~/.bashrc

5. As we know, Core hadoop is made up of 5 daemons viz. NameNode(NN), SecondaryNameNode(SN), DataNode(NN),ResourceManager(RM),  NodeManager(NM). We need to modify the config files in the conf folder.Below are the files required for respective daemons.

NAMENODE core-site.xml
RESOURCE MANAGER mapred-site.xml
SECONDARYNAMENODE
DATANODE slaves
NODEMANAGER  slaves & yarn-site.xml

Ports used by Hadoop Daemons
Remote Procedure Call (RPC) is a protocol that one program can use to request a service from a program located in another computer in a network without having to understand network details.
WEB which is denoted in the table is the WEB port number.

Hadoop Daemons RPC Port WEB – UI
NameNode 50000(8020** – default) 50070
SecondaryNameNode 50090
DataNode 50010 50075
Resource Manager 8030 8088
Node Manager 8040 8042

Its best practice to add ips and its hostname in hosts file. If your network is backed by a DNS server, below changes are not needed. Edit /etc/hosts
$sudo vi /etc/hosts

10.0.0.2 datadotz_master
10.0.0.3 datadotz_slave1
10.0.0.4 datadotz_slave2

Hadoop Configuration
$cd hadoop-2.6.0

$ vi etc/hadoop/core-site.xml

<property>
<name>fs.default.name</name>
<value>hdfs://datadotz_master:50000</value>
</property>

$ vi etc/hadoop/mapred-site.xml.template

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

$vi hdfs-site.xml

<property>
<name>dfs.datanode.data.dir</name>
<value>/home/ec2-user/hadoop2-dir/datanode-dir</value>
</property>

$ vi etc/hadoop/yarn-site.xml

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

$vi hadoop-env.sh

export JAVA_HOME=/home/ec2-user/jdk1.7.0_60

$ vi etc/hadoop/yarn-env.sh

export JAVA_HOME=/home/ec2-user/jdk1.7.0_60

$ vi etc/hadoop/mapred-env.sh

export JAVA_HOME=/home/ec2-user/jdk1.7.0_60
————————————————————————————————————–
Passwordless Authentication. If you use default script such as start-all.sh, stop-all.sh and other similar scripts, it needs to log(using ssh) into other machines from the machine where you are running the script. Typically we run it from NN machine. While logging in , every machine will ask for passwords. If you are having 10 node cluster, you are needed to enter password minimum of 10. To avoid the same, we create passwordless authentication. First we need to generate the ssh key and copy the public key into authorized keys of the destination machine.
Generate the ssh key (manages and converts authentication keys)

$ cd
$ ssh-keygen -t rsa
$ cd .ssh
$ cat id_rsa.pub >> authorized_keys

Setup passwordless ssh to localhost and to slaves

$ ssh localhost (or) ipaddress (Asking No Password )
————————————————————————————————————–

Copy Namenode id_rsa.pub key and append in slaves machines’ authorized_keys

————————————————————————————————————–

MACHINE – 3

1. Download the Hadoop tar from below link. You can also take the link from Apache Hadoop Site. Below is the link for hadoop-2.6.0. Current Version is hadoop-2.6.0 as of writing this wiki(April 12, 2015). Please check for current version in the hadoop Site.
wget http://mirrors.gigenet.com/apache/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz

2. Download the Java from Oracle. Please Check for Oracle Website if the link is broken. Please check for current version in the Oracle Site.
wget http://download.java.net/jdk7u60/archive/b11/binaries/jdk-7u60-ea-bin-b11-linux-x64-19_mar_2014.tar.gz?q=download/jdk7u60/archive/b11/binaries/jdk-7u60-ea-bin-b11-linux-x64-19_mar_2014.tar.gz

3. Unpack the Comparisons
$tar -zxvf hadoop-2.6.0.tar.gz
$tar -zxf jdk1.7.0_60.tar.gz

4. Set JAVA PATH in Linux Environment.You can JAVA path to classpath using any one the files viz: /etc/profile , ~/.bashc_profile , ~/.bashrc. In this , we are using ~/.bashrc. Please edit .bashrc and add below 2 lines
$vi ~/.bashrc
export JAVA_HOME=/home/ec2-user/jdk1.7.0_60
export PATH=$HOME/bin:$JAVA_HOME/bin:$PATH
Execute the .bashrc File to make the changes in effect immediately for the current ssh session
$source ~/.bashrc

5. As we know, Core hadoop is made up of 5 daemons viz. NameNode(NN), SecondaryNameNode(SN), DataNode(NN),ResourceManager(RM),  NodeManager(NM). We need to modify the config files in the conf folder.Below are the files required for respective daemons.

NAMENODE core-site.xml
RESOURCE MANAGER mapred-site.xml
SECONDARYNAMENODE
DATANODE slaves
NODEMANAGER  slaves & yarn-site.xml

Ports used by Hadoop Daemons
Remote Procedure Call (RPC) is a protocol that one program can use to request a service from a program located in another computer in a network without having to understand network details.
WEB which is denoted in the table is the WEB port number.

Hadoop Daemons RPC Port WEB – UI
NameNode 50000(8020** – default) 50070
SecondaryNameNode 50090
DataNode 50010 50075
Resource Manager 8030 8088
Node Manager 8040 8042

Its best practice to add ips and its hostname in hosts file. If your network is backed by a DNS server, below changes are not needed. Edit /etc/hosts
$sudo vi /etc/hosts

10.0.0.2 datadotz_master
10.0.0.3 datadotz_slave1
10.0.0.4 datadotz_slave2

Modify Hadoop Configuration Files

Hadoop Configuration

$cd hadoop-2.6.0

$ vi etc/hadoop/core-site.xml

<property>
<name>fs.default.name</name>
<value>hdfs://datadotz_master:50000</value>
</property>

$ vi etc/hadoop/mapred-site.xml.template

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

$vi hdfs-site.xml

<property>
<name>dfs.datanode.data.dir</name>
<value>/home/ec2-user/hadoop2-dir/datanode-dir</value>
</property>

$ vi etc/hadoop/yarn-site.xml

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

<property>
<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>

$vi hadoop-env.sh

export JAVA_HOME=/home/ec2-user/jdk1.7.0_60

$ vi etc/hadoop/yarn-env.sh

export JAVA_HOME=/home/ec2-user/jdk1.7.0_60

$ vi etc/hadoop/mapred-env.sh

export JAVA_HOME=/home/ec2-user/jdk1.7.0_60
————————————————————————————————————–
Passwordless Authentication. If you use default script such as start-all.sh, stop-all.sh and other similar scripts, it needs to log(using ssh) into other machines from the machine where you are running the script. Typically we run it from NN machine. While logging in , every machine will ask for passwords. If you are having 10 node cluster, you are needed to enter password minimum of 10. To avoid the same, we create passwordless authentication. First we need to generate the ssh key and copy the public key into authorized keys of the destination machine.

Generate the ssh key

(manages and converts authentication keys)

$ cd
$ ssh-keygen -t rsa
$ cd .ssh
$ cat id_rsa.pub >> authorized_keys

Setup passwordless ssh to localhost and to slaves

$ ssh localhost (or) ip address

(Asking No Password )

————————————————————————————————————–
Copy Namenode id_rsa.pub key and append in slaves machines’ authorized_keys
————————————————————————————————————–
Master Node (Namenode Machine) (MACHINE – 1)
————————————————————————————————————–

Format Hadoop NameNode

$ cd hadoop-2.6.0
$ bin/hadoop namenode –format (Your Hadoop File System Ready)

Start All Hadoop Related Services
$ sbin/start-all.sh
(Starting Daemon’s For DFS & YARN)

$ jps (JPS command is for listing JVM processes on local and remote machines.)
MACHINE – 1

NameNode
DataNode
ResourceManager
NodeManager

MACHINE – 2

$ jps

DataNode
SecondaryNameNode
NodeManager

MACHINE – 3

$ jps

DataNode
NodeManager
All the Java Process are running successfully. Lets check the same using Web UI. Lets browse NameNode and JobTracker WebUI )

NameNode : datadotz_master:50070
JobTracker : datadotz_master:50030

$ bin/stop-all.sh (To Stop All Hadoop Related Services)

———————————-

Article written by DataDotz Team

DataDotz is a Chennai based BigData Team primarily focussed on consulting and training on technologies such as Apache Hadoop, Apache Spark , NoSQL(HBase, Cassandra, MongoDB), Search and Cloud Computing.

Note: DataDotz also provides classroom based Apache Kafka training in Chennai. The Course includes Cassandra , MongoDB, Scala and Apache Spark Training. For more details related to Apache Spark training in Chennai, please visit http://datadotz.com/training/