Hadoop for DBAs (3/13): A Simple 3-Node Hadoop Cluster

After « Building Hadoop for Oracle Linux 7 (2/13) », it is somehow expected to deploy Hadoop on Oracle Linux 7. This third article of the series shows how to create a 3-node Hadoop cluster. Once the cluster up and running, it digs into the processes, logs and consoles. The last step of it is to execute a MapReduce sample job.
The setup is very basic. Once the 3 servers installed, it should not take more than a few minutes to deploy, configure and start Hadoop. The final infrastructure does not implement any of the security features, one would configure on most production environment. For instance, it keeps more advanced authentication and authorization mechanisms for later…

Architecture

You’ll find a schema of the final configuration below. The 3 servers are named yellow, pink and green. If you want to test it for yourself, you can use 3 Virtualbox VMs, each one configured with 2GB of RAM and Oracle Linux 7.
Hadoop Cluster
To speed up the configuration, it does not integrate with any Kerberos domain controller. There is no connection key and no data encryption. SELinux is disabled. All the network port are opened. Service high availability and backups are not used. It is not properly sized for memory, CPU or network. The underlying storage technology is not even mentioned. This is, somehow; the simplest Hadoop cluster you will ever meet : not convenient for production purpose at all!

System Prerequisites

System and network infrastructure requires a few configuration steps before you proceed with Hadoop:

  • Install the 3 servers with Oracle Linux 7. Make sure you install Java, openssh and rsync
  • Deactivate SELinux and open the firewall
  • IP connectivity should be operational as well as name resolution
  • A system user should be present to install Hadoop as non root. This article assumes it is named hadoop. Create an SSH equivalency between all the servers for that user
  • A file system must be created to hold Hadoop distributed File System (HDFS)

Sections below provide some details about these configurations.

Yellow, pink and green

Install Oracle Linux 7 on the 3 servers. You should install the required RPMs including openssh, rsync and Java SE 8 as described in the previous article.

SELinux and Firewalld

Change the firewall from the public zone to the trusted zone and open that zone for any traffic:

systemctl enable firewalld
systemctl start firewalld
firewall-cmd --get-active-zones
firewall-cmd --permanent --zone trusted --add-port=1-65535/tcp
firewall-cmd --permanent --zone trusted --add-port=1-65535/udp
firewall-cmd --zone trusted --add-port=1-65535/tcp
firewall-cmd --zone trusted --add-port=1-65535/udp
firewall-cmd --set-default-zone trusted
firewall-cmd --get-active-zones
trusted
  interfaces: enp0s3
firewall-cmd --list-all
trusted (default, active)
  interfaces: enp0s3
  sources:
  services:
  ports: 1-65535/udp 1-65535/tcp
  masquerade: no
  forward-ports:
  icmp-blocks:
  rich rules:

Edit /etc/selinux/config and change its content to disable SELinux. Once done, reboot the server:

sed -i 's/^[ \t]*SELINUX[ \t]*=.*$/SELINUX=disabled/' \
    /etc/selinux/config
reboot
getenforce

IP Configuration and Name Resolution

Configure the network on the 3 servers and register all the IP Addresses and names in /etc/hosts for every machine. If you prefer to use a DNS server, make sure names can be resolved from everywhere.

The Hadoop user

Usually, users are created separately per installed components, i.e. o,r for HDFS and one for YARN. To make it simple, you can install everything as a single hadoop user:

groupadd -g 501 hadoop
useradd -u 501 -g hadoop -m -p hadoop -s /bin/bash hadoop

SSH equivalencies

The script below show how to create SSH equivalencies for the Hadoop user. Run it from every server to create the SSH private keys and authorize connections on remote hosts:

cd ~
mkdir .ssh
chmod 700 .ssh
ssh-keygen -t dsa -q -N "" -f .ssh/id_dsa
ssh-copy-id -o StrictHostKeyChecking=no yellow
ssh-copy-id -o StrictHostKeyChecking=no green
ssh-copy-id -o StrictHostKeyChecking=no pink

Test equivalencies; the script below tests the 9 SSH combinations. It should not prompt for any password:

for i in green pink yellow; do
   for j in green pink yellow; do
      ssh $i ssh $j "echo $i to $j: `date`"
   done
done
green to green: Wed Aug 6 20:51:25 CEST 2014
green to pink: Wed Aug 6 20:51:25 CEST 2014
green to yellow: Wed Aug 6 20:51:25 CEST 2014
pink to green: Wed Aug 6 20:51:25 CEST 2014
pink to pink: Wed Aug 6 20:51:26 CEST 2014
pink to yellow: Wed Aug 6 20:51:26 CEST 2014
yellow to green: Wed Aug 6 20:51:26 CEST 2014
yellow to pink: Wed Aug 6 20:51:26 CEST 2014
yellow to yellow: Wed Aug 6 20:51:26 CEST 2014

Hadoop Installation and Configuration

Installing and configuring Hadoop is straightforward:

  • Untar the distribution,
  • Set a few properties including server names, local storage for HDFS, Name Node address and Resource Manager host name.

Hadoop Installation

Untar Hadoop in /usr/local/hadoop. Before you proceed, make sure you’ve copied the distribution as /home/hadoop/hadoop-2.4.1.tar.gz on every server:

# As root:
cd /usr/local
mkdir hadoop
chown hadoop:hadoop hadoop
su - hadoop
# As hadoop:
cd /usr/local
tar -zxvf /home/hadoop/hadoop-2.4.1.tar.gz \
   --transform 's/hadoop-2.4.1/hadoop/'

HDFS underlying storage

Unlike regular filesystems, HDFS relies itself on underlying local filesystems. With Oracle Linux 7, you may want to use XFS. However to make it simpler, this configuration only relies on 2 destinations in /, one for the data and one for Metadata. On yellow, the server holding HDFS Node Name, run:

# As root
mkdir /u01
chown hadoop:hadoop /u01
su - hadoop
# As hadoop
mkdir /u01/metadata
mkdir /u01/files

You won’t need the metadata directory for now on Pink and Green:

# As root
mkdir /u01
chown hadoop:hadoop /u01
su - hadoop
# As hadoop
mkdir /u01/metadata
mkdir /u01/files

Cluster Properties

The number of Hadoop parameters is quite significant. This configuration sets the very few that are mandatory for this simple cluster:

  • HDFS underlying directories
  • The Node Manager address and port
  • The Resource Manager server name
  • The list of slave servers

Note:
You can separate the Name Node and Resource Manager from the other slave servers. However, in order to have a large-enough configuration, Data Nodes and Node Managers are running on every server

By default, the configuration directory is located in etc/hadoop in Hadoop home. Files are XML files and the few changes that are needed are described below.

HDFS directories

Local directory names for HDFS data and metadata are stored in hdfs-site.xml. dfs.datanode.data.dir and dfs.namenode.name.dir are the properties to change for your configuration. Below is an example of that file for the sample cluster:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>/u01/files</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>/u01/metadata</value>
  </property>
</configuration>

Node Name Access Point

Name Node address and port are defined by the fs.defaultFS property in core-site.xml. Below is an example of the file for the sample cluster:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
   <property>
        <name>fs.defaultFS</name>
        <value>hdfs://yellow:9000</value>
    </property>
</configuration>

Resource Manager Server

Resource Manager name is defined by the yarn.resourcemanager.hostname property in yarn-site.xml. Below is an example of the file for the sample cluster:

<?xml version="1.0"?>
<configuration>
    <property>
        <name>yarn.resourcemanager.hostname</name>
        <value>pink</value>
    </property>
</configuration>

Slave servers

Slave names are kept in the slaves file. This file is not mandatory but allows to start all the slaves as a one-only command from the master node. In this sample cluster, the slaves file contains:

yellow
pink
green

Configuration File Copies

Once the settings stored on one of the nodes, push them on the other ones. Assuming yellow was used to change the files, push the files to the 2 others nodes:

for i in pink green; do
   scp /usr/local/hadoop/etc/hadoop/slaves \
          $i:/usr/local/hadoop/etc/hadoop/slaves
   scp /usr/local/hadoop/etc/hadoop/*-site.xml \
          $i:/usr/local/hadoop/etc/hadoop/.
done

Note:
This is not the only way to perform the replication and one can rely on the HADOOP_MASTER variable and rsync to synchronize the configuration at the startup time.

Formatting HDFS

Last, format the HDFS filesystem from the Name Node or yellow in this case:
bin/hdfs namenode -format color

HDFS and YARN startup

You can start HDFS Name Node and Data Nodes from the Name Node. Run the 3 lines below from Yellow:

cd /usr/local/hadoop
sbin/hadoop-daemon.sh start namenode
sbin/hadoop-daemons.sh start datanode

You can start YARN Resource Manager, Job History as well as all the Node Managers from the Resource Manager Node. Run the 4 commands below from Pink:

cd /usr/local/hadoop
sbin/yarn-daemon.sh start resourcemanager
sbin/mr-jobhistory-daemon.sh start historyserver
sbin/yarn-daemons.sh start nodemanager

Hadoop Post-Installation Checks

A first level of checks for the configuration consists in making sure Hadoop java processes are running as expected:

for i in yellow pink green; do
   echo "${i}:"
   ssh $i "ps -u hadoop -o user,pid,args |grep [j]ava|cut -c1-80"
done
yellow:
hadoop    3664 /usr/java/jdk1.8.0_11/bin/java -Dproc_namenode -Xmx1000m -Djava.n
hadoop    3786 /usr/java/jdk1.8.0_11/bin/java -Dproc_datanode -Xmx1000m -Djava.n
hadoop    3931 /usr/java/jdk1.8.0_11/bin/java -Dproc_nodemanager -Xmx1000m -Dhad
pink:
hadoop    5037 /usr/java/jdk1.8.0_11/bin/java -Dproc_datanode -Xmx1000m -Djava.n
hadoop    5143 /usr/java/jdk1.8.0_11/bin/java -Dproc_resourcemanager -Xmx1000m -
hadoop    5203 /usr/java/jdk1.8.0_11/bin/java -Dproc_historyserver -Xmx1000m -Dj
hadoop    5264 /usr/java/jdk1.8.0_11/bin/java -Dproc_nodemanager -Xmx1000m -Dhad
green:
hadoop    3260 /usr/java/jdk1.8.0_11/bin/java -Dproc_datanode -Xmx1000m -Djava.n
hadoop    3399 /usr/java/jdk1.8.0_11/bin/java -Dproc_nodemanager -Xmx1000m -Dhad

To go deeper, check the log files in the log directories and connect to the web consoles. By default, the Name Node console listen on port 50070, here on yellow. The « Data Nodes » menu presents the running Data Nodes:Name Node ConsoleEvery Data Node has also a console listening to 50075:Data Node ConsoleYARN Resource Manager listen to port 8088 by default. The « Nodes » menu shows the Node Managers that are running:
Yarn ConsoleThe Job History Console can be accessed from port 19888:Job History ConsoleNode Manager consoles are listening to port 8042:Node Manager Console

Sample MapReduce Job

To check the cluster, execute a MapReduce job. To begin, create a /user/hadoop directory in HDFS:

bin/hdfs dfs -mkdir /user
bin/hdfs dfs -mkdir /user/hadoop

Then copy the configuration directory in an HDFS input directory. That directory will, by default, be stored in /user/hadoop:

bin/hdfs dfs -put etc/hadoop input
bin/hdfs dfs -ls
Found 1 items
drwxr-xr-x   - hadoop supergroup          0 2014-08-06 19:44 input

Run the « grep » command that comes with Hadoop distribution as an example. That program is a MapReduce Job that search a directory for a Regular Expression and count the number of matching expressions. The result is stored in another directory in HDFS:

bin/hadoop jar \
    share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar \
    grep input output 'dfs[a-z.]+'

Check the output directory has been created and the results stored in the part-r-00000 file:

bin/hdfs dfs -ls
Found 2 items
drwxr-xr-x   - hadoop supergroup          0 2014-08-06 19:44 input
drwxr-xr-x   - hadoop supergroup          0 2014-08-06 19:46 output
bin/hdfs dfs -ls output
Found 2 items
-rw-r--r--   3 hadoop supergroup          0 2014-08-06 19:46 output/_SUCCESS
-rw-r--r--   3 hadoop supergroup        227 2014-08-06 19:46 output/part-r-00000
bin/hdfs dfs -cat output/part-r-00000
6	dfs.audit.logger
4	dfs.class
3	dfs.server.namenode.
2	dfs.period
2	dfs.audit.log.maxfilesize
2	dfs.audit.log.maxbackupindex
1	dfsmetrics.log
1	dfsadmin
1	dfs.servers
1	dfs.namenode.name.dir
1	dfs.file
1	dfs.datanode.data.dir

To finish with the demo, delete all HDFS directories:

bin/hdfs dfs -rm -skipTrash -R /user

HDFS and YARN shutdown

To stop YARN, stop the Resource Manager and the Node Managers on all the slave servers. You only need to connect to Pink as Hadoop:

cd /usr/local/hadoop
sbin/mr-jobhistory-daemon.sh stop historyserver
sbin/yarn-daemon.sh stop resourcemanager
sbin/yarn-daemons.sh stop nodemanager

To stop HDFS, stop the Name Node and the Data Nodes on all the slave servers. You only need to connect to Yellow as Hadoop:

cd /usr/local/hadoop
sbin/hadoop-daemon.sh stop namenode
sbin/hadoop-daemons.sh stop datanode

In this article, you’ve installed, configured and tested a 3 node Hadoop clusters. The next ones will explore the different components from HDFS to Spark…
References:

To know more about Hadoop installation, refer to
[1] Hadoop MapReduce Next Generation – Cluster Setup

1 réflexion sur “Hadoop for DBAs (3/13): A Simple 3-Node Hadoop Cluster”

  1. Hi Greg, may I suggest you set SELINUX to Permissive rather than just disabled, like that one day we’ll all be using the thing and it could avoid an issue we had recently on a dev server where disabling selinux seemed to have a nasty consequence on the software RAID: we couldn’t get to one of the disks any more…

Les commentaires sont fermés.