After « Building Hadoop for Oracle Linux 7 (2/13) », it is somehow expected to deploy Hadoop on Oracle Linux 7. This third article of the series shows how to create a 3-node Hadoop cluster. Once the cluster up and running, it digs into the processes, logs and consoles. The last step of it is to execute a MapReduce sample job.
The setup is very basic. Once the 3 servers installed, it should not take more than a few minutes to deploy, configure and start Hadoop. The final infrastructure does not implement any of the security features, one would configure on most production environment. For instance, it keeps more advanced authentication and authorization mechanisms for later…
Architecture
You’ll find a schema of the final configuration below. The 3 servers are named yellow
, pink
and green
. If you want to test it for yourself, you can use 3 Virtualbox VMs, each one configured with 2GB of RAM and Oracle Linux 7.
To speed up the configuration, it does not integrate with any Kerberos domain controller. There is no connection key and no data encryption. SELinux is disabled. All the network port are opened. Service high availability and backups are not used. It is not properly sized for memory, CPU or network. The underlying storage technology is not even mentioned. This is, somehow; the simplest Hadoop cluster you will ever meet : not convenient for production purpose at all!
System Prerequisites
System and network infrastructure requires a few configuration steps before you proceed with Hadoop:
- Install the 3 servers with Oracle Linux 7. Make sure you install Java,
openssh
andrsync
- Deactivate SELinux and open the firewall
- IP connectivity should be operational as well as name resolution
- A system user should be present to install Hadoop as non root. This article assumes it is named hadoop. Create an SSH equivalency between all the servers for that user
- A file system must be created to hold Hadoop distributed File System (HDFS)
Sections below provide some details about these configurations.
Yellow, pink and green
Install Oracle Linux 7 on the 3 servers. You should install the required RPMs including openssh
, rsync
and Java SE 8 as described in the previous article.
SELinux and Firewalld
Change the firewall from the public
zone to the trusted
zone and open that zone for any traffic:
systemctl enable firewalld systemctl start firewalld firewall-cmd --get-active-zones firewall-cmd --permanent --zone trusted --add-port=1-65535/tcp firewall-cmd --permanent --zone trusted --add-port=1-65535/udp firewall-cmd --zone trusted --add-port=1-65535/tcp firewall-cmd --zone trusted --add-port=1-65535/udp firewall-cmd --set-default-zone trusted firewall-cmd --get-active-zones trusted interfaces: enp0s3 firewall-cmd --list-all trusted (default, active) interfaces: enp0s3 sources: services: ports: 1-65535/udp 1-65535/tcp masquerade: no forward-ports: icmp-blocks: rich rules:
Edit /etc/selinux/config
and change its content to disable SELinux. Once done, reboot the server:
sed -i 's/^[ \t]*SELINUX[ \t]*=.*$/SELINUX=disabled/' \ /etc/selinux/config reboot getenforce
IP Configuration and Name Resolution
Configure the network on the 3 servers and register all the IP Addresses and names in /etc/hosts
for every machine. If you prefer to use a DNS server, make sure names can be resolved from everywhere.
The Hadoop user
Usually, users are created separately per installed components, i.e. o,r for HDFS and one for YARN. To make it simple, you can install everything as a single hadoop user:
groupadd -g 501 hadoop useradd -u 501 -g hadoop -m -p hadoop -s /bin/bash hadoop
SSH equivalencies
The script below show how to create SSH equivalencies for the Hadoop
user. Run it from every server to create the SSH private keys and authorize connections on remote hosts:
cd ~ mkdir .ssh chmod 700 .ssh ssh-keygen -t dsa -q -N "" -f .ssh/id_dsa ssh-copy-id -o StrictHostKeyChecking=no yellow ssh-copy-id -o StrictHostKeyChecking=no green ssh-copy-id -o StrictHostKeyChecking=no pink
Test equivalencies; the script below tests the 9 SSH combinations. It should not prompt for any password:
for i in green pink yellow; do for j in green pink yellow; do ssh $i ssh $j "echo $i to $j: `date`" done done green to green: Wed Aug 6 20:51:25 CEST 2014 green to pink: Wed Aug 6 20:51:25 CEST 2014 green to yellow: Wed Aug 6 20:51:25 CEST 2014 pink to green: Wed Aug 6 20:51:25 CEST 2014 pink to pink: Wed Aug 6 20:51:26 CEST 2014 pink to yellow: Wed Aug 6 20:51:26 CEST 2014 yellow to green: Wed Aug 6 20:51:26 CEST 2014 yellow to pink: Wed Aug 6 20:51:26 CEST 2014 yellow to yellow: Wed Aug 6 20:51:26 CEST 2014
Hadoop Installation and Configuration
Installing and configuring Hadoop is straightforward:
- Untar the distribution,
- Set a few properties including server names, local storage for HDFS, Name Node address and Resource Manager host name.
Hadoop Installation
Untar Hadoop in /usr/local/hadoop
. Before you proceed, make sure you’ve copied the distribution as /home/hadoop/hadoop-2.4.1.tar.gz
on every server:
# As root: cd /usr/local mkdir hadoop chown hadoop:hadoop hadoop su - hadoop # As hadoop: cd /usr/local tar -zxvf /home/hadoop/hadoop-2.4.1.tar.gz \ --transform 's/hadoop-2.4.1/hadoop/'
HDFS underlying storage
Unlike regular filesystems, HDFS relies itself on underlying local filesystems. With Oracle Linux 7, you may want to use XFS. However to make it simpler, this configuration only relies on 2 destinations in /
, one for the data and one for Metadata. On yellow
, the server holding HDFS Node Name, run:
# As root mkdir /u01 chown hadoop:hadoop /u01 su - hadoop # As hadoop mkdir /u01/metadata mkdir /u01/files
You won’t need the metadata directory for now on Pink
and Green
:
# As root mkdir /u01 chown hadoop:hadoop /u01 su - hadoop # As hadoop mkdir /u01/metadata mkdir /u01/files
Cluster Properties
The number of Hadoop parameters is quite significant. This configuration sets the very few that are mandatory for this simple cluster:
- HDFS underlying directories
- The Node Manager address and port
- The Resource Manager server name
- The list of slave servers
Note:
You can separate the Name Node and Resource Manager from the other slave servers. However, in order to have a large-enough configuration, Data Nodes and Node Managers are running on every server
By default, the configuration directory is located in etc/hadoop
in Hadoop home. Files are XML files and the few changes that are needed are described below.
HDFS directories
Local directory names for HDFS data and metadata are stored in hdfs-site.xml
. dfs.datanode.data.dir
and dfs.namenode.name.dir
are the properties to change for your configuration. Below is an example of that file for the sample cluster:
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>dfs.datanode.data.dir</name> <value>/u01/files</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>/u01/metadata</value> </property> </configuration>
Node Name Access Point
Name Node address and port are defined by the fs.defaultFS
property in core-site.xml
. Below is an example of the file for the sample cluster:
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://yellow:9000</value> </property> </configuration>
Resource Manager Server
Resource Manager name is defined by the yarn.resourcemanager.hostname
property in yarn-site.xml
. Below is an example of the file for the sample cluster:
<?xml version="1.0"?> <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>pink</value> </property> </configuration>
Slave servers
Slave names are kept in the slaves
file. This file is not mandatory but allows to start all the slaves as a one-only command from the master node. In this sample cluster, the slaves
file contains:
yellow pink green
Configuration File Copies
Once the settings stored on one of the nodes, push them on the other ones. Assuming yellow
was used to change the files, push the files to the 2 others nodes:
for i in pink green; do scp /usr/local/hadoop/etc/hadoop/slaves \ $i:/usr/local/hadoop/etc/hadoop/slaves scp /usr/local/hadoop/etc/hadoop/*-site.xml \ $i:/usr/local/hadoop/etc/hadoop/. done
Note:
This is not the only way to perform the replication and one can rely on theHADOOP_MASTER
variable and
rsync
to synchronize the configuration at the startup time.
Formatting HDFS
Last, format the HDFS filesystem from the Name Node or yellow
in this case:
bin/hdfs namenode -format color
HDFS and YARN startup
You can start HDFS Name Node and Data Nodes from the Name Node. Run the 3 lines below from Yellow
:
cd /usr/local/hadoop sbin/hadoop-daemon.sh start namenode sbin/hadoop-daemons.sh start datanode
You can start YARN Resource Manager, Job History as well as all the Node Managers from the Resource Manager Node. Run the 4 commands below from Pink
:
cd /usr/local/hadoop sbin/yarn-daemon.sh start resourcemanager sbin/mr-jobhistory-daemon.sh start historyserver sbin/yarn-daemons.sh start nodemanager
Hadoop Post-Installation Checks
A first level of checks for the configuration consists in making sure Hadoop java processes are running as expected:
for i in yellow pink green; do echo "${i}:" ssh $i "ps -u hadoop -o user,pid,args |grep [j]ava|cut -c1-80" done yellow: hadoop 3664 /usr/java/jdk1.8.0_11/bin/java -Dproc_namenode -Xmx1000m -Djava.n hadoop 3786 /usr/java/jdk1.8.0_11/bin/java -Dproc_datanode -Xmx1000m -Djava.n hadoop 3931 /usr/java/jdk1.8.0_11/bin/java -Dproc_nodemanager -Xmx1000m -Dhad pink: hadoop 5037 /usr/java/jdk1.8.0_11/bin/java -Dproc_datanode -Xmx1000m -Djava.n hadoop 5143 /usr/java/jdk1.8.0_11/bin/java -Dproc_resourcemanager -Xmx1000m - hadoop 5203 /usr/java/jdk1.8.0_11/bin/java -Dproc_historyserver -Xmx1000m -Dj hadoop 5264 /usr/java/jdk1.8.0_11/bin/java -Dproc_nodemanager -Xmx1000m -Dhad green: hadoop 3260 /usr/java/jdk1.8.0_11/bin/java -Dproc_datanode -Xmx1000m -Djava.n hadoop 3399 /usr/java/jdk1.8.0_11/bin/java -Dproc_nodemanager -Xmx1000m -Dhad
To go deeper, check the log files in the log directories and connect to the web consoles. By default, the Name Node console listen on port 50070, here on yellow
. The « Data Nodes » menu presents the running Data Nodes:Every Data Node has also a console listening to 50075:YARN Resource Manager listen to port 8088 by default. The « Nodes » menu shows the Node Managers that are running:
The Job History Console can be accessed from port 19888:Node Manager consoles are listening to port 8042:
Sample MapReduce Job
To check the cluster, execute a MapReduce job. To begin, create a /user/hadoop
directory in HDFS:
bin/hdfs dfs -mkdir /user bin/hdfs dfs -mkdir /user/hadoop
Then copy the configuration directory in an HDFS input
directory. That directory will, by default, be stored in /user/hadoop
:
bin/hdfs dfs -put etc/hadoop input bin/hdfs dfs -ls Found 1 items drwxr-xr-x - hadoop supergroup 0 2014-08-06 19:44 input
Run the « grep » command that comes with Hadoop distribution as an example. That program is a MapReduce Job that search a directory for a Regular Expression and count the number of matching expressions. The result is stored in another directory in HDFS:
bin/hadoop jar \ share/hadoop/mapreduce/hadoop-mapreduce-examples-2.4.1.jar \ grep input output 'dfs[a-z.]+'
Check the output directory has been created and the results stored in the part-r-00000
file:
bin/hdfs dfs -ls Found 2 items drwxr-xr-x - hadoop supergroup 0 2014-08-06 19:44 input drwxr-xr-x - hadoop supergroup 0 2014-08-06 19:46 output bin/hdfs dfs -ls output Found 2 items -rw-r--r-- 3 hadoop supergroup 0 2014-08-06 19:46 output/_SUCCESS -rw-r--r-- 3 hadoop supergroup 227 2014-08-06 19:46 output/part-r-00000 bin/hdfs dfs -cat output/part-r-00000 6 dfs.audit.logger 4 dfs.class 3 dfs.server.namenode. 2 dfs.period 2 dfs.audit.log.maxfilesize 2 dfs.audit.log.maxbackupindex 1 dfsmetrics.log 1 dfsadmin 1 dfs.servers 1 dfs.namenode.name.dir 1 dfs.file 1 dfs.datanode.data.dir
To finish with the demo, delete all HDFS directories:
bin/hdfs dfs -rm -skipTrash -R /user
HDFS and YARN shutdown
To stop YARN, stop the Resource Manager and the Node Managers on all the slave servers. You only need to connect to Pink
as Hadoop
:
cd /usr/local/hadoop sbin/mr-jobhistory-daemon.sh stop historyserver sbin/yarn-daemon.sh stop resourcemanager sbin/yarn-daemons.sh stop nodemanager
To stop HDFS, stop the Name Node and the Data Nodes on all the slave servers. You only need to connect to Yellow
as Hadoop
:
cd /usr/local/hadoop sbin/hadoop-daemon.sh stop namenode sbin/hadoop-daemons.sh stop datanode
In this article, you’ve installed, configured and tested a 3 node Hadoop clusters. The next ones will explore the different components from HDFS to Spark…
References:
To know more about Hadoop installation, refer to
[1] Hadoop MapReduce Next Generation – Cluster Setup
1 réflexion sur “Hadoop for DBAs (3/13): A Simple 3-Node Hadoop Cluster”
Hi Greg, may I suggest you set SELINUX to Permissive rather than just disabled, like that one day we’ll all be using the thing and it could avoid an issue we had recently on a dev server where disabling selinux seemed to have a nasty consequence on the software RAID: we couldn’t get to one of the disks any more…
Les commentaires sont fermés.