HDFS is Hadoop Distributed File System and provides High Availability through Replication and Automatic Failover. This 4th article of the Hadoop for DBAs series digs into HDFS concepts. It is probably the funniest part to learn for DBAs. It is also the closest to what you are used to. There are dozens of features to be interested in and learn. This article focuses on some of the concepts and under the hood. It also demonstrates HDFS High Availability for DataNode and NameNode. Its 3 sections are independent and demonstrate: o A First Look at HDFS Underlying Structures; o HDFS Replication and DataNode Failure; o Quorum Journal Manager and NameNode Failure.
HDFS is a file system and, in some way, something you are very used to. It provides parameters to bypass caches or to force data in caches. You can impact block distribution, geo-distribution. You can also impact the level of replication. It is built to feed processes running in parallel. It can be accessed in many different ways, including CLI, Java, C or REST API. It provides advanced features like snapshots and rolling upgrades. You have the ability to move data dynamically and also extend or shrink your cluster. It is bluffing!
On the other hand, you probably don’t know anything like HDFS already. It could be because there are not that many distributed file systems anyway. Plus, it is written in Java to store blocks of files across thousands of servers. It is built for large files written by one process at a time. It still only supports data to be appended to existing files and it relies on local file systems on every nodes. So, it is not something you would have needed anyway, unless you want to deal with a at least a few terabytes of data and don’t want to invest in proprietary hardware and software. But enough of the overview, let us get into the inside… And before you go ahead, make sure you have installed and started HDFS as described in Hadoop for DBAs (3/13): A Simple 3-Node Hadoop Cluster.
A First Look at HDFS Underlying Structures
As you’ve figured out by installing and starting it, HDFS relies on 2 main components:
- The NameNode manages metadata including directories and files, files to blocks mappings, ongoing and some past operations as well as the status of datanodes
- DataNodes store, as you can expect, data blocks that are chunks of files
The number of operations performed by the NameNode are kept minimal so that HDFS can scale out without making the infrastructure more complex by adding namespaces. As a result, clients get and send data directly to DataNodes. The diagram below shows how a client interacts with HDFS when storing a file:
- It coordinates with the NameNode to get and store metadata. Once done it sends the file blocks to a DataNode
- DataNode replicates data to other DataNodes which send feedback to the NameNode so that it can keep track of metadata
To check details of those operations clean up your HDFS filesystem and restart the namenode. You will find a fsImage file in the metadata directory. From the yellow
server, run:
bin/hdfs dfs -rm -R -f -skipTrash /user/hadoop/* bin/hdfs dfs -mkdir /user bin/hdfs dfs -mkdir /user/hadoop sbin/hadoop-daemon.sh stop namenode sbin/hadoop-daemon.sh start namenode
Once the namenode restarted, you should find a fsImage file to be used in case of a crash. That file contains an image of the HDFS files and directories. You can use the Offline Image Viewer or oiv to check the content of it:
ls -tr /u01/metadata/current [...] fsimage_0000000000000000503 fsimage_0000000000000000503.md5 VERSION edits_inprogress_0000000000000000504 seen_txid bin/hdfs oiv -i /u01/metadata/current/fsimage_0000000000000000503 \ -o /tmp/fsimage -p Indented cat /tmp/fsimage drwxr-xr-x - hadoop supergroup 1409403918690 0 / drwxrwx--- - hadoop supergroup 1409421293892 0 /tmp drwxr-xr-x - hadoop supergroup 1409403927142 0 /user drwxr-xr-x - hadoop supergroup 1409421326069 0 /user/hadoop
Put a file in HDFS to figure out what will happen, you can run hdfs dfs -put
from pink:
dd if=/dev/zero of=/tmp/data bs=64M count=3 bin/hdfs dfs -put /tmp/data data bin/hdfs dfs -ls /user/hadoop Found 1 items -rw-r--r-- 3 hadoop supergroup 201326592 2014-08-30 20:25 /user/hadoop/data
Once the file in HDFS, you can check how it is stored with hdfs fsck:
bin/hdfs fsck /user/hadoop/data -files -blocks -locations Connecting to namenode via http://yellow:50070 FSCK started by hadoop (auth:SIMPLE) from /192.168.56.4 for path /user/hadoop/data at Sat Aug 30 20:34:11 CEST 2014 /user/hadoop/data 201326592 bytes, 2 block(s): OK 0. BP-6177126-192.168.56.3-1407174901393:blk_1073741890_1066 len=134217728 repl=3 [192.168.56.4:50010, 192.168.56.5:50010, 192.168.56.3:50010] 1. BP-6177126-192.168.56.3-1407174901393:blk_1073741891_1067 len=67108864 repl=3 [192.168.56.4:50010, 192.168.56.5:50010, 192.168.56.3:50010] Status: HEALTHY Total size: 201326592 B Total dirs: 0 Total files: 1 Total symlinks: 0 Total blocks (validated): 2 (avg. block size 100663296 B) Minimally replicated blocks: 2 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 3 Average block replication: 3.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 3 Number of racks: 1 FSCK ended at Sat Aug 30 20:34:11 CEST 2014 in 11 milliseconds The filesystem under path '/user/hadoop/data' is HEALTHY
Stop the NameNode again and check the content of the metadata directory. The edit files keep tracks of the operations managed by the NameNode. You can use the Offline Edit Viewer to understand what is managed by the NameNode and how/when operations are performed on DataNodes:
sbin/hadoop-daemon.sh stop namenode ls -tr /u01/metadata/current [...] fsimage_0000000000000000503 [...] seen_txid edits_inprogress_0000000000000000517 bin/hdfs oev -i /u01/metadata/current/edits_inprogress_0000000000000000517 \ -o /tmp/edit.xml -p XML cat /tmp/edit.xml [...] <RECORD> <OPCODE>OP_CLOSE</OPCODE> <DATA> <TXID>525</TXID> <LENGTH>0</LENGTH> <INODEID>0</INODEID> <PATH>/user/hadoop/data._COPYING_</PATH> <REPLICATION>3</REPLICATION> <MTIME>1409423158612</MTIME> <ATIME>1409423152498</ATIME> <BLOCKSIZE>134217728</BLOCKSIZE> <CLIENT_NAME></CLIENT_NAME> <CLIENT_MACHINE></CLIENT_MACHINE> <BLOCK> <BLOCK_ID>1073741890</BLOCK_ID> <NUM_BYTES>134217728</NUM_BYTES> <GENSTAMP>1066</GENSTAMP> </BLOCK> <BLOCK> <BLOCK_ID>1073741891</BLOCK_ID> <NUM_BYTES>67108864</NUM_BYTES> <GENSTAMP>1067</GENSTAMP> </BLOCK> <PERMISSION_STATUS> <USERNAME>hadoop</USERNAME> <GROUPNAME>supergroup</GROUPNAME> <MODE>420</MODE> </PERMISSION_STATUS> </DATA> </RECORD> [...]
As you can guess from above default block size are 128M, except for the last block of the file. If you look for those blocks in the underlying filesystem, you should find them named « blk_<blockid> » like below:
find /u01/files/current -iname "*107374189?" .../BP-6177126-192.168.56.3-1407174901393/current/finalized/blk_1073741890 .../BP-6177126-192.168.56.3-1407174901393/current/finalized/blk_1073741891
HDFS Replication and DataNode Loss
As you can see from above the default replication factor for every file block is 3. Because we have a 3 node cluster, it won’t help to demonstrate blocks are automatically rebalanced in the event of a server crash. To begin with this section, change the level of replication to 2:
- Stop HDFS NameNode and DataNodes
- Add a dfs.replication property in etc/hadoop/hdfs-site.xml and set it to 2
- Remove the content of the data and metadata directories
- Format the new file system
- Start back HDFS NameNode and DataNodes
bin/hdfs dfs -mkdir /user bin/hdfs dfs -mkdir /user/hadoop dd if=/dev/zero of=/tmp/data bs=64M count=3 bin/hdfs dfs -put /tmp/data data bin/hdfs fsck /user/hadoop/data -files -blocks -locations Connecting to namenode via http://yellow:50070 FSCK started by hadoop (auth:SIMPLE) from /192.168.56.3 for path /user/hadoop/data at Sun Aug 31 06:47:24 CEST 2014 /user/hadoop/data 201326592 bytes, 2 block(s): OK 0. BP-399541416-192.168.56.3-1409430021329:blk_1073741827_1003 len=134217728 repl=2 [192.168.56.4:50010, 192.168.56.3:50010] 1. BP-399541416-192.168.56.3-1409430021329:blk_1073741828_1004 len=67108864 repl=2 [192.168.56.3:50010, 192.168.56.5:50010] Status: HEALTHY Total size: 201326592 B Total dirs: 0 Total files: 1 Total symlinks: 0 Total blocks (validated): 2 (avg. block size 100663296 B) Minimally replicated blocks: 2 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 2 Average block replication: 2.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 3 Number of racks: 1 FSCK ended at Sun Aug 31 06:47:24 CEST 2014 in 4 milliseconds The filesystem under path '/user/hadoop/data' is HEALTHY
As you can see above there is only 2 copies of each block. Stop one DataNode that contains some blocks. From that example, we will stop yellow (192.168.56.3)
:
sbin/hadoop-daemon.sh stop datanode
As you can see from hdfs dfsadmin -report
the failed DataNode is not marked as dead right away. By default, it takes 10 minutes and 30 seconds to be detected:
bin/hdfs dfsadmin -report Configured Capacity: 72417030144 (67.44 GB) Present Capacity: 61966094336 (57.71 GB) DFS Remaining: 61560250368 (57.33 GB) DFS Used: 405843968 (387.04 MB) DFS Used%: 0.65% Under replicated blocks: 0 Blocks with corrupt replicas: 0 Missing blocks: 0 ------------------------------------------------- Datanodes available: 3 (3 total, 0 dead) Live datanodes: Name: 192.168.56.5:50010 (green.resetlogs.com) Hostname: green Decommission Status : Normal Configured Capacity: 24139010048 (22.48 GB) DFS Used: 67649536 (64.52 MB) Non DFS Used: 3213279232 (2.99 GB) DFS Remaining: 20858081280 (19.43 GB) DFS Used%: 0.28% DFS Remaining%: 86.41% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Last contact: Sun Aug 31 06:57:04 CEST 2014 Name: 192.168.56.3:50010 (yellow.resetlogs.com) Hostname: yellow.resetlogs.com Decommission Status : Normal Configured Capacity: 24139010048 (22.48 GB) DFS Used: 202915840 (193.52 MB) Non DFS Used: 3726766080 (3.47 GB) DFS Remaining: 20209328128 (18.82 GB) DFS Used%: 0.84% DFS Remaining%: 83.72% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Last contact: Sun Aug 31 06:56:23 CEST 2014 Name: 192.168.56.4:50010 (pink.resetlogs.com) Hostname: pink Decommission Status : Normal Configured Capacity: 24139010048 (22.48 GB) DFS Used: 135278592 (129.01 MB) Non DFS Used: 3510890496 (3.27 GB) DFS Remaining: 20492840960 (19.09 GB) DFS Used%: 0.56% DFS Remaining%: 84.90% Configured Cache Capacity: 0 (0 B) Cache Used: 0 (0 B) Cache Remaining: 0 (0 B) Cache Used%: 100.00% Cache Remaining%: 0.00% Last contact: Sun Aug 31 06:57:04 CEST 2014
During that period of time, if you want to get some files, the client might report some exceptions as you can see below; however, it should be able to get the file from block copies:
bin/hdfs dfs -get /user/hadoop/data /tmp/greg2 14/08/31 07:00:40 WARN hdfs.BlockReaderFactory: I/O error constructing remote block reader. java.net.ConnectException: Connection refused [...] 14/08/31 07:00:40 WARN hdfs.DFSClient: Failed to connect to /192.168.56.3:50010 for block, add to deadNodes and continue. java.net.ConnectException: Connection refused java.net.ConnectException: Connection refused [...] 14/08/31 07:00:40 INFO hdfs.DFSClient: Successfully connected to /192.168.56.4:50010 for BP-399541416-192.168.56.3-1409430021329:blk_1073741827_1003
After a while, the missing datanode should be reported as dead:
bin/hdfs dfsadmin -report
Configured Capacity: 72417030144 (67.44 GB)
Present Capacity: 61966077952 (57.71 GB)
DFS Remaining: 61560233984 (57.33 GB)
DFS Used: 405843968 (387.04 MB)
DFS Used%: 0.65%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0
-------------------------------------------------
Datanodes available: 2 (3 total, 1 dead)
Live datanodes:
Name: 192.168.56.5:50010 (green.resetlogs.com)
Hostname: green
Decommission Status : Normal
Configured Capacity: 24139010048 (22.48 GB)
DFS Used: 67649536 (64.52 MB)
Non DFS Used: 3213287424 (2.99 GB)
DFS Remaining: 20858073088 (19.43 GB)
DFS Used%: 0.28%
DFS Remaining%: 86.41%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Last contact: Sun Aug 31 07:09:01 CEST 2014
Name: 192.168.56.4:50010 (pink.resetlogs.com)
Hostname: pink
Decommission Status : Normal
Configured Capacity: 24139010048 (22.48 GB)
DFS Used: 135278592 (129.01 MB)
Non DFS Used: 3510898688 (3.27 GB)
DFS Remaining: 20492832768 (19.09 GB)
DFS Used%: 0.56%
DFS Remaining%: 84.90%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Last contact: Sun Aug 31 07:09:01 CEST 2014
Dead datanodes:
Name: 192.168.56.3:50010 (yellow.resetlogs.com)
Hostname: yellow.resetlogs.com
Decommission Status : Normal
Configured Capacity: 24139010048 (22.48 GB)
DFS Used: 202915840 (193.52 MB)
Non DFS Used: 3726766080 (3.47 GB)
DFS Remaining: 20209328128 (18.82 GB)
DFS Used%: 0.84%
DFS Remaining%: 83.72%
Configured Cache Capacity: 0 (0 B)
Cache Used: 0 (0 B)
Cache Remaining: 0 (0 B)
Cache Used%: 100.00%
Cache Remaining%: 0.00%
Last contact: Sun Aug 31 06:56:23 CEST 2014
File blocks are re-balanced to the remaining DataNodes so that you don’t any get exception when reading the file:
bin/hdfs fsck /user/hadoop/data -files -blocks -locations Connecting to namenode via http://yellow:50070 FSCK started by hadoop (auth:SIMPLE) from /192.168.56.3 for path /user/hadoop/data at Sun Aug 31 07:11:32 CEST 2014 /user/hadoop/data 201326592 bytes, 2 block(s): OK 0. BP-399541416-192.168.56.3-1409430021329:blk_1073741827_1003 len=134217728 repl=2 [192.168.56.4:50010, 192.168.56.5:50010] 1. BP-399541416-192.168.56.3-1409430021329:blk_1073741828_1004 len=67108864 repl=2 [192.168.56.5:50010, 192.168.56.4:50010] Status: HEALTHY Total size: 201326592 B Total dirs: 0 Total files: 1 Total symlinks: 0 Total blocks (validated): 2 (avg. block size 100663296 B) Minimally replicated blocks: 2 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 2 Average block replication: 2.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 2 Number of racks: 1 FSCK ended at Sun Aug 31 07:11:32 CEST 2014 in 1 milliseconds The filesystem under path '/user/hadoop/data' is HEALTHY bin/hdfs dfs -get /user/hadoop/data /tmp/copy3
If you restart the failed DataNode, the NameNode will report it back alive right away and each block will have 3 copies:
bin/hdfs fsck /user/hadoop/data -files -blocks -locations Connecting to namenode via http://yellow:50070 FSCK started by hadoop (auth:SIMPLE) from /192.168.56.3 for path /user/hadoop/data at Sun Aug 31 07:17:25 CEST 2014 /user/hadoop/data 201326592 bytes, 2 block(s): OK 0. BP-399541416-192.168.56.3-1409430021329:blk_1073741827_1003 len=134217728 repl=2 [192.168.56.4:50010, 192.168.56.5:50010, 192.168.56.3:50010] 1. BP-399541416-192.168.56.3-1409430021329:blk_1073741828_1004 len=67108864 repl=2 [192.168.56.5:50010, 192.168.56.4:50010, 192.168.56.3:50010] Status: HEALTHY Total size: 201326592 B Total dirs: 0 Total files: 1 Total symlinks: 0 Total blocks (validated): 2 (avg. block size 100663296 B) Minimally replicated blocks: 2 (100.0 %) Over-replicated blocks: 0 (0.0 %) Under-replicated blocks: 0 (0.0 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor: 2 Average block replication: 2.0 Corrupt blocks: 0 Missing replicas: 0 (0.0 %) Number of data-nodes: 3 Number of racks: 1 FSCK ended at Sun Aug 31 07:17:25 CEST 2014 in 1 milliseconds The filesystem under path '/user/hadoop/data' is HEALTHY
NameNode Failure and QJM
In the previous configuration, the NameNode remains a single point of failure, especially if disks are local to servers and not secured by any kind of RAID/LVM. Keep in mind that HDFS metadata are only accessible from the NameNode server. To secure that configuration, HDFS provides a way to replicate metadata to a standby NameNode. For now and, at least up to release 2.5.0, there can only be 2 NameNodes per configuration: one active and one standby. In case of a server loss or to manage server maintenance, you can failover the active and standby NameNodes. That replication Infrastructure is named Quorum Journal Manager and it works as described below:
- The active NameNode sends edits to JounalNodes so that metadata can be maintained by other NameNode
- Standby NameNodes pull edits from JournalNodes and maintain a copy of metadata
- DataNodes send not only to the active NameNode but to all NameNodes in the configuration so that a standby node can be set active by a simple failover command
NameNode Standby Configuration
In order to configure the Quorum Journal Manager for manual failover, proceed as below:
- Stop the NameNode and DataNodes
sbin/hadoop-daemon.sh stop namenode sbin/hadoop-daemon.sh stop datanode
- Create the Journal Directory on every node
for i in yellow pink green; do ssh $i mkdir /u01/journal done
- Create the metadata directory on green
ssh green mkdir /u01/metadata
- Make sure fuser is installed on every one of your servers
yum install -y psmisc
- Add the following properties to
etc/hadoop/hdfs-site.xml
<property> <name>dfs.nameservices</name> <value>colorcluster</value> </property> <property> <name>dfs.ha.namenodes.colorcluster</name> <value>nn1,nn2</value> </property> <property> <name>dfs.namenode.rpc-address.colorcluster.nn1</name> <value>yellow.resetlogs.com:8020</value> </property> <property> <name>dfs.namenode.rpc-address.colorcluster.nn2</name> <value>green.resetlogs.com:8020</value> </property> <property> <name>dfs.namenode.http-address.colorcluster.nn1</name> <value>yellow.resetlogs.com:50070</value> </property> <property> <name>dfs.namenode.http-address.colorcluster.nn2</name> <value>green.resetlogs.com:50070</value> </property> <property> <name>dfs.namenode.shared.edits.dir</name> <value>qjournal://yellow.resetlogs.com:8485;pink.resetlogs.com:8485;green.resetlogs.com:8485/colorcluster</value> </property> <property> <name>dfs.journalnode.edits.dir</name> <value>/u01/journal</value> </property> <property> <name>dfs.client.failover.proxy.provider.colorcluster</name> <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value> </property> <property> <name>dfs.ha.fencing.methods</name> <value>sshfence</value> </property> <property> <name>dfs.ha.fencing.ssh.private-key-files</name> <value>/home/hadoop/.ssh/id_dsa</value> </property>
Note:
For details about those parameters, refer to HDFS High Availability Using the Quorum Journal Manager.
- Change
core-site.xml
fs.defaultFS
property to match the cluster name
cat etc/hadoop/core-site.xml <configuration> <property> <name>fs.defaultFS</name> <value>hdfs://colorcluster</value> </property> </configuration>
- Distribute those 2 files to all the nodes of the cluster
for i in pink green; do scp etc/hadoop/hdfs-site.xml $i:/usr/local/hadoop/etc/hadoop/. scp etc/hadoop/core-site.xml $i:/usr/local/hadoop/etc/hadoop/. done
- Start the journalnode and datanode processes
sbin/hadoop-daemons.sh start journalnode sbin/hadoop-daemons.sh start datanode
- Format the journalnode from yellow
bin/hdfs namenode -initializeSharedEdits
- Start the NameNode on yellow
sbin/hadoop-daemon.sh start namenode
- Configure the NameNode on
green
to synchronize with the primary NameNode
# on green bin/hdfs namenode -bootstrapStandby
- Start the NameNode on green
sbin/hadoop-daemon.sh start namenode
- Check the status on both NameNodes:
bin/hdfs haadmin -getServiceState nn1 standby bin/hdfs haadmin -getServiceState nn2 standby
- Activate the one of the namenode
bin/hdfs haadmin -transitionToActive nn1
- Test HDFS; the client should be able to access HDFS
bin/hdfs dfs -ls /user/hadoop -rw-r--r-- 2 hadoop supergroup 201326592 2014-08-31 13:29 /user/hadoop/data
Manual Failover
If you lose the active NameNode from yellow, you won’t be able to access hdfs anymore:
ps -ef |grep [n]amenode |awk '{print "kill -9",$2}'|sh
However, you can easily failover to the standby node with a one-only command:
bin/hdfs haadmin -failover nn1 nn2
The client should be able to use HDFS back again
bin/hdfs dfs -ls /user/hadoop -rw-r--r-- 2 hadoop supergroup 201326592 2014-08-31 13:29 /user/hadoop/data
Startup and Shutdown Procedure
Once you’ve configured QJM, starting and stopping the whole HDFS cluster differ. In order to stop the whole cluster, connect to yellow
and run:
cd /usr/local/hadoop ssh green `pwd`/sbin/hadoop-daemon.sh stop namenode sbin/hadoop-daemon.sh stop namenode sbin/hadoop-daemons.sh stop journalnode sbin/hadoop-daemons.sh stop datanode
To restart HDFS, run:
cd /usr/local/hadoop sbin/hadoop-daemons.sh start journalnode sbin/hadoop-daemon.sh start namenode ssh green `pwd`/sbin/hadoop-daemon.sh start namenode sbin/hadoop-daemons.sh start datanode bin/hdfs haadmin -failover nn2 nn1 bin/hdfs haadmin -getServiceState nn1
Conclusion
This 4th part of the series shows how HDFS works. It also shows how DataNode replication behaves and How to setup and test Quorum Journal Manager to handle manual failover of NameNode. There are many other aspects of HDFS you may want to dig into, like the use of Zookeeper to automate NameNode failovers or the configuration of rack for replication. Next week you’ll start developing MapReduce Jobs with this colorcluster
. Stay tuned!
Reference
To know more about HDFS, refer to HDFS Users Guide.
1 réflexion sur “Hadoop for DBAs (4/13): Introduction to HDFS”
What do you understand by Standalone (or local) mode?
Les commentaires sont fermés.