HADOOP Interacting with HDFS For University Program on Apache Hadoop & Apache Apex 1
2 What's the Need? Big data Ocean Expensive hardware Frequent Failures and Difficult recovery Scaling up with more machines
3 Hadoop Open source software - a Java framework - initial release: December 10, 2011 It provides both, Storage [HDFS] Processing [MapReduce] HDFS: Hadoop Distributed File System
4 How Hadoop addresses the need? Big data Ocean Have multiple machines. Each will store some portion of data, not the entire data. Expensive hardware Use commodity hardware. Simple and cheap. Frequent Failures and Difficult recovery Have multiple copies of data. Have the copies in different machines. Scaling up with more machines If more processing is needed, add new machines on the fly
5 HDFS Runs on Commodity hardware: Doesn't require expensive machines Large Files; Write-once, Read-many (WORM) Files are split into blocks Actual blocks go to DataNodes The metadata is stored at NameNode Replicate blocks to different node Default configuration: Block size = 128MB Replication Factor = 3
6
7
8
9 Where NOT TO use HDFS Low latency data access HDFS is optimized for high throughput of data at the expense of latency. Large number of small files Namenode has the entire file-system metadata in memory. Too much metadata as compared to actual data. Multiple writers / Arbitrary file modifications No support for multiple writers for a file Always append to end of a file
10 Some Key Concepts NameNode DataNodes JobTracker TaskTrackers ResourceManager (MRv2) NodeManager (MRv2) ApplicationMaster (MRv2)
11 NameNode & DataNodes NameNode: DataNode: Centerpiece of HDFS: The Master Only stores the block metadata: block-name, block-location etc. Critical component; When down, whole cluster is considered down; Single point of failure Should be configured with higher RAM Stores the actual data: The Slave In constant communication with NameNode When down, it does not affect the availability of data/cluster Should be configured with higher disk space SecondaryNameNode: Doesn't actually act as a NameNode Stores the image of primary NameNode at certain checkpoint Used as backup to restore NameNode
12
13 JobTracker & TaskTrackers JobTracker: Talks to the NameNode to determine location of the data Monitors all TaskTrackers and submits status of the job back to the client When down, HDFS is still functional; no new MR job; existing jobs halted Replaced by ResourceManager/ApplicationMaster in MRv2 TaskTracker: Runs on all DataNodes TaskTracker communicates with JobTracker signaling the task progress TaskTracker failure is not considered fatal Replaced by NodeManager in MRv2
14 ResourceManager & NodeManager Present in Hadoop v2.0 Equivalent of JobTracker & TaskTracker in v1.0 ResourceManager (RM): Runs usually at NameNode; Distributes resources among applications. Two main components: Scheduler and ApplicationsManager (AM) NodeManager (NM): Per-node framework agent Responsible for containers Monitors their resource usage Reports the stats to RM Central ResourceManager and Node specific Manager together is called YARN
15
16 Hadoop 1.0 vs. 2.0 HDFS 1.0: Single point of failure Horizontal scaling performance issue HDFS 2.0: HDFS High Availability HDFS Snapshot Improved performance HDFS Federation
17 HDFS Federation
18 Interacting with HDFS Command prompt: Similar to Linux terminal commands Unix is the model, POSIX is the API Web Interface: Similar to browsing a FTP site on web
Interacting With HDFS On Command Prompt 19
20 Notes File Paths on HDFS: hdfs://127.0.0.1:8020/user/username/demo/data/file.txt hdfs://localhost:8020/user/username/demo/data/file.txt /user/username/demo/file.txt demo/file.txt File System: Local: local file system (linux) HDFS: hadoop file system At some places: The terms file and directory has the same meaning.
21 Before we start Command: Usage: hdfs hdfs [--config confdir] COMMAND Example: hdfs dfs hdfs dfsadmin hdfs fsck hdfs namenode hdfs datanode
hdfs `dfs` commands 22
23 In general Syntax for `dfs` commands hdfs dfs -<COMMAND> -[OPTIONS] <PARAMETERS> e.g. hdfs dfs -ls -R /user/username/demo/data/
24 0. Do It yourself Syntax: hdfs dfs -help [COMMAND ] hdfs dfs -usage [COMMAND ] Example: hdfs dfs -help cat hdfs dfs -usage cat
25 1. List the file/directory Syntax: hdfs dfs -ls [-d] [-h] [-R] <hdfs-dir-path> Example: hdfs dfs -ls hdfs dfs -ls / hdfs dfs -ls /user/username/demo/list-dir-example hdfs dfs -ls -R /user/username/demo/list-dir-example
26 2. Creating a directory Syntax: hdfs dfs -mkdir [-p] <hdfs-dir-path> Example: hdfs dfs -mkdir /user/username/demo/create-dir-example hdfs dfs -mkdir -p /user/username/demo/create-direxample/dir1/dir2/dir3
27 3. Create a file on local & put it on HDFS Syntax: vi filename.txt hdfs dfs -put [options] <local-file-path> <hdfs-dir-path> Example: vi file-copy-to-hdfs.txt hdfs dfs -put file-copy-to-hdfs.txt /user/username/demo/putexample/
28 4. Get a file from HDFS to local Syntax: hdfs dfs -get <hdfs-file-path> [local-dir-path] Example: hdfs dfs -get /user/username/demo/get-example/file-copy-fromhdfs.txt ~/demo/
29 5. Copy From LOCAL To HDFS Syntax: hdfs dfs -copyfromlocal <local-file-path> <hdfs-file-path> Example: hdfs dfs -copyfromlocal file-copy-to-hdfs.txt /user/username/demo/copyfromlocal-example/
30 6. Copy To LOCAL From HDFS Syntax: hdfs dfs -copytolocal <hdfs-file-path> <local-file-path> Example: hdfs dfs -copytolocal /user/username/demo/copytolocalexample/file-copy-from-hdfs.txt ~/demo/
31 7. Move a file from local to HDFS Syntax: hdfs dfs -movefromlocal <local-file-path> <hdfs-dir-path> Example: hdfs dfs -movefromlocal /path/to/file.txt /user/username/demo/movefromlocal-example/
32 8. Copy a file within HDFS Syntax: hdfs dfs -cp <hdfs-source-file-path> <hdfs-dest-file-path> Example: hdfs dfs -cp /user/username/demo/copy-within-hdfs/file-copy.txt /user/username/demo/data/
33 9. Move a file within HDFS Syntax: hdfs dfs -mv <hdfs-source-file-path> <hdfs-dest-file-path> Example: hdfs dfs -mv /user/username/demo/move-within-hdfs/file-move.txt /user/username/demo/data/
34 10. Merge files on HDFS Syntax: hdfs dfs -getmerge [-nl] <hdfs-dir-path> <local-file-path> Examples: hdfs dfs -getmerge -nl /user/username/demo/merge-example/ /path/to/all-files.txt
35 11. View file contents Syntax: hdfs dfs -cat <hdfs-file-path> hdfs dfs -tail <hdfs-file-path> hdfs dfs -text <hdfs-file-path> Examples: hdfs dfs -cat /user/username/demo/data/cat-example.txt hdfs dfs -cat /user/username/demo/data/cat-example.txt head
36 12. Remove files/dirs from HDFS Syntax: hdfs dfs -rm [options] <hdfs-file-path> Examples: hdfs dfs -rm /user/username/demo/remove-example/remove-file.txt hdfs dfs -rm -R /user/username/demo/remove-example/ hdfs dfs -rm -R -skiptrash /user/username/demo/remove-example/
37 13. Change file/dir properties Syntax: hdfs dfs -chgrp [-R] <NewGroupName> <hdfs-file-path> hdfs dfs -chmod [-R] <permissions> <hdfs-file-path> hdfs dfs -chown [-R] <NewOwnerName> <hdfs-file-path> Examples: hdfs dfs -chmod -R 777 /user/username/demo/data/file-changeproperties.txt
38 14. Check the file size Syntax: hdfs dfs -du <hdfs-file-path> Examples: hdfs dfs -du /user/username/demo/data/file.txt hdfs dfs -du -s -h /user/username/demo/data/
39 15. Create a zero byte file in HDFS Syntax: hdfs dfs -touchz <hdfs-file-path> Examples: hdfs dfs -touchz /user/username/demo/data/zero-byte-file.txt
40 16. File test operations Syntax: hdfs dfs -test -[defsz] <hdfs-file-path> Examples: hdfs dfs -test -e /user/username/demo/data/file.txt echo $?
41 17. Get FileSystem Statistics Syntax: hdfs dfs -stat [format] <hdfs-file-path> Format Options: %b - file size in blocks, %n - filename %r - replication %y - modification date %g - group name of owner %o - block size %u - user name of owner
42 18. Get File/Dir Counts Syntax: hdfs dfs -count [-q] [-h] [-v] <hdfs-file-path> Example: hdfs dfs -count -v /user/username/demo/
43 19. Set replication factor Syntax: hdfs dfs -setrep -w -R n <hdfs-file-path> Examples: hdfs dfs -setrep -w -R 2 /user/username/demo/data/file.txt
44 20. Set Block Size Syntax: hdfs dfs -D dfs.blocksize=blocksize -copyfromlocal <local-file-path> <hdfs-file-path> Examples: hdfs dfs -D dfs.blocksize=67108864 -copyfromlocal /path/to/file.txt /user/username/demo/block-example/
45 21. Empty the HDFS trash Syntax: hdfs dfs -expunge Location:
Other hdfs commands (admin) 46
47 22. HDFS Admin Commands: fsck Syntax: hdfs fsck <hdfs-file-path> Options: [-list-corruptfileblocks [-move -delete -openforwrite] [-files [-blocks [-locations -racks]]] [-includesnapshots]
48
49 23. HDFS Admin Commands: dfsadmin Syntax: hdfs dfsadmin Options: [-report [-live] [-dead] [-decommissioning]] [-safemode enter leave get wait] [-refreshnodes] [-refresh <host:ipc_port> <key> [arg1..argn]] [-shutdowndatanode <datanode:port> [upgrade]] [-getdatanodeinfo <datanode_host:ipc_port>] [-help [cmd]] Examples: hdfs dfsadmin -report -live
50
51 24. HDFS Admin Commands: namenode Syntax: hdfs namenode Options: [-checkpoint] [-format [-clusterid cid ] [-force] [-noninteractive] ] [-upgrade [-clusterid cid] ] [-rollback] [-recover [-force] ] [-metadataversion ] Examples: hdfs namenode -help
52 25. HDFS Admin Commands: getconf Syntax: hdfs getconf [-options] Options: [ -namenodes ] [ -backupnodes ] [ -excludefile ] [ -confkey [key] ] [ -secondarynamenodes ] [ -includefile ] [ -nnrpcaddresses ]
53 Again,,, THE most important command!! Syntax: hdfs dfs -help [options] hdfs dfs -usage [options] Examples: hdfs dfs -help help hdfs dfs -usage usage
Interacting With HDFS In Web Browser 54
55 Web HDFS URL: http://namenode:50070/explorer.html Examples: http://localhost:50070/explorer.html http://ec2-52-23-214-111.compute-1.amazonaws.com:50070/explorer.html
56 References 1. 2. 3. 4. 5. 6. 7. 8. 9. http://www.hadoopinrealworld.com http://www.slideshare.net/sanjeeb85/hdfscommandreference http://www.slideshare.net/jaganadhg/hdfs-10509123 http://www.slideshare.net/praveenbhat2/adv-os-presentation http://www.tomsitpro.com/articles/hadoop-2-vs-1,2-718.html http://www.snia.org/sites/default/files/hadoop2_new_and_noteworthy_snia_v3.pdf http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoophdfs/hdfscommands.html http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoopcommon/filesystemshell.html http://hadoop.apache.org/docs/r1.2.1/distcp.html
57 Thank You!! Please send your questions at: pradeep@datatorrent.com pradeep.n.kumbhar@gmail.com
Resources Apache Apex website - http://apex.incubator.apache.org/ Subscribe - http://apex.incubator.apache.org/community.html Download - http://apex.incubator.apache.org/downloads.html Twitter - @ApacheApex; Follow - https://twitter.com/apacheapex Facebook - https://www.facebook.com/apacheapex/ Meetup - http://www.meetup.com/topics/apache-apex Startup Program Free Enterprise License for Startups, Educational Institutions, Non-Profits - https://www.datatorrent.com/product/startup-accelerator/ Cloud Trial - http://web.datatorrent.com/cloudtrial.html 58 2016 DataTorrent
We Are Hiring jobs@datatorrent.com Developers/Architects QA Automation Developers Information Developers Build and Release 59 2016 DataTorrent
Upcoming Events March 15th March 17th 6pm PST Title March 24th 9am PST Title 60 2016 DataTorrent
APPENDIX 61
62 Copy data from one node to another node in HDFS Description: Copy data between clusters Syntax: hadoop distcp hdfs://nn1:8020/foo/bar hdfs://nn2:8020/bar/foo hadoop distcp hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b hdfs: //nn2:8020/bar/foo hadoop distcp -f hdfs://nn1:8020/srclist.file hdfs://nn2:8020/bar/foo Where srclist.file contains hdfs://nn1:8020/foo/a hdfs://nn1:8020/foo/b