A quick post that explains the following with samples
- Create a HAR file
- List the Contents of a HAR file
- Read the contents of a file that is within a HAR
Listed below is the input
directory structure in HDFS
I’ll be using to create a har
hadoop fs -ls /bejoyks/test/har/source_files/*
Found 2 items
-rw-r--r-- 3 hadoop
supergroup 22 2015-06-29 20:25
/bejoyks/test/har/source_files/srcDir01/file1.tsv
-rw-r--r-- 3 hadoop
supergroup 22 2015-06-29 20:25
/bejoyks/test/har/source_files/srcDir01/file2.tsv
Found 2 items
-rw-r--r-- 3 hadoop
supergroup 22 2015-06-29 20:25
/bejoyks/test/har/source_files/srcDir02/file3.tsv
-rw-r--r-- 3 hadoop
supergroup 22 2015-06-29 20:25
/bejoyks/test/har/source_files/srcDir02/file4.tsv
CLI Command to create
a HAR
Syntax
hadoop archive -archiveName tsv <archiveName.har> -p <ParentDirHDFS>
-r <ReplicationFactor> <childDir01>
<childDir02> <DestinationDirectoryHDFS>
Command Used
hadoop archive -archiveName tsv_daily.har -p
/bejoyks/test/har/source_files -r 3 srcDir01 srcDir02
/bejoyks/test/har/destination
LISTING DIRS and FILES
in HAR
Syntax
hadoop fs –ls har://<AbsolutePathOfHarFile>
Command Used and
Output
Command 01 :
hadoop fs -ls har:///bejoyks/test/har/destination/tsv_daily.har
Found 2 items
drwxr-xr-x - hadoop
supergroup 0 2015-06-29 20:39
har:///bejoyks/test/har/destination/tsv_daily.har/srcDir01
drwxr-xr-x - hadoop
supergroup 0 2015-06-29 20:39
har:///bejoyks/test/har/destination/tsv_daily.har/srcDir02
Command 02 :
hadoop fs -ls har:///home/hadoop/work/bejoyks/test/har/destination/tsv_daily.har/srcDir01
Found 2 items
-rw-r--r-- 3 hadoop
supergroup 22 2015-06-29 20:39
har:///bejoyks/test/har/destination/tsv_daily.har/srcDir01/file1.tsv
-rw-r--r-- 3 hadoop
supergroup 22 2015-06-29 20:39
har:///bejoyks/test/har/destination/tsv_daily.har/srcDir01/file2.tsv
READING a File within
a HAR
hadoop fs -text har:///bejoyks/test/har/destination/tsv_daily.har/srcDir01/file2.tsv
file2 row1
file2 row2
** Common mistakes while reading a HAR file
Always use the URI while reading a HAR file
Since we are used lo listing the directories/files in HDFS
without the URI , we might use the similar pattern here. But HAR files doen’t
work well if it is not prefixed with URI . If listed without URI you’ll get the
HAR metadata under the hood, something like below.
hadoop fs -ls
/bejoyks/test/har/destination/tsv_daily.har
Found 3 items
-rw-r--r-- 5 hadoop supergroup 277 2015-06-29 20:39
/bejoyks/test/har/destination/tsv_daily.har/_index
-rw-r--r-- 5 hadoop supergroup 23 2015-06-29 20:39
/bejoyks/test/har/destination/tsv_daily.har/_masterindex
-rw-r--r-- 3 hadoop supergroup 88 2015-06-29 20:39 /bejoyks/test/har/destination/tsv_daily.har/part-0