Thursday, April 28, 2011

How to run hadoop - map reduce jobs without a cluster? With cloudera VM.

This document is indented to aid basic java developers to kick start practical investigation on Hadoop map reduce jobs without any cluster set up on their end. To understand this document you need to possess basic theoretical knowledge on  Hadoop, hdfs and map reduce jobs. It is also advisable to have some prior knowledge on basic linux commands.
            It is possible to try sample map reduce jobs on your windows PC without any cumbersome hadoop setup if you have cloudera test VM with you. This environment is ideal for functionality testing on very small sample data volumes, larger input data wont be supported due to memory constrains posed by the VM.
To test your sample map reduce job on local hadoop environment (cloudera  training VM) follow the below mentioned steps in order.

1.       Download the following  software and ubuntu image on your windows pc.

2.       Install the VM ware player. Extract the Cloudera Training VM and there you can find a *.vmx file. Open the same.(you can notice that it opens on VM Ware Player)
User credentials for Cloudera VM
User name: training
Password: training

3.       Copy the jar and the required input files into the cloudera linux box.
Here I have copied jar to  home -> training -> use-case -> source-code and
Input files to  home -> training -> use-case -> Input
(To browse to folder in linux box, click on places link on the top menu bar -> select home folder. Now you would be in /home/training folder in your local linux file system from there you can browse and create directories as you do in windows.)
You can copy paste files from your Windows file system to the cloudera linux box just like you do it between folders in Windows.

4.       Open linux terminal(terminal icon available on cloudera linux box desk top)

5.       Create a input directory in your hdfs
Command Syntax:
hadoop fs -mkdir  <full path of directory in hdfs>
Example
hadoop fs -mkdir  /userdata/bejoy/input

6.       Copy the contents from the linux box input folder to hdfs input folder
Command Syntax:
hadoop fs –copyFromLocal  <source directory from local linux box>  <destination directory in hdfs> 
Example
hadoop fs -copyFromLocal  /home/training/use-case/input  /userdata/bejoy/input/

7.       Check the availability of the input files on HDFS(not a mandatory step)
Command Syntax:
hadoop fs –ls  <full path of hdfs directory>
Example
hadoop fs -ls  /userdata/bejoy/input/

8.       Run the jar file with required attributes
Command Syntax:
hadoop jar  <full path of jar with jar name>  <full package name of the Hadoop Driver Class>  < full path of hdfs input directory>  < full path of hdfs output directory>

Example
hadoop jar  /home/training/use-case/source-code/salarycalculator.jar com.ge.corp.hadoop.salcalc.SalaryCalculator  -files  /home/training/use-case/reference/location.txt  /userdata/bejoy/input  /userdata/bejoy/output
Note:  -files option is used if our MR program is referring to other reference files for processing of input files

hadoop jar  /home/training/use-case/source-code/salarycalculator.jar com.ge.corp.hadoop.salcalc.SalaryCalculator  -files  /home/training/use-case/reference/location.txt  -D mapred.reduce.tasks=17  /userdata/bejoy/input  /userdata/bejoy/output
Note:  -D mapred.reduce.tasks is used to specify the no of reducers we’d be using to run our MR job. (the no of output files is same as that of the no of reducers used in running your job)

9.       If you get a log of the below mentioned format in your terminal proceed to step 8 else trouble shoot the error in your MR code and re run the same
10/07/23 03:51:17 INFO mapred.FileInputFormat: Total input paths to process : 2
10/07/23 03:51:18 INFO mapred.JobClient: Running job: job_201007230305_0001
10/07/23 03:51:19 INFO mapred.JobClient:  map 0% reduce 0%
10/07/23 03:51:33 INFO mapred.JobClient:  map 100% reduce 0%
10/07/23 03:51:42 INFO mapred.JobClient:  map 100% reduce 100%
10/07/23 03:51:44 INFO mapred.JobClient: Job complete: job_201007230305_0001
10/07/23 03:51:44 INFO mapred.JobClient: Counters: 18
.
.
.
 Note: this format of log ensures that the MR program ran successfully

10.   Copy the contents of hdfs output directory  to local linux file system
Command Syntax:
hadoop fs –copyFromLocal   <source directory in hdfs>  <destination directory in local linux file system >
Example
hadoop  fs  –copyToLocal  /userdata/bejoy/output  /home/training/use-case/

11.   Open the corresponding directory in your local linux pc and verify the output
(ouput file name would be part-00000.txt)
Here my directory path in local linux pc would be
Home -> training -> use-case -> output

NOTE : Step 10 and 11 is needed only if you need the output file to be put into LFS and the to be transferred to another location or make it accessible foe other applications.
To view the file alone we can use the cat command in hdfs
hadoop fs –cat  <full path of the file in hdfs>

To list the contents of a directory in hdfs we have to use
hadoop fs –ls <full path of the directory in hdfs>

5 comments:

  1. Hadoop is a way to store large amounts of data in petabytes and zettabytes. This storage system is called as Hadoop Distributed File System
    Hadoop trainings

    ReplyDelete
  2. hai sir u r explanation is superb
    easy understanding for new ones(like me)
    can u explain log file analysis using mapreduce in step by step manner..

    ReplyDelete
  3. Hai mate, well crafted, you really nailed it.It was awesome to see the good explanation of Hadoop information over this blog. And keep updating on latest technology info. for getting more knowledge to the Hadoop Lovers.
    Hadoop Training in hyderabad

    ReplyDelete
  4. Your posts is really helpful for me.Thanks for your wonderful post. I am very happy to read your post. It is really very helpful for us and I have gathered some important information from this blog.

    Salesforce Training

    ReplyDelete