Kick Start Hadoop: How to run hadoop - map reduce jobs without a cluster? With cloudera VM.

This document is indented to aid basic java developers to kick start practical investigation on Hadoop map reduce jobs without any cluster set up on their end. To understand this document you need to possess basic theoretical knowledge on Hadoop, hdfs and map reduce jobs. It is also advisable to have some prior knowledge on basic linux commands.

It is possible to try sample map reduce jobs on your windows PC without any cumbersome hadoop setup if you have cloudera test VM with you. This environment is ideal for functionality testing on very small sample data volumes, larger input data wont be supported due to memory constrains posed by the VM.

To test your sample map reduce job on local hadoop environment (cloudera training VM) follow the below mentioned steps in order.

1. Download the following software and ubuntu image on your windows pc.

a. Cloudera Training VM
https://ccp.cloudera.com/display/SUPPORT/Cloudera%27s+Hadoop+Demo+VM

b. VM Ware Player
http://downloads.vmware.com/d/info/desktop_downloads/vmware_player/3_0

2. Install the VM ware player. Extract the Cloudera Training VM and there you can find a *.vmx file. Open the same.(you can notice that it opens on VM Ware Player)

User credentials for Cloudera VM

User name: training

Password: training

3. Copy the jar and the required input files into the cloudera linux box.

Here I have copied jar to home -> training -> use-case -> source-code and

Input files to home -> training -> use-case -> Input

(To browse to folder in linux box, click on places link on the top menu bar -> select home folder. Now you would be in /home/training folder in your local linux file system from there you can browse and create directories as you do in windows.)

You can copy paste files from your Windows file system to the cloudera linux box just like you do it between folders in Windows.

4. Open linux terminal(terminal icon available on cloudera linux box desk top)

5. Create a input directory in your hdfs

Command Syntax:

hadoop fs -mkdir <full path of directory in hdfs>

Example

hadoop fs -mkdir /userdata/bejoy/input

6. Copy the contents from the linux box input folder to hdfs input folder

Command Syntax:

hadoop fs –copyFromLocal <source directory from local linux box> <destination directory in hdfs>

Example

hadoop fs -copyFromLocal /home/training/use-case/input /userdata/bejoy/input/

7. Check the availability of the input files on HDFS(not a mandatory step)

Command Syntax:

hadoop fs –ls <full path of hdfs directory>

Example

hadoop fs -ls /userdata/bejoy/input/

8. Run the jar file with required attributes

Command Syntax:

hadoop jar <full path of jar with jar name> <full package name of the Hadoop Driver Class> < full path of hdfs input directory> < full path of hdfs output directory>

Example

hadoop jar /home/training/use-case/source-code/salarycalculator.jar com.ge.corp.hadoop.salcalc.SalaryCalculator -files /home/training/use-case/reference/location.txt /userdata/bejoy/input /userdata/bejoy/output

Note: -files option is used if our MR program is referring to other reference files for processing of input files

Note: -D mapred.reduce.tasks is used to specify the no of reducers we’d be using to run our MR job. (the no of output files is same as that of the no of reducers used in running your job)

9. If you get a log of the below mentioned format in your terminal proceed to step 8 else trouble shoot the error in your MR code and re run the same

10/07/23 03:51:17 INFO mapred.FileInputFormat: Total input paths to process : 2

10/07/23 03:51:18 INFO mapred.JobClient: Running job: job_201007230305_0001

10/07/23 03:51:19 INFO mapred.JobClient: map 0% reduce 0%

10/07/23 03:51:33 INFO mapred.JobClient: map 100% reduce 0%

10/07/23 03:51:42 INFO mapred.JobClient: map 100% reduce 100%

10/07/23 03:51:44 INFO mapred.JobClient: Job complete: job_201007230305_0001

10/07/23 03:51:44 INFO mapred.JobClient: Counters: 18

Note: this format of log ensures that the MR program ran successfully

10. Copy the contents of hdfs output directory to local linux file system

Command Syntax:

hadoop fs –copyFromLocal <source directory in hdfs> <destination directory in local linux file system >

Example

hadoop fs –copyToLocal /userdata/bejoy/output /home/training/use-case/

11. Open the corresponding directory in your local linux pc and verify the output

(ouput file name would be part-00000.txt)

Here my directory path in local linux pc would be

Home -> training -> use-case -> output

NOTE : Step 10 and 11 is needed only if you need the output file to be put into LFS and the to be transferred to another location or make it accessible foe other applications.

To view the file alone we can use the cat command in hdfs

hadoop fs –cat <full path of the file in hdfs>

To list the contents of a directory in hdfs we have to use

hadoop fs –ls <full path of the directory in hdfs>

Kick Start Hadoop

Thursday, April 28, 2011

How to run hadoop - map reduce jobs without a cluster? With cloudera VM.

2 comments: