Thursday, June 2, 2011

Analyzing Apache logs with Pig


Analyzing log files, churning them and extracting meaningful information is a potential use case in Hadoop. We don’t have to go in for MapReduce programming for these analyses; instead we can go for tools like Pig and Hive for this log analysis. I’d just give you a start off on the analysis part. Let us consider Pig for apache log analysis. Pig has some built in libraries that would help us load the apache log files into pig and also some cleanup operation on string values from crude log files. All the functionalities are available in the piggybank.jar mostly available under pig/contrib/piggybank/java/ directory. As the first step we need to register this jar file with our pig session then only we can use the functionalities in our Pig Latin
1.       Register PiggyBank jar
REGISTER /usr/lib/pig/contrib/piggybank/java/piggybank.jar;
Once we have registered the jar file we need to define a few functionalities to be used in our Pig Latin. For any basic apache log analysis we need a loader to load the log files in a column oriented format in pig, we can create a apache log loader as
2.       Define a log loader
DEFINE ApacheCommonLogLoader org.apache.pig.piggybank.storage.apachelog.CommonLogLoader();
(Piggy Bank has other log loaders as well)
In apache log files the default format of date is ‘dd/MMM/yyyy:HH:mm:ss Z’ . But such a date won’t help us much in case of log analysis we may have to extract date without time stamp. For that we use DateExtractor()
3.       Define Date Extractor
DEFINE DayExtractor org.apache.pig.piggybank.evaluation.util.apachelogparser.DateExtractor('yyyy-MM-dd');
Once we have the required functionalities with us we need to first load the log file into pig
4.       Load apachelog file into pig
--load the log files from hdfs into pig using CommonLogLoader
logs = LOAD '/userdata/bejoys/pig/p01/access.log.2011-01-01' USING ApacheCommonLogLoader AS (ip_address, rfc, userId, dt, request, serverstatus, returnobject, referersite, clientbrowser);

Now we are ready to dive in for the actual log analysis. There would be multiple information you need to extract out of a log; we’d see a few of those common requirements out here

Note: you need to first register the jar, define the classes to be used and load the log files into pig before trying out any of the pig latin below

Requirement 1: Find unique hits per day
PIG Latin
--Extracting the day alone and grouping records based on days
grpd = GROUP logs BY DayExtractor(dt) as day;
--looping through each group to get the unique no of userIds
cntd = FOREACH grpd
{
                tempId =  logs.userId;
                uniqueUserId = DISTINCT tempId;
                GENERATE group AS day,COUNT(uniqueUserId) AS cnt;
}
--sorting the processed records based on no of unique user ids in descending order
srtd = ORDER cntd BY cnt desc;
--storing the final result into a hdfs directory
STORE srtd INTO '/userdata/bejoys/pig/ApacheLogResult1';

Requirement 1: Find unique hits to websites (IPs) per day
PIG Latin

--Extracting the day alone and grouping records based on days and ip address
grpd = GROUP logs BY (DayExtractor(dt) as day,ip_address);
--looping through each group to get the unique no of userIds
cntd = FOREACH grpd
{
                tempId =  logs.userId;
                uniqueUserId = DISTINCT tempId;
                GENERATE group AS day,COUNT(uniqueUserId) AS cnt;
}
--sorting the processed records based on no of unique user ids in descending order
srtd = ORDER cntd BY cnt desc;
--storing the final result into a hdfs directory
STORE srtd INTO '/userdata/bejoys/pig/ ApacheLogResult2 ';

Note: When you use pig latin in grunt shell we need to know a few factors
1.       When we issue a pig statement in grunt and press enter only the semantic check is being done, no execution is triggered.
2.       All the pig statements are executed only after the STORE command is submitted, ie map reduce programs would be triggered only after STORE is submitted
3.       Also in this case you don’t have to load the log files again and again to pig once it is loaded we can use the same for all related operations in that session. Once you are out of the grunt shell the loaded files are lost, you’d have to perform the register and log file loading steps all over again.

7 comments:

  1. This is really informative ... Thanks

    ReplyDelete
  2. This is really good. Do you the sample access.log file ? userdata/bejoys/pig/p01/access.log.2011-01-01............I want to test it in local environment .You can email to srikrishna94atyahoo.com

    ReplyDelete
  3. Really Thanks, most of the cases, to analyze server logs, used Hive, but compare with Hive pig easy to analyze log files. Thanks to share valuable information about Pig and its operations. Keep going share more tips thanks

    ReplyDelete

  4. The very next time I read a blog, Hopefully it doesn’t fail me just as much as this one. After all, Yes, it was my choice to read, however I really thought you would have something useful to talk about. All I hear is a bunch of crying about something that you can fix if you weren’t too busy seeking attention.
    Hadoop Online Training

    ReplyDelete
  5. This is a very nice article and when i tried to extract the date using DayExtractor(dt) as day i got an error called "expected semicolon" and some of the records were missed when i processed the access log file with more than thousand rows.

    ReplyDelete
  6. Really a good piece of knowledge on Big Data and Hadoop. Thanks for such a good post. I would like to recommend one more resource NPN Training which helps in getting more knowledge on Hadoop. The best part of NPN Training is they provide complete Hands-on classes.

    For More Details visit
    http://npntraining.com/courses/big-data-and-hadoop.php

    ReplyDelete