Friday, May 27, 2011

Mahout Recommendations in Distributed mode with Hadoop Map Reduce


                The implementation of mahout Recommendations in completely different in distributed environment compared to the same on stand alone. In distributed environment the concept of Data Model and neighborhood ceases to exist, as data is distributed across multiple machines and computations are not just based on local data. In core when we take mahout into distributed mode there are a series of mappers and reducers involved in the process with multiple intermediate result sets. The entire recommendation process begins with that of computing the co-occurrence matrix and user vectors.
                Mahout distribution already provides a job to enable recommenders in distributed environment. Follow the below mentioned steps in order to implement mahout recommendations in Hadoop environment

1.       Go to the core directory of mahout distribution and run ‘mvn clean package’
(You should have maven installed in your pc)
Once this is done verify whether a job mahout-core-0.4-SNAPSHOT.job has been created within /target directory. This is a map reduce jar for computing recommendations

2.       Copy the input data set(input.txt) into hdfs
hadoop fs –copyFromLocal input.txt /userdata/input/input.txt

3.       Copy the file users.txt into hdfs. Users .txt should contain the list of user ids whose recommendations are required, one in a line.
hadoop fs –copyFromLocal users.txt /userdata/input/users.txt

4.       Run the recommender job

hadoop jar target/mahout-core-0.4-SNAPSHOT.job org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-Dmapred.input.dir=/userdata/bejoy/mahout/input/input.txt
-Dmapred.output.dir=/userdata/bejoy/output
--usersFile input/users.txt
--numRecommendations 5
–b true
-s SIMILARITY_TANIMOTOCOEFFICIENT

All the parameters passed on with the job are self explanatory except a few like
--numRecommendations’ no of recommendations to be generated for each user id specified in the users.txt file
-b true’ indicates that the data set is Boolean
-s’ is used to indicate which similarity algorithm to be used for generating recommendations

5.       You can find the output at the following hdfs dir userdata/bejoy/mahout/output

When we go for distributed computations, the computations are calculated offline and the recommendations are stored in some rdbms/Hbase for retrieval in real time applications. When we go in for offline computations it is better to choose item based computations because in high traffic sites the lists of items grow at a very slower pace compared to list of users, items relationship. For an ecommerce site if we make user based recommendations offline it needn’t be accurate as there would be n users buying m items every minute and these entries are not considered while forming user neighborhood for recommendations made next moment.

9 comments:

  1. Hi, for user based recommendations what would be best way of using
    Mahout + hadoop.
    offline calculations are great, but preferences changes vary fast.
    online calculations will take more time.
    What would be most optimal solution for this.

    ReplyDelete
  2. Thanks for this article, Managing a business data is not an easy thing, it is very complex process to handle the corporate information both Hadoop and cognos doing this in a easy manner with help of business software suite, thanks for sharing this useful post….
    Fita Chennai reviews
    Hadoop Training in Chennai
    Big Data Training in Chennai

    ReplyDelete
  3. Thanks for your informative article on ios mobile application development. Your article helped me to explore the future of mobile apps developers. Having sound knowledge on mobile application development will help you to float in mobile application development. iOS Training in Chennai | iOS Training Institutes in Chennai

    ReplyDelete