Friday, May 27, 2011
Mahout Recommendations in Distributed mode with Hadoop Map Reduce
The implementation of mahout Recommendations in completely different in distributed environment compared to the same on stand alone. In distributed environment the concept of Data Model and neighborhood ceases to exist, as data is distributed across multiple machines and computations are not just based on local data. In core when we take mahout into distributed mode there are a series of mappers and reducers involved in the process with multiple intermediate result sets. The entire recommendation process begins with that of computing the co-occurrence matrix and user vectors.
Mahout distribution already provides a job to enable recommenders in distributed environment. Follow the below mentioned steps in order to implement mahout recommendations in Hadoop environment
1. Go to the core directory of mahout distribution and run ‘mvn clean package’
(You should have maven installed in your pc)
Once this is done verify whether a job mahout-core-0.4-SNAPSHOT.job has been created within /target directory. This is a map reduce jar for computing recommendations
2. Copy the input data set(input.txt) into hdfs
hadoop fs –copyFromLocal input.txt /userdata/input/input.txt
3. Copy the file users.txt into hdfs. Users .txt should contain the list of user ids whose recommendations are required, one in a line.
hadoop fs –copyFromLocal users.txt /userdata/input/users.txt
4. Run the recommender job
hadoop jar target/mahout-core-0.4-SNAPSHOT.job org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
All the parameters passed on with the job are self explanatory except a few like
‘--numRecommendations’ no of recommendations to be generated for each user id specified in the users.txt file
‘-b true’ indicates that the data set is Boolean
‘-s’ is used to indicate which similarity algorithm to be used for generating recommendations
5. You can find the output at the following hdfs dir userdata/bejoy/mahout/output
When we go for distributed computations, the computations are calculated offline and the recommendations are stored in some rdbms/Hbase for retrieval in real time applications. When we go in for offline computations it is better to choose item based computations because in high traffic sites the lists of items grow at a very slower pace compared to list of users, items relationship. For an ecommerce site if we make user based recommendations offline it needn’t be accurate as there would be n users buying m items every minute and these entries are not considered while forming user neighborhood for recommendations made next moment.