Kick Start Hadoop: Generating Recommendations with mahout for Boolean data sets (data sets with no preference value)

Thursday, May 26, 2011

Generating Recommendations with mahout for Boolean data sets (data sets with no preference value)

Boolean data Sets

Input data set that doesn’t have a preference value, ie input data set would be of the format UserId1,ItemId1

UserId2,ItemId2

Here it’d based on some data where an user either likes an item or he doesn’t, there is no preference value associated with this.

When we use Boolean data sets we need to appropriately choose the Similarity algorithms and Recommenders

Similarity Algorithms

For Boolean data sets we can either go in for Tanimoto Coefficient Similarity or Log Likelihood Similarity

Recommender

We need to use Generic Boolean Pref User Based Recommender or Generic Boolean Pref Item Based Recommender

Sample codes for generating User based and Item Based recommendations are given below

Used Based Recommender for Boolean Data Sets

import java.io.File;

import java.io.IOException;

import java.util.List;

import org.apache.mahout.cf.taste.common.TasteException;

import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;

import org.apache.mahout.cf.taste.impl.neighborhood.ThresholdUserNeighborhood;

import org.apache.mahout.cf.taste.impl.recommender.GenericBooleanPrefUserBasedRecommender;

import org.apache.mahout.cf.taste.impl.similarity.TanimotoCoefficientSimilarity;

import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;

import org.apache.mahout.cf.taste.recommender.RecommendedItem;

import org.apache.mahout.cf.taste.recommender.Recommender;

import org.apache.mahout.cf.taste.similarity.UserSimilarity;

public class UserRecommender {

public static void main(String args[])

{

// specifying the user id to which the recommendations have to be generated for

int userId=510;

//specifying the number of recommendations to be generated

int noOfRecommendations=5;

//specifying theNeighborhood size

double thresholdValue=0.7;

try

{

// Data model created to accept the input file

FileDataModel dataModel = new FileDataModel(new File("D://input.txt"));

/*TanimotoCoefficientSimilarity is intended for "binary" data sets

where a user either expresses a generic "yes" preference for an item or has no preference.*/

UserSimilarity userSimilarity = new TanimotoCoefficientSimilarity(dataModel);

/*ThresholdUserNeighborhood is preferred in situations where we go in for a

similarity measure between neighbors and not any number*/

UserNeighborhood neighborhood =new ThresholdUserNeighborhood(thresholdValue, userSimilarity, dataModel);

/*GenericBooleanPrefUserBasedRecommender is appropriate for use when no notion

of preference value exists in the data. */

Recommender recommender =new GenericBooleanPrefUserBasedRecommender(dataModel, neighborhood, userSimilarity);

//calling the recommend method to generate recommendations

List<RecommendedItem> recommendations =recommender.recommend(userId, noOfRecommendations);

for (RecommendedItem recommendedItem : recommendations)

System.out.println(recommendedItem.getItemID());

}

catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

} catch (TasteException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

Item Based Recommender for Boolean Data Sets

import java.io.File;

import java.io.IOException;

import java.util.List;

import org.apache.mahout.cf.taste.common.TasteException;

import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;

import org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender;

import org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity;

import org.apache.mahout.cf.taste.recommender.ItemBasedRecommender;

import org.apache.mahout.cf.taste.recommender.RecommendedItem;

import org.apache.mahout.cf.taste.similarity.ItemSimilarity;

public class ItemRecommender {

public static void main(String args[])

{

// specifying the user id to which the recommendations have to be generated for

int userId=510;

//specifying the number of recommendations to be generated

int noOfRecommendations=5;

try

{

// Data model created to accept the input file

FileDataModel dataModel = new FileDataModel(new File("D://input.txt"));

/*Specifies the Similarity algorithm*/

ItemSimilarity itemSimilarity = new LogLikelihoodSimilarity(dataModel);

/*Initalizing the recommender */

ItemBasedRecommender recommender =new GenericItemBasedRecommender(dataModel, itemSimilarity);

//calling the recommend method to generate recommendations

List<RecommendedItem> recommendations =recommender.recommend(userId, noOfRecommendations);

for (RecommendedItem recommendedItem : recommendations)

System.out.println(recommendedItem.getItemID());

}

catch (IOException e) {

// TODO Auto-generated catch block

e.printStackTrace();

} catch (TasteException e) {

// TODO Auto-generated catch block

e.printStackTrace();

}

Note: For more details on basic recommendations refer generating recommendations with mahout

14 comments:

JoNormanJuly 19, 2011 at 12:53 AM
GenericItemBasedRecommender or GenericBooleanPrefItemBasedRecommender ??
ReplyDelete
Replies
Bejoy KSJuly 19, 2011 at 3:26 AM
Hi JoNorman
You can and you should use GenericBooleanPrefItemBasedRecommender since it is a latest and updated implementation. The code samples on this blog was based out of a previous release of mahout(0.4). (I was working on the same kind of an year ago). The earlier version had just GenericItemBasedRecommender no GenericBooleanPrefItemBasedRecommender.
ReplyDelete
Replies
AshishFebruary 23, 2012 at 4:09 AM
I tried above code but it displays following error

java.io.FileNotFoundException: \org\apache\mahout\cf\taste\example\grouplens\input.txt

how to solve it? I tried all options:

1)File ratingsFile = new File("/org/apache/mahout/cf/taste/example/grouplens/input.txt");
DataModel dataModel = new FileDataModel(ratingsFile);
2)FileDataModel dataModel = new FileDataModel(new File("/org/apache/mahout/cf/taste/example/grouplens/input1.csv"));
ReplyDelete
Replies
Bejoy KSFebruary 23, 2012 at 10:38 AM
Hi Ashish
It is just a minor issue with the path of your input file. Did you try an ls on the path from CLI and confirmed that the file is present at the mentioned location

ls -l /org/apache/mahout/cf/taste/example/grouplens/

If you get the desired output from the above ls command then the code should run without any exceptions.

Please revert if you still have any issues.
ReplyDelete
Replies
Aditya RaghuwanshiApril 9, 2012 at 1:31 AM
I was just starting with Mahout and found this approach pretty good for what I'm trying to do. My question is this. I have a dataset of users liking or disliking an item. How would the input file look like. currently its in the form userid,itemid,(1/-1) depending on user following or not following.
ReplyDelete
Replies
Bejoy KSMay 15, 2012 at 7:04 AM
Hi Aditya
Your data set is of boolean type. The better approach is to use a pre processor that removes all records that have a preference value as -1. Then use it with the recommendation algorithms available in mahout.
ReplyDelete
Replies
JaspreetJune 11, 2012 at 1:52 PM
Replacing the "-1" with "0" might be a good idea as the zero will ensure that the said user is not recommended that item.
ReplyDelete
Replies
Masoud_mjJuly 11, 2012 at 12:28 AM
In mahout we need to save vectors elements as double values while we have binary data and this makes files very large and takes large memory. I have a binary dataset 20k * 200k. How can I reduce the clustering (kmeans) memory usage. Is there any dimension reduction algorithm? Note that I need Manhattan distance on binary data and the reduced dimensions should maintain that characteristics
ReplyDelete
Replies
Mr GullOctober 22, 2012 at 1:15 PM
Thanks for the great article!

But what if we have large dataset (for example 10^6 rows of user_id item_id)? How to speed up recommendation calculations? For me it takes about a minute

ReplyDelete
Replies
രജിത്ത് രവിJanuary 18, 2013 at 3:36 AM
Hi Bejoy

you have mentioned that

input data set would be of the format UserId1,ItemId1
UserId2,ItemId2

r u saying that the data need not have a binary value(preferring the like or dislike?)

is an input file of below mentioned format correct?

UserId2,ItemId2
1, 10
1, 20
2, 30
2, 40

ReplyDelete
Replies
രജിത്ത് രവിJanuary 18, 2013 at 3:57 AM
Why i was asking this is because, when i set the prefernce value(as binary 1 or 0) for User Based Recommendation system i am not getting any recommendations.

whereas for the Item Based recommendations i am getting the predictions.

i shall experiment the rest of the things mean while

Thanks for this article, it has helped me a lot
ReplyDelete
Replies
ANTARIP BISWASJuly 11, 2013 at 12:04 AM
When I am using an attribute having string values in the training data for a Recommender in Mahout, I am getting a NumberFormatException which is happening during the building of the FileDataModel from the data in the file. If the string attribute value is "1.0" which is basically a number represented as string, then the Recommender is not throwing the NumberFormatException. But if the attribute value is "Washington", then the NumberFormatException is thrown.
Is there any solution by which I can pass string attribute values as itemID/userID in the training data for Recommenders in Mahout?
ReplyDelete
Replies
jackOctober 22, 2014 at 5:39 AM
I've tried to use the `GenericItemBasedRecommender` with several similarity types (llr, tanimoto, euclidean) - AND it always gave me the SAME results.. (precision, recall, etc)..
May it be a bug? Has anyone encountered this problem?
Pls help.
ReplyDelete
Replies

Add comment