Thursday, May 26, 2011

Generating Recommendations with mahout for Boolean data sets (data sets with no preference value)



Boolean data Sets
                Input data set that doesn’t have a preference value, ie input data set would be of the format UserId1,ItemId1
UserId2,ItemId2
Here it’d based on some data where an user either likes an item or he doesn’t, there is no preference value associated with this.

                When we use Boolean data sets we need to appropriately choose the Similarity algorithms and Recommenders

Similarity Algorithms
                For Boolean data sets we can either go in for Tanimoto Coefficient Similarity or Log Likelihood Similarity

Recommender
                We need to use Generic Boolean Pref User Based Recommender or Generic Boolean Pref Item Based Recommender

                Sample codes for generating User based and Item Based recommendations are given below

Used Based Recommender for Boolean Data Sets

import java.io.File;
import java.io.IOException;
import java.util.List;

import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.neighborhood.ThresholdUserNeighborhood;
import org.apache.mahout.cf.taste.impl.recommender.GenericBooleanPrefUserBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.TanimotoCoefficientSimilarity;
import org.apache.mahout.cf.taste.neighborhood.UserNeighborhood;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.recommender.Recommender;
import org.apache.mahout.cf.taste.similarity.UserSimilarity;

public class UserRecommender {
     
      public static void main(String args[])
      {
            // specifying the user id to which the recommendations have to be generated for
            int userId=510;
           
            //specifying the number of recommendations to be generated
            int noOfRecommendations=5;
           
            //specifying theNeighborhood size
            double thresholdValue=0.7;
           
            try
            {
                  // Data model created to accept the input file
                  FileDataModel dataModel = new FileDataModel(new File("D://input.txt"));
                 
                  /*TanimotoCoefficientSimilarity is intended for "binary" data sets
                  where a user either expresses a generic "yes" preference for an item or has no preference.*/
                  UserSimilarity userSimilarity = new TanimotoCoefficientSimilarity(dataModel);
                 
                  /*ThresholdUserNeighborhood is preferred in situations where we go in for a
                   similarity measure between neighbors and not any number*/
                  UserNeighborhood neighborhood =new ThresholdUserNeighborhood(thresholdValue, userSimilarity, dataModel);
                 
                  /*GenericBooleanPrefUserBasedRecommender is appropriate for use when no notion
                  of preference value exists in the data. */
                  Recommender recommender =new GenericBooleanPrefUserBasedRecommender(dataModel, neighborhood, userSimilarity);
                 
                  //calling the recommend method to generate recommendations
                  List<RecommendedItem> recommendations =recommender.recommend(userId, noOfRecommendations);
           
                  //
                  for (RecommendedItem recommendedItem : recommendations)
                        System.out.println(recommendedItem.getItemID());
            }
            catch (IOException e) {
                  // TODO Auto-generated catch block
                  e.printStackTrace();
            } catch (TasteException e) {
                  // TODO Auto-generated catch block
                  e.printStackTrace();
            }
           
                 
      }

}

Item Based Recommender for Boolean Data Sets

import java.io.File;
import java.io.IOException;
import java.util.List;

import org.apache.mahout.cf.taste.common.TasteException;
import org.apache.mahout.cf.taste.impl.model.file.FileDataModel;
import org.apache.mahout.cf.taste.impl.recommender.GenericItemBasedRecommender;
import org.apache.mahout.cf.taste.impl.similarity.LogLikelihoodSimilarity;
import org.apache.mahout.cf.taste.recommender.ItemBasedRecommender;
import org.apache.mahout.cf.taste.recommender.RecommendedItem;
import org.apache.mahout.cf.taste.similarity.ItemSimilarity;

public class ItemRecommender {
     
      public static void main(String args[])
      {
            // specifying the user id to which the recommendations have to be generated for
            int userId=510;
           
            //specifying the number of recommendations to be generated
            int noOfRecommendations=5;
           
            try
            {
                  // Data model created to accept the input file
                  FileDataModel dataModel = new FileDataModel(new File("D://input.txt"));
                 
                  /*Specifies the Similarity algorithm*/
                  ItemSimilarity itemSimilarity = new LogLikelihoodSimilarity(dataModel);
                 
                  /*Initalizing the recommender */
                  ItemBasedRecommender recommender =new GenericItemBasedRecommender(dataModel, itemSimilarity);
                 
                  //calling the recommend method to generate recommendations
                  List<RecommendedItem> recommendations =recommender.recommend(userId, noOfRecommendations);
           
                  //
                  for (RecommendedItem recommendedItem : recommendations)
                        System.out.println(recommendedItem.getItemID());
            }
            catch (IOException e) {
                  // TODO Auto-generated catch block
                  e.printStackTrace();
            } catch (TasteException e) {
                  // TODO Auto-generated catch block
                  e.printStackTrace();
            }
           
      }
}


Note: For more details on basic recommendations refer generating recommendations with mahout

14 comments:

  1. GenericItemBasedRecommender or GenericBooleanPrefItemBasedRecommender ??

    ReplyDelete
  2. Hi JoNorman
    You can and you should use GenericBooleanPrefItemBasedRecommender since it is a latest and updated implementation. The code samples on this blog was based out of a previous release of mahout(0.4). (I was working on the same kind of an year ago). The earlier version had just GenericItemBasedRecommender no GenericBooleanPrefItemBasedRecommender.

    ReplyDelete
  3. I tried above code but it displays following error

    java.io.FileNotFoundException: \org\apache\mahout\cf\taste\example\grouplens\input.txt

    how to solve it? I tried all options:

    1)File ratingsFile = new File("/org/apache/mahout/cf/taste/example/grouplens/input.txt");
    DataModel dataModel = new FileDataModel(ratingsFile);
    2)FileDataModel dataModel = new FileDataModel(new File("/org/apache/mahout/cf/taste/example/grouplens/input1.csv"));

    ReplyDelete
  4. Hi Ashish
    It is just a minor issue with the path of your input file. Did you try an ls on the path from CLI and confirmed that the file is present at the mentioned location

    ls -l /org/apache/mahout/cf/taste/example/grouplens/

    If you get the desired output from the above ls command then the code should run without any exceptions.

    Please revert if you still have any issues.

    ReplyDelete
  5. I was just starting with Mahout and found this approach pretty good for what I'm trying to do. My question is this. I have a dataset of users liking or disliking an item. How would the input file look like. currently its in the form userid,itemid,(1/-1) depending on user following or not following.

    ReplyDelete
  6. Hi Aditya
    Your data set is of boolean type. The better approach is to use a pre processor that removes all records that have a preference value as -1. Then use it with the recommendation algorithms available in mahout.

    ReplyDelete
  7. Replacing the "-1" with "0" might be a good idea as the zero will ensure that the said user is not recommended that item.

    ReplyDelete
  8. In mahout we need to save vectors elements as double values while we have binary data and this makes files very large and takes large memory. I have a binary dataset 20k * 200k. How can I reduce the clustering (kmeans) memory usage. Is there any dimension reduction algorithm? Note that I need Manhattan distance on binary data and the reduced dimensions should maintain that characteristics

    ReplyDelete
  9. Thanks for the great article!

    But what if we have large dataset (for example 10^6 rows of user_id item_id)? How to speed up recommendation calculations? For me it takes about a minute

    ReplyDelete
  10. Hi Bejoy

    you have mentioned that

    input data set would be of the format UserId1,ItemId1
    UserId2,ItemId2

    r u saying that the data need not have a binary value(preferring the like or dislike?)

    is an input file of below mentioned format correct?

    UserId2,ItemId2
    1, 10
    1, 20
    2, 30
    2, 40

    ReplyDelete
  11. Why i was asking this is because, when i set the prefernce value(as binary 1 or 0) for User Based Recommendation system i am not getting any recommendations.

    whereas for the Item Based recommendations i am getting the predictions.

    i shall experiment the rest of the things mean while

    Thanks for this article, it has helped me a lot

    ReplyDelete
  12. When I am using an attribute having string values in the training data for a Recommender in Mahout, I am getting a NumberFormatException which is happening during the building of the FileDataModel from the data in the file. If the string attribute value is "1.0" which is basically a number represented as string, then the Recommender is not throwing the NumberFormatException. But if the attribute value is "Washington", then the NumberFormatException is thrown.
    Is there any solution by which I can pass string attribute values as itemID/userID in the training data for Recommenders in Mahout?

    ReplyDelete
    Replies
    1. This comment has been removed by the author.

      Delete
  13. I've tried to use the `GenericItemBasedRecommender` with several similarity types (llr, tanimoto, euclidean) - AND it always gave me the SAME results.. (precision, recall, etc)..
    May it be a bug? Has anyone encountered this problem?
    Pls help.

    ReplyDelete