problem in recommender similarity computation (taste)
Hi, I've noticed a problem in the non-Hadoop (taste) version of the recommender package. The problem is in the AbstractSimilarity (in package org.apache.mahout.cf.taste.impl.similarity). This class is the base class for computing the similarity values between vectors of users or items. It assumes that the similarity between the vectors is computed using only the commonly rated items/users. Consider the following two vectors: V1: _, 3, 4, _, 2 V2: 3, 5, _, 2, 4 where _ means no ratings. For these two vectors, the cosine or Pearson similarity is computed on the following vectors: 3, 2 5, 4 However, if the number of common ratings is small then the similarity result will be very unreliable. Which is indeed the case if you run the code on Movielens dataset and measure recall values, the results will be very bad. There can be two solutions: 1. There should be a parameter n, which determines the minimum number of common ratings needed to compute a similarity otherwise the system should return NaN. 2. The similarity should be computed using all the ratings, for the above two vectors, the cosine similarity should be (3*5+2*4)/(sqrt(3^2+4^2+2^2)+sqrt(3^2+5^2+2^2+4^2)) Tevfik
Re: Can user id and item id be negative integers?
AbstractIDMigrator is for being able to use String IDs (it converts Strings to Longs.) IDs are stored in Long types, so there should not be any problems with negative IDs, but in practice I have not work with negative IDs before. Tevfik On Wed, Aug 6, 2014 at 3:51 AM, Peng Zhang pzhang.x...@gmail.com wrote: Hi, Does this support the possibility that user/item id can be negative? I am reading through the source code of org.apache.mahout.cf.taste.impl.model.AbstractIDMigrator. The hash() function is trying to convert a string id to a long id like this. It’s quite possible that the long id returned is a negative one, when the leading bit is 1:) protected final long hash(String value) { byte[] md5hash; synchronized (md5Digest) { md5hash = md5Digest.digest(value.getBytes(Charsets.UTF_8)); md5Digest.reset(); } long hash = 0L; for (int i = 0; i 8; i++) { hash = hash 8 | md5hash[i] 0x00FFL; } return hash; } Hi Ted, I am running the in memory version of GenericItemBasedRecommender and SVDRecommender, i.e. I am using them in my Java code. Hi Pat, Not all user id are negative. Input file sample: ... -1250,6929,1 -1250,7059,1 -1250,7654,1 -1250,8094,1 -1250,9486,1 -1250,9563,3 10018000,11080,1 10018000,11176,1 10018000,11196,1 10018000,12220,1 10018000,12447,1 10018000,13213,1 ... Item based recommender output sample: User,Brand,Scoring -1250,12352,5.0 -1250,14261,5.0 -1250,15934,4.309238 -1250,16463,3.0 -1250,3627,1.0 1025250,29099,1.0 1025250,18741,1.0 1025250,14261,1.0 … SVD recommender output sample: User,Brand,Scoring -1250,3627,3.9108906 -1250,27791,3.8262475 -1250,251,3.744943 -1250,20979,3.5778444 -1250,14482,3.5494242 1025250,27791,2.2692947 1025250,251,1.9651389 1025250,14482,1.9196383 1025250,12220,1.9153352 ... Thank you, Peng Zhang M: +86 186-1658-7856 pzhang.x...@gmail.com On Aug 6, 2014, at 7:26 AM, Pat Ferrel p...@occamsmachete.com wrote: Are they ALL negative? Maybe only the non-negatives are working or there are some conditions where negatives work. I certainly wouldn’t count on it because I’ll bet it isn’t working as it should. On Aug 5, 2014, at 4:03 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Tue, Aug 5, 2014 at 3:21 AM, Peng Zhang pzhang.x...@gmail.com wrote: But today I am trying to use negative user id and item id, and they are working well with the item recommender and dvd recommender. Which programs are you using?
Re: Recommender Systems - RecommenderIRStatsEvaluator
- Is there a way to specify the train and test set like you can with the *RecommenderEvaluator*? No, though you can specify the evaluation percentage. This is because of the logic of evaluation. The logic is to take away relevant items and then make recommendations and see whether the relevant items appear in top-N lists. It is also possible (and I think in some ways better) to first split the data into test and training and select relevant items from the test set. But this is not how it is implemented. - Is it possible to perform k-fold cross-validation with the *RecommenderIRStatsEvaluator*? I don't think so. - How does the default way of evaluation work with *RecommenderIRStatsEvaluator*? I tried to explain it above. I would like to remind that it is not difficult to write your own evaluation code for your specific purposes. Tevfik On Tue, May 20, 2014 at 3:51 PM, Floris Devriendt florisdevrie...@gmail.com wrote: Hey all, The *RecommenderEvaluator *has the option to choose how big your training set is (and so choosing the test set size as well), but the *RecommenderIRStatsEvaluator* does not seem to have this argument in its *.evaluate()*-method. That's why I was wondering how the internals of the *RecommenderIRStatsEvaluator* work. I have the following questions on *RecommenderIRStatsEvaluator*: - Is there a way to specify the train and test set like you can with the *RecommenderEvaluator*? - Is it possible to perform k-fold cross-validation with the *RecommenderIRStatsEvaluator*? - How does the default way of evaluation work with *RecommenderIRStatsEvaluator*? If somebody has an answer to any of these questions it would be greatly appreciated. Kind regards, Floris Devriendt
Re: Number of features for ALS
Interesting topic, Ted, can you give examples of those mathematical assumptions under-pinning ALS which are violated by the real world? On Thu, Mar 27, 2014 at 3:43 PM, Ted Dunning ted.dunn...@gmail.com wrote: How can there be any other practical method? Essentially all of the mathematical assumptions under-pinning ALS are violated by the real world. Why would any mathematical consideration of the number of features be much more than heuristic? That said, you can make an information content argument. You can also make the argument that if you take too many features, it doesn't much hurt so you should always take as many as you can compute. On Thu, Mar 27, 2014 at 6:33 AM, Sebastian Schelter s...@apache.org wrote: Hi, does anyone know of a principled approach of choosing the number of features for ALS (other than cross-validation?) --sebastian
Re: Recommend items not rated by any user
Sorry there was a typo in the previous paragraph. If I remember correctly, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value with at least one of the items preferred by the user. On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Hi Juan, If I remember correctly, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value that is with at least one of the items preferred by the user. Tevfik On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org wrote: On 03/05/2014 01:23 PM, Juan José Ramos wrote: Thanks for the reply, Sebastian. I am not sure if that should be implemented in the Abstract base class though because for instance PreferredItemsNeighborhoodCandidateItemsStrategy, by definition, it returns the item not rated by the user and rated by somebody else. Good point. So we seem to need special implementations. Back to my last post, I have been playing around with AllSimilarItemsCandidateItemsStrategy and AllUnknownItemsCandidateItemsStrategy, and although they both do what I wanted (recommend items not previously rated by any user), I honestly can't tell the difference between the two strategies. In my tests the output was always the same. If the eventual output of the recommender will not include items already rated by the user as pointed out here ( http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E), AllSimilarItemsCandidateItemsStrategy should be equivalent to AllUnkownItemsCandidateItemsStrategy, shouldn't it? AllSimilarItems returns all items that are similar to any item that the user already knows. AllUnknownItems simply returns all items that the user has not interacted with yet. These are two different things, although they might overlap in some scenarios. Best, Sebastian Thanks. On Wed, Mar 5, 2014 at 10:23 AM, Sebastian Schelter s...@apache.org wrote: Hi Juan, that is a good catch. CandidateItemsStrategy is the right place to implement this. Maybe we should simply extend its interface to add a parameter that says whether to keep or remove the current users items? We could even do this in the abstract base class then. --sebastian On 03/05/2014 10:42 AM, Juan José Ramos wrote: In case somebody runs into the same situation, the key seems to be in the CandidateItemStrategy being passed to the constructor of GenericItemBasedRecommender. Looking into the code, if no CandidateItemStrategy is specified in the constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is used and as the documentation says, the doGetCandidateItems method: returns all items that have not been rated by the user and that were preferred by another user that has preferred at least one item that the current user has preferred too. So, a different CandidateItemStrategy needs to be passed. For this problem, it seems to me that AllSimilarItemsCandidateItemsStrategy, AllUnknownItemsCandidateItemsStrategy are good candidates. Does anybody know where to find some documentation about the different CandidateItemStrategy? Based on the name I would say that: 1) AllSimilarItemsCandidateItemsStrategy returns all similar items regardless of whether they have been already rated by someone or not. 2) AllUnknownItemsCandidateItemsStrategy returns all similar items that have not been rated by anyone yet. Does anybody know if it works like that? Thanks. On Tue, Mar 4, 2014 at 9:16 AM, Juan José Ramos jjar...@gmail.com wrote: First thing is thatI know this requirement would not make sense in a CF Recommender. In my case, I am trying to use Mahout to create something closer to a Content-Based Recommender. In particular, I am pre-computing a similarity matrix between all the documents (items) of my catalogue and using that matrix as the ItemSimilarity for my Item-Based Recommender. So, when a user rates a document, how could I make the recommender outputs similar documents to that ones the user has already rated even if no other user in the system has rated them yet? Is that even possible in the first place? Thanks a lot.
Re: Recommend items not rated by any user
Juan, You got me wrong, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value with at least one of the items preferred by the user. So, it does not simply return all items that have not been rated by the user. For example, if there is an item X which has not been rated by the user and if the similarity value between X and at least one of the items rated (preferred) by the user is not NaN, then X will be not be returned by AllSimilarItemsCandidateItemsStrategy, but it will be returned by AllUnknownItemsCandidateItemsStrategy. On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com wrote: Hi Tefik, Thanks for the response. I think what you says contradicts what Sebastian pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user, what would AllUnknownItemsCandidateItemsStrategy return? On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: Sorry there was a typo in the previous paragraph. If I remember correctly, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value with at least one of the items preferred by the user. On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Hi Juan, If I remember correctly, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value that is with at least one of the items preferred by the user. Tevfik On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org wrote: On 03/05/2014 01:23 PM, Juan José Ramos wrote: Thanks for the reply, Sebastian. I am not sure if that should be implemented in the Abstract base class though because for instance PreferredItemsNeighborhoodCandidateItemsStrategy, by definition, it returns the item not rated by the user and rated by somebody else. Good point. So we seem to need special implementations. Back to my last post, I have been playing around with AllSimilarItemsCandidateItemsStrategy and AllUnknownItemsCandidateItemsStrategy, and although they both do what I wanted (recommend items not previously rated by any user), I honestly can't tell the difference between the two strategies. In my tests the output was always the same. If the eventual output of the recommender will not include items already rated by the user as pointed out here ( http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E ), AllSimilarItemsCandidateItemsStrategy should be equivalent to AllUnkownItemsCandidateItemsStrategy, shouldn't it? AllSimilarItems returns all items that are similar to any item that the user already knows. AllUnknownItems simply returns all items that the user has not interacted with yet. These are two different things, although they might overlap in some scenarios. Best, Sebastian Thanks. On Wed, Mar 5, 2014 at 10:23 AM, Sebastian Schelter s...@apache.org wrote: Hi Juan, that is a good catch. CandidateItemsStrategy is the right place to implement this. Maybe we should simply extend its interface to add a parameter that says whether to keep or remove the current users items? We could even do this in the abstract base class then. --sebastian On 03/05/2014 10:42 AM, Juan José Ramos wrote: In case somebody runs into the same situation, the key seems to be in the CandidateItemStrategy being passed to the constructor of GenericItemBasedRecommender. Looking into the code, if no CandidateItemStrategy is specified in the constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is used and as the documentation says, the doGetCandidateItems method: returns all items that have not been rated by the user and that were preferred by another user that has preferred at least one item that the current user has preferred too. So, a different CandidateItemStrategy needs to be passed. For this problem, it seems to me that AllSimilarItemsCandidateItemsStrategy, AllUnknownItemsCandidateItemsStrategy are good candidates. Does anybody know where to find some documentation about the different CandidateItemStrategy? Based on the name I would say that: 1) AllSimilarItemsCandidateItemsStrategy returns all similar items regardless of whether they have been already rated by someone or not. 2) AllUnknownItemsCandidateItemsStrategy returns all similar items that have not been rated by anyone yet. Does anybody know if it works like that? Thanks. On Tue, Mar 4, 2014 at 9:16 AM, Juan José Ramos jjar...@gmail.com wrote
Re: Recommend items not rated by any user
If the similarity between item 5 and two of the items user 1 preferred are not NaN then it will return 1, that is what I'm saying. If the similarities were all NaN then it will not return it. But surely, you might wonder if all similarities between an item and user's items are NaN, then AllUnknownItemsCandidateItemsStrategy probably will not return it. So both strategies seems to be effectively the same, I don't know what the implementers had in mind when designing AllSimilarItemsCandidateItemsStrategy. On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote: @Tevfik, running this recommender: GenericItemBasedRecommender itemRecommender = new GenericItemBasedRecommender(dataModel, itemSimilarity, new AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new AllSimilarItemsCandidateItemsStrategy(itemSimilarity)); With this dataModel: 1,1,1.0 1,2,2.0 1,3,1.0 1,4,2.0 2,1,1.0 2,2,4.0 And these similarities 1,2,0.1 1,3,0.2 1,4,0.3 2,3,0.5 3,4,0.5 5,1,0.2 5,2,1.0 Returns item 5 for User 1. So item 5 has not been preferred by user 1, and the similarity between item 5 and two of the items user 1 preferred are not NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item. So, I'm truly sorry to insist on this, but I still really do not get the difference. On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: Juan, You got me wrong, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value with at least one of the items preferred by the user. So, it does not simply return all items that have not been rated by the user. For example, if there is an item X which has not been rated by the user and if the similarity value between X and at least one of the items rated (preferred) by the user is not NaN, then X will be not be returned by AllSimilarItemsCandidateItemsStrategy, but it will be returned by AllUnknownItemsCandidateItemsStrategy. On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com wrote: Hi Tefik, Thanks for the response. I think what you says contradicts what Sebastian pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user, what would AllUnknownItemsCandidateItemsStrategy return? On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Sorry there was a typo in the previous paragraph. If I remember correctly, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value with at least one of the items preferred by the user. On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Hi Juan, If I remember correctly, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value that is with at least one of the items preferred by the user. Tevfik On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org wrote: On 03/05/2014 01:23 PM, Juan José Ramos wrote: Thanks for the reply, Sebastian. I am not sure if that should be implemented in the Abstract base class though because for instance PreferredItemsNeighborhoodCandidateItemsStrategy, by definition, it returns the item not rated by the user and rated by somebody else. Good point. So we seem to need special implementations. Back to my last post, I have been playing around with AllSimilarItemsCandidateItemsStrategy and AllUnknownItemsCandidateItemsStrategy, and although they both do what I wanted (recommend items not previously rated by any user), I honestly can't tell the difference between the two strategies. In my tests the output was always the same. If the eventual output of the recommender will not include items already rated by the user as pointed out here ( http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E ), AllSimilarItemsCandidateItemsStrategy should be equivalent to AllUnkownItemsCandidateItemsStrategy, shouldn't it? AllSimilarItems returns all items that are similar to any item that the user already knows. AllUnknownItems simply returns all items that the user has not interacted with yet. These are two different things, although they might overlap in some scenarios. Best, Sebastian Thanks. On Wed, Mar 5, 2014 at 10:23 AM, Sebastian Schelter s...@apache.org wrote: Hi Juan, that is a good catch. CandidateItemsStrategy is the right place to implement this. Maybe we should simply extend its
Re: Recommend items not rated by any user
Hi Sebastian, But in order not to select items that is not similar to at least one of the items the user interacted with you have to compute the similarity with all user items (which is the main task for estimating the preference of an item in item-based method). So, it seems to me that AllSimilarItemsStrategy does not bring much advantage over AllUnknownItemsCandidateItemsStrategy. On Wed, Mar 5, 2014 at 6:46 PM, Sebastian Schelter s...@apache.org wrote: So both strategies seems to be effectively the same, I don't know what the implementers had in mind when designing AllSimilarItemsCandidateItemsStrategy. It can take a long time to estimate preferences for all items a user doesn't know. Especially if you have a lot of items. Traditional item-based recommenders will not recommend any item that is not similar to at least one of the items the user interacted with, so AllSimilarItemsStrategy already selects the maximum set of items that could be potentially recommended to the user. --sebastian On 03/05/2014 05:38 PM, Tevfik Aytekin wrote: If the similarity between item 5 and two of the items user 1 preferred are not NaN then it will return 1, that is what I'm saying. If the similarities were all NaN then it will not return it. But surely, you might wonder if all similarities between an item and user's items are NaN, then AllUnknownItemsCandidateItemsStrategy probably will not return it. On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote: @Tevfik, running this recommender: GenericItemBasedRecommender itemRecommender = new GenericItemBasedRecommender(dataModel, itemSimilarity, new AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new AllSimilarItemsCandidateItemsStrategy(itemSimilarity)); With this dataModel: 1,1,1.0 1,2,2.0 1,3,1.0 1,4,2.0 2,1,1.0 2,2,4.0 And these similarities 1,2,0.1 1,3,0.2 1,4,0.3 2,3,0.5 3,4,0.5 5,1,0.2 5,2,1.0 Returns item 5 for User 1. So item 5 has not been preferred by user 1, and the similarity between item 5 and two of the items user 1 preferred are not NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item. So, I'm truly sorry to insist on this, but I still really do not get the difference. On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: Juan, You got me wrong, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value with at least one of the items preferred by the user. So, it does not simply return all items that have not been rated by the user. For example, if there is an item X which has not been rated by the user and if the similarity value between X and at least one of the items rated (preferred) by the user is not NaN, then X will be not be returned by AllSimilarItemsCandidateItemsStrategy, but it will be returned by AllUnknownItemsCandidateItemsStrategy. On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com wrote: Hi Tefik, Thanks for the response. I think what you says contradicts what Sebastian pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user, what would AllUnknownItemsCandidateItemsStrategy return? On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Sorry there was a typo in the previous paragraph. If I remember correctly, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value with at least one of the items preferred by the user. On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Hi Juan, If I remember correctly, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value that is with at least one of the items preferred by the user. Tevfik On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org wrote: On 03/05/2014 01:23 PM, Juan José Ramos wrote: Thanks for the reply, Sebastian. I am not sure if that should be implemented in the Abstract base class though because for instance PreferredItemsNeighborhoodCandidateItemsStrategy, by definition, it returns the item not rated by the user and rated by somebody else. Good point. So we seem to need special implementations. Back to my last post, I have been playing around with AllSimilarItemsCandidateItemsStrategy and AllUnknownItemsCandidateItemsStrategy, and although they both do what I wanted (recommend items not previously rated by any user), I honestly can't tell the difference between the two strategies. In my tests the output was always the same. If the eventual output of the recommender will not include items already rated by the user
Re: Recommend items not rated by any user
It can even make things worse in SVD-based algorithms for which preference estimation is very fast. On Wed, Mar 5, 2014 at 7:00 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Hi Sebastian, But in order not to select items that is not similar to at least one of the items the user interacted with you have to compute the similarity with all user items (which is the main task for estimating the preference of an item in item-based method). So, it seems to me that AllSimilarItemsStrategy does not bring much advantage over AllUnknownItemsCandidateItemsStrategy. On Wed, Mar 5, 2014 at 6:46 PM, Sebastian Schelter s...@apache.org wrote: So both strategies seems to be effectively the same, I don't know what the implementers had in mind when designing AllSimilarItemsCandidateItemsStrategy. It can take a long time to estimate preferences for all items a user doesn't know. Especially if you have a lot of items. Traditional item-based recommenders will not recommend any item that is not similar to at least one of the items the user interacted with, so AllSimilarItemsStrategy already selects the maximum set of items that could be potentially recommended to the user. --sebastian On 03/05/2014 05:38 PM, Tevfik Aytekin wrote: If the similarity between item 5 and two of the items user 1 preferred are not NaN then it will return 1, that is what I'm saying. If the similarities were all NaN then it will not return it. But surely, you might wonder if all similarities between an item and user's items are NaN, then AllUnknownItemsCandidateItemsStrategy probably will not return it. On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote: @Tevfik, running this recommender: GenericItemBasedRecommender itemRecommender = new GenericItemBasedRecommender(dataModel, itemSimilarity, new AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new AllSimilarItemsCandidateItemsStrategy(itemSimilarity)); With this dataModel: 1,1,1.0 1,2,2.0 1,3,1.0 1,4,2.0 2,1,1.0 2,2,4.0 And these similarities 1,2,0.1 1,3,0.2 1,4,0.3 2,3,0.5 3,4,0.5 5,1,0.2 5,2,1.0 Returns item 5 for User 1. So item 5 has not been preferred by user 1, and the similarity between item 5 and two of the items user 1 preferred are not NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item. So, I'm truly sorry to insist on this, but I still really do not get the difference. On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: Juan, You got me wrong, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value with at least one of the items preferred by the user. So, it does not simply return all items that have not been rated by the user. For example, if there is an item X which has not been rated by the user and if the similarity value between X and at least one of the items rated (preferred) by the user is not NaN, then X will be not be returned by AllSimilarItemsCandidateItemsStrategy, but it will be returned by AllUnknownItemsCandidateItemsStrategy. On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com wrote: Hi Tefik, Thanks for the response. I think what you says contradicts what Sebastian pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user, what would AllUnknownItemsCandidateItemsStrategy return? On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Sorry there was a typo in the previous paragraph. If I remember correctly, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value with at least one of the items preferred by the user. On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Hi Juan, If I remember correctly, AllSimilarItemsCandidateItemsStrategy returns all items that have not been rated by the user and the similarity metric returns a non-NaN similarity value that is with at least one of the items preferred by the user. Tevfik On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org wrote: On 03/05/2014 01:23 PM, Juan José Ramos wrote: Thanks for the reply, Sebastian. I am not sure if that should be implemented in the Abstract base class though because for instance PreferredItemsNeighborhoodCandidateItemsStrategy, by definition, it returns the item not rated by the user and rated by somebody else. Good point. So we seem to need special implementations. Back to my last post, I have been playing around with AllSimilarItemsCandidateItemsStrategy and AllUnknownItemsCandidateItemsStrategy, and although they both do what I wanted (recommend items not previously rated by any user), I honestly can't tell
Re: Why some userId has no recommendations?
In some cases users might not get any recommendations. There might be different reasons of this. In your case there is only item 107 which can be recommended to user 5 (since user 5 rated all other items). Item 107 got two ratings which are both 5. In this case pearson correlation between this item and others are undefined. I think this is the reason why user 5 is not getting any recommendations. Tevfik On Thu, Feb 13, 2014 at 9:08 AM, jobin wilson jobinwil...@gmail.com wrote: Hi Jiang, Mahout's userbased recommender make use of similarity of a user with other users to arrive at what to recommend to him in this specific case,uses Pearson correlation coefficient calculated from the user ratings as a similarity measure to form a neighborhood.It then estimates ratings for unpicked items based on user similarity and ratings provided by neighbors. A short answer is that if a user gets any recommendations totally depend on the training data that you provide as input to the model.In this case,if you expect 107 as a recommendation for user 5,there arent enough ratings available for 107 in the user 5's neighborhood. If you modify your data as below,you will get recommendations for user 5. (just add a dummy rating 2,107,5) I have included some code snippet which demonstrate this idea of user similarity and neighborhood .Hope this helps. *Code:* public class Test { public static void main(String args[]) throws Exception { String inFile = F:\\hadoop\\data\\recsysinput.txt; DataModel dataModel = new FileDataModel(new File(inFile)); UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel); UserNeighborhood userNeighborhood = new NearestNUserNeighborhood(100, userSimilarity, dataModel); Recommender recommender = new GenericUserBasedRecommender(dataModel, userNeighborhood, userSimilarity); for (int i = 1; i = 5; i++) { ListRecommendedItem recommendations = recommender.recommend(i, 1); for(int j=1;j=5 ;j++){ System.out.println(Similarity between user:+i+ and user:+j+ = +userSimilarity.userSimilarity(i, j)); } System.out.println(recommend for user: + i + Neighborhood Size: + userNeighborhood.getUserNeighborhood(i).length); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); } } } } *Input:* 1,101,5.0 1,102,3.0 1,103,2.5 2,101,2 2,102,2.5 2,103,5 2,104,2 2,107,5 3,101,2.5 3,104,4 3,105,4.5 3,107,5 4,101,5 4,103,3 4,104,4.5 4,106,4 5,101,4 5,102,3 5,103,2 5,104,4 5,105,3.5 5,106,4 *Output:* SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/D:/from%20D/MSR/Coursework/SEM2/Pattern%20Recognition/project/acadnet/mahout-distribution-0.7/mahout-distribution-0.7/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/D:/from%20D/MSR/Coursework/SEM2/Pattern%20Recognition/project/acadnet/mahout-distribution-0.7/mahout-distribution-0.7/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/D:/from%20D/MSR/Coursework/SEM2/Pattern%20Recognition/project/acadnet/mahout-distribution-0.7/mahout-distribution-0.7/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. log4j:WARN No appenders could be found for logger (org.apache.mahout.cf.taste.impl.model.file.FileDataModel). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. Similarity between user:1 and user:1= 1.0 Similarity between user:1 and user:2= -0.7642652566278799 Similarity between user:1 and user:3= NaN Similarity between user:1 and user:4= 0.9998 Similarity between user:1 and user:5= 0.944911182523068 recommend for user:1 Neighborhood Size:3 RecommendedItem[item:104, value:5.0] Similarity between user:2 and user:1= -0.7642652566278799 Similarity between user:2 and user:2= 0.9998 Similarity between user:2 and user:3= 0.8029550685469666 Similarity between user:2 and user:4= -0.9707253433941515 Similarity between user:2 and user:5= -0.9393939393939394 recommend for user:2 Neighborhood Size:4 RecommendedItem[item:106, value:4.0] Similarity between user:3 and user:1= NaN Similarity between user:3 and user:2= 0.8029550685469666 Similarity between user:3 and user:3= 1.0 Similarity between user:3 and user:4= -1.0 Similarity between user:3 and user:5= -0.6933752452815484 recommend for user:3 Neighborhood Size:3 RecommendedItem[item:106, value:4.0] Similarity between user:4 and user:1= 0.9998 Similarity between user:4 and user:2= -0.9707253433941515 Similarity between user:4 and user:3= -1.0 Similarity
Re: Why some userId has no recommendations?
You are right Koobas, my answer was on the assumption that item-based NN is used (but I noticed that user-based NN is being used). So my answer is not correct, sorry. Currently, I could not understand the exact reason why user 5 is not getting any recommendations, as you said user 5 should get 107. On Thu, Feb 13, 2014 at 3:21 PM, Koobas koo...@gmail.com wrote: User 3 gave a recommendation to item 107. User 5 did not rate 107. On Thu, Feb 13, 2014 at 1:57 AM, Suresh M suresh4mas...@gmail.com wrote: user 5 has given rating for all 5 books, So there will be no recommendations for him. On 12 February 2014 08:55, jiangwen jiang jiangwen...@gmail.com wrote: Hi, all: I try to user mahout api to make recommendations, but I find some userId has no recommendations, why? here is my code public static void main(String args[]) throws Exception { String inFile = F:\\hadoop\\data\\recsysinput.txt; DataModel dataModel = new FileDataModel(new File(inFile)); UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel); UserNeighborhood userNeighborhood = new NearestNUserNeighborhood(100, userSimilarity, dataModel); Recommender recommender = new GenericUserBasedRecommender(dataModel, userNeighborhood, userSimilarity); for (int i = 1; i 5; i++) { ListRecommendedItem recommendations = recommender.recommend(i, 1); System.out.println(recommend for user: + i); for (RecommendedItem recommendation : recommendations) { System.out.println(recommendation); } } } input data(recsysinput.txt): 1,101,5.0 1,102,3.0 1,103,2.5 2,101,2 2,102,2.5 2,103,5 2,104,2 3,101,2.5 3,104,4 3,105,4.5 3,107,5 4,101,5 4,103,3 4,104,4.5 4,106,4 5,101,4 5,102,3 5,103,2 5,104,4 5,105,3.5 5,106,4 output: recommend for user:1 RecommendedItem[item:104, value:5.0] recommend for user:2 RecommendedItem[item:106, value:4.0] recommend for user:3 RecommendedItem[item:106, value:4.0] recommend for user:4 RecommendedItem[item:105, value:5.0] recommend for user:5 UserId 5 has no recommendations, is it right? Can I get some recommendations for userId 5, even if the recommendation results are not good enough? thanks Regards!
Re: Popularity of recommender items
Well, I think what you are suggesting is to define popularity as being similar to other items. So in this way most popular items will be those which are most similar to all other items, like the centroids in K-means. I would first check the correlation between this definition and the standard one (that is, the definition of popularity as having the highest number of ratings). But my intuition is that they are different things. For example. an item might lie at the center in the similarity space but it might not be a popular item. However, there might still be some correlation, it would be interesting to check it. hope it helps On Wed, Feb 5, 2014 at 3:27 AM, Pat Ferrel p...@occamsmachete.com wrote: Trying to come up with a relative measure of popularity for items in a recommender. Something that could be used to rank items. The user - item preference matrix would be the obvious thought. Just add the number of preferences per item. Maybe transpose the preference matrix (the temp DRM created by the recommender), then for each row vector (now that a row = item) grab the number of non zero preferences. This corresponds to the number of preferences, and would give one measure of popularity. In the case where the items are not boolean you'd sum the weights. However it might be a better idea to look at the item-item similarity matrix. It doesn't need to be transposed and contains the important similarities--as calculated by LLR for example. Here similarity means similarity in which users preferred an item. So summing the non-zero weights would give perhaps an even better relative popularity measure. For the same reason clustering the similarity matrix would yield important clusters. Anyone have intuition about this? I started to think about this because transposing the user-item matrix seems to yield a fromat that cannot be sent directly into clustering.
Re: generic latent variable recommender question
Thanks for the answers, actually I worked on a similar issue, increasing the diversity of top-N lists (http://link.springer.com/article/10.1007%2Fs10844-013-0252-9). Clustering-based approaches produce good results and they are very fast compared to some optimization based techniques. Also it turned out that introducing randomization (such as choosing random 20 items among the top 100 items) might decrease diversity if the diversity of the top-N lists is better than the diversity of a set of random items, which might sometimes be the case. On Sun, Jan 26, 2014 at 8:49 PM, Ted Dunning ted.dunn...@gmail.com wrote: On Sun, Jan 26, 2014 at 9:36 AM, Pat Ferrel p...@occamsmachete.com wrote: I think I’ll leave dithering out until it goes live because it would seem to make the eyeball test easier. I doubt all these experiments will survive. With anti-flood if you turn the epsilon parameter to 1 (makes log(epsilon) = 0), then no re-ordering is done. I like knobs that go to 11, but also have an off position.
Re: generic latent variable recommender question
Case 1 is fine, in case 2, I don't think that a dot product (without normalization) will yield a meaningful distance measure. Cosine distance or a Pearson correlation would be better. The situation is similar to Latent Semantic Indexing in which documents are represented by their low rank approximations and similarities between them (that is, approximations) are computed using cosine similarity. There is no need to make any normalization in case 1 since the values in the feature vectors are formed to approximate the rating values. On Sat, Jan 25, 2014 at 5:08 AM, Koobas koo...@gmail.com wrote: A generic latent variable recommender question. I passed the user-item matrix through a low rank approximation, with either something like ALS or SVD, and now I have the feature vectors for all users and all items. Case 1: I want to recommend items to a user. I compute a dot product of the user’s feature vector with all feature vectors of all the items. I eliminate the ones that the user already has, and find the largest value among the others, right? Case 2: I want to find similar items for an item. Should I compute dot product of the item’s feature vector against feature vectors of all the other items? OR Should I compute the ANGLE between each par of feature vectors? I.e., compute the cosine similarity? I.e., normalize the vectors before computing the dot products? If “yes” for case 2, is that something I should also do for case 1?
Re: generic latent variable recommender question
Hi Ted, Could you explain what do you mean by a dithering step and an anti-flood step? By dithering I guess you mean adding some sort of noise in order not to show the same results every time. But I have no clue about the anti-flood step. Tevfik On Sat, Jan 25, 2014 at 11:05 PM, Koobas koo...@gmail.com wrote: On Sat, Jan 25, 2014 at 3:51 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: Case 1 is fine, in case 2, I don't think that a dot product (without normalization) will yield a meaningful distance measure. Cosine distance or a Pearson correlation would be better. The situation is similar to Latent Semantic Indexing in which documents are represented by their low rank approximations and similarities between them (that is, approximations) are computed using cosine similarity. There is no need to make any normalization in case 1 since the values in the feature vectors are formed to approximate the rating values. That's exactly what I was thinking. Thanks for your reply. On Sat, Jan 25, 2014 at 5:08 AM, Koobas koo...@gmail.com wrote: A generic latent variable recommender question. I passed the user-item matrix through a low rank approximation, with either something like ALS or SVD, and now I have the feature vectors for all users and all items. Case 1: I want to recommend items to a user. I compute a dot product of the user’s feature vector with all feature vectors of all the items. I eliminate the ones that the user already has, and find the largest value among the others, right? Case 2: I want to find similar items for an item. Should I compute dot product of the item’s feature vector against feature vectors of all the other items? OR Should I compute the ANGLE between each par of feature vectors? I.e., compute the cosine similarity? I.e., normalize the vectors before computing the dot products? If “yes” for case 2, is that something I should also do for case 1?
Re: Hadoop implementation of ParallelSGDFactorizer
Thanks Sebastian. On Sat, Sep 7, 2013 at 8:24 PM, Sebastian Schelter ssc.o...@googlemail.com wrote: IIRC the algorithm behind ParallelSGDFactorizer needs shared memory, which is not given in a shared-nothing environment. On 07.09.2013 19:08, Tevfik Aytekin wrote: Hi, There seems to be no Hadoop implementation of ParallelSGDFactorizer. ALSWRFactorizer has a Hadoop implementation. ParallelSGDFactorizer (since it is based on stochastic gradient descent) is much faster than ALSWRFactorizer. I don't know Hadoop much. But it seems to me that a Hadoop implementation of ParallelSGDFactorizer will also be much faster than the Hadoop implementaion of ALSWRFactorizer. Is there a specific reason for why there is no Hadoop implementation of ParallelSGDFactorizer? Is it because since Hadoop operations are already slow the slowness of ALSWRFactorizer does not matter much. Or is it simply because nobody has implemented it yet? Thanks Tevfik
Hadoop implementation of ParallelSGDFactorizer
Hi, There seems to be no Hadoop implementation of ParallelSGDFactorizer. ALSWRFactorizer has a Hadoop implementation. ParallelSGDFactorizer (since it is based on stochastic gradient descent) is much faster than ALSWRFactorizer. I don't know Hadoop much. But it seems to me that a Hadoop implementation of ParallelSGDFactorizer will also be much faster than the Hadoop implementaion of ALSWRFactorizer. Is there a specific reason for why there is no Hadoop implementation of ParallelSGDFactorizer? Is it because since Hadoop operations are already slow the slowness of ALSWRFactorizer does not matter much. Or is it simply because nobody has implemented it yet? Thanks Tevfik
Re: Hadoop implementation of ParallelSGDFactorizer
Sebastian, what is IIRC? On Sat, Sep 7, 2013 at 8:24 PM, Sebastian Schelter ssc.o...@googlemail.com wrote: IIRC the algorithm behind ParallelSGDFactorizer needs shared memory, which is not given in a shared-nothing environment. On 07.09.2013 19:08, Tevfik Aytekin wrote: Hi, There seems to be no Hadoop implementation of ParallelSGDFactorizer. ALSWRFactorizer has a Hadoop implementation. ParallelSGDFactorizer (since it is based on stochastic gradient descent) is much faster than ALSWRFactorizer. I don't know Hadoop much. But it seems to me that a Hadoop implementation of ParallelSGDFactorizer will also be much faster than the Hadoop implementaion of ALSWRFactorizer. Is there a specific reason for why there is no Hadoop implementation of ParallelSGDFactorizer? Is it because since Hadoop operations are already slow the slowness of ALSWRFactorizer does not matter much. Or is it simply because nobody has implemented it yet? Thanks Tevfik
Re: Which database should I use with Mahout
Thanks Sean, but I could not get your answer. Can you please explain it again? On Sun, May 19, 2013 at 8:00 PM, Sean Owen sro...@gmail.com wrote: It doesn't matter, in the sense that it is never going to be fast enough for real-time at any reasonable scale if actually run off a database directly. One operation results in thousands of queries. It's going to read data into memory anyway and cache it there. So, whatever is easiest for you. The simplest solution is a file. On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz ahmetyilmazefe...@yahoo.com wrote: Hi, I would like to use Mahout to make recommendations on my web site. Since the data is going to be big, hopefully, I plan to use hadoop implementations of the recommender algorithms. I'm currently storing the data in mysql. Should I continue with it or should I switch to a nosql database such as mongodb or something else? Thanks Ahmet
Re: Which database should I use with Mahout
ok, got it, thanks. On Sun, May 19, 2013 at 8:20 PM, Sean Owen sro...@gmail.com wrote: I'm first saying that you really don't want to use the database as a data model directly. It is far too slow. Instead you want to use a data model implementation that reads all of the data, once, serially, into memory. And in that case, it makes no difference where the data is being read from, because it is read just once, serially. A file is just as fine as a fancy database. In fact it's probably easier and faster. On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Thanks Sean, but I could not get your answer. Can you please explain it again? On Sun, May 19, 2013 at 8:00 PM, Sean Owen sro...@gmail.com wrote: It doesn't matter, in the sense that it is never going to be fast enough for real-time at any reasonable scale if actually run off a database directly. One operation results in thousands of queries. It's going to read data into memory anyway and cache it there. So, whatever is easiest for you. The simplest solution is a file. On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz ahmetyilmazefe...@yahoo.com wrote: Hi, I would like to use Mahout to make recommendations on my web site. Since the data is going to be big, hopefully, I plan to use hadoop implementations of the recommender algorithms. I'm currently storing the data in mysql. Should I continue with it or should I switch to a nosql database such as mongodb or something else? Thanks Ahmet
Re: Which database should I use with Mahout
Hi Manuel, But if one uses matrix factorization and stores the user and item factors in memory then there will be no database access during recommendation. I thought that the original question was where to store the data and how to give it to hadoop. On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt manuel.blechschm...@gmx.de wrote: Hi Tevfik, one request to the recommender could become more then 1000 queries to the database depending on which recommender you use and the amount of preferences for the given user. The problem is not if you are using SQL, NoSQL, or any other query language. The problem is the latency of the answers. An average tcp package in the same data center takes 500 µs. A main memory reference 0,1 µs. This means that your main memory of your java process can be accessed 5000 times faster then any other process like a database connected via TCP/IP. http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html Here you can see a screenshot that shows that database communication is by far (99%) the slowest component of a recommender request: https://source.apaxo.de/MahoutDatabaseLowPerformance.png If you do not want to cache your data in your Java process you can use a complete in memory database technology like SAP HANA http://www.saphana.com/welcome or EXASOL http://www.exasol.com/ Nevertheless if you are using these you do not need Mahout anymore. An architecture of a Mahout system can be seen here: https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png Hope that helps Manuel Am 19.05.2013 um 19:20 schrieb Sean Owen: I'm first saying that you really don't want to use the database as a data model directly. It is far too slow. Instead you want to use a data model implementation that reads all of the data, once, serially, into memory. And in that case, it makes no difference where the data is being read from, because it is read just once, serially. A file is just as fine as a fancy database. In fact it's probably easier and faster. On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Thanks Sean, but I could not get your answer. Can you please explain it again? On Sun, May 19, 2013 at 8:00 PM, Sean Owen sro...@gmail.com wrote: It doesn't matter, in the sense that it is never going to be fast enough for real-time at any reasonable scale if actually run off a database directly. One operation results in thousands of queries. It's going to read data into memory anyway and cache it there. So, whatever is easiest for you. The simplest solution is a file. On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz ahmetyilmazefe...@yahoo.com wrote: Hi, I would like to use Mahout to make recommendations on my web site. Since the data is going to be big, hopefully, I plan to use hadoop implementations of the recommender algorithms. I'm currently storing the data in mysql. Should I continue with it or should I switch to a nosql database such as mongodb or something else? Thanks Ahmet -- Manuel Blechschmidt M.Sc. IT Systems Engineering Dortustr. 57 14467 Potsdam Mobil: 0173/6322621 Twitter: http://twitter.com/Manuel_B
Re: parallelALS and RMSE TEST
This problem is called one-class classification problem. In the domain of collaborative filtering it is called one-class collaborative filtering (since what you have are only positive preferences). You may search the web with these key words to find papers providing solutions. I'm not sure whether Mahout has algorithms for one-class collaborative filtering. On Mon, May 6, 2013 at 1:42 PM, Sean Owen sro...@gmail.com wrote: ALS-WR weights the error on each term differently, so the average error doesn't really have meaning here, even if you are comparing the difference with 1. I think you will need to fall back to mean average precision or something. On Mon, May 6, 2013 at 11:24 AM, William icswilliam2...@gmail.com wrote: Sean Owen srowen at gmail.com writes: If you have no ratings, how are you using RMSE? this typically measures error in reconstructing ratings. I think you are probably measuring something meaningless. I suppose the rate of seen movies are 1. Is it right? If I use Collaborative Filtering with ALS-WR to get some recommendations, I must have a real rating-matrix?
Re: parallelALS and RMSE TEST
Hi Sean, Isn't boolean preferences is supported in the context of memory-based recommendation algorithms in Mahout? Are there matrix factorization algorithms in Mahout which can work with this kind of data (that is, the kind of data which consists of users and the movies they have seen). On Mon, May 6, 2013 at 10:34 PM, Sean Owen sro...@gmail.com wrote: Yes, it goes by the name 'boolean prefs' in the project since target variables don't have values -- they just exist or don't. So, yes it's certainly supported but the question here is how to evaluate the output. On Mon, May 6, 2013 at 8:29 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: This problem is called one-class classification problem. In the domain of collaborative filtering it is called one-class collaborative filtering (since what you have are only positive preferences). You may search the web with these key words to find papers providing solutions. I'm not sure whether Mahout has algorithms for one-class collaborative filtering. On Mon, May 6, 2013 at 1:42 PM, Sean Owen sro...@gmail.com wrote: ALS-WR weights the error on each term differently, so the average error doesn't really have meaning here, even if you are comparing the difference with 1. I think you will need to fall back to mean average precision or something. On Mon, May 6, 2013 at 11:24 AM, William icswilliam2...@gmail.com wrote: Sean Owen srowen at gmail.com writes: If you have no ratings, how are you using RMSE? this typically measures error in reconstructing ratings. I think you are probably measuring something meaningless. I suppose the rate of seen movies are 1. Is it right? If I use Collaborative Filtering with ALS-WR to get some recommendations, I must have a real rating-matrix?
Re: parallelALS and RMSE TEST
But the data under consideration here is not 0/1 data, it contains only 1's. On Mon, May 6, 2013 at 11:29 PM, Sean Owen sro...@gmail.com wrote: Parallel ALS is exactly an example of where you can use matrix factorization for 0/1 data. On Mon, May 6, 2013 at 9:22 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: Hi Sean, Isn't boolean preferences is supported in the context of memory-based recommendation algorithms in Mahout? Are there matrix factorization algorithms in Mahout which can work with this kind of data (that is, the kind of data which consists of users and the movies they have seen). On Mon, May 6, 2013 at 10:34 PM, Sean Owen sro...@gmail.com wrote: Yes, it goes by the name 'boolean prefs' in the project since target variables don't have values -- they just exist or don't. So, yes it's certainly supported but the question here is how to evaluate the output. On Mon, May 6, 2013 at 8:29 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote: This problem is called one-class classification problem. In the domain of collaborative filtering it is called one-class collaborative filtering (since what you have are only positive preferences). You may search the web with these key words to find papers providing solutions. I'm not sure whether Mahout has algorithms for one-class collaborative filtering. On Mon, May 6, 2013 at 1:42 PM, Sean Owen sro...@gmail.com wrote: ALS-WR weights the error on each term differently, so the average error doesn't really have meaning here, even if you are comparing the difference with 1. I think you will need to fall back to mean average precision or something. On Mon, May 6, 2013 at 11:24 AM, William icswilliam2...@gmail.com wrote: Sean Owen srowen at gmail.com writes: If you have no ratings, how are you using RMSE? this typically measures error in reconstructing ratings. I think you are probably measuring something meaningless. I suppose the rate of seen movies are 1. Is it right? If I use Collaborative Filtering with ALS-WR to get some recommendations, I must have a real rating-matrix?
Re: User Based recommender - strange behaviour of Pearson
You are correct, since centeredSumX2 equals zero, the Pearson similarity will be undefined (because of division by zero in the Pearson formula). If you do not center the data that will be cosine similarity which is another common similarity metric used in recommender systems and it will not be undefined when a user has the same ratings for all items. On Tue, Apr 9, 2013 at 6:19 PM, yamo93 yam...@gmail.com wrote: Hi, I use a user based recommender. I've just discovered a strange behaviour of Pearson when a user has the same ratings for all rated items. The system don't recommend anything in this case for this user. I try an explanation : it is due to centered data (centeredSumX2 equals 0 in this case). Is it exact ? Using UncenteredCosine as a workaround is it a good idea ? Thanks, Yann.
Re: Problems with Mahout's RecommenderIRStatsEvaluator
I think, it is better to choose ratings of the test user in a random fashion. On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen sro...@gmail.com wrote: Yes. But: the test sample is small. Using 40% of your data to test is probably quite too much. My point is that it may be the least-bad thing to do. What test are you proposing instead, and why is it coherent with what you're testing? On Sat, Feb 16, 2013 at 8:26 PM, Ahmet Ylmaz ahmetyilmazefe...@yahoo.comwrote: But modeling a user only by his/her low ratings can be problematic since people generally are more precise (I believe) in their high ratings. Another problem is that recommender algorithms in general first mean normalize the ratings for each user. Suppose that we have the following ratings of 3 people (A, B, and C) on 5 items. A's ratings: 1 2 3 4 5 B's ratings: 1 3 5 2 4 C's ratings: 1 2 3 4 5 Suppose that A is the test user. Now if we put only the low ratings of A (1, 2, and 3) into the training set and mean normalize the ratings then A will be more similar to B than C, which is not true. From: Sean Owen sro...@gmail.com To: Mahout User List user@mahout.apache.org; Ahmet Ylmaz ahmetyilmazefe...@yahoo.com Sent: Saturday, February 16, 2013 8:41 PM Subject: Re: Problems with Mahout's RecommenderIRStatsEvaluator No, this is not a problem. Yes it builds a model for each user, which takes a long time. It's accurate, but time-consuming. It's meant for small data. You could rewrite your own test to hold out data for all test users at once. That's what I did when I rewrote a lot of this just because it was more useful to have larger tests. There are several ways to choose the test data. One common way is by time, but there is no time information here by default. The problem is that, for example, recent ratings may be low -- or at least not high ratings. But the evaluation is of course asking the recommender for items that are predicted to be highly rated. Random selection has the same problem. Choosing by rating at least makes the test coherent. It does bias the training set, but, the test set is supposed to be small. There is no way to actually know, a priori, what the top recommendations are. You have no information to evaluate most recommendations. This makes a precision/recall test fairly uninformative in practice. Still, it's better than nothing and commonly understood. While precision/recall won't be high on tests like this, because of this, I don't get these values for movielens data on any normal algo, but, you may be, if choosing an algorithm or parameters that don't work well. On Sat, Feb 16, 2013 at 7:30 PM, Ahmet Ylmaz ahmetyilmazefe...@yahoo.com wrote: Hi, I have looked at the internals of Mahout's RecommenderIRStatsEvaluator code. I think that there are two important problems here. According to my understanding the experimental protocol used in this code is something like this: It takes away a certain percentage of users as test users. For each test user it builds a training set consisting of ratings given by all other users + the ratings of the test user which are below the relevanceThreshold. It then builds a model and makes a recommendation to the test user and finds the intersection between this recommendation list and the items which are rated above the relevanceThreshold by the test user. It then calculates the precision and recall in the usual way. Probems: 1. (mild) It builds a model for every test user which can take a lot of time. 2. (severe) Only the ratings (of the test user) which are below the relevanceThreshold are put into the training set. This means that the algorithm only knows the preferences of the test user about the items which s/he don't like. This is not a good representation of user ratings. Moreover when I run this evaluator on movielens 1m data, the precision and recall turned out to be, respectively, 0.011534185658699288 0.007905982905982885 and the run took about 13 minutes on my intel core i3. (I used user based recommendation with k=2) Altgough I know that it is not ok to judge the performance of a recommendation algorithm by looking at these absolute precision and recall values, still these numbers seems to me too low which might be the result of the second problem I mentioned above. Am I missing something? Thanks Ahmet
Re: Problems with Mahout's RecommenderIRStatsEvaluator
No, rating prediction is clearly a supervised ML problem On Sat, Feb 16, 2013 at 10:15 PM, Sean Owen sro...@gmail.com wrote: This is a good answer for evaluation of supervised ML, but, this is unsupervised. Choosing randomly is choosing the 'right answers' randomly, and that's plainly problematic. On Sat, Feb 16, 2013 at 8:53 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: I think, it is better to choose ratings of the test user in a random fashion. On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen sro...@gmail.com wrote: Yes. But: the test sample is small. Using 40% of your data to test is probably quite too much. My point is that it may be the least-bad thing to do. What test are you proposing instead, and why is it coherent with what you're testing?
Re: Problems with Mahout's RecommenderIRStatsEvaluator
I'm suggesting the second one. In that way the test user's ratings in the training set will compose of both low and high rated items, that prevents the problem pointed out by Ahmet. On Sat, Feb 16, 2013 at 11:19 PM, Sean Owen sro...@gmail.com wrote: If you're suggesting that you hold out only high-rated items, and then sample them, then that's what is done already in the code, except without the sampling. The sampling doesn't buy anything that I can see. If you're suggesting holding out a random subset and then throwing away the held-out items with low rating, then it's also the same idea, except you're randomly throwing away some lower-rated data from both test and train. I don't see what that helps either. On Sat, Feb 16, 2013 at 9:41 PM, Tevfik Aytekin tevfik.ayte...@gmail.comwrote: What I mean is you can choose ratings randomly and try to recommend the ones above the threshold