problem in recommender similarity computation (taste)

2015-03-07 Thread Tevfik Aytekin
Hi,

I've noticed a problem in the non-Hadoop (taste) version of the
recommender package. The problem is in the AbstractSimilarity (in
package org.apache.mahout.cf.taste.impl.similarity).

This class is the base class for computing the similarity values
between vectors of users or items. It assumes that the similarity
between the vectors is computed using only the commonly rated
items/users.

Consider the following two vectors:
V1: _, 3, 4, _, 2
V2: 3, 5, _, 2, 4

where _ means no ratings. For these two vectors, the cosine or
Pearson similarity is computed on the following vectors:

3, 2
5, 4

However, if the number of common ratings is small then the similarity
result will be very unreliable. Which is indeed the case if you run
the code on Movielens dataset and measure recall values, the results
will be very bad.

There can be two solutions:
1. There should be a parameter n, which determines the minimum number
of common ratings needed to compute a similarity otherwise the system
should return NaN.
2. The similarity should be computed using all the ratings, for the
above two vectors, the cosine similarity should be

(3*5+2*4)/(sqrt(3^2+4^2+2^2)+sqrt(3^2+5^2+2^2+4^2))

Tevfik


Re: Can user id and item id be negative integers?

2014-08-09 Thread Tevfik Aytekin
AbstractIDMigrator is for being able to use String IDs (it converts
Strings to Longs.)
IDs are stored in Long types, so there should not be any problems with
negative IDs, but in practice I have not work with negative IDs
before.

Tevfik

On Wed, Aug 6, 2014 at 3:51 AM, Peng Zhang pzhang.x...@gmail.com wrote:
 Hi,

 Does this support the possibility that user/item id can be negative?

 I am reading through the source code of 
 org.apache.mahout.cf.taste.impl.model.AbstractIDMigrator. The hash() function 
 is trying to convert a string id to a long id like this. It’s quite possible 
 that the long id returned is a negative one, when the leading bit is 1:)

 protected final long hash(String value) {
 byte[] md5hash;
 synchronized (md5Digest) {
   md5hash = md5Digest.digest(value.getBytes(Charsets.UTF_8));
   md5Digest.reset();
 }
 long hash = 0L;
 for (int i = 0; i  8; i++) {
   hash = hash  8 | md5hash[i]  0x00FFL;
 }
 return hash;
   }


 Hi Ted,
 I am running the in memory version of GenericItemBasedRecommender and 
 SVDRecommender, i.e. I am using them in my Java code.


 Hi Pat,
 Not all user id are negative. Input file sample:
 ...
 -1250,6929,1
 -1250,7059,1
 -1250,7654,1
 -1250,8094,1
 -1250,9486,1
 -1250,9563,3
 10018000,11080,1
 10018000,11176,1
 10018000,11196,1
 10018000,12220,1
 10018000,12447,1
 10018000,13213,1
 ...

 Item based recommender output sample:
 User,Brand,Scoring
 -1250,12352,5.0
 -1250,14261,5.0
 -1250,15934,4.309238
 -1250,16463,3.0
 -1250,3627,1.0
 1025250,29099,1.0
 1025250,18741,1.0
 1025250,14261,1.0
 …

 SVD recommender output sample:
 User,Brand,Scoring
 -1250,3627,3.9108906
 -1250,27791,3.8262475
 -1250,251,3.744943
 -1250,20979,3.5778444
 -1250,14482,3.5494242
 1025250,27791,2.2692947
 1025250,251,1.9651389
 1025250,14482,1.9196383
 1025250,12220,1.9153352
 ...


 Thank you,

 Peng Zhang
 M: +86 186-1658-7856
 pzhang.x...@gmail.com





 On Aug 6, 2014, at 7:26 AM, Pat Ferrel p...@occamsmachete.com wrote:

 Are they ALL negative? Maybe only the non-negatives are working or there are 
 some conditions where negatives work. I certainly wouldn’t count on it 
 because I’ll bet it isn’t working as it should.


 On Aug 5, 2014, at 4:03 PM, Ted Dunning ted.dunn...@gmail.com wrote:

 On Tue, Aug 5, 2014 at 3:21 AM, Peng Zhang pzhang.x...@gmail.com wrote:

 But today I am trying to use negative user id and item id, and they are
 working well with the item recommender and dvd recommender.


 Which programs are you using?




Re: Recommender Systems - RecommenderIRStatsEvaluator

2014-05-20 Thread Tevfik Aytekin
- Is there a way to specify the train and test set like you can with the
*RecommenderEvaluator*?
No, though you can specify the evaluation percentage. This is because
of the logic of evaluation. The logic is to take away relevant items
and then make recommendations and see whether the relevant items
appear in top-N lists. It is also possible (and I think in some ways
better) to first split the data into test and training and select
relevant items from the test set. But this is not how it is
implemented.

- Is it possible to perform k-fold cross-validation with the
*RecommenderIRStatsEvaluator*?
I don't think so.
- How does the default way of evaluation work with
*RecommenderIRStatsEvaluator*?
I tried to explain it above.

I would like to remind that it is not difficult to write your own
evaluation code for your specific purposes.

Tevfik


On Tue, May 20, 2014 at 3:51 PM, Floris Devriendt
florisdevrie...@gmail.com wrote:
 Hey all,

 The *RecommenderEvaluator *has the option to choose how big your training
 set is (and so choosing the test set size as well), but the
 *RecommenderIRStatsEvaluator* does not seem to have this argument in its
 *.evaluate()*-method. That's why I was wondering how the internals of the
 *RecommenderIRStatsEvaluator* work.

 I have the following questions on *RecommenderIRStatsEvaluator*:

- Is there a way to specify the train and test set like you can with the
*RecommenderEvaluator*?
- Is it possible to perform k-fold cross-validation with the
*RecommenderIRStatsEvaluator*?
- How does the default way of evaluation work with
*RecommenderIRStatsEvaluator*?

 If somebody has an answer to any of these questions it would be greatly
 appreciated.

 Kind regards,
 Floris Devriendt


Re: Number of features for ALS

2014-03-27 Thread Tevfik Aytekin
Interesting topic,
Ted, can you give examples of those mathematical assumptions
under-pinning ALS which are violated by the real world?

On Thu, Mar 27, 2014 at 3:43 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 How can there be any other practical method?  Essentially all of the
 mathematical assumptions under-pinning ALS are violated by the real world.
  Why would any mathematical consideration of the number of features be much
 more than heuristic?

 That said, you can make an information content argument.  You can also make
 the argument that if you take too many features, it doesn't much hurt so
 you should always take as many as you can compute.



 On Thu, Mar 27, 2014 at 6:33 AM, Sebastian Schelter s...@apache.org wrote:

 Hi,

 does anyone know of a principled approach of choosing the number of
 features for ALS (other than cross-validation?)

 --sebastian



Re: Recommend items not rated by any user

2014-03-05 Thread Tevfik Aytekin
Sorry there was a typo in the previous paragraph.

If I remember correctly, AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value with at
least one of the items preferred by the user.

On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote:
 Hi Juan,

 If I remember correctly, AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value that is with at
 least one of the items preferred by the user.

 Tevfik

 On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org wrote:
 On 03/05/2014 01:23 PM, Juan José Ramos wrote:

 Thanks for the reply, Sebastian.

 I am not sure if that should be implemented in the Abstract base class
 though because for
 instance PreferredItemsNeighborhoodCandidateItemsStrategy, by definition,
 it returns the item not rated by the user and rated by somebody else.


 Good point. So we seem to need special implementations.



 Back to my last post, I have been playing around with
 AllSimilarItemsCandidateItemsStrategy
 and AllUnknownItemsCandidateItemsStrategy, and although they both do what
 I
 wanted (recommend items not previously rated by any user), I honestly
 can't
 tell the difference between the two strategies. In my tests the output was
 always the same. If the eventual output of the recommender will not
 include
 items already rated by the user as pointed out here (

 http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E),
 AllSimilarItemsCandidateItemsStrategy should be equivalent to
 AllUnkownItemsCandidateItemsStrategy, shouldn't it?


 AllSimilarItems returns all items that are similar to any item that the user
 already knows. AllUnknownItems simply returns all items that the user has
 not interacted with yet.

 These are two different things, although they might overlap in some
 scenarios.

 Best,
 Sebastian




 Thanks.

 On Wed, Mar 5, 2014 at 10:23 AM, Sebastian Schelter s...@apache.org
 wrote:


 Hi Juan,

 that is a good catch. CandidateItemsStrategy is the right place to

 implement this. Maybe we should simply extend its interface to add a
 parameter that says whether to keep or remove the current users items?


 We could even do this in the abstract base class then.

 --sebastian


 On 03/05/2014 10:42 AM, Juan José Ramos wrote:


 In case somebody runs into the same situation, the key seems to be in
 the
 CandidateItemStrategy being passed to the constructor
 of GenericItemBasedRecommender. Looking into the code, if no
 CandidateItemStrategy is specified in the
 constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is used
 and
 as the documentation says, the doGetCandidateItems method: returns all
 items that have not been rated by the user and that were preferred by
 another user that has preferred at least one item that the current user

 has

 preferred too.

 So, a different CandidateItemStrategy needs to be passed. For this

 problem,

 it seems to me that AllSimilarItemsCandidateItemsStrategy,
 AllUnknownItemsCandidateItemsStrategy are good candidates. Does anybody
 know where to find some documentation about the different
 CandidateItemStrategy? Based on the name I would say that:
 1) AllSimilarItemsCandidateItemsStrategy returns all similar items
 regardless of whether they have been already rated by someone or not.
 2) AllUnknownItemsCandidateItemsStrategy returns all similar items that
 have not been rated by anyone yet.

 Does anybody know if it works like that?
 Thanks.


 On Tue, Mar 4, 2014 at 9:16 AM, Juan José Ramos jjar...@gmail.com

 wrote:


 First thing is thatI know this requirement would not make sense in a CF
 Recommender. In my case, I am trying to use Mahout to create something
 closer to a Content-Based Recommender.

 In particular, I am pre-computing a similarity matrix between all the
 documents (items) of my catalogue and using that matrix as the
 ItemSimilarity for my Item-Based Recommender.

 So, when a user rates a document, how could I make the recommender

 outputs

 similar documents to that ones the user has already rated even if no

 other

 user in the system has rated them yet? Is that even possible in the

 first

 place?

 Thanks a lot.







Re: Recommend items not rated by any user

2014-03-05 Thread Tevfik Aytekin
Juan,
You got me wrong,

AllSimilarItemsCandidateItemsStrategy

returns all items that have not been rated by the user and the
similarity metric returns a non-NaN similarity value with at
least one of the items preferred by the user.

So, it does not simply return all items that have not been rated by
the user. For example, if there is an item X which has not been rated
by the user and if the similarity value between X and at least one of
the items rated (preferred) by the user is not NaN, then X will be not
be returned by AllSimilarItemsCandidateItemsStrategy, but it will be
returned by AllUnknownItemsCandidateItemsStrategy.



On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com wrote:
 Hi Tefik,

 Thanks for the response. I think what you says contradicts what Sebastian
 pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy returns
 all items that have not been rated by the user, what would
 AllUnknownItemsCandidateItemsStrategy return?


 On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin 
 tevfik.ayte...@gmail.comwrote:

 Sorry there was a typo in the previous paragraph.

 If I remember correctly, AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value with at
 least one of the items preferred by the user.

 On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin tevfik.ayte...@gmail.com
 wrote:
  Hi Juan,
 
  If I remember correctly, AllSimilarItemsCandidateItemsStrategy
 
  returns all items that have not been rated by the user and the
  similarity metric returns a non-NaN similarity value that is with at
  least one of the items preferred by the user.
 
  Tevfik
 
  On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org
 wrote:
  On 03/05/2014 01:23 PM, Juan José Ramos wrote:
 
  Thanks for the reply, Sebastian.
 
  I am not sure if that should be implemented in the Abstract base class
  though because for
  instance PreferredItemsNeighborhoodCandidateItemsStrategy, by
 definition,
  it returns the item not rated by the user and rated by somebody else.
 
 
  Good point. So we seem to need special implementations.
 
 
 
  Back to my last post, I have been playing around with
  AllSimilarItemsCandidateItemsStrategy
  and AllUnknownItemsCandidateItemsStrategy, and although they both do
 what
  I
  wanted (recommend items not previously rated by any user), I honestly
  can't
  tell the difference between the two strategies. In my tests the output
 was
  always the same. If the eventual output of the recommender will not
  include
  items already rated by the user as pointed out here (
 
 
 http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E
 ),
  AllSimilarItemsCandidateItemsStrategy should be equivalent to
  AllUnkownItemsCandidateItemsStrategy, shouldn't it?
 
 
  AllSimilarItems returns all items that are similar to any item that the
 user
  already knows. AllUnknownItems simply returns all items that the user
 has
  not interacted with yet.
 
  These are two different things, although they might overlap in some
  scenarios.
 
  Best,
  Sebastian
 
 
 
 
  Thanks.
 
  On Wed, Mar 5, 2014 at 10:23 AM, Sebastian Schelter s...@apache.org
  wrote:
 
 
  Hi Juan,
 
  that is a good catch. CandidateItemsStrategy is the right place to
 
  implement this. Maybe we should simply extend its interface to add a
  parameter that says whether to keep or remove the current users items?
 
 
  We could even do this in the abstract base class then.
 
  --sebastian
 
 
  On 03/05/2014 10:42 AM, Juan José Ramos wrote:
 
 
  In case somebody runs into the same situation, the key seems to be in
  the
  CandidateItemStrategy being passed to the constructor
  of GenericItemBasedRecommender. Looking into the code, if no
  CandidateItemStrategy is specified in the
  constructor, PreferredItemsNeighborhoodCandidateItemsStrategy is used
  and
  as the documentation says, the doGetCandidateItems method: returns
 all
  items that have not been rated by the user and that were preferred by
  another user that has preferred at least one item that the current
 user
 
  has
 
  preferred too.
 
  So, a different CandidateItemStrategy needs to be passed. For this
 
  problem,
 
  it seems to me that AllSimilarItemsCandidateItemsStrategy,
  AllUnknownItemsCandidateItemsStrategy are good candidates. Does
 anybody
  know where to find some documentation about the different
  CandidateItemStrategy? Based on the name I would say that:
  1) AllSimilarItemsCandidateItemsStrategy returns all similar items
  regardless of whether they have been already rated by someone or not.
  2) AllUnknownItemsCandidateItemsStrategy returns all similar items
 that
  have not been rated by anyone yet.
 
  Does anybody know if it works like that?
  Thanks.
 
 
  On Tue, Mar 4, 2014 at 9:16 AM, Juan José Ramos jjar...@gmail.com
 
  wrote

Re: Recommend items not rated by any user

2014-03-05 Thread Tevfik Aytekin
If the similarity between item 5 and two of the items user 1 preferred are not
NaN then it will return 1, that is what I'm saying. If the
similarities were all NaN then
it will not return it.

But surely, you might wonder if all similarities between an item and
user's items are NaN, then
AllUnknownItemsCandidateItemsStrategy probably will not return it.

So both strategies seems to be effectively the same, I don't know what
the implementers had in mind when designing
AllSimilarItemsCandidateItemsStrategy.

On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote:
 @Tevfik, running this recommender:

 GenericItemBasedRecommender itemRecommender = new
 GenericItemBasedRecommender(dataModel, itemSimilarity, new
 AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new
 AllSimilarItemsCandidateItemsStrategy(itemSimilarity));


 With this dataModel:
 1,1,1.0
 1,2,2.0
 1,3,1.0
 1,4,2.0
 2,1,1.0
 2,2,4.0


 And these similarities
 1,2,0.1
 1,3,0.2
 1,4,0.3
 2,3,0.5
 3,4,0.5
 5,1,0.2
 5,2,1.0

 Returns item 5 for User 1. So item 5 has not been preferred by user 1, and
 the similarity between item 5 and two of the items user 1 preferred are not
 NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item. So,
 I'm truly sorry to insist on this, but I still really do not get the
 difference.


 On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin 
 tevfik.ayte...@gmail.comwrote:

 Juan,
 You got me wrong,

 AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value with at
 least one of the items preferred by the user.

 So, it does not simply return all items that have not been rated by
 the user. For example, if there is an item X which has not been rated
 by the user and if the similarity value between X and at least one of
 the items rated (preferred) by the user is not NaN, then X will be not
 be returned by AllSimilarItemsCandidateItemsStrategy, but it will be
 returned by AllUnknownItemsCandidateItemsStrategy.



 On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com wrote:
  Hi Tefik,
 
  Thanks for the response. I think what you says contradicts what Sebastian
  pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy
 returns
  all items that have not been rated by the user, what would
  AllUnknownItemsCandidateItemsStrategy return?
 
 
  On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin tevfik.ayte...@gmail.com
 wrote:
 
  Sorry there was a typo in the previous paragraph.
 
  If I remember correctly, AllSimilarItemsCandidateItemsStrategy
 
  returns all items that have not been rated by the user and the
  similarity metric returns a non-NaN similarity value with at
  least one of the items preferred by the user.
 
  On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin 
 tevfik.ayte...@gmail.com
  wrote:
   Hi Juan,
  
   If I remember correctly, AllSimilarItemsCandidateItemsStrategy
  
   returns all items that have not been rated by the user and the
   similarity metric returns a non-NaN similarity value that is with at
   least one of the items preferred by the user.
  
   Tevfik
  
   On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org
  wrote:
   On 03/05/2014 01:23 PM, Juan José Ramos wrote:
  
   Thanks for the reply, Sebastian.
  
   I am not sure if that should be implemented in the Abstract base
 class
   though because for
   instance PreferredItemsNeighborhoodCandidateItemsStrategy, by
  definition,
   it returns the item not rated by the user and rated by somebody
 else.
  
  
   Good point. So we seem to need special implementations.
  
  
  
   Back to my last post, I have been playing around with
   AllSimilarItemsCandidateItemsStrategy
   and AllUnknownItemsCandidateItemsStrategy, and although they both do
  what
   I
   wanted (recommend items not previously rated by any user), I
 honestly
   can't
   tell the difference between the two strategies. In my tests the
 output
  was
   always the same. If the eventual output of the recommender will not
   include
   items already rated by the user as pointed out here (
  
  
 
 http://mail-archives.apache.org/mod_mbox/mahout-user/201403.mbox/%3CCABHkCkuv35dbwF%2B9sK88FR3hg7MAcdv0MP10v-5QWEvwmNdY%2BA%40mail.gmail.com%3E
  ),
   AllSimilarItemsCandidateItemsStrategy should be equivalent to
   AllUnkownItemsCandidateItemsStrategy, shouldn't it?
  
  
   AllSimilarItems returns all items that are similar to any item that
 the
  user
   already knows. AllUnknownItems simply returns all items that the user
  has
   not interacted with yet.
  
   These are two different things, although they might overlap in some
   scenarios.
  
   Best,
   Sebastian
  
  
  
  
   Thanks.
  
   On Wed, Mar 5, 2014 at 10:23 AM, Sebastian Schelter s...@apache.org
 
   wrote:
  
  
   Hi Juan,
  
   that is a good catch. CandidateItemsStrategy is the right place to
  
   implement this. Maybe we should simply extend its

Re: Recommend items not rated by any user

2014-03-05 Thread Tevfik Aytekin
Hi Sebastian,
But in order not to select items that is not similar to at least one
of the items the user interacted with you have to compute the
similarity with all user items (which is the main task for estimating
the preference of an item in item-based method). So, it seems to me
that AllSimilarItemsStrategy does not bring much advantage over
AllUnknownItemsCandidateItemsStrategy.

On Wed, Mar 5, 2014 at 6:46 PM, Sebastian Schelter s...@apache.org wrote:
 So both strategies seems to be effectively the same, I don't know what
 the implementers had in mind when designing
 AllSimilarItemsCandidateItemsStrategy.

 It can take a long time to estimate preferences for all items a user doesn't
 know. Especially if you have a lot of items. Traditional item-based
 recommenders will not recommend any item that is not similar to at least one
 of the items the user interacted with, so AllSimilarItemsStrategy already
 selects the maximum set of items that could be potentially recommended to
 the user.

 --sebastian




 On 03/05/2014 05:38 PM, Tevfik Aytekin wrote:

 If the similarity between item 5 and two of the items user 1 preferred are
 not
 NaN then it will return 1, that is what I'm saying. If the
 similarities were all NaN then
 it will not return it.

 But surely, you might wonder if all similarities between an item and
 user's items are NaN, then
 AllUnknownItemsCandidateItemsStrategy probably will not return it.


 On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote:

 @Tevfik, running this recommender:

 GenericItemBasedRecommender itemRecommender = new
 GenericItemBasedRecommender(dataModel, itemSimilarity, new
 AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new
 AllSimilarItemsCandidateItemsStrategy(itemSimilarity));


 With this dataModel:
 1,1,1.0
 1,2,2.0
 1,3,1.0
 1,4,2.0
 2,1,1.0
 2,2,4.0


 And these similarities
 1,2,0.1
 1,3,0.2
 1,4,0.3
 2,3,0.5
 3,4,0.5
 5,1,0.2
 5,2,1.0

 Returns item 5 for User 1. So item 5 has not been preferred by user 1,
 and
 the similarity between item 5 and two of the items user 1 preferred are
 not
 NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item.
 So,
 I'm truly sorry to insist on this, but I still really do not get the
 difference.


 On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin
 tevfik.ayte...@gmail.comwrote:

 Juan,
 You got me wrong,

 AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value with at
 least one of the items preferred by the user.

 So, it does not simply return all items that have not been rated by
 the user. For example, if there is an item X which has not been rated
 by the user and if the similarity value between X and at least one of
 the items rated (preferred) by the user is not NaN, then X will be not
 be returned by AllSimilarItemsCandidateItemsStrategy, but it will be
 returned by AllUnknownItemsCandidateItemsStrategy.



 On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com
 wrote:

 Hi Tefik,

 Thanks for the response. I think what you says contradicts what
 Sebastian
 pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy

 returns

 all items that have not been rated by the user, what would
 AllUnknownItemsCandidateItemsStrategy return?


 On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin
 tevfik.ayte...@gmail.com
 wrote:

 Sorry there was a typo in the previous paragraph.

 If I remember correctly, AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value with at
 least one of the items preferred by the user.

 On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin 

 tevfik.ayte...@gmail.com

 wrote:

 Hi Juan,

 If I remember correctly, AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value that is with at
 least one of the items preferred by the user.

 Tevfik

 On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org

 wrote:

 On 03/05/2014 01:23 PM, Juan José Ramos wrote:


 Thanks for the reply, Sebastian.

 I am not sure if that should be implemented in the Abstract base

 class

 though because for
 instance PreferredItemsNeighborhoodCandidateItemsStrategy, by

 definition,

 it returns the item not rated by the user and rated by somebody

 else.



 Good point. So we seem to need special implementations.



 Back to my last post, I have been playing around with
 AllSimilarItemsCandidateItemsStrategy
 and AllUnknownItemsCandidateItemsStrategy, and although they both
 do

 what

 I
 wanted (recommend items not previously rated by any user), I

 honestly

 can't
 tell the difference between the two strategies. In my tests the

 output

 was

 always the same. If the eventual output of the recommender will not
 include
 items already rated by the user

Re: Recommend items not rated by any user

2014-03-05 Thread Tevfik Aytekin
It can even make things worse in SVD-based algorithms for which
preference estimation is very fast.

On Wed, Mar 5, 2014 at 7:00 PM, Tevfik Aytekin tevfik.ayte...@gmail.com wrote:
 Hi Sebastian,
 But in order not to select items that is not similar to at least one
 of the items the user interacted with you have to compute the
 similarity with all user items (which is the main task for estimating
 the preference of an item in item-based method). So, it seems to me
 that AllSimilarItemsStrategy does not bring much advantage over
 AllUnknownItemsCandidateItemsStrategy.

 On Wed, Mar 5, 2014 at 6:46 PM, Sebastian Schelter s...@apache.org wrote:
 So both strategies seems to be effectively the same, I don't know what
 the implementers had in mind when designing
 AllSimilarItemsCandidateItemsStrategy.

 It can take a long time to estimate preferences for all items a user doesn't
 know. Especially if you have a lot of items. Traditional item-based
 recommenders will not recommend any item that is not similar to at least one
 of the items the user interacted with, so AllSimilarItemsStrategy already
 selects the maximum set of items that could be potentially recommended to
 the user.

 --sebastian




 On 03/05/2014 05:38 PM, Tevfik Aytekin wrote:

 If the similarity between item 5 and two of the items user 1 preferred are
 not
 NaN then it will return 1, that is what I'm saying. If the
 similarities were all NaN then
 it will not return it.

 But surely, you might wonder if all similarities between an item and
 user's items are NaN, then
 AllUnknownItemsCandidateItemsStrategy probably will not return it.


 On Wed, Mar 5, 2014 at 6:06 PM, Juan José Ramos jjar...@gmail.com wrote:

 @Tevfik, running this recommender:

 GenericItemBasedRecommender itemRecommender = new
 GenericItemBasedRecommender(dataModel, itemSimilarity, new
 AllSimilarItemsCandidateItemsStrategy(itemSimilarity), new
 AllSimilarItemsCandidateItemsStrategy(itemSimilarity));


 With this dataModel:
 1,1,1.0
 1,2,2.0
 1,3,1.0
 1,4,2.0
 2,1,1.0
 2,2,4.0


 And these similarities
 1,2,0.1
 1,3,0.2
 1,4,0.3
 2,3,0.5
 3,4,0.5
 5,1,0.2
 5,2,1.0

 Returns item 5 for User 1. So item 5 has not been preferred by user 1,
 and
 the similarity between item 5 and two of the items user 1 preferred are
 not
 NaN, but AllSimilarItemsCandidateItemsStrategy is returning that item.
 So,
 I'm truly sorry to insist on this, but I still really do not get the
 difference.


 On Wed, Mar 5, 2014 at 2:53 PM, Tevfik Aytekin
 tevfik.ayte...@gmail.comwrote:

 Juan,
 You got me wrong,

 AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value with at
 least one of the items preferred by the user.

 So, it does not simply return all items that have not been rated by
 the user. For example, if there is an item X which has not been rated
 by the user and if the similarity value between X and at least one of
 the items rated (preferred) by the user is not NaN, then X will be not
 be returned by AllSimilarItemsCandidateItemsStrategy, but it will be
 returned by AllUnknownItemsCandidateItemsStrategy.



 On Wed, Mar 5, 2014 at 4:42 PM, Juan José Ramos jjar...@gmail.com
 wrote:

 Hi Tefik,

 Thanks for the response. I think what you says contradicts what
 Sebastian
 pointed out before. Also, if AllSimilarItemsCandidateItemsStrategy

 returns

 all items that have not been rated by the user, what would
 AllUnknownItemsCandidateItemsStrategy return?


 On Wed, Mar 5, 2014 at 1:40 PM, Tevfik Aytekin
 tevfik.ayte...@gmail.com
 wrote:

 Sorry there was a typo in the previous paragraph.

 If I remember correctly, AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value with at
 least one of the items preferred by the user.

 On Wed, Mar 5, 2014 at 3:38 PM, Tevfik Aytekin 

 tevfik.ayte...@gmail.com

 wrote:

 Hi Juan,

 If I remember correctly, AllSimilarItemsCandidateItemsStrategy

 returns all items that have not been rated by the user and the
 similarity metric returns a non-NaN similarity value that is with at
 least one of the items preferred by the user.

 Tevfik

 On Wed, Mar 5, 2014 at 2:30 PM, Sebastian Schelter s...@apache.org

 wrote:

 On 03/05/2014 01:23 PM, Juan José Ramos wrote:


 Thanks for the reply, Sebastian.

 I am not sure if that should be implemented in the Abstract base

 class

 though because for
 instance PreferredItemsNeighborhoodCandidateItemsStrategy, by

 definition,

 it returns the item not rated by the user and rated by somebody

 else.



 Good point. So we seem to need special implementations.



 Back to my last post, I have been playing around with
 AllSimilarItemsCandidateItemsStrategy
 and AllUnknownItemsCandidateItemsStrategy, and although they both
 do

 what

 I
 wanted (recommend items not previously rated by any user), I

 honestly

 can't
 tell

Re: Why some userId has no recommendations?

2014-02-13 Thread Tevfik Aytekin
In some cases users might not get any recommendations. There might be
different reasons of this. In your case there is only item 107 which
can be recommended to user 5 (since user 5 rated all other items).
Item 107 got two ratings which are both 5. In this case pearson
correlation between this item and others are undefined. I think this
is the reason why user 5 is not getting any recommendations.

Tevfik

On Thu, Feb 13, 2014 at 9:08 AM, jobin wilson jobinwil...@gmail.com wrote:
 Hi Jiang,

 Mahout's userbased recommender make use of similarity of a user with other
 users to arrive at what to recommend to him  in this specific case,uses
 Pearson correlation coefficient calculated from the user ratings as a
 similarity measure to form a neighborhood.It then estimates ratings for
 unpicked items based on user similarity and ratings provided by neighbors.

 A short answer is that if a user gets any recommendations totally depend on
 the training data that you provide as input to the model.In this case,if
 you expect 107 as a recommendation for user 5,there arent enough ratings
 available for 107 in the user 5's neighborhood. If you modify your data as
 below,you will get recommendations for user 5. (just add a dummy rating
 2,107,5)

 I have included some code snippet which demonstrate this idea of user
 similarity and neighborhood .Hope this helps.

 *Code:*
 public class Test {

 public static void main(String args[]) throws Exception {
 String inFile = F:\\hadoop\\data\\recsysinput.txt;
 DataModel dataModel = new FileDataModel(new File(inFile));
 UserSimilarity userSimilarity = new
 PearsonCorrelationSimilarity(dataModel);
 UserNeighborhood userNeighborhood = new
 NearestNUserNeighborhood(100, userSimilarity, dataModel);
 Recommender recommender = new
 GenericUserBasedRecommender(dataModel, userNeighborhood, userSimilarity);

 for (int i = 1; i = 5; i++) {
 ListRecommendedItem recommendations =
 recommender.recommend(i, 1);
 for(int j=1;j=5 ;j++){
 System.out.println(Similarity between user:+i+ and
 user:+j+ = +userSimilarity.userSimilarity(i, j));
 }
 System.out.println(recommend for user: + i + Neighborhood
 Size: + userNeighborhood.getUserNeighborhood(i).length);

 for (RecommendedItem recommendation : recommendations) {
 System.out.println(recommendation);
 }
 }
 }
 }

 *Input:*
 1,101,5.0
 1,102,3.0
 1,103,2.5
 2,101,2
 2,102,2.5
 2,103,5
 2,104,2
 2,107,5
 3,101,2.5
 3,104,4
 3,105,4.5
 3,107,5
 4,101,5
 4,103,3
 4,104,4.5
 4,106,4
 5,101,4
 5,102,3
 5,103,2
 5,104,4
 5,105,3.5
 5,106,4

 *Output:*
 SLF4J: Class path contains multiple SLF4J bindings.
 SLF4J: Found binding in
 [jar:file:/D:/from%20D/MSR/Coursework/SEM2/Pattern%20Recognition/project/acadnet/mahout-distribution-0.7/mahout-distribution-0.7/mahout-examples-0.7-job.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in
 [jar:file:/D:/from%20D/MSR/Coursework/SEM2/Pattern%20Recognition/project/acadnet/mahout-distribution-0.7/mahout-distribution-0.7/lib/slf4j-jcl-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: Found binding in
 [jar:file:/D:/from%20D/MSR/Coursework/SEM2/Pattern%20Recognition/project/acadnet/mahout-distribution-0.7/mahout-distribution-0.7/lib/slf4j-log4j12-1.6.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
 SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
 explanation.
 log4j:WARN No appenders could be found for logger
 (org.apache.mahout.cf.taste.impl.model.file.FileDataModel).
 log4j:WARN Please initialize the log4j system properly.
 log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
 more info.
 Similarity between user:1 and user:1= 1.0
 Similarity between user:1 and user:2= -0.7642652566278799
 Similarity between user:1 and user:3= NaN
 Similarity between user:1 and user:4= 0.9998
 Similarity between user:1 and user:5= 0.944911182523068
 recommend for user:1 Neighborhood Size:3
 RecommendedItem[item:104, value:5.0]
 Similarity between user:2 and user:1= -0.7642652566278799
 Similarity between user:2 and user:2= 0.9998
 Similarity between user:2 and user:3= 0.8029550685469666
 Similarity between user:2 and user:4= -0.9707253433941515
 Similarity between user:2 and user:5= -0.9393939393939394
 recommend for user:2 Neighborhood Size:4
 RecommendedItem[item:106, value:4.0]
 Similarity between user:3 and user:1= NaN
 Similarity between user:3 and user:2= 0.8029550685469666
 Similarity between user:3 and user:3= 1.0
 Similarity between user:3 and user:4= -1.0
 Similarity between user:3 and user:5= -0.6933752452815484
 recommend for user:3 Neighborhood Size:3
 RecommendedItem[item:106, value:4.0]
 Similarity between user:4 and user:1= 0.9998
 Similarity between user:4 and user:2= -0.9707253433941515
 Similarity between user:4 and user:3= -1.0
 Similarity 

Re: Why some userId has no recommendations?

2014-02-13 Thread Tevfik Aytekin
You are right Koobas, my answer was on the assumption that item-based
NN is used (but I noticed that user-based NN is being used). So my
answer is not correct, sorry.
Currently, I could not understand the exact reason why user 5 is not
getting any recommendations, as you said user 5 should get 107.

On Thu, Feb 13, 2014 at 3:21 PM, Koobas koo...@gmail.com wrote:
 User 3 gave a recommendation to item 107.
 User 5 did not rate 107.


 On Thu, Feb 13, 2014 at 1:57 AM, Suresh M suresh4mas...@gmail.com wrote:

 user 5 has given rating for all 5 books,
 So there will be no recommendations for him.



 On 12 February 2014 08:55, jiangwen jiang jiangwen...@gmail.com wrote:

  Hi, all:
 
  I try to user mahout api to make recommendations, but I find some userId
  has no recommendations, why?
 
  here is my code
  public static void main(String args[]) throws Exception {
  String inFile = F:\\hadoop\\data\\recsysinput.txt;
  DataModel dataModel = new FileDataModel(new File(inFile));
  UserSimilarity userSimilarity = new
  PearsonCorrelationSimilarity(dataModel);
  UserNeighborhood userNeighborhood = new
  NearestNUserNeighborhood(100, userSimilarity, dataModel);
  Recommender recommender = new
  GenericUserBasedRecommender(dataModel, userNeighborhood, userSimilarity);
 
  for (int i = 1; i  5; i++) {
  ListRecommendedItem recommendations =
  recommender.recommend(i, 1);
 
  System.out.println(recommend for user: + i);
  for (RecommendedItem recommendation : recommendations) {
  System.out.println(recommendation);
  }
  }
  }
 
 
  input data(recsysinput.txt):
  1,101,5.0
  1,102,3.0
  1,103,2.5
  2,101,2
  2,102,2.5
  2,103,5
  2,104,2
  3,101,2.5
  3,104,4
  3,105,4.5
  3,107,5
  4,101,5
  4,103,3
  4,104,4.5
  4,106,4
  5,101,4
  5,102,3
  5,103,2
  5,104,4
  5,105,3.5
  5,106,4
 
  output:
  recommend for user:1
  RecommendedItem[item:104, value:5.0]
  recommend for user:2
  RecommendedItem[item:106, value:4.0]
  recommend for user:3
  RecommendedItem[item:106, value:4.0]
  recommend for user:4
  RecommendedItem[item:105, value:5.0]
  recommend for user:5
 
  UserId 5 has no recommendations, is it right?
  Can I get some recommendations for userId 5, even if the recommendation
  results are not good enough?
 
  thanks
  Regards!
 



Re: Popularity of recommender items

2014-02-06 Thread Tevfik Aytekin
Well, I think what you are suggesting is to define popularity as being
similar to other items. So in this way most popular items will be
those which are most similar to all other items, like the centroids in
K-means.

I would first check the correlation between this definition and the
standard one (that is, the definition of popularity as having the
highest number of ratings). But my intuition is that they are
different things. For example. an item might lie at the center in the
similarity space but it might not be a popular item. However, there
might still be some correlation, it would be interesting to check it.

hope it helps




On Wed, Feb 5, 2014 at 3:27 AM, Pat Ferrel p...@occamsmachete.com wrote:
 Trying to come up with a relative measure of popularity for items in a 
 recommender. Something that could be used to rank items.

 The user - item preference matrix would be the obvious thought. Just add the 
 number of preferences per item. Maybe transpose the preference matrix (the 
 temp DRM created by the recommender), then for each row vector (now that a 
 row = item) grab the number of non zero preferences. This corresponds to the 
 number of preferences, and would give one measure of popularity. In the case 
 where the items are not boolean you'd sum the weights.

 However it might be a better idea to look at the item-item similarity matrix. 
 It doesn't need to be transposed and contains the important 
 similarities--as calculated by LLR for example. Here similarity means 
 similarity in which users preferred an item. So summing the non-zero weights 
 would give perhaps an even better relative popularity measure. For the same 
 reason clustering the similarity matrix would yield important clusters.

 Anyone have intuition about this?

 I started to think about this because transposing the user-item matrix seems 
 to yield a fromat that cannot be sent directly into clustering.


Re: generic latent variable recommender question

2014-01-26 Thread Tevfik Aytekin
Thanks for the answers, actually I worked on a similar issue,
increasing the diversity of top-N lists
(http://link.springer.com/article/10.1007%2Fs10844-013-0252-9).
Clustering-based approaches produce good results and they are very
fast compared to some optimization based techniques. Also it turned
out that introducing randomization (such as choosing random 20 items
among the top 100 items) might decrease diversity if the diversity of
the top-N lists is better than the diversity of a set of random items,
which might sometimes be the case.

On Sun, Jan 26, 2014 at 8:49 PM, Ted Dunning ted.dunn...@gmail.com wrote:
 On Sun, Jan 26, 2014 at 9:36 AM, Pat Ferrel p...@occamsmachete.com wrote:

 I think I’ll leave dithering out until it goes live because it would seem
 to make the eyeball test easier. I doubt all these experiments will survive.


 With anti-flood if you turn the epsilon parameter to 1 (makes log(epsilon)
 = 0), then no re-ordering is done.

 I like knobs that go to 11, but also have an off position.


Re: generic latent variable recommender question

2014-01-25 Thread Tevfik Aytekin
Case 1 is fine, in case 2, I don't think that a dot product (without
normalization) will yield a meaningful distance measure. Cosine
distance or a Pearson correlation would be better. The situation is
similar to Latent Semantic Indexing in which documents are represented
by their low rank approximations and similarities between them (that
is, approximations) are computed using cosine similarity.
There is no need to make any normalization in case 1 since the values
in the feature vectors are formed to approximate the rating values.

On Sat, Jan 25, 2014 at 5:08 AM, Koobas koo...@gmail.com wrote:
 A generic latent variable recommender question.
 I passed the user-item matrix through a low rank approximation,
 with either something like ALS or SVD, and now I have the feature
 vectors for all users and all items.

 Case 1:
 I want to recommend items to a user.
 I compute a dot product of the user’s feature vector with all feature
 vectors of all the items.
 I eliminate the ones that the user already has, and find the largest value
 among the others, right?

 Case 2:
 I want to find similar items for an item.
 Should I compute dot product of the item’s feature vector against feature
 vectors of all the other items?
OR
 Should I compute the ANGLE between each par of feature vectors?
 I.e., compute the cosine similarity?
 I.e., normalize the vectors before computing the dot products?

 If “yes” for case 2, is that something I should also do for case 1?


Re: generic latent variable recommender question

2014-01-25 Thread Tevfik Aytekin
Hi Ted,
Could you explain what do you mean by a dithering step and an
anti-flood step?
By dithering I guess you mean adding some sort of noise in order not
to show the same results every time.
But I have no clue about the anti-flood step.

Tevfik

On Sat, Jan 25, 2014 at 11:05 PM, Koobas koo...@gmail.com wrote:
 On Sat, Jan 25, 2014 at 3:51 PM, Tevfik Aytekin 
 tevfik.ayte...@gmail.comwrote:

 Case 1 is fine, in case 2, I don't think that a dot product (without
 normalization) will yield a meaningful distance measure. Cosine
 distance or a Pearson correlation would be better. The situation is
 similar to Latent Semantic Indexing in which documents are represented
 by their low rank approximations and similarities between them (that
 is, approximations) are computed using cosine similarity.
 There is no need to make any normalization in case 1 since the values
 in the feature vectors are formed to approximate the rating values.

 That's exactly what I was thinking.
 Thanks for your reply.


 On Sat, Jan 25, 2014 at 5:08 AM, Koobas koo...@gmail.com wrote:
  A generic latent variable recommender question.
  I passed the user-item matrix through a low rank approximation,
  with either something like ALS or SVD, and now I have the feature
  vectors for all users and all items.
 
  Case 1:
  I want to recommend items to a user.
  I compute a dot product of the user’s feature vector with all feature
  vectors of all the items.
  I eliminate the ones that the user already has, and find the largest
 value
  among the others, right?
 
  Case 2:
  I want to find similar items for an item.
  Should I compute dot product of the item’s feature vector against feature
  vectors of all the other items?
 OR
  Should I compute the ANGLE between each par of feature vectors?
  I.e., compute the cosine similarity?
  I.e., normalize the vectors before computing the dot products?
 
  If “yes” for case 2, is that something I should also do for case 1?



Re: Hadoop implementation of ParallelSGDFactorizer

2013-09-08 Thread Tevfik Aytekin
Thanks Sebastian.

On Sat, Sep 7, 2013 at 8:24 PM, Sebastian Schelter
ssc.o...@googlemail.com wrote:
 IIRC the algorithm behind ParallelSGDFactorizer needs shared memory,
 which is not given in a shared-nothing environment.


 On 07.09.2013 19:08, Tevfik Aytekin wrote:
 Hi,
 There seems to be no Hadoop implementation of ParallelSGDFactorizer.
 ALSWRFactorizer has a Hadoop implementation.

 ParallelSGDFactorizer (since it is based on stochastic gradient
 descent) is much faster than ALSWRFactorizer.

 I don't know Hadoop much. But it seems to me that a Hadoop
 implementation of ParallelSGDFactorizer will also be much faster than
 the Hadoop implementaion of ALSWRFactorizer.

 Is there a specific reason for why there is no Hadoop implementation
 of ParallelSGDFactorizer? Is it because since Hadoop operations are
 already slow the slowness of ALSWRFactorizer does not matter much. Or
 is it simply because nobody has implemented it yet?

 Thanks
 Tevfik




Hadoop implementation of ParallelSGDFactorizer

2013-09-07 Thread Tevfik Aytekin
Hi,
There seems to be no Hadoop implementation of ParallelSGDFactorizer.
ALSWRFactorizer has a Hadoop implementation.

ParallelSGDFactorizer (since it is based on stochastic gradient
descent) is much faster than ALSWRFactorizer.

I don't know Hadoop much. But it seems to me that a Hadoop
implementation of ParallelSGDFactorizer will also be much faster than
the Hadoop implementaion of ALSWRFactorizer.

Is there a specific reason for why there is no Hadoop implementation
of ParallelSGDFactorizer? Is it because since Hadoop operations are
already slow the slowness of ALSWRFactorizer does not matter much. Or
is it simply because nobody has implemented it yet?

Thanks
Tevfik


Re: Hadoop implementation of ParallelSGDFactorizer

2013-09-07 Thread Tevfik Aytekin
Sebastian, what is IIRC?

On Sat, Sep 7, 2013 at 8:24 PM, Sebastian Schelter
ssc.o...@googlemail.com wrote:
 IIRC the algorithm behind ParallelSGDFactorizer needs shared memory,
 which is not given in a shared-nothing environment.


 On 07.09.2013 19:08, Tevfik Aytekin wrote:
 Hi,
 There seems to be no Hadoop implementation of ParallelSGDFactorizer.
 ALSWRFactorizer has a Hadoop implementation.

 ParallelSGDFactorizer (since it is based on stochastic gradient
 descent) is much faster than ALSWRFactorizer.

 I don't know Hadoop much. But it seems to me that a Hadoop
 implementation of ParallelSGDFactorizer will also be much faster than
 the Hadoop implementaion of ALSWRFactorizer.

 Is there a specific reason for why there is no Hadoop implementation
 of ParallelSGDFactorizer? Is it because since Hadoop operations are
 already slow the slowness of ALSWRFactorizer does not matter much. Or
 is it simply because nobody has implemented it yet?

 Thanks
 Tevfik




Re: Which database should I use with Mahout

2013-05-19 Thread Tevfik Aytekin
Thanks Sean, but I could not get your answer. Can you please explain it again?


On Sun, May 19, 2013 at 8:00 PM, Sean Owen sro...@gmail.com wrote:
 It doesn't matter, in the sense that it is never going to be fast
 enough for real-time at any reasonable scale if actually run off a
 database directly. One operation results in thousands of queries. It's
 going to read data into memory anyway and cache it there. So, whatever
 is easiest for you. The simplest solution is a file.

 On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
 ahmetyilmazefe...@yahoo.com wrote:
 Hi,
 I would like to use Mahout to make recommendations on my web site. Since the 
 data is going to be big, hopefully, I plan to use hadoop implementations of 
 the recommender algorithms.

 I'm currently storing the data in mysql. Should I continue with it or should 
 I switch to a nosql database such as mongodb or something else?

 Thanks
 Ahmet


Re: Which database should I use with Mahout

2013-05-19 Thread Tevfik Aytekin
ok, got it, thanks.

On Sun, May 19, 2013 at 8:20 PM, Sean Owen sro...@gmail.com wrote:
 I'm first saying that you really don't want to use the database as a
 data model directly. It is far too slow.
 Instead you want to use a data model implementation that reads all of
 the data, once, serially, into memory. And in that case, it makes no
 difference where the data is being read from, because it is read just
 once, serially. A file is just as fine as a fancy database. In fact
 it's probably easier and faster.

 On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin
 tevfik.ayte...@gmail.com wrote:
 Thanks Sean, but I could not get your answer. Can you please explain it 
 again?


 On Sun, May 19, 2013 at 8:00 PM, Sean Owen sro...@gmail.com wrote:
 It doesn't matter, in the sense that it is never going to be fast
 enough for real-time at any reasonable scale if actually run off a
 database directly. One operation results in thousands of queries. It's
 going to read data into memory anyway and cache it there. So, whatever
 is easiest for you. The simplest solution is a file.

 On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
 ahmetyilmazefe...@yahoo.com wrote:
 Hi,
 I would like to use Mahout to make recommendations on my web site. Since 
 the data is going to be big, hopefully, I plan to use hadoop 
 implementations of the recommender algorithms.

 I'm currently storing the data in mysql. Should I continue with it or 
 should I switch to a nosql database such as mongodb or something else?

 Thanks
 Ahmet


Re: Which database should I use with Mahout

2013-05-19 Thread Tevfik Aytekin
Hi Manuel,
But if one uses matrix factorization and stores the user and item
factors in memory then there will be no database access during
recommendation.
I thought that the original question was where to store the data and
how to give it to hadoop.

On Sun, May 19, 2013 at 9:01 PM, Manuel Blechschmidt
manuel.blechschm...@gmx.de wrote:
 Hi Tevfik,
 one request to the recommender could become more then 1000 queries to the 
 database depending on which recommender you use and the amount of preferences 
 for the given user.

 The problem is not if you are using SQL, NoSQL, or any other query language. 
 The problem is the latency of the answers.

 An average tcp package in the same data center takes 500 µs. A main memory 
 reference 0,1 µs. This means that your main memory of your java process can 
 be accessed 5000 times faster then any other process like a database 
 connected via TCP/IP.

 http://www.eecs.berkeley.edu/~rcs/research/interactive_latency.html

 Here you can see a screenshot that shows that database communication is by 
 far (99%) the slowest component of a recommender request:

 https://source.apaxo.de/MahoutDatabaseLowPerformance.png

 If you do not want to cache your data in your Java process you can use a 
 complete in memory database technology like SAP HANA 
 http://www.saphana.com/welcome or EXASOL http://www.exasol.com/

 Nevertheless if you are using these you do not need Mahout anymore.

 An architecture of a Mahout system can be seen here:
 https://github.com/ManuelB/facebook-recommender-demo/blob/master/docs/RecommenderArchitecture.png

 Hope that helps
 Manuel

 Am 19.05.2013 um 19:20 schrieb Sean Owen:

 I'm first saying that you really don't want to use the database as a
 data model directly. It is far too slow.
 Instead you want to use a data model implementation that reads all of
 the data, once, serially, into memory. And in that case, it makes no
 difference where the data is being read from, because it is read just
 once, serially. A file is just as fine as a fancy database. In fact
 it's probably easier and faster.

 On Sun, May 19, 2013 at 10:14 AM, Tevfik Aytekin
 tevfik.ayte...@gmail.com wrote:
 Thanks Sean, but I could not get your answer. Can you please explain it 
 again?


 On Sun, May 19, 2013 at 8:00 PM, Sean Owen sro...@gmail.com wrote:
 It doesn't matter, in the sense that it is never going to be fast
 enough for real-time at any reasonable scale if actually run off a
 database directly. One operation results in thousands of queries. It's
 going to read data into memory anyway and cache it there. So, whatever
 is easiest for you. The simplest solution is a file.

 On Sun, May 19, 2013 at 9:52 AM, Ahmet Ylmaz
 ahmetyilmazefe...@yahoo.com wrote:
 Hi,
 I would like to use Mahout to make recommendations on my web site. Since 
 the data is going to be big, hopefully, I plan to use hadoop 
 implementations of the recommender algorithms.

 I'm currently storing the data in mysql. Should I continue with it or 
 should I switch to a nosql database such as mongodb or something else?

 Thanks
 Ahmet

 --
 Manuel Blechschmidt
 M.Sc. IT Systems Engineering
 Dortustr. 57
 14467 Potsdam
 Mobil: 0173/6322621
 Twitter: http://twitter.com/Manuel_B



Re: parallelALS and RMSE TEST

2013-05-06 Thread Tevfik Aytekin
This problem is called one-class classification problem. In the domain
of collaborative filtering it is called one-class collaborative
filtering (since what you have are only positive preferences). You may
search the web with these key words to find papers providing
solutions. I'm not sure whether Mahout has algorithms for one-class
collaborative filtering.

On Mon, May 6, 2013 at 1:42 PM, Sean Owen sro...@gmail.com wrote:
 ALS-WR weights the error on each term differently, so the average
 error doesn't really have meaning here, even if you are comparing the
 difference with 1. I think you will need to fall back to mean
 average precision or something.

 On Mon, May 6, 2013 at 11:24 AM, William icswilliam2...@gmail.com wrote:
 Sean Owen srowen at gmail.com writes:


 If you have no ratings, how are you using RMSE? this typically
 measures error in reconstructing ratings.
 I think you are probably measuring something meaningless.



 I suppose the rate of seen movies are 1. Is it right?
 If I use Collaborative Filtering with ALS-WR to get some recommendations, I
 must have a real rating-matrix?





Re: parallelALS and RMSE TEST

2013-05-06 Thread Tevfik Aytekin
Hi Sean,
Isn't boolean preferences is supported in the context of memory-based
recommendation algorithms in Mahout?
Are there matrix factorization algorithms in Mahout which can work
with this kind of data (that is, the kind of data which consists of
users and the movies they have seen).




On Mon, May 6, 2013 at 10:34 PM, Sean Owen sro...@gmail.com wrote:
 Yes, it goes by the name 'boolean prefs' in the project since target
 variables don't have values -- they just exist or don't.
 So, yes it's certainly supported but the question here is how to
 evaluate the output.

 On Mon, May 6, 2013 at 8:29 PM, Tevfik Aytekin tevfik.ayte...@gmail.com 
 wrote:
 This problem is called one-class classification problem. In the domain
 of collaborative filtering it is called one-class collaborative
 filtering (since what you have are only positive preferences). You may
 search the web with these key words to find papers providing
 solutions. I'm not sure whether Mahout has algorithms for one-class
 collaborative filtering.

 On Mon, May 6, 2013 at 1:42 PM, Sean Owen sro...@gmail.com wrote:
 ALS-WR weights the error on each term differently, so the average
 error doesn't really have meaning here, even if you are comparing the
 difference with 1. I think you will need to fall back to mean
 average precision or something.

 On Mon, May 6, 2013 at 11:24 AM, William icswilliam2...@gmail.com wrote:
 Sean Owen srowen at gmail.com writes:


 If you have no ratings, how are you using RMSE? this typically
 measures error in reconstructing ratings.
 I think you are probably measuring something meaningless.



 I suppose the rate of seen movies are 1. Is it right?
 If I use Collaborative Filtering with ALS-WR to get some recommendations, I
 must have a real rating-matrix?





Re: parallelALS and RMSE TEST

2013-05-06 Thread Tevfik Aytekin
But the data under consideration here is not 0/1 data, it contains only 1's.

On Mon, May 6, 2013 at 11:29 PM, Sean Owen sro...@gmail.com wrote:
 Parallel ALS is exactly an example of where you can use matrix
 factorization for 0/1 data.

 On Mon, May 6, 2013 at 9:22 PM, Tevfik Aytekin tevfik.ayte...@gmail.com 
 wrote:
 Hi Sean,
 Isn't boolean preferences is supported in the context of memory-based
 recommendation algorithms in Mahout?
 Are there matrix factorization algorithms in Mahout which can work
 with this kind of data (that is, the kind of data which consists of
 users and the movies they have seen).




 On Mon, May 6, 2013 at 10:34 PM, Sean Owen sro...@gmail.com wrote:
 Yes, it goes by the name 'boolean prefs' in the project since target
 variables don't have values -- they just exist or don't.
 So, yes it's certainly supported but the question here is how to
 evaluate the output.

 On Mon, May 6, 2013 at 8:29 PM, Tevfik Aytekin tevfik.ayte...@gmail.com 
 wrote:
 This problem is called one-class classification problem. In the domain
 of collaborative filtering it is called one-class collaborative
 filtering (since what you have are only positive preferences). You may
 search the web with these key words to find papers providing
 solutions. I'm not sure whether Mahout has algorithms for one-class
 collaborative filtering.

 On Mon, May 6, 2013 at 1:42 PM, Sean Owen sro...@gmail.com wrote:
 ALS-WR weights the error on each term differently, so the average
 error doesn't really have meaning here, even if you are comparing the
 difference with 1. I think you will need to fall back to mean
 average precision or something.

 On Mon, May 6, 2013 at 11:24 AM, William icswilliam2...@gmail.com wrote:
 Sean Owen srowen at gmail.com writes:


 If you have no ratings, how are you using RMSE? this typically
 measures error in reconstructing ratings.
 I think you are probably measuring something meaningless.



 I suppose the rate of seen movies are 1. Is it right?
 If I use Collaborative Filtering with ALS-WR to get some 
 recommendations, I
 must have a real rating-matrix?





Re: User Based recommender - strange behaviour of Pearson

2013-04-09 Thread Tevfik Aytekin
You are correct, since centeredSumX2 equals zero, the Pearson
similarity will be undefined (because of division by zero in the
Pearson formula).

If you do not center the data that will be cosine similarity which is
another common similarity metric used in recommender systems and it
will not be undefined when a user has the same ratings for all items.

On Tue, Apr 9, 2013 at 6:19 PM, yamo93 yam...@gmail.com wrote:
 Hi,

 I use a user based recommender.
 I've just discovered a strange behaviour of Pearson when a user has the same
 ratings for all rated items. The system don't recommend anything in this
 case for this user.

 I try an explanation : it is due to centered data (centeredSumX2 equals 0 in
 this case). Is it exact ?

 Using UncenteredCosine as a workaround is it a good idea ?

 Thanks,
 Yann.


Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Tevfik Aytekin
I think, it is better to choose ratings of the test user in a random fashion.

On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen sro...@gmail.com wrote:
 Yes. But: the test sample is small. Using 40% of your data to test is
 probably quite too much.

 My point is that it may be the least-bad thing to do. What test are you
 proposing instead, and why is it coherent with what you're testing?




 On Sat, Feb 16, 2013 at 8:26 PM, Ahmet Ylmaz 
 ahmetyilmazefe...@yahoo.comwrote:

 But modeling a user only by his/her low ratings can be problematic since
 people generally are more precise (I believe) in their high ratings.
 Another problem is that recommender algorithms in general first mean
 normalize the ratings for each user. Suppose that we have the following
 ratings of 3 people (A, B, and C) on 5 items.

 A's ratings: 1 2 3 4 5
 B's ratings: 1 3 5 2 4
 C's ratings: 1 2 3 4 5


 Suppose that A is the test user. Now if we put only the low ratings of A
 (1, 2, and 3) into the training set and mean normalize the ratings then A
 will be
 more similar to B than C, which is not true.




 
  From: Sean Owen sro...@gmail.com
 To: Mahout User List user@mahout.apache.org; Ahmet Ylmaz 
 ahmetyilmazefe...@yahoo.com
 Sent: Saturday, February 16, 2013 8:41 PM
 Subject: Re: Problems with Mahout's RecommenderIRStatsEvaluator

 No, this is not a problem.

 Yes it builds a model for each user, which takes a long time. It's
 accurate, but time-consuming. It's meant for small data. You could rewrite
 your own test to hold out data for all test users at once. That's what I
 did when I rewrote a lot of this just because it was more useful to have
 larger tests.

 There are several ways to choose the test data. One common way is by time,
 but there is no time information here by default. The problem is that, for
 example, recent ratings may be low -- or at least not high ratings. But the
 evaluation is of course asking the recommender for items that are predicted
 to be highly rated. Random selection has the same problem. Choosing by
 rating at least makes the test coherent.

 It does bias the training set, but, the test set is supposed to be small.

 There is no way to actually know, a priori, what the top recommendations
 are. You have no information to evaluate most recommendations. This makes a
 precision/recall test fairly uninformative in practice. Still, it's better
 than nothing and commonly understood.

 While precision/recall won't be high on tests like this, because of this, I
 don't get these values for movielens data on any normal algo, but, you may
 be, if choosing an algorithm or parameters that don't work well.




 On Sat, Feb 16, 2013 at 7:30 PM, Ahmet Ylmaz ahmetyilmazefe...@yahoo.com
 wrote:

  Hi,
 
  I have looked at the internals of Mahout's RecommenderIRStatsEvaluator
  code. I think that there are two important problems here.
 
  According to my understanding the experimental protocol used in this code
  is something like this:
 
  It takes away a certain percentage of users as test users.
  For
   each test user it builds a training set consisting of ratings given by
  all other users + the ratings of the test user which are below the
  relevanceThreshold.
  It then builds a model and makes a
  recommendation to the test user and finds the intersection between this
  recommendation list and the items which are rated above the
  relevanceThreshold by the test user.
  It then calculates the precision and recall in the usual way.
 
  Probems:
  1. (mild) It builds a model for every test user which can take a lot of
  time.
 
  2. (severe) Only the ratings (of the test user) which are below the
  relevanceThreshold are put into the training set. This means that the
  algorithm
  only knows the preferences of the test user about the items which s/he
  don't like. This is not a good representation of user ratings.
 
  Moreover when I run this evaluator on movielens 1m data, the precision
 and
  recall turned out to be, respectively,
 
  0.011534185658699288
  0.007905982905982885
 
  and the run took about 13 minutes on my intel core i3. (I used user based
  recommendation with k=2)
 
 
  Altgough I know that it is not ok to judge the performance of a
  recommendation algorithm by looking at these absolute precision and
 recall
  values, still these numbers seems to me too low which might be the result
  of the second problem I mentioned above.
 
  Am I missing something?
 
  Thanks
  Ahmet
 



Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Tevfik Aytekin
No, rating prediction is clearly a supervised ML problem

On Sat, Feb 16, 2013 at 10:15 PM, Sean Owen sro...@gmail.com wrote:
 This is a good answer for evaluation of supervised ML, but, this is
 unsupervised. Choosing randomly is choosing the 'right answers' randomly,
 and that's plainly problematic.


 On Sat, Feb 16, 2013 at 8:53 PM, Tevfik Aytekin 
 tevfik.ayte...@gmail.comwrote:

 I think, it is better to choose ratings of the test user in a random
 fashion.

 On Sat, Feb 16, 2013 at 9:37 PM, Sean Owen sro...@gmail.com wrote:
  Yes. But: the test sample is small. Using 40% of your data to test is
  probably quite too much.
 
  My point is that it may be the least-bad thing to do. What test are you
  proposing instead, and why is it coherent with what you're testing?
 



Re: Problems with Mahout's RecommenderIRStatsEvaluator

2013-02-16 Thread Tevfik Aytekin
I'm suggesting the second one. In that way the test user's ratings in
the training set will compose of both low and high rated items, that
prevents the problem pointed out by Ahmet.

On Sat, Feb 16, 2013 at 11:19 PM, Sean Owen sro...@gmail.com wrote:
 If you're suggesting that you hold out only high-rated items, and then
 sample them, then that's what is done already in the code, except without
 the sampling. The sampling doesn't buy anything that I can see.

 If you're suggesting holding out a random subset and then throwing away the
 held-out items with low rating, then it's also the same idea, except you're
 randomly throwing away some lower-rated data from both test and train. I
 don't see what that helps either.


 On Sat, Feb 16, 2013 at 9:41 PM, Tevfik Aytekin 
 tevfik.ayte...@gmail.comwrote:

 What I mean is you can choose ratings randomly and try to recommend
 the ones above  the threshold