Re: Sparse data & Item Similarity

Matthew Runo Wed, 16 Feb 2011 15:35:52 -0800

So I've only processed a tiny fraction of my data with the
LogLikelihoodSimilarity but already the output looks a lot better.


Do you think there's any benefit to storing things with small
similarities? For example, would it make sense to just filter out
things that are - say - less than 0.5? I would probably not recommend
items that are so dissimilar.

-Matthew Runo

On Wed, Feb 16, 2011 at 2:39 PM, Matthew Runo <[email protected]> wrote:
> Thank you for that suggestion. I have a few different actions that
> users can do.. "view", "add to cart", and "buy" which I've assigned
> different preference values to. Perhaps it would be better to simply
> use boolean yes/no in my case?
>
> I'll give the log likelihood stuff a try tonight and I'll report back
> in case anyone else runs into this issue.
>
> -Matthew Runo
>
> On Wed, Feb 16, 2011 at 2:31 PM, Chris Schilling <[email protected]> wrote:
>> Mathew,
>>
>> I was running into a similar issue with my data.  I discussed it with Sean 
>> Owen offline and his advice was, in a nutshell, to use the log-likelihood 
>> similarity metric.  Since you describe your users as having only links, I 
>> assume you are not dealing with preference data.  So, with the boolean data, 
>> the log-likelihood metric works very well (in my case, which I am also 
>> dealing with very sparse data).   How do your results look if you try the 
>> likelihood approach?
>>
>> Hope this helps,
>> Chris
>>
>>
>> On Feb 16, 2011, at 2:24 PM, Matthew Runo wrote:
>>
>>> Hello folks -
>>>
>>> (I think that) I'm running into an issue with my user data being too
>>> sparse with my item-item similarity calculations. A typical item_id in
>>> my data might have about 2000 links to other items, but very few
>>> "combinations" of users have viewed the same products.
>>>
>>> For example we have two items, 1244 and 2319 - and there are only
>>> three users in common between them.
>>>
>>> So, there's only those three users who viewed both items. I'm
>>> assigning preferences to different types of actions in my data.. and
>>> since all three users did the same action towards the item, they have
>>> the same preference value. Maybe I just need to start with a bigger
>>> set of data to get more links between items in different "actions" in
>>> order to spread out the generated similarities? I'm using the
>>> EuclideanDistanceSimilarity to do the final computation.
>>>
>>> I think this is leading to a huge number of "1" values being returned.
>>> Nearly 72% of my item-item similarities are 1.0. I feel that this is
>>> invalid, but I'm not quite sure of the best way to attack it.
>>>
>>> There are some similarities of 1 where the items do not appear to be
>>> similar at all, and the best I've been able to come up with as to how
>>> the 1 came around was that there was only one user who had a link
>>> between them and so that one user.
>>>
>>> How many item-user-item combinations per item pair does it take to get
>>> good output?
>>>
>>> Sorry if I'm not quite describing my problem in the proper terms..
>>>
>>> --Matthew Runo
>>
>>
>

Re: Sparse data & Item Similarity

Reply via email to