Re: Need to reduce execution time of RowSimilarityJob

Sebastian Schelter Wed, 03 Oct 2012 00:36:45 -0700

Please don't send patches to the mailing list. Here's a guide that tells
you how to contribute to Mahout, it involves issuing a jira ticket:


https://cwiki.apache.org/MAHOUT/how-to-contribute.html

Your patch uses Java7 while Mahout is based on Java 6. Furthermore, you
can't simply throw UnsupportedOperationExceptions for most methods.

--sebastian


On 02.10.2012 21:01, yamo93 wrote:
> You'll find in attachment a class that implements cosine distance as in
> hadoop. I've just implemented the core method : itemSimilarity.
> 
> On 10/02/2012 02:59 PM, yamo93 wrote:
>> Ok, i'll try this evening.
>>
>> On 10/02/2012 02:39 PM, Sebastian Schelter wrote:
>>> Would you like to create a patch for this?
>>>
>>> On 02.10.2012 14:36, yamo93 wrote:
>>>> +1 for the implementation over all entries.
>>>>
>>>> On 10/02/2012 11:50 AM, Sebastian Schelter wrote:
>>>>> I don't see why documents with only one word in common should have a
>>>>> similarity of 1.0 in RowSimilarityJob. consider() is only invoked
>>>>> if you
>>>>> specify a threshold for the similarity.
>>>>>
>>>>> UncenteredCosineSimilarity works on matching entries only, which is
>>>>> problematic for documents, as empty entries have a meaning (0 term
>>>>> occurrences) as opposed to collaborative filtering data.
>>>>>
>>>>> Maybe we should remove UncenteredCosine andd create another similarity
>>>>> implementation that computes the cosine correctly over all entries.
>>>>>
>>>>> --sebastian
>>>>>
>>>>>
>>>>> On 02.10.2012 10:08, yamo93 wrote:
>>>>>> Hello Seb,
>>>>>>
>>>>>> In my comprehension, the algorithm is the same (except the
>>>>>> normalization
>>>>>> part) as UncenteredCosine (with the drawback that vectors with
>>>>>> only one
>>>>>> word in common have a distance of 1.0)... but the result are quite
>>>>>> different (is this just an effect of the consider() method which
>>>>>> remove
>>>>>> irrelevant values ?) ...
>>>>>>
>>>>>> I looked at the code but there is quite nothing in
>>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity,
>>>>>>
>>>>>>
>>>>>> the code seems to be in SimilarityReducer which is not so simple to
>>>>>> understand ...
>>>>>>
>>>>>> Thanks for helping,
>>>>>>
>>>>>> On 10/01/2012 10:25 PM, Sebastian Schelter wrote:
>>>>>>> The cosine similarity as computed by RowSimilarityJob is the cosine
>>>>>>> similarity between the whole vectors.
>>>>>>>
>>>>>>> see
>>>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> for details
>>>>>>>
>>>>>>> At first both vectors are scaled to unit length in normalize() and
>>>>>>> after
>>>>>>> this their dot product in similarity() (which can be computed from
>>>>>>> elements that exist in both vectors) gives the cosine between those.
>>>>>>>
>>>>>>> On 01.10.2012 21:52, bangbig wrote:
>>>>>>>> I think it's better to understand how the RowSimilarityJob gets the
>>>>>>>> result.
>>>>>>>> For two items,
>>>>>>>> itemA, 0, 0, a1, a2, a3, 0
>>>>>>>> itemB, 0, b1, b2, b3, 0 , 0
>>>>>>>> when computing, it just uses the blue parts of the vectors.
>>>>>>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2)
>>>>>>>> * sqrt(b2*b2 + b3*b3))
>>>>>>>> 1) if itemA and itemB have just one common word, the result is 1;
>>>>>>>> 2) if the values of the vectors are almost the same, the value
>>>>>>>> would
>>>>>>>> also be nearly 1;
>>>>>>>> and for the two cases above, I think you can consider to use
>>>>>>>> association rules to consider the problem.
>>>>>>>>
>>>>>>>> At 2012-10-01 20:53:16,yamo93 <[email protected]> wrote:
>>>>>>>>> It seems that RowSimilarityJob does not have the same weakness,
>>>>>>>>> but i
>>>>>>>>> also use CosineSimilarity. Why ?
>>>>>>>>>
>>>>>>>>> On 10/01/2012 12:37 PM, Sean Owen wrote:
>>>>>>>>>> Yes, this is one of the weaknesses of this particular flavor
>>>>>>>>>> of this
>>>>>>>>>> particular similarity metric. The more sparse, the worse the
>>>>>>>>>> problem
>>>>>>>>>> is in general. There are some band-aid solutions like applying
>>>>>>>>>> some
>>>>>>>>>> kind of weight against similarities based on small intersection
>>>>>>>>>> size.
>>>>>>>>>> Or you can pretend that missing values are 0
>>>>>>>>>> (PreferenceInferrer),
>>>>>>>>>> which can introduce its own problems, or perhaps some mean value.
>>>>>>>>>>
>>>>>>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <[email protected]> wrote:
>>>>>>>>>>> Thanks for replying.
>>>>>>>>>>>
>>>>>>>>>>> So, documents with only one word in common have more chance
>>>>>>>>>>> to be
>>>>>>>>>>> similar
>>>>>>>>>>> than documents with more words in common, right ?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote:
>>>>>>>>>>>> Similar items, right? You should look at the vectors that
>>>>>>>>>>>> have 1.0
>>>>>>>>>>>> similarity and see if they are in fact collinear. This is still
>>>>>>>>>>>> by far
>>>>>>>>>>>> the most likely explanation. Remember that the vector
>>>>>>>>>>>> similarity is
>>>>>>>>>>>> computed over elements that exist in both vectors only. They
>>>>>>>>>>>> just
>>>>>>>>>>>> have
>>>>>>>>>>>> to have 2 identical values for this to happen.
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <[email protected]>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> For each item, i have 10 recommended items with a value of
>>>>>>>>>>>>> 1.0.
>>>>>>>>>>>>> It sounds like a bug somewhere.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote:
>>>>>>>>>>>>>> It's possible this is correct. 1.0 is the maximum
>>>>>>>>>>>>>> similarity and
>>>>>>>>>>>>>> occurs when two vector are just a scalar multiple of each
>>>>>>>>>>>>>> other (0
>>>>>>>>>>>>>> angle between them). It's possible there are several of
>>>>>>>>>>>>>> these,
>>>>>>>>>>>>>> and so
>>>>>>>>>>>>>> their 1.0 similarities dominate the result.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <[email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>> I saw something strange : all recommended items, returned by
>>>>>>>>>>>>>>> mostSimilarItems(), have a value of 1.0.
>>>>>>>>>>>>>>> Is it normal ?
>>
>

Re: Need to reduce execution time of RowSimilarityJob

Reply via email to