Please don't send patches to the mailing list. Here's a guide that tells you how to contribute to Mahout, it involves issuing a jira ticket:
https://cwiki.apache.org/MAHOUT/how-to-contribute.html Your patch uses Java7 while Mahout is based on Java 6. Furthermore, you can't simply throw UnsupportedOperationExceptions for most methods. --sebastian On 02.10.2012 21:01, yamo93 wrote: > You'll find in attachment a class that implements cosine distance as in > hadoop. I've just implemented the core method : itemSimilarity. > > On 10/02/2012 02:59 PM, yamo93 wrote: >> Ok, i'll try this evening. >> >> On 10/02/2012 02:39 PM, Sebastian Schelter wrote: >>> Would you like to create a patch for this? >>> >>> On 02.10.2012 14:36, yamo93 wrote: >>>> +1 for the implementation over all entries. >>>> >>>> On 10/02/2012 11:50 AM, Sebastian Schelter wrote: >>>>> I don't see why documents with only one word in common should have a >>>>> similarity of 1.0 in RowSimilarityJob. consider() is only invoked >>>>> if you >>>>> specify a threshold for the similarity. >>>>> >>>>> UncenteredCosineSimilarity works on matching entries only, which is >>>>> problematic for documents, as empty entries have a meaning (0 term >>>>> occurrences) as opposed to collaborative filtering data. >>>>> >>>>> Maybe we should remove UncenteredCosine andd create another similarity >>>>> implementation that computes the cosine correctly over all entries. >>>>> >>>>> --sebastian >>>>> >>>>> >>>>> On 02.10.2012 10:08, yamo93 wrote: >>>>>> Hello Seb, >>>>>> >>>>>> In my comprehension, the algorithm is the same (except the >>>>>> normalization >>>>>> part) as UncenteredCosine (with the drawback that vectors with >>>>>> only one >>>>>> word in common have a distance of 1.0)... but the result are quite >>>>>> different (is this just an effect of the consider() method which >>>>>> remove >>>>>> irrelevant values ?) ... >>>>>> >>>>>> I looked at the code but there is quite nothing in >>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity, >>>>>> >>>>>> >>>>>> the code seems to be in SimilarityReducer which is not so simple to >>>>>> understand ... >>>>>> >>>>>> Thanks for helping, >>>>>> >>>>>> On 10/01/2012 10:25 PM, Sebastian Schelter wrote: >>>>>>> The cosine similarity as computed by RowSimilarityJob is the cosine >>>>>>> similarity between the whole vectors. >>>>>>> >>>>>>> see >>>>>>> org.apache.mahout.math.hadoop.similarity.cooccurrence.measures.CosineSimilarity >>>>>>> >>>>>>> >>>>>>> >>>>>>> for details >>>>>>> >>>>>>> At first both vectors are scaled to unit length in normalize() and >>>>>>> after >>>>>>> this their dot product in similarity() (which can be computed from >>>>>>> elements that exist in both vectors) gives the cosine between those. >>>>>>> >>>>>>> On 01.10.2012 21:52, bangbig wrote: >>>>>>>> I think it's better to understand how the RowSimilarityJob gets the >>>>>>>> result. >>>>>>>> For two items, >>>>>>>> itemA, 0, 0, a1, a2, a3, 0 >>>>>>>> itemB, 0, b1, b2, b3, 0 , 0 >>>>>>>> when computing, it just uses the blue parts of the vectors. >>>>>>>> the cosine similarity thus is (a1*b2 + a2*b3)/(sqrt(a1*a1 + a2* a2) >>>>>>>> * sqrt(b2*b2 + b3*b3)) >>>>>>>> 1) if itemA and itemB have just one common word, the result is 1; >>>>>>>> 2) if the values of the vectors are almost the same, the value >>>>>>>> would >>>>>>>> also be nearly 1; >>>>>>>> and for the two cases above, I think you can consider to use >>>>>>>> association rules to consider the problem. >>>>>>>> >>>>>>>> At 2012-10-01 20:53:16,yamo93 <[email protected]> wrote: >>>>>>>>> It seems that RowSimilarityJob does not have the same weakness, >>>>>>>>> but i >>>>>>>>> also use CosineSimilarity. Why ? >>>>>>>>> >>>>>>>>> On 10/01/2012 12:37 PM, Sean Owen wrote: >>>>>>>>>> Yes, this is one of the weaknesses of this particular flavor >>>>>>>>>> of this >>>>>>>>>> particular similarity metric. The more sparse, the worse the >>>>>>>>>> problem >>>>>>>>>> is in general. There are some band-aid solutions like applying >>>>>>>>>> some >>>>>>>>>> kind of weight against similarities based on small intersection >>>>>>>>>> size. >>>>>>>>>> Or you can pretend that missing values are 0 >>>>>>>>>> (PreferenceInferrer), >>>>>>>>>> which can introduce its own problems, or perhaps some mean value. >>>>>>>>>> >>>>>>>>>> On Mon, Oct 1, 2012 at 11:32 AM, yamo93 <[email protected]> wrote: >>>>>>>>>>> Thanks for replying. >>>>>>>>>>> >>>>>>>>>>> So, documents with only one word in common have more chance >>>>>>>>>>> to be >>>>>>>>>>> similar >>>>>>>>>>> than documents with more words in common, right ? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 10/01/2012 11:28 AM, Sean Owen wrote: >>>>>>>>>>>> Similar items, right? You should look at the vectors that >>>>>>>>>>>> have 1.0 >>>>>>>>>>>> similarity and see if they are in fact collinear. This is still >>>>>>>>>>>> by far >>>>>>>>>>>> the most likely explanation. Remember that the vector >>>>>>>>>>>> similarity is >>>>>>>>>>>> computed over elements that exist in both vectors only. They >>>>>>>>>>>> just >>>>>>>>>>>> have >>>>>>>>>>>> to have 2 identical values for this to happen. >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:25 AM, yamo93 <[email protected]> >>>>>>>>>>>> wrote: >>>>>>>>>>>>> For each item, i have 10 recommended items with a value of >>>>>>>>>>>>> 1.0. >>>>>>>>>>>>> It sounds like a bug somewhere. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On 10/01/2012 11:06 AM, Sean Owen wrote: >>>>>>>>>>>>>> It's possible this is correct. 1.0 is the maximum >>>>>>>>>>>>>> similarity and >>>>>>>>>>>>>> occurs when two vector are just a scalar multiple of each >>>>>>>>>>>>>> other (0 >>>>>>>>>>>>>> angle between them). It's possible there are several of >>>>>>>>>>>>>> these, >>>>>>>>>>>>>> and so >>>>>>>>>>>>>> their 1.0 similarities dominate the result. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Oct 1, 2012 at 10:03 AM, yamo93 <[email protected]> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> I saw something strange : all recommended items, returned by >>>>>>>>>>>>>>> mostSimilarItems(), have a value of 1.0. >>>>>>>>>>>>>>> Is it normal ? >> >
