For what it's worth, the issue of the recommender recommending items that already had been "preferred" by a user seems to have gone away. I realise I am a few reboots of the platform later than I was when I have asked about it, but to the best of my knowledge nothing else has changed. I would feel silly if this was an error on my side, but I cannot find any other explanation. As long as we set the --maxPrefsPerUser parameter high enough, there are no more "duplicates". My apologies for muddying the waters earlier on.
Rafal On 7 Aug 2013, at 17:19, Sebastian Schelter <[email protected]> wrote: if you also set --maxPrefsPerUserInItemSimilarity to a number higher than the max preferences per user, no sampling should occur. This might slow down the job however. 2013/8/7 Rafal Lukawiecki <[email protected]> > Is there a set of parameters which I could pass to RecommenderJob to avoid > that random sampling, in order to create a test case for the issue I have > experienced? Would setting --maxSimilaritiesPerItem and/or > --maxPrefsPerUserInItemSimilarity help? Many thanks. > > On 7 Aug 2013, at 16:12, Sebastian Schelter <[email protected]> > wrote: > > It could affect the results even in this case, as we also sample the > preferences when computing similar items. > > On 07.08.2013 17:07, Rafal Lukawiecki wrote: >> Thank you, Sebastian. Would the random sampling affect the results of > RecommenderJob, in any case? I am setting --maxPrefsPerUser to exceed the > actual, maximum number of preferences expressed by every user. >> >> Rafal >> >> On 7 Aug 2013, at 15:48, Sebastian Schelter <[email protected]> >> wrote: >> >> The code in trunk allows to you to specify a randomSeed, the older >> versions don't unfortunately. >> >> On 07.08.2013 16:35, Rafal Lukawiecki wrote: >>> Hi Sebastian, >>> >>> The quantity of returned "duplicates" is much too large to be caused > just by sampling's randomness. I wonder if this could be related to > something that is platform-specific, as in Windows vs. *nix representation > of input files, data types etc. >>> >>> For argument's sake, is it possible to fix the seed of the random > aspect of the sampling so I could feed the same input through two platforms > and compare the results? >>> >>> Rafal >>> >>> On 7 Aug 2013, at 15:20, Sebastian Schelter <[email protected]> >>> wrote: >>> >>> Hi Rafal, >>> >>> this sounds really strange, the bug should not have anything to do with >>> the version of Hadoop that you are running. You could sometimes not see >>> it due to the random sampling of the preferences. >>> >>> --sebastian >>> >>> On 07.08.2013 13:53, Rafal Lukawiecki wrote: >>>> Sebastian, >>>> >>>> I've been doing a little more digging regarding the issue of > preferences being calculated for already preferred items. I re-run the jobs > using the same data and the same parameters on a different installation of > Hadoop, and the problem seems to have gone away. For now it looks like the > issue arises when I run it under Mahout 0.7 and 0.8 using HDP (Hortonworks > Data Platform) for Windows 1.1.0, with Hadoop 1.1.0. This problem does not > show up, yet in my tests, under Hadoop 1.2.1 compiled for OS X. I will work > a little more to ensure my results, but if they stood up, should I still > report it as a Mahout issue? >>>> >>>> Rafal >>>> -- >>>> Rafal Lukawiecki >>>> Strategic Consultant and Director >>>> Project Botticelli Ltd >>>> >>>> On 1 Aug 2013, at 17:31, Sebastian Schelter <[email protected]> wrote: >>>> >>>> Setting it to the maximum number should be enough. Would be great if > you >>>> can share your dataset and tests. >>>> >>>> 2013/8/1 Rafal Lukawiecki <[email protected]> >>>> >>>>> Should I have set that parameter to a value much much larger than the >>>>> maximum number of actually expressed preferences by a user? >>>>> >>>>> I'm working on an anonymised data set. If it works as an error test > case, >>>>> I'd be happy to share it for your re-test. I am still hoping it is my >>>>> error, not Mahout's. >>>>> >>>>> Rafal >>>>> -- >>>>> Rafal Lukawiecki >>>>> Pardon brevity, mobile device. >>>>> >>>>> On 1 Aug 2013, at 17:19, "Sebastian Schelter" <[email protected]> wrote: >>>>> >>>>>> Ok, please file a bug report detailing what you've tested and what >>>>> results >>>>>> you got. >>>>>> >>>>>> Just to clarify, setting maxPrefsPerUser to a high number still does > not >>>>>> help? That surprises me. >>>>>> >>>>>> >>>>>> 2013/8/1 Rafal Lukawiecki <[email protected]> >>>>>> >>>>>>> Hi Sebastian, >>>>>>> >>>>>>> I've rechecked the results, and, I'm afraid that the issue has not > gone >>>>>>> away, contrary to my yesterday's enthusiastic response. Using 0.8 I > have >>>>>>> retested with and without --maxPrefsPerUser 9000 parameter (no user > has >>>>>>> more than 5000 prefs). I have also supplied the prefs file, without > the >>>>>>> preference value, that is as: user,item (one per line) as a >>>>> --filterFile, >>>>>>> with and without the -maxPrefsPerUser, and I am afraid we are also >>>>> seeing >>>>>>> recommendations for items the user has expressed a prior preference > for. >>>>>>> >>>>>>> I suppose I need to file a bug report. >>>>>>> >>>>>>> Rafal >>>>>>> -- >>>>>>> Rafal Lukawiecki >>>>>>> Pardon my brevity, sent from a telephone. >>>>>>> >>>>>>> On 31 Jul 2013, at 22:35, "Rafal Lukawiecki" < >>>>> [email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Dear Sebastian, >>>>>>>> >>>>>>>> It looks like setting --maxPrefsPerUser 10000 have resolved the > issue >>>>> in >>>>>>> our case—it seems that the most preferences a user had was just > about >>>>> 5000, >>>>>>> so I doubled it just-in-case, but when I operationalise this model, > I >>>>> will >>>>>>> make sure to calculate the actual max number of preferences and set > the >>>>>>> parameter accordingly. I will double-check the resultset to make > sure >>>>> the >>>>>>> issue is really gone, as I have only checked the few cases where we > have >>>>>>> spotted a recommendation of a previously preferred item. >>>>>>>> >>>>>>>> Would you like me to file a bug, and would you like me to test it > on >>>>> 0.8 >>>>>>> or another version? I am using 0.7. >>>>>>>> >>>>>>>> Thanks for your kind support. >>>>>>>> Rafal >>>>>>>> -- >>>>>>>> Rafal Lukawiecki >>>>>>>> Strategic Consultant and Director >>>>>>>> Project Botticelli Ltd >>>>>>>> >>>>>>>> On 31 Jul 2013, at 06:22, Sebastian Schelter < > [email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Rafal, >>>>>>>> >>>>>>>> can you try to set the option --maxPrefsPerUser to the maximum > number >>>>> of >>>>>>>> interactions per user and see if you still get the error? >>>>>>>> >>>>>>>> Best, >>>>>>>> Sebastian >>>>>>>> >>>>>>>> On 30.07.2013 19:29, Rafal Lukawiecki wrote: >>>>>>>>> Thank you Sebastian. The data set is not that large, as we are > running >>>>>>> tests on a subset. It is about 24k users, 40k items, the preference > file >>>>>>> has 65k preferences as triples. This was using Similarity > Cooccurrence. >>>>>>>>> >>>>>>>>> I can see if I could anonymise the data set to share if that > would be >>>>>>> helpful. >>>>>>>>> >>>>>>>>> Thanks for your kind help. >>>>>>>>> >>>>>>>>> Rafal >>>>>>>>> -- >>>>>>>>> Rafal Lukawiecki >>>>>>>>> Pardon my brevity, sent from a telephone. >>>>>>>>> >>>>>>>>> On 30 Jul 2013, at 18:18, "Sebastian Schelter" <[email protected]> >>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Rafal, >>>>>>>>>> >>>>>>>>>> can you issue a ticket for this problem at >>>>>>>>>> https://issues.apache.org/jira/browse/MAHOUT ? We have > unit-tests >>>>> that >>>>>>>>>> check whether this happens and currently they work fine. I can > only >>>>>>> imagine >>>>>>>>>> that the problem occurs in larger datasets where we sample the > data >>>>> in >>>>>>> some >>>>>>>>>> places. Can you describe a scenario/dataset where this happens? >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Sebastian >>>>>>>>>> >>>>>>>>>> 2013/7/30 Rafal Lukawiecki <[email protected]> >>>>>>>>>> >>>>>>>>>>> I'm new here, just registered. Many thanks to everyone for > working >>>>> on >>>>>>> an >>>>>>>>>>> amazing piece of software, thank you for building Mahout and for >>>>> your >>>>>>>>>>> support. My apologies if this is not the right place to ask the >>>>>>> question—I >>>>>>>>>>> have searched for the issue, and I can see this problem has been >>>>>>> reported >>>>>>>>>>> here: >>>>>>> >>>>> > http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items >>>>>>>>>>> >>>>>>>>>>> Unfortunately, the trail leads to the newsgroups, and I have not >>>>>>> found a >>>>>>>>>>> way, yet, to get an answer from them, without asking you. >>>>>>>>>>> >>>>>>>>>>> Essentially, I am running a Hadoop RecommenderJob from Mahout > 0.7, >>>>>>> and I >>>>>>>>>>> am finding that it is recommending items that the user has > already >>>>>>>>>>> expressed a preference for in their input file. I understand > that >>>>> this >>>>>>>>>>> should not be happening, and I am not sure if there is a know > fix or >>>>>>> if I >>>>>>>>>>> should be looking for a workaround (such as using the entire > input >>>>> as >>>>>>> the >>>>>>>>>>> filterFile). >>>>>>>>>>> >>>>>>>>>>> I will double-check that there is no error on my side, but so > far it >>>>>>> does >>>>>>>>>>> not seem that way. >>>>>>>>>>> >>>>>>>>>>> Many thanks and my regards from Ireland, >>>>>>>>>>> Rafal Lukawiecki >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> Rafal Lukawiecki >>>>>>>>>>> >>>>>>>>>>> Strategic Consultant and Director >>>>>>>>>>> >>>>>>>>>>> Project Botticelli Ltd >>>>>>> >>>>> >>>> >>>> >>> >>> >>> >> >> >> > > > >
