if you also set --maxPrefsPerUserInItemSimilarity to a number higher than the max preferences per user, no sampling should occur. This might slow down the job however.
2013/8/7 Rafal Lukawiecki <[email protected]> > Is there a set of parameters which I could pass to RecommenderJob to avoid > that random sampling, in order to create a test case for the issue I have > experienced? Would setting --maxSimilaritiesPerItem and/or > --maxPrefsPerUserInItemSimilarity help? Many thanks. > > On 7 Aug 2013, at 16:12, Sebastian Schelter <[email protected]> > wrote: > > It could affect the results even in this case, as we also sample the > preferences when computing similar items. > > On 07.08.2013 17:07, Rafal Lukawiecki wrote: > > Thank you, Sebastian. Would the random sampling affect the results of > RecommenderJob, in any case? I am setting --maxPrefsPerUser to exceed the > actual, maximum number of preferences expressed by every user. > > > > Rafal > > > > On 7 Aug 2013, at 15:48, Sebastian Schelter <[email protected]> > > wrote: > > > > The code in trunk allows to you to specify a randomSeed, the older > > versions don't unfortunately. > > > > On 07.08.2013 16:35, Rafal Lukawiecki wrote: > >> Hi Sebastian, > >> > >> The quantity of returned "duplicates" is much too large to be caused > just by sampling's randomness. I wonder if this could be related to > something that is platform-specific, as in Windows vs. *nix representation > of input files, data types etc. > >> > >> For argument's sake, is it possible to fix the seed of the random > aspect of the sampling so I could feed the same input through two platforms > and compare the results? > >> > >> Rafal > >> > >> On 7 Aug 2013, at 15:20, Sebastian Schelter <[email protected]> > >> wrote: > >> > >> Hi Rafal, > >> > >> this sounds really strange, the bug should not have anything to do with > >> the version of Hadoop that you are running. You could sometimes not see > >> it due to the random sampling of the preferences. > >> > >> --sebastian > >> > >> On 07.08.2013 13:53, Rafal Lukawiecki wrote: > >>> Sebastian, > >>> > >>> I've been doing a little more digging regarding the issue of > preferences being calculated for already preferred items. I re-run the jobs > using the same data and the same parameters on a different installation of > Hadoop, and the problem seems to have gone away. For now it looks like the > issue arises when I run it under Mahout 0.7 and 0.8 using HDP (Hortonworks > Data Platform) for Windows 1.1.0, with Hadoop 1.1.0. This problem does not > show up, yet in my tests, under Hadoop 1.2.1 compiled for OS X. I will work > a little more to ensure my results, but if they stood up, should I still > report it as a Mahout issue? > >>> > >>> Rafal > >>> -- > >>> Rafal Lukawiecki > >>> Strategic Consultant and Director > >>> Project Botticelli Ltd > >>> > >>> On 1 Aug 2013, at 17:31, Sebastian Schelter <[email protected]> wrote: > >>> > >>> Setting it to the maximum number should be enough. Would be great if > you > >>> can share your dataset and tests. > >>> > >>> 2013/8/1 Rafal Lukawiecki <[email protected]> > >>> > >>>> Should I have set that parameter to a value much much larger than the > >>>> maximum number of actually expressed preferences by a user? > >>>> > >>>> I'm working on an anonymised data set. If it works as an error test > case, > >>>> I'd be happy to share it for your re-test. I am still hoping it is my > >>>> error, not Mahout's. > >>>> > >>>> Rafal > >>>> -- > >>>> Rafal Lukawiecki > >>>> Pardon brevity, mobile device. > >>>> > >>>> On 1 Aug 2013, at 17:19, "Sebastian Schelter" <[email protected]> wrote: > >>>> > >>>>> Ok, please file a bug report detailing what you've tested and what > >>>> results > >>>>> you got. > >>>>> > >>>>> Just to clarify, setting maxPrefsPerUser to a high number still does > not > >>>>> help? That surprises me. > >>>>> > >>>>> > >>>>> 2013/8/1 Rafal Lukawiecki <[email protected]> > >>>>> > >>>>>> Hi Sebastian, > >>>>>> > >>>>>> I've rechecked the results, and, I'm afraid that the issue has not > gone > >>>>>> away, contrary to my yesterday's enthusiastic response. Using 0.8 I > have > >>>>>> retested with and without --maxPrefsPerUser 9000 parameter (no user > has > >>>>>> more than 5000 prefs). I have also supplied the prefs file, without > the > >>>>>> preference value, that is as: user,item (one per line) as a > >>>> --filterFile, > >>>>>> with and without the -maxPrefsPerUser, and I am afraid we are also > >>>> seeing > >>>>>> recommendations for items the user has expressed a prior preference > for. > >>>>>> > >>>>>> I suppose I need to file a bug report. > >>>>>> > >>>>>> Rafal > >>>>>> -- > >>>>>> Rafal Lukawiecki > >>>>>> Pardon my brevity, sent from a telephone. > >>>>>> > >>>>>> On 31 Jul 2013, at 22:35, "Rafal Lukawiecki" < > >>>> [email protected]> > >>>>>> wrote: > >>>>>> > >>>>>>> Dear Sebastian, > >>>>>>> > >>>>>>> It looks like setting --maxPrefsPerUser 10000 have resolved the > issue > >>>> in > >>>>>> our case—it seems that the most preferences a user had was just > about > >>>> 5000, > >>>>>> so I doubled it just-in-case, but when I operationalise this model, > I > >>>> will > >>>>>> make sure to calculate the actual max number of preferences and set > the > >>>>>> parameter accordingly. I will double-check the resultset to make > sure > >>>> the > >>>>>> issue is really gone, as I have only checked the few cases where we > have > >>>>>> spotted a recommendation of a previously preferred item. > >>>>>>> > >>>>>>> Would you like me to file a bug, and would you like me to test it > on > >>>> 0.8 > >>>>>> or another version? I am using 0.7. > >>>>>>> > >>>>>>> Thanks for your kind support. > >>>>>>> Rafal > >>>>>>> -- > >>>>>>> Rafal Lukawiecki > >>>>>>> Strategic Consultant and Director > >>>>>>> Project Botticelli Ltd > >>>>>>> > >>>>>>> On 31 Jul 2013, at 06:22, Sebastian Schelter < > [email protected]> > >>>>>>> wrote: > >>>>>>> > >>>>>>> Hi Rafal, > >>>>>>> > >>>>>>> can you try to set the option --maxPrefsPerUser to the maximum > number > >>>> of > >>>>>>> interactions per user and see if you still get the error? > >>>>>>> > >>>>>>> Best, > >>>>>>> Sebastian > >>>>>>> > >>>>>>> On 30.07.2013 19:29, Rafal Lukawiecki wrote: > >>>>>>>> Thank you Sebastian. The data set is not that large, as we are > running > >>>>>> tests on a subset. It is about 24k users, 40k items, the preference > file > >>>>>> has 65k preferences as triples. This was using Similarity > Cooccurrence. > >>>>>>>> > >>>>>>>> I can see if I could anonymise the data set to share if that > would be > >>>>>> helpful. > >>>>>>>> > >>>>>>>> Thanks for your kind help. > >>>>>>>> > >>>>>>>> Rafal > >>>>>>>> -- > >>>>>>>> Rafal Lukawiecki > >>>>>>>> Pardon my brevity, sent from a telephone. > >>>>>>>> > >>>>>>>> On 30 Jul 2013, at 18:18, "Sebastian Schelter" <[email protected]> > >>>> wrote: > >>>>>>>> > >>>>>>>>> Hi Rafal, > >>>>>>>>> > >>>>>>>>> can you issue a ticket for this problem at > >>>>>>>>> https://issues.apache.org/jira/browse/MAHOUT ? We have > unit-tests > >>>> that > >>>>>>>>> check whether this happens and currently they work fine. I can > only > >>>>>> imagine > >>>>>>>>> that the problem occurs in larger datasets where we sample the > data > >>>> in > >>>>>> some > >>>>>>>>> places. Can you describe a scenario/dataset where this happens? > >>>>>>>>> > >>>>>>>>> Best, > >>>>>>>>> Sebastian > >>>>>>>>> > >>>>>>>>> 2013/7/30 Rafal Lukawiecki <[email protected]> > >>>>>>>>> > >>>>>>>>>> I'm new here, just registered. Many thanks to everyone for > working > >>>> on > >>>>>> an > >>>>>>>>>> amazing piece of software, thank you for building Mahout and for > >>>> your > >>>>>>>>>> support. My apologies if this is not the right place to ask the > >>>>>> question—I > >>>>>>>>>> have searched for the issue, and I can see this problem has been > >>>>>> reported > >>>>>>>>>> here: > >>>>>> > >>>> > http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items > >>>>>>>>>> > >>>>>>>>>> Unfortunately, the trail leads to the newsgroups, and I have not > >>>>>> found a > >>>>>>>>>> way, yet, to get an answer from them, without asking you. > >>>>>>>>>> > >>>>>>>>>> Essentially, I am running a Hadoop RecommenderJob from Mahout > 0.7, > >>>>>> and I > >>>>>>>>>> am finding that it is recommending items that the user has > already > >>>>>>>>>> expressed a preference for in their input file. I understand > that > >>>> this > >>>>>>>>>> should not be happening, and I am not sure if there is a know > fix or > >>>>>> if I > >>>>>>>>>> should be looking for a workaround (such as using the entire > input > >>>> as > >>>>>> the > >>>>>>>>>> filterFile). > >>>>>>>>>> > >>>>>>>>>> I will double-check that there is no error on my side, but so > far it > >>>>>> does > >>>>>>>>>> not seem that way. > >>>>>>>>>> > >>>>>>>>>> Many thanks and my regards from Ireland, > >>>>>>>>>> Rafal Lukawiecki > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> > >>>>>>>>>> Rafal Lukawiecki > >>>>>>>>>> > >>>>>>>>>> Strategic Consultant and Director > >>>>>>>>>> > >>>>>>>>>> Project Botticelli Ltd > >>>>>> > >>>> > >>> > >>> > >> > >> > >> > > > > > > > > > >
