The code in trunk allows to you to specify a randomSeed, the older versions don't unfortunately.
On 07.08.2013 16:35, Rafal Lukawiecki wrote: > Hi Sebastian, > > The quantity of returned "duplicates" is much too large to be caused just by > sampling's randomness. I wonder if this could be related to something that is > platform-specific, as in Windows vs. *nix representation of input files, data > types etc. > > For argument's sake, is it possible to fix the seed of the random aspect of > the sampling so I could feed the same input through two platforms and compare > the results? > > Rafal > > On 7 Aug 2013, at 15:20, Sebastian Schelter <[email protected]> > wrote: > > Hi Rafal, > > this sounds really strange, the bug should not have anything to do with > the version of Hadoop that you are running. You could sometimes not see > it due to the random sampling of the preferences. > > --sebastian > > On 07.08.2013 13:53, Rafal Lukawiecki wrote: >> Sebastian, >> >> I've been doing a little more digging regarding the issue of preferences >> being calculated for already preferred items. I re-run the jobs using the >> same data and the same parameters on a different installation of Hadoop, and >> the problem seems to have gone away. For now it looks like the issue arises >> when I run it under Mahout 0.7 and 0.8 using HDP (Hortonworks Data Platform) >> for Windows 1.1.0, with Hadoop 1.1.0. This problem does not show up, yet in >> my tests, under Hadoop 1.2.1 compiled for OS X. I will work a little more to >> ensure my results, but if they stood up, should I still report it as a >> Mahout issue? >> >> Rafal >> -- >> Rafal Lukawiecki >> Strategic Consultant and Director >> Project Botticelli Ltd >> >> On 1 Aug 2013, at 17:31, Sebastian Schelter <[email protected]> wrote: >> >> Setting it to the maximum number should be enough. Would be great if you >> can share your dataset and tests. >> >> 2013/8/1 Rafal Lukawiecki <[email protected]> >> >>> Should I have set that parameter to a value much much larger than the >>> maximum number of actually expressed preferences by a user? >>> >>> I'm working on an anonymised data set. If it works as an error test case, >>> I'd be happy to share it for your re-test. I am still hoping it is my >>> error, not Mahout's. >>> >>> Rafal >>> -- >>> Rafal Lukawiecki >>> Pardon brevity, mobile device. >>> >>> On 1 Aug 2013, at 17:19, "Sebastian Schelter" <[email protected]> wrote: >>> >>>> Ok, please file a bug report detailing what you've tested and what >>> results >>>> you got. >>>> >>>> Just to clarify, setting maxPrefsPerUser to a high number still does not >>>> help? That surprises me. >>>> >>>> >>>> 2013/8/1 Rafal Lukawiecki <[email protected]> >>>> >>>>> Hi Sebastian, >>>>> >>>>> I've rechecked the results, and, I'm afraid that the issue has not gone >>>>> away, contrary to my yesterday's enthusiastic response. Using 0.8 I have >>>>> retested with and without --maxPrefsPerUser 9000 parameter (no user has >>>>> more than 5000 prefs). I have also supplied the prefs file, without the >>>>> preference value, that is as: user,item (one per line) as a >>> --filterFile, >>>>> with and without the -maxPrefsPerUser, and I am afraid we are also >>> seeing >>>>> recommendations for items the user has expressed a prior preference for. >>>>> >>>>> I suppose I need to file a bug report. >>>>> >>>>> Rafal >>>>> -- >>>>> Rafal Lukawiecki >>>>> Pardon my brevity, sent from a telephone. >>>>> >>>>> On 31 Jul 2013, at 22:35, "Rafal Lukawiecki" < >>> [email protected]> >>>>> wrote: >>>>> >>>>>> Dear Sebastian, >>>>>> >>>>>> It looks like setting --maxPrefsPerUser 10000 have resolved the issue >>> in >>>>> our case—it seems that the most preferences a user had was just about >>> 5000, >>>>> so I doubled it just-in-case, but when I operationalise this model, I >>> will >>>>> make sure to calculate the actual max number of preferences and set the >>>>> parameter accordingly. I will double-check the resultset to make sure >>> the >>>>> issue is really gone, as I have only checked the few cases where we have >>>>> spotted a recommendation of a previously preferred item. >>>>>> >>>>>> Would you like me to file a bug, and would you like me to test it on >>> 0.8 >>>>> or another version? I am using 0.7. >>>>>> >>>>>> Thanks for your kind support. >>>>>> Rafal >>>>>> -- >>>>>> Rafal Lukawiecki >>>>>> Strategic Consultant and Director >>>>>> Project Botticelli Ltd >>>>>> >>>>>> On 31 Jul 2013, at 06:22, Sebastian Schelter <[email protected]> >>>>>> wrote: >>>>>> >>>>>> Hi Rafal, >>>>>> >>>>>> can you try to set the option --maxPrefsPerUser to the maximum number >>> of >>>>>> interactions per user and see if you still get the error? >>>>>> >>>>>> Best, >>>>>> Sebastian >>>>>> >>>>>> On 30.07.2013 19:29, Rafal Lukawiecki wrote: >>>>>>> Thank you Sebastian. The data set is not that large, as we are running >>>>> tests on a subset. It is about 24k users, 40k items, the preference file >>>>> has 65k preferences as triples. This was using Similarity Cooccurrence. >>>>>>> >>>>>>> I can see if I could anonymise the data set to share if that would be >>>>> helpful. >>>>>>> >>>>>>> Thanks for your kind help. >>>>>>> >>>>>>> Rafal >>>>>>> -- >>>>>>> Rafal Lukawiecki >>>>>>> Pardon my brevity, sent from a telephone. >>>>>>> >>>>>>> On 30 Jul 2013, at 18:18, "Sebastian Schelter" <[email protected]> >>> wrote: >>>>>>> >>>>>>>> Hi Rafal, >>>>>>>> >>>>>>>> can you issue a ticket for this problem at >>>>>>>> https://issues.apache.org/jira/browse/MAHOUT ? We have unit-tests >>> that >>>>>>>> check whether this happens and currently they work fine. I can only >>>>> imagine >>>>>>>> that the problem occurs in larger datasets where we sample the data >>> in >>>>> some >>>>>>>> places. Can you describe a scenario/dataset where this happens? >>>>>>>> >>>>>>>> Best, >>>>>>>> Sebastian >>>>>>>> >>>>>>>> 2013/7/30 Rafal Lukawiecki <[email protected]> >>>>>>>> >>>>>>>>> I'm new here, just registered. Many thanks to everyone for working >>> on >>>>> an >>>>>>>>> amazing piece of software, thank you for building Mahout and for >>> your >>>>>>>>> support. My apologies if this is not the right place to ask the >>>>> question—I >>>>>>>>> have searched for the issue, and I can see this problem has been >>>>> reported >>>>>>>>> here: >>>>> >>> http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items >>>>>>>>> >>>>>>>>> Unfortunately, the trail leads to the newsgroups, and I have not >>>>> found a >>>>>>>>> way, yet, to get an answer from them, without asking you. >>>>>>>>> >>>>>>>>> Essentially, I am running a Hadoop RecommenderJob from Mahout 0.7, >>>>> and I >>>>>>>>> am finding that it is recommending items that the user has already >>>>>>>>> expressed a preference for in their input file. I understand that >>> this >>>>>>>>> should not be happening, and I am not sure if there is a know fix or >>>>> if I >>>>>>>>> should be looking for a workaround (such as using the entire input >>> as >>>>> the >>>>>>>>> filterFile). >>>>>>>>> >>>>>>>>> I will double-check that there is no error on my side, but so far it >>>>> does >>>>>>>>> not seem that way. >>>>>>>>> >>>>>>>>> Many thanks and my regards from Ireland, >>>>>>>>> Rafal Lukawiecki >>>>>>>>> >>>>>>>>> -- >>>>>>>>> >>>>>>>>> Rafal Lukawiecki >>>>>>>>> >>>>>>>>> Strategic Consultant and Director >>>>>>>>> >>>>>>>>> Project Botticelli Ltd >>>>> >>> >> >> > > >
