It could affect the results even in this case, as we also sample the preferences when computing similar items.
On 07.08.2013 17:07, Rafal Lukawiecki wrote: > Thank you, Sebastian. Would the random sampling affect the results of > RecommenderJob, in any case? I am setting --maxPrefsPerUser to exceed the > actual, maximum number of preferences expressed by every user. > > Rafal > > On 7 Aug 2013, at 15:48, Sebastian Schelter <[email protected]> > wrote: > > The code in trunk allows to you to specify a randomSeed, the older > versions don't unfortunately. > > On 07.08.2013 16:35, Rafal Lukawiecki wrote: >> Hi Sebastian, >> >> The quantity of returned "duplicates" is much too large to be caused just by >> sampling's randomness. I wonder if this could be related to something that >> is platform-specific, as in Windows vs. *nix representation of input files, >> data types etc. >> >> For argument's sake, is it possible to fix the seed of the random aspect of >> the sampling so I could feed the same input through two platforms and >> compare the results? >> >> Rafal >> >> On 7 Aug 2013, at 15:20, Sebastian Schelter <[email protected]> >> wrote: >> >> Hi Rafal, >> >> this sounds really strange, the bug should not have anything to do with >> the version of Hadoop that you are running. You could sometimes not see >> it due to the random sampling of the preferences. >> >> --sebastian >> >> On 07.08.2013 13:53, Rafal Lukawiecki wrote: >>> Sebastian, >>> >>> I've been doing a little more digging regarding the issue of preferences >>> being calculated for already preferred items. I re-run the jobs using the >>> same data and the same parameters on a different installation of Hadoop, >>> and the problem seems to have gone away. For now it looks like the issue >>> arises when I run it under Mahout 0.7 and 0.8 using HDP (Hortonworks Data >>> Platform) for Windows 1.1.0, with Hadoop 1.1.0. This problem does not show >>> up, yet in my tests, under Hadoop 1.2.1 compiled for OS X. I will work a >>> little more to ensure my results, but if they stood up, should I still >>> report it as a Mahout issue? >>> >>> Rafal >>> -- >>> Rafal Lukawiecki >>> Strategic Consultant and Director >>> Project Botticelli Ltd >>> >>> On 1 Aug 2013, at 17:31, Sebastian Schelter <[email protected]> wrote: >>> >>> Setting it to the maximum number should be enough. Would be great if you >>> can share your dataset and tests. >>> >>> 2013/8/1 Rafal Lukawiecki <[email protected]> >>> >>>> Should I have set that parameter to a value much much larger than the >>>> maximum number of actually expressed preferences by a user? >>>> >>>> I'm working on an anonymised data set. If it works as an error test case, >>>> I'd be happy to share it for your re-test. I am still hoping it is my >>>> error, not Mahout's. >>>> >>>> Rafal >>>> -- >>>> Rafal Lukawiecki >>>> Pardon brevity, mobile device. >>>> >>>> On 1 Aug 2013, at 17:19, "Sebastian Schelter" <[email protected]> wrote: >>>> >>>>> Ok, please file a bug report detailing what you've tested and what >>>> results >>>>> you got. >>>>> >>>>> Just to clarify, setting maxPrefsPerUser to a high number still does not >>>>> help? That surprises me. >>>>> >>>>> >>>>> 2013/8/1 Rafal Lukawiecki <[email protected]> >>>>> >>>>>> Hi Sebastian, >>>>>> >>>>>> I've rechecked the results, and, I'm afraid that the issue has not gone >>>>>> away, contrary to my yesterday's enthusiastic response. Using 0.8 I have >>>>>> retested with and without --maxPrefsPerUser 9000 parameter (no user has >>>>>> more than 5000 prefs). I have also supplied the prefs file, without the >>>>>> preference value, that is as: user,item (one per line) as a >>>> --filterFile, >>>>>> with and without the -maxPrefsPerUser, and I am afraid we are also >>>> seeing >>>>>> recommendations for items the user has expressed a prior preference for. >>>>>> >>>>>> I suppose I need to file a bug report. >>>>>> >>>>>> Rafal >>>>>> -- >>>>>> Rafal Lukawiecki >>>>>> Pardon my brevity, sent from a telephone. >>>>>> >>>>>> On 31 Jul 2013, at 22:35, "Rafal Lukawiecki" < >>>> [email protected]> >>>>>> wrote: >>>>>> >>>>>>> Dear Sebastian, >>>>>>> >>>>>>> It looks like setting --maxPrefsPerUser 10000 have resolved the issue >>>> in >>>>>> our case—it seems that the most preferences a user had was just about >>>> 5000, >>>>>> so I doubled it just-in-case, but when I operationalise this model, I >>>> will >>>>>> make sure to calculate the actual max number of preferences and set the >>>>>> parameter accordingly. I will double-check the resultset to make sure >>>> the >>>>>> issue is really gone, as I have only checked the few cases where we have >>>>>> spotted a recommendation of a previously preferred item. >>>>>>> >>>>>>> Would you like me to file a bug, and would you like me to test it on >>>> 0.8 >>>>>> or another version? I am using 0.7. >>>>>>> >>>>>>> Thanks for your kind support. >>>>>>> Rafal >>>>>>> -- >>>>>>> Rafal Lukawiecki >>>>>>> Strategic Consultant and Director >>>>>>> Project Botticelli Ltd >>>>>>> >>>>>>> On 31 Jul 2013, at 06:22, Sebastian Schelter <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>> Hi Rafal, >>>>>>> >>>>>>> can you try to set the option --maxPrefsPerUser to the maximum number >>>> of >>>>>>> interactions per user and see if you still get the error? >>>>>>> >>>>>>> Best, >>>>>>> Sebastian >>>>>>> >>>>>>> On 30.07.2013 19:29, Rafal Lukawiecki wrote: >>>>>>>> Thank you Sebastian. The data set is not that large, as we are running >>>>>> tests on a subset. It is about 24k users, 40k items, the preference file >>>>>> has 65k preferences as triples. This was using Similarity Cooccurrence. >>>>>>>> >>>>>>>> I can see if I could anonymise the data set to share if that would be >>>>>> helpful. >>>>>>>> >>>>>>>> Thanks for your kind help. >>>>>>>> >>>>>>>> Rafal >>>>>>>> -- >>>>>>>> Rafal Lukawiecki >>>>>>>> Pardon my brevity, sent from a telephone. >>>>>>>> >>>>>>>> On 30 Jul 2013, at 18:18, "Sebastian Schelter" <[email protected]> >>>> wrote: >>>>>>>> >>>>>>>>> Hi Rafal, >>>>>>>>> >>>>>>>>> can you issue a ticket for this problem at >>>>>>>>> https://issues.apache.org/jira/browse/MAHOUT ? We have unit-tests >>>> that >>>>>>>>> check whether this happens and currently they work fine. I can only >>>>>> imagine >>>>>>>>> that the problem occurs in larger datasets where we sample the data >>>> in >>>>>> some >>>>>>>>> places. Can you describe a scenario/dataset where this happens? >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Sebastian >>>>>>>>> >>>>>>>>> 2013/7/30 Rafal Lukawiecki <[email protected]> >>>>>>>>> >>>>>>>>>> I'm new here, just registered. Many thanks to everyone for working >>>> on >>>>>> an >>>>>>>>>> amazing piece of software, thank you for building Mahout and for >>>> your >>>>>>>>>> support. My apologies if this is not the right place to ask the >>>>>> question—I >>>>>>>>>> have searched for the issue, and I can see this problem has been >>>>>> reported >>>>>>>>>> here: >>>>>> >>>> http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items >>>>>>>>>> >>>>>>>>>> Unfortunately, the trail leads to the newsgroups, and I have not >>>>>> found a >>>>>>>>>> way, yet, to get an answer from them, without asking you. >>>>>>>>>> >>>>>>>>>> Essentially, I am running a Hadoop RecommenderJob from Mahout 0.7, >>>>>> and I >>>>>>>>>> am finding that it is recommending items that the user has already >>>>>>>>>> expressed a preference for in their input file. I understand that >>>> this >>>>>>>>>> should not be happening, and I am not sure if there is a know fix or >>>>>> if I >>>>>>>>>> should be looking for a workaround (such as using the entire input >>>> as >>>>>> the >>>>>>>>>> filterFile). >>>>>>>>>> >>>>>>>>>> I will double-check that there is no error on my side, but so far it >>>>>> does >>>>>>>>>> not seem that way. >>>>>>>>>> >>>>>>>>>> Many thanks and my regards from Ireland, >>>>>>>>>> Rafal Lukawiecki >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> >>>>>>>>>> Rafal Lukawiecki >>>>>>>>>> >>>>>>>>>> Strategic Consultant and Director >>>>>>>>>> >>>>>>>>>> Project Botticelli Ltd >>>>>> >>>> >>> >>> >> >> >> > > >
