Hi Rafal, No need to apologize, this list exists for anwering questions. You found a very important bug, btw. Glad that the job works for you now.
Best, Sebastian On 15.08.2013 19:55, Rafal Lukawiecki wrote: > For what it's worth, the issue of the recommender recommending items that > already had been "preferred" by a user seems to have gone away. I realise I am a few reboots of the platform later than I was when I have asked about it, but to the best of my knowledge nothing else has changed. I would feel silly if this was an error on my side, but I cannot find any other explanation. As long as we set the --maxPrefsPerUser parameter high enough, there are no more "duplicates". My apologies for muddying the waters earlier on. > > Rafal > > On 7 Aug 2013, at 17:19, Sebastian Schelter <[email protected]> wrote: > > if you also set --maxPrefsPerUserInItemSimilarity to a number higher than > the max preferences per user, no sampling should occur. This might slow > down the job however. > > 2013/8/7 Rafal Lukawiecki <[email protected]> > >> Is there a set of parameters which I could pass to RecommenderJob to avoid >> that random sampling, in order to create a test case for the issue I have >> experienced? Would setting --maxSimilaritiesPerItem and/or >> --maxPrefsPerUserInItemSimilarity help? Many thanks. >> >> On 7 Aug 2013, at 16:12, Sebastian Schelter <[email protected]> >> wrote: >> >> It could affect the results even in this case, as we also sample the >> preferences when computing similar items. >> >> On 07.08.2013 17:07, Rafal Lukawiecki wrote: >>> Thank you, Sebastian. Would the random sampling affect the results of >> RecommenderJob, in any case? I am setting --maxPrefsPerUser to exceed the >> actual, maximum number of preferences expressed by every user. >>> >>> Rafal >>> >>> On 7 Aug 2013, at 15:48, Sebastian Schelter <[email protected]> >>> wrote: >>> >>> The code in trunk allows to you to specify a randomSeed, the older >>> versions don't unfortunately. >>> >>> On 07.08.2013 16:35, Rafal Lukawiecki wrote: >>>> Hi Sebastian, >>>> >>>> The quantity of returned "duplicates" is much too large to be caused >> just by sampling's randomness. I wonder if this could be related to >> something that is platform-specific, as in Windows vs. *nix representation >> of input files, data types etc. >>>> >>>> For argument's sake, is it possible to fix the seed of the random >> aspect of the sampling so I could feed the same input through two platforms >> and compare the results? >>>> >>>> Rafal >>>> >>>> On 7 Aug 2013, at 15:20, Sebastian Schelter <[email protected]> >>>> wrote: >>>> >>>> Hi Rafal, >>>> >>>> this sounds really strange, the bug should not have anything to do with >>>> the version of Hadoop that you are running. You could sometimes not see >>>> it due to the random sampling of the preferences. >>>> >>>> --sebastian >>>> >>>> On 07.08.2013 13:53, Rafal Lukawiecki wrote: >>>>> Sebastian, >>>>> >>>>> I've been doing a little more digging regarding the issue of >> preferences being calculated for already preferred items. I re-run the jobs >> using the same data and the same parameters on a different installation of >> Hadoop, and the problem seems to have gone away. For now it looks like the >> issue arises when I run it under Mahout 0.7 and 0.8 using HDP (Hortonworks >> Data Platform) for Windows 1.1.0, with Hadoop 1.1.0. This problem does not >> show up, yet in my tests, under Hadoop 1.2.1 compiled for OS X. I will work >> a little more to ensure my results, but if they stood up, should I still >> report it as a Mahout issue? >>>>> >>>>> Rafal >>>>> -- >>>>> Rafal Lukawiecki >>>>> Strategic Consultant and Director >>>>> Project Botticelli Ltd >>>>> >>>>> On 1 Aug 2013, at 17:31, Sebastian Schelter <[email protected]> wrote: >>>>> >>>>> Setting it to the maximum number should be enough. Would be great if >> you >>>>> can share your dataset and tests. >>>>> >>>>> 2013/8/1 Rafal Lukawiecki <[email protected]> >>>>> >>>>>> Should I have set that parameter to a value much much larger than the >>>>>> maximum number of actually expressed preferences by a user? >>>>>> >>>>>> I'm working on an anonymised data set. If it works as an error test >> case, >>>>>> I'd be happy to share it for your re-test. I am still hoping it is my >>>>>> error, not Mahout's. >>>>>> >>>>>> Rafal >>>>>> -- >>>>>> Rafal Lukawiecki >>>>>> Pardon brevity, mobile device. >>>>>> >>>>>> On 1 Aug 2013, at 17:19, "Sebastian Schelter" <[email protected]> wrote: >>>>>> >>>>>>> Ok, please file a bug report detailing what you've tested and what >>>>>> results >>>>>>> you got. >>>>>>> >>>>>>> Just to clarify, setting maxPrefsPerUser to a high number still does >> not >>>>>>> help? That surprises me. >>>>>>> >>>>>>> >>>>>>> 2013/8/1 Rafal Lukawiecki <[email protected]> >>>>>>> >>>>>>>> Hi Sebastian, >>>>>>>> >>>>>>>> I've rechecked the results, and, I'm afraid that the issue has not >> gone >>>>>>>> away, contrary to my yesterday's enthusiastic response. Using 0.8 I >> have >>>>>>>> retested with and without --maxPrefsPerUser 9000 parameter (no user >> has >>>>>>>> more than 5000 prefs). I have also supplied the prefs file, without >> the >>>>>>>> preference value, that is as: user,item (one per line) as a >>>>>> --filterFile, >>>>>>>> with and without the -maxPrefsPerUser, and I am afraid we are also >>>>>> seeing >>>>>>>> recommendations for items the user has expressed a prior preference >> for. >>>>>>>> >>>>>>>> I suppose I need to file a bug report. >>>>>>>> >>>>>>>> Rafal >>>>>>>> -- >>>>>>>> Rafal Lukawiecki >>>>>>>> Pardon my brevity, sent from a telephone. >>>>>>>> >>>>>>>> On 31 Jul 2013, at 22:35, "Rafal Lukawiecki" < >>>>>> [email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Dear Sebastian, >>>>>>>>> >>>>>>>>> It looks like setting --maxPrefsPerUser 10000 have resolved the >> issue >>>>>> in >>>>>>>> our case—it seems that the most preferences a user had was just >> about >>>>>> 5000, >>>>>>>> so I doubled it just-in-case, but when I operationalise this model, >> I >>>>>> will >>>>>>>> make sure to calculate the actual max number of preferences and set >> the >>>>>>>> parameter accordingly. I will double-check the resultset to make >> sure >>>>>> the >>>>>>>> issue is really gone, as I have only checked the few cases where we >> have >>>>>>>> spotted a recommendation of a previously preferred item. >>>>>>>>> >>>>>>>>> Would you like me to file a bug, and would you like me to test it >> on >>>>>> 0.8 >>>>>>>> or another version? I am using 0.7. >>>>>>>>> >>>>>>>>> Thanks for your kind support. >>>>>>>>> Rafal >>>>>>>>> -- >>>>>>>>> Rafal Lukawiecki >>>>>>>>> Strategic Consultant and Director >>>>>>>>> Project Botticelli Ltd >>>>>>>>> >>>>>>>>> On 31 Jul 2013, at 06:22, Sebastian Schelter < >> [email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Rafal, >>>>>>>>> >>>>>>>>> can you try to set the option --maxPrefsPerUser to the maximum >> number >>>>>> of >>>>>>>>> interactions per user and see if you still get the error? >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Sebastian >>>>>>>>> >>>>>>>>> On 30.07.2013 19:29, Rafal Lukawiecki wrote: >>>>>>>>>> Thank you Sebastian. The data set is not that large, as we are >> running >>>>>>>> tests on a subset. It is about 24k users, 40k items, the preference >> file >>>>>>>> has 65k preferences as triples. This was using Similarity >> Cooccurrence. >>>>>>>>>> >>>>>>>>>> I can see if I could anonymise the data set to share if that >> would be >>>>>>>> helpful. >>>>>>>>>> >>>>>>>>>> Thanks for your kind help. >>>>>>>>>> >>>>>>>>>> Rafal >>>>>>>>>> -- >>>>>>>>>> Rafal Lukawiecki >>>>>>>>>> Pardon my brevity, sent from a telephone. >>>>>>>>>> >>>>>>>>>> On 30 Jul 2013, at 18:18, "Sebastian Schelter" <[email protected]> >>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Rafal, >>>>>>>>>>> >>>>>>>>>>> can you issue a ticket for this problem at >>>>>>>>>>> https://issues.apache.org/jira/browse/MAHOUT ? We have >> unit-tests >>>>>> that >>>>>>>>>>> check whether this happens and currently they work fine. I can >> only >>>>>>>> imagine >>>>>>>>>>> that the problem occurs in larger datasets where we sample the >> data >>>>>> in >>>>>>>> some >>>>>>>>>>> places. Can you describe a scenario/dataset where this happens? >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Sebastian >>>>>>>>>>> >>>>>>>>>>> 2013/7/30 Rafal Lukawiecki <[email protected]> >>>>>>>>>>> >>>>>>>>>>>> I'm new here, just registered. Many thanks to everyone for >> working >>>>>> on >>>>>>>> an >>>>>>>>>>>> amazing piece of software, thank you for building Mahout and for >>>>>> your >>>>>>>>>>>> support. My apologies if this is not the right place to ask the >>>>>>>> question—I >>>>>>>>>>>> have searched for the issue, and I can see this problem has been >>>>>>>> reported >>>>>>>>>>>> here: >>>>>>>> >>>>>> >> http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items >>>>>>>>>>>> >>>>>>>>>>>> Unfortunately, the trail leads to the newsgroups, and I have not >>>>>>>> found a >>>>>>>>>>>> way, yet, to get an answer from them, without asking you. >>>>>>>>>>>> >>>>>>>>>>>> Essentially, I am running a Hadoop RecommenderJob from Mahout >> 0.7, >>>>>>>> and I >>>>>>>>>>>> am finding that it is recommending items that the user has >> already >>>>>>>>>>>> expressed a preference for in their input file. I understand >> that >>>>>> this >>>>>>>>>>>> should not be happening, and I am not sure if there is a know >> fix or >>>>>>>> if I >>>>>>>>>>>> should be looking for a workaround (such as using the entire >> input >>>>>> as >>>>>>>> the >>>>>>>>>>>> filterFile). >>>>>>>>>>>> >>>>>>>>>>>> I will double-check that there is no error on my side, but so >> far it >>>>>>>> does >>>>>>>>>>>> not seem that way. >>>>>>>>>>>> >>>>>>>>>>>> Many thanks and my regards from Ireland, >>>>>>>>>>>> Rafal Lukawiecki >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> Rafal Lukawiecki >>>>>>>>>>>> >>>>>>>>>>>> Strategic Consultant and Director >>>>>>>>>>>> >>>>>>>>>>>> Project Botticelli Ltd >>>>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >>> >>> >> >> >> >> > >
