Hi Rafal,

No need to apologize, this list exists for anwering questions. You found
a very important bug, btw. Glad that the job works for you now.

Best,
Sebastian

On 15.08.2013 19:55, Rafal Lukawiecki wrote:
> For what it's worth, the issue of the recommender recommending items that 
> already had been "preferred" by a user seems to have gone away. 

I realise I am a few reboots of the platform later than I was when I
have asked about it, but to the best of my knowledge nothing else has
changed.

I would feel silly if this was an error on my side, but I cannot find
any other explanation. As long as we set the --maxPrefsPerUser parameter
high enough,

there are no more "duplicates".  My apologies for muddying the waters
earlier on.
> 
> Rafal
> 
> On 7 Aug 2013, at 17:19, Sebastian Schelter <[email protected]> wrote:
> 
> if you also set --maxPrefsPerUserInItemSimilarity to a number higher than
> the max preferences per user, no sampling should occur. This might slow
> down the job however.
> 
> 2013/8/7 Rafal Lukawiecki <[email protected]>
> 
>> Is there a set of parameters which I could pass to RecommenderJob to avoid
>> that random sampling, in order to create a test case for the issue I have
>> experienced? Would setting --maxSimilaritiesPerItem and/or
>> --maxPrefsPerUserInItemSimilarity help? Many thanks.
>>
>> On 7 Aug 2013, at 16:12, Sebastian Schelter <[email protected]>
>> wrote:
>>
>> It could affect the results even in this case, as we also sample the
>> preferences when computing similar items.
>>
>> On 07.08.2013 17:07, Rafal Lukawiecki wrote:
>>> Thank you, Sebastian. Would the random sampling affect the results of
>> RecommenderJob, in any case? I am setting --maxPrefsPerUser to exceed the
>> actual, maximum number of preferences expressed by every user.
>>>
>>> Rafal
>>>
>>> On 7 Aug 2013, at 15:48, Sebastian Schelter <[email protected]>
>>> wrote:
>>>
>>> The code in trunk allows to you to specify a randomSeed, the older
>>> versions don't unfortunately.
>>>
>>> On 07.08.2013 16:35, Rafal Lukawiecki wrote:
>>>> Hi Sebastian,
>>>>
>>>> The quantity of returned "duplicates" is much too large to be caused
>> just by sampling's randomness. I wonder if this could be related to
>> something that is platform-specific, as in Windows vs. *nix representation
>> of input files, data types etc.
>>>>
>>>> For argument's sake, is it possible to fix the seed of the random
>> aspect of the sampling so I could feed the same input through two platforms
>> and compare the results?
>>>>
>>>> Rafal
>>>>
>>>> On 7 Aug 2013, at 15:20, Sebastian Schelter <[email protected]>
>>>> wrote:
>>>>
>>>> Hi Rafal,
>>>>
>>>> this sounds really strange, the bug should not have anything to do with
>>>> the version of Hadoop that you are running. You could sometimes not see
>>>> it due to the random sampling of the preferences.
>>>>
>>>> --sebastian
>>>>
>>>> On 07.08.2013 13:53, Rafal Lukawiecki wrote:
>>>>> Sebastian,
>>>>>
>>>>> I've been doing a little more digging regarding the issue of
>> preferences being calculated for already preferred items. I re-run the jobs
>> using the same data and the same parameters on a different installation of
>> Hadoop, and the problem seems to have gone away. For now it looks like the
>> issue arises when I run it under Mahout 0.7 and 0.8 using HDP (Hortonworks
>> Data Platform) for Windows 1.1.0, with Hadoop 1.1.0. This problem does not
>> show up, yet in my tests, under Hadoop 1.2.1 compiled for OS X. I will work
>> a little more to ensure my results, but if they stood up, should I still
>> report it as a Mahout issue?
>>>>>
>>>>> Rafal
>>>>> --
>>>>> Rafal Lukawiecki
>>>>> Strategic Consultant and Director
>>>>> Project Botticelli Ltd
>>>>>
>>>>> On 1 Aug 2013, at 17:31, Sebastian Schelter <[email protected]> wrote:
>>>>>
>>>>> Setting it to the maximum number should be enough. Would be great if
>> you
>>>>> can share your dataset and tests.
>>>>>
>>>>> 2013/8/1 Rafal Lukawiecki <[email protected]>
>>>>>
>>>>>> Should I have set that parameter to a value much much larger than the
>>>>>> maximum number of actually expressed preferences by a user?
>>>>>>
>>>>>> I'm working on an anonymised data set. If it works as an error test
>> case,
>>>>>> I'd be happy to share it for your re-test. I am still hoping it is my
>>>>>> error, not Mahout's.
>>>>>>
>>>>>> Rafal
>>>>>> --
>>>>>> Rafal Lukawiecki
>>>>>> Pardon brevity, mobile device.
>>>>>>
>>>>>> On 1 Aug 2013, at 17:19, "Sebastian Schelter" <[email protected]> wrote:
>>>>>>
>>>>>>> Ok, please file a bug report detailing what you've tested and what
>>>>>> results
>>>>>>> you got.
>>>>>>>
>>>>>>> Just to clarify, setting maxPrefsPerUser to a high number still does
>> not
>>>>>>> help? That surprises me.
>>>>>>>
>>>>>>>
>>>>>>> 2013/8/1 Rafal Lukawiecki <[email protected]>
>>>>>>>
>>>>>>>> Hi Sebastian,
>>>>>>>>
>>>>>>>> I've rechecked the results, and, I'm afraid that the issue has not
>> gone
>>>>>>>> away, contrary to my yesterday's enthusiastic response. Using 0.8 I
>> have
>>>>>>>> retested with and without --maxPrefsPerUser 9000 parameter (no user
>> has
>>>>>>>> more than 5000 prefs). I have also supplied the prefs file, without
>> the
>>>>>>>> preference value, that is as: user,item (one per line) as a
>>>>>> --filterFile,
>>>>>>>> with and without the -maxPrefsPerUser, and I am afraid we are also
>>>>>> seeing
>>>>>>>> recommendations for items the user has expressed a prior preference
>> for.
>>>>>>>>
>>>>>>>> I suppose I need to file a bug report.
>>>>>>>>
>>>>>>>> Rafal
>>>>>>>> --
>>>>>>>> Rafal Lukawiecki
>>>>>>>> Pardon my brevity, sent from a telephone.
>>>>>>>>
>>>>>>>> On 31 Jul 2013, at 22:35, "Rafal Lukawiecki" <
>>>>>> [email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Dear Sebastian,
>>>>>>>>>
>>>>>>>>> It looks like setting --maxPrefsPerUser 10000 have resolved the
>> issue
>>>>>> in
>>>>>>>> our case—it seems that the most preferences a user had was just
>> about
>>>>>> 5000,
>>>>>>>> so I doubled it just-in-case, but when I operationalise this model,
>> I
>>>>>> will
>>>>>>>> make sure to calculate the actual max number of preferences and set
>> the
>>>>>>>> parameter accordingly. I will double-check the resultset to make
>> sure
>>>>>> the
>>>>>>>> issue is really gone, as I have only checked the few cases where we
>> have
>>>>>>>> spotted a recommendation of a previously preferred item.
>>>>>>>>>
>>>>>>>>> Would you like me to file a bug, and would you like me to test it
>> on
>>>>>> 0.8
>>>>>>>> or another version? I am using 0.7.
>>>>>>>>>
>>>>>>>>> Thanks for your kind support.
>>>>>>>>> Rafal
>>>>>>>>> --
>>>>>>>>> Rafal Lukawiecki
>>>>>>>>> Strategic Consultant and Director
>>>>>>>>> Project Botticelli Ltd
>>>>>>>>>
>>>>>>>>> On 31 Jul 2013, at 06:22, Sebastian Schelter <
>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Rafal,
>>>>>>>>>
>>>>>>>>> can you try to set the option --maxPrefsPerUser to the maximum
>> number
>>>>>> of
>>>>>>>>> interactions per user and see if you still get the error?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Sebastian
>>>>>>>>>
>>>>>>>>> On 30.07.2013 19:29, Rafal Lukawiecki wrote:
>>>>>>>>>> Thank you Sebastian. The data set is not that large, as we are
>> running
>>>>>>>> tests on a subset. It is about 24k users, 40k items, the preference
>> file
>>>>>>>> has 65k preferences as triples. This was using Similarity
>> Cooccurrence.
>>>>>>>>>>
>>>>>>>>>> I can see if I could anonymise the data set to share if that
>> would be
>>>>>>>> helpful.
>>>>>>>>>>
>>>>>>>>>> Thanks for your kind help.
>>>>>>>>>>
>>>>>>>>>> Rafal
>>>>>>>>>> --
>>>>>>>>>> Rafal Lukawiecki
>>>>>>>>>> Pardon my brevity, sent from a telephone.
>>>>>>>>>>
>>>>>>>>>> On 30 Jul 2013, at 18:18, "Sebastian Schelter" <[email protected]>
>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Rafal,
>>>>>>>>>>>
>>>>>>>>>>> can you issue a ticket for this problem at
>>>>>>>>>>> https://issues.apache.org/jira/browse/MAHOUT ? We have
>> unit-tests
>>>>>> that
>>>>>>>>>>> check whether this happens and currently they work fine. I can
>> only
>>>>>>>> imagine
>>>>>>>>>>> that the problem occurs in larger datasets where we sample the
>> data
>>>>>> in
>>>>>>>> some
>>>>>>>>>>> places. Can you describe a scenario/dataset where this happens?
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Sebastian
>>>>>>>>>>>
>>>>>>>>>>> 2013/7/30 Rafal Lukawiecki <[email protected]>
>>>>>>>>>>>
>>>>>>>>>>>> I'm new here, just registered. Many thanks to everyone for
>> working
>>>>>> on
>>>>>>>> an
>>>>>>>>>>>> amazing piece of software, thank you for building Mahout and for
>>>>>> your
>>>>>>>>>>>> support. My apologies if this is not the right place to ask the
>>>>>>>> question—I
>>>>>>>>>>>> have searched for the issue, and I can see this problem has been
>>>>>>>> reported
>>>>>>>>>>>> here:
>>>>>>>>
>>>>>>
>> http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items
>>>>>>>>>>>>
>>>>>>>>>>>> Unfortunately, the trail leads to the newsgroups, and I have not
>>>>>>>> found a
>>>>>>>>>>>> way, yet, to get an answer from them, without asking you.
>>>>>>>>>>>>
>>>>>>>>>>>> Essentially, I am running a Hadoop RecommenderJob from Mahout
>> 0.7,
>>>>>>>> and I
>>>>>>>>>>>> am finding that it is recommending items that the user has
>> already
>>>>>>>>>>>> expressed a preference for in their input file. I understand
>> that
>>>>>> this
>>>>>>>>>>>> should not be happening, and I am not sure if there is a know
>> fix or
>>>>>>>> if I
>>>>>>>>>>>> should be looking for a workaround (such as using the entire
>> input
>>>>>> as
>>>>>>>> the
>>>>>>>>>>>> filterFile).
>>>>>>>>>>>>
>>>>>>>>>>>> I will double-check that there is no error on my side, but so
>> far it
>>>>>>>> does
>>>>>>>>>>>> not seem that way.
>>>>>>>>>>>>
>>>>>>>>>>>> Many thanks and my regards from Ireland,
>>>>>>>>>>>> Rafal Lukawiecki
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>>
>>>>>>>>>>>> Rafal Lukawiecki
>>>>>>>>>>>>
>>>>>>>>>>>> Strategic Consultant and Director
>>>>>>>>>>>>
>>>>>>>>>>>> Project Botticelli Ltd
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>
>>
>>
>>
> 
> 

Reply via email to