The code in trunk allows to you to specify a randomSeed, the older
versions don't unfortunately.

On 07.08.2013 16:35, Rafal Lukawiecki wrote:
> Hi Sebastian,
> 
> The quantity of returned "duplicates" is much too large to be caused just by 
> sampling's randomness. I wonder if this could be related to something that is 
> platform-specific, as in Windows vs. *nix representation of input files, data 
> types etc.
> 
> For argument's sake, is it possible to fix the seed of the random aspect of 
> the sampling so I could feed the same input through two platforms and compare 
> the results?
> 
> Rafal
> 
> On 7 Aug 2013, at 15:20, Sebastian Schelter <[email protected]>
>  wrote:
> 
> Hi Rafal,
> 
> this sounds really strange, the bug should not have anything to do with
> the version of Hadoop that you are running. You could sometimes not see
> it due to the random sampling of the preferences.
> 
> --sebastian
> 
> On 07.08.2013 13:53, Rafal Lukawiecki wrote:
>> Sebastian,
>>
>> I've been doing a little more digging regarding the issue of preferences 
>> being calculated for already preferred items. I re-run the jobs using the 
>> same data and the same parameters on a different installation of Hadoop, and 
>> the problem seems to have gone away. For now it looks like the issue arises 
>> when I run it under Mahout 0.7 and 0.8 using HDP (Hortonworks Data Platform) 
>> for Windows 1.1.0, with Hadoop 1.1.0. This problem does not show up, yet in 
>> my tests, under Hadoop 1.2.1 compiled for OS X. I will work a little more to 
>> ensure my results, but if they stood up, should I still report it as a 
>> Mahout issue?
>>
>> Rafal  
>> --
>> Rafal Lukawiecki
>> Strategic Consultant and Director 
>> Project Botticelli Ltd
>>
>> On 1 Aug 2013, at 17:31, Sebastian Schelter <[email protected]> wrote:
>>
>> Setting it to the maximum number should be enough. Would be great if you
>> can share your dataset and tests.
>>
>> 2013/8/1 Rafal Lukawiecki <[email protected]>
>>
>>> Should I have set that parameter to a value much much larger than the
>>> maximum number of actually expressed preferences by a user?
>>>
>>> I'm working on an anonymised data set. If it works as an error test case,
>>> I'd be happy to share it for your re-test. I am still hoping it is my
>>> error, not Mahout's.
>>>
>>> Rafal
>>> --
>>> Rafal Lukawiecki
>>> Pardon brevity, mobile device.
>>>
>>> On 1 Aug 2013, at 17:19, "Sebastian Schelter" <[email protected]> wrote:
>>>
>>>> Ok, please file a bug report detailing what you've tested and what
>>> results
>>>> you got.
>>>>
>>>> Just to clarify, setting maxPrefsPerUser to a high number still does not
>>>> help? That surprises me.
>>>>
>>>>
>>>> 2013/8/1 Rafal Lukawiecki <[email protected]>
>>>>
>>>>> Hi Sebastian,
>>>>>
>>>>> I've rechecked the results, and, I'm afraid that the issue has not gone
>>>>> away, contrary to my yesterday's enthusiastic response. Using 0.8 I have
>>>>> retested with and without --maxPrefsPerUser 9000 parameter (no user has
>>>>> more than 5000 prefs). I have also supplied the prefs file, without the
>>>>> preference value, that is as: user,item (one per line) as a
>>> --filterFile,
>>>>> with and without the -maxPrefsPerUser, and I am afraid we are also
>>> seeing
>>>>> recommendations for items the user has expressed a prior preference for.
>>>>>
>>>>> I suppose I need to file a bug report.
>>>>>
>>>>> Rafal
>>>>> --
>>>>> Rafal Lukawiecki
>>>>> Pardon my brevity, sent from a telephone.
>>>>>
>>>>> On 31 Jul 2013, at 22:35, "Rafal Lukawiecki" <
>>> [email protected]>
>>>>> wrote:
>>>>>
>>>>>> Dear Sebastian,
>>>>>>
>>>>>> It looks like setting --maxPrefsPerUser 10000 have resolved the issue
>>> in
>>>>> our case—it seems that the most preferences a user had was just about
>>> 5000,
>>>>> so I doubled it just-in-case, but when I operationalise this model, I
>>> will
>>>>> make sure to calculate the actual max number of preferences and set the
>>>>> parameter accordingly. I will double-check the resultset to make sure
>>> the
>>>>> issue is really gone, as I have only checked the few cases where we have
>>>>> spotted a recommendation of a previously preferred item.
>>>>>>
>>>>>> Would you like me to file a bug, and would you like me to test it on
>>> 0.8
>>>>> or another version? I am using 0.7.
>>>>>>
>>>>>> Thanks for your kind support.
>>>>>> Rafal
>>>>>> --
>>>>>> Rafal Lukawiecki
>>>>>> Strategic Consultant and Director
>>>>>> Project Botticelli Ltd
>>>>>>
>>>>>> On 31 Jul 2013, at 06:22, Sebastian Schelter <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Rafal,
>>>>>>
>>>>>> can you try to set the option --maxPrefsPerUser to the maximum number
>>> of
>>>>>> interactions per user and see if you still get the error?
>>>>>>
>>>>>> Best,
>>>>>> Sebastian
>>>>>>
>>>>>> On 30.07.2013 19:29, Rafal Lukawiecki wrote:
>>>>>>> Thank you Sebastian. The data set is not that large, as we are running
>>>>> tests on a subset. It is about 24k users, 40k items, the preference file
>>>>> has 65k preferences as triples. This was using Similarity Cooccurrence.
>>>>>>>
>>>>>>> I can see if I could anonymise the data set to share if that would be
>>>>> helpful.
>>>>>>>
>>>>>>> Thanks for your kind help.
>>>>>>>
>>>>>>> Rafal
>>>>>>> --
>>>>>>> Rafal Lukawiecki
>>>>>>> Pardon my brevity, sent from a telephone.
>>>>>>>
>>>>>>> On 30 Jul 2013, at 18:18, "Sebastian Schelter" <[email protected]>
>>> wrote:
>>>>>>>
>>>>>>>> Hi Rafal,
>>>>>>>>
>>>>>>>> can you issue a ticket for this problem at
>>>>>>>> https://issues.apache.org/jira/browse/MAHOUT ? We have unit-tests
>>> that
>>>>>>>> check whether this happens and currently they work fine. I can only
>>>>> imagine
>>>>>>>> that the problem occurs in larger datasets where we sample the data
>>> in
>>>>> some
>>>>>>>> places. Can you describe a scenario/dataset where this happens?
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Sebastian
>>>>>>>>
>>>>>>>> 2013/7/30 Rafal Lukawiecki <[email protected]>
>>>>>>>>
>>>>>>>>> I'm new here, just registered. Many thanks to everyone for working
>>> on
>>>>> an
>>>>>>>>> amazing piece of software, thank you for building Mahout and for
>>> your
>>>>>>>>> support. My apologies if this is not the right place to ask the
>>>>> question—I
>>>>>>>>> have searched for the issue, and I can see this problem has been
>>>>> reported
>>>>>>>>> here:
>>>>>
>>> http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items
>>>>>>>>>
>>>>>>>>> Unfortunately, the trail leads to the newsgroups, and I have not
>>>>> found a
>>>>>>>>> way, yet, to get an answer from them, without asking you.
>>>>>>>>>
>>>>>>>>> Essentially, I am running a Hadoop RecommenderJob from Mahout 0.7,
>>>>> and I
>>>>>>>>> am finding that it is recommending items that the user has already
>>>>>>>>> expressed a preference for in their input file. I understand that
>>> this
>>>>>>>>> should not be happening, and I am not sure if there is a know fix or
>>>>> if I
>>>>>>>>> should be looking for a workaround (such as using the entire input
>>> as
>>>>> the
>>>>>>>>> filterFile).
>>>>>>>>>
>>>>>>>>> I will double-check that there is no error on my side, but so far it
>>>>> does
>>>>>>>>> not seem that way.
>>>>>>>>>
>>>>>>>>> Many thanks and my regards from Ireland,
>>>>>>>>> Rafal Lukawiecki
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>> Rafal Lukawiecki
>>>>>>>>>
>>>>>>>>> Strategic Consultant and Director
>>>>>>>>>
>>>>>>>>> Project Botticelli Ltd
>>>>>
>>>
>>
>>
> 
> 
> 

Reply via email to