It could affect the results even in this case, as we also sample the
preferences when computing similar items.

On 07.08.2013 17:07, Rafal Lukawiecki wrote:
> Thank you, Sebastian. Would the random sampling affect the results of 
> RecommenderJob, in any case? I am setting --maxPrefsPerUser to exceed the 
> actual, maximum number of preferences expressed by every user.
> 
> Rafal
>  
> On 7 Aug 2013, at 15:48, Sebastian Schelter <[email protected]>
>  wrote:
> 
> The code in trunk allows to you to specify a randomSeed, the older
> versions don't unfortunately.
> 
> On 07.08.2013 16:35, Rafal Lukawiecki wrote:
>> Hi Sebastian,
>>
>> The quantity of returned "duplicates" is much too large to be caused just by 
>> sampling's randomness. I wonder if this could be related to something that 
>> is platform-specific, as in Windows vs. *nix representation of input files, 
>> data types etc.
>>
>> For argument's sake, is it possible to fix the seed of the random aspect of 
>> the sampling so I could feed the same input through two platforms and 
>> compare the results?
>>
>> Rafal
>>
>> On 7 Aug 2013, at 15:20, Sebastian Schelter <[email protected]>
>> wrote:
>>
>> Hi Rafal,
>>
>> this sounds really strange, the bug should not have anything to do with
>> the version of Hadoop that you are running. You could sometimes not see
>> it due to the random sampling of the preferences.
>>
>> --sebastian
>>
>> On 07.08.2013 13:53, Rafal Lukawiecki wrote:
>>> Sebastian,
>>>
>>> I've been doing a little more digging regarding the issue of preferences 
>>> being calculated for already preferred items. I re-run the jobs using the 
>>> same data and the same parameters on a different installation of Hadoop, 
>>> and the problem seems to have gone away. For now it looks like the issue 
>>> arises when I run it under Mahout 0.7 and 0.8 using HDP (Hortonworks Data 
>>> Platform) for Windows 1.1.0, with Hadoop 1.1.0. This problem does not show 
>>> up, yet in my tests, under Hadoop 1.2.1 compiled for OS X. I will work a 
>>> little more to ensure my results, but if they stood up, should I still 
>>> report it as a Mahout issue?
>>>
>>> Rafal  
>>> --
>>> Rafal Lukawiecki
>>> Strategic Consultant and Director 
>>> Project Botticelli Ltd
>>>
>>> On 1 Aug 2013, at 17:31, Sebastian Schelter <[email protected]> wrote:
>>>
>>> Setting it to the maximum number should be enough. Would be great if you
>>> can share your dataset and tests.
>>>
>>> 2013/8/1 Rafal Lukawiecki <[email protected]>
>>>
>>>> Should I have set that parameter to a value much much larger than the
>>>> maximum number of actually expressed preferences by a user?
>>>>
>>>> I'm working on an anonymised data set. If it works as an error test case,
>>>> I'd be happy to share it for your re-test. I am still hoping it is my
>>>> error, not Mahout's.
>>>>
>>>> Rafal
>>>> --
>>>> Rafal Lukawiecki
>>>> Pardon brevity, mobile device.
>>>>
>>>> On 1 Aug 2013, at 17:19, "Sebastian Schelter" <[email protected]> wrote:
>>>>
>>>>> Ok, please file a bug report detailing what you've tested and what
>>>> results
>>>>> you got.
>>>>>
>>>>> Just to clarify, setting maxPrefsPerUser to a high number still does not
>>>>> help? That surprises me.
>>>>>
>>>>>
>>>>> 2013/8/1 Rafal Lukawiecki <[email protected]>
>>>>>
>>>>>> Hi Sebastian,
>>>>>>
>>>>>> I've rechecked the results, and, I'm afraid that the issue has not gone
>>>>>> away, contrary to my yesterday's enthusiastic response. Using 0.8 I have
>>>>>> retested with and without --maxPrefsPerUser 9000 parameter (no user has
>>>>>> more than 5000 prefs). I have also supplied the prefs file, without the
>>>>>> preference value, that is as: user,item (one per line) as a
>>>> --filterFile,
>>>>>> with and without the -maxPrefsPerUser, and I am afraid we are also
>>>> seeing
>>>>>> recommendations for items the user has expressed a prior preference for.
>>>>>>
>>>>>> I suppose I need to file a bug report.
>>>>>>
>>>>>> Rafal
>>>>>> --
>>>>>> Rafal Lukawiecki
>>>>>> Pardon my brevity, sent from a telephone.
>>>>>>
>>>>>> On 31 Jul 2013, at 22:35, "Rafal Lukawiecki" <
>>>> [email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Dear Sebastian,
>>>>>>>
>>>>>>> It looks like setting --maxPrefsPerUser 10000 have resolved the issue
>>>> in
>>>>>> our case—it seems that the most preferences a user had was just about
>>>> 5000,
>>>>>> so I doubled it just-in-case, but when I operationalise this model, I
>>>> will
>>>>>> make sure to calculate the actual max number of preferences and set the
>>>>>> parameter accordingly. I will double-check the resultset to make sure
>>>> the
>>>>>> issue is really gone, as I have only checked the few cases where we have
>>>>>> spotted a recommendation of a previously preferred item.
>>>>>>>
>>>>>>> Would you like me to file a bug, and would you like me to test it on
>>>> 0.8
>>>>>> or another version? I am using 0.7.
>>>>>>>
>>>>>>> Thanks for your kind support.
>>>>>>> Rafal
>>>>>>> --
>>>>>>> Rafal Lukawiecki
>>>>>>> Strategic Consultant and Director
>>>>>>> Project Botticelli Ltd
>>>>>>>
>>>>>>> On 31 Jul 2013, at 06:22, Sebastian Schelter <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Rafal,
>>>>>>>
>>>>>>> can you try to set the option --maxPrefsPerUser to the maximum number
>>>> of
>>>>>>> interactions per user and see if you still get the error?
>>>>>>>
>>>>>>> Best,
>>>>>>> Sebastian
>>>>>>>
>>>>>>> On 30.07.2013 19:29, Rafal Lukawiecki wrote:
>>>>>>>> Thank you Sebastian. The data set is not that large, as we are running
>>>>>> tests on a subset. It is about 24k users, 40k items, the preference file
>>>>>> has 65k preferences as triples. This was using Similarity Cooccurrence.
>>>>>>>>
>>>>>>>> I can see if I could anonymise the data set to share if that would be
>>>>>> helpful.
>>>>>>>>
>>>>>>>> Thanks for your kind help.
>>>>>>>>
>>>>>>>> Rafal
>>>>>>>> --
>>>>>>>> Rafal Lukawiecki
>>>>>>>> Pardon my brevity, sent from a telephone.
>>>>>>>>
>>>>>>>> On 30 Jul 2013, at 18:18, "Sebastian Schelter" <[email protected]>
>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Rafal,
>>>>>>>>>
>>>>>>>>> can you issue a ticket for this problem at
>>>>>>>>> https://issues.apache.org/jira/browse/MAHOUT ? We have unit-tests
>>>> that
>>>>>>>>> check whether this happens and currently they work fine. I can only
>>>>>> imagine
>>>>>>>>> that the problem occurs in larger datasets where we sample the data
>>>> in
>>>>>> some
>>>>>>>>> places. Can you describe a scenario/dataset where this happens?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Sebastian
>>>>>>>>>
>>>>>>>>> 2013/7/30 Rafal Lukawiecki <[email protected]>
>>>>>>>>>
>>>>>>>>>> I'm new here, just registered. Many thanks to everyone for working
>>>> on
>>>>>> an
>>>>>>>>>> amazing piece of software, thank you for building Mahout and for
>>>> your
>>>>>>>>>> support. My apologies if this is not the right place to ask the
>>>>>> question—I
>>>>>>>>>> have searched for the issue, and I can see this problem has been
>>>>>> reported
>>>>>>>>>> here:
>>>>>>
>>>> http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items
>>>>>>>>>>
>>>>>>>>>> Unfortunately, the trail leads to the newsgroups, and I have not
>>>>>> found a
>>>>>>>>>> way, yet, to get an answer from them, without asking you.
>>>>>>>>>>
>>>>>>>>>> Essentially, I am running a Hadoop RecommenderJob from Mahout 0.7,
>>>>>> and I
>>>>>>>>>> am finding that it is recommending items that the user has already
>>>>>>>>>> expressed a preference for in their input file. I understand that
>>>> this
>>>>>>>>>> should not be happening, and I am not sure if there is a know fix or
>>>>>> if I
>>>>>>>>>> should be looking for a workaround (such as using the entire input
>>>> as
>>>>>> the
>>>>>>>>>> filterFile).
>>>>>>>>>>
>>>>>>>>>> I will double-check that there is no error on my side, but so far it
>>>>>> does
>>>>>>>>>> not seem that way.
>>>>>>>>>>
>>>>>>>>>> Many thanks and my regards from Ireland,
>>>>>>>>>> Rafal Lukawiecki
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>> Rafal Lukawiecki
>>>>>>>>>>
>>>>>>>>>> Strategic Consultant and Director
>>>>>>>>>>
>>>>>>>>>> Project Botticelli Ltd
>>>>>>
>>>>
>>>
>>>
>>
>>
>>
> 
> 
> 

Reply via email to