For what it's worth, the issue of the recommender recommending items that 
already had been "preferred" by a user seems to have gone away. I realise I am 
a few reboots of the platform later than I was when I have asked about it, but 
to the best of my knowledge nothing else has changed. I would feel silly if 
this was an error on my side, but I cannot find any other explanation. As long 
as we set the --maxPrefsPerUser parameter high enough, there are no more 
"duplicates".  My apologies for muddying the waters earlier on.

Rafal

On 7 Aug 2013, at 17:19, Sebastian Schelter <[email protected]> wrote:

if you also set --maxPrefsPerUserInItemSimilarity to a number higher than
the max preferences per user, no sampling should occur. This might slow
down the job however.

2013/8/7 Rafal Lukawiecki <[email protected]>

> Is there a set of parameters which I could pass to RecommenderJob to avoid
> that random sampling, in order to create a test case for the issue I have
> experienced? Would setting --maxSimilaritiesPerItem and/or
> --maxPrefsPerUserInItemSimilarity help? Many thanks.
> 
> On 7 Aug 2013, at 16:12, Sebastian Schelter <[email protected]>
> wrote:
> 
> It could affect the results even in this case, as we also sample the
> preferences when computing similar items.
> 
> On 07.08.2013 17:07, Rafal Lukawiecki wrote:
>> Thank you, Sebastian. Would the random sampling affect the results of
> RecommenderJob, in any case? I am setting --maxPrefsPerUser to exceed the
> actual, maximum number of preferences expressed by every user.
>> 
>> Rafal
>> 
>> On 7 Aug 2013, at 15:48, Sebastian Schelter <[email protected]>
>> wrote:
>> 
>> The code in trunk allows to you to specify a randomSeed, the older
>> versions don't unfortunately.
>> 
>> On 07.08.2013 16:35, Rafal Lukawiecki wrote:
>>> Hi Sebastian,
>>> 
>>> The quantity of returned "duplicates" is much too large to be caused
> just by sampling's randomness. I wonder if this could be related to
> something that is platform-specific, as in Windows vs. *nix representation
> of input files, data types etc.
>>> 
>>> For argument's sake, is it possible to fix the seed of the random
> aspect of the sampling so I could feed the same input through two platforms
> and compare the results?
>>> 
>>> Rafal
>>> 
>>> On 7 Aug 2013, at 15:20, Sebastian Schelter <[email protected]>
>>> wrote:
>>> 
>>> Hi Rafal,
>>> 
>>> this sounds really strange, the bug should not have anything to do with
>>> the version of Hadoop that you are running. You could sometimes not see
>>> it due to the random sampling of the preferences.
>>> 
>>> --sebastian
>>> 
>>> On 07.08.2013 13:53, Rafal Lukawiecki wrote:
>>>> Sebastian,
>>>> 
>>>> I've been doing a little more digging regarding the issue of
> preferences being calculated for already preferred items. I re-run the jobs
> using the same data and the same parameters on a different installation of
> Hadoop, and the problem seems to have gone away. For now it looks like the
> issue arises when I run it under Mahout 0.7 and 0.8 using HDP (Hortonworks
> Data Platform) for Windows 1.1.0, with Hadoop 1.1.0. This problem does not
> show up, yet in my tests, under Hadoop 1.2.1 compiled for OS X. I will work
> a little more to ensure my results, but if they stood up, should I still
> report it as a Mahout issue?
>>>> 
>>>> Rafal
>>>> --
>>>> Rafal Lukawiecki
>>>> Strategic Consultant and Director
>>>> Project Botticelli Ltd
>>>> 
>>>> On 1 Aug 2013, at 17:31, Sebastian Schelter <[email protected]> wrote:
>>>> 
>>>> Setting it to the maximum number should be enough. Would be great if
> you
>>>> can share your dataset and tests.
>>>> 
>>>> 2013/8/1 Rafal Lukawiecki <[email protected]>
>>>> 
>>>>> Should I have set that parameter to a value much much larger than the
>>>>> maximum number of actually expressed preferences by a user?
>>>>> 
>>>>> I'm working on an anonymised data set. If it works as an error test
> case,
>>>>> I'd be happy to share it for your re-test. I am still hoping it is my
>>>>> error, not Mahout's.
>>>>> 
>>>>> Rafal
>>>>> --
>>>>> Rafal Lukawiecki
>>>>> Pardon brevity, mobile device.
>>>>> 
>>>>> On 1 Aug 2013, at 17:19, "Sebastian Schelter" <[email protected]> wrote:
>>>>> 
>>>>>> Ok, please file a bug report detailing what you've tested and what
>>>>> results
>>>>>> you got.
>>>>>> 
>>>>>> Just to clarify, setting maxPrefsPerUser to a high number still does
> not
>>>>>> help? That surprises me.
>>>>>> 
>>>>>> 
>>>>>> 2013/8/1 Rafal Lukawiecki <[email protected]>
>>>>>> 
>>>>>>> Hi Sebastian,
>>>>>>> 
>>>>>>> I've rechecked the results, and, I'm afraid that the issue has not
> gone
>>>>>>> away, contrary to my yesterday's enthusiastic response. Using 0.8 I
> have
>>>>>>> retested with and without --maxPrefsPerUser 9000 parameter (no user
> has
>>>>>>> more than 5000 prefs). I have also supplied the prefs file, without
> the
>>>>>>> preference value, that is as: user,item (one per line) as a
>>>>> --filterFile,
>>>>>>> with and without the -maxPrefsPerUser, and I am afraid we are also
>>>>> seeing
>>>>>>> recommendations for items the user has expressed a prior preference
> for.
>>>>>>> 
>>>>>>> I suppose I need to file a bug report.
>>>>>>> 
>>>>>>> Rafal
>>>>>>> --
>>>>>>> Rafal Lukawiecki
>>>>>>> Pardon my brevity, sent from a telephone.
>>>>>>> 
>>>>>>> On 31 Jul 2013, at 22:35, "Rafal Lukawiecki" <
>>>>> [email protected]>
>>>>>>> wrote:
>>>>>>> 
>>>>>>>> Dear Sebastian,
>>>>>>>> 
>>>>>>>> It looks like setting --maxPrefsPerUser 10000 have resolved the
> issue
>>>>> in
>>>>>>> our case—it seems that the most preferences a user had was just
> about
>>>>> 5000,
>>>>>>> so I doubled it just-in-case, but when I operationalise this model,
> I
>>>>> will
>>>>>>> make sure to calculate the actual max number of preferences and set
> the
>>>>>>> parameter accordingly. I will double-check the resultset to make
> sure
>>>>> the
>>>>>>> issue is really gone, as I have only checked the few cases where we
> have
>>>>>>> spotted a recommendation of a previously preferred item.
>>>>>>>> 
>>>>>>>> Would you like me to file a bug, and would you like me to test it
> on
>>>>> 0.8
>>>>>>> or another version? I am using 0.7.
>>>>>>>> 
>>>>>>>> Thanks for your kind support.
>>>>>>>> Rafal
>>>>>>>> --
>>>>>>>> Rafal Lukawiecki
>>>>>>>> Strategic Consultant and Director
>>>>>>>> Project Botticelli Ltd
>>>>>>>> 
>>>>>>>> On 31 Jul 2013, at 06:22, Sebastian Schelter <
> [email protected]>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> Hi Rafal,
>>>>>>>> 
>>>>>>>> can you try to set the option --maxPrefsPerUser to the maximum
> number
>>>>> of
>>>>>>>> interactions per user and see if you still get the error?
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Sebastian
>>>>>>>> 
>>>>>>>> On 30.07.2013 19:29, Rafal Lukawiecki wrote:
>>>>>>>>> Thank you Sebastian. The data set is not that large, as we are
> running
>>>>>>> tests on a subset. It is about 24k users, 40k items, the preference
> file
>>>>>>> has 65k preferences as triples. This was using Similarity
> Cooccurrence.
>>>>>>>>> 
>>>>>>>>> I can see if I could anonymise the data set to share if that
> would be
>>>>>>> helpful.
>>>>>>>>> 
>>>>>>>>> Thanks for your kind help.
>>>>>>>>> 
>>>>>>>>> Rafal
>>>>>>>>> --
>>>>>>>>> Rafal Lukawiecki
>>>>>>>>> Pardon my brevity, sent from a telephone.
>>>>>>>>> 
>>>>>>>>> On 30 Jul 2013, at 18:18, "Sebastian Schelter" <[email protected]>
>>>>> wrote:
>>>>>>>>> 
>>>>>>>>>> Hi Rafal,
>>>>>>>>>> 
>>>>>>>>>> can you issue a ticket for this problem at
>>>>>>>>>> https://issues.apache.org/jira/browse/MAHOUT ? We have
> unit-tests
>>>>> that
>>>>>>>>>> check whether this happens and currently they work fine. I can
> only
>>>>>>> imagine
>>>>>>>>>> that the problem occurs in larger datasets where we sample the
> data
>>>>> in
>>>>>>> some
>>>>>>>>>> places. Can you describe a scenario/dataset where this happens?
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Sebastian
>>>>>>>>>> 
>>>>>>>>>> 2013/7/30 Rafal Lukawiecki <[email protected]>
>>>>>>>>>> 
>>>>>>>>>>> I'm new here, just registered. Many thanks to everyone for
> working
>>>>> on
>>>>>>> an
>>>>>>>>>>> amazing piece of software, thank you for building Mahout and for
>>>>> your
>>>>>>>>>>> support. My apologies if this is not the right place to ask the
>>>>>>> question—I
>>>>>>>>>>> have searched for the issue, and I can see this problem has been
>>>>>>> reported
>>>>>>>>>>> here:
>>>>>>> 
>>>>> 
> http://stackoverflow.com/questions/13822455/apache-mahout-distributed-recommender-recommends-already-rated-items
>>>>>>>>>>> 
>>>>>>>>>>> Unfortunately, the trail leads to the newsgroups, and I have not
>>>>>>> found a
>>>>>>>>>>> way, yet, to get an answer from them, without asking you.
>>>>>>>>>>> 
>>>>>>>>>>> Essentially, I am running a Hadoop RecommenderJob from Mahout
> 0.7,
>>>>>>> and I
>>>>>>>>>>> am finding that it is recommending items that the user has
> already
>>>>>>>>>>> expressed a preference for in their input file. I understand
> that
>>>>> this
>>>>>>>>>>> should not be happening, and I am not sure if there is a know
> fix or
>>>>>>> if I
>>>>>>>>>>> should be looking for a workaround (such as using the entire
> input
>>>>> as
>>>>>>> the
>>>>>>>>>>> filterFile).
>>>>>>>>>>> 
>>>>>>>>>>> I will double-check that there is no error on my side, but so
> far it
>>>>>>> does
>>>>>>>>>>> not seem that way.
>>>>>>>>>>> 
>>>>>>>>>>> Many thanks and my regards from Ireland,
>>>>>>>>>>> Rafal Lukawiecki
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> 
>>>>>>>>>>> Rafal Lukawiecki
>>>>>>>>>>> 
>>>>>>>>>>> Strategic Consultant and Director
>>>>>>>>>>> 
>>>>>>>>>>> Project Botticelli Ltd
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
> 
> 
> 


Reply via email to