Re: Problem with ItemSimilarityJob, empty part-r-00000

Quentin-Gabriel Thurier Wed, 22 Jan 2014 02:55:16 -0800

Hi Sebastian,

Awesome! it works now (localy and on aws). For my business case I just have
to set --maxPrefsPerUser with a value higher than the total number of
tracks.


Many thanks for your help!

Quentin


2014/1/21 Sebastian Schelter <[email protected]>

> Hi Quentin,
>
> I could reproduce what you report. There seems to be a bug in the
> downsampling code of version 0.7 (if a user has more than a given number of
> preferences, we downsample them).
>
> If you specify the additional parameter --maxPrefsPerUser and set it to a
> value higher than the maximum number of interactions per user in your data,
> you should not have a problem. Simply try a very large number.
>
> --sebastian
>
>
> On 01/21/2014 07:10 PM, Quentin-Gabriel Thurier wrote:
>
>> The point is that I'm using the cloudera pseudo-distirbuted distribution
>> and I think mahout-core-0.7-cdh4.5.0-job.jar is the up to date mahout
>> version for cdh4.
>>
>>
>> 2014/1/21 Sebastian Schelter <[email protected]>
>>
>>  I ran your example file with the current trunk and got results. Can you
>>> try to upgrade or are you bound to 0.7? If the latter is the case, I can
>>> rerun the test with 0.7.
>>>
>>> --sebastian
>>>
>>>
>>>
>>> On 01/21/2014 05:35 PM, Quentin-Gabriel Thurier wrote:
>>>
>>>  I'm using mahout-examples-0.7-cdh4.5.0-job.jar locally. But I tried on
>>>> EMR
>>>> (with mahout-examples-0.8-job.jar this time) on 3000 tracks and I also
>>>> had
>>>> empty result files. Should I send you the dataset on your apache address
>>>> (it is only 140Ko)?
>>>>
>>>> Quentin
>>>>
>>>>
>>>> 2014/1/21 Sebastian Schelter <[email protected]>
>>>>
>>>>   Hmm, strange. Which version of mahout are you using? Do you run the
>>>> 1200
>>>>
>>>>> tracks job locally or on a cluster? Can you share your input file (in
>>>>> private)?
>>>>>
>>>>> --sebastian
>>>>>
>>>>>
>>>>>
>>>>> On 01/21/2014 02:34 PM, Quentin-Gabriel Thurier wrote:
>>>>>
>>>>>   Hi Sebastian
>>>>>
>>>>>>
>>>>>> I tested the job on a tiny example (50 tracks) :
>>>>>>
>>>>>>    mahout itemsimilarity --input input/msd_sample/mahout5 --output
>>>>>>
>>>>>>
>>>>>>>   output/mahout5 --similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE
>>>>>>>
>>>>>> --booleanData false --maxSimilaritiesPerItem 1
>>>>>>
>>>>>> *1st row of the output:
>>>>>>
>>>>>> -2135949055     -335737401      0.09939478338891584
>>>>>>
>>>>>> *related rows from the input:
>>>>>>
>>>>>> 1,-2135949055,230.42567
>>>>>> 2,-2135949055,0.0
>>>>>> 3,-2135949055,0.0
>>>>>> 4,-2135949055,-3.96
>>>>>> 5,-2135949055,-1.0
>>>>>> 6,-2135949055,96.897
>>>>>> 1,-335737401,222.35384
>>>>>> 2,-335737401,0.0
>>>>>> 3,-335737401,0.0
>>>>>> 4,-335737401,-5.232
>>>>>> 5,-335737401,-1.0
>>>>>> 6,-335737401,100.812
>>>>>>
>>>>>> This is correct :
>>>>>> 1/(1+(230.42567-222.35384)^2+(-3.96--5.232)^2+(96.897-100.812)^2)
>>>>>> = 0.09939483
>>>>>>
>>>>>> I don't have any exception except the usual warning : WARN
>>>>>> mapred.JobClient: Use GenericOptionsParser for parsing the arguments.
>>>>>> Applications should implement Tool for the same.
>>>>>>
>>>>>> Then I take 1200 tracks (the 50 previous are included in the 1200) the
>>>>>> job
>>>>>> don't fail but part-r-00000 is empty. As previously I only have a
>>>>>> warning
>>>>>> and the input looks like:
>>>>>>
>>>>>> 1,524572804,192.522
>>>>>> 2,524572804,0.0
>>>>>> 3,524572804,0.0
>>>>>> 4,524572804,-5.902
>>>>>> 5,524572804,-1.0
>>>>>> 6,524572804,123.756
>>>>>> 1,-1821170097,269.81833
>>>>>> 2,-1821170097,0.0
>>>>>> 3,-1821170097,0.0
>>>>>> 4,-1821170097,-13.496
>>>>>> 5,-1821170097,0.26586103
>>>>>> 6,-1821170097,86.643
>>>>>>
>>>>>> Quentin
>>>>>>
>>>>>>
>>>>>> 2014/1/21 Sebastian Schelter <[email protected]>
>>>>>>
>>>>>>    Hi Quentin,
>>>>>>
>>>>>>
>>>>>>> Have you checked the log to ensure that you don't get any exceptions
>>>>>>> during the computation?
>>>>>>>
>>>>>>> Could you test the job with a tiny example where you can calculate
>>>>>>> the
>>>>>>> result by hand?
>>>>>>>
>>>>>>> Can you share an input file on which this job fails?
>>>>>>>
>>>>>>> --sebastian
>>>>>>>
>>>>>>>
>>>>>>> On 01/21/2014 11:22 AM, Quentin-Gabriel Thurier wrote:
>>>>>>>
>>>>>>>    I encounter few troubles with Mahout that I can't sort out..
>>>>>>>
>>>>>>>
>>>>>>>> The context is that I'm trying to calculate pairwise euclidean
>>>>>>>> distances
>>>>>>>> between music tracks based on 6 audio features per track. My input
>>>>>>>> for
>>>>>>>> the
>>>>>>>> mahout job is a text file which looks like this:
>>>>>>>>
>>>>>>>> feature_id,track_id,feature_value
>>>>>>>> <integer>,< integer>,<double>
>>>>>>>>
>>>>>>>> This command works locally for less than 600 tracks (based on
>>>>>>>> mahout-core-0.7-cdh4.5.0-job.jar):
>>>>>>>>
>>>>>>>> mahout itemsimilarity --input input/msd_sample/mahout --output
>>>>>>>> output/mahout --similarityClassname
>>>>>>>> SIMILARITY_EUCLIDEAN_DISTANCE --booleanData false
>>>>>>>> --maxSimilaritiesPerItem 1
>>>>>>>>
>>>>>>>> But for more tracks I get an empty file part-r-0000. I tried to
>>>>>>>> decrease
>>>>>>>> the --threshold parameter but I still don't have any result.
>>>>>>>>
>>>>>>>> I also tried to launch the job on aws EMR with the equivalent input
>>>>>>>> for
>>>>>>>> 3000 tracks (based on mahout-core-0.8-job.jar):
>>>>>>>>
>>>>>>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob
>>>>>>>> --input
>>>>>>>> s3n://hadoop-filrouge/input/msd-sample/mahout --output
>>>>>>>> s3n://hadoop-filrouge/output/mahout/01202014-itemsimilarity
>>>>>>>> --similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE --booleanData
>>>>>>>> false
>>>>>>>> --maxSimilaritiesPerItem 1
>>>>>>>>
>>>>>>>> The job runs successfully but I get 17 empty part-r-000xx..
>>>>>>>>
>>>>>>>> I'm totally stuck right now and I'm running out of idea to fix this
>>>>>>>> issue.
>>>>>>>> So if anydody only have a little idea of what is going on, that
>>>>>>>> could
>>>>>>>> really help.
>>>>>>>>
>>>>>>>> Many thanks,
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Problem with ItemSimilarityJob, empty part-r-00000

Reply via email to