Hi Sebastian, Awesome! it works now (localy and on aws). For my business case I just have to set --maxPrefsPerUser with a value higher than the total number of tracks.
Many thanks for your help! Quentin 2014/1/21 Sebastian Schelter <[email protected]> > Hi Quentin, > > I could reproduce what you report. There seems to be a bug in the > downsampling code of version 0.7 (if a user has more than a given number of > preferences, we downsample them). > > If you specify the additional parameter --maxPrefsPerUser and set it to a > value higher than the maximum number of interactions per user in your data, > you should not have a problem. Simply try a very large number. > > --sebastian > > > On 01/21/2014 07:10 PM, Quentin-Gabriel Thurier wrote: > >> The point is that I'm using the cloudera pseudo-distirbuted distribution >> and I think mahout-core-0.7-cdh4.5.0-job.jar is the up to date mahout >> version for cdh4. >> >> >> 2014/1/21 Sebastian Schelter <[email protected]> >> >> I ran your example file with the current trunk and got results. Can you >>> try to upgrade or are you bound to 0.7? If the latter is the case, I can >>> rerun the test with 0.7. >>> >>> --sebastian >>> >>> >>> >>> On 01/21/2014 05:35 PM, Quentin-Gabriel Thurier wrote: >>> >>> I'm using mahout-examples-0.7-cdh4.5.0-job.jar locally. But I tried on >>>> EMR >>>> (with mahout-examples-0.8-job.jar this time) on 3000 tracks and I also >>>> had >>>> empty result files. Should I send you the dataset on your apache address >>>> (it is only 140Ko)? >>>> >>>> Quentin >>>> >>>> >>>> 2014/1/21 Sebastian Schelter <[email protected]> >>>> >>>> Hmm, strange. Which version of mahout are you using? Do you run the >>>> 1200 >>>> >>>>> tracks job locally or on a cluster? Can you share your input file (in >>>>> private)? >>>>> >>>>> --sebastian >>>>> >>>>> >>>>> >>>>> On 01/21/2014 02:34 PM, Quentin-Gabriel Thurier wrote: >>>>> >>>>> Hi Sebastian >>>>> >>>>>> >>>>>> I tested the job on a tiny example (50 tracks) : >>>>>> >>>>>> mahout itemsimilarity --input input/msd_sample/mahout5 --output >>>>>> >>>>>> >>>>>>> output/mahout5 --similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE >>>>>>> >>>>>> --booleanData false --maxSimilaritiesPerItem 1 >>>>>> >>>>>> *1st row of the output: >>>>>> >>>>>> -2135949055 -335737401 0.09939478338891584 >>>>>> >>>>>> *related rows from the input: >>>>>> >>>>>> 1,-2135949055,230.42567 >>>>>> 2,-2135949055,0.0 >>>>>> 3,-2135949055,0.0 >>>>>> 4,-2135949055,-3.96 >>>>>> 5,-2135949055,-1.0 >>>>>> 6,-2135949055,96.897 >>>>>> 1,-335737401,222.35384 >>>>>> 2,-335737401,0.0 >>>>>> 3,-335737401,0.0 >>>>>> 4,-335737401,-5.232 >>>>>> 5,-335737401,-1.0 >>>>>> 6,-335737401,100.812 >>>>>> >>>>>> This is correct : >>>>>> 1/(1+(230.42567-222.35384)^2+(-3.96--5.232)^2+(96.897-100.812)^2) >>>>>> = 0.09939483 >>>>>> >>>>>> I don't have any exception except the usual warning : WARN >>>>>> mapred.JobClient: Use GenericOptionsParser for parsing the arguments. >>>>>> Applications should implement Tool for the same. >>>>>> >>>>>> Then I take 1200 tracks (the 50 previous are included in the 1200) the >>>>>> job >>>>>> don't fail but part-r-00000 is empty. As previously I only have a >>>>>> warning >>>>>> and the input looks like: >>>>>> >>>>>> 1,524572804,192.522 >>>>>> 2,524572804,0.0 >>>>>> 3,524572804,0.0 >>>>>> 4,524572804,-5.902 >>>>>> 5,524572804,-1.0 >>>>>> 6,524572804,123.756 >>>>>> 1,-1821170097,269.81833 >>>>>> 2,-1821170097,0.0 >>>>>> 3,-1821170097,0.0 >>>>>> 4,-1821170097,-13.496 >>>>>> 5,-1821170097,0.26586103 >>>>>> 6,-1821170097,86.643 >>>>>> >>>>>> Quentin >>>>>> >>>>>> >>>>>> 2014/1/21 Sebastian Schelter <[email protected]> >>>>>> >>>>>> Hi Quentin, >>>>>> >>>>>> >>>>>>> Have you checked the log to ensure that you don't get any exceptions >>>>>>> during the computation? >>>>>>> >>>>>>> Could you test the job with a tiny example where you can calculate >>>>>>> the >>>>>>> result by hand? >>>>>>> >>>>>>> Can you share an input file on which this job fails? >>>>>>> >>>>>>> --sebastian >>>>>>> >>>>>>> >>>>>>> On 01/21/2014 11:22 AM, Quentin-Gabriel Thurier wrote: >>>>>>> >>>>>>> I encounter few troubles with Mahout that I can't sort out.. >>>>>>> >>>>>>> >>>>>>>> The context is that I'm trying to calculate pairwise euclidean >>>>>>>> distances >>>>>>>> between music tracks based on 6 audio features per track. My input >>>>>>>> for >>>>>>>> the >>>>>>>> mahout job is a text file which looks like this: >>>>>>>> >>>>>>>> feature_id,track_id,feature_value >>>>>>>> <integer>,< integer>,<double> >>>>>>>> >>>>>>>> This command works locally for less than 600 tracks (based on >>>>>>>> mahout-core-0.7-cdh4.5.0-job.jar): >>>>>>>> >>>>>>>> mahout itemsimilarity --input input/msd_sample/mahout --output >>>>>>>> output/mahout --similarityClassname >>>>>>>> SIMILARITY_EUCLIDEAN_DISTANCE --booleanData false >>>>>>>> --maxSimilaritiesPerItem 1 >>>>>>>> >>>>>>>> But for more tracks I get an empty file part-r-0000. I tried to >>>>>>>> decrease >>>>>>>> the --threshold parameter but I still don't have any result. >>>>>>>> >>>>>>>> I also tried to launch the job on aws EMR with the equivalent input >>>>>>>> for >>>>>>>> 3000 tracks (based on mahout-core-0.8-job.jar): >>>>>>>> >>>>>>>> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob >>>>>>>> --input >>>>>>>> s3n://hadoop-filrouge/input/msd-sample/mahout --output >>>>>>>> s3n://hadoop-filrouge/output/mahout/01202014-itemsimilarity >>>>>>>> --similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE --booleanData >>>>>>>> false >>>>>>>> --maxSimilaritiesPerItem 1 >>>>>>>> >>>>>>>> The job runs successfully but I get 17 empty part-r-000xx.. >>>>>>>> >>>>>>>> I'm totally stuck right now and I'm running out of idea to fix this >>>>>>>> issue. >>>>>>>> So if anydody only have a little idea of what is going on, that >>>>>>>> could >>>>>>>> really help. >>>>>>>> >>>>>>>> Many thanks, >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
