Hi Sebastian I tested the job on a tiny example (50 tracks) :
>mahout itemsimilarity --input input/msd_sample/mahout5 --output output/mahout5 --similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE --booleanData false --maxSimilaritiesPerItem 1 *1st row of the output: -2135949055 -335737401 0.09939478338891584 *related rows from the input: 1,-2135949055,230.42567 2,-2135949055,0.0 3,-2135949055,0.0 4,-2135949055,-3.96 5,-2135949055,-1.0 6,-2135949055,96.897 1,-335737401,222.35384 2,-335737401,0.0 3,-335737401,0.0 4,-335737401,-5.232 5,-335737401,-1.0 6,-335737401,100.812 This is correct : 1/(1+(230.42567-222.35384)^2+(-3.96--5.232)^2+(96.897-100.812)^2) = 0.09939483 I don't have any exception except the usual warning : WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. Then I take 1200 tracks (the 50 previous are included in the 1200) the job don't fail but part-r-00000 is empty. As previously I only have a warning and the input looks like: 1,524572804,192.522 2,524572804,0.0 3,524572804,0.0 4,524572804,-5.902 5,524572804,-1.0 6,524572804,123.756 1,-1821170097,269.81833 2,-1821170097,0.0 3,-1821170097,0.0 4,-1821170097,-13.496 5,-1821170097,0.26586103 6,-1821170097,86.643 Quentin 2014/1/21 Sebastian Schelter <[email protected]> > Hi Quentin, > > Have you checked the log to ensure that you don't get any exceptions > during the computation? > > Could you test the job with a tiny example where you can calculate the > result by hand? > > Can you share an input file on which this job fails? > > --sebastian > > > On 01/21/2014 11:22 AM, Quentin-Gabriel Thurier wrote: > >> I encounter few troubles with Mahout that I can't sort out.. >> >> The context is that I'm trying to calculate pairwise euclidean distances >> between music tracks based on 6 audio features per track. My input for the >> mahout job is a text file which looks like this: >> >> feature_id,track_id,feature_value >> <integer>,< integer>,<double> >> >> This command works locally for less than 600 tracks (based on >> mahout-core-0.7-cdh4.5.0-job.jar): >> >> mahout itemsimilarity --input input/msd_sample/mahout --output >> output/mahout --similarityClassname >> SIMILARITY_EUCLIDEAN_DISTANCE --booleanData false >> --maxSimilaritiesPerItem 1 >> >> But for more tracks I get an empty file part-r-0000. I tried to decrease >> the --threshold parameter but I still don't have any result. >> >> I also tried to launch the job on aws EMR with the equivalent input for >> 3000 tracks (based on mahout-core-0.8-job.jar): >> >> org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob >> --input >> s3n://hadoop-filrouge/input/msd-sample/mahout --output >> s3n://hadoop-filrouge/output/mahout/01202014-itemsimilarity >> --similarityClassname SIMILARITY_EUCLIDEAN_DISTANCE --booleanData false >> --maxSimilaritiesPerItem 1 >> >> The job runs successfully but I get 17 empty part-r-000xx.. >> >> I'm totally stuck right now and I'm running out of idea to fix this issue. >> So if anydody only have a little idea of what is going on, that could >> really help. >> >> Many thanks, >> >> >
