1) how many cores in the cluster? The whole idea behind mapreduce is you buy more cpus you get nearly linear decrease in runtime. 2) what is your mahout command line with options, or how are you invoking mahout. I have seen the Mahout mapreduce recommender take this long so we should check what you are doing with downsampling. 3) do you really need to RANK your ids, that’s a full sort? When using pig I usually get DISTINCT ones and assign an incrementing integer as the Mahout ID corresponding 4) your #2 assigning different weights to different actions usually does not work. I’ve done this before and compared offline metrics and seen precision go down. I’d get this working using only your primary actions first. What are you trying to get the user to do? View something, buy something? Use that action as the primary preference and start out with a weight of 1 using LLR. With LLR the weights are not used anyway so your data may not produce good results with mixed actions.
A plug for the (admittedly pre-alpha) spark-itemsimilairty: 1) output from 2 can be directly ingested and will create output. 2) multiple actions can be used with cross-cooccurrence, not by guessing at weights. 3) output has your application specific IDs preserved. 4) its about 10x faster than mapreduce and will do aways with your ID translation steps One caveat is that your cluster machines will need lots of memory. I have 8-16g on mine. On Aug 17, 2014, at 1:26 AM, Serega Sheypak <[email protected]> wrote: 1. I do collect preferences for items using 60days sliding window. today - 60 days. 2. I do prepare triples user_id, item_id, descrete_pref_value (3 for item view, 5 for clicking recommndation block. The idea is to give more value for recommendations which attact visitor attention). I get ~ 20.000.000 of lines with ~1.000.000 distinct items and ~2.000.000 distinct users 3. I do use apache pig RANK function to rank all distinct user_id 4. I do the same for item_id 5. I do join input dataset with ranked datasets and provide input to mahout with dense interger user_id, item_id 6. I do get mahout output and join integer item_id back to get natural key value. step #1-2 takes ~ 40min step #3-5 takes ~1 hour mahout calc takes ~3hours 2014-08-17 10:45 GMT+04:00 Ted Dunning <[email protected]>: > This really doesn't sound right. It should be possible to process almost a > thousand times that much data every night without that much problem. > > How are you preparing the input data? > > How are you converting to Mahout id's? > > Even using python, you should be able to do the conversion in just a few > minutes without any parallelism whatsoever. > > > > > On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <[email protected]> > wrote: > >> Hi, We are trying calculate ItemSimilarity. >> Right now we have 2*10^7 input lines. I do provide input data as raw text >> each day to recalculate item similarities. We do get +100..1000 new items >> each day. >> 1. It takes too much time to prepare input data. >> 2. It takes too much time to convert user_id, item_id to mahout ids >> >> Is there any poissibility to provide data to mahout mapreduce >> ItemSimilarity using some binary format with compression? >> >
