Re: mapreduce ItemSimilarity input optimization

Pat Ferrel Sun, 17 Aug 2014 08:03:32 -0700

1) how many cores in the cluster? The whole idea behind mapreduce is you buy 
more cpus you get nearly linear decrease in runtime.
2) what is your mahout command line with options, or how are you invoking 
mahout. I have seen the Mahout mapreduce recommender take this long so we 
should check what you are doing with downsampling.
3) do you really need to RANK your ids, that’s a full sort? When using pig I 
usually get DISTINCT ones and assign an incrementing integer as the Mahout ID 
corresponding
4) your #2 assigning different weights to different actions usually does not 
work. I’ve done this before and compared offline metrics and seen precision go 
down. I’d get this working using only your primary actions first. What are you 
trying to get the user to do? View something, buy something? Use that action as 
the primary preference and start out with a weight of 1 using LLR. With LLR the 
weights are not used anyway so your data may not produce good results with 
mixed actions.

A plug for the (admittedly pre-alpha) spark-itemsimilairty:
1) output from 2 can be directly ingested and will create output.
2) multiple actions can be used with cross-cooccurrence, not by guessing at 
weights. 
3) output has your application specific IDs preserved.
4) its about 10x faster than mapreduce and will do aways with your ID 
translation steps

One caveat is that your cluster machines will need lots of memory. I have 8-16g 
on mine.

On Aug 17, 2014, at 1:26 AM, Serega Sheypak <[email protected]> wrote:

1. I do collect preferences for items using 60days sliding window. today -
60 days.
2. I do prepare triples user_id, item_id, descrete_pref_value (3 for item
view, 5 for clicking recommndation block. The idea is to give more value
for recommendations which attact visitor attention). I get ~ 20.000.000 of
lines with ~1.000.000 distinct items and ~2.000.000 distinct users
3. I do use apache pig RANK function to rank all distinct user_id
4. I do the same for item_id
5. I do join input dataset with ranked datasets and provide input to mahout
with dense interger user_id, item_id
6. I do get mahout output and join integer item_id back to get natural key
value.

step #1-2 takes ~ 40min
step #3-5 takes ~1 hour
mahout calc takes ~3hours

2014-08-17 10:45 GMT+04:00 Ted Dunning <[email protected]>:

> This really doesn't sound right.  It should be possible to process almost a
> thousand times that much data every night without that much problem.
> 
> How are you preparing the input data?
> 
> How are you converting to Mahout id's?
> 
> Even using python, you should be able to do the conversion in just a few
> minutes without any parallelism whatsoever.
> 
> 
> 
> 
> On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <[email protected]>
> wrote:
> 
>> Hi, We are trying calculate ItemSimilarity.
>> Right now we have 2*10^7 input lines. I do provide input data as raw text
>> each day to recalculate item similarities. We do get +100..1000 new items
>> each day.
>> 1. It takes too much time to prepare input data.
>> 2. It takes too much time to convert user_id, item_id to mahout ids
>> 
>> Is there any poissibility to provide data to mahout mapreduce
>> ItemSimilarity using some binary format with compression?
>> 
>

Re: mapreduce ItemSimilarity input optimization

Reply via email to