Re: mapreduce ItemSimilarity input optimization

Ted Dunning Sat, 16 Aug 2014 23:47:27 -0700

This really doesn't sound right.  It should be possible to process almost a
thousand times that much data every night without that much problem.

How are you preparing the input data?

How are you converting to Mahout id's?

Even using python, you should be able to do the conversion in just a few
minutes without any parallelism whatsoever.

On Sat, Aug 16, 2014 at 5:10 AM, Serega Sheypak <[email protected]>
wrote:

> Hi, We are trying calculate ItemSimilarity.
> Right now we have 2*10^7 input lines. I do provide input data as raw text
> each day to recalculate item similarities. We do get +100..1000 new items
> each day.
> 1. It takes too much time to prepare input data.
> 2. It takes too much time to convert user_id, item_id to mahout ids
>
> Is there any poissibility to provide data to mahout mapreduce
> ItemSimilarity using some binary format with compression?
>

Re: mapreduce ItemSimilarity input optimization

Reply via email to