If there are 100 features, it's more like 2.6M * 2.8M * 100 = 728 Tflops --
I think you're missing an "M", and the features by an order of magnitude.
That's still 1 day on an 8-core machine by this rule of thumb.

The 80 hours is the model building time too (right?), not the time to
multiply U*M'. This is dominated by iterations when building from scratch,
and I expect took 75% of that 80 hours. So if the multiply was 20 hours --
on 10 machines -- on Hadoop, then that's still slow but not out of the
question for Hadoop, given it's usually a 3-6x slowdown over a parallel
in-core implementation.

I'm pretty sure what exists in Mahout here can be optimized further at the
Hadoop level; I don't know that it's doing the multiply badly though. In
fact I'm pretty sure it's caching cols in memory, which is a bit of
'cheating' to speed up by taking a lot of memory.


On Wed, Mar 6, 2013 at 3:47 AM, Ted Dunning <[email protected]> wrote:

> Hmm... each users recommendations seems to be about 2.8 x 20M Flops = 60M
> Flops.  You should get about a Gflop per core in Java so this should about
> 60 ms.  You can make this faster with more cores or by using ATLAS.
>
> Are you expecting 3 million unique people every 80 hours?  If no, then it
> is probably more efficient to compute the recommendations on the fly.
>
> How many recommendations per second are you expecting?  If you have 1
> million uniques per day (just for grins) and we assume 20,000 s/day to
> allow for peak loading, you have to do 50 queries per second peak.  This
> seems to require 3 cores.  Use 16 to be safe.
>
> Regarding the 80 hours, 3 million x 60ms = 180,000 seconds = 50 hours.  I
> think that your map-reduce is under performing by about a factor of 10.
>  This is quite plausible with bad arrangement of the inner loops.  I think
> that you would have highest performance computing the recommendations for a
> few thousand items by a few thousand users at a time.  It might be just
> about as fast to do all items against a few users at a time.  The reason
> for this is that dense matrix multiply requires c n x k + m x k memory ops,
> but n x k x m arithmetic ops.  If you can re-use data many times, you can
> balance memory channel bandwidth against CPU speed.  Typically you need 20
> or more re-uses to really make this fly.
>
>

Reply via email to