Hi Josh,

The factorization should be quite a bit faster with the current trunk,
as we reworked the QR decomposition used for solving the least squares
problems of ALS.

I think we can also remove a lot of object instantiations in
ParallelALSFactorizationJob.

/s

On 06.03.2013 11:25, Josh Devins wrote:
> So the 80 hour estimate is _only_ for the U*M', top-n calculation and not
> the factorization. Factorization is on the order of 2-hours. For the
> interested, here's the pertinent code from the ALS `RecommenderJob`:
> 
> http://grepcode.com/file/repo1.maven.org/maven2/org.apache.mahout/mahout-core/0.7/org/apache/mahout/cf/taste/hadoop/als/RecommenderJob.java?av=f#148
> 
> I'm sure this can be optimised, but by an order of magnitude? Something to
> try out, I'll report back if I find anything concrete.
> 
> 
> 
> On 6 March 2013 11:13, Ted Dunning <[email protected]> wrote:
> 
>> Well, it would definitely not be the for time I counted incorrectly.
>>  Anytime I do arithmetic the result should be considered suspect.  I do
>> think my numbers are correct, but then again, I always do.
>>
>> But the OP did say 20 dimensions which gives me back 5x.
>>
>> Inclusion of learning time is a good suspect.  In the other side of the
>> ledger, if the multiply is doing any column wise access it is a likely
>> performance bug.  The computation is AB'. Perhaps you refer to rows of B
>> which are the columns of B'.
>>
>> Sent from my sleepy thumbs set to typing on my iPhone.
>>
>> On Mar 6, 2013, at 4:16 AM, Sean Owen <[email protected]> wrote:
>>
>>> If there are 100 features, it's more like 2.6M * 2.8M * 100 = 728 Tflops
>> --
>>> I think you're missing an "M", and the features by an order of magnitude.
>>> That's still 1 day on an 8-core machine by this rule of thumb.
>>>
>>> The 80 hours is the model building time too (right?), not the time to
>>> multiply U*M'. This is dominated by iterations when building from
>> scratch,
>>> and I expect took 75% of that 80 hours. So if the multiply was 20 hours
>> --
>>> on 10 machines -- on Hadoop, then that's still slow but not out of the
>>> question for Hadoop, given it's usually a 3-6x slowdown over a parallel
>>> in-core implementation.
>>>
>>> I'm pretty sure what exists in Mahout here can be optimized further at
>> the
>>> Hadoop level; I don't know that it's doing the multiply badly though. In
>>> fact I'm pretty sure it's caching cols in memory, which is a bit of
>>> 'cheating' to speed up by taking a lot of memory.
>>>
>>>
>>> On Wed, Mar 6, 2013 at 3:47 AM, Ted Dunning <[email protected]>
>> wrote:
>>>
>>>> Hmm... each users recommendations seems to be about 2.8 x 20M Flops =
>> 60M
>>>> Flops.  You should get about a Gflop per core in Java so this should
>> about
>>>> 60 ms.  You can make this faster with more cores or by using ATLAS.
>>>>
>>>> Are you expecting 3 million unique people every 80 hours?  If no, then
>> it
>>>> is probably more efficient to compute the recommendations on the fly.
>>>>
>>>> How many recommendations per second are you expecting?  If you have 1
>>>> million uniques per day (just for grins) and we assume 20,000 s/day to
>>>> allow for peak loading, you have to do 50 queries per second peak.  This
>>>> seems to require 3 cores.  Use 16 to be safe.
>>>>
>>>> Regarding the 80 hours, 3 million x 60ms = 180,000 seconds = 50 hours.
>>  I
>>>> think that your map-reduce is under performing by about a factor of 10.
>>>> This is quite plausible with bad arrangement of the inner loops.  I
>> think
>>>> that you would have highest performance computing the recommendations
>> for a
>>>> few thousand items by a few thousand users at a time.  It might be just
>>>> about as fast to do all items against a few users at a time.  The reason
>>>> for this is that dense matrix multiply requires c n x k + m x k memory
>> ops,
>>>> but n x k x m arithmetic ops.  If you can re-use data many times, you
>> can
>>>> balance memory channel bandwidth against CPU speed.  Typically you need
>> 20
>>>> or more re-uses to really make this fly.
>>>>
>>>>
>>
> 

Reply via email to