Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Dmitriy Lyubimov Wed, 21 Dec 2011 09:11:44 -0800

On Wed, Dec 21, 2011 at 3:14 AM, Radim Rehurek <[email protected]> wrote:


>
> Also interesting: I'm not sure it's understood, but distributing SVD (the 
> math) doesn't bring you anything here. The cost is dominated by the number of 
> passes over input data, which is dominated by I/O (for such small values of 
> `k`). This is an area where Mahout can truly shine, because of HDFS -- if the 
> data is already pre-distributed to workers, the cost of IO can be shared. If, 
> on the other hand, you'd need to read the data first, then distribute them 
> further to nodes for processing, then a sequential algo will be faster.
>

Also , to that point: in my experiments cost is not dominated by I/O
when used on barebone clusters. (Amazon EC2, other kind of VMs, YMMV
but not in a barebone rack).
In fact, in terms of performance, if anything, the splitting might be
even more fine grained for optimum horizontal scaled than it is now
when used with extra sparse inputs. I.e. we could easily run more
tasks (but they are collocated in many cases). Performance is an
aspect where we actually may make some significant differences with
hadoop. Improving horizontal scale is tricky though because hadoop
splits based on input sizes by default but flops don't scale
proportionally to input sizes but somewhat faster. Therefore, it would
need some hack of default hadoop splitting mechanism, generating more
clones of splits to run. If that's done that indeed almost constant
running time can be acheived provided cluster has capacity to take
them.

Hope it is useful for your inquery.

Re: SVD in Mahout (was: Mahout Lanczos SVD complexity)

Reply via email to