On Wed, Dec 21, 2011 at 3:14 AM, Radim Rehurek <[email protected]> wrote:
> > Also interesting: I'm not sure it's understood, but distributing SVD (the > math) doesn't bring you anything here. The cost is dominated by the number of > passes over input data, which is dominated by I/O (for such small values of > `k`). This is an area where Mahout can truly shine, because of HDFS -- if the > data is already pre-distributed to workers, the cost of IO can be shared. If, > on the other hand, you'd need to read the data first, then distribute them > further to nodes for processing, then a sequential algo will be faster. > Also , to that point: in my experiments cost is not dominated by I/O when used on barebone clusters. (Amazon EC2, other kind of VMs, YMMV but not in a barebone rack). In fact, in terms of performance, if anything, the splitting might be even more fine grained for optimum horizontal scaled than it is now when used with extra sparse inputs. I.e. we could easily run more tasks (but they are collocated in many cases). Performance is an aspect where we actually may make some significant differences with hadoop. Improving horizontal scale is tricky though because hadoop splits based on input sizes by default but flops don't scale proportionally to input sizes but somewhat faster. Therefore, it would need some hack of default hadoop splitting mechanism, generating more clones of splits to run. If that's done that indeed almost constant running time can be acheived provided cluster has capacity to take them. Hope it is useful for your inquery.
