Hi Aureliano, I think spark is actually well-suited to this task. Spark is a general-purpose framework for distributed computing, and while a lot of users use it primarily for large-scale data processing, it's just as well suited to distributed CPU-intensive processing - as long as your algorithm fits well into the iterative map/reduce computation model.
You can have an RDD composed of any serializable object in scala. I don't about your particular use case, but you could Make an RDD[Array[Array[Double]], or, if you want access to fast linear algebra primitives, an RDD[DoubleMatrix] where DoubleMatrix comes from the jblas library. You'll need to be a little careful when writing your algorithm to avoid overhead that comes with shuffling/reducing data - often in these scenarios, communication becomes the bottleneck, so writing your algorithm to minimize communication is important. - Evan On Fri, Dec 20, 2013 at 9:40 AM, Aureliano Buendia <[email protected]>wrote: > > > > On Fri, Dec 20, 2013 at 5:21 PM, Tom Vacek <[email protected]> wrote: > >> If you use an RDD[Array[Double]] with a row decomposition of the matrix, >> you can index windows of the rows all you want, but you're limited to 100 >> concurrent tasks. You could use a column decomposition and access subsets >> of the columns with a PartitionPruningRDD. I have to say, though, if >> you're doing dense matrix operations, they will be 100s of times faster on >> a shared mem platform. This particular matrix, at 800 MB could be a Breeze >> on a single node. >> > > The computation for every submatrix is very expensive, it takes days on a > single node. I was hoping this can be reduced to hours or minutes with > spark. > > Are you saying that spark is not suitable for this type of job? >
