Hi Aureliano,

I think spark is actually well-suited to this task. Spark is a
general-purpose framework for distributed computing, and while a lot of
users use it primarily for large-scale data processing, it's just as well
suited to distributed CPU-intensive processing - as long as your algorithm
fits well into the iterative map/reduce computation model.

You can have an RDD composed of any serializable object in scala. I don't
about your particular use case, but you could Make an
RDD[Array[Array[Double]], or, if you want access to fast linear algebra
primitives, an RDD[DoubleMatrix] where DoubleMatrix comes from the jblas
library.

You'll need to be a little careful when writing your algorithm to avoid
overhead that comes with shuffling/reducing data - often in these
scenarios, communication becomes the bottleneck, so writing your algorithm
to minimize communication is important.

- Evan




On Fri, Dec 20, 2013 at 9:40 AM, Aureliano Buendia <[email protected]>wrote:

>
>
>
> On Fri, Dec 20, 2013 at 5:21 PM, Tom Vacek <[email protected]> wrote:
>
>> If you use an RDD[Array[Double]] with a row decomposition of the matrix,
>> you can index windows of the rows all you want, but you're limited to 100
>> concurrent tasks.  You could use a column decomposition and access subsets
>> of the columns with a PartitionPruningRDD.  I have to say, though, if
>> you're doing dense matrix operations, they will be 100s of times faster on
>> a shared mem platform.  This particular matrix, at 800 MB could be a Breeze
>> on a single node.
>>
>
> The computation for every submatrix is very expensive, it takes days on a
> single node. I was hoping this can be reduced to hours or minutes with
> spark.
>
> Are you saying that spark is not suitable for this type of job?
>

Reply via email to