Re: How to access a sub matrix in a spark task?

Tom Vacek Fri, 20 Dec 2013 10:02:24 -0800

Oh, I see.  I was thinking that there was a computational dependency on one
window to the next.  If the computations are independent, then I think
Spark can help you out quite a bit.

I think you would want an RDD where each element is a window of your dense
matrix.  I'm not aware of a way to distribute the windows of the big matrix
in a way that doesn't involve broadcasting the whole thing.  You might have
to tweak some config options, but I think it would work straightaway.  I
would initialize the data structure like this:
val matB = sc.broadcast(myBigDenseMatrix)
val distributedChunks = sc.parallelize(0 until numWindows).mapPartitions(it
=> it.map(windowID => getWindow(matB.value, windowID) ) )

Then just apply your matrix ops as map on

You maybe have your own tool for dense matrix ops, but I would suggest
Scala Breeze.  You'll have to use an old version of Breeze (current builds
are for 2.10).  Spark with Scala-2.10 is a little way off.

On Fri, Dec 20, 2013 at 11:40 AM, Aureliano Buendia <[email protected]>wrote:

>
>
>
> On Fri, Dec 20, 2013 at 5:21 PM, Tom Vacek <[email protected]> wrote:
>
>> If you use an RDD[Array[Double]] with a row decomposition of the matrix,
>> you can index windows of the rows all you want, but you're limited to 100
>> concurrent tasks.  You could use a column decomposition and access subsets
>> of the columns with a PartitionPruningRDD.  I have to say, though, if
>> you're doing dense matrix operations, they will be 100s of times faster on
>> a shared mem platform.  This particular matrix, at 800 MB could be a Breeze
>> on a single node.
>>
>
> The computation for every submatrix is very expensive, it takes days on a
> single node. I was hoping this can be reduced to hours or minutes with
> spark.
>
> Are you saying that spark is not suitable for this type of job?
>

Re: How to access a sub matrix in a spark task?

Reply via email to