Oh, I see. I was thinking that there was a computational dependency on one window to the next. If the computations are independent, then I think Spark can help you out quite a bit.
I think you would want an RDD where each element is a window of your dense matrix. I'm not aware of a way to distribute the windows of the big matrix in a way that doesn't involve broadcasting the whole thing. You might have to tweak some config options, but I think it would work straightaway. I would initialize the data structure like this: val matB = sc.broadcast(myBigDenseMatrix) val distributedChunks = sc.parallelize(0 until numWindows).mapPartitions(it => it.map(windowID => getWindow(matB.value, windowID) ) ) Then just apply your matrix ops as map on You maybe have your own tool for dense matrix ops, but I would suggest Scala Breeze. You'll have to use an old version of Breeze (current builds are for 2.10). Spark with Scala-2.10 is a little way off. On Fri, Dec 20, 2013 at 11:40 AM, Aureliano Buendia <[email protected]>wrote: > > > > On Fri, Dec 20, 2013 at 5:21 PM, Tom Vacek <[email protected]> wrote: > >> If you use an RDD[Array[Double]] with a row decomposition of the matrix, >> you can index windows of the rows all you want, but you're limited to 100 >> concurrent tasks. You could use a column decomposition and access subsets >> of the columns with a PartitionPruningRDD. I have to say, though, if >> you're doing dense matrix operations, they will be 100s of times faster on >> a shared mem platform. This particular matrix, at 800 MB could be a Breeze >> on a single node. >> > > The computation for every submatrix is very expensive, it takes days on a > single node. I was hoping this can be reduced to hours or minutes with > spark. > > Are you saying that spark is not suitable for this type of job? >
