On Fri, Dec 20, 2013 at 6:00 PM, Tom Vacek <[email protected]> wrote:
> Oh, I see. I was thinking that there was a computational dependency on > one window to the next. If the computations are independent, then I think > Spark can help you out quite a bit. > > I think you would want an RDD where each element is a window of your dense > matrix. I'm not aware of a way to distribute the windows of the big matrix > in a way that doesn't involve broadcasting the whole thing. You might have > to tweak some config options, but I think it would work straightaway. I > would initialize the data structure like this: > val matB = sc.broadcast(myBigDenseMatrix) > val distributedChunks = sc.parallelize(0 until > numWindows).mapPartitions(it => it.map(windowID => getWindow(matB.value, > windowID) ) ) > Here broadcast is used instead of calling parallelize on myBigDenseMatrix. Is it okay to broadcast a huge amount of data? Does sharing a big data mean a big network io overhead comparing to calling parallelize, or is this overhead optimized due to the of partitioning? > > Then just apply your matrix ops as map on > > You maybe have your own tool for dense matrix ops, but I would suggest > Scala Breeze. You'll have to use an old version of Breeze (current builds > are for 2.10). Spark with Scala-2.10 is a little way off. > > > On Fri, Dec 20, 2013 at 11:40 AM, Aureliano Buendia > <[email protected]>wrote: > >> >> >> >> On Fri, Dec 20, 2013 at 5:21 PM, Tom Vacek <[email protected]>wrote: >> >>> If you use an RDD[Array[Double]] with a row decomposition of the matrix, >>> you can index windows of the rows all you want, but you're limited to 100 >>> concurrent tasks. You could use a column decomposition and access subsets >>> of the columns with a PartitionPruningRDD. I have to say, though, if >>> you're doing dense matrix operations, they will be 100s of times faster on >>> a shared mem platform. This particular matrix, at 800 MB could be a Breeze >>> on a single node. >>> >> >> The computation for every submatrix is very expensive, it takes days on a >> single node. I was hoping this can be reduced to hours or minutes with >> spark. >> >> Are you saying that spark is not suitable for this type of job? >> > >
