If you use an RDD[Array[Double]] with a row decomposition of the matrix, you can index windows of the rows all you want, but you're limited to 100 concurrent tasks. You could use a column decomposition and access subsets of the columns with a PartitionPruningRDD. I have to say, though, if you're doing dense matrix operations, they will be 100s of times faster on a shared mem platform. This particular matrix, at 800 MB could be a Breeze on a single node.
On Fri, Dec 20, 2013 at 9:40 AM, Aureliano Buendia <[email protected]>wrote: > Hi, > > I have a 100 x 1000,000 matrix of double value, and I want to perform > distributed computing on > a 'window' of 100 x 50, where the window starts at each column. That is, > each task must have access to columns j to j+50. > > Spark examples only come with accessing a single row per task. Is it > possible to have access to a small part of the matrix? >
