On Fri, Dec 20, 2013 at 5:55 PM, Evan R. Sparks <[email protected]>wrote:
> Hi Aureliano, > > I think spark is actually well-suited to this task. Spark is a > general-purpose framework for distributed computing, and while a lot of > users use it primarily for large-scale data processing, it's just as well > suited to distributed CPU-intensive processing - as long as your algorithm > fits well into the iterative map/reduce computation model. > > You can have an RDD composed of any serializable object in scala. I don't > about your particular use case, but you could Make an > RDD[Array[Array[Double]], or, if you want access to fast linear algebra > primitives, an RDD[DoubleMatrix] where DoubleMatrix comes from the jblas > library. > Each of these matrices are a window moving along the big matrix. I think going with your method means a huge amount of data duplication, as each window has a big overlap with other windows. (Each window moves column by column) > > You'll need to be a little careful when writing your algorithm to avoid > overhead that comes with shuffling/reducing data - often in these > scenarios, communication becomes the bottleneck, so writing your algorithm > to minimize communication is important. > > - Evan > > > > > On Fri, Dec 20, 2013 at 9:40 AM, Aureliano Buendia > <[email protected]>wrote: > >> >> >> >> On Fri, Dec 20, 2013 at 5:21 PM, Tom Vacek <[email protected]>wrote: >> >>> If you use an RDD[Array[Double]] with a row decomposition of the matrix, >>> you can index windows of the rows all you want, but you're limited to 100 >>> concurrent tasks. You could use a column decomposition and access subsets >>> of the columns with a PartitionPruningRDD. I have to say, though, if >>> you're doing dense matrix operations, they will be 100s of times faster on >>> a shared mem platform. This particular matrix, at 800 MB could be a Breeze >>> on a single node. >>> >> >> The computation for every submatrix is very expensive, it takes days on a >> single node. I was hoping this can be reduced to hours or minutes with >> spark. >> >> Are you saying that spark is not suitable for this type of job? >> > >
