Re: How to access a sub matrix in a spark task?

Aureliano Buendia Fri, 20 Dec 2013 10:59:19 -0800

On Fri, Dec 20, 2013 at 5:55 PM, Evan R. Sparks <[email protected]>wrote:


> Hi Aureliano,
>
> I think spark is actually well-suited to this task. Spark is a
> general-purpose framework for distributed computing, and while a lot of
> users use it primarily for large-scale data processing, it's just as well
> suited to distributed CPU-intensive processing - as long as your algorithm
> fits well into the iterative map/reduce computation model.
>
> You can have an RDD composed of any serializable object in scala. I don't
> about your particular use case, but you could Make an
> RDD[Array[Array[Double]], or, if you want access to fast linear algebra
> primitives, an RDD[DoubleMatrix] where DoubleMatrix comes from the jblas
> library.
>

Each of these matrices are a window moving along the big matrix. I think
going with your method means a huge amount of data duplication, as each
window has a big overlap with other windows. (Each window moves column by
column)


>
> You'll need to be a little careful when writing your algorithm to avoid
> overhead that comes with shuffling/reducing data - often in these
> scenarios, communication becomes the bottleneck, so writing your algorithm
> to minimize communication is important.
>
> - Evan
>
>
>
>
> On Fri, Dec 20, 2013 at 9:40 AM, Aureliano Buendia 
> <[email protected]>wrote:
>
>>
>>
>>
>> On Fri, Dec 20, 2013 at 5:21 PM, Tom Vacek <[email protected]>wrote:
>>
>>> If you use an RDD[Array[Double]] with a row decomposition of the matrix,
>>> you can index windows of the rows all you want, but you're limited to 100
>>> concurrent tasks.  You could use a column decomposition and access subsets
>>> of the columns with a PartitionPruningRDD.  I have to say, though, if
>>> you're doing dense matrix operations, they will be 100s of times faster on
>>> a shared mem platform.  This particular matrix, at 800 MB could be a Breeze
>>> on a single node.
>>>
>>
>> The computation for every submatrix is very expensive, it takes days on a
>> single node. I was hoping this can be reduced to hours or minutes with
>> spark.
>>
>> Are you saying that spark is not suitable for this type of job?
>>
>
>

Re: How to access a sub matrix in a spark task?

Reply via email to