Re: about DistributedRowMatrix implementation

Dmitriy Lyubimov Sat, 01 Oct 2011 20:42:09 -0700

I have a branch in github that equips vectorWritable with a preprocessor via
a Cinfigurable hadoop interface and happily preprocess input element by
element without creating any heap object in memory.


I proposed to contribute that approach a year ago but it was rejected,
afaik, on the grounds that push style preprocessor is  a "bad" or
"confusing" pattern to have.

If you want, I can dig that patch out for judgement again.

The benefits if this patch are significant. For once, unbounding width of
the input for memory, reducing garbage collector pressure, not having to
have a lot of memory( actually, any extra heap memory) for wide matrices...
it makes sense all around anywhere you look at it. Except for "bad" pattern.

One thing it is though without a doubt, is that it is totally possible( and
actually the version of ssvd we were using rans exactly on that
projection-as-a-single-element-preprocessor pattern).
On Oct 1, 2011 3:43 PM, "Dmitriy Lyubimov" <[email protected]> wrote:
> I have a branch in github that equips vectorWritable with a preprocessor
via
> a Cinfigurable hadoop interface and happily preprocess input element by
> element without creating any heap object in memory.
>
> I proposed to contribute that approach a year ago but it was rejected,
> afaik, on the grounds that push style preprocessor is a "bad" or
> "confusing" pattern to have.
>
> If you want, I can dig that patch out for judgement again.
>
> The benefits if this patch are significant. For once, unbounding width of
> the input for memory, reducing garbage collector pressure, not having to
> have a lot of memory( actually, any extra heap memory) for wide
matrices...
> it makes sense all around anywhere you look at it. Except for "bad"
pattern.
>
> One thing it is though without a doubt, is that it is totally possible(
and
> actually the version of ssvd we were using rans exactly on that
> projection-as-a-single-element-preprocessor pattern).
>
> Sent from android tab
> On Oct 1, 2011 10:42 AM, "Jake Mannix" <[email protected]> wrote:
>> Marc,
>>
>> If you want to do element-at-a-time multiplication, without putting both
>> row and
>> column in memory at a time, this is totally doable, but just not
>> implemented
>> in Mahout yet. The current implementation manages to do it in one
>> map-reduce
>> pass by doing a mapside join (the CompositeInputFormat thing), but in
>> general
>> if you don't do a map-side join, it's 2 passes. In which case, doing this
>> element at a time instead of row/column at a time is also 2 passes, and
>> has no restrictions on how much is in memory at a time.
>>
>> I've had some code lying around which started on doing this, but never
>> had a need just yet. If you open up a JIRA ticket for this, I could post
>> my code fragments so far, and maybe you (or someone else) could help
>> finish it off.
>>
>> Can you describe a bit about how big your matrices are? Dense matrix
>> multiplication is an O(N^3) operation, so if N is too large so that even
>> one row or column cannot fit in memory, then N^3 is not going to finish
>> any time this year or next, from what I can tell.
>>
>> -jake
>>
>> On Sat, Oct 1, 2011 at 3:18 AM, Marc Sturlese <[email protected]
>>wrote:
>>
>>> Well after digging into the code and do some tests, I've seen that what
I
>>> was
>>> asking for is not possible. Mahout will only let you do a distributed
>>> matrix
>>> multiplication of 2 sparse matrix, as the representation of a whole row
> or
>>> column has to feed in memory. Actually have to feed in memory a row and
a
>>> column each time (as it uses the CompositeInputFormat).
>>> To do dense matrix multiplication with hadoop just found this:
>>> http://homepage.mac.com/j.norstad/matrix-multiply/index.html
>>> But the data generated by the maps will be extremely huge and the job
> will
>>> take ages (of course depending of the number of nodes).
>>> I've seed around that Hama and R are possible solutions too. Any advice,
>>> comment or experience?
>>>
>>>
>>> --
>>> View this message in context:
>>>
>
http://lucene.472066.n3.nabble.com/about-DistributedRowMatrix-implementation-tp3375372p3384669.html
>>> Sent from the Mahout User List mailing list archive at Nabble.com.
>>>

Re: about DistributedRowMatrix implementation

Reply via email to