Although I sense the discussion is really a bit more than just about reading inputs one element at a time.
Yes, I guess multiplication is generally 2 passes unless it map side join which I think though has more interesting prerequisites for the input that a general drm assumes, I think. I thought map side joins require same sort and partitioning and drm doesn't assume that in most general case? Although I have a pretty vague idea how exactly that particular input format does what it does. It is not supported in the new API and I felt I wanted to abstain from going back to the deprecated stuff just to have that. Alright, please never mind. On Oct 1, 2011 3:45 PM, "Dmitriy Lyubimov" <[email protected]> wrote: > I have a branch in github that equips vectorWritable with a preprocessor via > a Cinfigurable hadoop interface and happily preprocess input element by > element without creating any heap object in memory. > > I proposed to contribute that approach a year ago but it was rejected, > afaik, on the grounds that push style preprocessor is a "bad" or > "confusing" pattern to have. > > If you want, I can dig that patch out for judgement again. > > The benefits if this patch are significant. For once, unbounding width of > the input for memory, reducing garbage collector pressure, not having to > have a lot of memory( actually, any extra heap memory) for wide matrices... > it makes sense all around anywhere you look at it. Except for "bad" pattern. > > One thing it is though without a doubt, is that it is totally possible( and > actually the version of ssvd we were using rans exactly on that > projection-as-a-single-element-preprocessor pattern). > On Oct 1, 2011 3:43 PM, "Dmitriy Lyubimov" <[email protected]> wrote: >> I have a branch in github that equips vectorWritable with a preprocessor > via >> a Cinfigurable hadoop interface and happily preprocess input element by >> element without creating any heap object in memory. >> >> I proposed to contribute that approach a year ago but it was rejected, >> afaik, on the grounds that push style preprocessor is a "bad" or >> "confusing" pattern to have. >> >> If you want, I can dig that patch out for judgement again. >> >> The benefits if this patch are significant. For once, unbounding width of >> the input for memory, reducing garbage collector pressure, not having to >> have a lot of memory( actually, any extra heap memory) for wide > matrices... >> it makes sense all around anywhere you look at it. Except for "bad" > pattern. >> >> One thing it is though without a doubt, is that it is totally possible( > and >> actually the version of ssvd we were using rans exactly on that >> projection-as-a-single-element-preprocessor pattern). >> >> Sent from android tab >> On Oct 1, 2011 10:42 AM, "Jake Mannix" <[email protected]> wrote: >>> Marc, >>> >>> If you want to do element-at-a-time multiplication, without putting both >>> row and >>> column in memory at a time, this is totally doable, but just not >>> implemented >>> in Mahout yet. The current implementation manages to do it in one >>> map-reduce >>> pass by doing a mapside join (the CompositeInputFormat thing), but in >>> general >>> if you don't do a map-side join, it's 2 passes. In which case, doing this >>> element at a time instead of row/column at a time is also 2 passes, and >>> has no restrictions on how much is in memory at a time. >>> >>> I've had some code lying around which started on doing this, but never >>> had a need just yet. If you open up a JIRA ticket for this, I could post >>> my code fragments so far, and maybe you (or someone else) could help >>> finish it off. >>> >>> Can you describe a bit about how big your matrices are? Dense matrix >>> multiplication is an O(N^3) operation, so if N is too large so that even >>> one row or column cannot fit in memory, then N^3 is not going to finish >>> any time this year or next, from what I can tell. >>> >>> -jake >>> >>> On Sat, Oct 1, 2011 at 3:18 AM, Marc Sturlese <[email protected] >>>wrote: >>> >>>> Well after digging into the code and do some tests, I've seen that what > I >>>> was >>>> asking for is not possible. Mahout will only let you do a distributed >>>> matrix >>>> multiplication of 2 sparse matrix, as the representation of a whole row >> or >>>> column has to feed in memory. Actually have to feed in memory a row and > a >>>> column each time (as it uses the CompositeInputFormat). >>>> To do dense matrix multiplication with hadoop just found this: >>>> http://homepage.mac.com/j.norstad/matrix-multiply/index.html >>>> But the data generated by the maps will be extremely huge and the job >> will >>>> take ages (of course depending of the number of nodes). >>>> I've seed around that Hama and R are possible solutions too. Any advice, >>>> comment or experience? >>>> >>>> >>>> -- >>>> View this message in context: >>>> >> > http://lucene.472066.n3.nabble.com/about-DistributedRowMatrix-implementation-tp3375372p3384669.html >>>> Sent from the Mahout User List mailing list archive at Nabble.com. >>>>
