Well, my word selection wasn't great when I said "one map task produces only a single result". The way I meant this was that one map task only produces a single outer product (that consist of multiple column vectors hence multiple mapper emits), but those are not the ones to combine in this case, right?
2012/9/26 Sigurd Spieckermann <[email protected]> > Yes, but one int/vector pair corresponds to the respective column of A > multiplied by an element of the respective row of B, correct? So the > concatenation of the resulting columns would be outer product of the column > of A and the row of B. None of these vectors are summed up but rather the > outer products of multiple map tasks are summed up. So what is the job of > the combiner here? It would be nice if the combiner could sum up all outer > products computed on that datanode, but this is the part I can't see > happening in Hadoop. Is the general statement correct that a combiner is > only applied to all outputs of a *map task* and that a map task processes > all key-value pairs of a split? In this case, there is only one key-value > pair per split, right? The int/vector being index and column/row of the > matrix. > > > 2012/9/26 Jake Mannix <[email protected]> > >> On Wed, Sep 26, 2012 at 4:49 AM, Sigurd Spieckermann < >> [email protected]> wrote: >> >> > Hi guys, >> > >> > I'm trying to understand the way the combiner in Mahout SVD works. ( >> > https://cwiki.apache.org/MAHOUT/dimensional-reduction.html) As far as I >> > know from the Mahout math matrix-multiplication implementation, matrix >> A is >> > represented by column-vectors, matrix B is represented by row vectors >> and >> > an inner join executes an outer product of the columns of A with the >> rows >> > of B. All outer products are summed by the combiners and reducers. What >> I >> > am wondering about is how a combiner can actually combine multiple outer >> > products on the same datanode because the join-package requires the >> data to >> > be partitioned into unsplittable files. In this case, I understand that >> one >> > file contains one column/row of its corresponding matrix. Hence, each >> map >> > task receives a column-row-tuple, computes the outer product and emits >> the >> > result. >> >> >> This all sounds right, but not the following: >> >> >> > My understanding of Hadoop is that the combiner follows a map task >> > immediately but one map task produces only a single result so there is >> > nothing to combine. >> >> >> That part is not true - a mapper may emit more than one key-value pair >> (and >> for >> matrix multiplication, this is true *a fortiori* - there is one int/vector >> pair emitted per >> nonzero element of the row being mapped over). >> >> >> > If the combiner could accumulate the results of >> > multiple map task, I would understand the idea, but from my >> understanding >> > and tests, it does not. >> > >> > Could anyone clarify the process please? >> > >> > Thanks a lot! >> > Sigurd >> > >> >> >> >> -- >> >> -jake >> > >
