Hm, I've had the same understanding of the definition of a map task, but my confusion is whether the combine method is only applied to the outputs of a map task (potentially many because a split usually has multiple key-value pairs) or if the combine method is also applied to the outputs of multiple map tasks. The way I understand the Mahout matrix multiplication using the Hadoop join-package is that each stripe (column/row) is stored in a single file (I guess because it is assumed that even one column/row can be very big consuming up to the entire block size) and therefore a single outer product is computed in *one* map task. If the combiner cannot combine outputs of *multiple* map tasks, then there is nothing to combine.
2012/9/26 Sebastian Schelter <[email protected]> > If I understand the discussion correctly, there is some confusion here. > > A map task is not the same as a single invocation of the function to map. > > A map task consumes a split and invokes the function to map for each > key-value pair contained in the split. The function to combine is > applied (usually several times, in some implementation specific way) to > the output of all the invocations of that map task. > > --sebastian > > On 26.09.2012 15:40, Sigurd Spieckermann wrote: > > Well, my word selection wasn't great when I said "one map task produces > > only a single result". The way I meant this was that one map task only > > produces a single outer product (that consist of multiple column vectors > > hence multiple mapper emits), but those are not the ones to combine in > this > > case, right? > > > > 2012/9/26 Sigurd Spieckermann <[email protected]> > > > >> Yes, but one int/vector pair corresponds to the respective column of A > >> multiplied by an element of the respective row of B, correct? So the > >> concatenation of the resulting columns would be outer product of the > column > >> of A and the row of B. None of these vectors are summed up but rather > the > >> outer products of multiple map tasks are summed up. So what is the job > of > >> the combiner here? It would be nice if the combiner could sum up all > outer > >> products computed on that datanode, but this is the part I can't see > >> happening in Hadoop. Is the general statement correct that a combiner is > >> only applied to all outputs of a *map task* and that a map task > processes > >> all key-value pairs of a split? In this case, there is only one > key-value > >> pair per split, right? The int/vector being index and column/row of the > >> matrix. > >> > >> > >> 2012/9/26 Jake Mannix <[email protected]> > >> > >>> On Wed, Sep 26, 2012 at 4:49 AM, Sigurd Spieckermann < > >>> [email protected]> wrote: > >>> > >>>> Hi guys, > >>>> > >>>> I'm trying to understand the way the combiner in Mahout SVD works. ( > >>>> https://cwiki.apache.org/MAHOUT/dimensional-reduction.html) As far > as I > >>>> know from the Mahout math matrix-multiplication implementation, matrix > >>> A is > >>>> represented by column-vectors, matrix B is represented by row vectors > >>> and > >>>> an inner join executes an outer product of the columns of A with the > >>> rows > >>>> of B. All outer products are summed by the combiners and reducers. > What > >>> I > >>>> am wondering about is how a combiner can actually combine multiple > outer > >>>> products on the same datanode because the join-package requires the > >>> data to > >>>> be partitioned into unsplittable files. In this case, I understand > that > >>> one > >>>> file contains one column/row of its corresponding matrix. Hence, each > >>> map > >>>> task receives a column-row-tuple, computes the outer product and emits > >>> the > >>>> result. > >>> > >>> > >>> This all sounds right, but not the following: > >>> > >>> > >>>> My understanding of Hadoop is that the combiner follows a map task > >>>> immediately but one map task produces only a single result so there is > >>>> nothing to combine. > >>> > >>> > >>> That part is not true - a mapper may emit more than one key-value pair > >>> (and > >>> for > >>> matrix multiplication, this is true *a fortiori* - there is one > int/vector > >>> pair emitted per > >>> nonzero element of the row being mapped over). > >>> > >>> > >>>> If the combiner could accumulate the results of > >>>> multiple map task, I would understand the idea, but from my > >>> understanding > >>>> and tests, it does not. > >>>> > >>>> Could anyone clarify the process please? > >>>> > >>>> Thanks a lot! > >>>> Sigurd > >>>> > >>> > >>> > >>> > >>> -- > >>> > >>> -jake > >>> > >> > >> > > > >
