Re: Combiner applied on multiple map task outputs (like in Mahout SVD)

Sebastian Schelter Wed, 26 Sep 2012 07:07:12 -0700

Hi Sigurd,

I think that's the misconception then: "each stripe (column/row) is
stored in a single file".


Each split contains (IntWritable, VectorWritable)-tuples, for the first
matrix, these represent the columns, for the second, these represent the
rows.

In order to compute the outer products, these two inputs are joined via
a map-side join conducted by Hadoop's composite input format. This is a
very effective way, because you can exploit data locality. If you have
two matching input splits on the same machine, there is no network
traffic involved in joining them.

Note that this approach only works if both inputs are partitioned and
sorted in the same way.

--sebastian

Re: Combiner applied on multiple map task outputs (like in Mahout SVD)

Reply via email to