OK, we've been on the same page then up to the point where you say one split stores multiple columns/rows, not just one. In that case, if there are N splits on each datanode, there will still be N (big, but probably sparse) matrices transferred over the network because only the emits of one split can be combined. Also, there is some upper bound of the size of a matrix by the HDFS block size and the number of nonzero elements per column/row right? So the bigger the matrix, the less the gain by the combiner because fewer stripes fit into one split. Is this no problem in practice? And about my particular application problem, is there any way to combine the outputs of multiple map tasks, not just the results of a single map task?
2012/9/26 Sebastian Schelter <[email protected]> > Hi Sigurd, > > I think that's the misconception then: "each stripe (column/row) is > stored in a single file". > > Each split contains (IntWritable, VectorWritable)-tuples, for the first > matrix, these represent the columns, for the second, these represent the > rows. > > In order to compute the outer products, these two inputs are joined via > a map-side join conducted by Hadoop's composite input format. This is a > very effective way, because you can exploit data locality. If you have > two matching input splits on the same machine, there is no network > traffic involved in joining them. > > Note that this approach only works if both inputs are partitioned and > sorted in the same way. > > --sebastian >
