On Tue, Aug 30, 2011 at 2:24 PM, Dmitriy Ryaboy <[email protected]> wrote:

> If I understand correctly, you are objecting to the reshuffle of
> foo_grouped
> based on col_a when it was just shuffled based on col_a.
>
> I've considered this. We currently don't take existing partitioning into
> account; that would be handy. But we need to evaluate the cost/benefit
> here.
>
>
Yes.  Exactly.I think the cost/benefit would be clear.  You don't have to
read it on the source or transfer it over the network again.  However, you
do have to write it to disk on the remote end and there may be situations
where reduce doesn't need to be written to disk…


> This only works when parallelism is the same for both group operators.
>
>
I would assume this is the default.  But I agree.  All our jobs run with the
same parallelism.


> As Hadoop does not specify locality policies for reducers (afaik), the
> sidefiles would likely need to be read over the network. This might be
> faster than what happens now because you avoid some IO and the sort, but I
> am not totally convinced it materially affects total runtime for
> non-trivial
> examples.
>

If the side files are / were written to HDFS they would be replicated and
the locality would them become irrelevant.

Your mappers would always have the side files.

With only one elided group by I think this actually might be MORE data than
required.  For example, if you are on machines A, B, C and the first MR was
ran on A and the partition/reduced data was there and then during the second
MR job you ended up on B then you would need to read the data from A-B… This
would require network bandwidth.

If you had at least three operations, two of which re-used existing data,
then the network bandwidth wouldn't be wasted.  Another upside to writing to
HDFS is that you don't have to re-run the job if one of the nodes dies..


>
> Do you have any cheap ways to estimate how much savings that gets us in
> these kinds of scenarios?
>

I think it would be obvious.  Maybe I'm missing something.

For multiple GROUPs > 2/3 the savings would be significant.  All the file
IO, CPU, and network IO is removed and you just use one of the partitioned
relations.

In my app… this would yield a 5-10x speedup…

Kevin

-- 

Founder/CEO Spinn3r.com

Location: *San Francisco, CA*
Skype: *burtonator*

Skype-in: *(415) 871-0687*

Reply via email to