On Tue, Aug 30, 2011 at 2:24 PM, Dmitriy Ryaboy <[email protected]> wrote:
> If I understand correctly, you are objecting to the reshuffle of > foo_grouped > based on col_a when it was just shuffled based on col_a. > > I've considered this. We currently don't take existing partitioning into > account; that would be handy. But we need to evaluate the cost/benefit > here. > > Yes. Exactly.I think the cost/benefit would be clear. You don't have to read it on the source or transfer it over the network again. However, you do have to write it to disk on the remote end and there may be situations where reduce doesn't need to be written to disk… > This only works when parallelism is the same for both group operators. > > I would assume this is the default. But I agree. All our jobs run with the same parallelism. > As Hadoop does not specify locality policies for reducers (afaik), the > sidefiles would likely need to be read over the network. This might be > faster than what happens now because you avoid some IO and the sort, but I > am not totally convinced it materially affects total runtime for > non-trivial > examples. > If the side files are / were written to HDFS they would be replicated and the locality would them become irrelevant. Your mappers would always have the side files. With only one elided group by I think this actually might be MORE data than required. For example, if you are on machines A, B, C and the first MR was ran on A and the partition/reduced data was there and then during the second MR job you ended up on B then you would need to read the data from A-B… This would require network bandwidth. If you had at least three operations, two of which re-used existing data, then the network bandwidth wouldn't be wasted. Another upside to writing to HDFS is that you don't have to re-run the job if one of the nodes dies.. > > Do you have any cheap ways to estimate how much savings that gets us in > these kinds of scenarios? > I think it would be obvious. Maybe I'm missing something. For multiple GROUPs > 2/3 the savings would be significant. All the file IO, CPU, and network IO is removed and you just use one of the partitioned relations. In my app… this would yield a 5-10x speedup… Kevin -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* Skype: *burtonator* Skype-in: *(415) 871-0687*
