Hi David, A Reduce task's percentage is a good indicator of the phase it is in: First 33% is the COPY phase, 33-66% is the SORT phase and 66-100% is the user-code-progress phase.
If your reduce task, at an individual level, is hanging at 80%, then the cause is not the COPY mechanism - as thats already been completed log ago. Also, given the timeouts/etc. built into the COPY phase (which isn't user-code btw), it'd be extremely surprising if the task hung and timed out in such a phase rather than failing outright. You could do (1) at a per-job level, or alternately you can investigate what is causing the sudden hang (memory fill up? slowing I/O?) and try to address that. If you have an operation that may take over 10 minutes to return back to proceed onto the next value/key iteration, then its better to set a status update within such an operation or as a b/g daemon thread that keeps reporting a different status every < 10 mins such that the JT is aware it is still alive. I'm not sure if (2), (2.5) and (3) are relevant here, but thats a yes to (2.5) - # of reduces is a pure per job setting. I guess (4) helps improve COPY phase speeds, but per your post I doubt you're seeing any perf. problems here with COPY. On Mon, May 13, 2013 at 11:35 AM, David Parks <[email protected]> wrote: > I have a job that’s getting 600s task timeouts during the copy phase of the > reduce step. I see a lot of copy tasks all moving at about 2.5MB/sec, and > it’s taking longer than 10 min to do that copy. > > > > The process starts copying when the reduce step is 80% complete. This is a > very IO bound task as I’m just joining 1.5TB of data via 2 map/reduce steps > on 6 nodes (each node has 1x 4TB disk, and 24GB of ram). > > > > What should I be thinking in terms of fixing this? > > · Increase timeout? (seems odd that it would timeout on the internal > copy) > > · Reduce # tasks? (I’ve got 8 reducers, 1-per-core, 25 > io.sort.factor & 256 io.sort.mb) > > o Can I do that per job?? > > · Increase copy threads? > > · Don’t start the reducers until 100% complete on the mappers? > > > > > > > > > > -- Harsh J
