I have a job that's getting 600s task timeouts during the copy phase of the reduce step. I see a lot of copy tasks all moving at about 2.5MB/sec, and it's taking longer than 10 min to do that copy.
The process starts copying when the reduce step is 80% complete. This is a very IO bound task as I'm just joining 1.5TB of data via 2 map/reduce steps on 6 nodes (each node has 1x 4TB disk, and 24GB of ram). What should I be thinking in terms of fixing this? . Increase timeout? (seems odd that it would timeout on the internal copy) . Reduce # tasks? (I've got 8 reducers, 1-per-core, 25 io.sort.factor & 256 io.sort.mb) o Can I do that per job?? . Increase copy threads? . Don't start the reducers until 100% complete on the mappers?
