600s timeout during copy phase of job

David Parks Sun, 12 May 2013 23:05:40 -0700

I have a job that's getting 600s task timeouts during the copy phase of the
reduce step. I see a lot of copy tasks all moving at about 2.5MB/sec, and
it's taking longer than 10 min to do that copy.


 

The process starts copying when the reduce step is 80% complete. This is a
very IO bound task as I'm just joining 1.5TB of data via 2 map/reduce steps
on 6 nodes (each node has 1x 4TB disk, and 24GB of ram).

 

What should I be thinking in terms of fixing this? 

.         Increase timeout? (seems odd that it would timeout on the internal
copy)

.         Reduce # tasks? (I've got 8 reducers, 1-per-core, 25
io.sort.factor & 256 io.sort.mb)

o   Can I do that per job??

.         Increase copy threads?

.         Don't start the reducers until 100% complete on the mappers?

600s timeout during copy phase of job

Reply via email to