Hi, We had a cluster hang using Spark master (standalone, as an EMR job) a few days ago. The last output in the master log is that Stage 5 was still running:
14:26 INFO scheduler.DAGScheduler: running: Set(Stage 5) So, to see what task wasn't complete, I went looking for "Completed ShuffleMapTask(5, ...)" entries. To find which one was not there, I ran: $ cat master.stderr | grep 'MapTask(5, ' | cut -d ' ' -f 7 | sort Stage 5 had 58 tasks, and 0-57 were all completed, except task 14. So I assume that is what was hanging the job. Looking for task 14, it got started: 14:15 cluster.ClusterTaskSetManager: Starting task 5.0:14 as TID 523 on executor 8: ip-10-40-7-103.ec2.internal (PROCESS_LOCAL) And it showed up on that slave: 14:15 executor.Executor: Running task ID 523 But then later the master ignored the slave's task update: 14:16 cluster.ClusterScheduler: Ignoring update from TID 523 because its task set is gone But then that's it. I don't know anything about the "task set gone" scenario; what should have happened? Should it have been retried? I've uploaded the master log, slave log, and a slave jstack here: https://gist.github.com/anonymous/ed6a57b0e3aff13d9f49 There are also a few suspicious looking connection failures/Could not get block errors on the slave a minute or so after it started running TID 523, so I would not be surprised if this task actually failed on the slave, but whatever the result, the master seems to have ignored it, and so didn't retry. Are these logs enough for someone to piece together what happened? Any hints where I could look? Is there something else I could grab off the cluster next time it happens that would help? Thanks! - Stephen
