We have a series of spark jobs which run in succession over various cached datasets, do small groups and transforms, and then call saveAsSequenceFile() on them.
Each call to save as a sequence file appears to have done its work, the task says it completed in "xxx.xxxxx seconds" but then it pauses and the pauses are quite significant- sometimes up to 2 minutes. We are trying to figure out what's going on during this pause- if the executors are really still writing to the sequence files or if maybe a race condition is happening on an executor which is causing timeouts. Any ideas? Anyone else seen this happening? We also tried running all the saveAsSequenceFile calls in separate futures to see if maybe the waiting would still only take 1-2 minutes but it looks like the waiting still takes the sum of the amount of time it would have originally (several several minutes). Our job runs, in its entirety, 35 minutes and we're estimating that we're spending at least 16 minutes in this paused state. What I haven't been able to do is figure out how to trace through all the executors. Is there a way to do this? The event logs in yarn don't seem to help much with this.