Hi,

we've been seeing occasional issues in production with the FileOutCommitter
reaching a deadlock situation.

We are writing our data to S3 and currently have speculation enabled. What
we see is that Spark get's a file not found error trying to access a
temporary part file that it wrote (part-#2 file it seems to be every
time?), so the task fails. But the file actually exists in S3 so subsequent
speculations and task retries all fail because the committer tells them the
file exists. This will persist until human intervention kills the
application. Usually rerunning the application will succeed on the next try
so it is not deterministic with the dataset or anything.

It seems like there isn't a good story yet for file writing and speculation
(https://issues.apache.org/jira/browse/SPARK-4879), although our error here
seems worse that reports in that issue since I believe ours deadlocks and
those don't?

Has anyone else observed deadlocking like this?

Thanks,
Richard

Reply via email to