[jira] [Commented] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete
[ https://issues.apache.org/jira/browse/SPARK-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273830#comment-15273830 ] Apache Spark commented on SPARK-14915: -- User 'srowen' has created a pull request for this issue: https://github.com/apache/spark/pull/12950 > Tasks that fail due to CommitDeniedException (a side-effect of speculation) > can cause job to never complete > --- > > Key: SPARK-14915 > URL: https://issues.apache.org/jira/browse/SPARK-14915 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.3, 1.6.2, 2.0.0 >Reporter: Jason Moore >Assignee: Jason Moore >Priority: Critical > Fix For: 1.6.2, 2.0.0 > > > In SPARK-14357, code was corrected towards the originally intended behavior > that a CommitDeniedException should not count towards the failure count for a > job. After having run with this fix for a few weeks, it's become apparent > that this behavior has some unintended consequences - that a speculative task > will continuously receive a CDE from the driver, now causing it to fail and > retry over and over without limit. > I'm thinking we could put a task that receives a CDE from the driver, into a > TaskState.FINISHED or some other state to indicated that the task shouldn't > be resubmitted by the TaskScheduler. I'd probably need some opinions on > whether there are other consequences for doing something like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete
[ https://issues.apache.org/jira/browse/SPARK-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261607#comment-15261607 ] Apache Spark commented on SPARK-14915: -- User 'jasonmoore2k' has created a pull request for this issue: https://github.com/apache/spark/pull/12751 > Tasks that fail due to CommitDeniedException (a side-effect of speculation) > can cause job to never complete > --- > > Key: SPARK-14915 > URL: https://issues.apache.org/jira/browse/SPARK-14915 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2 >Reporter: Jason Moore >Priority: Critical > > In SPARK-14357, code was corrected towards the originally intended behavior > that a CommitDeniedException should not count towards the failure count for a > job. After having run with this fix for a few weeks, it's become apparent > that this behavior has some unintended consequences - that a speculative task > will continuously receive a CDE from the driver, now causing it to fail and > retry over and over without limit. > I'm thinking we could put a task that receives a CDE from the driver, into a > TaskState.FINISHED or some other state to indicated that the task shouldn't > be resubmitted by the TaskScheduler. I'd probably need some opinions on > whether there are other consequences for doing something like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete
[ https://issues.apache.org/jira/browse/SPARK-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261252#comment-15261252 ] Jason Moore commented on SPARK-14915: - That's exactly my current thinking too. But even if keep allowing some tasks to be retried without limit in certain contexts (the current two I'm aware of are: commit denied on speculative tasks or an executor lost because of a YARN de-allocation), it does seem that the commit denied is often happening when another copy has already succeeded. I'm about to do some testing on this now, and not re-queuing in this scenario. > Tasks that fail due to CommitDeniedException (a side-effect of speculation) > can cause job to never complete > --- > > Key: SPARK-14915 > URL: https://issues.apache.org/jira/browse/SPARK-14915 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2 >Reporter: Jason Moore >Priority: Critical > > In SPARK-14357, code was corrected towards the originally intended behavior > that a CommitDeniedException should not count towards the failure count for a > job. After having run with this fix for a few weeks, it's become apparent > that this behavior has some unintended consequences - that a speculative task > will continuously receive a CDE from the driver, now causing it to fail and > retry over and over without limit. > I'm thinking we could put a task that receives a CDE from the driver, into a > TaskState.FINISHED or some other state to indicated that the task shouldn't > be resubmitted by the TaskScheduler. I'd probably need some opinions on > whether there are other consequences for doing something like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete
[ https://issues.apache.org/jira/browse/SPARK-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15260623#comment-15260623 ] Andrew Or commented on SPARK-14915: --- I haven't looked into the scheduler code in detail yet, but it seems to me the bug is not caused by your fix to use the `CausedBy`. Rather, the bug has always existed, but your fix just uncovered it. It does seem like a problem in the scheduler; under no circumstances should we retry a stage without limits. > Tasks that fail due to CommitDeniedException (a side-effect of speculation) > can cause job to never complete > --- > > Key: SPARK-14915 > URL: https://issues.apache.org/jira/browse/SPARK-14915 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2 >Reporter: Jason Moore >Priority: Critical > > In SPARK-14357, code was corrected towards the originally intended behavior > that a CommitDeniedException should not count towards the failure count for a > job. After having run with this fix for a few weeks, it's become apparent > that this behavior has some unintended consequences - that a speculative task > will continuously receive a CDE from the driver, now causing it to fail and > retry over and over without limit. > I'm thinking we could put a task that receives a CDE from the driver, into a > TaskState.FINISHED or some other state to indicated that the task shouldn't > be resubmitted by the TaskScheduler. I'd probably need some opinions on > whether there are other consequences for doing something like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete
[ https://issues.apache.org/jira/browse/SPARK-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15259920#comment-15259920 ] Jason Moore commented on SPARK-14915: - Could I get thoughts on this: at [TaskSetManager.scala#L723|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L723] a call is made to addPendingTask after a task has failed. I can think of a scenario that it might be a good idea not to add the task back into the pending queue: when success(index) == true (which implies that another copy of the task has already succeeded). I'm soon going to test it out with the condition, as I think it's quite possibly what is causing tasks to continually re-queue after a CDE until the stage has completed (further lengthening the duration of the stage, as that take up execution resources). > Tasks that fail due to CommitDeniedException (a side-effect of speculation) > can cause job to never complete > --- > > Key: SPARK-14915 > URL: https://issues.apache.org/jira/browse/SPARK-14915 > Project: Spark > Issue Type: Bug >Affects Versions: 1.6.2 >Reporter: Jason Moore >Priority: Critical > > In SPARK-14357, code was corrected towards the originally intended behavior > that a CommitDeniedException should not count towards the failure count for a > job. After having run with this fix for a few weeks, it's become apparent > that this behavior has some unintended consequences - that a speculative task > will continuously receive a CDE from the driver, now causing it to fail and > retry over and over without limit. > I'm thinking we could put a task that receives a CDE from the driver, into a > TaskState.FINISHED or some other state to indicated that the task shouldn't > be resubmitted by the TaskScheduler. I'd probably need some opinions on > whether there are other consequences for doing something like this. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org