[jira] [Commented] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete

2016-05-06 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15273830#comment-15273830
 ] 

Apache Spark commented on SPARK-14915:
--

User 'srowen' has created a pull request for this issue:
https://github.com/apache/spark/pull/12950

> Tasks that fail due to CommitDeniedException (a side-effect of speculation) 
> can cause job to never complete
> ---
>
> Key: SPARK-14915
> URL: https://issues.apache.org/jira/browse/SPARK-14915
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.3, 1.6.2, 2.0.0
>Reporter: Jason Moore
>Assignee: Jason Moore
>Priority: Critical
> Fix For: 1.6.2, 2.0.0
>
>
> In SPARK-14357, code was corrected towards the originally intended behavior 
> that a CommitDeniedException should not count towards the failure count for a 
> job.  After having run with this fix for a few weeks, it's become apparent 
> that this behavior has some unintended consequences - that a speculative task 
> will continuously receive a CDE from the driver, now causing it to fail and 
> retry over and over without limit.
> I'm thinking we could put a task that receives a CDE from the driver, into a 
> TaskState.FINISHED or some other state to indicated that the task shouldn't 
> be resubmitted by the TaskScheduler. I'd probably need some opinions on 
> whether there are other consequences for doing something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete

2016-04-27 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261607#comment-15261607
 ] 

Apache Spark commented on SPARK-14915:
--

User 'jasonmoore2k' has created a pull request for this issue:
https://github.com/apache/spark/pull/12751

> Tasks that fail due to CommitDeniedException (a side-effect of speculation) 
> can cause job to never complete
> ---
>
> Key: SPARK-14915
> URL: https://issues.apache.org/jira/browse/SPARK-14915
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2
>Reporter: Jason Moore
>Priority: Critical
>
> In SPARK-14357, code was corrected towards the originally intended behavior 
> that a CommitDeniedException should not count towards the failure count for a 
> job.  After having run with this fix for a few weeks, it's become apparent 
> that this behavior has some unintended consequences - that a speculative task 
> will continuously receive a CDE from the driver, now causing it to fail and 
> retry over and over without limit.
> I'm thinking we could put a task that receives a CDE from the driver, into a 
> TaskState.FINISHED or some other state to indicated that the task shouldn't 
> be resubmitted by the TaskScheduler. I'd probably need some opinions on 
> whether there are other consequences for doing something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete

2016-04-27 Thread Jason Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15261252#comment-15261252
 ] 

Jason Moore commented on SPARK-14915:
-

That's exactly my current thinking too.  But even if keep allowing some tasks 
to be retried without limit in certain contexts (the current two I'm aware of 
are: commit denied on speculative tasks or an executor lost because of a YARN 
de-allocation), it does seem that the commit denied is often happening when 
another copy has already succeeded.  I'm about to do some testing on this now, 
and not re-queuing in this scenario.

> Tasks that fail due to CommitDeniedException (a side-effect of speculation) 
> can cause job to never complete
> ---
>
> Key: SPARK-14915
> URL: https://issues.apache.org/jira/browse/SPARK-14915
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2
>Reporter: Jason Moore
>Priority: Critical
>
> In SPARK-14357, code was corrected towards the originally intended behavior 
> that a CommitDeniedException should not count towards the failure count for a 
> job.  After having run with this fix for a few weeks, it's become apparent 
> that this behavior has some unintended consequences - that a speculative task 
> will continuously receive a CDE from the driver, now causing it to fail and 
> retry over and over without limit.
> I'm thinking we could put a task that receives a CDE from the driver, into a 
> TaskState.FINISHED or some other state to indicated that the task shouldn't 
> be resubmitted by the TaskScheduler. I'd probably need some opinions on 
> whether there are other consequences for doing something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete

2016-04-27 Thread Andrew Or (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15260623#comment-15260623
 ] 

Andrew Or commented on SPARK-14915:
---

I haven't looked into the scheduler code in detail yet, but it seems to me the 
bug is not caused by your fix to use the `CausedBy`. Rather, the bug has always 
existed, but your fix just uncovered it. It does seem like a problem in the 
scheduler; under no circumstances should we retry a stage without limits.

> Tasks that fail due to CommitDeniedException (a side-effect of speculation) 
> can cause job to never complete
> ---
>
> Key: SPARK-14915
> URL: https://issues.apache.org/jira/browse/SPARK-14915
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2
>Reporter: Jason Moore
>Priority: Critical
>
> In SPARK-14357, code was corrected towards the originally intended behavior 
> that a CommitDeniedException should not count towards the failure count for a 
> job.  After having run with this fix for a few weeks, it's become apparent 
> that this behavior has some unintended consequences - that a speculative task 
> will continuously receive a CDE from the driver, now causing it to fail and 
> retry over and over without limit.
> I'm thinking we could put a task that receives a CDE from the driver, into a 
> TaskState.FINISHED or some other state to indicated that the task shouldn't 
> be resubmitted by the TaskScheduler. I'd probably need some opinions on 
> whether there are other consequences for doing something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14915) Tasks that fail due to CommitDeniedException (a side-effect of speculation) can cause job to never complete

2016-04-27 Thread Jason Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15259920#comment-15259920
 ] 

Jason Moore commented on SPARK-14915:
-

Could I get thoughts on this: at 
[TaskSetManager.scala#L723|https://github.com/apache/spark/blob/branch-1.6/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala#L723]
 a call is made to addPendingTask after a task has failed.  I can think of a 
scenario that it might be a good idea not to add the task back into the pending 
queue: when success(index) == true (which implies that another copy of the task 
has already succeeded).

I'm soon going to test it out with the condition, as I think it's quite 
possibly what is causing tasks to continually re-queue after a CDE until the 
stage has completed (further lengthening the duration of the stage, as that 
take up execution resources).

> Tasks that fail due to CommitDeniedException (a side-effect of speculation) 
> can cause job to never complete
> ---
>
> Key: SPARK-14915
> URL: https://issues.apache.org/jira/browse/SPARK-14915
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.6.2
>Reporter: Jason Moore
>Priority: Critical
>
> In SPARK-14357, code was corrected towards the originally intended behavior 
> that a CommitDeniedException should not count towards the failure count for a 
> job.  After having run with this fix for a few weeks, it's become apparent 
> that this behavior has some unintended consequences - that a speculative task 
> will continuously receive a CDE from the driver, now causing it to fail and 
> retry over and over without limit.
> I'm thinking we could put a task that receives a CDE from the driver, into a 
> TaskState.FINISHED or some other state to indicated that the task shouldn't 
> be resubmitted by the TaskScheduler. I'd probably need some opinions on 
> whether there are other consequences for doing something like this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org