[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user squito commented on the issue: https://github.com/apache/spark/pull/15213 made a minor comment b/c I saw it, but don't think this should hold up getting it merged. lgtm for merging as-is and following up later. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user zsxwing commented on the issue: https://github.com/apache/spark/pull/15213 LGTM. I'm going to merge this in favor of of another RC since the fix here is correct. If there are any further style suggestions, we can do it in a follow up PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user kayousterhout commented on the issue: https://github.com/apache/spark/pull/15213 LGTM. My understanding of @markhamstra's comment above is that he's ok with this as-is; @squito any last comments here? Thanks for fixing this @scwf! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user rxin commented on the issue: https://github.com/apache/spark/pull/15213 What else is needed to merge this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15213 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15213 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66036/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15213 **[Test build #66036 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66036/consoleFull)** for PR 15213 at commit [`09077cb`](https://github.com/apache/spark/commit/09077cb855d961c27a63c26468f03aadaf75316a). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15213 **[Test build #66036 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66036/consoleFull)** for PR 15213 at commit [`09077cb`](https://github.com/apache/spark/commit/09077cb855d961c27a63c26468f03aadaf75316a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user scwf commented on the issue: https://github.com/apache/spark/pull/15213 retest this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15213 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15213 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66031/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15213 **[Test build #66031 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66031/consoleFull)** for PR 15213 at commit [`09077cb`](https://github.com/apache/spark/commit/09077cb855d961c27a63c26468f03aadaf75316a). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user scwf commented on the issue: https://github.com/apache/spark/pull/15213 @kayousterhout Thanks for your comment, i have updated based on all your comment. . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15213 **[Test build #66031 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66031/consoleFull)** for PR 15213 at commit [`09077cb`](https://github.com/apache/spark/commit/09077cb855d961c27a63c26468f03aadaf75316a). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15213 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15213 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66006/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15213 **[Test build #66006 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66006/consoleFull)** for PR 15213 at commit [`f91d86f`](https://github.com/apache/spark/commit/f91d86f92a7070b0e8ed63773ecf1020975bc2fb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15213 **[Test build #66006 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66006/consoleFull)** for PR 15213 at commit [`f91d86f`](https://github.com/apache/spark/commit/f91d86f92a7070b0e8ed63773ecf1020975bc2fb). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user markhamstra commented on the issue: https://github.com/apache/spark/pull/15213 > Honestly, I think just getting the fix in is important enough that I'm fine w/ putting in the minimally invasive thing now. That's fine, @squito -- go ahead and merge when you're happy with your requested changes, and then I'll follow up in short order with a separate refactoring PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user squito commented on the issue: https://github.com/apache/spark/pull/15213 gosh this is a serious bug, can't believe we haven't found it already. Thanks for reporting and working on a fix. Honestly, I think just getting the fix in is important enough that I'm fine w/ putting in the minimally invasive thing now. Perhaps also makes it easier to backport. Mark & I can keep discussing some cleanup independently. I'd like to see the test improved slightly, but otherwise I think this is fine to merge now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user markhamstra commented on the issue: https://github.com/apache/spark/pull/15213 @scwf I understand that you were trying to make the least invasive fix possible to deal with the problem. That's usually a good thing to do, but even when that kind of fix is getting to the root of the problem it can still result in layers of patches that are hard to make sense of. That's not really the fault of any one patch; rather, the blame lies more with those of us who often didn't produce clear, maintainable code in the first place. When it's possible to see re-organizing principles that will make the code clearer, reduce duplication, make future maintenance less error prone, etc., then it's usually a good idea to do a little larger refactoring instead of just a minimally invasive fix. I think this is a small example of where that kind of refactoring makes sense, so that's why I made my code suggestion. If you can see ways to make things even clearer, then feel free to suggest them. I'm sure that Kay, Imran and others who also have been trying to make these kinds of clarifying changes in the DAGScheduler will also chime in if they have further suggestions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user scwf commented on the issue: https://github.com/apache/spark/pull/15213 @markhamstra in my fix i just want to make the minor changes for the dagscheduer, and your fix is also ok to me, i can update this according your comment. Thanks:) /cc @zsxwing may also have comments on this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user markhamstra commented on the issue: https://github.com/apache/spark/pull/15213 The fix is logically correct; however, the prior code is needlessly complex and not as easy to understand as it should be, and the proposed fix doesn't improve on that. I'd like to take the opportunity to make the code easier to understand and maintain. Something like this: ```scala // It is likely that we receive multiple FetchFailed for a single stage (because we have // multiple tasks running concurrently on different executors). In that case, it is // possible the fetch failure has already been handled by the scheduler. if (runningStages.contains(failedStage)) { logInfo(s"Marking $failedStage (${failedStage.name}) as failed " + s"due to a fetch failure from $mapStage (${mapStage.name})") markStageAsFinished(failedStage, Some(failureMessage)) } else { logDebug(s"Received fetch failure from $task, but its from $failedStage which is no " + s"longer running") } val shouldAbortStage = failedStage.failedOnFetchAndShouldAbort(task.stageAttemptId) || disallowStageRetryForTest if (shouldAbortStage) { val abortMessage = if (disallowStageRetryForTest) { "Fetch failure will not retry stage due to testing config" } else { s"$failedStage (${failedStage.name}) " + s"has failed the maximum allowable number of " + s"times: ${Stage.MAX_CONSECUTIVE_FETCH_FAILURES}. " + s"Most recent failure reason: $failureMessage" } abortStage(failedStage, abortMessage, None) } else { // update failedStages and make sure a ResubmitFailedStages event is enqueued // TODO: Cancel running tasks in the failed stage -- cf. SPARK-17064 val noResubmitEnqueued = failedStages.isEmpty failedStages += failedStage failedStages += mapStage if (noResubmitEnqueued) { // If failedStages is not empty, then a previous FetchFailed already went through // this block of code and queued up a ResubmitFailedStages event that has not yet // run. We, therefore, only need to queue up a new ResubmitFailedStages event when // failedStages.isEmpty. logInfo( s"Resubmitting $mapStage (${mapStage.name}) and " + s"$failedStage (${failedStage.name}) due to fetch failure" ) messageScheduler.schedule( new Runnable { override def run(): Unit = eventProcessLoop.post(ResubmitFailedStages) }, DAGScheduler.RESUBMIT_TIMEOUT, TimeUnit.MILLISECONDS ) } } // Mark the map whose fetch failed as broken in the map stage ``` This should be equivalent to what you have, @scwf, with the exception that `fetchFailedAttemptIds.add(stageAttemptId)` is done even when `disallowStageRetryForTest` is `true` -- which seems like a better idea to me. Also available here: https://github.com/markhamstra/spark/commit/368f82d9789ec04565af835e7cb80d1cdb0ccf0c @squito --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15213 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/15213 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65853/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15213: [SPARK-17644] [CORE] Do not add failedStages when abortS...
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/15213 **[Test build #65853 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65853/consoleFull)** for PR 15213 at commit [`1127ca1`](https://github.com/apache/spark/commit/1127ca1538e9a9ded9e91ead65af8c710e99003d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org