Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
please add jira SPARK-20898 to the description since fixing that here
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77582/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17113
**[Test build #77582 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77582/testReport)**
for PR 17113 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17113
**[Test build #77582 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77582/testReport)**
for PR 17113 at commit
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/17113
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77575/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17113
**[Test build #77575 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77575/testReport)**
for PR 17113 at commit
Github user squito commented on the issue:
https://github.com/apache/spark/pull/17113
> then you get fetch failure again and iterate until job failure
At first I was thinking the node goes bad, but you first detect it via
fetch failures -- in that case, you wouldn't need
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
so unfortunately I haven't actually been seeing this. You can see with
external shuffle is something happens to the NM and it does cause job failure.
NM crashes for OOM, something else kills it
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77446/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17113
**[Test build #77446 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77446/testReport)**
for PR 17113 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17113
**[Test build #77446 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77446/testReport)**
for PR 17113 at commit
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/17113
Sorry @tgravescs I didn't test executor killing in real cluster. There has
bug in it, so I pushed a commit to fix it. Thanks for your reviewing.
---
If your project is set up for it, you can
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
@squito just double checking, are you ok with this change and did you have
any comments?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
I'm curious did you test the killing part on an actual yarn job? I was
trying it on master and I don't think it works at all due to the way its
passing allocation client. Its a separate issue
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77367/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17113
**[Test build #77367 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77367/testReport)**
for PR 17113 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17113
**[Test build #77367 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77367/testReport)**
for PR 17113 at commit
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/17113
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77346/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17113
**[Test build #77346 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77346/testReport)**
for PR 17113 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17113
**[Test build #77346 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77346/testReport)**
for PR 17113 at commit
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/17113
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/77344/
Test FAILed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17113
**[Test build #77344 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/77344/testReport)**
for PR 17113 at commit
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/17113
Thanks @tgravescs , I will update the code soon.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
@jerryshao sorry my delay on this, we have rough design what we want to do
for future changes but I think those are going to take a while and in the mean
time I think this is a useful addition
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/17113
Thanks @tgravescs , no problem.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
sorry for the delay on this we have been having some discussion about
scheduler changes and the fetch failure handling in the scheduler. Since this
is related holding off on this.
---
If your
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75152/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17113
**[Test build #75152 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75152/testReport)**
for PR 17113 at commit
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/17113
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/75095/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17113
**[Test build #75095 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75095/testReport)**
for PR 17113 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17113
**[Test build #75095 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/75095/testReport)**
for PR 17113 at commit
Github user squito commented on the issue:
https://github.com/apache/spark/pull/17113
sorry i was vague -- I'm saying I'm ok with this as long as its (a) off by
default and (b) experimental so we can change it around (which it is).
---
If your project is set up for it, you can reply
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
> Another thing I thought about as I was reviewing this -- spark currently
assumes that a fetchfailure is always the fault of the source, never the
destination. I almost wonder if we should count
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
sorry haven't had a chance to get to this to do full review, hopefully
tomorrow.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/74234/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17113
**[Test build #74234 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74234/testReport)**
for PR 17113 at commit
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/17113
@tgravescs , I just added a configuration to turn off this feature by
default.
Do you have any further comments on it?
---
If your project is set up for it, you can reply to this email
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17113
**[Test build #74234 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/74234/testReport)**
for PR 17113 at commit
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
I was not talking about designing this around the killing task part of
this, other then in reference to being able to count the # of fetch failures
before triggering the blacklisting, but I think
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/17113
@markhamstra Completely agree, I would love to see this enabled by default.
For example, I really hate to see speculative tasks continuing to run when the
taskset has completed (for example) - used
Github user markhamstra commented on the issue:
https://github.com/apache/spark/pull/17113
@squito Correct, we really only try to kill running tasks currently on job
failure (and if the config setting allows it); but there is the long-standing
"TODO: Cancel running tasks in the
Github user markhamstra commented on the issue:
https://github.com/apache/spark/pull/17113
@tgravescs At the config level, it is spark.job.interruptOnCancel or
SparkContext.SPARK_JOB_INTERRUPT_ON_CANCEL, which then gets passed around as a
boolean -- e.g. shouldInterruptThread.
---
Github user squito commented on the issue:
https://github.com/apache/spark/pull/17113
I think killing tasks is only applicable in different scenarios, eg. if the
[*job*
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
> Whether running tasks are interrupted on stage abort or not depends on
the state of a config boolean -- and ideally we'd like to get to the point
where we can confidently set that config so
Github user markhamstra commented on the issue:
https://github.com/apache/spark/pull/17113
@mridulm Correct, turning task interruption on by default is not so much a
matter of Spark itself handling it well as it is a possible (though not
completely known) issue with lower layer
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/17113
@markhamstra given the impact interruption has on lower layer libraries
which dont handle it well (iirc hdfs ?), we probably will not set it to true
even if spark code is robust,
---
If your
Github user markhamstra commented on the issue:
https://github.com/apache/spark/pull/17113
> Spark does immediately abort the stage but it doesn't kill the running
tasks
Whether running tasks are interrupted on stage abort or not depends on the
state of a config boolean --
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
So I looked at this a little more. I'm more ok with this since Spark
doesn't actually invalidate the shuffle output. You are basically just trying
to stop new tasks from running on the executors
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/17113
@tgravescs , thanks a lot for your comments.
Actually the issue here is a simulated one from my test cluster, I didn't
get an issue report from real customers.
Yes, in most of
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
@jerryshao are you actually seeing issues with this on real
customer/production jobs? How often? NM failure for us is very rare. I'm not
familiar with how mesos would fail differently, the
Github user jerryshao commented on the issue:
https://github.com/apache/spark/pull/17113
@tgravescs , the main scenario is external shuffle service unavailable
scenario, this could be happened in working preserving + NM failure situation.
Also like Mesos + external standalone shuffle
Github user markhamstra commented on the issue:
https://github.com/apache/spark/pull/17113
"Current Spark's blacklist mechanism": please be more precise. The most
recent released version of Spark, 2.1.0, does not include a lot of recent
changes to blacklisting (mostly
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
can you clarify the situations you are seeing issues? What happened to the
NM in this case. If you have work preserving restart I would think this would
actually cause you more problems. The NM
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/73668/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/17113
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17113
**[Test build #73668 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73668/testReport)**
for PR 17113 at commit
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/17113
**[Test build #73668 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/73668/testReport)**
for PR 17113 at commit
70 matches
Mail list logo