Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/15249
Awesome nice work!! Exciting to see this in! Let me know when the other
component, which blacklists across different stages, is ready for review.
---
If your project is set up for it, you ca
Github user squito commented on the issue:
https://github.com/apache/spark/pull/15249
merged to master, thanks everyone
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled an
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66822/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66822 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66822/consoleFull)**
for PR 15249 at commit
[`4501e6c`](https://github.com/apache/spark/commit/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66822 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66822/consoleFull)**
for PR 15249 at commit
[`4501e6c`](https://github.com/apache/spark/commit/4
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66820/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66820 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66820/consoleFull)**
for PR 15249 at commit
[`445cc97`](https://github.com/apache/spark/commit/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66820 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66820/consoleFull)**
for PR 15249 at commit
[`445cc97`](https://github.com/apache/spark/commit/4
Github user squito commented on the issue:
https://github.com/apache/spark/pull/15249
@kayousterhout good idea about running performance tests, I hadn't run them
on a recent rev. I confirmed that the issue in
https://github.com/apache/spark/pull/14871 was no longer present (just to b
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/3/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #3 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/3/consoleFull)**
for PR 15249 at commit
[`c805a0b`](https://github.com/apache/spark/commit/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #3 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/3/consoleFull)**
for PR 15249 at commit
[`c805a0b`](https://github.com/apache/spark/commit/c
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15249
As I mentioned before, this is definitely a huge step in the right
direction !
Having said that, I want to ensure we dont aggressively blacklist executors
and nodes - at scale, I have seen
Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/15249
@tgravescs @mridulm To avoid being stuck in analysis paralysis for this
feature, I'd propose the following:
(1) We merge this PR. I think we're mostly in agreement that the behavior
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/15249
The question to me comes down to how many and how often do you expect
temporary resource issues. At some point if its just from that much skew you
should probably fix your configs and it would be
Github user squito commented on the issue:
https://github.com/apache/spark/pull/15249
@mridulm we had considered that approach earlier on as well -- I don't
think it works because you can also have resources which are not totally
broken, but are flaky for a long period of time. Simpl
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15249
Thinking more, and based on what @squito mentioned, I was considering the
following :
Since we are primarily dealing with executor or nodes which are 'bad' as
opposed to recoverable fa
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15249
If I understood the change correctly, a node can get blacklisted for a
taskset if sufficient (even different) tasks fail on executers on it.
Which can potentially cause all nodes to be blackl
Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/15249
@mridulm re: job failures, can you elaborate on the job failure scenario
you're concerned about?
Jobs can only fail when some tasks are unschedulable, which can happen if a
task is pe
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66462/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66462 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66462/consoleFull)**
for PR 15249 at commit
[`34eff27`](https://github.com/apache/spark/commit/
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15249
@squito I am hoping we _can_ remove the old code/functionality actually (it
is klunky very specific to single executor resource contention/shutdown usecase
- unfortunately common enough to warrant i
Github user squito commented on the issue:
https://github.com/apache/spark/pull/15249
to be clear, when I proposed leaving the old feature in place, my intent
was *not* to make them interact nicely at all. you wouldn't even be able to
use the two features together. The idea was just
Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/15249
@tgravescs no decision here yet.
@mridulm the main question for (2), though, is are the consequences a
deal-breaker? It doesn't seem disastrous if a task needs to run on a non-local
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/15249
For preemption Spark is not counting those as task failures anymore.
So I'm not sure if we decided on what to do. Are we leaving the old
functionality as is or adding a new config for tim
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15249
@tgravescs re(1): It was typically observed when yarn is killing the
executor.
Usually when it run over the memory limits (not sure if it was happening
during pre-emption also).
---
If your p
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15249
@kayousterhout
Agree with (1) - permanent blacklist will effectively work the same way for
executor shutdown.
Re(2) - A task failure is not necessarily only due to resource restric
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66462 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66462/consoleFull)**
for PR 15249 at commit
[`34eff27`](https://github.com/apache/spark/commit/3
Github user squito commented on the issue:
https://github.com/apache/spark/pull/15249
@tgravescs yeah there is an interaction w/ locality, but I think it can
wait for a follow up. This was in the design doc in the follow up section,
though I didn't file a jira for it.
> Dela
Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/15249
@squito re: the config message, thanks for the long explanation. That
makes sense and I can't think of a better error message. The current one is
very clear in telling the user how to fix th
Github user squito commented on the issue:
https://github.com/apache/spark/pull/15249
I agree with Kay's summary above, just one addition. For (2) Temporary
Resource Contention & the approach in this PR -- perhaps its obvious, but
another consequence of this approach is that you lose
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/15249
for 1) We definitely run multiple executors on a node on Yarn but I would
certainly hope this isn't a huge issue. Perhaps @mridulm can clarify on when
he was seeing this, but I would assume it
Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/15249
I thought about this a little more and had some offline discussion with
Imran. @mridulm, I re-read all of your comments and it sounds like there are
two issues that are addressed by the old b
Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/15249
Sorry something weird seems to have happened yesterday where github
published half of my review! Anyway the rest is above.
---
If your project is set up for it, you can reply to this email an
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/15249
Actually my comment about locality wait makes me wonder if that should be
taking blacklisting into account as well here, something I hadn't looked at
closely before. There is no reason to wait for
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/15249
Sorry I haven't followed this PR since it was split off the main one.
Response might be a bit split as its talking about various responses if
something doesn't make sense let me know.
>
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15249
@squito I agree, you are right - even the functionality is blacklisting in
both cases, the problem each is trying to tackle are quite different.
Transient issues (resource, shutdown, etc) ve
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15249
@kayousterhout When an executor or node is shutting down it is actually at
driver level (not just taskset level) - since all tasks would fail on executors
when they are shutting down.
But if the
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66426/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66426 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66426/consoleFull)**
for PR 15249 at commit
[`354f36b`](https://github.com/apache/spark/commit/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66426 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66426/consoleFull)**
for PR 15249 at commit
[`354f36b`](https://github.com/apache/spark/commit/3
Github user squito commented on the issue:
https://github.com/apache/spark/pull/15249
@kayousterhout on the topic of that error msg in the config validation
(sorry github is being weird about letting me respond directly to your comment:
> I see -- I didn't realize executor-lev
Github user squito commented on the issue:
https://github.com/apache/spark/pull/15249
@mridulm on yarn's bad disk detection -- yes, you are right, it is very
rudimentary check for bad disks. It really can't catch everything (and we've
seen that in practice). I was just pointing out
Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/15249
Re: executor blacklisting, one more reason I've heard for this (I think
from Imran) is that tasks can fail on an executor because of memory pressure --
in which case the task may succeed on ot
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15249
@squito Re:(a)
The earlier behavior was used fairly extensively in various properties at
yahoo (web search, groups, bigml, image search, etc). It was added for a
specific failure mode
Github user squito commented on the issue:
https://github.com/apache/spark/pull/15249
(a) right, this is a behavior change ... seemed fair since earlier behavior
was undocumented, and I don't see a strong reason to maintain the exact same
behavior as before. I think its fair for us t
Github user mridulm commented on the issue:
https://github.com/apache/spark/pull/15249
@squito Thanks for clarifying, that makes some of the choices clearer !
A few points to better understand :
a) timeout does not seem to be enforced in this pr.
Which means it is
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66394/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66394 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66394/consoleFull)**
for PR 15249 at commit
[`9086106`](https://github.com/apache/spark/commit/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66394 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66394/consoleFull)**
for PR 15249 at commit
[`9086106`](https://github.com/apache/spark/commit/9
Github user squito commented on the issue:
https://github.com/apache/spark/pull/15249
I forgot to add that I had turned off blacklisting by default, I agree with
your suggestion Kay. I pushed another commit which updated the docs as well.
There are some other small style things and
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66367/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66367 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66367/consoleFull)**
for PR 15249 at commit
[`a6c863f`](https://github.com/apache/spark/commit/
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66358/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66358 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66358/consoleFull)**
for PR 15249 at commit
[`89d3c5e`](https://github.com/apache/spark/commit/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66367 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66367/consoleFull)**
for PR 15249 at commit
[`a6c863f`](https://github.com/apache/spark/commit/a
Github user squito commented on the issue:
https://github.com/apache/spark/pull/15249
@kayousterhout @mridulm thanks for the feedback. obviously still need to
figure out the timeout thing but otherwise think I've addressed things. will
do another pass in the morning.
---
If your p
Github user squito commented on the issue:
https://github.com/apache/spark/pull/15249
@mridulm on the questions about expiry from blacklists, you are not missing
anything -- this explictly does not do any timeouts at the taskset level (this
is mentioned in the design doc). The timeou
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66358 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66358/consoleFull)**
for PR 15249 at commit
[`89d3c5e`](https://github.com/apache/spark/commit/8
Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/15249
This mostly looks good -- I made a bunch of cosmetic comments. Sorry for
the delay -- I'll be quicker on the next review so we can get this in!
---
If your project is set up for it, you can
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66175/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66175 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66175/consoleFull)**
for PR 15249 at commit
[`5568973`](https://github.com/apache/spark/commit/
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/66176/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66176 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66176/consoleFull)**
for PR 15249 at commit
[`9c9d816`](https://github.com/apache/spark/commit/
Github user squito commented on the issue:
https://github.com/apache/spark/pull/15249
thanks for the reviews @markhamstra & @kayousterhout , just pushed an update
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your pr
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66176 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66176/consoleFull)**
for PR 15249 at commit
[`9c9d816`](https://github.com/apache/spark/commit/9
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #66175 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/66175/consoleFull)**
for PR 15249 at commit
[`5568973`](https://github.com/apache/spark/commit/5
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65986/
Test PASSed.
---
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #65986 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65986/consoleFull)**
for PR 15249 at commit
[`21e6789`](https://github.com/apache/spark/commit/
Github user squito commented on the issue:
https://github.com/apache/spark/pull/15249
Jenkins, retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wish
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #65986 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65986/consoleFull)**
for PR 15249 at commit
[`21e6789`](https://github.com/apache/spark/commit/2
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65979/
Test FAILed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Merged build finished. Test FAILed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #65979 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65979/consoleFull)**
for PR 15249 at commit
[`21e6789`](https://github.com/apache/spark/commit/
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #65979 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65979/consoleFull)**
for PR 15249 at commit
[`21e6789`](https://github.com/apache/spark/commit/2
Github user kayousterhout commented on the issue:
https://github.com/apache/spark/pull/15249
This is awesome to separate this out. I should have time to review this
tomorrow and then hopefully we can (finally) merge this in the next few days!
---
If your project is set up for it, yo
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/65938/
Test PASSed.
---
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/15249
Merged build finished. Test PASSed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
e
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #65938 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65938/consoleFull)**
for PR 15249 at commit
[`882b385`](https://github.com/apache/spark/commit/
Github user djvulee commented on the issue:
https://github.com/apache/spark/pull/15249
I would say this is a very important PR. As our experience, sometimes we
just need to skip some nodes for the bad disks,the exist blacklist mechanism
effects little.
---
If your project is set up
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/15249
**[Test build #65938 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/65938/consoleFull)**
for PR 15249 at commit
[`882b385`](https://github.com/apache/spark/commit/8
Github user squito commented on the issue:
https://github.com/apache/spark/pull/15249
@kayousterhout after your suggestion to pull out a helper for blacklisting
with a TAskSet, I thought it might make sense to actually pull out everythign
related to TaskSets, so that we can make progr
93 matches
Mail list logo