Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2024-03-16 Thread via GitHub


github-actions[bot] commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-2002201527

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-15 Thread via GitHub


yaooqinn commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1813660864

   > What do you mean by this, are you saying the Spark on YARN handling of 
preempted containers is not working properly? Meaning if the container is 
preempted it should not show up as an executor failure. Are you seeing those 
preempted containers show up as failed?
   Or are you saying that yes Spark on YARN doesn't mark preempted as failed?
   
   PREEMPTED is ok, and its cases are not counted by executor failure tracker, 
I was wrong about this, sorry to bother.
   
   > If that is the case then Spark should allow users to turn 
spark.executor.maxNumFailures off or I assume you could do the same thing by 
setting it to int.maxvalue.
   
   There are pros and cons to this suggestion, I guess. Disabling the executor 
failure tracker certainly keeps the app alive, but at the same time invalidates 
fast fail.
   
   > As implemented this seems very arbitrary and I would think hard for a 
normal user to set and use this feature.
   
   Most of configurations with numeric value and the defaults in spark are 
arbitrary?
   
   
   > I don't understand why this isn't the same as minimum number of executors 
as that seems more in line - saying you need some minimum number for this 
application to run and by the way its ok to keep running with this is launching 
new executors is failing.
   
   minimum number of executors can be 0
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-15 Thread via GitHub


tgravescs commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1812838085

   > > Preemption on yarn shouldn't be going against the number of failed 
executors. If it is then something has changed and we should fix that.
   
   > Yes, you are right
   
   What do you mean by this, are you saying the Spark on YARN handling of 
preempted containers is not working properly? Meaning if the container is 
preempted it should not show up as an executor failure.  Are you seeing those 
preempted containers show up as failed?
Or are you saying that yes Spark on YARN doesn't mark preempted as failed? 
   
   > What does 'this feature' point to?
   
   Sorry I misunderstood your environment here, I thought you were running on 
k8s but it looks like you running on YARN.  by feature I mean the 
spark.yarn.max.executor.failures/spark.executor.maxNumFailures config and its 
functionality.
   
   So unless yarn preemption handling is broken (please answer question above), 
 you gave one very specific use case where user added a bad JAR, in that use 
case it seems like you just don't want spark.executor.maxNumFailures enabled at 
all.  You said you don't want the app to fail so admins can come fix things up 
and not have it affect other users.  If that is the case then Spark should 
allow users to turn spark.executor.maxNumFailures off or I assume you could do 
the same thing by setting it to int.maxvalue.  
   
   As implemented this seems very arbitrary and I would think hard for a normal 
user to set and use this feature.  You have it as a ratio, which normally I'm 
in favor of but really only works if you have max executors set so it is really 
just a hardcoded number. That number seems arbitrary as its just depends on if 
you get lucky and happen to have that before some users pushes a bad jar.   I 
don't understand why this isn't the same as minimum number of executors as that 
seems more in line  - saying you need some minimum number for this application 
to run and by the way its ok to keep running with this is launching new 
executors is failing. 
   
   If there is some other issues with Spark Connect and add jars maybe that is 
a different conversation about isolation 
(https://issues.apache.org/jira/browse/SPARK-44146).  Or maybe it needs to 
better prevent users from adding jars with the same name.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-15 Thread via GitHub


tgravescs commented on code in PR #43746:
URL: https://github.com/apache/spark/pull/43746#discussion_r1394411937


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -2087,6 +2087,17 @@ package object config {
   .doubleConf
   .createOptional
 
+  private[spark] val SCHEDULER_MIN_RESOURCES_TO_SURVIVE_RATIO =
+ConfigBuilder("spark.scheduler.minResourcesToSurviveRatio")

Review Comment:
   I misread the title, sorry ignore this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-14 Thread via GitHub


yaooqinn commented on code in PR #43746:
URL: https://github.com/apache/spark/pull/43746#discussion_r1393541374


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -2087,6 +2087,17 @@ package object config {
   .doubleConf
   .createOptional
 
+  private[spark] val SCHEDULER_MIN_RESOURCES_TO_SURVIVE_RATIO =
+ConfigBuilder("spark.scheduler.minResourcesToSurviveRatio")

Review Comment:
   Hi @tgravescs, can you take a look at this? Can you explain why this is 
related to the blacklisting/excludeOnFailure? Did I understand something wrong 
here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-14 Thread via GitHub


yaooqinn commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1811680290

   > Preemption on yarn shouldn't be going against the number of failed 
executors. If it is then something has changed and we should fix that.
   
   Yes, you are right
   
   > This is a consequence of using a shared environment. Ideally Spark would 
isolate it from other and other users wouldn't be affected but that 
unfortunately isn't the case. I'm not sure your environment but ideally users 
test things before running in some production environment and breaking things.
   
   Yeah, the test step is necessary before the prod. But as you said 'ideally'. 
System robust takes precedence over that.
   
   
   > If this feature doesn't really work or has issues on k8s then there should 
be a way to disable it, which seems like more what you want here right? You are 
essentially saying you don't want it to fail the application, thus turn the 
feature off and you should just do monitoring on your own to catch issues.
   
   
   Why do you always mention k8s when I give evidence on yarn? Well for k8s, 
ExecutorFailureTracker works well for app initialization to fail fast for 
continuous pod failures. ExecutorFailureTracker does not work well on apps with 
sufficient pods and then comes some pod failures 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-14 Thread via GitHub


tgravescs commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1810429020

   > ERROR cluster.YarnScheduler: Lost executor x on x.163.org: Container 
container_x on host:x.163.org was preempted.
   
   Preemption on yarn shouldn't be going against the number of failed 
executors. If it is then something has changed and we should fix that. 
   
   ```
case ContainerExitStatus.PREEMPTED =>   

|
   // Preemption is not the fault of the running tasks, since YARN 
preempts containers 
  
   // merely to do resource sharing, and tasks that fail due to 
preempted executors could   
 
   // just as easily finish on any other executor. 
   ```
   
   > K8s environment with Horizontal Pod Scheduler case
   
   Can you be more specific here, why is this going to cause failures that 
aren't similar to YARN dynamic allocation getting more executors?  Is it 
scaling down and the containers are marked as failed vs yarn marking them as 
preempted and not counting against failures?  Is there anyway to know on k8s 
this happened so we could not count them?  it seems like if this is really an 
issue the feature should be off by default on k8s
   
   
   > Let's take Spark Thrift Server and Spark Connect as examples, ADD JAR is 
an end-user user operation and the artifacts can be changed by themselves. 
Starting and maintaining the Server is for system admins. If the jar issue 
occurs here, shall we give enough time for admins to detect the issue and then 
gracefully reboot it to reduce the impact on other concurrent users?
   
   This is a consequence of using a shared environment.  Ideally Spark would 
isolate it from other and other users wouldn't be  affected but that 
unfortunately isn't the case.  I'm not sure your environment but ideally users 
test things before running in some production environment and breaking things. 
   
   If this feature doesn't really work or has issues on k8s then there should 
be a way to disable it, which seems like more what you want here right?  You 
essentially are saying you don't want it to fail the application and you should 
just do monitoring on your own to catch issues.
   
   Note, the documentation on this feature are missing, I made some comments: 
https://github.com/apache/spark/commit/40872e9a094f8459b0b6f626937ced48a8d98efb 
 can you please fix those?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-13 Thread via GitHub


yaooqinn commented on code in PR #43746:
URL: https://github.com/apache/spark/pull/43746#discussion_r1392110060


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -2087,6 +2087,17 @@ package object config {
   .doubleConf
   .createOptional
 
+  private[spark] val SCHEDULER_MIN_RESOURCES_TO_SURVIVE_RATIO =
+ConfigBuilder("spark.scheduler.minResourcesToSurviveRatio")

Review Comment:
   But `excludeOnFailure` is irrelevant, we are applying feature to 
ExecutorFailureTracker not the HealthTracker



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-13 Thread via GitHub


yaooqinn commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1809634875

   Hi @tgravescs 
   
   > Oh I just realized you added this config 
   
   I(Or our customer you mean) didn't add it, 20 failure is calculated by 10 
max executors * 2
   
   Hi @mridulm @tgravescs 
   > If users changes a jar mid application, this is really bad IMHO.
   
   >  spark is not tolerant to out of band changes to the artifacts (jars, etc) 
while it is running - this needs to be fixed at the application end, 
   
   
   Let take Spark Thrift Server and Spark Connect as examples, `ADD JAR` is an 
end-user user operation and the artifacts can be changed by themselves. 
Starting and maintaining the Server is for system admins.  If the jar issue 
occurs here, shall we give enough time for admins to detect the issue and then 
gracefully reboot it to reduce the impact on other concurrent users.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-13 Thread via GitHub


yaooqinn commented on code in PR #43746:
URL: https://github.com/apache/spark/pull/43746#discussion_r1392051387


##
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala:
##
@@ -717,6 +719,15 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
 
   def sufficientResourcesRegistered(): Boolean = true
 
+  // When the executor failure tracker collects enough failures, if the 
current resources are
+  // insufficient for keep the app running, it will fail the application 
directly; otherwise,
+  // it survives this check round.
+  def insufficientResourcesRetained(): Boolean = {
+totalRegisteredExecutors.get() < maxExecutors * minSurviveRatio

Review Comment:
   I guess Int.MaxValue for maxExecutors is rare. It's problematic itself to 
use in prod.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-13 Thread via GitHub


mridulm commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1809535163

   > Hi @mridulm. Thanks for the questions, but can you make your question much 
more specific? As for me, your questions are already described in the PR 
description based on what I understand and my findings. It would be great if I 
could get your point more precisely.
   
   The PR description is describing what the change is trying to do, along with 
details I am not sure are related to why executor failures were observed/why 
application eventually failed - would be great to understand what is going on 
which we are trying to mitigate with this.
   
   Having said that, looking at your response to @tgravescs query 
[here](https://github.com/apache/spark/pull/43746#issuecomment-1808450794) 
gives a lot more context (thanks !)
   
   It points to an issue with how users are running the spark application - 
spark is not tolerant to out of band changes to the artifacts (jars, etc) while 
it is running - this needs to be fixed at the application end, not in spark, 
and we should not try to work around this issue - agree with @tgravescs on this.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-13 Thread via GitHub


tgravescs commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1808766570

   > spark.executor.maxFailures
   
   Oh I just realized you added this config - ie ported the yarn feature to k8s 
and I think you mean spark.executor.maxNumFailures. I had missed this go by.
   
   >  It failed because it hit the max executor failures while the root cause 
was one of the shared UDF jar changed by a developer, who turned out not to be 
the app owner. Yarn failed to bring up new executors, so the 20 failures were 
collected within 10 secs.
   
   If users changes a jar mid application, this is really bad IMHO.  How do you 
know your application doesn't get different results on different executors.  
Say that had actually worked but the logic changed in the udf.   This to me is 
a process side of things and Spark did the right thing in failing and it should 
have failed.  Would you have known as quickly if it hadn't failed that someone 
pushed a bad jar?  I assume maybe next application run sometime later but it 
still would have caused some app to fail.
   
   
   > The probability of apps failing with executor max failures is low for the 
total amount apps. But it turns out to be a daily issue
   
   I'm not sure I follow this statement, you see this kind of issue daily and 
its because users push bad jars that much or why do you see it daily?  I'm 
trying to understand how much this is really a problem that Spark should be 
solving.  Do you see failures where having the feature on actually helps you?  
I kind of assume so since you ported it to k8s but if not just turn it off.  
   
   I can see a reliability aspect here that if you have a sufficient number of 
executors already allocated and running, then just keep running instead of 
killing the entire application.   How you achieve that though vs this proposal 
I'm not sure I agree with. If user set a minimum number of executors, why isn't 
this just that number? As one of the other comments stated this approach is 
useless for normal users with dynamic allocation so why doesn't it apply to 
that case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-13 Thread via GitHub


tgravescs commented on code in PR #43746:
URL: https://github.com/apache/spark/pull/43746#discussion_r1391420795


##
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala:
##
@@ -717,6 +719,15 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
 
   def sufficientResourcesRegistered(): Boolean = true
 
+  // When the executor failure tracker collects enough failures, if the 
current resources are
+  // insufficient for keep the app running, it will fail the application 
directly; otherwise,
+  // it survives this check round.
+  def insufficientResourcesRetained(): Boolean = {
+totalRegisteredExecutors.get() < maxExecutors * minSurviveRatio

Review Comment:
   Exactly so this feature doesn't apply to how many users normally run and 
could be confusing.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-13 Thread via GitHub


yaooqinn commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1808588312

   
   
   
   > It is not clear to me what the issue is, why we are taking this proposed 
approach, and what the underlying cause is. Failing an application due to 
repeated failures is typically independent of how much resources it is 
currently holding.
   > 
   > I can speculate as to why this is happening, but can you update the jira 
(or pr description) with more details on what the issue is ? And why this 
proposal should work ?
   
   Hi @mridulm. Thanks for the questions, but can you make your question much 
more specific? As for me, your questions are already described in the PR 
description based on what I understand and my findings. It would be great if I 
could get your point more precisely.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-13 Thread via GitHub


yaooqinn commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1808450794

   Hi @tgravescs, thank you for the detailed review.
   
   
   > 20 is a lot of failures. What is the real issue causing this? ie why are 
these executors failing? 
   
   The failures can be divided into two kinds. The first one is for both 
existing and new executors, i.e. exit on 143(killed by resource managers), oom, 
etc., which is OK to fail the app w/ or w/o this PR. The second one is only for 
new executors, i.e., some of the external dependencies file changes by expected 
or unexpected maintenance behaviors or rejections from resource managers, which 
this PR mainly focuses on to reduce the risk of an app being killed all of a 
sudden. In the second case, 20 is a relatively small number, as the allocating 
requests and responses go very quickly. 
   
   > How long was the app running? Is it some cloud environment they are going 
away, is it really an issue with the application or its configuration?
   
   The app I described above in the PR description ran for 1.5 hours. It failed 
because it hit the max executor failures while the root cause was one of the 
shared UDF jar changed by a developer, who turned out not to be the app owner. 
Yarn failed to bring up new executors, so the 20 failures were collected within 
10 secs.
   
   ```
   2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container 
container_e106_1694175944291_7158886_01_27 on host: x.163.org (state: 
COMPLETE, exit status: -1000)
   2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: 
container_e106_1694175944291_7158886_01_27 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnSchedulerEndpoint WARN - 
Requesting driver to remove executor 26 for reason Container from a bad node: 
container_e106_1694175944291_7158886_01_27 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container 
container_e106_1694175944291_7158886_01_29 on host: x.163.org (state: 
COMPLETE, exit status: -1000)
   2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: 
container_e106_1694175944291_7158886_01_29 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST BlockManagerMaster INFO - Removal of executor 26 
requested
   2023-11-06 23:39:43 CST BlockManagerMasterEndpoint INFO - Trying to remove 
executor 26 from BlockManagerMaster.
   2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnDriverEndpoint INFO - Asked 
to remove non-existent executor 26
   2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container 
container_e106_1694175944291_7158886_01_28 on host: x.163.org (state: 
COMPLETE, exit status: -1000)
   2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnSchedulerEndpoint WARN - 
Requesting driver to remove executor 28 for reason Container from a bad node: 
container_e106_1694175944291_7158886_01_29 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: 
container_e106_1694175944291_7158886_01_28 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST BlockManagerMaster INFO - Removal of executor 28 
requested
   2023-11-06 23:39:43 CST BlockManagerMasterEndpoint INFO - Trying to remove 
executor 28 from BlockManagerMaster.
   2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnDriverEndpoint INFO - Asked 
to remove non-existent executor 28
   2023-11-06 23:39:43 CST YarnSchedulerBackend$YarnSchedulerEndpoint WARN - 
Requesting driver to remove executor 27 for reason Container from a bad node: 
container_e106_1694175944291_7158886_01_28 on host: x.163.org. Exit status: 
-1000. Diagnostics: [2023-11-06 23:39:40.308]java.io.IOException: Resource 
x.jar changed on src filesystem (expected 1698924864275, was 1699273405453
   .
   2023-11-06 23:39:43 CST YarnAllocator INFO - Completed container 
container_e106_1694175944291_7158886_01_26 on host: x.163.org (state: 
COMPLETE, exit status: -1000)
   2023-11-06 23:39:43 CST YarnAllocator WARN - Container from a bad node: 
container_e106_1694175944291_7158886_01_26 on host: x.163.org. Exit status: 
-1000. 

Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-13 Thread via GitHub


mridulm commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1808444864

   It is not clear to me what the issue is, why we are taking this proposed 
approach, and what the underlying cause is.
   Failing an application due to repeated failures is typically independent of 
how much resources it is currently holding.
   
   I can speculate as to why this is happening, but can you update the jira (or 
pr description) with more details on what the issue is ? And why this proposal 
should work ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-13 Thread via GitHub


yaooqinn commented on code in PR #43746:
URL: https://github.com/apache/spark/pull/43746#discussion_r1391233086


##
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala:
##
@@ -717,6 +719,15 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
 
   def sufficientResourcesRegistered(): Boolean = true
 
+  // When the executor failure tracker collects enough failures, if the 
current resources are
+  // insufficient for keep the app running, it will fail the application 
directly; otherwise,
+  // it survives this check round.
+  def insufficientResourcesRetained(): Boolean = {
+totalRegisteredExecutors.get() < maxExecutors * minSurviveRatio

Review Comment:
   When maxExecutors is Int.MaxValue, the value of max failures is 2 * 
Int.MaxValue or explicit value is given via spark.executor.maxFailures. I guess 
we don't change the behavior here



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-13 Thread via GitHub


tgravescs commented on code in PR #43746:
URL: https://github.com/apache/spark/pull/43746#discussion_r1391209682


##
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##
@@ -2087,6 +2087,17 @@ package object config {
   .doubleConf
   .createOptional
 
+  private[spark] val SCHEDULER_MIN_RESOURCES_TO_SURVIVE_RATIO =
+ConfigBuilder("spark.scheduler.minResourcesToSurviveRatio")

Review Comment:
   I think this config should have the excludeOnFailure in the name if it is 
applying to that feature, which is implied in description of this.  I also 
think this feature could be quite confusing to users, should be mentioned in 
the that documentation.



##
core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala:
##
@@ -717,6 +719,15 @@ class CoarseGrainedSchedulerBackend(scheduler: 
TaskSchedulerImpl, val rpcEnv: Rp
 
   def sufficientResourcesRegistered(): Boolean = true
 
+  // When the executor failure tracker collects enough failures, if the 
current resources are
+  // insufficient for keep the app running, it will fail the application 
directly; otherwise,
+  // it survives this check round.
+  def insufficientResourcesRetained(): Boolean = {
+totalRegisteredExecutors.get() < maxExecutors * minSurviveRatio

Review Comment:
   with dynamic allocation maxExecutors is Int.MaxValue, so how does that 
really work with it?  I would basically say it doesn't.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-13 Thread via GitHub


LuciferYang commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1807707575

   > Thank you @LuciferYang for the suggestions. BTW,
   > 
   > > `Ratio>=1` ? `Ratio>=0.5`?
   > 
   > It is ratio>=1, which makes `runnings < max * raition` always true and 
only checkes the maxNumExecutorFailures just like what we do before
   
   Got 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-13 Thread via GitHub


yaooqinn commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1807629094

   
   Thank you @LuciferYang for the suggestions. BTW,
   > `Ratio>=1` ? `Ratio>=0.5`?
   
   It is ratio>=1, which makes `runnings < max * raition` always true and only 
checkes the maxNumExecutorFailures just like what we do before
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-12 Thread via GitHub


yaooqinn commented on code in PR #43746:
URL: https://github.com/apache/spark/pull/43746#discussion_r1390684419


##
core/src/main/scala/org/apache/spark/scheduler/cluster/SchedulerBackendUtils.scala:
##
@@ -44,4 +44,12 @@ private[spark] object SchedulerBackendUtils {
   conf.get(EXECUTOR_INSTANCES).getOrElse(numExecutors)
 }
   }
+
+  def getMaxTargetExecutorNumber(conf: SparkConf): Int = {

Review Comment:
   Also, there are a lot places bypassing the minExecutor adjusting



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-12 Thread via GitHub


yaooqinn commented on code in PR #43746:
URL: https://github.com/apache/spark/pull/43746#discussion_r1390683182


##
core/src/main/scala/org/apache/spark/scheduler/cluster/SchedulerBackendUtils.scala:
##
@@ -44,4 +44,12 @@ private[spark] object SchedulerBackendUtils {
   conf.get(EXECUTOR_INSTANCES).getOrElse(numExecutors)
 }
   }
+
+  def getMaxTargetExecutorNumber(conf: SparkConf): Int = {

Review Comment:
   Hmm, the above code is unreliable. For instance, a non-streaming application 
configured with DYN_ALLOCATION_MAX_EXECUTORS by a user run on a system 
configuration copy with STREAMING_DYN_ALLOCATION_MAX_EXECUTORS. This surprises 
the users.
   
   Anyway, I will follow the code above.
   



##
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala:
##
@@ -142,7 +142,8 @@ class ExecutorPodsAllocator(
 }
 snapshotsStore.addSubscriber(podAllocationDelay) { executorPodsSnapshot =>
   onNewSnapshots(applicationId, schedulerBackend, executorPodsSnapshot)
-  if (failureTracker.numFailedExecutors > maxNumExecutorFailures) {
+  if (getNumExecutorsFailed > maxNumExecutorFailures &&
+  schedulerBackend.insufficientResourcesRetained()) {
 logError(s"Max number of executor failures ($maxNumExecutorFailures) 
reached")

Review Comment:
   OK



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-12 Thread via GitHub


LuciferYang commented on code in PR #43746:
URL: https://github.com/apache/spark/pull/43746#discussion_r1390659516


##
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala:
##
@@ -142,7 +142,8 @@ class ExecutorPodsAllocator(
 }
 snapshotsStore.addSubscriber(podAllocationDelay) { executorPodsSnapshot =>
   onNewSnapshots(applicationId, schedulerBackend, executorPodsSnapshot)
-  if (failureTracker.numFailedExecutors > maxNumExecutorFailures) {
+  if (getNumExecutorsFailed > maxNumExecutorFailures &&
+  schedulerBackend.insufficientResourcesRetained()) {
 logError(s"Max number of executor failures ($maxNumExecutorFailures) 
reached")

Review Comment:
   Do the logs here need to be refined?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-12 Thread via GitHub


LuciferYang commented on code in PR #43746:
URL: https://github.com/apache/spark/pull/43746#discussion_r1390658366


##
core/src/main/scala/org/apache/spark/scheduler/cluster/SchedulerBackendUtils.scala:
##
@@ -44,4 +44,12 @@ private[spark] object SchedulerBackendUtils {
   conf.get(EXECUTOR_INSTANCES).getOrElse(numExecutors)
 }
   }
+
+  def getMaxTargetExecutorNumber(conf: SparkConf): Int = {

Review Comment:
   
https://github.com/apache/spark/blob/9a990b5a40de3c3a17801dd4dbd4f48e3f399815/core/src/main/scala/org/apache/spark/deploy/ExecutorFailureTracker.scala#L86-L95
   
   Is this different from the `effectNumExecutors` in the above code? Don't we 
need to consider scenarios where `isStreamingDynamicAllocationEnabled` is true?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-12 Thread via GitHub


LuciferYang commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1807528689

   `Ratio>=1` ? `Ratio>=0.5`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-12 Thread via GitHub


yaooqinn commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1807458331

   > In this scenario, will the Spark App never fail as before?
   
   @LuciferYang. It depends, but I would never use the `never`. 
   
   -  Ratio>=1, the behavior is AS-IS
   -  Ratio<1, for this preempt scenario, it allows the app to survive if it 
has a sufficient number of resources, which changes depending on proactive DRA, 
passive eviction of resource managers, or runtime failures. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-12 Thread via GitHub


LuciferYang commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1807431965

   Assuming that the Spark App runs in a resource prioritization environment, 
due to its low priority on a single machine, it may be preempted at any time 
(killed by the resource monitor on the single machine), but the queue resources 
are sufficient to quickly apply for new resources. In this scenario, will the 
Spark App never fail as before?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-12 Thread via GitHub


yaooqinn commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1807392307

   Thank you for the check @dongjoon-hyun 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-12 Thread via GitHub


dongjoon-hyun commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1807256183

   cc @mridulm for YARN part because this might introduce a behavior failure 
which the job never fail again with the executor failures. BTW, for the record, 
I believe this is mainly useful for K8s environment with Horizontal Pod 
Scheduler case, and I guess the failure situation is rare in YARN environment. 
That's the reason why I didn't ask a migration guide for this configuration and 
the behavior change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-10 Thread via GitHub


yaooqinn commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1806702193

   cc @dongjoon-hyun @cloud-fan @HyukjinKwon @LuciferYang @tgravescs, thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

2023-11-09 Thread via GitHub


yaooqinn opened a new pull request, #43746:
URL: https://github.com/apache/spark/pull/43746

   
   
   
   ### What changes were proposed in this pull request?
   
   
   This PR introduces a new configuration, 
`spark.scheduler.minResourcesToSurviveRatio`, which works as:
   
   When an application encounters the maximum number of executor failures, if 
the scheduler still has sufficient resources, calculated by `live executors >= 
max number of executor * ratio`, it will not fail immediately and will give its 
best effort to finish what it started. The smaller the ratio is, the more 
tolerant the application will be to executor failures. If the value >=1, it 
means intolerant at all.
   
   ### Why are the changes needed?
   
   
   The ExecutorFailureTracker is overly sensitive to executor failures and 
tends to overreact by terminating the application immediately upon reaching the 
threshold, regardless of whether the current resources obtained are sufficient 
or not. Since executor allocation depends on various factors such as resource 
managers, host environments, and external dependencies, it may become 
unavailable for some time. During this period, ExecutorFailureTracker may 
accumulate enough failures to mistakenly kill itself.
   Here is also an example from our prod,
   The application had been running for hours before it suddenly crashed with 
`Stage cancelled because SparkContext was shut down` and `Max number of 
executor failures (20) reached`. Meanwhile, it still had 90% of maxNumExecutors 
and was about to finish. In its final moments(< 10 secs), it only requested one 
more executor.
   
   
   The threshold of ExecutorFailureTracker is inflexible to use. It's 
pre-configured by `spark.executor.maxNumFailures` or calculated by `2 * max 
number of executor`. It does not consider the actual numbers of live executors.
   
   
   Thus, `spark.scheduler.minResourcesToSurviveRatio` is introduced to evaluate 
whether to terminate itself or leave it behind to the next round.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   yes, new configuation
   
   ### How was this patch tested?
   
   
   tested on yarn manually
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org