[ https://issues.apache.org/jira/browse/SPARK-36509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17403554#comment-17403554 ]
Apache Spark commented on SPARK-36509: -------------------------------------- User 'sarutak' has created a pull request for this issue: https://github.com/apache/spark/pull/33818 > Executors don't get rescheduled in standalone mode when worker dies > ------------------------------------------------------------------- > > Key: SPARK-36509 > URL: https://issues.apache.org/jira/browse/SPARK-36509 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.0.1, 3.1.1, 3.1.2 > Reporter: Peter Kaiser > Priority: Major > > This is reproducible with an application that uses less cores than what are > available on the workers: > E.g. with 1 application with 1 executor, when the worker with the executor is > killed, the application will not get another executor assigned even if there > are enough resources in the cluster. This seems to be a regression, caused by > [https://github.com/apache/spark/commit/51de86baed0776304c6184f2c04b6303ef48df90#diff-ca694acef669f50f9b45ca0d32ab6f5a516270bb26b33c4abb704e2dc00a1a03] > . > That causes an assertion error on the master because it get's an > executorStateChange from 'RUNNING' to 'RUNNING' instead of 'FAILED': > {noformat} > 2021-08-13 14:04:12,554 [dispatcher-event-loop-2] INFO : I have been elected > leader! New state: ALIVE > 2021-08-13 14:04:12,554 [dispatcher-event-loop-2] INFO : I have been elected > leader! New state: ALIVE > 2021-08-13 14:04:56,489 [dispatcher-event-loop-10] INFO : Registering worker > 172.27.64.1:58636 with 12 cores, 30.7 GiB RAM > 2021-08-13 14:04:59,949 [dispatcher-event-loop-6] INFO : Registering worker > 172.27.64.1:58694 with 12 cores, 30.7 GiB RAM > 2021-08-13 14:05:20,212 [dispatcher-event-loop-2] INFO : Registering app > query-frontend-null-172.27.64.1 > 2021-08-13 14:05:20,212 [dispatcher-event-loop-2] INFO : Registered app > query-frontend-null-172.27.64.1 with ID app-20210813140520-0000 > 2021-08-13 14:05:20,228 [dispatcher-event-loop-2] INFO : Launching executor > app-20210813140520-0000/0 on worker worker-20210813140459-172.27.64.1-58694 > 2021-08-13 14:05:37,991 [dispatcher-event-loop-9] ERROR: Ignoring > errorjava.lang.AssertionError: assertion failed: executor 0 state transfer > from RUNNING to RUNNING is illegal at scala.Predef$.assert(Predef.scala:223) > at > org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:323) > at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115) > {noformat} > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org