[ https://issues.apache.org/jira/browse/SPARK-37432?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17448235#comment-17448235 ]
Hoa Le edited comment on SPARK-37432 at 11/23/21, 8:43 PM: ----------------------------------------------------------- FYI, I had the issue with decommission enabled. Thoughts if we can have logic like: if a driver doesn't hear any heartbeat from an executor several times then it should remove that executor from its list was (Author: vanhoale): FYI, I enabled decommission feature. Thoughts if we can have logic like: if a driver doesn't hear any heartbeat from an executor several times then it should remove that executor from its list > Driver keep a record of decommission executor > --------------------------------------------- > > Key: SPARK-37432 > URL: https://issues.apache.org/jira/browse/SPARK-37432 > Project: Spark > Issue Type: Bug > Components: Kubernetes > Affects Versions: 3.1.1 > Reporter: Hoa Le > Priority: Minor > Attachments: master_ui_executor_tab.png > > > Hi, > We are running spark on k8s with version 3.1.1. After the spark application > running for a while, we are getting the exception below: > On driver: > > {code:java} > 2021-11-21 18:25:21,859 ERROR Failed to send RPC RPC 6827167497981418905 to > /10.1.201.113:58354: java.nio.channels.ClosedChannelException > (org.apache.spark.network.client.TransportClient) [rpc-server-4-1] > java.nio.channels.ClosedChannelException > at > io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) > at > io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Unknown Source) > 2021-11-21 18:25:21,864 ERROR Failed to send RPC RPC 7618635518207296341 to > /10.1.201.113:58354: java.nio.channels.ClosedChannelException > (org.apache.spark.network.client.TransportClient) [rpc-server-4-1] > java.nio.channels.ClosedChannelException > at > io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) > at > io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Unknown Source) > 2021-11-21 18:25:21,868 ERROR Failed to send RPC RPC 5040314884474308699 to > /10.1.201.113:58354: java.nio.channels.ClosedChannelException > (org.apache.spark.network.client.TransportClient) [rpc-server-4-1] > java.nio.channels.ClosedChannelException > at > io.netty.channel.AbstractChannel$AbstractUnsafe.newClosedChannelException(AbstractChannel.java:957) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:865) > at > io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) > at > io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) > at > io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) > at > io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) > at > io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) > at > io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) > at java.base/java.lang.Thread.run(Unknown Source) {code} > > > The dead executor (we got the logs exported to persistent storage): > > {code:java} > 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at > java.base/java.lang.Thread.run(Unknown Source) > 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at > java.base/java.util.concurrent.FutureTask.run(Unknown Source) > 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) > 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at > org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:83) > 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at > java.base/java.lang.Thread.sleep(Native Method) > 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,java.lang.InterruptedException: > sleep interrupted > 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,"2021-11-21 14:25:47,632 > ERROR Error while waiting for block to migrate > (org.apache.spark.storage.BlockManagerDecommissioner) [migrate-shuffles-1]" > 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at > java.base/java.lang.Thread.run(Unknown Source) > 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > 2021-11-21T14:25:47.745Z,i-060dc8622c9624dbd,spark,at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at > java.base/java.util.concurrent.FutureTask.run(Unknown Source) > 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) > 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at > org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:83) > 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at > java.base/java.lang.Thread.sleep(Native Method) > 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,java.lang.InterruptedException: > sleep interrupted > 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,"2021-11-21 14:25:47,632 > ERROR Error while waiting for block to migrate > (org.apache.spark.storage.BlockManagerDecommissioner) [migrate-shuffles-2]" > 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at > java.base/java.lang.Thread.run(Unknown Source) > 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at > java.base/java.util.concurrent.FutureTask.run(Unknown Source) > 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at > java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source) > 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at > org.apache.spark.storage.BlockManagerDecommissioner$ShuffleMigrationRunnable.run(BlockManagerDecommissioner.scala:83) > 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,at > java.base/java.lang.Thread.sleep(Native Method) > 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,java.lang.InterruptedException: > sleep interrupted > 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,"2021-11-21 14:25:47,632 > ERROR Error while waiting for block to migrate > (org.apache.spark.storage.BlockManagerDecommissioner) [migrate-shuffles-0]" > 2021-11-21T14:25:47.744Z,i-060dc8622c9624dbd,spark,"2021-11-21 14:25:47,618 > ERROR Executor self-exiting due to : Finished decommissioning > (org.apache.spark.executor.CoarseGrainedExecutorBackend) > [wait-for-blocks-to-migrate]" > 2021-11-21T14:23:52.722Z,i-060dc8622c9624dbd,spark,"2021-11-21 14:23:52,199 > WARN NoSuchMethodException was thrown when disabling normalizeUri. This > indicates you are using an old version (< 4.5.8) of Apache http client. It is > recommended to use http client version >= 4.5.9 to avoid the breaking change > introduced in apache client 4.5.7 and the latency in exception handling. See > https://github.com/aws/aws-sdk-java/issues/1919 for more information > (com.amazonaws.http.apache.utils.ApacheUtils) [dispatcher-Executor]" > 2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,WARNING: All illegal > access operations will be denied in a future release > 2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,WARNING: Use > --illegal-access=warn to enable warnings of further illegal reflective access > operations > 2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,WARNING: Please consider > reporting this to the maintainers of org.apache.spark.unsafe.Platform > 2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,"WARNING: Illegal > reflective access by org.apache.spark.unsafe.Platform > (file:/opt/spark/jars/spark-unsafe_2.12-3.1.1.jar) to constructor > java.nio.DirectByteBuffer(long,int)" > 2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,WARNING: An illegal > reflective access operation has occurred > 2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,+ exec /usr/bin/tini -s -- > /usr/local/openjdk-11/bin/java > -Dlog4j.configuration=file:/opt/spark/log4j/log4j.properties > -javaagent:/prometheus/jmx_prometheus_javaagent-0.16.1.jar=8090:/etc/metrics/conf/prometheus.yaml > -Dspark.network.timeout=600s -Dspark.driver.port=7078 > -Dspark.driver.blockManager.port=7079 -Xms14336m -Xmx14336m -cp > '/opt/spark/conf::/opt/spark/jars/*:' > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > spark://coarsegrainedschedu...@firehose-ingestion-job-4ace777d3ec0ca06-driver-svc.atlas-spark-apps.svc:7078 > --executor-id 699 --cores 3 --app-id spark-be1c315d0c2d49df926455f6d04a50eb > --hostname 10.1.201.113 --resourceProfileId 0 > 2021-11-21T14:23:51.721Z,i-060dc8622c9624dbd,spark,"+ > CMD=(${JAVA_HOME}/bin/java ""${SPARK_EXECUTOR_JAVA_OPTS[@]}"" > -Xms$SPARK_EXECUTOR_MEMORY -Xmx$SPARK_EXECUTOR_MEMORY -cp > ""$SPARK_CLASSPATH:$SPARK_DIST_CLASSPATH"" > org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url > $SPARK_DRIVER_URL --executor-id $SPARK_EXECUTOR_ID --cores > $SPARK_EXECUTOR_CORES --app-id $SPARK_APPLICATION_ID --hostname > $SPARK_EXECUTOR_POD_IP --resourceProfileId $SPARK_RESOURCE_PROFILE_ID)" {code} > > > The actual spark executor pod was decommissioned but seems like the driver > still keeps sending tasks to the driver. > The below (attachment) is the master UI at executor task of the app: > !image-2021-11-21-12-33-56-840.png! > > and there is no executor pod 699 is running: > {code:java} > $ kubectl get pods |grep firehose > firehose-ingestion-job-driver 1/1 Running 0 > 23h > firehoseingestionjob-28d3d27d3ec15aaf-exec-869 1/1 Running 0 > 18m > firehoseingestionjob-28d3d27d3ec15aaf-exec-874 1/1 Running 0 > 18m {code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org