[ https://issues.apache.org/jira/browse/SPARK-32975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201207#comment-17201207 ]
Tibor Fasanga commented on SPARK-32975: --------------------------------------- Note that the main problem is that the executor POD quits with error and Spark driver and Spark operator think it is still running, therefore the executor is never restarted. This is intermittent problem. Our testing shows that this happens frequently when the following is true: # the driver POD has a sidecar container, and # it takes longer to initialize and start the sidecar container (this delay is caused by time required to pull the image of the sidecar container) In other words, this problem manifests itself when there is a delay between starting the driver *container* and the time the driver *POD* is fully started (the POD contains the driver container and the sidecar container). In this case we see the following events in the description of the driver POD: (see the "_Pulling image "registry.nspos.nokia.local/fluent/fluent-bit:1.5.5_" event that is present in this case) {code:java} Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned default/act-pipeline-app-driver to node5 Warning FailedMount 20m kubelet, node5 MountVolume.SetUp failed for volume "spark-conf-volume" : configmap "act-pipeline-app-1600699152173-driver-conf-map" not found Normal Pulled 20m kubelet, node5 Container image "registry.nspos.nokia.local/nspos-pki-container:20.9.0-rel.1" already present on machine Normal Created 20m kubelet, node5 Created container nspos-pki Normal Started 20m kubelet, node5 Started container nspos-pki Normal Pulling 20m kubelet, node5 Pulling image "registry.nspos.nokia.local/analytics-rtanalytics-pipeline-app:20.9.0-rel.48" Normal Pulled 19m kubelet, node5 Successfully pulled image "registry.nspos.nokia.local/analytics-rtanalytics-pipeline-app:20.9.0-rel.48" Normal Created 19m kubelet, node5 Created container spark-kubernetes-driver Normal Started 19m kubelet, node5 Started container spark-kubernetes-driver Normal Pulling 19m kubelet, node5 Pulling image "registry.nspos.nokia.local/fluent/fluent-bit:1.5.5" Normal Pulled 18m kubelet, node5 Successfully pulled image "registry.nspos.nokia.local/fluent/fluent-bit:1.5.5" Normal Created 18m kubelet, node5 Created container log-sidecar Normal Started 18m kubelet, node5 Started container log-sidecar {code} Note: The message "_MountVolume.SetUp failed for volume "spark-conf-volume" : configmap "act-pipeline-app-1600699152173-driver-conf-map" not found_" seems to be unrelated and does not seem to cause any problems. > [K8S] - executor fails to be restarted after it goes to ERROR/Failure state > --------------------------------------------------------------------------- > > Key: SPARK-32975 > URL: https://issues.apache.org/jira/browse/SPARK-32975 > Project: Spark > Issue Type: Bug > Components: Kubernetes, Scheduler > Affects Versions: 2.4.4 > Reporter: Shenson Joseph > Priority: Critical > > We are using v1beta2-1.1.2-2.4.5 version of operator with spark-2.4.4 > spark executors keeps getting killed with exit code 1 and we are seeing > following exception in the executor which goes to error state. Once this > error happens, driver doesn't restart executor. > > Exception in thread "main" java.lang.reflect.UndeclaredThrowableException > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713) > at > org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) > Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult: > at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) > at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) > at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101) > at > org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65) > at > org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) > ... 4 more > Caused by: java.io.IOException: Failed to connect to > act-pipeline-app-1600187491917-driver-svc.default.svc:7078 > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245) > at > org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187) > at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198) > at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194) > at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.net.UnknownHostException: > act-pipeline-app-1600187491917-driver-svc.default.svc > at java.net.InetAddress.getAllByName0(InetAddress.java:1281) > at java.net.InetAddress.getAllByName(InetAddress.java:1193) > at java.net.InetAddress.getAllByName(InetAddress.java:1127) > at java.net.InetAddress.getByName(InetAddress.java:1077) > at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146) > at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143) > at java.security.AccessController.doPrivileged(Native Method) > at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143) > at > io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43) > at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63) > at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55) > at > io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57) > at > io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32) > at > io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108) > at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208) > at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49) > at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188) > at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174) > at > io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507) > at > io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481) > at > io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420) > at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104) > at > io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:978) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512) > at > io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423) > at > io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482) > at > io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163) > at > io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403) > at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463) > at > io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) > at > io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138) > ... 1 more > CodeCache: size=245760Kb used=4762Kb max_used=4763Kb free=240997Kb > bounds [0x00007f49f5000000, 0x00007f49f54b0000, 0x00007f4a04000000] > total_blobs=1764 nmethods=1356 adapters=324 > compilation: enabled > > > > *Additional information:* > *The status of spark application shows it is RUNNING:* > kubectl describe sparkapplications.sparkoperator.k8s.io act-pipeline-app > ... > ... > Status: > Application State: > State: RUNNING > Driver Info: > Pod Name: act-pipeline-app-driver > Web UI Address: 10.233.57.201:40550 > Web UI Port: 40550 > Web UI Service Name: act-pipeline-app-ui-svc > Execution Attempts: 1 > Executor State: > act-pipeline-app-1600097064694-exec-1: RUNNING > Last Submission Attempt Time: 2020-09-14T15:24:26Z > Spark Application Id: > spark-942bb2e500c54f92ac357b818c712558 > Submission Attempts: 1 > Submission ID: > 4ecdb6ca-d237-4524-b05e-c42cfcc73dc7 > Termination Time: <nil> > Events: <none> > > *The executor pod is reporting that it is Terminated:* > kubectl describe pod -l > sparkoperator.k8s.io/app-name=act-pipeline-app,spark-role=executor > ... > ... > Containers: > executor: > Container ID: > docker://9aa5b585e8fb7390b87a4771f3ed1402cae41f0fe55905d0172ed6e90dde34e6 > ... > Ports: 7079/TCP, 8090/TCP > Host Ports: 0/TCP, 0/TCP > Args: > executor > State: Terminated > Reason: Error > Exit Code: 1 > Started: Mon, 14 Sep 2020 11:25:35 -0400 > Finished: Mon, 14 Sep 2020 11:25:39 -0400 > Ready: False > Restart Count: 0 > ... > Conditions: > Type Status > Initialized True > Ready False > ContainersReady False > PodScheduled True > ... > QoS Class: Burstable > Node-Selectors: <none> > Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s > node.kubernetes.io/unreachable:NoExecute for 300s > Events: <none> > In early stage of the driver’s life the failed executor is not detected (it > is assumed to be running) and therefore it will not be restarted. > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org