[jira] [Commented] (SPARK-32975) [K8S] - executor fails to be restarted after it goes to ERROR/Failure state

Tibor Fasanga (Jira) Wed, 23 Sep 2020 18:52:30 -0700


    [ 
https://issues.apache.org/jira/browse/SPARK-32975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201207#comment-17201207
 ]


Tibor Fasanga commented on SPARK-32975:
---------------------------------------

Note that the main problem is that the executor POD quits with error and Spark 
driver and Spark operator think it is still running, therefore the executor is 
never restarted.

This is intermittent problem. Our testing shows that this happens frequently 
when the following is true: 
 # the driver POD has a sidecar container, and
 # it takes longer to initialize and start the sidecar container (this delay is 
caused by time required to pull the image of the sidecar container)

In other words, this problem manifests itself when there is a delay between 
starting the driver *container* and the time the driver *POD* is fully started 
(the POD contains the driver container and the sidecar container).

In this case we see the following events in the description of the driver POD: 
(see the "_Pulling image "registry.nspos.nokia.local/fluent/fluent-bit:1.5.5_" 
event that is present in this case) 
{code:java}
Events:
  Type     Reason       Age        From               Message
  ----     ------       ----       ----               -------
  Normal   Scheduled    <unknown>  default-scheduler  Successfully assigned 
default/act-pipeline-app-driver to node5
  Warning  FailedMount  20m        kubelet, node5     MountVolume.SetUp failed 
for volume "spark-conf-volume" : configmap 
"act-pipeline-app-1600699152173-driver-conf-map" not found
  Normal   Pulled       20m        kubelet, node5     Container image 
"registry.nspos.nokia.local/nspos-pki-container:20.9.0-rel.1" already present 
on machine
  Normal   Created      20m        kubelet, node5     Created container 
nspos-pki
  Normal   Started      20m        kubelet, node5     Started container 
nspos-pki
  Normal   Pulling      20m        kubelet, node5     Pulling image 
"registry.nspos.nokia.local/analytics-rtanalytics-pipeline-app:20.9.0-rel.48"
  Normal   Pulled       19m        kubelet, node5     Successfully pulled image 
"registry.nspos.nokia.local/analytics-rtanalytics-pipeline-app:20.9.0-rel.48"
  Normal   Created      19m        kubelet, node5     Created container 
spark-kubernetes-driver
  Normal   Started      19m        kubelet, node5     Started container 
spark-kubernetes-driver
  Normal   Pulling      19m        kubelet, node5     Pulling image 
"registry.nspos.nokia.local/fluent/fluent-bit:1.5.5"
  Normal   Pulled       18m        kubelet, node5     Successfully pulled image 
"registry.nspos.nokia.local/fluent/fluent-bit:1.5.5"
  Normal   Created      18m        kubelet, node5     Created container 
log-sidecar
  Normal   Started      18m        kubelet, node5     Started container 
log-sidecar
{code}
Note: The message "_MountVolume.SetUp failed for volume "spark-conf-volume" : 
configmap "act-pipeline-app-1600699152173-driver-conf-map" not found_" seems to 
be unrelated and does not seem to cause any problems.

> [K8S] - executor fails to be restarted after it goes to ERROR/Failure state
> ---------------------------------------------------------------------------
>
>                 Key: SPARK-32975
>                 URL: https://issues.apache.org/jira/browse/SPARK-32975
>             Project: Spark
>          Issue Type: Bug
>          Components: Kubernetes, Scheduler
>    Affects Versions: 2.4.4
>            Reporter: Shenson Joseph
>            Priority: Critical
>
> We are using v1beta2-1.1.2-2.4.5 version of operator with spark-2.4.4
> spark executors keeps getting killed with exit code 1 and we are seeing 
> following exception in the executor which goes to error state. Once this 
> error happens, driver doesn't restart executor. 
>  
> Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1713)
> at 
> org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:64)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:281)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
> Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
> at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
> at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
> at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
> at 
> org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:201)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:65)
> at 
> org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:64)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
> ... 4 more
> Caused by: java.io.IOException: Failed to connect to 
> act-pipeline-app-1600187491917-driver-svc.default.svc:7078
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:245)
> at 
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:187)
> at org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:198)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:194)
> at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:190)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.UnknownHostException: 
> act-pipeline-app-1600187491917-driver-svc.default.svc
> at java.net.InetAddress.getAllByName0(InetAddress.java:1281)
> at java.net.InetAddress.getAllByName(InetAddress.java:1193)
> at java.net.InetAddress.getAllByName(InetAddress.java:1127)
> at java.net.InetAddress.getByName(InetAddress.java:1077)
> at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:146)
> at io.netty.util.internal.SocketUtils$8.run(SocketUtils.java:143)
> at java.security.AccessController.doPrivileged(Native Method)
> at io.netty.util.internal.SocketUtils.addressByName(SocketUtils.java:143)
> at 
> io.netty.resolver.DefaultNameResolver.doResolve(DefaultNameResolver.java:43)
> at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:63)
> at io.netty.resolver.SimpleNameResolver.resolve(SimpleNameResolver.java:55)
> at 
> io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:57)
> at 
> io.netty.resolver.InetSocketAddressResolver.doResolve(InetSocketAddressResolver.java:32)
> at 
> io.netty.resolver.AbstractAddressResolver.resolve(AbstractAddressResolver.java:108)
> at io.netty.bootstrap.Bootstrap.doResolveAndConnect0(Bootstrap.java:208)
> at io.netty.bootstrap.Bootstrap.access$000(Bootstrap.java:49)
> at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:188)
> at io.netty.bootstrap.Bootstrap$1.operationComplete(Bootstrap.java:174)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
> at 
> io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
> at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:104)
> at 
> io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:82)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetSuccess(AbstractChannel.java:978)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.register0(AbstractChannel.java:512)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe.access$200(AbstractChannel.java:423)
> at 
> io.netty.channel.AbstractChannel$AbstractUnsafe$1.run(AbstractChannel.java:482)
> at 
> io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> at 
> io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
> at 
> io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
> ... 1 more
> CodeCache: size=245760Kb used=4762Kb max_used=4763Kb free=240997Kb
> bounds [0x00007f49f5000000, 0x00007f49f54b0000, 0x00007f4a04000000]
> total_blobs=1764 nmethods=1356 adapters=324
> compilation: enabled
>  
>  
>  
> *Additional information:*
> *The status of spark application shows it is RUNNING:*
> kubectl describe sparkapplications.sparkoperator.k8s.io act-pipeline-app
> ...
> ...
> Status:
>   Application State:
>     State:  RUNNING
>   Driver Info:
>     Pod Name:             act-pipeline-app-driver
>     Web UI Address:       10.233.57.201:40550
>     Web UI Port:          40550
>     Web UI Service Name:  act-pipeline-app-ui-svc
>   Execution Attempts:     1
>   Executor State:
>     act-pipeline-app-1600097064694-exec-1:  RUNNING
>   Last Submission Attempt Time:             2020-09-14T15:24:26Z
>   Spark Application Id:                     
> spark-942bb2e500c54f92ac357b818c712558
>   Submission Attempts:                      1
>   Submission ID:                            
> 4ecdb6ca-d237-4524-b05e-c42cfcc73dc7
>   Termination Time:                         <nil>
> Events:                                     <none>
>  
> *The executor pod is reporting that it is Terminated:*
> kubectl describe pod -l 
> sparkoperator.k8s.io/app-name=act-pipeline-app,spark-role=executor
> ...
> ...
> Containers:
>   executor:
>     Container ID:  
> docker://9aa5b585e8fb7390b87a4771f3ed1402cae41f0fe55905d0172ed6e90dde34e6
> ...
>     Ports:         7079/TCP, 8090/TCP
>     Host Ports:    0/TCP, 0/TCP
>     Args:
>       executor
>     State:          Terminated
>       Reason:       Error
>       Exit Code:    1
>       Started:      Mon, 14 Sep 2020 11:25:35 -0400
>       Finished:     Mon, 14 Sep 2020 11:25:39 -0400
>     Ready:          False
>     Restart Count:  0
> ...
> Conditions:
>   Type              Status
>   Initialized       True
>   Ready             False
>   ContainersReady   False
>   PodScheduled      True
> ...
> QoS Class:       Burstable
> Node-Selectors:  <none>
> Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
>                  node.kubernetes.io/unreachable:NoExecute for 300s
> Events:          <none>
> In early stage of the driver’s life the failed executor is not detected (it 
> is assumed to be running) and therefore it will not be restarted.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32975) [K8S] - executor fails to be restarted after it goes to ERROR/Failure state

Reply via email to