[jira] [Comment Edited] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

2023-09-07 Thread Michael Negodaev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762581#comment-17762581
 ] 

Michael Negodaev edited comment on SPARK-34645 at 9/7/23 9:42 AM:
--

Faced with a similar problem with Spark 3.3.2 on k8s and JDK 11.
I'm running a streaming application on k8s cluster. When the job fails for some 
reason driver terminates all the executors and stops the query and the context. 
But the driver's pod get stuck into a Running status.
Taking the dump shows that we're waiting for the driver's pod to complete:
{code:java}
at 
org.apache.spark.deploy.k8s.submit.LoggingPodStatusWatcherImpl.watchOrStop(LoggingPodStatusWatcher.scala:103){code}
Noticed that the last messages of the driver log are:
{code:java}
INFO  [org.apache.spark.SparkContext]: Successfully stopped SparkContext
Exception in thread "main" 
org.apache.spark.sql.streaming.StreamingQueryException: ...
Driver stacktrace:
=== Streaming Query ===
... {code}
while in a regular situation ShutdownHookManager should be called and the last 
messages should be
{code:java}
[shutdown-hook-0] INFO  [org.apache.spark.util.ShutdownHookManager]: Shutdown 
hook called
[shutdown-hook-0] INFO  [org.apache.spark.util.ShutdownHookManager]: Deleting 
directory /tmp/spark-1ba67a92-f374-4767-a233-54932195a2d5
...{code}
I can't see why ShutdownHookManager hasn't been called at the end of the job.


was (Author: mnegodaev):
Faced with a similar problem with Spark 3.3.2 on k8s and JDK 11.
I'm running a streaming application on k8s cluster. When the job fails for some 
reason driver terminates all the executors and stops the query and the context. 
But the driver's pod get stuck into a running status.
Taking the dump shows that we're waiting for the driver's pod to complete:
{code:java}
at 
org.apache.spark.deploy.k8s.submit.LoggingPodStatusWatcherImpl.watchOrStop(LoggingPodStatusWatcher.scala:103){code}
Noticed that the last messages of the driver log are:
{code:java}
INFO  [org.apache.spark.SparkContext]: Successfully stopped SparkContext
Exception in thread "main" 
org.apache.spark.sql.streaming.StreamingQueryException: ...
Driver stacktrace:
=== Streaming Query ===
... {code}
while in a regular situation ShutdownHookManager should be called and the last 
messages should be
{code:java}
[shutdown-hook-0] INFO  [org.apache.spark.util.ShutdownHookManager]: Shutdown 
hook called
[shutdown-hook-0] INFO  [org.apache.spark.util.ShutdownHookManager]: Deleting 
directory /tmp/spark-1ba67a92-f374-4767-a233-54932195a2d5
...{code}
I can't see why ShutdownHookManager hasn't been called at the end of the job.

 

 

> [K8S] Driver pod stuck in Running state after job completes
> ---
>
> Key: SPARK-34645
> URL: https://issues.apache.org/jira/browse/SPARK-34645
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2
> Environment: Kubernetes:
> {code:java}
> Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
> GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
> BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
> GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
> BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
> Platform:"linux/amd64"}
>  {code}
>Reporter: Andy Grove
>Priority: Major
>
> I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
> so the driver runs in a pod.
> When running with Spark 3.0.1 and 3.1.1 everything works as expected and I 
> see the Spark context being shut down after the job completes.
> However, when running with Spark 3.0.2 I do not see the context get shut down 
> and the driver pod is stuck in the Running state indefinitely.
> This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
> output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
> the job completes.
> {code:java}
> 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from 
> shutdown hook
> 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
> Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
> http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
> 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
> down all executors
> 2021-03-05 20:09:24,600 INFO 
> k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes 
> client has been closed (this 

[jira] [Comment Edited] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

2023-09-07 Thread Michael Negodaev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762581#comment-17762581
 ] 

Michael Negodaev edited comment on SPARK-34645 at 9/7/23 9:40 AM:
--

Faced with a similar problem with Spark 3.3.2 on k8s and JDK 11.
I'm running a streaming application on k8s cluster. When the job fails for some 
reason driver terminates all the executors and stops the query and the context. 
But the driver's pod get stuck into a running status.
Taking the dump shows that we're waiting for the driver's pod to complete:
{code:java}
at 
org.apache.spark.deploy.k8s.submit.LoggingPodStatusWatcherImpl.watchOrStop(LoggingPodStatusWatcher.scala:103){code}
Noticed that the last messages of the driver log are:
{code:java}
INFO  [org.apache.spark.SparkContext]: Successfully stopped SparkContext
Exception in thread "main" 
org.apache.spark.sql.streaming.StreamingQueryException: ...
Driver stacktrace:
=== Streaming Query ===
... {code}
while in a regular situation ShutdownHookManager should be called and the last 
messages should be
{code:java}
[shutdown-hook-0] INFO  [org.apache.spark.util.ShutdownHookManager]: Shutdown 
hook called
[shutdown-hook-0] INFO  [org.apache.spark.util.ShutdownHookManager]: Deleting 
directory /tmp/spark-1ba67a92-f374-4767-a233-54932195a2d5
...{code}
I can't see why ShutdownHookManager hasn't been called at the end of the job.

 

 


was (Author: mnegodaev):
Faced with a similar problem with Spark 3.3.2 on k8s and JDK 11.
I'm running a streaming application on k8s cluster with
{code:java}
Trigger.ProcessingTime(duration){code}
When the job fails for some reason driver terminates all the executors and 
stops the query and the context. But the driver's pod get stuck into a running 
status.
Taking the dump shows that we're waiting for the driver's pod to complete:
{code:java}
at 
org.apache.spark.deploy.k8s.submit.LoggingPodStatusWatcherImpl.watchOrStop(LoggingPodStatusWatcher.scala:103)
 {code}
I also noticed that changing the job trigger to
{code:java}
Trigger.Once() {code}
resolves the problem: after the job's failure k8s terminates driver's pod with 
Error status.

> [K8S] Driver pod stuck in Running state after job completes
> ---
>
> Key: SPARK-34645
> URL: https://issues.apache.org/jira/browse/SPARK-34645
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2
> Environment: Kubernetes:
> {code:java}
> Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
> GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
> BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
> GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
> BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
> Platform:"linux/amd64"}
>  {code}
>Reporter: Andy Grove
>Priority: Major
>
> I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
> so the driver runs in a pod.
> When running with Spark 3.0.1 and 3.1.1 everything works as expected and I 
> see the Spark context being shut down after the job completes.
> However, when running with Spark 3.0.2 I do not see the context get shut down 
> and the driver pod is stuck in the Running state indefinitely.
> This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
> output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
> the job completes.
> {code:java}
> 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from 
> shutdown hook
> 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
> Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
> http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
> 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
> down all executors
> 2021-03-05 20:09:24,600 INFO 
> k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes 
> client has been closed (this is expected if the application is shutting down.)
> 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared
> 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped
> 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster 
> stopped
> 2021-03-05 20:09:24,752 INFO 
> 

[jira] [Comment Edited] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

2023-09-07 Thread Michael Negodaev (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17762581#comment-17762581
 ] 

Michael Negodaev edited comment on SPARK-34645 at 9/7/23 6:34 AM:
--

Faced with a similar problem with Spark 3.3.2 on k8s and JDK 11.
I'm running a streaming application on k8s cluster with
{code:java}
Trigger.ProcessingTime(duration){code}
When the job fails for some reason driver terminates all the executors and 
stops the query and the context. But the driver's pod get stuck into a running 
status.
Taking the dump shows that we're waiting for the driver's pod to complete:
{code:java}
at 
org.apache.spark.deploy.k8s.submit.LoggingPodStatusWatcherImpl.watchOrStop(LoggingPodStatusWatcher.scala:103)
 {code}
I also noticed that changing the job trigger to
{code:java}
Trigger.Once() {code}
resolves the problem: after the job's failure k8s terminates driver's pod with 
Error status.


was (Author: mnegodaev):
Faced with a similar problem with Spark 3.3.2 on k8s and JDK 11.
I'm running a streaming application on k8s cluster with
{code:java}
Trigger.ProcessingTime(duration){code}
When the job fails for some reason driver terminates all the executors and 
stops the query and the context. But the driver's pod get stuck into a running 
status.
Taking the dump shows that we're waiting for the driver's pod to complete:

 
{code:java}
at 
org.apache.spark.deploy.k8s.submit.LoggingPodStatusWatcherImpl.watchOrStop(LoggingPodStatusWatcher.scala:103)
 {code}
I also noticed that changing the job trigger to
 

 
{code:java}
Trigger.Once() {code}
resolves the problem: after the job's failure k8s terminates driver's pod with 
Error status.
 

> [K8S] Driver pod stuck in Running state after job completes
> ---
>
> Key: SPARK-34645
> URL: https://issues.apache.org/jira/browse/SPARK-34645
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2
> Environment: Kubernetes:
> {code:java}
> Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
> GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
> BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
> GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
> BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
> Platform:"linux/amd64"}
>  {code}
>Reporter: Andy Grove
>Priority: Major
>
> I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
> so the driver runs in a pod.
> When running with Spark 3.0.1 and 3.1.1 everything works as expected and I 
> see the Spark context being shut down after the job completes.
> However, when running with Spark 3.0.2 I do not see the context get shut down 
> and the driver pod is stuck in the Running state indefinitely.
> This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
> output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
> the job completes.
> {code:java}
> 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from 
> shutdown hook
> 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
> Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
> http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
> 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
> down all executors
> 2021-03-05 20:09:24,600 INFO 
> k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes 
> client has been closed (this is expected if the application is shutting down.)
> 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared
> 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped
> 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster 
> stopped
> 2021-03-05 20:09:24,752 INFO 
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped 
> SparkContext
> 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called
> 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory 
> /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a
> 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting 

[jira] [Comment Edited] (SPARK-34645) [K8S] Driver pod stuck in Running state after job completes

2021-05-16 Thread Arghya Saha (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17345731#comment-17345731
 ] 

Arghya Saha edited comment on SPARK-34645 at 5/16/21, 3:08 PM:
---

I am facing the same issue Spark 3.1.1 + JDK 11. In which version the fix is 
expected?


was (Author: arghya18):
I am facing the same issue Spark 3.1.1 + JDK 11. In which the fix is expected?

> [K8S] Driver pod stuck in Running state after job completes
> ---
>
> Key: SPARK-34645
> URL: https://issues.apache.org/jira/browse/SPARK-34645
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 3.0.2
> Environment: Kubernetes:
> {code:java}
> Client Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.2", 
> GitCommit:"f5743093fd1c663cb0cbc89748f730662345d44d", GitTreeState:"clean", 
> BuildDate:"2020-09-16T13:41:02Z", GoVersion:"go1.15", Compiler:"gc", 
> Platform:"linux/amd64"}
> Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5", 
> GitCommit:"2166946f41b36dea2c4626f90a77706f426cdea2", GitTreeState:"clean", 
> BuildDate:"2019-03-25T15:19:22Z", GoVersion:"go1.11.5", Compiler:"gc", 
> Platform:"linux/amd64"}
>  {code}
>Reporter: Andy Grove
>Priority: Major
>
> I am running automated benchmarks in k8s, using spark-submit in cluster mode, 
> so the driver runs in a pod.
> When running with Spark 3.0.1 and 3.1.1 everything works as expected and I 
> see the Spark context being shut down after the job completes.
> However, when running with Spark 3.0.2 I do not see the context get shut down 
> and the driver pod is stuck in the Running state indefinitely.
> This is the output I see after job completion with 3.0.1 and 3.1.1 and this 
> output does not appear with 3.0.2. With 3.0.2 there is no output at all after 
> the job completes.
> {code:java}
> 2021-03-05 20:09:24,576 INFO spark.SparkContext: Invoking stop() from 
> shutdown hook
> 2021-03-05 20:09:24,592 INFO server.AbstractConnector: Stopped 
> Spark@784499d0{HTTP/1.1, (http/1.1)}{0.0.0.0:4040}
> 2021-03-05 20:09:24,594 INFO ui.SparkUI: Stopped Spark web UI at 
> http://benchmark-runner-3e8a38780400e0d1-driver-svc.default.svc:4040
> 2021-03-05 20:09:24,599 INFO k8s.KubernetesClusterSchedulerBackend: Shutting 
> down all executors
> 2021-03-05 20:09:24,600 INFO 
> k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each 
> executor to shut down
> 2021-03-05 20:09:24,609 WARN k8s.ExecutorPodsWatchSnapshotSource: Kubernetes 
> client has been closed (this is expected if the application is shutting down.)
> 2021-03-05 20:09:24,719 INFO spark.MapOutputTrackerMasterEndpoint: 
> MapOutputTrackerMasterEndpoint stopped!
> 2021-03-05 20:09:24,736 INFO memory.MemoryStore: MemoryStore cleared
> 2021-03-05 20:09:24,738 INFO storage.BlockManager: BlockManager stopped
> 2021-03-05 20:09:24,744 INFO storage.BlockManagerMaster: BlockManagerMaster 
> stopped
> 2021-03-05 20:09:24,752 INFO 
> scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: 
> OutputCommitCoordinator stopped!
> 2021-03-05 20:09:24,768 INFO spark.SparkContext: Successfully stopped 
> SparkContext
> 2021-03-05 20:09:24,768 INFO util.ShutdownHookManager: Shutdown hook called
> 2021-03-05 20:09:24,769 INFO util.ShutdownHookManager: Deleting directory 
> /var/data/spark-67fa44df-e86c-463a-a149-25d95817ff8e/spark-a5476c14-c103-4108-b733-961400485d8a
> 2021-03-05 20:09:24,772 INFO util.ShutdownHookManager: Deleting directory 
> /tmp/spark-9d6261f5-4394-472b-9c9a-e22bde877814
> 2021-03-05 20:09:24,778 INFO impl.MetricsSystemImpl: Stopping s3a-file-system 
> metrics system...
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system stopped.
> 2021-03-05 20:09:24,779 INFO impl.MetricsSystemImpl: s3a-file-system metrics 
> system shutdown complete.
>  {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org