[jira] [Assigned] (SPARK-23980) Resilient Spark driver on Kubernetes

2019-01-02 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23980:
--

Assignee: (was: Marcelo Vanzin)

> Resilient Spark driver on Kubernetes
> 
>
> Key: SPARK-23980
> URL: https://issues.apache.org/jira/browse/SPARK-23980
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sebastian Toader
>Priority: Major
>
> The current implementation of `Spark driver` on Kubernetes is not resilient 
> to node failures as it’s implemented as a `Pod`. In case of a node failure 
> Kubernetes terminates the pods that were running on that node. Kubernetes 
> doesn't reschedule these pods to any of the other nodes of the cluster.
> If the `driver` is implemented as Kubernetes 
> [Job|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/]
>  than it will be rescheduled to other node.
> When the driver is terminated its executors (that may run on other nodes) are 
> terminated by Kubernetes with some delay by [Kubernetes Garbage 
> collection|https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/].
> This can lead to concurrency issues where the re-spawned `driver` was trying 
> to create new executors with same name as the executors being in the middle 
> of being cleaned up by Kubernetes garbage collection.
> To solve this issue the executor name must be made unique for each `driver` 
> *instance*.
> The PR linked to this lira is an implementation of the above that creates 
> spark driver as a Job and ensures that executor pod names are unique per 
> driver instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23980) Resilient Spark driver on Kubernetes

2019-01-02 Thread Marcelo Vanzin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-23980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin reassigned SPARK-23980:
--

Assignee: Marcelo Vanzin

> Resilient Spark driver on Kubernetes
> 
>
> Key: SPARK-23980
> URL: https://issues.apache.org/jira/browse/SPARK-23980
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sebastian Toader
>Assignee: Marcelo Vanzin
>Priority: Major
>
> The current implementation of `Spark driver` on Kubernetes is not resilient 
> to node failures as it’s implemented as a `Pod`. In case of a node failure 
> Kubernetes terminates the pods that were running on that node. Kubernetes 
> doesn't reschedule these pods to any of the other nodes of the cluster.
> If the `driver` is implemented as Kubernetes 
> [Job|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/]
>  than it will be rescheduled to other node.
> When the driver is terminated its executors (that may run on other nodes) are 
> terminated by Kubernetes with some delay by [Kubernetes Garbage 
> collection|https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/].
> This can lead to concurrency issues where the re-spawned `driver` was trying 
> to create new executors with same name as the executors being in the middle 
> of being cleaned up by Kubernetes garbage collection.
> To solve this issue the executor name must be made unique for each `driver` 
> *instance*.
> The PR linked to this lira is an implementation of the above that creates 
> spark driver as a Job and ensures that executor pod names are unique per 
> driver instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23980) Resilient Spark driver on Kubernetes

2018-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23980:


Assignee: Apache Spark

> Resilient Spark driver on Kubernetes
> 
>
> Key: SPARK-23980
> URL: https://issues.apache.org/jira/browse/SPARK-23980
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sebastian Toader
>Assignee: Apache Spark
>Priority: Major
>
> The current implementation of `Spark driver` on Kubernetes is not resilient 
> to node failures as it’s implemented as a `Pod`. In case of a node failure 
> Kubernetes terminates the pods that were running on that node. Kubernetes 
> doesn't reschedule these pods to any of the other nodes of the cluster.
> If the `driver` is implemented as Kubernetes 
> [Job|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/]
>  than it will be rescheduled to other node.
> When the driver is terminated its executors (that may run on other nodes) are 
> terminated by Kubernetes with some delay by [Kubernetes Garbage 
> collection|https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/].
> This can lead to concurrency issues where the re-spawned `driver` was trying 
> to create new executors with same name as the executors being in the middle 
> of being cleaned up by Kubernetes garbage collection.
> To solve this issue the executor name must be made unique for each `driver` 
> *instance*.
> The PR linked to this lira is an implementation of the above that creates 
> spark driver as a Job and ensures that executor pod names are unique per 
> driver instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-23980) Resilient Spark driver on Kubernetes

2018-04-13 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-23980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-23980:


Assignee: (was: Apache Spark)

> Resilient Spark driver on Kubernetes
> 
>
> Key: SPARK-23980
> URL: https://issues.apache.org/jira/browse/SPARK-23980
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.3.0
>Reporter: Sebastian Toader
>Priority: Major
>
> The current implementation of `Spark driver` on Kubernetes is not resilient 
> to node failures as it’s implemented as a `Pod`. In case of a node failure 
> Kubernetes terminates the pods that were running on that node. Kubernetes 
> doesn't reschedule these pods to any of the other nodes of the cluster.
> If the `driver` is implemented as Kubernetes 
> [Job|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/]
>  than it will be rescheduled to other node.
> When the driver is terminated its executors (that may run on other nodes) are 
> terminated by Kubernetes with some delay by [Kubernetes Garbage 
> collection|https://kubernetes.io/docs/concepts/workloads/controllers/garbage-collection/].
> This can lead to concurrency issues where the re-spawned `driver` was trying 
> to create new executors with same name as the executors being in the middle 
> of being cleaned up by Kubernetes garbage collection.
> To solve this issue the executor name must be made unique for each `driver` 
> *instance*.
> The PR linked to this lira is an implementation of the above that creates 
> spark driver as a Job and ensures that executor pod names are unique per 
> driver instance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org