[jira] [Comment Edited] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-03 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462139#comment-16462139
 ] 

Anirudh Ramanathan edited comment on SPARK-24135 at 5/3/18 9:01 AM:


It is increasingly common for people to write custom controllers and custom 
resources and not use the built-in controllers, especially when the workloads 
have special characteristics. This is the whole reason why people are working 
on tooling like the [operator 
framework|https://coreos.com/blog/introducing-operator-framework]. I don't 
think the future lies in shoehorning applications to use the existing 
controllers. The existing controllers are a good starting point but for any 
custom orchestration, the recommendation from the k8s community at large would 
be to write an operator which in some sense is what we've done. So, I think 
moving towards the built-in controllers doesn't give us anything more. 

Also, replication controllers and deployments are not used for applications 
with termination semantics. They're suitable for long running services. That's 
the reason why they never give up after seeing failures. However, if you see 
the "batch" type built-in controller, the [job 
controller|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion],
 it does implement a [backoff 
policy|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#pod-backoff-failure-policy]
 that covers the initialization and runtime errors in containers. As I see it, 
we should have safe limits for all kinds of failures to eventually give up. I'm 
ok with having this limit similar to the job controller, as a configurable 
number and one might want to set it very high in your case to do near infinite 
retries, but I'm not convinced that that behavior is a safe choice in the 
general case. 

Also, flakiness due to admission webhooks seems like it should be handled by 
retries in the init container, or by some other automation, since it's outside 
Spark land. That makes me apprehensive about handling such specific cases 
within Spark, instead of dealing with it as "framework error" and "app error".


was (Author: foxish):
It is increasingly common for people to write custom controllers and custom 
resources and not use the built-in controllers, especially when the workloads 
have special characteristics. This is the whole reason why people are working 
on tooling like the [operator 
framework|https://coreos.com/blog/introducing-operator-framework]. I don't 
think the future lies in shoehorning applications to use the existing 
controllers. The existing controllers are a good starting point but for any 
custom orchestration, the recommendation from the k8s community at large would 
be to write an operator which in some sense is what we've done. So, I think 
moving towards the built-in controllers doesn't give us anything more. 

Also, replication controllers and deployments are not used for applications 
with termination semantics. They're suitable for long running services. The 
only "batch" type built-in controller is the [job 
controller|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion],
 which does implement a [backoff 
policy|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#pod-backoff-failure-policy]
 that covers the initialization and runtime errors in containers. As I see it, 
we should have safe limits for all kinds of failures to eventually give up; 
it's more a question of whether this should be treated differently as a 
framework error.

Also, flakiness due to admission webhooks seems like it should be handled by 
retries in the init container, or by some other automation, since it's outside 
Spark land. That makes me apprehensive about handling such specific cases 
within Spark, instead of dealing with it as "framework error" and "app error".

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up 

[jira] [Comment Edited] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-03 Thread Anirudh Ramanathan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462139#comment-16462139
 ] 

Anirudh Ramanathan edited comment on SPARK-24135 at 5/3/18 8:58 AM:


It is increasingly common for people to write custom controllers and custom 
resources and not use the built-in controllers, especially when the workloads 
have special characteristics. This is the whole reason why people are working 
on tooling like the [operator 
framework|https://coreos.com/blog/introducing-operator-framework]. I don't 
think the future lies in shoehorning applications to use the existing 
controllers. The existing controllers are a good starting point but for any 
custom orchestration, the recommendation from the k8s community at large would 
be to write an operator which in some sense is what we've done. So, I think 
moving towards the built-in controllers doesn't give us anything more. 

Also, replication controllers and deployments are not used for applications 
with termination semantics. They're suitable for long running services. The 
only "batch" type built-in controller is the [job 
controller|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion],
 which does implement a [backoff 
policy|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#pod-backoff-failure-policy]
 that covers the initialization and runtime errors in containers. As I see it, 
we should have safe limits for all kinds of failures to eventually give up; 
it's more a question of whether this should be treated differently as a 
framework error.

Also, flakiness due to admission webhooks seems like it should be handled by 
retries in the init container, or by some other automation, since it's outside 
Spark land. That makes me apprehensive about handling such specific cases 
within Spark, instead of dealing with it as "framework error" and "app error".


was (Author: foxish):
It is increasingly common for people to write custom controllers and custom 
resources and not use the built-in controllers, especially when the workloads 
have special characteristics. This is the whole reason why people are working 
on tooling like the [operator 
framework|https://coreos.com/blog/introducing-operator-framework]. I don't 
think the future lies in shoehorning applications to use the existing 
controllers. The existing controllers are a good starting point but for any 
custom orchestration, the recommendation from the k8s community at large would 
be to write an operator which in some sense is what we've done. So, I think 
moving towards the built-in controllers doesn't give us anything more. 

Also, replication controllers and deployments are not used for applications 
with termination semantics. They're suitable for long running services. The 
only "batch" type built-in controller is the [job 
controller|https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion],
 which does implement a backoff policy that covers the initialization and 
runtime errors in containers. As I see it, we should have safe limits for all 
kinds of failures to eventually give up; it's more a question of whether this 
should be treated differently as a framework error.

Also, flakiness due to admission webhooks seems like it should be handled by 
retries in the init container, or by some other automation, since it's outside 
Spark land. That makes me apprehensive about handling such specific cases 
within Spark, instead of dealing with it as "framework error" and "app error".

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark 

[jira] [Comment Edited] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-03 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462106#comment-16462106
 ] 

Matt Cheah edited comment on SPARK-24135 at 5/3/18 8:35 AM:


Not necessarily - if the pods fail to start up, we should retry them 
indefinitely as a replication controller or a deployment would. There's an 
argument that can be made that we should be using those higher level primitives 
to run executors instead of raw pods anyways, just that Spark's scheduler code 
would need non-trivial changes to do so right now.


was (Author: mcheah):
Not necessarily - if the pods fail to start up, we should retry them 
indefinitely as a replication controller or a deployment would. There's an 
argument that can be made that we should be using those lower level primitives 
to run executors instead of raw pods anyways, just that Spark's scheduler code 
would need non-trivial changes to do so right now.

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-02 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16461188#comment-16461188
 ] 

Matt Cheah edited comment on SPARK-24135 at 5/2/18 3:37 PM:


{quote}Restarting seems like it would eventually be limited by the job failure 
limit that Spark already has. If pod startup failures are deterministic the job 
failure count will hit this limit and job will be killed that way.
{quote}
In the case of the executor failing to start at all, this wouldn't be caught by 
Spark's task failure count logic because you're never going to end up 
scheduling tasks on these executors that failed to start.


was (Author: mcheah):
> Restarting seems like it would eventually be limited by the job failure limit 
>that Spark already has. If pod startup failures are deterministic the job 
>failure count will hit this limit and job will be killed that way.

In the case of the executor failing to start at all, this wouldn't be caught by 
Spark's task failure count logic because you're never going to end up 
scheduling tasks on these executors that failed to start.

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-02 Thread Matt Cheah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460047#comment-16460047
 ] 

Matt Cheah edited comment on SPARK-24135 at 5/2/18 3:37 PM:


{quote}But I'm not sure how much this buys us because very likely the newly 
requested executors will fail to be initialized,
{quote}
That's entirely up to the behavior of the init container itself - there's many 
reasons for one to believe that a given init container's logic can be flaky. 
But it's not immediately obvious to me whether or not the init container's 
failure should count towards a job failure. Job failures shouldn't be caused by 
failures in the framework, and in this case, the framework has added the 
init-container for these pods - in other words the user's code didn't directly 
cause the job failure.


was (Author: mcheah):
_> But I'm not sure how much this buys us because very likely the newly 
requested executors will fail to be initialized,_

That's entirely up to the behavior of the init container itself - there's many 
reasons for one to believe that a given init container's logic can be flaky. 
But it's not immediately obvious to me whether or not the init container's 
failure should count towards a job failure. Job failures shouldn't be caused by 
failures in the framework, and in this case, the framework has added the 
init-container for these pods - in other words the user's code didn't directly 
cause the job failure.

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-24135) [K8s] Executors that fail to start up because of init-container errors are not retried and limit the executor pool size

2018-05-01 Thread Yinan Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-24135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16460066#comment-16460066
 ] 

Yinan Li edited comment on SPARK-24135 at 5/1/18 7:53 PM:
--

I agree that we should add detection for initialization errors. But I'm not 
sure if requesting new executors to replace the ones that failed initialization 
is a good idea. External webhooks or initializers are typically installed by 
cluster admins and there's always risks of bugs in the webhooks or initializers 
that cause pods to fail initialization. In case of initializers, things are 
worse as pods will not be able to get out of pending status if for whatever 
reasons the controller that's handling a particular initializer is down. For 
the reasons [~mcheah] mentioned above, it's not obvious if initialization 
errors should count towards job failures. I think keeping track of how many 
initialization errors are seen and stopping requesting new executors after 
certain threshold might be a good idea.


was (Author: liyinan926):
I agree that we should add detection for initialization errors. But I'm not 
sure if requesting new executors to replace the ones that failed initialization 
is a good idea. External webhooks or initializers are typically installed by 
cluster admins and there's always risks of bugs in the webhooks or initializers 
that cause pods to fail initialization. In case of initializers, things are 
worse as pods will not be able to get out of pending status if for whatever 
reasons the controller that's handling a particular initializer is down. For 
the reasons [~mcheah] mentioned above, it's not obvious if initialization 
errors should count towards job failures. I think keeping track of how many 
initialization errors are seen and stopping requesting new executors might be a 
good idea.

> [K8s] Executors that fail to start up because of init-container errors are 
> not retried and limit the executor pool size
> ---
>
> Key: SPARK-24135
> URL: https://issues.apache.org/jira/browse/SPARK-24135
> Project: Spark
>  Issue Type: Bug
>  Components: Kubernetes
>Affects Versions: 2.3.0
>Reporter: Matt Cheah
>Priority: Major
>
> In KubernetesClusterSchedulerBackend, we detect if executors disconnect after 
> having been started or if executors hit the {{ERROR}} or {{DELETED}} states. 
> When executors fail in these ways, they are removed from the pending 
> executors pool and the driver should retry requesting these executors.
> However, the driver does not handle a different class of error: when the pod 
> enters the {{Init:Error}} state. This state comes up when the executor fails 
> to launch because one of its init-containers fails. Spark itself doesn't 
> attach any init-containers to the executors. However, custom web hooks can 
> run on the cluster and attach init-containers to the executor pods. 
> Additionally, pod presets can specify init containers to run on these pods. 
> Therefore Spark should be handling the {{Init:Error}} cases regardless if 
> Spark itself is aware of init-containers or not.
> This class of error is particularly bad because when we hit this state, the 
> failed executor will never start, but it's still seen as pending by the 
> executor allocator. The executor allocator won't request more rounds of 
> executors because its current batch hasn't been resolved to either running or 
> failed. Therefore we end up with being stuck with the number of executors 
> that successfully started before the faulty one failed to start, potentially 
> creating a fake resource bottleneck.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org