[jira] [Commented] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors

2019-01-14 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16742591#comment-16742591
 ] 

Chun-Hung Hsiao commented on MESOS-9223:


For surfacing the error:
We could pass the failure to {{LocalResourceProviderManager}} through the 
interface I proposed above, then expose that in the 
{{LIST_RESOURCE_PROVIDER_CONFIGS}} API through MESOS-8745.

> Storage local provider does not sufficiently handle container launch failures 
> or errors
> ---
>
> Key: MESOS-9223
> URL: https://issues.apache.org/jira/browse/MESOS-9223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, storage
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Critical
>
> The storage local resource provider as currently implemented does not handle 
> launch failures or task errors of its standalone containers well enough, If 
> e.g., a RP container fails to come up during node start a warning would be 
> logged, but an operator still needs to detect degraded functionality, 
> manually check the state of containers with {{GET_CONTAINERS}}, and decide 
> whether the agent needs restarting; I suspect they do not have always have 
> enough context for this decision. It would be better if the provider would 
> either enforce a restart by failing over the whole agent, or by retrying the 
> operation (optionally: up to some maximum amount of retries).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors

2019-01-11 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16740239#comment-16740239
 ] 

Benjamin Bannier commented on MESOS-9223:
-

Reviews:

 [r/69606/|https://reviews.apache.org/r/69606/]
 [r/69719/|https://reviews.apache.org/r/69719/]

> Storage local provider does not sufficiently handle container launch failures 
> or errors
> ---
>
> Key: MESOS-9223
> URL: https://issues.apache.org/jira/browse/MESOS-9223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, storage
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Critical
>
> The storage local resource provider as currently implemented does not handle 
> launch failures or task errors of its standalone containers well enough, If 
> e.g., a RP container fails to come up during node start a warning would be 
> logged, but an operator still needs to detect degraded functionality, 
> manually check the state of containers with {{GET_CONTAINERS}}, and decide 
> whether the agent needs restarting; I suspect they do not have always have 
> enough context for this decision. It would be better if the provider would 
> either enforce a restart by failing over the whole agent, or by retrying the 
> operation (optionally: up to some maximum amount of retries).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors

2019-01-09 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738638#comment-16738638
 ] 

Benjamin Bannier commented on MESOS-9223:
-

Capturing results from offline sync with [~chhsia0], [~jieyu], and [~jdef]:

 The agent should expose metrics reflecting resource provider-related state 
changes (e.g., a metric for resource provider (re)subscriptions, and other 
relevant events the RP manager currently exposes). If this could be implemented 
in the RP manager, we could reuse the code for ERPs where masters would hold RP 
managers. We'll want to expose metrics aggregated over all agents, but probably 
also metrics per RP to simplify triage.

We are not yet sure how RPs can surface reasons for e.g., disconnects as no 
message channel from RP up to RP manager exists. Right now this could be 
implemented by making use of e.g., out of band transport of plugin logs (e.g., 
via journald).

> Storage local provider does not sufficiently handle container launch failures 
> or errors
> ---
>
> Key: MESOS-9223
> URL: https://issues.apache.org/jira/browse/MESOS-9223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, storage
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Critical
>
> The storage local resource provider as currently implemented does not handle 
> launch failures or task errors of its standalone containers well enough, If 
> e.g., a RP container fails to come up during node start a warning would be 
> logged, but an operator still needs to detect degraded functionality, 
> manually check the state of containers with {{GET_CONTAINERS}}, and decide 
> whether the agent needs restarting; I suspect they do not have always have 
> enough context for this decision. It would be better if the provider would 
> either enforce a restart by failing over the whole agent, or by retrying the 
> operation (optionally: up to some maximum amount of retries).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors

2018-12-21 Thread Benjamin Bannier (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16726759#comment-16726759
 ] 

Benjamin Bannier commented on MESOS-9223:
-

Reviews:
https://reviews.apache.org/r/69606/
https://reviews.apache.org/r/69607/ 

> Storage local provider does not sufficiently handle container launch failures 
> or errors
> ---
>
> Key: MESOS-9223
> URL: https://issues.apache.org/jira/browse/MESOS-9223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, storage
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>Priority: Critical
>
> The storage local resource provider as currently implemented does not handle 
> launch failures or task errors of its standalone containers well enough, If 
> e.g., a RP container fails to come up during node start a warning would be 
> logged, but an operator still needs to detect degraded functionality, 
> manually check the state of containers with {{GET_CONTAINERS}}, and decide 
> whether the agent needs restarting; I suspect they do not have always have 
> enough context for this decision. It would be better if the provider would 
> either enforce a restart by failing over the whole agent, or by retrying the 
> operation (optionally: up to some maximum amount of retries).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors

2018-12-03 Thread James DeFelice (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16707318#comment-16707318
 ] 

James DeFelice commented on MESOS-9223:
---

MESOS-8380 addresses UI changes. The UI should not be the only place to easily 
observe/troubleshoot errors. Ideally there'd be an API that exposes such.

> Storage local provider does not sufficiently handle container launch failures 
> or errors
> ---
>
> Key: MESOS-9223
> URL: https://issues.apache.org/jira/browse/MESOS-9223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, storage
>Reporter: Benjamin Bannier
>Priority: Critical
>
> The storage local resource provider as currently implemented does not handle 
> launch failures or task errors of its standalone containers well enough, If 
> e.g., a RP container fails to come up during node start a warning would be 
> logged, but an operator still needs to detect degraded functionality, 
> manually check the state of containers with {{GET_CONTAINERS}}, and decide 
> whether the agent needs restarting; I suspect they do not have always have 
> enough context for this decision. It would be better if the provider would 
> either enforce a restart by failing over the whole agent, or by retrying the 
> operation (optionally: up to some maximum amount of retries).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors

2018-11-21 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16695138#comment-16695138
 ] 

Chun-Hung Hsiao commented on MESOS-9223:


There are to problems here:
1. How to make the agent more robust when handling SLRP failures.
2. How to surface the SLRP failures.

It seems to me 2 can be addressed by MESOS-8380.

Thought dumps for 1:
We can make the {{LocalResourceProviderDaemon}} act like systemd: retry 
launching the SLRP when there is a launch failure, potentially with an 
exponential backoff.

> Storage local provider does not sufficiently handle container launch failures 
> or errors
> ---
>
> Key: MESOS-9223
> URL: https://issues.apache.org/jira/browse/MESOS-9223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, storage
>Reporter: Benjamin Bannier
>Assignee: Chun-Hung Hsiao
>Priority: Critical
>
> The storage local resource provider as currently implemented does not handle 
> launch failures or task errors of its standalone containers well enough, If 
> e.g., a RP container fails to come up during node start a warning would be 
> logged, but an operator still needs to detect degraded functionality, 
> manually check the state of containers with {{GET_CONTAINERS}}, and decide 
> whether the agent needs restarting; I suspect they do not have always have 
> enough context for this decision. It would be better if the provider would 
> either enforce a restart by failing over the whole agent, or by retrying the 
> operation (optionally: up to some maximum amount of retries).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors

2018-10-05 Thread James DeFelice (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16640285#comment-16640285
 ] 

James DeFelice commented on MESOS-9223:
---

Regardless of whether retries are implemented, it would be nice to have an API 
that exposed the reason for the error. e.g. the last log line, or Mesos error 
related to the container failure.

> Storage local provider does not sufficiently handle container launch failures 
> or errors
> ---
>
> Key: MESOS-9223
> URL: https://issues.apache.org/jira/browse/MESOS-9223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, storage
>Reporter: Benjamin Bannier
>Assignee: Chun-Hung Hsiao
>Priority: Blocker
>
> The storage local resource provider as currently implemented does not handle 
> launch failures or task errors of its standalone containers well enough, If 
> e.g., a RP container fails to come up during node start a warning would be 
> logged, but an operator still needs to detect degraded functionality, 
> manually check the state of containers with {{GET_CONTAINERS}}, and decide 
> whether the agent needs restarting; I suspect they do not have always have 
> enough context for this decision. It would be better if the provider would 
> either enforce a restart by failing over the whole agent, or by retrying the 
> operation (optionally: up to some maximum amount of retries).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)