[jira] [Assigned] (MESOS-8400) Handle plugin crashes gracefully in SLRP recovery.

2020-05-18 Thread Benjamin Bannier (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-8400:
---

Assignee: (was: Benjamin Bannier)

> Handle plugin crashes gracefully in SLRP recovery.
> --
>
> Key: MESOS-8400
> URL: https://issues.apache.org/jira/browse/MESOS-8400
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Priority: Blocker
>  Labels: mesosphere, mesosphere-dss-post-ga, storage
>
> When a CSI plugin crashes, the container daemon in SLRP will reset its 
> corresponding {{csi::Client}} service future. However, if a CSI call races 
> with a plugin crash, the call may be issued before the service future is 
> reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses 
> this for {{CreateVolume}} and {{DeleteVolume}} calls, but calls in the SLRP 
> recovery path, e.g., {{ListVolume}}, {{GetCapacity}}, {{Probe}}, could make 
> the SLRP unrecoverable.
> There are two main issues:
>  1. For {{Probe}}, we should investigate if it is needed to make a few retry 
> attempts, then after that, we should recover from failed attempts (e.g., kill 
> the plugin container), then make the container daemon relaunch the plugin 
> instead of failing the daemon.
> 2. For other calls in the recovery path, we should either retry the call, or 
> make the local resource provider daemon be able to restart the SLRP after it 
> fails.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-8400) Handle plugin crashes gracefully in SLRP recovery.

2019-08-21 Thread Benjamin Bannier (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-8400:
---

  Sprint: Resource Mgmt: RI-17 Sprint 53
Assignee: Benjamin Bannier

> Handle plugin crashes gracefully in SLRP recovery.
> --
>
> Key: MESOS-8400
> URL: https://issues.apache.org/jira/browse/MESOS-8400
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Assignee: Benjamin Bannier
>Priority: Blocker
>  Labels: mesosphere, mesosphere-dss-post-ga, storage
>
> When a CSI plugin crashes, the container daemon in SLRP will reset its 
> corresponding {{csi::Client}} service future. However, if a CSI call races 
> with a plugin crash, the call may be issued before the service future is 
> reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses 
> this for {{CreateVolume}} and {{DeleteVolume}} calls, but calls in the SLRP 
> recovery path, e.g., {{ListVolume}}, {{GetCapacity}}, {{Probe}}, could make 
> the SLRP unrecoverable.
> There are two main issues:
>  1. For {{Probe}}, we should investigate if it is needed to make a few retry 
> attempts, then after that, we should recover from failed attempts (e.g., kill 
> the plugin container), then make the container daemon relaunch the plugin 
> instead of failing the daemon.
> 2. For other calls in the recovery path, we should either retry the call, or 
> make the local resource provider daemon be able to restart the SLRP after it 
> fails.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)