[jira] [Commented] (MESOS-8400) Handle plugin crashes gracefully in SLRP recovery.

2021-06-10 Thread Qian Zhang (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360928#comment-17360928
 ] 

Qian Zhang commented on MESOS-8400:
---

I see there are still two patches not merged yet:

[https://reviews.apache.org/r/71384]
[https://reviews.apache.org/r/71385]

[~bbannier] Can you please comment? Do we still need these two patches?

> Handle plugin crashes gracefully in SLRP recovery.
> --
>
> Key: MESOS-8400
> URL: https://issues.apache.org/jira/browse/MESOS-8400
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Priority: Blocker
>  Labels: mesosphere, mesosphere-dss-post-ga, storage
>
> When a CSI plugin crashes, the container daemon in SLRP will reset its 
> corresponding {{csi::Client}} service future. However, if a CSI call races 
> with a plugin crash, the call may be issued before the service future is 
> reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses 
> this for {{CreateVolume}} and {{DeleteVolume}} calls, but calls in the SLRP 
> recovery path, e.g., {{ListVolume}}, {{GetCapacity}}, {{Probe}}, could make 
> the SLRP unrecoverable.
> There are two main issues:
>  1. For {{Probe}}, we should investigate if it is needed to make a few retry 
> attempts, then after that, we should recover from failed attempts (e.g., kill 
> the plugin container), then make the container daemon relaunch the plugin 
> instead of failing the daemon.
> 2. For other calls in the recovery path, we should either retry the call, or 
> make the local resource provider daemon be able to restart the SLRP after it 
> fails.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-8400) Handle plugin crashes gracefully in SLRP recovery.

2021-06-10 Thread Gregoire Seux (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360610#comment-17360610
 ] 

Gregoire Seux commented on MESOS-8400:
--

All related reviews seems to have been applied, should we close this issue?

> Handle plugin crashes gracefully in SLRP recovery.
> --
>
> Key: MESOS-8400
> URL: https://issues.apache.org/jira/browse/MESOS-8400
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Priority: Blocker
>  Labels: mesosphere, mesosphere-dss-post-ga, storage
>
> When a CSI plugin crashes, the container daemon in SLRP will reset its 
> corresponding {{csi::Client}} service future. However, if a CSI call races 
> with a plugin crash, the call may be issued before the service future is 
> reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses 
> this for {{CreateVolume}} and {{DeleteVolume}} calls, but calls in the SLRP 
> recovery path, e.g., {{ListVolume}}, {{GetCapacity}}, {{Probe}}, could make 
> the SLRP unrecoverable.
> There are two main issues:
>  1. For {{Probe}}, we should investigate if it is needed to make a few retry 
> attempts, then after that, we should recover from failed attempts (e.g., kill 
> the plugin container), then make the container daemon relaunch the plugin 
> instead of failing the daemon.
> 2. For other calls in the recovery path, we should either retry the call, or 
> make the local resource provider daemon be able to restart the SLRP after it 
> fails.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-8400) Handle plugin crashes gracefully in SLRP recovery.

2019-09-04 Thread Benjamin Bannier (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922502#comment-16922502
 ] 

Benjamin Bannier commented on MESOS-8400:
-

{noformat}
commit d1b32cc3753001f7001dfa30fcea9000264001ef
Author: Benjamin Bannier 
Date:   Wed Sep 4 13:03:22 2019 +0200

Added stringification for resource provider calls.

Review: https://reviews.apache.org/r/71383/

commit 4676938dbff75ab0badd6dad35496285ddcff65c
Author: Benjamin Bannier 
Date:   Wed Sep 4 13:03:20 2019 +0200

Removed unused and unimplemented method declaration.

Review: https://reviews.apache.org/r/71382/
 {noformat}

> Handle plugin crashes gracefully in SLRP recovery.
> --
>
> Key: MESOS-8400
> URL: https://issues.apache.org/jira/browse/MESOS-8400
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Assignee: Benjamin Bannier
>Priority: Blocker
>  Labels: mesosphere, mesosphere-dss-post-ga, storage
>
> When a CSI plugin crashes, the container daemon in SLRP will reset its 
> corresponding {{csi::Client}} service future. However, if a CSI call races 
> with a plugin crash, the call may be issued before the service future is 
> reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses 
> this for {{CreateVolume}} and {{DeleteVolume}} calls, but calls in the SLRP 
> recovery path, e.g., {{ListVolume}}, {{GetCapacity}}, {{Probe}}, could make 
> the SLRP unrecoverable.
> There are two main issues:
>  1. For {{Probe}}, we should investigate if it is needed to make a few retry 
> attempts, then after that, we should recover from failed attempts (e.g., kill 
> the plugin container), then make the container daemon relaunch the plugin 
> instead of failing the daemon.
> 2. For other calls in the recovery path, we should either retry the call, or 
> make the local resource provider daemon be able to restart the SLRP after it 
> fails.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-8400) Handle plugin crashes gracefully in SLRP recovery.

2019-07-17 Thread Chun-Hung Hsiao (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887227#comment-16887227
 ] 

Chun-Hung Hsiao commented on MESOS-8400:


Adding retry logic for all calls blindly doesn't seem a good strategy.

For this ticket, we can focus on making the LRP daemon restart failed SLRP with 
some backoff.

> Handle plugin crashes gracefully in SLRP recovery.
> --
>
> Key: MESOS-8400
> URL: https://issues.apache.org/jira/browse/MESOS-8400
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chun-Hung Hsiao
>Priority: Blocker
>  Labels: mesosphere, mesosphere-dss-post-ga, storage
>
> When a CSI plugin crashes, the container daemon in SLRP will reset its 
> corresponding {{csi::Client}} service future. However, if a CSI call races 
> with a plugin crash, the call may be issued before the service future is 
> reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses 
> this for {{CreateVolume}} and {{DeleteVolume}} calls, but calls in the SLRP 
> recovery path, e.g., {{ListVolume}}, {{GetCapacity}}, {{Probe}}, could make 
> the SLRP unrecoverable.
> There are two main issues:
>  1. For {{Probe}}, we should investigate if it is needed to make a few retry 
> attempts, then after that, we should recover from failed attempts (e.g., kill 
> the plugin container), then make the container daemon relaunch the plugin 
> instead of failing the daemon.
> 2. For other calls in the recovery path, we should either retry the call, or 
> make the local resource provider daemon be able to restart the SLRP after it 
> fails.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)