[jira] [Commented] (MESOS-8400) Handle plugin crashes gracefully in SLRP recovery.
[ https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360928#comment-17360928 ] Qian Zhang commented on MESOS-8400: --- I see there are still two patches not merged yet: [https://reviews.apache.org/r/71384] [https://reviews.apache.org/r/71385] [~bbannier] Can you please comment? Do we still need these two patches? > Handle plugin crashes gracefully in SLRP recovery. > -- > > Key: MESOS-8400 > URL: https://issues.apache.org/jira/browse/MESOS-8400 > Project: Mesos > Issue Type: Improvement >Reporter: Chun-Hung Hsiao >Priority: Blocker > Labels: mesosphere, mesosphere-dss-post-ga, storage > > When a CSI plugin crashes, the container daemon in SLRP will reset its > corresponding {{csi::Client}} service future. However, if a CSI call races > with a plugin crash, the call may be issued before the service future is > reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses > this for {{CreateVolume}} and {{DeleteVolume}} calls, but calls in the SLRP > recovery path, e.g., {{ListVolume}}, {{GetCapacity}}, {{Probe}}, could make > the SLRP unrecoverable. > There are two main issues: > 1. For {{Probe}}, we should investigate if it is needed to make a few retry > attempts, then after that, we should recover from failed attempts (e.g., kill > the plugin container), then make the container daemon relaunch the plugin > instead of failing the daemon. > 2. For other calls in the recovery path, we should either retry the call, or > make the local resource provider daemon be able to restart the SLRP after it > fails. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-8400) Handle plugin crashes gracefully in SLRP recovery.
[ https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17360610#comment-17360610 ] Gregoire Seux commented on MESOS-8400: -- All related reviews seems to have been applied, should we close this issue? > Handle plugin crashes gracefully in SLRP recovery. > -- > > Key: MESOS-8400 > URL: https://issues.apache.org/jira/browse/MESOS-8400 > Project: Mesos > Issue Type: Improvement >Reporter: Chun-Hung Hsiao >Priority: Blocker > Labels: mesosphere, mesosphere-dss-post-ga, storage > > When a CSI plugin crashes, the container daemon in SLRP will reset its > corresponding {{csi::Client}} service future. However, if a CSI call races > with a plugin crash, the call may be issued before the service future is > reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses > this for {{CreateVolume}} and {{DeleteVolume}} calls, but calls in the SLRP > recovery path, e.g., {{ListVolume}}, {{GetCapacity}}, {{Probe}}, could make > the SLRP unrecoverable. > There are two main issues: > 1. For {{Probe}}, we should investigate if it is needed to make a few retry > attempts, then after that, we should recover from failed attempts (e.g., kill > the plugin container), then make the container daemon relaunch the plugin > instead of failing the daemon. > 2. For other calls in the recovery path, we should either retry the call, or > make the local resource provider daemon be able to restart the SLRP after it > fails. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-8400) Handle plugin crashes gracefully in SLRP recovery.
[ https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16922502#comment-16922502 ] Benjamin Bannier commented on MESOS-8400: - {noformat} commit d1b32cc3753001f7001dfa30fcea9000264001ef Author: Benjamin Bannier Date: Wed Sep 4 13:03:22 2019 +0200 Added stringification for resource provider calls. Review: https://reviews.apache.org/r/71383/ commit 4676938dbff75ab0badd6dad35496285ddcff65c Author: Benjamin Bannier Date: Wed Sep 4 13:03:20 2019 +0200 Removed unused and unimplemented method declaration. Review: https://reviews.apache.org/r/71382/ {noformat} > Handle plugin crashes gracefully in SLRP recovery. > -- > > Key: MESOS-8400 > URL: https://issues.apache.org/jira/browse/MESOS-8400 > Project: Mesos > Issue Type: Improvement >Reporter: Chun-Hung Hsiao >Assignee: Benjamin Bannier >Priority: Blocker > Labels: mesosphere, mesosphere-dss-post-ga, storage > > When a CSI plugin crashes, the container daemon in SLRP will reset its > corresponding {{csi::Client}} service future. However, if a CSI call races > with a plugin crash, the call may be issued before the service future is > reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses > this for {{CreateVolume}} and {{DeleteVolume}} calls, but calls in the SLRP > recovery path, e.g., {{ListVolume}}, {{GetCapacity}}, {{Probe}}, could make > the SLRP unrecoverable. > There are two main issues: > 1. For {{Probe}}, we should investigate if it is needed to make a few retry > attempts, then after that, we should recover from failed attempts (e.g., kill > the plugin container), then make the container daemon relaunch the plugin > instead of failing the daemon. > 2. For other calls in the recovery path, we should either retry the call, or > make the local resource provider daemon be able to restart the SLRP after it > fails. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-8400) Handle plugin crashes gracefully in SLRP recovery.
[ https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16887227#comment-16887227 ] Chun-Hung Hsiao commented on MESOS-8400: Adding retry logic for all calls blindly doesn't seem a good strategy. For this ticket, we can focus on making the LRP daemon restart failed SLRP with some backoff. > Handle plugin crashes gracefully in SLRP recovery. > -- > > Key: MESOS-8400 > URL: https://issues.apache.org/jira/browse/MESOS-8400 > Project: Mesos > Issue Type: Improvement >Reporter: Chun-Hung Hsiao >Priority: Blocker > Labels: mesosphere, mesosphere-dss-post-ga, storage > > When a CSI plugin crashes, the container daemon in SLRP will reset its > corresponding {{csi::Client}} service future. However, if a CSI call races > with a plugin crash, the call may be issued before the service future is > reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses > this for {{CreateVolume}} and {{DeleteVolume}} calls, but calls in the SLRP > recovery path, e.g., {{ListVolume}}, {{GetCapacity}}, {{Probe}}, could make > the SLRP unrecoverable. > There are two main issues: > 1. For {{Probe}}, we should investigate if it is needed to make a few retry > attempts, then after that, we should recover from failed attempts (e.g., kill > the plugin container), then make the container daemon relaunch the plugin > instead of failing the daemon. > 2. For other calls in the recovery path, we should either retry the call, or > make the local resource provider daemon be able to restart the SLRP after it > fails. -- This message was sent by Atlassian JIRA (v7.6.14#76016)