[ https://issues.apache.org/jira/browse/MESOS-10126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17119535#comment-17119535 ]
Qian Zhang commented on MESOS-10126: ------------------------------------ 1.10.x: commit 97251a90d3336bd628c82becca00f545d95b01aa Author: Qian Zhang <zhq527...@gmail.com> Date: Fri May 15 10:23:51 2020 +0800 Erased `Info` struct before unmouting volumes in Docker volume isolator. Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will unmount the volume first, and if the unmount operation fails we will NOT erase the container's `Info` struct from `infos`. This is problematic because the remaining `Info` in `infos` will cause the reference count of the volume is greater than 0, but actually the volume is not being used by any containers. That means we may never get a chance to unmount this volume on this agent, furthermore if it is an EBS volume, it cannot be used by any tasks launched on any other agents since a EBS volume can only be attached to one node at a time. The only workaround would manually unmount the volume. So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase container's `Info` struct before unmounting volumes. Review: [https://reviews.apache.org/r/72516] 1.9.x: commit dcce73d57b4d8866fedb3f287d978a135616afb3 Author: Qian Zhang <zhq527...@gmail.com> Date: Fri May 15 10:23:51 2020 +0800 Erased `Info` struct before unmouting volumes in Docker volume isolator. Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will unmount the volume first, and if the unmount operation fails we will NOT erase the container's `Info` struct from `infos`. This is problematic because the remaining `Info` in `infos` will cause the reference count of the volume is greater than 0, but actually the volume is not being used by any containers. That means we may never get a chance to unmount this volume on this agent, furthermore if it is an EBS volume, it cannot be used by any tasks launched on any other agents since a EBS volume can only be attached to one node at a time. The only workaround would manually unmount the volume. So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase container's `Info` struct before unmounting volumes. Review: [https://reviews.apache.org/r/72516] 1.8.x: commit cdd3e2924596eecf605eeb73e9c57f23f6643936 Author: Qian Zhang <zhq527...@gmail.com> Date: Fri May 15 10:23:51 2020 +0800 Erased `Info` struct before unmouting volumes in Docker volume isolator. Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will unmount the volume first, and if the unmount operation fails we will NOT erase the container's `Info` struct from `infos`. This is problematic because the remaining `Info` in `infos` will cause the reference count of the volume is greater than 0, but actually the volume is not being used by any containers. That means we may never get a chance to unmount this volume on this agent, furthermore if it is an EBS volume, it cannot be used by any tasks launched on any other agents since a EBS volume can only be attached to one node at a time. The only workaround would manually unmount the volume. So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase container's `Info` struct before unmounting volumes. Review: [https://reviews.apache.org/r/72516] 1.7.x: commit 819b9d8345e701321067f3b14ad2bb78b60d285c Author: Qian Zhang <zhq527...@gmail.com> Date: Fri May 15 10:23:51 2020 +0800 Erased `Info` struct before unmouting volumes in Docker volume isolator. Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will unmount the volume first, and if the unmount operation fails we will NOT erase the container's `Info` struct from `infos`. This is problematic because the remaining `Info` in `infos` will cause the reference count of the volume is greater than 0, but actually the volume is not being used by any containers. That means we may never get a chance to unmount this volume on this agent, furthermore if it is an EBS volume, it cannot be used by any tasks launched on any other agents since a EBS volume can only be attached to one node at a time. The only workaround would manually unmount the volume. So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase container's `Info` struct before unmounting volumes. Review: [https://reviews.apache.org/r/72516] 1.6.x: commit b0a57116c6794f5d0036ed9c3668f27f29155bd7 Author: Qian Zhang <zhq527...@gmail.com> Date: Fri May 15 10:23:51 2020 +0800 Erased `Info` struct before unmouting volumes in Docker volume isolator. Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will unmount the volume first, and if the unmount operation fails we will NOT erase the container's `Info` struct from `infos`. This is problematic because the remaining `Info` in `infos` will cause the reference count of the volume is greater than 0, but actually the volume is not being used by any containers. That means we may never get a chance to unmount this volume on this agent, furthermore if it is an EBS volume, it cannot be used by any tasks launched on any other agents since a EBS volume can only be attached to one node at a time. The only workaround would manually unmount the volume. So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase container's `Info` struct before unmounting volumes. Review: [https://reviews.apache.org/r/72516] 1.5.x: commit 4c72e6098fdf2b38e79954c03385bb0ecacbb489 Author: Qian Zhang <zhq527...@gmail.com> Date: Fri May 15 10:23:51 2020 +0800 Erased `Info` struct before unmouting volumes in Docker volume isolator. Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will unmount the volume first, and if the unmount operation fails we will NOT erase the container's `Info` struct from `infos`. This is problematic because the remaining `Info` in `infos` will cause the reference count of the volume is greater than 0, but actually the volume is not being used by any containers. That means we may never get a chance to unmount this volume on this agent, furthermore if it is an EBS volume, it cannot be used by any tasks launched on any other agents since a EBS volume can only be attached to one node at a time. The only workaround would manually unmount the volume. So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase container's `Info` struct before unmounting volumes. Review: [https://reviews.apache.org/r/72516] 1.4.x: commit 8f0d3d0857b13924f379b97b9fe4b229f2d5d301 Author: Qian Zhang <zhq527...@gmail.com> Date: Fri May 15 10:23:51 2020 +0800 Erased `Info` struct before unmouting volumes in Docker volume isolator. Currently when `DockerVolumeIsolatorProcess::cleanup()` is called, we will unmount the volume first, and if the unmount operation fails we will NOT erase the container's `Info` struct from `infos`. This is problematic because the remaining `Info` in `infos` will cause the reference count of the volume is greater than 0, but actually the volume is not being used by any containers. That means we may never get a chance to unmount this volume on this agent, furthermore if it is an EBS volume, it cannot be used by any tasks launched on any other agents since a EBS volume can only be attached to one node at a time. The only workaround would manually unmount the volume. So in this patch `DockerVolumeIsolatorProcess::cleanup()` is updated to erase container's `Info` struct before unmounting volumes. Review: [https://reviews.apache.org/r/72516] > Docker volume isolator needs to clean up the `info` struct regardless the > result of unmount operation > ----------------------------------------------------------------------------------------------------- > > Key: MESOS-10126 > URL: https://issues.apache.org/jira/browse/MESOS-10126 > Project: Mesos > Issue Type: Bug > Components: containerization > Reporter: Qian Zhang > Assignee: Qian Zhang > Priority: Critical > Fix For: 1.4.4, 1.5.4, 1.6.3, 1.8.2, 1.9.1, 1.7.4, 1.11.0, 1.10.1 > > > Currently when > [DockerVolumeIsolatorProcess::cleanup()|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L610] > is called, we will unmount the volume first, but if the unmount operation > fails we will not remove the container's checkpoint directory and NOT erase > the container's `info` struct from `infos`. This is problematic, because the > remaining `info` in the `infos` will cause the reference count of the volume > is larger than 0, but actually the volume is not being used by any > containers. And next time when another container using this volume is > destroyed, we will NOT unmount the volume since its reference count will be > larger than 1 (see > [here|https://github.com/apache/mesos/blob/1.9.0/src/slave/containerizer/mesos/isolators/docker/volume/isolator.cpp#L631:L651] > for details) which should be 2, so we will never have chance to unmount this > volume. > We have this issue since Mesos 1.0.0 release when Docker volume isolator was > introduced. -- This message was sent by Atlassian Jira (v8.3.4#803005)