[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16466073#comment-16466073 ] James Peach commented on MESOS-6575: {noformat} commit 081c3114fefa18c6acd1e884e6d6583232e30d5c Author: Harold Dost Date: Mon May 7 08:39:29 2018 -0700 Documented the `--xfs-kill-containers` flag. Added a description of the `--xfs-kill-containers` flag to the `disk/xfs` isolator page and listed it in the upgrade documentation. Review: https://reviews.apache.org/r/66975/ {noformat} > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > Fix For: 1.6.0 > > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16459045#comment-16459045 ] James Peach commented on MESOS-6575: | [/r/66173|https://reviews.apache.org/r/66173/] | Added test for `disk/xfs` container limitation. | | [r/66001|https://reviews.apache.org/r/66001/]| Added soft limit and kill to `disk/xfs`. | > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > Fix For: 1.6.0 > > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393227#comment-16393227 ] Harold Dost III commented on MESOS-6575: Take a look at my review, and let me know what you think. > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16393177#comment-16393177 ] James Peach commented on MESOS-6575: {quote} I guess I don't understand the opposition to having the soft limit as in the current implementation the soft limit is being set, but it happens to be set to the exact amount as the hard limit. The advantage of the soft limit is that we don't have to keep track of how long has something been over the soft limit, we perform the system call which provides us a time when the grace period is over and once that occurs we can kill the application. {quote} My reasoning is that it doesn't matter how long the task has exceeded the allocated limit for. The `disk/du` isolator doesn't wait for you to be over the quota for any length of time - the task is terminated as soon as the violation is detected. It's certainly possible to set a different soft limit, but I can't see how it helps. The isolator still needs to poll on an interval and verify the used space. > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392696#comment-16392696 ] Harold Dost III commented on MESOS-6575: So one thing to mention is we are potentially looking at having a percentage slop/offset in addition bytes. Bytes would override percentage and they would be set as startup options. > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16392538#comment-16392538 ] Harold Dost III commented on MESOS-6575: So the issue with that is that an app isn't guaranteed to be able to fill the exact limit specified, leaving it hovering slightly short of the desired amount of space. {quote}Thinking about this some more, I'm not sure that we need to do anything with soft limits at all. Let's assume that we implement this for task sandboxes by applying a hard limit that is "disk_resource + some_constant_slop". We still need to have the isolator periodically check the usage in order to raise the limitation, so it doesn't really matter whether we have a soft limit. All we really need to do is check the current usage against the resource limit.{quote} I guess I don't understand the opposition to having the soft limit as in the current implementation the soft limit is being set, but it happens to be set to the exact amount as the hard limit. The advantage of the soft limit is that we don't have to keep track of how long has something been over the soft limit, we perform the system call which provides us a time when the grace period is over and once that occurs we can kill the application. > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391804#comment-16391804 ] James Peach commented on MESOS-6575: > James Peach Would you be able to act as the shepherd for getting this patch > in? Yes I can shepherd. However, I don't think that setting the soft limit is the right approach. I can't see a scenario where it is actually needed. If the isolator needs to poll (and it almost certainly does), then all it needs to do is to compare the actual disk usage against the allocated disk resource. > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391767#comment-16391767 ] Harold Dost III commented on MESOS-6575: [~jamespeach] Would you be able to act as the guide for getting this patch in? > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391765#comment-16391765 ] Harold Dost III commented on MESOS-6575: Design Doc: https://docs.google.com/document/d/17ElrKtBX7ek7ZHPzBndVIJqdlmsv8Mu1U1sVvfLX4gA/edit?ts=5aa17e84 > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383144#comment-16383144 ] Harold Dost III commented on MESOS-6575: {quote}This is because the XFS isolator doesn't support path volumes so there's no need to track any paths. {quote} That's a good point, but the part that is missing is how we would add the container limitation if we don't have a resource to bind it to. {quote}Thinking about this some more, I'm not sure that we need to do anything with soft limits at all. Let's assume that we implement this for task sandboxes by applying a hard limit that is "disk_resource + some_constant_slop". {quote} xfs_use_disk_reservation_as_soft_limit becomes useful because when you set a soft limit the isolator doesn't need to worry about raising the limit. The actual problem with hard limits is not when the capacity is actually met it is when it falls short by some varied amount depending on tasks. The advantage would be that when a soft limit is violated the project has the amount of time in the xfs project timer to come back into range or it will get the container limitation and therefore killed. {quote}We still need to have the isolator periodically check the usage in order to raise the limitation, so it doesn't really matter whether we have a soft limit. All we really need to do is check the current usage against the resource limit.{quote} So the proposition around having the isolator raise the limit itself is the potential for a runaway effect and then to make it useful it seems like you're also going to need additional tweaking parameters like backoff , a percentage/blocks raised per increase, limit in increases, possibly a mechanism to reduce the limit. To be honest though I don't know how much I am even behind the idea of diff_bytes as a concept and would much rather have apps be explicit. The flag {{xfs_use_disk_reservation_as_soft_limit}} plus having the ability for per task soft limits available should be enough without adding too much complexity. > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382948#comment-16382948 ] James Peach commented on MESOS-6575: {quote} When the resource is updated in the xfs handler they are not tracked, but instead are added up. {quote} This is because the XFS isolator doesn't support path volumes so there's no need to track any paths. It might be interesting to refactor a unified way to tracking disk resource, as a prerequisite to any other XFS changes, but AFAICT that's not actually required here. {quote} The idea behind the "diff_bytes" would be that you'd take the hard limit of any given task and subtract that amount of bytes to create a soft_limit below the hard limit. {quote} Thinking about this some more, I'm not sure that we need to do anything with soft limits at all. Let's assume that we implement this for task sandboxes by applying a hard limit that is "disk_resource + some_constant_slop". We still need to have the isolator periodically check the usage in order to raise the limitation, so it doesn't really matter whether we have a soft limit. All we really need to do is check the current usage against the resource limit. > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16382646#comment-16382646 ] Harold Dost III commented on MESOS-6575: One other thing while viewing the source for how {{disk/du}} handles disk resources and how {{disk/xfs}} handles resources. When the resource is updated in the xfs handler they are not tracked, but instead are added up. With this being the case, there's no way to set a limitation on a disk resource [because of this function|https://github.com/apache/mesos/blob/32f6d4eec2724414e217875f4f7d3b2538db5381/src/slave/containerizer/mesos/isolators/xfs/disk.cpp#L70]. The reasoning behind doing it this way may have made sense, but the logic is lost in translation. My thought would be to track it similarly to how {{disk/du}} does. > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16381694#comment-16381694 ] Harold Dost III commented on MESOS-6575: [~jamespeach] So while looking at this ticket, I don't know if we'd want to break this down into multiple tickets, but here are my thoughts. At the flag level to provide two settings. - {{xfs_use_disk_reservation_as_soft_limit}} - would be true/false (default: false) which would simply make the space reserved to be turned into a soft limit instead of a hard limit, which leads us to the next flag. - {{xfs_kill_on_soft_limit_violation}} - true/false (default:false) this way at a global level it can be configured so that once the grace period is over (configured by sysadmins with {{xfs_quota}}) it is killed. With all of that being said, on a resource level, we could have two parameters: - {{soft_disk_limit}} - This would override the flag {{xfs_use_disk_reservation_as_soft_limit}} instead such that if a soft limit is specified it provides exactly whatever space is desired for both. - {{kill_on_soft_limit_violation}} - This would override the global flag {{xfs_kill_on_soft_limit_violation}} on a per task basis. Optionally I was thinking that we could introduce another flag (not to make it even more complicated) which would be a default offset of soft limits. Something like {{xfs_kill_soft_quota_diff_bytes}} and it would be used to provide a global soft limit. This would also be overridden by {{soft_disk_limit}}, and would be ignored if {{xfs_use_disk_reservation_as_soft_limit}} is set. The idea behind the "diff_bytes" would be that you'd take the hard limit of any given task and subtract that amount of bytes to create a soft_limit below the hard limit. > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329711#comment-16329711 ] James Peach commented on MESOS-6575: Yeh, I think that using the soft limit is a pretty good idea. We can set the soft limit to the resources and the hard limit to resource + a fudge factor. We can kill applications based on either directly observing soft limit breaches, or the quota warnings (need to check whether XFS will reset them if the task goes back under the soft limit). > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham >Assignee: James Peach >Priority: Major > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16255758#comment-16255758 ] Pierre Cheynier commented on MESOS-6575: We may also be interested in this feature. Actually, XFS offer real enforcement and this is what's nice with it (avoid someone to fallocate the whole disk). But, a lot of applications are not developed to handle EDQUOT correctly (think what happens on a non-containerized environment), or cannot react preventively because they are not directly aware of what's happening (a companion process is filling up the disk by writing logs, etc.). So it's better to actually kill the task, like what's happening with oom-killer when using {{cgroups/memory}}. So, our feeling is that we could leverage the XFS soft limit and eventually the timer to introduce more modularity: * it would have to be specified at the agent level that you want to enforce (probably by reusing {{enforce_container_disk}} as suggested here) * the soft limit would be customizable (ex: soft limit = hard limit - 2%) * a collector would watch the container to eventually reach the soft limit and eventually kill the container, like what cgroups/mem is performing indirectly by relying on Linux oom-killer (and like what disk/du did for disk usage). What do you think? > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Santhosh Kumar Shanmugham > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15671596#comment-15671596 ] Santhosh Kumar Shanmugham commented on MESOS-6575: -- If the task inside the container is not able to make any progress because it exhausted its disk quota, the user is probably going to kill it and restart it with a different configuration. We can also argue that - by not killing the task, it becomes harder for the user to detect tasks that become unhealthy after exhaust the disk, and potentially requires changes to the metrics and alarms. We ran into a situation where the container exhausted its disk quota and went into an unhealthy state, where even the log message writes were failing due to lack of quota. The {{disk/xfs}} isolator's current behavior would make more sense, if the container were resizable. > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: isolation, slave >Reporter: Santhosh Kumar Shanmugham > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6575) Change `disk/xfs` isolator to terminate executor when it exceeds quota
[ https://issues.apache.org/jira/browse/MESOS-6575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15668934#comment-15668934 ] James Peach commented on MESOS-6575: A significant benefit of the {{disk/xfs}} isolator is that it doesn't kill the task, so I'm not very supportive of this. I suppose that it could be implemented as an additional feature flag, but I'm not sure why you would want this. IMHO the behavior of the {{disk/du}} isolator is pretty undesirable. > Change `disk/xfs` isolator to terminate executor when it exceeds quota > -- > > Key: MESOS-6575 > URL: https://issues.apache.org/jira/browse/MESOS-6575 > Project: Mesos > Issue Type: Task > Components: isolation, slave >Reporter: Santhosh Kumar Shanmugham > > Unlike {{disk/du}} isolator which sends a {{ContainerLimitation}} protobuf > when the executor exceeds the quota, {{disk/xfs}} isolator, which relies on > XFS's internal quota enforcement, silently fails the {{write}} operation, > that causes the quota limit to be exceeded, without surfacing the quota > breach information. > This task is to change the `disk/xfs` isolator so that, a > {{ContainerLimitation}} message is triggered when the quota is exceeded. > This feature will rely on the underlying filesystem being mounted with > {{pqnoenforce}} (accounting-only mode), so that XFS does not silently causes > a {{EDQUOT}} error on writes that causes the quota to be exceeded. Now the > isolator can track the disk quota via {{xfs_quota}}, very much like > {{disk/du}} using {{du}}, every {{container_disk_watch_interval}} and surface > the disk quota limit exceed event via a {{ContainerLimitation}} protobuf, > causing the executor to be terminated. This feature can then be turned on/off > via the existing {{enforce_container_disk_quota}} option. -- This message was sent by Atlassian JIRA (v6.3.4#6332)