[jira] [Assigned] (MESOS-10118) Agent incorrectly handles draining when empty
[ https://issues.apache.org/jira/browse/MESOS-10118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann reassigned MESOS-10118: - Assignee: Greg Mann > Agent incorrectly handles draining when empty > - > > Key: MESOS-10118 > URL: https://issues.apache.org/jira/browse/MESOS-10118 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.9.0 >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Major > > When the agent receives a {{DrainSlaveMessage}} and does not have any tasks > or operations, it writes the {{DrainConfig}} to disk and is then implicitly > stuck in a "draining" state indefinitely. For example, if an agent > reregistration is triggered at such a time, the master may think the agent is > operating normally and send a task to it, at which point the task will fail > because the agent thinks it's draining (see this test for an example: > https://reviews.apache.org/r/72364/). > If the agent receives a {{DrainSlaveMessage}} when it has no tasks or > operations, it should avoid writing any {{DrainConfig}} to disk so that it > immediately "transitions" into the already-drained state. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10118) Agent incorrectly handles draining when empty
Greg Mann created MESOS-10118: - Summary: Agent incorrectly handles draining when empty Key: MESOS-10118 URL: https://issues.apache.org/jira/browse/MESOS-10118 Project: Mesos Issue Type: Bug Components: agent Affects Versions: 1.9.0 Reporter: Greg Mann When the agent receives a {{DrainSlaveMessage}} and does not have any tasks or operations, it writes the {{DrainConfig}} to disk and is then implicitly stuck in a "draining" state indefinitely. For example, if an agent reregistration is triggered at such a time, the master may think the agent is operating normally and send a task to it, at which point the task will fail because the agent thinks it's draining (see this test for an example: https://reviews.apache.org/r/72364/). If the agent receives a {{DrainSlaveMessage}} when it has no tasks or operations, it should avoid writing any {{DrainConfig}} to disk so that it immediately "transitions" into the already-drained state. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10107) containeriser: failed to remove cgroup - EBUSY
[ https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084136#comment-17084136 ] Andrei Budnik commented on MESOS-10107: --- {code:java} commit af3ca189aced5fbc537bfca571264142d4cd37b3 Author: Charles-Francois Natali Date: Wed Apr 1 13:40:16 2020 +0100 Handled EBUSY when destroying a cgroup. It's a workaround for kernel bugs which can cause `rmdir` to fail with `EBUSY` even though the cgroup - appears - empty. See for example https://lkml.org/lkml/2020/1/15/1349 This closes #355 {code} > containeriser: failed to remove cgroup - EBUSY > -- > > Key: MESOS-10107 > URL: https://issues.apache.org/jira/browse/MESOS-10107 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Charles N >Priority: Major > Labels: cgroups, containerization > Fix For: 1.10.0 > > Attachments: mesos-remove-cgroup-race.diff, > reproduce-cgroup-rmdir-race.py > > > We've been seeing some random errors on our cluster, where the container > cgroup isn't properly destroyed after the OOM killer kicked in when memory > limit has been exceeded - see analysis and patch below: > Agent log: > {noformat} > I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: > 10272MB Maximum Used: 10518532KB > MEMORY STATISTICS: > cache 0 > rss 10502754304 > rss_huge 4001366016 > shmem 0 > mapped_file 270336 > dirty 0 > writeback 0 > swap 0 > pgpgin 1684617 > pgpgout 95480 > pgfault 1670328 > pgmajfault 957 > inactive_anon 0 > active_anon 10501189632 > inactive_file 4096 > active_file 0 > unevictable 0 > hierarchical_memory_limit 10770972672 > hierarchical_memsw_limit 10770972672 > total_cache 0 > total_rss 10502754304 > total_rss_huge 4001366016 > total_shmem 0 > total_mapped_file 270336 > total_dirty 0 > total_writeback 0 > total_swap 0 > total_pgpgin 1684617 > total_pgpgout 95480 > total_pgfault 1670328 > total_pgmajfault 957 > total_inactive_anon 0 > total_active_anon 10501070848 > total_inactive_file 4096 > total_active_file 0 > total_unevictable 0 > I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource > [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be > terminated > I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state > I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state > of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING > after 4.285078272secs > I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy > container 2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup > '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c' > I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after > 102.27072ms > I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after > 242944ns > I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for > executor(1)@127.0.1.1:46357 > I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited > E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor > 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework > 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all > processes in the container: Failed to remove cgroup > 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup > '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device > or resource busy > {noformat} > Initially I thought it was a race condition in the cgroup destruction code, > but an strace confirmed that the cgroup directory was only deleted once all > tasks had exited (edited and commented strace below from a different instance > of the same problem): > {noformat} > # get the list of processes > 3431 23:01:32.293608 openat(AT_FDCWD, > "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs", > O_RDONLY > 3431 23:01:32.293669 <... openat resumed> ) = 18 <0.36> > 3431 23:01:32.294220 read(18, >
[jira] [Assigned] (MESOS-10107) containeriser: failed to remove cgroup - EBUSY
[ https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-10107: - Assignee: Charles Natali > containeriser: failed to remove cgroup - EBUSY > -- > > Key: MESOS-10107 > URL: https://issues.apache.org/jira/browse/MESOS-10107 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Charles N >Assignee: Charles Natali >Priority: Major > Labels: cgroups, containerization > Fix For: 1.10.0 > > Attachments: mesos-remove-cgroup-race.diff, > reproduce-cgroup-rmdir-race.py > > > We've been seeing some random errors on our cluster, where the container > cgroup isn't properly destroyed after the OOM killer kicked in when memory > limit has been exceeded - see analysis and patch below: > Agent log: > {noformat} > I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: > 10272MB Maximum Used: 10518532KB > MEMORY STATISTICS: > cache 0 > rss 10502754304 > rss_huge 4001366016 > shmem 0 > mapped_file 270336 > dirty 0 > writeback 0 > swap 0 > pgpgin 1684617 > pgpgout 95480 > pgfault 1670328 > pgmajfault 957 > inactive_anon 0 > active_anon 10501189632 > inactive_file 4096 > active_file 0 > unevictable 0 > hierarchical_memory_limit 10770972672 > hierarchical_memsw_limit 10770972672 > total_cache 0 > total_rss 10502754304 > total_rss_huge 4001366016 > total_shmem 0 > total_mapped_file 270336 > total_dirty 0 > total_writeback 0 > total_swap 0 > total_pgpgin 1684617 > total_pgpgout 95480 > total_pgfault 1670328 > total_pgmajfault 957 > total_inactive_anon 0 > total_active_anon 10501070848 > total_inactive_file 4096 > total_active_file 0 > total_unevictable 0 > I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource > [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be > terminated > I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state > I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state > of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING > after 4.285078272secs > I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy > container 2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup > '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c' > I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after > 102.27072ms > I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after > 242944ns > I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for > executor(1)@127.0.1.1:46357 > I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited > E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor > 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework > 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all > processes in the container: Failed to remove cgroup > 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup > '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device > or resource busy > {noformat} > Initially I thought it was a race condition in the cgroup destruction code, > but an strace confirmed that the cgroup directory was only deleted once all > tasks had exited (edited and commented strace below from a different instance > of the same problem): > {noformat} > # get the list of processes > 3431 23:01:32.293608 openat(AT_FDCWD, > "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs", > O_RDONLY > 3431 23:01:32.293669 <... openat resumed> ) = 18 <0.36> > 3431 23:01:32.294220 read(18, > 3431 23:01:32.294268 <... read resumed> "5878\n6036\n6210\n", 8192) = > 15 <0.33> > 3431 23:01:32.294306 read(18, "", 4096) = 0 <0.13> > 3431 23:01:32.294346 close(18 > 3431 23:01:32.294402 <... close resumed> ) = 0 <0.45> > #kill them > 3431 23:01:32.296266 kill(5878, SIGKILL) = 0 <0.19> > 3431 23:01:32.296384 kill(6036, SIGKILL >
[jira] [Created] (MESOS-10117) Update the `usage()` method of containerizer to set resource limits in `ResourceStatistics`
Qian Zhang created MESOS-10117: -- Summary: Update the `usage()` method of containerizer to set resource limits in `ResourceStatistics` Key: MESOS-10117 URL: https://issues.apache.org/jira/browse/MESOS-10117 Project: Mesos Issue Type: Task Components: containerization Reporter: Qian Zhang Assignee: Qian Zhang In the `ResourceStatistics` protobuf message, there are a couple of issues: # There are already `cpu_limit` and `mem_limit_bytes` fields, but they are actually CPU & memory requests when resources limits are specified for a task. # There is already `mem_soft_limit_bytes` field, but this field seems not set anywhere. So we need to update this protobuf message and also the related containerizer code which set the fields of this protobuf message. -- This message was sent by Atlassian Jira (v8.3.4#803005)