[jira] [Assigned] (MESOS-10118) Agent incorrectly handles draining when empty

2020-04-15 Thread Greg Mann (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-10118:
-

Assignee: Greg Mann

> Agent incorrectly handles draining when empty
> -
>
> Key: MESOS-10118
> URL: https://issues.apache.org/jira/browse/MESOS-10118
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.9.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>
> When the agent receives a {{DrainSlaveMessage}} and does not have any tasks 
> or operations, it writes the {{DrainConfig}} to disk and is then implicitly 
> stuck in a "draining" state indefinitely. For example, if an agent 
> reregistration is triggered at such a time, the master may think the agent is 
> operating normally and send a task to it, at which point the task will fail 
> because the agent thinks it's draining (see this test for an example: 
> https://reviews.apache.org/r/72364/).
> If the agent receives a {{DrainSlaveMessage}} when it has no tasks or 
> operations, it should avoid writing any {{DrainConfig}} to disk so that it 
> immediately "transitions" into the already-drained state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10118) Agent incorrectly handles draining when empty

2020-04-15 Thread Greg Mann (Jira)
Greg Mann created MESOS-10118:
-

 Summary: Agent incorrectly handles draining when empty
 Key: MESOS-10118
 URL: https://issues.apache.org/jira/browse/MESOS-10118
 Project: Mesos
  Issue Type: Bug
  Components: agent
Affects Versions: 1.9.0
Reporter: Greg Mann


When the agent receives a {{DrainSlaveMessage}} and does not have any tasks or 
operations, it writes the {{DrainConfig}} to disk and is then implicitly stuck 
in a "draining" state indefinitely. For example, if an agent reregistration is 
triggered at such a time, the master may think the agent is operating normally 
and send a task to it, at which point the task will fail because the agent 
thinks it's draining (see this test for an example: 
https://reviews.apache.org/r/72364/).

If the agent receives a {{DrainSlaveMessage}} when it has no tasks or 
operations, it should avoid writing any {{DrainConfig}} to disk so that it 
immediately "transitions" into the already-drained state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10107) containeriser: failed to remove cgroup - EBUSY

2020-04-15 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084136#comment-17084136
 ] 

Andrei Budnik commented on MESOS-10107:
---

{code:java}
commit af3ca189aced5fbc537bfca571264142d4cd37b3
Author: Charles-Francois Natali 
Date:   Wed Apr 1 13:40:16 2020 +0100

Handled EBUSY when destroying a cgroup.

It's a workaround for kernel bugs which can cause `rmdir` to fail with
`EBUSY` even though the cgroup - appears - empty.
See for example https://lkml.org/lkml/2020/1/15/1349

This closes #355
{code}

> containeriser: failed to remove cgroup - EBUSY
> --
>
> Key: MESOS-10107
> URL: https://issues.apache.org/jira/browse/MESOS-10107
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Charles N
>Priority: Major
>  Labels: cgroups, containerization
> Fix For: 1.10.0
>
> Attachments: mesos-remove-cgroup-race.diff, 
> reproduce-cgroup-rmdir-race.py
>
>
> We've been seeing some random errors on our cluster, where the container 
> cgroup isn't properly destroyed after the OOM killer kicked in when memory 
> limit has been exceeded - see analysis and patch below:
> Agent log:
> {noformat}
> I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: 
> 10272MB Maximum Used: 10518532KB
> MEMORY STATISTICS: 
> cache 0
> rss 10502754304
> rss_huge 4001366016
> shmem 0
> mapped_file 270336
> dirty 0
> writeback 0
> swap 0
> pgpgin 1684617
> pgpgout 95480
> pgfault 1670328
> pgmajfault 957
> inactive_anon 0
> active_anon 10501189632
> inactive_file 4096
> active_file 0
> unevictable 0
> hierarchical_memory_limit 10770972672
> hierarchical_memsw_limit 10770972672
> total_cache 0
> total_rss 10502754304
> total_rss_huge 4001366016
> total_shmem 0
> total_mapped_file 270336
> total_dirty 0
> total_writeback 0
> total_swap 0
> total_pgpgin 1684617
> total_pgpgout 95480
> total_pgfault 1670328
> total_pgmajfault 957
> total_inactive_anon 0
> total_active_anon 10501070848
> total_inactive_file 4096
> total_active_file 0
> total_unevictable 0
> I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource 
> [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be 
> terminated
> I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state
> I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state 
> of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING 
> after 4.285078272secs
> I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy 
> container 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c'
> I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 102.27072ms
> I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 242944ns
> I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for 
> executor(1)@127.0.1.1:46357
> I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited
> E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor 
> 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework 
> 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all 
> processes in the container: Failed to remove cgroup 
> 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device 
> or resource busy
> {noformat}
> Initially I thought it was a race condition in the cgroup destruction code, 
> but an strace confirmed that the cgroup directory was only deleted once all 
> tasks had exited (edited and commented strace below from a different instance 
> of the same problem):
> {noformat}
> # get the list of processes
> 3431  23:01:32.293608 openat(AT_FDCWD,
> "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs",
> O_RDONLY 
> 3431  23:01:32.293669 <... openat resumed> ) = 18 <0.36>
> 3431  23:01:32.294220 read(18,  
> 

[jira] [Assigned] (MESOS-10107) containeriser: failed to remove cgroup - EBUSY

2020-04-15 Thread Andrei Budnik (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-10107:
-

Assignee: Charles Natali

> containeriser: failed to remove cgroup - EBUSY
> --
>
> Key: MESOS-10107
> URL: https://issues.apache.org/jira/browse/MESOS-10107
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Charles N
>Assignee: Charles Natali
>Priority: Major
>  Labels: cgroups, containerization
> Fix For: 1.10.0
>
> Attachments: mesos-remove-cgroup-race.diff, 
> reproduce-cgroup-rmdir-race.py
>
>
> We've been seeing some random errors on our cluster, where the container 
> cgroup isn't properly destroyed after the OOM killer kicked in when memory 
> limit has been exceeded - see analysis and patch below:
> Agent log:
> {noformat}
> I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: 
> 10272MB Maximum Used: 10518532KB
> MEMORY STATISTICS: 
> cache 0
> rss 10502754304
> rss_huge 4001366016
> shmem 0
> mapped_file 270336
> dirty 0
> writeback 0
> swap 0
> pgpgin 1684617
> pgpgout 95480
> pgfault 1670328
> pgmajfault 957
> inactive_anon 0
> active_anon 10501189632
> inactive_file 4096
> active_file 0
> unevictable 0
> hierarchical_memory_limit 10770972672
> hierarchical_memsw_limit 10770972672
> total_cache 0
> total_rss 10502754304
> total_rss_huge 4001366016
> total_shmem 0
> total_mapped_file 270336
> total_dirty 0
> total_writeback 0
> total_swap 0
> total_pgpgin 1684617
> total_pgpgout 95480
> total_pgfault 1670328
> total_pgmajfault 957
> total_inactive_anon 0
> total_active_anon 10501070848
> total_inactive_file 4096
> total_active_file 0
> total_unevictable 0
> I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource 
> [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be 
> terminated
> I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state
> I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state 
> of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING 
> after 4.285078272secs
> I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy 
> container 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c'
> I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 102.27072ms
> I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 242944ns
> I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for 
> executor(1)@127.0.1.1:46357
> I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited
> E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor 
> 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework 
> 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all 
> processes in the container: Failed to remove cgroup 
> 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device 
> or resource busy
> {noformat}
> Initially I thought it was a race condition in the cgroup destruction code, 
> but an strace confirmed that the cgroup directory was only deleted once all 
> tasks had exited (edited and commented strace below from a different instance 
> of the same problem):
> {noformat}
> # get the list of processes
> 3431  23:01:32.293608 openat(AT_FDCWD,
> "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs",
> O_RDONLY 
> 3431  23:01:32.293669 <... openat resumed> ) = 18 <0.36>
> 3431  23:01:32.294220 read(18,  
> 3431  23:01:32.294268 <... read resumed> "5878\n6036\n6210\n", 8192) =
> 15 <0.33>
> 3431  23:01:32.294306 read(18, "", 4096) = 0 <0.13>
> 3431  23:01:32.294346 close(18 
> 3431  23:01:32.294402 <... close resumed> ) = 0 <0.45>
> #kill them
> 3431  23:01:32.296266 kill(5878, SIGKILL) = 0 <0.19>
> 3431  23:01:32.296384 kill(6036, SIGKILL 
> 

[jira] [Created] (MESOS-10117) Update the `usage()` method of containerizer to set resource limits in `ResourceStatistics`

2020-04-15 Thread Qian Zhang (Jira)
Qian Zhang created MESOS-10117:
--

 Summary: Update the `usage()` method of containerizer to set 
resource limits in `ResourceStatistics`
 Key: MESOS-10117
 URL: https://issues.apache.org/jira/browse/MESOS-10117
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Qian Zhang
Assignee: Qian Zhang


In the `ResourceStatistics` protobuf message, there are a couple of issues:
 # There are already `cpu_limit` and `mem_limit_bytes` fields, but they are 
actually CPU & memory requests when resources limits are specified for a task.
 # There is already `mem_soft_limit_bytes` field, but this field seems not set 
anywhere.

So we need to update this protobuf message and also the related containerizer 
code which set the fields of this protobuf message.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)