[jira] [Created] (MESOS-10158) Mesos Agent gets stuck in Draining due to pending unacknowledged status updates
Andrei Budnik created MESOS-10158: - Summary: Mesos Agent gets stuck in Draining due to pending unacknowledged status updates Key: MESOS-10158 URL: https://issues.apache.org/jira/browse/MESOS-10158 Project: Mesos Issue Type: Bug Components: master Reporter: Andrei Budnik A Mesos agent can get stuck in the Draining mode caused by pending unacknowledged status updates. When the framework becomes disconnected, the agent keeps sending task status updates for terminated tasks of that framework. This leads to a problem when the agent gets stuck in the Draining state because the master transitions the agent from DRAINING to DRAINED state only after all task status updates get acknowledged. This problem can be resolved by sending ["Teardown" operation|https://github.com/apache/mesos/blob/8ce5d30808f3744eeded09d530f226079d569a94/include/mesos/v1/master/master.proto#L299-L303] for all lost frameworks. However, it would be much better if this situation could be handled automatically by the Master. At least, we should make it easier for an operator to find out what prevents draining operation to complete. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-7485) Add verbose logging for curl commands used in fetcher/puller
[ https://issues.apache.org/jira/browse/MESOS-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-7485: Assignee: Andrei Budnik > Add verbose logging for curl commands used in fetcher/puller > > > Key: MESOS-7485 > URL: https://issues.apache.org/jira/browse/MESOS-7485 > Project: Mesos > Issue Type: Bug >Reporter: Zhitao Li >Assignee: Andrei Budnik >Priority: Major > > Right now it's pretty hard to debug curl failures from puller/fetcher: even > with verbose logging turned on, we only see `curl` failed but no additional > information. > We should at least log the URL we send to curl. Ideally, we should also log > all other options exception any Auth headers (maybe indicating which auth > header used). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"
[ https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119576#comment-17119576 ] Andrei Budnik edited comment on MESOS-10131 at 5/29/20, 1:04 PM: - Please keep posting error messages on agent crash. Hopefully, we'll capture a part of `mountinfo` containing the loop. I think it might be worth capturing mount info after the moment it happens. We could check if there are duplicate records or even detect a loop or find some other anomalies. `mount && cat /proc/1/mountinfo` && `cat /proc//mountinfo` was (Author: abudnik): Please keep posting error messages on agent crash. Hopefully, we'll capture a part of `mountinfo` containing the loop. I think it might be worth capturing mount info after the moment it happens. We could check then if there are duplicate records or even detect a loop or find some other anomalies. `mount && cat /proc/1/mountinfo` && `cat /proc//mountinfo` > Agent frequently dies with error "Cycle found in mount table hierarchy" > --- > > Key: MESOS-10131 > URL: https://issues.apache.org/jira/browse/MESOS-10131 > Project: Mesos > Issue Type: Bug > Components: agent, framework >Affects Versions: 1.9.0 >Reporter: Thomas Plummer >Assignee: Andrei Budnik >Priority: Major > Attachments: log.txt > > > Our mesos agent frequently dies with the follow error in the slave logs: > > {code:java} > F0509 22:10:33.036993 17723 fs.cpp:217] Check failed: > !visitedParents.contains(parentId) Cycle found in mount table hierarchy at > entry '1954': > 18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs > rw,seclabel > 19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw > 20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs > rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755 > 21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - > securityfs securityfs rw > 22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs > rw,seclabel > 23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts > rw,seclabel,gid=5,mode=620,ptmxmode=000 > 24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755 > 25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs > ro,seclabel,mode=755 > 26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 > - cgroup cgroup > rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd > 27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - > pstore pstore rw > 28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime > shared:21 - efivarfs efivarfs rw > 29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime > shared:10 - cgroup cgroup rw,seclabel,perf_event > 30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime > shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls > 31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 > - cgroup cgroup rw,seclabel,cpuset > 32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - > cgroup cgroup rw,seclabel,blkio > 33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 > - cgroup cgroup rw,seclabel,freezer > 34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 > - cgroup cgroup rw,seclabel,hugetlb > 35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 > - cgroup cgroup rw,seclabel,devices > 36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime > shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu > 37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 > - cgroup cgroup rw,seclabel,memory > 38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - > cgroup cgroup rw,seclabel,pids > 39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw > 41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw > 43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs > systemd-1 > rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414 > 44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw > 45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel > 46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs > rw,seclabel > 47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
[jira] [Commented] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"
[ https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119576#comment-17119576 ] Andrei Budnik commented on MESOS-10131: --- Please keep posting error messages on agent crash. Hopefully, we'll capture a part of `mountinfo` containing the loop. I think it might be worth capturing mount info after the moment it happens. We could check then if there are duplicate records or even detect a loop or find some other anomalies. `mount && cat /proc/1/mountinfo` && `cat /proc//mountinfo` > Agent frequently dies with error "Cycle found in mount table hierarchy" > --- > > Key: MESOS-10131 > URL: https://issues.apache.org/jira/browse/MESOS-10131 > Project: Mesos > Issue Type: Bug > Components: agent, framework >Affects Versions: 1.9.0 >Reporter: Thomas Plummer >Assignee: Andrei Budnik >Priority: Major > Attachments: log.txt > > > Our mesos agent frequently dies with the follow error in the slave logs: > > {code:java} > F0509 22:10:33.036993 17723 fs.cpp:217] Check failed: > !visitedParents.contains(parentId) Cycle found in mount table hierarchy at > entry '1954': > 18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs > rw,seclabel > 19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw > 20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs > rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755 > 21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - > securityfs securityfs rw > 22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs > rw,seclabel > 23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts > rw,seclabel,gid=5,mode=620,ptmxmode=000 > 24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755 > 25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs > ro,seclabel,mode=755 > 26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 > - cgroup cgroup > rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd > 27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - > pstore pstore rw > 28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime > shared:21 - efivarfs efivarfs rw > 29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime > shared:10 - cgroup cgroup rw,seclabel,perf_event > 30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime > shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls > 31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 > - cgroup cgroup rw,seclabel,cpuset > 32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - > cgroup cgroup rw,seclabel,blkio > 33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 > - cgroup cgroup rw,seclabel,freezer > 34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 > - cgroup cgroup rw,seclabel,hugetlb > 35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 > - cgroup cgroup rw,seclabel,devices > 36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime > shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu > 37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 > - cgroup cgroup rw,seclabel,memory > 38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - > cgroup cgroup rw,seclabel,pids > 39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw > 41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw > 43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs > systemd-1 > rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414 > 44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw > 45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel > 46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs > rw,seclabel > 47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 48 47 8:1 / /boot/efi rw,relatime shared:30 - vfat /dev/sda1 > rw,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro > 49 41 253:2 / /var rw,relatime shared:31 - xfs /dev/mapper/vg_system-var > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 50 41 253:5 / /home rw,nodev,relatime shared:32 - xfs > /dev/mapper/vg_system-home >
[jira] [Comment Edited] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"
[ https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118896#comment-17118896 ] Andrei Budnik edited comment on MESOS-10131 at 5/28/20, 5:21 PM: - I think the message containing the whole mount table is long enough (~30k bytes) to reach the limit of the logger buffer... [~tomplummer] Could you capture both truncated log message and the output of "cat /proc//mountinfo" next time it crashes? (and/or `mount && cat /proc/1/mountinfo` if mesos agent can't start) was (Author: abudnik): I think the message containing the whole mount table is long enough (~30k bytes) to reach the limit of the logger buffer... [~tomplummer] Could you capture both truncated log message and the output of "cat /proc//mountinfo" next time it crashes? > Agent frequently dies with error "Cycle found in mount table hierarchy" > --- > > Key: MESOS-10131 > URL: https://issues.apache.org/jira/browse/MESOS-10131 > Project: Mesos > Issue Type: Bug > Components: agent, framework >Affects Versions: 1.9.0 >Reporter: Thomas Plummer >Assignee: Andrei Budnik >Priority: Major > Attachments: log.txt > > > Our mesos agent frequently dies with the follow error in the slave logs: > > {code:java} > F0509 22:10:33.036993 17723 fs.cpp:217] Check failed: > !visitedParents.contains(parentId) Cycle found in mount table hierarchy at > entry '1954': > 18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs > rw,seclabel > 19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw > 20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs > rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755 > 21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - > securityfs securityfs rw > 22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs > rw,seclabel > 23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts > rw,seclabel,gid=5,mode=620,ptmxmode=000 > 24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755 > 25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs > ro,seclabel,mode=755 > 26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 > - cgroup cgroup > rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd > 27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - > pstore pstore rw > 28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime > shared:21 - efivarfs efivarfs rw > 29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime > shared:10 - cgroup cgroup rw,seclabel,perf_event > 30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime > shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls > 31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 > - cgroup cgroup rw,seclabel,cpuset > 32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - > cgroup cgroup rw,seclabel,blkio > 33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 > - cgroup cgroup rw,seclabel,freezer > 34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 > - cgroup cgroup rw,seclabel,hugetlb > 35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 > - cgroup cgroup rw,seclabel,devices > 36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime > shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu > 37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 > - cgroup cgroup rw,seclabel,memory > 38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - > cgroup cgroup rw,seclabel,pids > 39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw > 41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw > 43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs > systemd-1 > rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414 > 44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw > 45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel > 46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs > rw,seclabel > 47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 48 47 8:1 / /boot/efi rw,relatime shared:30 - vfat /dev/sda1 >
[jira] [Commented] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"
[ https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118896#comment-17118896 ] Andrei Budnik commented on MESOS-10131: --- I think the message containing the whole mount table is long enough (~30k bytes) to reach the limit of the logger buffer... [~tomplummer] Could you capture both truncated log message and the output of "cat /proc//mountinfo" next time it crashes? > Agent frequently dies with error "Cycle found in mount table hierarchy" > --- > > Key: MESOS-10131 > URL: https://issues.apache.org/jira/browse/MESOS-10131 > Project: Mesos > Issue Type: Bug > Components: agent, framework >Affects Versions: 1.9.0 >Reporter: Thomas Plummer >Assignee: Andrei Budnik >Priority: Major > Attachments: log.txt > > > Our mesos agent frequently dies with the follow error in the slave logs: > > {code:java} > F0509 22:10:33.036993 17723 fs.cpp:217] Check failed: > !visitedParents.contains(parentId) Cycle found in mount table hierarchy at > entry '1954': > 18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs > rw,seclabel > 19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw > 20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs > rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755 > 21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - > securityfs securityfs rw > 22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs > rw,seclabel > 23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts > rw,seclabel,gid=5,mode=620,ptmxmode=000 > 24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755 > 25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs > ro,seclabel,mode=755 > 26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 > - cgroup cgroup > rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd > 27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - > pstore pstore rw > 28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime > shared:21 - efivarfs efivarfs rw > 29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime > shared:10 - cgroup cgroup rw,seclabel,perf_event > 30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime > shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls > 31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 > - cgroup cgroup rw,seclabel,cpuset > 32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - > cgroup cgroup rw,seclabel,blkio > 33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 > - cgroup cgroup rw,seclabel,freezer > 34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 > - cgroup cgroup rw,seclabel,hugetlb > 35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 > - cgroup cgroup rw,seclabel,devices > 36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime > shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu > 37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 > - cgroup cgroup rw,seclabel,memory > 38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - > cgroup cgroup rw,seclabel,pids > 39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw > 41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw > 43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs > systemd-1 > rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414 > 44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw > 45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel > 46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs > rw,seclabel > 47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 48 47 8:1 / /boot/efi rw,relatime shared:30 - vfat /dev/sda1 > rw,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro > 49 41 253:2 / /var rw,relatime shared:31 - xfs /dev/mapper/vg_system-var > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 50 41 253:5 / /home rw,nodev,relatime shared:32 - xfs > /dev/mapper/vg_system-home > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 51 41 253:4 / /tmp rw,nosuid,nodev,noexec,relatime shared:33 - xfs >
[jira] [Commented] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"
[ https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117903#comment-17117903 ] Andrei Budnik commented on MESOS-10131: --- [~tomplummer] It seems that the tail of the log message is missing. Could you please provide the whole log message containing the mount table? We will try to reproduce the problem by running a unit test to ensure that this is not a bug in the code. > Agent frequently dies with error "Cycle found in mount table hierarchy" > --- > > Key: MESOS-10131 > URL: https://issues.apache.org/jira/browse/MESOS-10131 > Project: Mesos > Issue Type: Bug > Components: agent, framework >Affects Versions: 1.9.0 >Reporter: Thomas Plummer >Assignee: Andrei Budnik >Priority: Major > > Our mesos agent frequently dies with the follow error in the slave logs: > > {code:java} > F0509 22:10:33.036993 17723 fs.cpp:217] Check failed: > !visitedParents.contains(parentId) Cycle found in mount table hierarchy at > entry '1954': > 18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs > rw,seclabel > 19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw > 20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs > rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755 > 21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - > securityfs securityfs rw > 22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs > rw,seclabel > 23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts > rw,seclabel,gid=5,mode=620,ptmxmode=000 > 24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755 > 25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs > ro,seclabel,mode=755 > 26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 > - cgroup cgroup > rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd > 27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - > pstore pstore rw > 28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime > shared:21 - efivarfs efivarfs rw > 29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime > shared:10 - cgroup cgroup rw,seclabel,perf_event > 30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime > shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls > 31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 > - cgroup cgroup rw,seclabel,cpuset > 32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - > cgroup cgroup rw,seclabel,blkio > 33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 > - cgroup cgroup rw,seclabel,freezer > 34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 > - cgroup cgroup rw,seclabel,hugetlb > 35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 > - cgroup cgroup rw,seclabel,devices > 36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime > shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu > 37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 > - cgroup cgroup rw,seclabel,memory > 38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - > cgroup cgroup rw,seclabel,pids > 39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw > 41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw > 43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs > systemd-1 > rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414 > 44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw > 45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel > 46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs > rw,seclabel > 47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 48 47 8:1 / /boot/efi rw,relatime shared:30 - vfat /dev/sda1 > rw,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro > 49 41 253:2 / /var rw,relatime shared:31 - xfs /dev/mapper/vg_system-var > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 50 41 253:5 / /home rw,nodev,relatime shared:32 - xfs > /dev/mapper/vg_system-home > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 51 41 253:4 / /tmp rw,nosuid,nodev,noexec,relatime shared:33 - xfs > /dev/mapper/vg_system-tmp >
[jira] [Commented] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"
[ https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117892#comment-17117892 ] Andrei Budnik commented on MESOS-10131: --- Mount table without extra newlines: {code:java} 18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs rw,seclabel 19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw 20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755 21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - securityfs securityfs rw 22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs rw,seclabel 23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts rw,seclabel,gid=5,mode=620,ptmxmode=000 24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755 25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs ro,seclabel,mode=755 26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 - cgroup cgroup rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd 27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - pstore pstore rw 28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime shared:21 - efivarfs efivarfs rw 29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime shared:10 - cgroup cgroup rw,seclabel,perf_event 30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls 31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 - cgroup cgroup rw,seclabel,cpuset 32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - cgroup cgroup rw,seclabel,blkio 33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 - cgroup cgroup rw,seclabel,freezer 34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 - cgroup cgroup rw,seclabel,hugetlb 35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 - cgroup cgroup rw,seclabel,devices 36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu 37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 - cgroup cgroup rw,seclabel,memory 38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - cgroup cgroup rw,seclabel,pids 39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw 41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota 42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw 43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs systemd-1 rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414 44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw 45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel 46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs rw,seclabel 47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota 48 47 8:1 / /boot/efi rw,relatime shared:30 - vfat /dev/sda1 rw,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro 49 41 253:2 / /var rw,relatime shared:31 - xfs /dev/mapper/vg_system-var rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota 50 41 253:5 / /home rw,nodev,relatime shared:32 - xfs /dev/mapper/vg_system-home rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota 51 41 253:4 / /tmp rw,nosuid,nodev,noexec,relatime shared:33 - xfs /dev/mapper/vg_system-tmp rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota 53 49 253:4 / /var/tmp rw,nosuid,nodev,noexec,relatime shared:33 - xfs /dev/mapper/vg_system-tmp rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota 52 49 253:3 / /var/log rw,relatime shared:34 - xfs /dev/mapper/vg_system-varlog rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota 54 52 253:6 / /var/log/audit rw,relatime shared:35 - xfs /dev/mapper/vg_system-varlogaudit rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota 187 41 0:41 / /mnt/receipt rw,relatime shared:165 - nfs4 dtmetlnfsa01p.a.carfax.us:/ rw,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.18.154.117,local_lock=none,addr=172.18.138.237 188 41 0:42 / /mnt/receipt_web_dev rw,relatime shared:169 - nfs4 dtmetlnfsa01b.a.carfax.us:/ rw,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.18.154.117,local_lock=none,addr=172.18.137.248 192 41 0:41 / /mnt/receipt_web_prod
[jira] [Commented] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"
[ https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117873#comment-17117873 ] Andrei Budnik commented on MESOS-10131: --- I've copy-pasted the mount table from the log excerpt into one of our unit tests (`FsTest.MountInfoTableReadSortedParentOfSelf`). It failed with the following error message: {code:java} ../../src/tests/containerizer/fs_tests.cpp:344: Failure table: Failed to parse entry 'docker/overlay2/l/LOG7DILAFLJBIQ7CKDQVFXJLP7:/var/lib/docker/overlay2/l/6JVIPP3XCCWKZPFAUWKXCDWYXL:/var/lib/docker/overlay2/l/L5VKHJHVOWG24VJPJCAKGTQX5G:/var/lib/docker/overlay2/l/ZIIS5MWCIF4C6KXI2LVKVU4TMF:/var/lib/docker/overlay2/l/4JXI': Could not find separator ' - ' {code} It seems that there was a memory corruption. I'm investigating what could be the cause. > Agent frequently dies with error "Cycle found in mount table hierarchy" > --- > > Key: MESOS-10131 > URL: https://issues.apache.org/jira/browse/MESOS-10131 > Project: Mesos > Issue Type: Bug > Components: agent, framework >Affects Versions: 1.9.0 >Reporter: Thomas Plummer >Assignee: Andrei Budnik >Priority: Major > > Our mesos agent frequently dies with the follow error in the slave logs: > > {code:java} > F0509 22:10:33.036993 17723 fs.cpp:217] Check failed: > !visitedParents.contains(parentId) Cycle found in mount table hierarchy at > entry '1954': > 18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs > rw,seclabel > 19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw > 20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs > rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755 > 21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - > securityfs securityfs rw > 22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs > rw,seclabel > 23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts > rw,seclabel,gid=5,mode=620,ptmxmode=000 > 24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755 > 25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs > ro,seclabel,mode=755 > 26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 > - cgroup cgroup > rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd > 27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - > pstore pstore rw > 28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime > shared:21 - efivarfs efivarfs rw > 29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime > shared:10 - cgroup cgroup rw,seclabel,perf_event > 30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime > shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls > 31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 > - cgroup cgroup rw,seclabel,cpuset > 32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - > cgroup cgroup rw,seclabel,blkio > 33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 > - cgroup cgroup rw,seclabel,freezer > 34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 > - cgroup cgroup rw,seclabel,hugetlb > 35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 > - cgroup cgroup rw,seclabel,devices > 36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime > shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu > 37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 > - cgroup cgroup rw,seclabel,memory > 38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - > cgroup cgroup rw,seclabel,pids > 39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw > 41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw > 43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs > systemd-1 > rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414 > 44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw > 45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel > 46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs > rw,seclabel > 47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 48 47 8:1 / /boot/efi rw,relatime shared:30 - vfat /dev/sda1 > rw,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro > 49 41 253:2 / /var
[jira] [Assigned] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"
[ https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-10131: - Assignee: Andrei Budnik > Agent frequently dies with error "Cycle found in mount table hierarchy" > --- > > Key: MESOS-10131 > URL: https://issues.apache.org/jira/browse/MESOS-10131 > Project: Mesos > Issue Type: Bug > Components: agent, framework >Affects Versions: 1.9.0 >Reporter: Thomas Plummer >Assignee: Andrei Budnik >Priority: Major > > Our mesos agent frequently dies with the follow error in the slave logs: > > {code:java} > F0509 22:10:33.036993 17723 fs.cpp:217] Check failed: > !visitedParents.contains(parentId) Cycle found in mount table hierarchy at > entry '1954': > 18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs > rw,seclabel > 19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw > 20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs > rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755 > 21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - > securityfs securityfs rw > 22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs > rw,seclabel > 23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts > rw,seclabel,gid=5,mode=620,ptmxmode=000 > 24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755 > 25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs > ro,seclabel,mode=755 > 26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 > - cgroup cgroup > rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd > 27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - > pstore pstore rw > 28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime > shared:21 - efivarfs efivarfs rw > 29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime > shared:10 - cgroup cgroup rw,seclabel,perf_event > 30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime > shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls > 31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 > - cgroup cgroup rw,seclabel,cpuset > 32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - > cgroup cgroup rw,seclabel,blkio > 33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 > - cgroup cgroup rw,seclabel,freezer > 34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 > - cgroup cgroup rw,seclabel,hugetlb > 35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 > - cgroup cgroup rw,seclabel,devices > 36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime > shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu > 37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 > - cgroup cgroup rw,seclabel,memory > 38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - > cgroup cgroup rw,seclabel,pids > 39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw > 41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw > 43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs > systemd-1 > rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414 > 44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw > 45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel > 46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs > rw,seclabel > 47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 48 47 8:1 / /boot/efi rw,relatime shared:30 - vfat /dev/sda1 > rw,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro > 49 41 253:2 / /var rw,relatime shared:31 - xfs /dev/mapper/vg_system-var > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 50 41 253:5 / /home rw,nodev,relatime shared:32 - xfs > /dev/mapper/vg_system-home > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 51 41 253:4 / /tmp rw,nosuid,nodev,noexec,relatime shared:33 - xfs > /dev/mapper/vg_system-tmp > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 53 49 253:4 / /var/tmp rw,nosuid,nodev,noexec,relatime shared:33 - xfs > /dev/mapper/vg_system-tmp > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 52 49 253:3 / /var/log
[jira] [Commented] (MESOS-10107) containeriser: failed to remove cgroup - EBUSY
[ https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101572#comment-17101572 ] Andrei Budnik commented on MESOS-10107: --- {code:java} commit 0cb1591b709e3c9f32093d943b8e2ddcdcf7999f Author: Charles-Francois Natali Date: Sat May 2 01:41:09 2020 +0100 Keep retrying to remove cgroup on EBUSY. This is a follow-up to MESOS-10107, which introduced retries when calling `rmdir` on a seemingly empty cgroup fails with `EBUSY` because of various kernel bugs. At the time, the fix introduced a bounded number of retries, using an exponential backoff summing up to slightly over 1s. This was done because it was similar to what Docker does, and worked during testing. However, after 1 month without seeing this error in our cluster at work, we finally experienced one case where the 1s timeout wasn't enough. It could be because the machine was busy at the time, or some other random factor. So instead of only trying for 1s, I think it might make sense to just keep retrying, until the top-level container destruction timeout - set at 1 minute - kicks in. This actually makes more sense, and avoids having a magical timeout in the cgroup code. We just need to ensure that when the destroyer is finalized, it discards the future in charge of doing the periodic remove. This closes #362 {code} > containeriser: failed to remove cgroup - EBUSY > -- > > Key: MESOS-10107 > URL: https://issues.apache.org/jira/browse/MESOS-10107 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Charles N >Assignee: Charles Natali >Priority: Major > Labels: cgroups, containerization > Fix For: 1.10.0 > > Attachments: mesos-remove-cgroup-race.diff, > reproduce-cgroup-rmdir-race.py > > > We've been seeing some random errors on our cluster, where the container > cgroup isn't properly destroyed after the OOM killer kicked in when memory > limit has been exceeded - see analysis and patch below: > Agent log: > {noformat} > I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: > 10272MB Maximum Used: 10518532KB > MEMORY STATISTICS: > cache 0 > rss 10502754304 > rss_huge 4001366016 > shmem 0 > mapped_file 270336 > dirty 0 > writeback 0 > swap 0 > pgpgin 1684617 > pgpgout 95480 > pgfault 1670328 > pgmajfault 957 > inactive_anon 0 > active_anon 10501189632 > inactive_file 4096 > active_file 0 > unevictable 0 > hierarchical_memory_limit 10770972672 > hierarchical_memsw_limit 10770972672 > total_cache 0 > total_rss 10502754304 > total_rss_huge 4001366016 > total_shmem 0 > total_mapped_file 270336 > total_dirty 0 > total_writeback 0 > total_swap 0 > total_pgpgin 1684617 > total_pgpgout 95480 > total_pgfault 1670328 > total_pgmajfault 957 > total_inactive_anon 0 > total_active_anon 10501070848 > total_inactive_file 4096 > total_active_file 0 > total_unevictable 0 > I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource > [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be > terminated > I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state > I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state > of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING > after 4.285078272secs > I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy > container 2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup > '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c' > I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after > 102.27072ms > I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after > 242944ns > I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for > executor(1)@127.0.1.1:46357 > I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited > E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor >
[jira] [Commented] (MESOS-10119) failure to destroy container can cause the agent to "leak" a GPU
[ https://issues.apache.org/jira/browse/MESOS-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088568#comment-17088568 ] Andrei Budnik commented on MESOS-10119: --- Could you reproduce the cgroups desctruction problem consistently? What are the kernel and systemd versions installed on your agents? > failure to destroy container can cause the agent to "leak" a GPU > > > Key: MESOS-10119 > URL: https://issues.apache.org/jira/browse/MESOS-10119 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Charles Natali >Priority: Major > > At work we hit the following problem: > # cgroup for a task using the GPU isolation failed to be destroyed on OOM > # the agent continued advertising the GPU as available > # all subsequent attempts to start tasks using a GPU fails with "Requested 1 > gpus but only 0 available" > Problem 1 looks like https://issues.apache.org/jira/browse/MESOS-9950) so can > be tackled separately, however the fact that the agent basically leaks the > GPU is pretty bad, because it basically turns into /dev/null, failing all > subsequent tasks requesting a GPU. > > See the logs: > > > {noformat} > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874277 2138 > memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or > directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874305 2138 > memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or > directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874315 2138 > memory.cpp:686] Failed to read 'memory.stat': No such file or directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874701 2136 > memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or > directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874734 2136 > memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or > directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874747 2136 > memory.cpp:686] Failed to read 'memory.stat': No such file or directory > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.062358 2152 > slave.cpp:6994] Termination of executor > 'task_0:067b0963-134f-a917-4503-89b6a2a630ac' of framework > c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to clean up an > isolator when destroying container: Failed to destroy cgroups: Failed to get > nested cgroups: Failed to determine canonical path of > '/sys/fs/cgroup/memory/mesos/8ef00748-b640-4620-97dc-f719e9775e88': No such > file or directory > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063295 2150 > containerizer.cpp:2567] Skipping status for container > 8ef00748-b640-4620-97dc-f719e9775e88 because: Container does not exist > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063429 2137 > containerizer.cpp:2428] Ignoring update for currently being destroyed > container 8ef00748-b640-4620-97dc-f719e9775e88 > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.079169 2150 > slave.cpp:6994] Termination of executor > 'task_1:a00165a1-123a-db09-6b1a-b6c4054b0acd' of framework > c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to kill all > processes in the container: Failed to remove cgroup > 'mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Failed to remove cgroup > '/sys/fs/cgroup/freezer/mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Device > or resource busy > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079537 2140 > containerizer.cpp:2567] Skipping status for container > 5c1418f0-1d4d-47cd-a188-0f4b87e394f2 because: Container does not exist > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079670 2136 > containerizer.cpp:2428] Ignoring update for currently being destroyed > container 5c1418f0-1d4d-47cd-a188-0f4b87e394f2 > Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.956969 2136 > slave.cpp:6889] Container '87253521-8d39-47ea-b4d1-febe527d230c' for executor > 'task_2:8b129d24-70d2-2cab-b2df-c73911954ec3' of framework > c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed to start: Requested 1 gpus > but only 0 available > Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.957670 2149 > memory.cpp:637] Listening on OOM events failed for container > 87253521-8d39-47ea-b4d1-febe527d230c: Event listener is terminating > Apr 17 17:00:07 engpuc006 mesos-slave[2068]: W0417 17:00:07.966552 2150 > containerizer.cpp:2421] Ignoring update for unknown container > 87253521-8d39-47ea-b4d1-febe527d230c > Apr 17 17:00:08 engpuc006 mesos-slave[2068]: W0417 17:00:08.109067 2154 > process.cpp:1480] Failed to link to '172.16.22.201:34059', connect: Failed > connect:
[jira] [Commented] (MESOS-10107) containeriser: failed to remove cgroup - EBUSY
[ https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084136#comment-17084136 ] Andrei Budnik commented on MESOS-10107: --- {code:java} commit af3ca189aced5fbc537bfca571264142d4cd37b3 Author: Charles-Francois Natali Date: Wed Apr 1 13:40:16 2020 +0100 Handled EBUSY when destroying a cgroup. It's a workaround for kernel bugs which can cause `rmdir` to fail with `EBUSY` even though the cgroup - appears - empty. See for example https://lkml.org/lkml/2020/1/15/1349 This closes #355 {code} > containeriser: failed to remove cgroup - EBUSY > -- > > Key: MESOS-10107 > URL: https://issues.apache.org/jira/browse/MESOS-10107 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Charles N >Priority: Major > Labels: cgroups, containerization > Fix For: 1.10.0 > > Attachments: mesos-remove-cgroup-race.diff, > reproduce-cgroup-rmdir-race.py > > > We've been seeing some random errors on our cluster, where the container > cgroup isn't properly destroyed after the OOM killer kicked in when memory > limit has been exceeded - see analysis and patch below: > Agent log: > {noformat} > I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: > 10272MB Maximum Used: 10518532KB > MEMORY STATISTICS: > cache 0 > rss 10502754304 > rss_huge 4001366016 > shmem 0 > mapped_file 270336 > dirty 0 > writeback 0 > swap 0 > pgpgin 1684617 > pgpgout 95480 > pgfault 1670328 > pgmajfault 957 > inactive_anon 0 > active_anon 10501189632 > inactive_file 4096 > active_file 0 > unevictable 0 > hierarchical_memory_limit 10770972672 > hierarchical_memsw_limit 10770972672 > total_cache 0 > total_rss 10502754304 > total_rss_huge 4001366016 > total_shmem 0 > total_mapped_file 270336 > total_dirty 0 > total_writeback 0 > total_swap 0 > total_pgpgin 1684617 > total_pgpgout 95480 > total_pgfault 1670328 > total_pgmajfault 957 > total_inactive_anon 0 > total_active_anon 10501070848 > total_inactive_file 4096 > total_active_file 0 > total_unevictable 0 > I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource > [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be > terminated > I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state > I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state > of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING > after 4.285078272secs > I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy > container 2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup > '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c' > I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after > 102.27072ms > I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after > 242944ns > I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for > executor(1)@127.0.1.1:46357 > I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited > E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor > 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework > 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all > processes in the container: Failed to remove cgroup > 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup > '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device > or resource busy > {noformat} > Initially I thought it was a race condition in the cgroup destruction code, > but an strace confirmed that the cgroup directory was only deleted once all > tasks had exited (edited and commented strace below from a different instance > of the same problem): > {noformat} > # get the list of processes > 3431 23:01:32.293608 openat(AT_FDCWD, > "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs", > O_RDONLY > 3431 23:01:32.293669 <... openat resumed> ) = 18 <0.36> > 3431 23:01:32.294220 read(18, >
[jira] [Assigned] (MESOS-10107) containeriser: failed to remove cgroup - EBUSY
[ https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-10107: - Assignee: Charles Natali > containeriser: failed to remove cgroup - EBUSY > -- > > Key: MESOS-10107 > URL: https://issues.apache.org/jira/browse/MESOS-10107 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Charles N >Assignee: Charles Natali >Priority: Major > Labels: cgroups, containerization > Fix For: 1.10.0 > > Attachments: mesos-remove-cgroup-race.diff, > reproduce-cgroup-rmdir-race.py > > > We've been seeing some random errors on our cluster, where the container > cgroup isn't properly destroyed after the OOM killer kicked in when memory > limit has been exceeded - see analysis and patch below: > Agent log: > {noformat} > I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: > 10272MB Maximum Used: 10518532KB > MEMORY STATISTICS: > cache 0 > rss 10502754304 > rss_huge 4001366016 > shmem 0 > mapped_file 270336 > dirty 0 > writeback 0 > swap 0 > pgpgin 1684617 > pgpgout 95480 > pgfault 1670328 > pgmajfault 957 > inactive_anon 0 > active_anon 10501189632 > inactive_file 4096 > active_file 0 > unevictable 0 > hierarchical_memory_limit 10770972672 > hierarchical_memsw_limit 10770972672 > total_cache 0 > total_rss 10502754304 > total_rss_huge 4001366016 > total_shmem 0 > total_mapped_file 270336 > total_dirty 0 > total_writeback 0 > total_swap 0 > total_pgpgin 1684617 > total_pgpgout 95480 > total_pgfault 1670328 > total_pgmajfault 957 > total_inactive_anon 0 > total_active_anon 10501070848 > total_inactive_file 4096 > total_active_file 0 > total_unevictable 0 > I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource > [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be > terminated > I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state > I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state > of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING > after 4.285078272secs > I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy > container 2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup > '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c' > I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after > 102.27072ms > I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after > 242944ns > I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for > executor(1)@127.0.1.1:46357 > I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited > E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor > 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework > 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all > processes in the container: Failed to remove cgroup > 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup > '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device > or resource busy > {noformat} > Initially I thought it was a race condition in the cgroup destruction code, > but an strace confirmed that the cgroup directory was only deleted once all > tasks had exited (edited and commented strace below from a different instance > of the same problem): > {noformat} > # get the list of processes > 3431 23:01:32.293608 openat(AT_FDCWD, > "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs", > O_RDONLY > 3431 23:01:32.293669 <... openat resumed> ) = 18 <0.36> > 3431 23:01:32.294220 read(18, > 3431 23:01:32.294268 <... read resumed> "5878\n6036\n6210\n", 8192) = > 15 <0.33> > 3431 23:01:32.294306 read(18, "", 4096) = 0 <0.13> > 3431 23:01:32.294346 close(18 > 3431 23:01:32.294402 <... close resumed> ) = 0 <0.45> > #kill them > 3431 23:01:32.296266 kill(5878, SIGKILL) = 0 <0.19> > 3431 23:01:32.296384 kill(6036, SIGKILL >
[jira] [Commented] (MESOS-10107) containeriser: failed to remove cgroup - EBUSY
[ https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072680#comment-17072680 ] Andrei Budnik commented on MESOS-10107: --- Thanks for the detailed explanations! Could you please submit your patch to [Apache Review Board|http://mesos.apache.org/documentation/latest/advanced-contribution/#submit-your-patch] or open a [PR on github|http://mesos.apache.org/documentation/latest/beginner-contribution/#open-a-pr/]? Does the workaround work reliably after changing the initial delay and retry count to the values taken from libcontainerd (10ms and 5)? Should we retry only if `::rmdir()` returns EBUSY errno error? > containeriser: failed to remove cgroup - EBUSY > -- > > Key: MESOS-10107 > URL: https://issues.apache.org/jira/browse/MESOS-10107 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Charles >Priority: Major > Attachments: mesos-remove-cgroup-race.diff, > reproduce-cgroup-rmdir-race.py > > > We've been seeing some random errors on our cluster, where the container > cgroup isn't properly destroyed after the OOM killer kicked in when memory > limit has been exceeded - see analysis and patch below: > Agent log: > {noformat} > I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: > 10272MB Maximum Used: 10518532KB > MEMORY STATISTICS: > cache 0 > rss 10502754304 > rss_huge 4001366016 > shmem 0 > mapped_file 270336 > dirty 0 > writeback 0 > swap 0 > pgpgin 1684617 > pgpgout 95480 > pgfault 1670328 > pgmajfault 957 > inactive_anon 0 > active_anon 10501189632 > inactive_file 4096 > active_file 0 > unevictable 0 > hierarchical_memory_limit 10770972672 > hierarchical_memsw_limit 10770972672 > total_cache 0 > total_rss 10502754304 > total_rss_huge 4001366016 > total_shmem 0 > total_mapped_file 270336 > total_dirty 0 > total_writeback 0 > total_swap 0 > total_pgpgin 1684617 > total_pgpgout 95480 > total_pgfault 1670328 > total_pgmajfault 957 > total_inactive_anon 0 > total_active_anon 10501070848 > total_inactive_file 4096 > total_active_file 0 > total_unevictable 0 > I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource > [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be > terminated > I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state > I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state > of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING > after 4.285078272secs > I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy > container 2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup > '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c' > I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after > 102.27072ms > I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c > I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after > 242944ns > I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for > executor(1)@127.0.1.1:46357 > I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container > 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited > E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor > 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework > 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all > processes in the container: Failed to remove cgroup > 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup > '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device > or resource busy > {noformat} > Initially I thought it was a race condition in the cgroup destruction code, > but an strace confirmed that the cgroup directory was only deleted once all > tasks had exited (edited and commented strace below from a different instance > of the same problem): > {noformat} > # get the list of processes > 3431 23:01:32.293608 openat(AT_FDCWD, > "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs", > O_RDONLY > 3431 23:01:32.293669 <... openat resumed> ) = 18 <0.36> > 3431
[jira] [Deleted] (MESOS-10078) Cgroups isolator: update cgroups subsystems to support nested cgroups
[ https://issues.apache.org/jira/browse/MESOS-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik deleted MESOS-10078: -- > Cgroups isolator: update cgroups subsystems to support nested cgroups > - > > Key: MESOS-10078 > URL: https://issues.apache.org/jira/browse/MESOS-10078 > Project: Mesos > Issue Type: Task >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > Labels: cgroups, containerization > > Update Cgroups Subsystems to support nested cgroups. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10098) Mesos agent fails to start on outdated systemd.
Andrei Budnik created MESOS-10098: - Summary: Mesos agent fails to start on outdated systemd. Key: MESOS-10098 URL: https://issues.apache.org/jira/browse/MESOS-10098 Project: Mesos Issue Type: Bug Components: agent Affects Versions: 1.10 Environment: CoreOS 2411.0.0 Reporter: Andrei Budnik Assignee: Andrei Budnik Fix For: 1.10 Mesos agent refuses to start due to a failure caused by the systemd-specific code: {code:java} E0220 12:03:02.943467 22298 main.cpp:670] EXIT with status 1: Expected exactly one socket with name unknown, got 0 instead {code} It turns out that some versions of systemd do not set environment variables `LISTEN_PID`, `LISTEN_FDS` and `LISTEN_FDNAMES` to the Mesos agent process, if its systemd unit is ill-formed. If this happens, `listenFdsWithName` returns an empty list, therefore leading to the error above. After fixing the problem with the systemd unit, systemd sets the value for `LISTEN_FDNAMES` taken from the `FileDescriptorName` field. In our case, the env variable is set to `systemd:dcos-mesos-slave`. Since the value is expected to be equal to "systemd:unknown" (for the compatibility with older systemd versions), the mismatch of values happens and we see the same error message. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9853) Update Docker executor to allow kill policy overrides
[ https://issues.apache.org/jira/browse/MESOS-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030016#comment-17030016 ] Andrei Budnik commented on MESOS-9853: -- Backported /r/71033/ ("Moved the Docker executor declaration into a header.") to the previous versions as there is a bugfix (/r/72055) that depends on this patch. 1.5.x {code:java} commit 68e8655c8fb6dbc41de6afb66a569583b32f78d3 Author: Greg Mann Date: Thu Jul 25 12:17:41 2019 -0700 Moved the Docker executor declaration into a header. This moves the declaration of the Docker executor into the Docker executor header file and moves the code for the Docker executor binary into a new launcher implementation file. This change will enable the Mesos executor driver implementation to make use of the `DockerExecutor` symbol. Review: https://reviews.apache.org/r/71033/ {code} 1.6.x {code:java} commit 02eb0ceb87dadc0a5ac6f6cd9f141347e852fb80 Author: Greg Mann Date: Thu Jul 25 12:17:41 2019 -0700 Moved the Docker executor declaration into a header. This moves the declaration of the Docker executor into the Docker executor header file and moves the code for the Docker executor binary into a new launcher implementation file. This change will enable the Mesos executor driver implementation to make use of the `DockerExecutor` symbol. Review: https://reviews.apache.org/r/71033/ {code} 1.7.x {code:java} commit 0567b31212105821d0b37ad049228dab6e98ed63 Author: Greg Mann Date: Thu Jul 25 12:17:41 2019 -0700 Moved the Docker executor declaration into a header. This moves the declaration of the Docker executor into the Docker executor header file and moves the code for the Docker executor binary into a new launcher implementation file. This change will enable the Mesos executor driver implementation to make use of the `DockerExecutor` symbol. Review: https://reviews.apache.org/r/71033/ {code} 1.8.x {code:java} commit 1995f63352a5a8c2c8e73adefed708a8620a5d47 Author: Greg Mann Date: Thu Jul 25 12:17:41 2019 -0700 Moved the Docker executor declaration into a header. This moves the declaration of the Docker executor into the Docker executor header file and moves the code for the Docker executor binary into a new launcher implementation file. This change will enable the Mesos executor driver implementation to make use of the `DockerExecutor` symbol. Review: https://reviews.apache.org/r/71033/ {code} > Update Docker executor to allow kill policy overrides > - > > Key: MESOS-9853 > URL: https://issues.apache.org/jira/browse/MESOS-9853 > Project: Mesos > Issue Type: Task >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Major > Labels: foundations, mesosphere > Fix For: 1.9.0 > > > In order for the agent to successfully override the task kill policy of > Docker tasks when the agent is being drained, the Docker executor must be > able to receive kill policy overrides and must be updated to honor them. > Since the Docker executor runs using the executor driver, this is currently > not possible. We could, for example, update the executor driver interface, or > move the Docker executor off of the executor driver. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-8537) Default executor doesn't wait for status updates to be ack'd before shutting down
[ https://issues.apache.org/jira/browse/MESOS-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029065#comment-17029065 ] Andrei Budnik commented on MESOS-8537: -- 1.5.x {code:java} commit 84b7af3409d8af343da0f0420e168a42de4b110f Author: Andrei Budnik Date: Wed Jan 29 19:07:50 2020 +0100 Changed termination logic of the default executor. Previously, the default executor terminated itself after all containers had terminated. This could lead to termination of the executor before processing of a terminal status update by the agent. In order to mitigate this issue, the executor slept for one second to give a chance to send all status updates and receive all status update acknowledgements before terminating itself. This might have led to various race conditions in some circumstances (e.g., on a slow host). This patch terminates the default executor if all status updates have been acknowledged by the agent and no running containers left. Also, this patch increases the timeout from one second to one minute for fail-safety. Review: https://reviews.apache.org/r/72029 {code} 1.6.x {code:java} commit 205525eb56a33e58bed1fc38e0b32189b19d3fbc Author: Andrei Budnik Date: Wed Jan 29 19:07:50 2020 +0100 Changed termination logic of the default executor. Previously, the default executor terminated itself after all containers had terminated. This could lead to termination of the executor before processing of a terminal status update by the agent. In order to mitigate this issue, the executor slept for one second to give a chance to send all status updates and receive all status update acknowledgements before terminating itself. This might have led to various race conditions in some circumstances (e.g., on a slow host). This patch terminates the default executor if all status updates have been acknowledged by the agent and no running containers left. Also, this patch increases the timeout from one second to one minute for fail-safety. Review: https://reviews.apache.org/r/72029 {code} 1.7.x {code:java} commit 5b399080eee11ee03f4bc6c09b791c24670da6c1 Author: Andrei Budnik Date: Wed Jan 29 19:07:50 2020 +0100 Changed termination logic of the default executor. Previously, the default executor terminated itself after all containers had terminated. This could lead to termination of the executor before processing of a terminal status update by the agent. In order to mitigate this issue, the executor slept for one second to give a chance to send all status updates and receive all status update acknowledgements before terminating itself. This might have led to various race conditions in some circumstances (e.g., on a slow host). This patch terminates the default executor if all status updates have been acknowledged by the agent and no running containers left. Also, this patch increases the timeout from one second to one minute for fail-safety. Review: https://reviews.apache.org/r/72029 {code} 1.8.x {code:java} commit a2ca451aab4625e126b9e7b470eb9f7c232dd746 Author: Andrei Budnik Date: Wed Jan 29 19:07:50 2020 +0100 Changed termination logic of the default executor. Previously, the default executor terminated itself after all containers had terminated. This could lead to termination of the executor before processing of a terminal status update by the agent. In order to mitigate this issue, the executor slept for one second to give a chance to send all status updates and receive all status update acknowledgements before terminating itself. This might have led to various race conditions in some circumstances (e.g., on a slow host). This patch terminates the default executor if all status updates have been acknowledged by the agent and no running containers left. Also, this patch increases the timeout from one second to one minute for fail-safety. Review: https://reviews.apache.org/r/72029 {code} 1.9.x {code:java} commit f37ae68a8f0d23a2e0f31812b8fe4494109769c6 Author: Andrei Budnik Date: Wed Jan 29 19:07:50 2020 +0100 Changed termination logic of the default executor. Previously, the default executor terminated itself after all containers had terminated. This could lead to termination of the executor before processing of a terminal status update by the agent. In order to mitigate this issue, the executor slept for one second to give a chance to send all status updates and receive all status update acknowledgements before terminating itself. This might have led to various race conditions in some circumstances (e.g., on a slow host). This patch terminates the default executor if all status updates have been acknowledged by the agent and no running containers left. Also, this patch
[jira] [Comment Edited] (MESOS-9847) Docker executor doesn't wait for status updates to be ack'd before shutting down.
[ https://issues.apache.org/jira/browse/MESOS-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029047#comment-17029047 ] Andrei Budnik edited comment on MESOS-9847 at 2/3/20 3:54 PM: -- {code:java} commit 457c38967bf9a53c1c5cd2743385937a26f413f6 Author: Andrei Budnik Date: Wed Jan 29 13:35:02 2020 +0100 Changed termination logic of the Docker executor. Previously, the Docker executor terminated itself after a task's container had terminated. This could lead to termination of the executor before processing of a terminal status update by the agent. In order to mitigate this issue, the executor slept for one second to give a chance to send all status updates and receive all status update acknowledgments before terminating itself. This might have led to various race conditions in some circumstances (e.g., on a slow host). This patch terminates the Docker executor after receiving a terminal status update acknowledgment. Also, this patch increases the timeout from one second to one minute for fail-safety. Review: https://reviews.apache.org/r/72055 {code} was (Author: abudnik): {code:java} commit 683dfc1ffb0b1ca758a07d19ab3badd8cac62dc7 Author: Andrei Budnik Date: Wed Jan 29 19:07:50 2020 +0100 Changed termination logic of the default executor. Previously, the default executor terminated itself after all containers had terminated. This could lead to termination of the executor before processing of a terminal status update by the agent. In order to mitigate this issue, the executor slept for one second to give a chance to send all status updates and receive all status update acknowledgements before terminating itself. This might have led to various race conditions in some circumstances (e.g., on a slow host). This patch terminates the default executor if all status updates have been acknowledged by the agent and no running containers left. Also, this patch increases the timeout from one second to one minute for fail-safety. Review: https://reviews.apache.org/r/72029 commit 457c38967bf9a53c1c5cd2743385937a26f413f6 Author: Andrei Budnik Date: Wed Jan 29 13:35:02 2020 +0100 Changed termination logic of the Docker executor. Previously, the Docker executor terminated itself after a task's container had terminated. This could lead to termination of the executor before processing of a terminal status update by the agent. In order to mitigate this issue, the executor slept for one second to give a chance to send all status updates and receive all status update acknowledgments before terminating itself. This might have led to various race conditions in some circumstances (e.g., on a slow host). This patch terminates the Docker executor after receiving a terminal status update acknowledgment. Also, this patch increases the timeout from one second to one minute for fail-safety. Review: https://reviews.apache.org/r/72055 {code} > Docker executor doesn't wait for status updates to be ack'd before shutting > down. > - > > Key: MESOS-9847 > URL: https://issues.apache.org/jira/browse/MESOS-9847 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Meng Zhu >Assignee: Andrei Budnik >Priority: Major > Labels: containerization > Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.2, 1.10, 1.9.1 > > > The docker executor doesn't wait for pending status updates to be > acknowledged before shutting down, instead it sleeps for one second and then > terminates: > {noformat} > void _stop() > { > // A hack for now ... but we need to wait until the status update > // is sent to the slave before we shut ourselves down. > // TODO(tnachen): Remove this hack and also the same hack in the > // command executor when we have the new HTTP APIs to wait until > // an ack. > os::sleep(Seconds(1)); > driver.get()->stop(); > } > {noformat} > This would result in racing between task status update (e.g. TASK_FINISHED) > and executor exit. The latter would lead agent generating a `TASK_FAILED` > status update by itself, leading to the confusing case where the agent > handles two different terminal status updates. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9847) Docker executor doesn't wait for status updates to be ack'd before shutting down.
[ https://issues.apache.org/jira/browse/MESOS-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029051#comment-17029051 ] Andrei Budnik commented on MESOS-9847: -- 1.5.x {code:java} commit ff98f12a50a56c13688b87068a116d1d08142f49 Author: Andrei Budnik Date: Wed Jan 29 13:35:02 2020 +0100 Changed termination logic of the Docker executor. Previously, the Docker executor terminated itself after a task's container had terminated. This could lead to termination of the executor before processing of a terminal status update by the agent. In order to mitigate this issue, the executor slept for one second to give a chance to send all status updates and receive all status update acknowledgments before terminating itself. This might have led to various race conditions in some circumstances (e.g., on a slow host). This patch terminates the Docker executor after receiving a terminal status update acknowledgment. Also, this patch increases the timeout from one second to one minute for fail-safety. Review: https://reviews.apache.org/r/72055 {code} 1.6.x {code:java} commit f511f25be9d850ee9b65fc3ec5f54d149beb2f19 Author: Andrei Budnik Date: Wed Jan 29 13:35:02 2020 +0100 Changed termination logic of the Docker executor. Previously, the Docker executor terminated itself after a task's container had terminated. This could lead to termination of the executor before processing of a terminal status update by the agent. In order to mitigate this issue, the executor slept for one second to give a chance to send all status updates and receive all status update acknowledgments before terminating itself. This might have led to various race conditions in some circumstances (e.g., on a slow host). This patch terminates the Docker executor after receiving a terminal status update acknowledgment. Also, this patch increases the timeout from one second to one minute for fail-safety. Review: https://reviews.apache.org/r/72055 {code} 1.7.x {code:java} commit 6a7da284d1b89f8a144ed2f896f005a5ee9d4aea Author: Andrei Budnik Date: Wed Jan 29 13:35:02 2020 +0100 Changed termination logic of the Docker executor. Previously, the Docker executor terminated itself after a task's container had terminated. This could lead to termination of the executor before processing of a terminal status update by the agent. In order to mitigate this issue, the executor slept for one second to give a chance to send all status updates and receive all status update acknowledgments before terminating itself. This might have led to various race conditions in some circumstances (e.g., on a slow host). This patch terminates the Docker executor after receiving a terminal status update acknowledgment. Also, this patch increases the timeout from one second to one minute for fail-safety. Review: https://reviews.apache.org/r/72055 {code} 1.8.x {code:java} commit 1bd0b37a7e522d63319db426dae7068b901eaea6 Author: Andrei Budnik Date: Wed Jan 29 13:35:02 2020 +0100 Changed termination logic of the Docker executor. Previously, the Docker executor terminated itself after a task's container had terminated. This could lead to termination of the executor before processing of a terminal status update by the agent. In order to mitigate this issue, the executor slept for one second to give a chance to send all status updates and receive all status update acknowledgments before terminating itself. This might have led to various race conditions in some circumstances (e.g., on a slow host). This patch terminates the Docker executor after receiving a terminal status update acknowledgment. Also, this patch increases the timeout from one second to one minute for fail-safety. Review: https://reviews.apache.org/r/72055 {code} 1.9.x {code:java} commit 3d60cba39d0377a7dc19b4c47f3bb0807418fe50 Author: Andrei Budnik Date: Wed Jan 29 13:35:02 2020 +0100 Changed termination logic of the Docker executor. Previously, the Docker executor terminated itself after a task's container had terminated. This could lead to termination of the executor before processing of a terminal status update by the agent. In order to mitigate this issue, the executor slept for one second to give a chance to send all status updates and receive all status update acknowledgments before terminating itself. This might have led to various race conditions in some circumstances (e.g., on a slow host). This patch terminates the Docker executor after receiving a terminal status update acknowledgment. Also, this patch increases the timeout from one second to one minute for fail-safety. Review: https://reviews.apache.org/r/72055 {code} > Docker executor doesn't wait for status updates to be
[jira] [Assigned] (MESOS-8537) Default executor doesn't wait for status updates to be ack'd before shutting down
[ https://issues.apache.org/jira/browse/MESOS-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-8537: Assignee: Andrei Budnik > Default executor doesn't wait for status updates to be ack'd before shutting > down > - > > Key: MESOS-8537 > URL: https://issues.apache.org/jira/browse/MESOS-8537 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.4.1, 1.5.0 >Reporter: Gastón Kleiman >Assignee: Andrei Budnik >Priority: Major > Labels: containerization, default-executor, mesosphere > > The default executor doesn't wait for pending status updates to be > acknowledged before shutting down, instead it sleeps for one second and then > terminates: > {code} > void _shutdown() > { > const Duration duration = Seconds(1); > LOG(INFO) << "Terminating after " << duration; > // TODO(qianzhang): Remove this hack since the executor now receives > // acknowledgements for status updates. The executor can terminate > // after it receives an ACK for a terminal status update. > os::sleep(duration); > terminate(self()); > } > {code} > The event handler should exit if upon receiving a {{Event::ACKNOWLEDGED}} the > executor is shutting down, no tasks are running anymore, and all pending > status updates have been acknowledged. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10080) Cgroups isolator: update cleanup logic to support nested cgroups
Andrei Budnik created MESOS-10080: - Summary: Cgroups isolator: update cleanup logic to support nested cgroups Key: MESOS-10080 URL: https://issues.apache.org/jira/browse/MESOS-10080 Project: Mesos Issue Type: Task Components: containerization Reporter: Andrei Budnik Assignee: Andrei Budnik Update Cgroups isolator to cleanup a nested cgroup for a nested container taking into account hierarchical layout of cgroups. Lowest nested cgroups should be destroyed first. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10079) Cgroups isolator: recover nested cgroups
Andrei Budnik created MESOS-10079: - Summary: Cgroups isolator: recover nested cgroups Key: MESOS-10079 URL: https://issues.apache.org/jira/browse/MESOS-10079 Project: Mesos Issue Type: Task Components: containerization Reporter: Andrei Budnik Assignee: Andrei Budnik Update recovery of Cgroups isolator to recover nested cgroups for those nested containers, which were launched in nested cgroups. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10078) Cgroups isolator: update cgroups subsystems to support nested cgroups
Andrei Budnik created MESOS-10078: - Summary: Cgroups isolator: update cgroups subsystems to support nested cgroups Key: MESOS-10078 URL: https://issues.apache.org/jira/browse/MESOS-10078 Project: Mesos Issue Type: Task Reporter: Andrei Budnik Assignee: Andrei Budnik Update Cgroups Subsystems to support nested cgroups. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10077) Cgroups isolator: allow updating and isolating resources for nested cgroups
Andrei Budnik created MESOS-10077: - Summary: Cgroups isolator: allow updating and isolating resources for nested cgroups Key: MESOS-10077 URL: https://issues.apache.org/jira/browse/MESOS-10077 Project: Mesos Issue Type: Task Components: containerization Reporter: Andrei Budnik Assignee: Andrei Budnik Allow Cgroups isolator to update and isolate resources for nested cgroups. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10076) Cgroups isolator: create nested cgroups
Andrei Budnik created MESOS-10076: - Summary: Cgroups isolator: create nested cgroups Key: MESOS-10076 URL: https://issues.apache.org/jira/browse/MESOS-10076 Project: Mesos Issue Type: Task Components: containerization Reporter: Andrei Budnik Assignee: Andrei Budnik Update Cgroups isolator to create nested cgroups for a nested container, which supports nested cgroups, during container launch preparation. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995737#comment-16995737 ] Andrei Budnik commented on MESOS-10066: --- cc [~qianzhang] > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > Attachments: logs-after.txt, logs-before.txt > > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching > executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources > [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}] > in work directory >
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989880#comment-16989880 ] Andrei Budnik commented on MESOS-10066: --- So the Docker socket is mounted from the host FS into the Docker container? I'm not sure if Mesos supports such a configuration. Since mesos-docker-executor is launched in a separate Docker container, there is no way to establish a socket connection from one Docker container (where agent runs) to another (where executor runs). Is executor's port 10.234.172.56:9899 exposed by the Docker container? AFAIK, [Mesos mini|http://mesos.apache.org/blog/mesos-mini/] uses Docker-in-Docker technique instead. > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > Attachments: logs-after.txt, logs-before.txt > > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox >
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989808#comment-16989808 ] Andrei Budnik commented on MESOS-10066: --- Did you try to specify --docker_mesos_image command-line option for the agent that runs inside the Docker container? > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > Attachments: logs-after.txt, logs-before.txt > > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching > executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources >
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989728#comment-16989728 ] Andrei Budnik commented on MESOS-10066: --- Could you please attach full agent logs? > mesos-docker-executor process dies when agent stops. Recovery fails when > agent returns > -- > > Key: MESOS-10066 > URL: https://issues.apache.org/jira/browse/MESOS-10066 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker, executor >Affects Versions: 1.7.3 >Reporter: Dalton Matos Coelho Barreto >Priority: Critical > > Hello all, > The documentation about Agent Recovery shows two conditions for the recovery > to be possible: > - The agent must have recovery enabled (default true?); > - The scheduler must register itself saying that it has checkpointing > enabled. > In my tests I'm using Marathon as the scheduler and Mesos itself sees > Marathon as e checkpoint-enabled scheduler: > {noformat} > $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, > "id": .id, "checkpoint": .checkpoint, "active": .active}' > { > "name": "asgard-chronos", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001", > "checkpoint": true, > "active": true > } > { > "name": "marathon", > "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-", > "checkpoint": true, > "active": true > } > }} > {noformat} > Here is what I'm using: > # Mesos Master, 1.4.1 > # Mesos Agent 1.7.3 > # Using docker image {{mesos/mesos-centos:1.7.x}} > # Docker sock mounted from the host > # Docker binary also mounted from the host > # Marathon: 1.4.12 > # Docker > {noformat} > Client: Docker Engine - Community > Version: 19.03.5 > API version: 1.39 (downgraded from 1.40) > Go version:go1.12.12 > Git commit:633a0ea838 > Built: Wed Nov 13 07:22:05 2019 > OS/Arch: linux/amd64 > Experimental: false > > Server: Docker Engine - Community > Engine: > Version: 18.09.2 > API version: 1.39 (minimum version 1.12) > Go version: go1.10.6 > Git commit: 6247962 > Built:Sun Feb 10 03:42:13 2019 > OS/Arch: linux/amd64 > Experimental: false > {noformat} > h2. The problem > Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} > docker image. > {noformat} > { > "id": "/sleep", > "cmd": "sleep 99d", > "cpus": 0.1, > "mem": 128, > "disk": 0, > "instances": 1, > "constraints": [], > "acceptedResourceRoles": [ > "*" > ], > "container": { > "type": "DOCKER", > "volumes": [], > "docker": { > "image": "debian", > "network": "HOST", > "privileged": false, > "parameters": [], > "forcePullImage": true > } > }, > "labels": {}, > "portDefinitions": [] > } > {noformat} > This task runs fine and get scheduled on the right agent, which is running > mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}). > Here is a sample log: > {noformat} > mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching > task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- > mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating > sandbox > '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923' > mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching > executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework > 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources > [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}] > in work directory >
[jira] [Created] (MESOS-10014) `tryUntrackFrameworkUnderRole` check failed in `HierarchicalAllocatorProcess::removeFramework`.
Andrei Budnik created MESOS-10014: - Summary: `tryUntrackFrameworkUnderRole` check failed in `HierarchicalAllocatorProcess::removeFramework`. Key: MESOS-10014 URL: https://issues.apache.org/jira/browse/MESOS-10014 Project: Mesos Issue Type: Bug Components: master, test Reporter: Andrei Budnik Attachments: AgentPendingOperationAfterMasterFailover-badrun.txt `ContentType/OperationReconciliationTest.AgentPendingOperationAfterMasterFailover/0` test failed: {code:java} F1018 09:05:14.310616 21391 hierarchical.cpp:745] Check failed: tryUntrackFrameworkUnderRole(framework, role) Framework: e6284079-cb6a-4a47-8f9a-ea9b84ff622a- role: default-role *** Check failure stack trace: *** @ 0x7f40fff0a1f6 google::LogMessage::Fail() @ 0x7f40fff0a14f google::LogMessage::SendToLog() @ 0x7f40fff09a91 google::LogMessage::Flush() @ 0x7f40fff0d12f google::LogMessageFatal::~LogMessageFatal() @ 0x7f410fd828ac mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeFramework() @ 0x186b29f _ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES8_EEvRKNS_3PIDIT_EEMSA_FvT0_EOT1_ENKUlOS6_PNS_11ProcessBaseEE_clESJ_SL_ @ 0x189c273 _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS3_11FrameworkIDESA_EEvRKNS1_3PIDIT_EEMSC_FvT0_EOT1_EUlOS8_PNS1_11ProcessBaseEE_JS8_SN_EEEDTclcl7forwardISC_Efp_Espcl7forwardIT0_Efp0_EEEOSC_DpOSP_ @ 0x18990b7 _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_JS9_St12_PlaceholderILi113invoke_expandISP_St5tupleIJS9_SR_EESU_IJOSO_EEJLm0ELm1DTcl6invokecl7forwardISD_Efp_Espcl6expandcl3getIXT2_EEcl7forwardISH_Efp0_EEcl7forwardISK_Efp2_OSD_OSH_N5cpp1416integer_sequenceImJXspT2_SL_ @ 0x1896100 _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_IS9_St12_PlaceholderILi1clIISO_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImILm0ELm1_Ecl16forward_as_tuplespcl7forwardIT_Efp_DpOSX_ @ 0x1895174 _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS6_11FrameworkIDESD_EEvRKNS4_3PIDIT_EEMSF_FvT0_EOT1_EUlOSB_PNS4_11ProcessBaseEE_ISB_St12_PlaceholderILi1EISQ_EEEDTclcl7forwardISF_Efp_Espcl7forwardIT0_Efp0_EEEOSF_DpOSV_ @ 0x1894b2b _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS7_11FrameworkIDESE_EEvRKNS5_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS5_11ProcessBaseEE_JSC_St12_PlaceholderILi1EJSR_EEEvOSG_DpOT0_ @ 0x18943bc _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNSA_11FrameworkIDESH_EEvRKNS1_3PIDIT_EEMSJ_FvT0_EOT1_EUlOSF_S3_E_ISF_St12_PlaceholderILi1EEclEOS3_ @ 0x7f41016deb22 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_ @ 0x7f410169620c process::ProcessBase::consume() @ 0x7f41016c0696 _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE @ 0x1822baa process::ProcessBase::serve() @ 0x7f4101692af1 process::ProcessManager::resume() @ 0x7f410168ed68 _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv @ 0x7f41016b81e2 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE @ 0x7f41016b7244 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEclEv @ 0x7f41016b6088 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv @ 0x7f40fca44590 execute_native_thread_routine @ 0x7f40ffa77e25 start_thread @ 0x7f40fa396bad __clone @ (nil) (unknown) {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-6480) Support for docker live-restore option in Mesos
[ https://issues.apache.org/jira/browse/MESOS-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942893#comment-16942893 ] Andrei Budnik commented on MESOS-6480: -- design doc: [https://docs.google.com/document/d/1JeLTr9L31S8eIg-6xpjedIUKvnfNake0kPTzxEwdUdI] > Support for docker live-restore option in Mesos > --- > > Key: MESOS-6480 > URL: https://issues.apache.org/jira/browse/MESOS-6480 > Project: Mesos > Issue Type: Task >Reporter: Milind Chawre >Priority: Major > > Docker-1.12 supports live-restore option which keeps containers alive during > docker daemon downtime https://docs.docker.com/engine/admin/live-restore/ > I tried to use this option in my Mesos setup And observed this : > 1. On mesos worker node stop docker daemon. > 2. After some time start the docker daemon. All the containers running on > that are still visible using "docker ps". This is an expected behaviour of > live-restore option. > 3. When I check mesos and marathon UI. It shows no Active tasks running on > that node. The containers which are still running on that node are now > scheduled on different mesos nodes, which is not right since I can see the > containers in "docker ps" output because of live-restore option. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9843) Implement tests for the `containerizer/debug` endpoint.
[ https://issues.apache.org/jira/browse/MESOS-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936734#comment-16936734 ] Andrei Budnik commented on MESOS-9843: -- {code:java} commit dee4b849c8179ea46947c8ea4dd031f6eb37b659 Author: Andrei Budnik abud...@apache.org Date: Fri Sep 6 17:01:56 2019 +0200 Added `futureTracker` to the `SlaveOptions` in tests. `PendingFutureTracker` is shared across both Mesos containerizer and the agent, so we need to add an option to be able to start a slave in tests with an instance of the `futureTrack` as a parameter. Review: https://reviews.apache.org/r/71454 {code} {code:java} commit 1122674a5c03894e4552d46cfa26dca0557a8f68 Author: Andrei Budnik Date: Fri Sep 6 13:25:35 2019 +0200 Implemented an integration test for /containerizer/debug endpoint. This test starts an agent with the MockIsolator to intercept calls to its `prepare` method, then it launches a task, which gets stuck. We check that the /containerizer/debug endpoint returns a non-empty list of pending futures including `MockIsolator::prepare`. After setting the promise for the `prepare`, the task successfully starts and we expect for the /containerizer/debug endpoint to return an empty list of pending operations. Review: https://reviews.apache.org/r/71455 {code} > Implement tests for the `containerizer/debug` endpoint. > --- > > Key: MESOS-9843 > URL: https://issues.apache.org/jira/browse/MESOS-9843 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > Labels: containerization > > Implement tests for container stuck issues and check that the agent's > `containerizer/debug` endpoint returns a JSON object containing information > about pending operations. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9969) Agent crashes when trying to clean up volue
[ https://issues.apache.org/jira/browse/MESOS-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931645#comment-16931645 ] Andrei Budnik commented on MESOS-9969: -- Could you please provide steps to reproduce this bug? > Agent crashes when trying to clean up volue > --- > > Key: MESOS-9969 > URL: https://issues.apache.org/jira/browse/MESOS-9969 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.8.2 >Reporter: Tomas Barton >Priority: Major > > {code} > Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.081748 21828 > linux_launcher.cpp:650] Destroying cgroup > '/sys/fs/cgroup/systemd/mesos/370ed262-4041-4180-a7e1-9ea78070e3a6' > Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.081876 21832 > containerizer.cpp:2907] Checkpointing termination state to nested container's > runtime directory > '/var/run/mesos/containers/8e3997e7-c53a-4043-9a7e-26a2e436a041/containers/ae0bdc6d-c738-4352-b5d4-7572182671d5/termination' > Sep 17 13:49:26 w03 mesos-agent[21803]: mesos-agent: > /pkg/src/mesos/3rdparty/stout/include/stout/option.hpp:120: T& > Option::get() & [with T = std::basic_string]: Assertion `isSome()' > failed. > Sep 17 13:49:26 w03 mesos-agent[21803]: *** Aborted at 1568728166 (unix time) > try "date -d @1568728166" if you are using GNU date *** > Sep 17 13:49:26 w03 mesos-agent[21803]: W0917 13:49:26.082281 21835 > disk.cpp:453] Ignoring cleanup for unknown container > a9ba6959-ea02-4543-b7d5-92a63940 > Sep 17 13:49:26 w03 mesos-agent[21803]: PC: @ 0x7f16a3867fff (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: *** SIGABRT (@0x552b) received by PID > 21803 (TID 0x7f169e47d700) from PID 21803; stack trace: *** > Sep 17 13:49:26 w03 mesos-agent[21803]: E0917 13:49:26.082608 21835 > memory.cpp:501] Listening on OOM events failed for container > a9ba6959-ea02-4543-b7d5-92a63940: Event listener is terminating > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3be50e0 (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3867fff (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a386942a (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3860e67 (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.083741 21835 > linux.cpp:1074] Unmounting volume > '/var/lib/mesos/slave/slaves/04e596b7-f03d-4cba-bbbc-fa9e0aebb5d2-S17/frameworks/04e596b7-f03d-4cba-bbbc-fa9e0aebb5d2-0003/executors/es01__coordinator__8591ac8e-3d9d-45ac-bb68-bee379c8c4a4/runs/a9ba6959-ea02-4543-b7d5-92a63940/container-path' > for con > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3860f12 (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a7654f13 > _ZNR6OptionISsE3getEv.part.152 > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a7666b2f > mesos::internal::slave::MesosContainerizerProcess::__destroy() > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a861cb41 > process::ProcessBase::consume() > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a8633c9c > process::ProcessManager::resume() > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a86398a6 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a43c6200 (unknown) > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3bdb4a4 start_thread > Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a391dd0f (unknown) > Sep 17 13:49:26 w03 systemd[1]: dcos-mesos-slave.service: Main process > exited, code=killed, status=6/ABRT > {code} -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9914) Refactor `MesosTest::StartSlave` in favour of builder style interface
[ https://issues.apache.org/jira/browse/MESOS-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924139#comment-16924139 ] Andrei Budnik commented on MESOS-9914: -- {code:java} commit 6c2a94ca0eca90e6d3517e4400f4529ddce0b84c Author: Andrei Budnik abud...@apache.org Date: Mon Sep 2 17:15:52 2019 +0200 Added `SlaveOptions` for wrapping all parameters of `StartSlave`. This patch introduces a `SlaveOptions` struct which holds optional parameters accepted by `cluster::Slave::create`. Added an overload of `StartSlave` that accepts `SlaveOptions`. It's a preferred way of creating and starting an instance of `cluster::Slave` in tests, since underlying `cluster::Slave::create` accepts a long list of optional arguments, which might be extended in the future. Review: https://reviews.apache.org/r/71424 {code} > Refactor `MesosTest::StartSlave` in favour of builder style interface > - > > Key: MESOS-9914 > URL: https://issues.apache.org/jira/browse/MESOS-9914 > Project: Mesos > Issue Type: Improvement > Components: test >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > > Every overload of `MesosTest::StartSlave` method depend on > `cluster::Slave::create` method, which accepts several arguments. In fact, > each overload of `MesosTest::StartSlave` accepts a subset of combination of > arguments that `cluster::Slave::create` accept. Given that the latter accepts > 11 arguments at the moment, and there are already 13 overloads of the > `MesosTest::StartSlave`, introducing a builder-style interface is very > desirable. It'd allow adding more arguments to the `cluster::Slave::create` > without the necessity to update all existing overloads. It would be a local > change as it won't affect existing tests. > See [this > comment|https://github.com/apache/mesos/blob/00bb0b6d6abe7700a5adab0bdaf7e91767a2db19/src/tests/mesos.hpp#L160-L177]. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9914) Refactor `MesosTest::StartSlave` in favour of builder style interface
[ https://issues.apache.org/jira/browse/MESOS-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920964#comment-16920964 ] Andrei Budnik commented on MESOS-9914: -- [https://reviews.apache.org/r/71424/] > Refactor `MesosTest::StartSlave` in favour of builder style interface > - > > Key: MESOS-9914 > URL: https://issues.apache.org/jira/browse/MESOS-9914 > Project: Mesos > Issue Type: Improvement > Components: test >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > > Every overload of `MesosTest::StartSlave` method depend on > `cluster::Slave::create` method, which accepts several arguments. In fact, > each overload of `MesosTest::StartSlave` accepts a subset of combination of > arguments that `cluster::Slave::create` accept. Given that the latter accepts > 11 arguments at the moment, and there are already 13 overloads of the > `MesosTest::StartSlave`, introducing a builder-style interface is very > desirable. It'd allow adding more arguments to the `cluster::Slave::create` > without the necessity to update all existing overloads. It would be a local > change as it won't affect existing tests. > See [this > comment|https://github.com/apache/mesos/blob/00bb0b6d6abe7700a5adab0bdaf7e91767a2db19/src/tests/mesos.hpp#L160-L177]. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Assigned] (MESOS-9914) Refactor `MesosTest::StartSlave` in favour of builder style interface
[ https://issues.apache.org/jira/browse/MESOS-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-9914: Assignee: Andrei Budnik > Refactor `MesosTest::StartSlave` in favour of builder style interface > - > > Key: MESOS-9914 > URL: https://issues.apache.org/jira/browse/MESOS-9914 > Project: Mesos > Issue Type: Improvement > Components: test >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > > Every overload of `MesosTest::StartSlave` method depend on > `cluster::Slave::create` method, which accepts several arguments. In fact, > each overload of `MesosTest::StartSlave` accepts a subset of combination of > arguments that `cluster::Slave::create` accept. Given that the latter accepts > 11 arguments at the moment, and there are already 13 overloads of the > `MesosTest::StartSlave`, introducing a builder-style interface is very > desirable. It'd allow adding more arguments to the `cluster::Slave::create` > without the necessity to update all existing overloads. It would be a local > change as it won't affect existing tests. > See [this > comment|https://github.com/apache/mesos/blob/00bb0b6d6abe7700a5adab0bdaf7e91767a2db19/src/tests/mesos.hpp#L160-L177]. -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.
[ https://issues.apache.org/jira/browse/MESOS-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915755#comment-16915755 ] Andrei Budnik commented on MESOS-9887: -- {code:java} commit 8aae23ec7cd4bc50532df0b1d1ea6ec23ce078f8 Author: Andrei Budnik abud...@apache.org Date: Fri Aug 23 14:36:18 2019 +0200 Added missing `return` statement in `Slave::statusUpdate`. Previously, if `statusUpdate` was called for a pending task, it would forward the status update and then continue executing `statusUpdate`, which then checks if there is an executor that is aware of this task. Given that a pending task is not known to any executor, it would always handle it by forwarding status update one more time. This patch adds missing `return` statement, which fixes the issue. Review: https://reviews.apache.org/r/71361 {code} {code:java} commit f0be23765531b05661ed7f1b124faf96744aa80b Author: Andrei Budnik abud...@apache.org Date: Tue Aug 20 19:24:44 2019 +0200 Fixed out-of-order processing of terminal status updates in agent. Previously, Mesos agent could send TASK_FAILED status update on executor termination while processing of TASK_FINISHED status update was in progress. Processing of task status updates involves sending requests to the containerizer, which might finish processing of these requests out-of-order, e.g. `MesosContainerizer::status`. Also, the agent does not overwrite status of the terminal status update once it's stored in the `terminatedTasks`. Hence, there was a race condition between two terminal status updates. Note that V1 Executors are not affected by this problem because they wait for an acknowledgement of the terminal status update by the agent before terminating. This patch introduces a new data structure `pendingStatusUpdates`, which holds a list of status updates that are being processed. This data structure allows validating the order of processing of status updates by the agent. Review: https://reviews.apache.org/r/71343 {code} > Race condition between two terminal task status updates for Docker executor. > > > Key: MESOS-9887 > URL: https://issues.apache.org/jira/browse/MESOS-9887 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Blocker > Labels: agent, containerization > Attachments: race_example.txt > > > h2. Overview > Expected behavior: > Task successfully finishes and sends TASK_FINISHED status update. > Observed behavior: > Task successfully finishes, but the agent sends TASK_FAILED with the reason > "REASON_EXECUTOR_TERMINATED". > In normal circumstances, Docker executor > [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758] > final status update TASK_FINISHED to the agent, which then [gets > processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543] > by the agent before termination of the executor's process. > However, if the processing of the initial TASK_FINISHED gets delayed, then > there is a chance that Docker executor terminates and the agent > [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662] > TASK_FAILED which will [be > handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826] > prior to the TASK_FINISHED status update. > See attached logs which contain an example of the race condition. > h2. Reproducing bug > 1. Add the following code: > {code:java} > static int c = 0; > if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates. > ::sleep(2); > } > {code} > to the > [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578] > and to the > [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167]. > 2. Recompile mesos > 3. Launch mesos master and agent locally > 4. Launch a simple Docker task via `mesos-execute`: > {code} > # cd build > ./src/mesos-execute --master="`hostname`:5050" --name="a" > --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" > --command="ls" > {code} > h2. Race condition - description > 1. Mesos agent receives TASK_FINISHED status update and then subscribes on > [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]. > 2. `containerizer->status()` operation for TASK_FINISHED status update gets > delayed in the
[jira] [Comment Edited] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.
[ https://issues.apache.org/jira/browse/MESOS-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912558#comment-16912558 ] Andrei Budnik edited comment on MESOS-9887 at 8/26/19 12:22 PM: [https://reviews.apache.org/r/71361/ https://reviews.apache.org/r/71343/|https://reviews.apache.org/r/71343/] was (Author: abudnik): https://reviews.apache.org/r/71343/ > Race condition between two terminal task status updates for Docker executor. > > > Key: MESOS-9887 > URL: https://issues.apache.org/jira/browse/MESOS-9887 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Blocker > Labels: agent, containerization > Attachments: race_example.txt > > > h2. Overview > Expected behavior: > Task successfully finishes and sends TASK_FINISHED status update. > Observed behavior: > Task successfully finishes, but the agent sends TASK_FAILED with the reason > "REASON_EXECUTOR_TERMINATED". > In normal circumstances, Docker executor > [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758] > final status update TASK_FINISHED to the agent, which then [gets > processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543] > by the agent before termination of the executor's process. > However, if the processing of the initial TASK_FINISHED gets delayed, then > there is a chance that Docker executor terminates and the agent > [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662] > TASK_FAILED which will [be > handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826] > prior to the TASK_FINISHED status update. > See attached logs which contain an example of the race condition. > h2. Reproducing bug > 1. Add the following code: > {code:java} > static int c = 0; > if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates. > ::sleep(2); > } > {code} > to the > [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578] > and to the > [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167]. > 2. Recompile mesos > 3. Launch mesos master and agent locally > 4. Launch a simple Docker task via `mesos-execute`: > {code} > # cd build > ./src/mesos-execute --master="`hostname`:5050" --name="a" > --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" > --command="ls" > {code} > h2. Race condition - description > 1. Mesos agent receives TASK_FINISHED status update and then subscribes on > [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]. > 2. `containerizer->status()` operation for TASK_FINISHED status update gets > delayed in the composing containerizer (e.g. due to switch of the worker > thread that executes `status` method). > 3. Docker executor terminates and the agent > [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662] > TASK_FAILED. > 4. Docker containerizer destroys the container. A registered callback for the > `containerizer->wait` call in the composing containerizer dispatches [lambda > function|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L368-L373] > that will clean up `containers_` map. > 5. Composing c'zer resumes and dispatches > `[status()|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L579]` > method to the Docker containerizer for TASK_FINISHED, which in turn hangs > for a few seconds. > 6. Corresponding `containerId` gets removed from the `containers_` map of the > composing c'zer. > 7. Mesos agent subscribes on > [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761] > for the TASK_FAILED status update. > 8. Composing c'zer returns ["Container not > found"|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L576] > for TASK_FAILED. > 9. > `[Slave::_statusUpdate|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5826]` > stores TASK_FAILED terminal status update in
[jira] [Comment Edited] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.
[ https://issues.apache.org/jira/browse/MESOS-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912558#comment-16912558 ] Andrei Budnik edited comment on MESOS-9887 at 8/26/19 12:22 PM: [https://reviews.apache.org/r/71361/] [https://reviews.apache.org/r/71343/] was (Author: abudnik): [https://reviews.apache.org/r/71361/ https://reviews.apache.org/r/71343/|https://reviews.apache.org/r/71343/] > Race condition between two terminal task status updates for Docker executor. > > > Key: MESOS-9887 > URL: https://issues.apache.org/jira/browse/MESOS-9887 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Blocker > Labels: agent, containerization > Attachments: race_example.txt > > > h2. Overview > Expected behavior: > Task successfully finishes and sends TASK_FINISHED status update. > Observed behavior: > Task successfully finishes, but the agent sends TASK_FAILED with the reason > "REASON_EXECUTOR_TERMINATED". > In normal circumstances, Docker executor > [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758] > final status update TASK_FINISHED to the agent, which then [gets > processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543] > by the agent before termination of the executor's process. > However, if the processing of the initial TASK_FINISHED gets delayed, then > there is a chance that Docker executor terminates and the agent > [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662] > TASK_FAILED which will [be > handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826] > prior to the TASK_FINISHED status update. > See attached logs which contain an example of the race condition. > h2. Reproducing bug > 1. Add the following code: > {code:java} > static int c = 0; > if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates. > ::sleep(2); > } > {code} > to the > [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578] > and to the > [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167]. > 2. Recompile mesos > 3. Launch mesos master and agent locally > 4. Launch a simple Docker task via `mesos-execute`: > {code} > # cd build > ./src/mesos-execute --master="`hostname`:5050" --name="a" > --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" > --command="ls" > {code} > h2. Race condition - description > 1. Mesos agent receives TASK_FINISHED status update and then subscribes on > [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]. > 2. `containerizer->status()` operation for TASK_FINISHED status update gets > delayed in the composing containerizer (e.g. due to switch of the worker > thread that executes `status` method). > 3. Docker executor terminates and the agent > [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662] > TASK_FAILED. > 4. Docker containerizer destroys the container. A registered callback for the > `containerizer->wait` call in the composing containerizer dispatches [lambda > function|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L368-L373] > that will clean up `containers_` map. > 5. Composing c'zer resumes and dispatches > `[status()|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L579]` > method to the Docker containerizer for TASK_FINISHED, which in turn hangs > for a few seconds. > 6. Corresponding `containerId` gets removed from the `containers_` map of the > composing c'zer. > 7. Mesos agent subscribes on > [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761] > for the TASK_FAILED status update. > 8. Composing c'zer returns ["Container not > found"|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L576] > for TASK_FAILED. > 9. > `[Slave::_statusUpdate|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5826]` >
[jira] [Commented] (MESOS-9844) Update documentation describing `containerizer/debug` endpoint.
[ https://issues.apache.org/jira/browse/MESOS-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913441#comment-16913441 ] Andrei Budnik commented on MESOS-9844: -- http://mesos.apache.org/documentation/latest/endpoints/slave/containerizer/debug/ > Update documentation describing `containerizer/debug` endpoint. > --- > > Key: MESOS-9844 > URL: https://issues.apache.org/jira/browse/MESOS-9844 > Project: Mesos > Issue Type: Documentation > Components: containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > Labels: containerization > -- This message was sent by Atlassian Jira (v8.3.2#803003)
[jira] [Commented] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.
[ https://issues.apache.org/jira/browse/MESOS-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912558#comment-16912558 ] Andrei Budnik commented on MESOS-9887: -- https://reviews.apache.org/r/71343/ > Race condition between two terminal task status updates for Docker executor. > > > Key: MESOS-9887 > URL: https://issues.apache.org/jira/browse/MESOS-9887 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Blocker > Labels: agent, containerization > Attachments: race_example.txt > > > h2. Overview > Expected behavior: > Task successfully finishes and sends TASK_FINISHED status update. > Observed behavior: > Task successfully finishes, but the agent sends TASK_FAILED with the reason > "REASON_EXECUTOR_TERMINATED". > In normal circumstances, Docker executor > [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758] > final status update TASK_FINISHED to the agent, which then [gets > processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543] > by the agent before termination of the executor's process. > However, if the processing of the initial TASK_FINISHED gets delayed, then > there is a chance that Docker executor terminates and the agent > [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662] > TASK_FAILED which will [be > handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826] > prior to the TASK_FINISHED status update. > See attached logs which contain an example of the race condition. > h2. Reproducing bug > 1. Add the following code: > {code:java} > static int c = 0; > if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates. > ::sleep(2); > } > {code} > to the > [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578] > and to the > [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167]. > 2. Recompile mesos > 3. Launch mesos master and agent locally > 4. Launch a simple Docker task via `mesos-execute`: > {code} > # cd build > ./src/mesos-execute --master="`hostname`:5050" --name="a" > --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" > --command="ls" > {code} > h2. Race condition - description > 1. Mesos agent receives TASK_FINISHED status update and then subscribes on > [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]. > 2. `containerizer->status()` operation for TASK_FINISHED status update gets > delayed in the composing containerizer (e.g. due to switch of the worker > thread that executes `status` method). > 3. Docker executor terminates and the agent > [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662] > TASK_FAILED. > 4. Docker containerizer destroys the container. A registered callback for the > `containerizer->wait` call in the composing containerizer dispatches [lambda > function|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L368-L373] > that will clean up `containers_` map. > 5. Composing c'zer resumes and dispatches > `[status()|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L579]` > method to the Docker containerizer for TASK_FINISHED, which in turn hangs > for a few seconds. > 6. Corresponding `containerId` gets removed from the `containers_` map of the > composing c'zer. > 7. Mesos agent subscribes on > [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761] > for the TASK_FAILED status update. > 8. Composing c'zer returns ["Container not > found"|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L576] > for TASK_FAILED. > 9. > `[Slave::_statusUpdate|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5826]` > stores TASK_FAILED terminal status update in the executor's data structure. > 10. Docker containerizer resumes and finishes processing of `status()` method > for TASK_FINISHED. Finally, it returns control to the
[jira] [Commented] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.
[ https://issues.apache.org/jira/browse/MESOS-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912400#comment-16912400 ] Andrei Budnik commented on MESOS-9887: -- Discarding these patches ^^ since multiple consecutive requests to the underlying containerizer might finish in a different order than they were sent. Hence, the agent should not rely on the order of completion of requests sent to the containerizer. > Race condition between two terminal task status updates for Docker executor. > > > Key: MESOS-9887 > URL: https://issues.apache.org/jira/browse/MESOS-9887 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Blocker > Labels: agent, containerization > Attachments: race_example.txt > > > h2. Overview > Expected behavior: > Task successfully finishes and sends TASK_FINISHED status update. > Observed behavior: > Task successfully finishes, but the agent sends TASK_FAILED with the reason > "REASON_EXECUTOR_TERMINATED". > In normal circumstances, Docker executor > [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758] > final status update TASK_FINISHED to the agent, which then [gets > processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543] > by the agent before termination of the executor's process. > However, if the processing of the initial TASK_FINISHED gets delayed, then > there is a chance that Docker executor terminates and the agent > [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662] > TASK_FAILED which will [be > handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826] > prior to the TASK_FINISHED status update. > See attached logs which contain an example of the race condition. > h2. Reproducing bug > 1. Add the following code: > {code:java} > static int c = 0; > if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates. > ::sleep(2); > } > {code} > to the > [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578] > and to the > [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167]. > 2. Recompile mesos > 3. Launch mesos master and agent locally > 4. Launch a simple Docker task via `mesos-execute`: > {code} > # cd build > ./src/mesos-execute --master="`hostname`:5050" --name="a" > --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" > --command="ls" > {code} > h2. Race condition - description > 1. Mesos agent receives TASK_FINISHED status update and then subscribes on > [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]. > 2. `containerizer->status()` operation for TASK_FINISHED status update gets > delayed in the composing containerizer (e.g. due to switch of the worker > thread that executes `status` method). > 3. Docker executor terminates and the agent > [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662] > TASK_FAILED. > 4. Docker containerizer destroys the container. A registered callback for the > `containerizer->wait` call in the composing containerizer dispatches [lambda > function|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L368-L373] > that will clean up `containers_` map. > 5. Composing c'zer resumes and dispatches > `[status()|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L579]` > method to the Docker containerizer for TASK_FINISHED, which in turn hangs > for a few seconds. > 6. Corresponding `containerId` gets removed from the `containers_` map of the > composing c'zer. > 7. Mesos agent subscribes on > [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761] > for the TASK_FAILED status update. > 8. Composing c'zer returns ["Container not > found"|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L576] > for TASK_FAILED. > 9. > `[Slave::_statusUpdate|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5826]` > stores
[jira] [Commented] (MESOS-9836) Docker containerizer overwrites `/mesos/slave` cgroups.
[ https://issues.apache.org/jira/browse/MESOS-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911305#comment-16911305 ] Andrei Budnik commented on MESOS-9836: -- Shall we deprecate the option to run a custom executor in a Docker container? If no one responds to our proposal in dev@ & user@ mailing lists, then we can safely deprecate this feature. > Docker containerizer overwrites `/mesos/slave` cgroups. > --- > > Key: MESOS-9836 > URL: https://issues.apache.org/jira/browse/MESOS-9836 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Chun-Hung Hsiao >Priority: Critical > Labels: docker, mesosphere > > The following bug was observed on our internal testing cluster. > The docker containerizer launched a container on an agent: > {noformat} > I0523 06:00:53.888579 21815 docker.cpp:1195] Starting container > 'f69c8a8c-eba4-4494-a305-0956a44a6ad2' for task > 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1' (and executor > 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1') of framework > 415284b7-2967-407d-b66f-f445e93f064e-0011 > I0523 06:00:54.524171 21815 docker.cpp:783] Checkpointing pid 13716 to > '/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S2/frameworks/415284b7-2967-407d-b66f-f445e93f064e-0011/executors/apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1/runs/f69c8a8c-eba4-4494-a305-0956a44a6ad2/pids/forked.pid' > {noformat} > After the container was launched, the docker containerizer did a {{docker > inspect}} on the container and cached the pid: > > [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1764] > The pid should be slightly greater than 13716. > The docker executor sent a {{TASK_FINISHED}} status update around 16 minutes > later: > {noformat} > I0523 06:16:17.287595 21809 slave.cpp:5566] Handling status update > TASK_FINISHED (Status UUID: 4e00b786-b773-46cd-8327-c7deb08f1de9) for task > apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1 of framework > 415284b7-2967-407d-b66f-f445e93f064e-0011 from executor(1)@172.31.1.7:36244 > {noformat} > After receiving the terminal status update, the agent asked the docker > containerizer to update {{cpu.cfs_period_us}}, {{cpu.cfs_quota_us}} and > {{memory.soft_limit_in_bytes}} of the container through the cached pid: > > [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1696] > {noformat} > I0523 06:16:17.290447 21815 docker.cpp:1868] Updated 'cpu.shares' to 102 at > /sys/fs/cgroup/cpu,cpuacct/mesos/slave for container > f69c8a8c-eba4-4494-a305-0956a44a6ad2 > I0523 06:16:17.290660 21815 docker.cpp:1895] Updated 'cpu.cfs_period_us' to > 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container > f69c8a8c-eba4-4494-a305-0956a44a6ad2 > I0523 06:16:17.889816 21815 docker.cpp:1937] Updated > 'memory.soft_limit_in_bytes' to 32MB for container > f69c8a8c-eba4-4494-a305-0956a44a6ad2 > {noformat} > Note that the cgroup of {{cpu.shares}} was {{/mesos/slave}}. This was > possibly because that over the 16 minutes the pid got reused: > {noformat} > # zgrep 'systemd.cpp:98\]' /var/log/mesos/archive/mesos-agent.log.12.gz > ... > I0523 06:00:54.525178 21815 systemd.cpp:98] Assigned child process '13716' to > 'mesos_executors.slice' > I0523 06:00:55.078546 21808 systemd.cpp:98] Assigned child process '13798' to > 'mesos_executors.slice' > I0523 06:00:55.134096 21808 systemd.cpp:98] Assigned child process '13799' to > 'mesos_executors.slice' > ... > I0523 06:06:30.997439 21808 systemd.cpp:98] Assigned child process '32689' to > 'mesos_executors.slice' > I0523 06:06:31.050976 21808 systemd.cpp:98] Assigned child process '32690' to > 'mesos_executors.slice' > I0523 06:06:31.110514 21815 systemd.cpp:98] Assigned child process '32692' to > 'mesos_executors.slice' > I0523 06:06:33.143726 21818 systemd.cpp:98] Assigned child process '446' to > 'mesos_executors.slice' > I0523 06:06:33.196251 21818 systemd.cpp:98] Assigned child process '447' to > 'mesos_executors.slice' > I0523 06:06:33.266332 21816 systemd.cpp:98] Assigned child process '449' to > 'mesos_executors.slice' > ... > I0523 06:09:34.870056 21808 systemd.cpp:98] Assigned child process '13717' to > 'mesos_executors.slice' > I0523 06:09:34.937762 21813 systemd.cpp:98] Assigned child process '13744' to > 'mesos_executors.slice' > I0523 06:09:35.073971 21817 systemd.cpp:98] Assigned child process '13754' to > 'mesos_executors.slice' > ... > {noformat} > It was highly likely that the container itself exited around 06:09:35, way > before the docker executor detected and reported the terminal status update, > and then its pid was reused by
[jira] [Commented] (MESOS-9836) Docker containerizer overwrites `/mesos/slave` cgroups.
[ https://issues.apache.org/jira/browse/MESOS-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908250#comment-16908250 ] Andrei Budnik commented on MESOS-9836: -- {quote} So what is the purpose of Docker containerizer's update method? {quote} As Mesos provides an option to run a Docker image as an (custom?) executor, it might make sense to update the Docker container's resources (executor+its tasks running in the Docker container) in cgroups. If this is the case, we probably should deprecate such an option? Ignoring `update` for Docker c'zer sounds like a good idea. > Docker containerizer overwrites `/mesos/slave` cgroups. > --- > > Key: MESOS-9836 > URL: https://issues.apache.org/jira/browse/MESOS-9836 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Chun-Hung Hsiao >Priority: Critical > Labels: docker, mesosphere > > The following bug was observed on our internal testing cluster. > The docker containerizer launched a container on an agent: > {noformat} > I0523 06:00:53.888579 21815 docker.cpp:1195] Starting container > 'f69c8a8c-eba4-4494-a305-0956a44a6ad2' for task > 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1' (and executor > 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1') of framework > 415284b7-2967-407d-b66f-f445e93f064e-0011 > I0523 06:00:54.524171 21815 docker.cpp:783] Checkpointing pid 13716 to > '/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S2/frameworks/415284b7-2967-407d-b66f-f445e93f064e-0011/executors/apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1/runs/f69c8a8c-eba4-4494-a305-0956a44a6ad2/pids/forked.pid' > {noformat} > After the container was launched, the docker containerizer did a {{docker > inspect}} on the container and cached the pid: > > [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1764] > The pid should be slightly greater than 13716. > The docker executor sent a {{TASK_FINISHED}} status update around 16 minutes > later: > {noformat} > I0523 06:16:17.287595 21809 slave.cpp:5566] Handling status update > TASK_FINISHED (Status UUID: 4e00b786-b773-46cd-8327-c7deb08f1de9) for task > apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1 of framework > 415284b7-2967-407d-b66f-f445e93f064e-0011 from executor(1)@172.31.1.7:36244 > {noformat} > After receiving the terminal status update, the agent asked the docker > containerizer to update {{cpu.cfs_period_us}}, {{cpu.cfs_quota_us}} and > {{memory.soft_limit_in_bytes}} of the container through the cached pid: > > [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1696] > {noformat} > I0523 06:16:17.290447 21815 docker.cpp:1868] Updated 'cpu.shares' to 102 at > /sys/fs/cgroup/cpu,cpuacct/mesos/slave for container > f69c8a8c-eba4-4494-a305-0956a44a6ad2 > I0523 06:16:17.290660 21815 docker.cpp:1895] Updated 'cpu.cfs_period_us' to > 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container > f69c8a8c-eba4-4494-a305-0956a44a6ad2 > I0523 06:16:17.889816 21815 docker.cpp:1937] Updated > 'memory.soft_limit_in_bytes' to 32MB for container > f69c8a8c-eba4-4494-a305-0956a44a6ad2 > {noformat} > Note that the cgroup of {{cpu.shares}} was {{/mesos/slave}}. This was > possibly because that over the 16 minutes the pid got reused: > {noformat} > # zgrep 'systemd.cpp:98\]' /var/log/mesos/archive/mesos-agent.log.12.gz > ... > I0523 06:00:54.525178 21815 systemd.cpp:98] Assigned child process '13716' to > 'mesos_executors.slice' > I0523 06:00:55.078546 21808 systemd.cpp:98] Assigned child process '13798' to > 'mesos_executors.slice' > I0523 06:00:55.134096 21808 systemd.cpp:98] Assigned child process '13799' to > 'mesos_executors.slice' > ... > I0523 06:06:30.997439 21808 systemd.cpp:98] Assigned child process '32689' to > 'mesos_executors.slice' > I0523 06:06:31.050976 21808 systemd.cpp:98] Assigned child process '32690' to > 'mesos_executors.slice' > I0523 06:06:31.110514 21815 systemd.cpp:98] Assigned child process '32692' to > 'mesos_executors.slice' > I0523 06:06:33.143726 21818 systemd.cpp:98] Assigned child process '446' to > 'mesos_executors.slice' > I0523 06:06:33.196251 21818 systemd.cpp:98] Assigned child process '447' to > 'mesos_executors.slice' > I0523 06:06:33.266332 21816 systemd.cpp:98] Assigned child process '449' to > 'mesos_executors.slice' > ... > I0523 06:09:34.870056 21808 systemd.cpp:98] Assigned child process '13717' to > 'mesos_executors.slice' > I0523 06:09:34.937762 21813 systemd.cpp:98] Assigned child process '13744' to > 'mesos_executors.slice' > I0523 06:09:35.073971 21817 systemd.cpp:98] Assigned child process '13754' to >
[jira] [Commented] (MESOS-9936) Slave recovery is very slow with high local volume persistant ( marathon app )
[ https://issues.apache.org/jira/browse/MESOS-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908065#comment-16908065 ] Andrei Budnik commented on MESOS-9936: -- How to reproduce the issue? Could you please share an app definition or provide steps to reproduce? Also, there must be more log lines between "Recovering provisioner" and "Finished recovering all containerizers". At least, "Provisioner recovery complete". Is there anything else between these 2 log lines? > Slave recovery is very slow with high local volume persistant ( marathon app ) > -- > > Key: MESOS-9936 > URL: https://issues.apache.org/jira/browse/MESOS-9936 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.8.1 >Reporter: Frédéric Comte >Priority: Major > > I run some local persistant applications.. > After an unplannified shutdown of nodes running this kind of applications, I > see that the recovery process of mesos is taking a lot of time (more than 8 > hours)... > This time depends of the amount of data in those volumes. > What does Mesos do in this process ? > {code:java} > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.771447 13370 > docker.cpp:890] Recovering Docker containers Jul 08 07:40:44 boss1 > mesos-agent[13345]: I0708 07:40:44.783957 13375 containerizer.cpp:801] > Recovering Mesos containers > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.799252 13373 > linux_launcher.cpp:286] Recovering Linux launcher > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.810429 13375 > containerizer.cpp:1127] Recovering isolators > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.817328 13389 > containerizer.cpp:1166] Recovering provisioner > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.928683 13373 > composing.cpp:339] Finished recovering all containerizers > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.950503 13354 > status_update_manager_process.hpp:314] Recovering operation status update > manager > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.957418 13399 > slave.cpp:7729] Recovering executors > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9936) Slave recovery is very slow with high local volume persistant ( marathon app )
[ https://issues.apache.org/jira/browse/MESOS-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906165#comment-16906165 ] Andrei Budnik commented on MESOS-9936: -- [~Fcomte] what version of Mesos are you using? > Slave recovery is very slow with high local volume persistant ( marathon app ) > -- > > Key: MESOS-9936 > URL: https://issues.apache.org/jira/browse/MESOS-9936 > Project: Mesos > Issue Type: Bug > Components: agent >Reporter: Frédéric Comte >Priority: Major > > I run some local persistant applications.. > After an unplannified shutdown of nodes running this kind of applications, I > see that the recovery process of mesos is taking a lot of time (more than 8 > hours)... > This time depends of the amount of data in those volumes. > What does Mesos do in this process ? > {code:java} > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.771447 13370 > docker.cpp:890] Recovering Docker containers Jul 08 07:40:44 boss1 > mesos-agent[13345]: I0708 07:40:44.783957 13375 containerizer.cpp:801] > Recovering Mesos containers > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.799252 13373 > linux_launcher.cpp:286] Recovering Linux launcher > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.810429 13375 > containerizer.cpp:1127] Recovering isolators > Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.817328 13389 > containerizer.cpp:1166] Recovering provisioner > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.928683 13373 > composing.cpp:339] Finished recovering all containerizers > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.950503 13354 > status_update_manager_process.hpp:314] Recovering operation status update > manager > Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.957418 13399 > slave.cpp:7729] Recovering executors > {code} -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.
[ https://issues.apache.org/jira/browse/MESOS-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-9887: Assignee: Andrei Budnik > Race condition between two terminal task status updates for Docker executor. > > > Key: MESOS-9887 > URL: https://issues.apache.org/jira/browse/MESOS-9887 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Blocker > Labels: agent, containerization > Attachments: race_example.txt > > > h2. Overview > Expected behavior: > Task successfully finishes and sends TASK_FINISHED status update. > Observed behavior: > Task successfully finishes, but the agent sends TASK_FAILED with the reason > "REASON_EXECUTOR_TERMINATED". > In normal circumstances, Docker executor > [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758] > final status update TASK_FINISHED to the agent, which then [gets > processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543] > by the agent before termination of the executor's process. > However, if the processing of the initial TASK_FINISHED gets delayed, then > there is a chance that Docker executor terminates and the agent > [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662] > TASK_FAILED which will [be > handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826] > prior to the TASK_FINISHED status update. > See attached logs which contain an example of the race condition. > h2. Reproducing bug > 1. Add the following code: > {code:java} > static int c = 0; > if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates. > ::sleep(2); > } > {code} > to the > [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578] > and to the > [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167]. > 2. Recompile mesos > 3. Launch mesos master and agent locally > 4. Launch a simple Docker task via `mesos-execute`: > {code} > # cd build > ./src/mesos-execute --master="`hostname`:5050" --name="a" > --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" > --command="ls" > {code} > h2. Race condition - description > 1. Mesos agent receives TASK_FINISHED status update and then subscribes on > [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]. > 2. `containerizer->status()` operation for TASK_FINISHED status update gets > delayed in the composing containerizer (e.g. due to switch of the worker > thread that executes `status` method). > 3. Docker executor terminates and the agent > [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662] > TASK_FAILED. > 4. Docker containerizer destroys the container. A registered callback for the > `containerizer->wait` call in the composing containerizer dispatches [lambda > function|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L368-L373] > that will clean up `containers_` map. > 5. Composing c'zer resumes and dispatches > `[status()|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L579]` > method to the Docker containerizer for TASK_FINISHED, which in turn hangs > for a few seconds. > 6. Corresponding `containerId` gets removed from the `containers_` map of the > composing c'zer. > 7. Mesos agent subscribes on > [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761] > for the TASK_FAILED status update. > 8. Composing c'zer returns ["Container not > found"|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L576] > for TASK_FAILED. > 9. > `[Slave::_statusUpdate|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5826]` > stores TASK_FAILED terminal status update in the executor's data structure. > 10. Docker containerizer resumes and finishes processing of `status()` method > for TASK_FINISHED. Finally, it returns control to the `Slave::_statusUpdate` > continuation. This method >
[jira] [Created] (MESOS-9926) Assertion failed in Master for `Slave::apply` while running `UnreserveVolumeResources` test.
Andrei Budnik created MESOS-9926: Summary: Assertion failed in Master for `Slave::apply` while running `UnreserveVolumeResources` test. Key: MESOS-9926 URL: https://issues.apache.org/jira/browse/MESOS-9926 Project: Mesos Issue Type: Bug Components: master, test Environment: Failed command: ['bash', '-c', "set -o pipefail; export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose --disable-libtool-wrappers --disable-parallel-test-execution' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh 2>&1 | tee build_71197"] Reporter: Andrei Budnik Attachments: UnreserveVolumeResources-badrun.txt `PersistentVolumeEndpointsTest.UnreserveVolumeResources` test failed: {code:java} F0806 02:52:55.479373 18920 master.cpp:13789] CHECK_SOME(resources): ports:[31000-32000]; cpus:24; mem:95641; disk(reservations: [(DYNAMIC,role1,test-principal)]):960; disk(reservations: [(DYNAMIC,role1,test-principal)])[id1:path1]:64 does not contain disk(reservations: [(DYNAMIC,role1,test-principal)]):1024 *** Check failure stack trace: *** @ 0x2b2180332cf6 google::LogMessage::Fail() @ 0x2b2180332c3e google::LogMessage::SendToLog() @ 0x2b21803325e8 google::LogMessage::Flush() @ 0x2b2180335a12 google::LogMessageFatal::~LogMessageFatal() @ 0x56408e20bafc _CheckFatal::~_CheckFatal() @ 0x2b217dc362b7 mesos::internal::master::Slave::apply() @ 0x2b217dc2c197 mesos::internal::master::Master::_apply() @ 0x2b217dcaa5ab _ZZN7process8dispatchIN5mesos8internal6master6MasterEPNS3_5SlaveEPNS3_9FrameworkERKNS1_15Offer_OperationES6_S8_SB_EEvRKNS_3PIDIT_EEMSD_FvT0_T1_T2_EOT3_OT4_OT5_ENKUlOS6_OS8_OS9_PNS_11ProcessBaseEE_clESS_ST_SU_SW_ @ 0x2b217dd556c5 _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master6MasterEPNS5_5SlaveEPNS5_9FrameworkERKNS3_15Offer_OperationES8_SA_SD_EEvRKNS1_3PIDIT_EEMSF_FvT0_T1_T2_EOT3_OT4_OT5_EUlOS8_OSA_OSB_PNS1_11ProcessBaseEE_JS8_SA_SB_SY_EEEDTclcl7forwardISF_Efp_Espcl7forwardIT0_Efp0_EEEOSF_DpOS10_ @ 0x2b217dd4e482 _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterEPNS6_5SlaveEPNS6_9FrameworkERKNS4_15Offer_OperationES9_SB_SE_EEvRKNS2_3PIDIT_EEMSG_FvT0_T1_T2_EOT3_OT4_OT5_EUlOS9_OSB_OSC_PNS2_11ProcessBaseEE_JS9_SB_SC_St12_PlaceholderILi113invoke_expandIS10_St5tupleIJS9_SB_SC_S12_EES15_IJOSZ_EEJLm0ELm1ELm2ELm3DTcl6invokecl7forwardIT_Efp_Espcl6expandcl3getIXT2_EEcl7forwardIT0_Efp0_EEcl7forwardIT1_Efp2_OS19_OS1A_N5cpp1416integer_sequenceImJXspT2_OS1B_ @ 0x2b217dd49853 _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterEPNS6_5SlaveEPNS6_9FrameworkERKNS4_15Offer_OperationES9_SB_SE_EEvRKNS2_3PIDIT_EEMSG_FvT0_T1_T2_EOT3_OT4_OT5_EUlOS9_OSB_OSC_PNS2_11ProcessBaseEE_IS9_SB_SC_St12_PlaceholderILi1clIISZ_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImILm0ELm1ELm2ELm3_Ecl16forward_as_tuplespcl7forwardIT_Efp_DpOS18_ I0806 02:52:55.928766 18910 status_update_manager_process.hpp:528] Forwarding operation status update OPERATION_FINISHED (Status UUID: 679c9f27-3130-4188-8c9a-07eccc25ae78) for operation UUID 0b856527-bcaa-4595-aeab-47505dff5aa6 on agent ba6f270f-d8c7-4b59-b5ce-6b497fe89d7c-S0 @ 0x2b217dd46ac5 _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterEPNS8_5SlaveEPNS8_9FrameworkERKNS6_15Offer_OperationESB_SD_SG_EEvRKNS4_3PIDIT_EEMSI_FvT0_T1_T2_EOT3_OT4_OT5_EUlOSB_OSD_OSE_PNS4_11ProcessBaseEE_ISB_SD_SE_St12_PlaceholderILi1EIS11_EEEDTclcl7forwardISI_Efp_Espcl7forwardIT0_Efp0_EEEOSI_DpOS16_ @ 0x2b217dd43fc1 _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master6MasterEPNS9_5SlaveEPNS9_9FrameworkERKNS7_15Offer_OperationESC_SE_SH_EEvRKNS5_3PIDIT_EEMSJ_FvT0_T1_T2_EOT3_OT4_OT5_EUlOSC_OSE_OSF_PNS5_11ProcessBaseEE_JSC_SE_SF_St12_PlaceholderILi1EJS12_EEEvOSJ_DpOT0_ @ 0x2b217dd4144d _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master6MasterEPNSC_5SlaveEPNSC_9FrameworkERKNSA_15Offer_OperationESF_SH_SK_EEvRKNS1_3PIDIT_EEMSM_FvT0_T1_T2_EOT3_OT4_OT5_EUlOSF_OSH_OSI_S3_E_JSF_SH_SI_St12_PlaceholderILi1EEclEOS3_ @ 0x2b218024eb51 _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_ @ 0x2b2180216927 process::ProcessBase::consume() @ 0x2b218023c5d2 _ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE @ 0x56408e20c7e8 process::ProcessBase::serve() @ 0x2b2180213539 process::ProcessManager::resume() @ 0x2b218020f886 _ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv @ 0x2b2180237086
[jira] [Created] (MESOS-9914) Refactor `MesosTest::StartSlave` in favour of builder style interface
Andrei Budnik created MESOS-9914: Summary: Refactor `MesosTest::StartSlave` in favour of builder style interface Key: MESOS-9914 URL: https://issues.apache.org/jira/browse/MESOS-9914 Project: Mesos Issue Type: Improvement Components: test Reporter: Andrei Budnik Every overload of `MesosTest::StartSlave` method depend on `cluster::Slave::create` method, which accepts several arguments. In fact, each overload of `MesosTest::StartSlave` accepts a subset of combination of arguments that `cluster::Slave::create` accept. Given that the latter accepts 11 arguments at the moment, and there are already 13 overloads of the `MesosTest::StartSlave`, introducing a builder-style interface is very desirable. It'd allow adding more arguments to the `cluster::Slave::create` without the necessity to update all existing overloads. It would be a local change as it won't affect existing tests. See [this comment|https://github.com/apache/mesos/blob/00bb0b6d6abe7700a5adab0bdaf7e91767a2db19/src/tests/mesos.hpp#L160-L177]. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9836) Docker containerizer overwrites `/mesos/slave` cgroups.
[ https://issues.apache.org/jira/browse/MESOS-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892844#comment-16892844 ] Andrei Budnik commented on MESOS-9836: -- A typical cgroup for Docker containers looks like: {code:java} /system.slice/docker-3a91c29381522918a2f2cad05583b172f415da4010bad672c21a19356aec1d69.scope {code} Probably we should leave out all cgroups not containing "docker" substring instead of (or in addition to) filtering [the system root cgroup|https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1783-L1788]. It's ugly, hacky, and introduces a dependency on Docker's runtime. > Docker containerizer overwrites `/mesos/slave` cgroups. > --- > > Key: MESOS-9836 > URL: https://issues.apache.org/jira/browse/MESOS-9836 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Chun-Hung Hsiao >Priority: Critical > Labels: docker, mesosphere > > The following bug was observed on our internal testing cluster. > The docker containerizer launched a container on an agent: > {noformat} > I0523 06:00:53.888579 21815 docker.cpp:1195] Starting container > 'f69c8a8c-eba4-4494-a305-0956a44a6ad2' for task > 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1' (and executor > 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1') of framework > 415284b7-2967-407d-b66f-f445e93f064e-0011 > I0523 06:00:54.524171 21815 docker.cpp:783] Checkpointing pid 13716 to > '/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S2/frameworks/415284b7-2967-407d-b66f-f445e93f064e-0011/executors/apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1/runs/f69c8a8c-eba4-4494-a305-0956a44a6ad2/pids/forked.pid' > {noformat} > After the container was launched, the docker containerizer did a {{docker > inspect}} on the container and cached the pid: > > [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1764] > The pid should be slightly greater than 13716. > The docker executor sent a {{TASK_FINISHED}} status update around 16 minutes > later: > {noformat} > I0523 06:16:17.287595 21809 slave.cpp:5566] Handling status update > TASK_FINISHED (Status UUID: 4e00b786-b773-46cd-8327-c7deb08f1de9) for task > apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1 of framework > 415284b7-2967-407d-b66f-f445e93f064e-0011 from executor(1)@172.31.1.7:36244 > {noformat} > After receiving the terminal status update, the agent asked the docker > containerizer to update {{cpu.cfs_period_us}}, {{cpu.cfs_quota_us}} and > {{memory.soft_limit_in_bytes}} of the container through the cached pid: > > [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1696] > {noformat} > I0523 06:16:17.290447 21815 docker.cpp:1868] Updated 'cpu.shares' to 102 at > /sys/fs/cgroup/cpu,cpuacct/mesos/slave for container > f69c8a8c-eba4-4494-a305-0956a44a6ad2 > I0523 06:16:17.290660 21815 docker.cpp:1895] Updated 'cpu.cfs_period_us' to > 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container > f69c8a8c-eba4-4494-a305-0956a44a6ad2 > I0523 06:16:17.889816 21815 docker.cpp:1937] Updated > 'memory.soft_limit_in_bytes' to 32MB for container > f69c8a8c-eba4-4494-a305-0956a44a6ad2 > {noformat} > Note that the cgroup of {{cpu.shares}} was {{/mesos/slave}}. This was > possibly because that over the 16 minutes the pid got reused: > {noformat} > # zgrep 'systemd.cpp:98\]' /var/log/mesos/archive/mesos-agent.log.12.gz > ... > I0523 06:00:54.525178 21815 systemd.cpp:98] Assigned child process '13716' to > 'mesos_executors.slice' > I0523 06:00:55.078546 21808 systemd.cpp:98] Assigned child process '13798' to > 'mesos_executors.slice' > I0523 06:00:55.134096 21808 systemd.cpp:98] Assigned child process '13799' to > 'mesos_executors.slice' > ... > I0523 06:06:30.997439 21808 systemd.cpp:98] Assigned child process '32689' to > 'mesos_executors.slice' > I0523 06:06:31.050976 21808 systemd.cpp:98] Assigned child process '32690' to > 'mesos_executors.slice' > I0523 06:06:31.110514 21815 systemd.cpp:98] Assigned child process '32692' to > 'mesos_executors.slice' > I0523 06:06:33.143726 21818 systemd.cpp:98] Assigned child process '446' to > 'mesos_executors.slice' > I0523 06:06:33.196251 21818 systemd.cpp:98] Assigned child process '447' to > 'mesos_executors.slice' > I0523 06:06:33.266332 21816 systemd.cpp:98] Assigned child process '449' to > 'mesos_executors.slice' > ... > I0523 06:09:34.870056 21808 systemd.cpp:98] Assigned child process '13717' to > 'mesos_executors.slice' > I0523 06:09:34.937762 21813 systemd.cpp:98] Assigned child process '13744' to > 'mesos_executors.slice' > I0523
[jira] [Created] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.
Andrei Budnik created MESOS-9887: Summary: Race condition between two terminal task status updates for Docker executor. Key: MESOS-9887 URL: https://issues.apache.org/jira/browse/MESOS-9887 Project: Mesos Issue Type: Bug Components: agent, containerization Reporter: Andrei Budnik Attachments: race_example.txt h2. Overview Expected behavior: Task successfully finishes and sends TASK_FINISHED status update. Observed behavior: Task successfully finishes, but the agent sends TASK_FAILED with the reason "REASON_EXECUTOR_TERMINATED". In normal circumstances, Docker executor [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758] final status update TASK_FINISHED to the agent, which then [gets processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543] by the agent before termination of the executor's process. However, if the processing of the initial TASK_FINISHED gets delayed, then there is a chance that Docker executor terminates and the agent [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662] TASK_FAILED which will [be handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826] prior to the TASK_FINISHED status update. See attached logs which contain an example of the race condition. h2. Reproducing bug 1. Add the following code: {code:java} static int c = 0; if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates. ::sleep(2); } {code} to the [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578] and to the [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167]. 2. Recompile mesos 3. Launch mesos master and agent locally 4. Launch a simple Docker task via `mesos-execute`: {code} # cd build ./src/mesos-execute --master="`hostname`:5050" --name="a" --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" --command="ls" {code} h2. Race condition - description 1. Mesos agent receives TASK_FINISHED status update and then subscribes on [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]. 2. `containerizer->status()` operation for TASK_FINISHED status update gets delayed in the composing containerizer (e.g. due to switch of the worker thread that executes `status` method). 3. Docker executor terminates and the agent [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662] TASK_FAILED. 4. Docker containerizer destroys the container. A registered callback for the `containerizer->wait` call in the composing containerizer dispatches [lambda function|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L368-L373] that will clean up `containers_` map. 5. Composing c'zer resumes and dispatches `[status()|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L579]` method to the Docker containerizer for TASK_FINISHED, which in turn hangs for a few seconds. 6. Corresponding `containerId` gets removed from the `containers_` map of the composing c'zer. 7. Mesos agent subscribes on [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761] for the TASK_FAILED status update. 8. Composing c'zer returns ["Container not found"|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L576] for TASK_FAILED. 9. `[Slave::_statusUpdate|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5826]` stores TASK_FAILED terminal status update in the executor's data structure. 10. Docker executor resumes and finishes processing of `status()` method for TASK_FINISHED. Finally, it returns control to the `Slave::_statusUpdate` continuation. This method [discovers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5808-L5814] that the executor has already been destroyed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9844) Update documentation describing `containerizer/debug` endpoint.
Andrei Budnik created MESOS-9844: Summary: Update documentation describing `containerizer/debug` endpoint. Key: MESOS-9844 URL: https://issues.apache.org/jira/browse/MESOS-9844 Project: Mesos Issue Type: Documentation Components: containerization Reporter: Andrei Budnik Assignee: Andrei Budnik -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9843) Implement tests for the `containerizer/debug` endpoint.
Andrei Budnik created MESOS-9843: Summary: Implement tests for the `containerizer/debug` endpoint. Key: MESOS-9843 URL: https://issues.apache.org/jira/browse/MESOS-9843 Project: Mesos Issue Type: Task Components: containerization Reporter: Andrei Budnik Assignee: Andrei Budnik Implement tests for container stuck issues and check that the agent's `containerizer/debug` endpoint returns a JSON object containing information about pending operations. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9842) Implement tests for the `FutureTracker` class and for its helper functions.
Andrei Budnik created MESOS-9842: Summary: Implement tests for the `FutureTracker` class and for its helper functions. Key: MESOS-9842 URL: https://issues.apache.org/jira/browse/MESOS-9842 Project: Mesos Issue Type: Task Components: containerization Reporter: Andrei Budnik Assignee: Andrei Budnik -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9841) Integrate `IsolatorTracker` and `LinuxLauncher` with Mesos containerizer.
Andrei Budnik created MESOS-9841: Summary: Integrate `IsolatorTracker` and `LinuxLauncher` with Mesos containerizer. Key: MESOS-9841 URL: https://issues.apache.org/jira/browse/MESOS-9841 Project: Mesos Issue Type: Task Components: containerization Reporter: Andrei Budnik Assignee: Andrei Budnik -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9840) Implement `LauncherTracker` class.
Andrei Budnik created MESOS-9840: Summary: Implement `LauncherTracker` class. Key: MESOS-9840 URL: https://issues.apache.org/jira/browse/MESOS-9840 Project: Mesos Issue Type: Task Components: containerization Reporter: Andrei Budnik Assignee: Andrei Budnik -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9839) Implement `IsolatorTracker` class.
[ https://issues.apache.org/jira/browse/MESOS-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-9839: Assignee: Andrei Budnik Labels: containerization (was: ) Component/s: containerization Issue Type: Task (was: Bug) > Implement `IsolatorTracker` class. > -- > > Key: MESOS-9839 > URL: https://issues.apache.org/jira/browse/MESOS-9839 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > Labels: containerization > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9839) Implement `IsolatorTracker` class.
Andrei Budnik created MESOS-9839: Summary: Implement `IsolatorTracker` class. Key: MESOS-9839 URL: https://issues.apache.org/jira/browse/MESOS-9839 Project: Mesos Issue Type: Bug Reporter: Andrei Budnik -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9838) Leaked HTTP input connection between agent and IOSwitchboard when launched with TTY enabled.
Andrei Budnik created MESOS-9838: Summary: Leaked HTTP input connection between agent and IOSwitchboard when launched with TTY enabled. Key: MESOS-9838 URL: https://issues.apache.org/jira/browse/MESOS-9838 Project: Mesos Issue Type: Bug Components: agent Reporter: Andrei Budnik Steps to reproduce: 1) Launch a TTY container. 2) Send the `ATTACH_CONTAINER_INPUT` request to the agent via an HTTP connection. 3) Close a tcp socket used to send `ATTACH_CONTAINER_INPUT`. 4) Send another `ATTACH_CONTAINER_INPUT` request to the agent - it returns `409 Conflict` HTTP error. For each incoming `ATTACH_CONTAINER_INPUT` request the agent creates an HTTP connection to the IOSwitchboard via unix socket. This connection is used to retransmit client requests to the IOSwitchboard. IOSwitchboard closes this connection automatically once the client closes its HTTP connection to the agent: for more details see HTTP handlers in [the agent|https://github.com/apache/mesos/blob/1961e41a61def2b7baca7563c0b7e1855880b55c/src/slave/http.cpp#L3105-L3116] and in the [IOSwitchboard|https://github.com/apache/mesos/blob/1961e41a61def2b7baca7563c0b7e1855880b55c/src/slave/containerizer/mesos/io/switchboard.cpp#L1665-L1758]. IOSwitchboard does not allow [multiple input connections|https://github.com/apache/mesos/blob/1961e41a61def2b7baca7563c0b7e1855880b55c/src/slave/containerizer/mesos/io/switchboard.cpp#L1654-L1657]. Currently, IOSwitchboard does not close HTTP connection for the `ATTACH_CONTAINER_INPUT` in the case described above. Hence, IOSwitchboard returns an error for the subsequent attempts to attach to the container input. The root cause needs to be found. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9837) Implement `FutureTracker` class along with helper functions.
[ https://issues.apache.org/jira/browse/MESOS-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-9837: Assignee: Andrei Budnik > Implement `FutureTracker` class along with helper functions. > > > Key: MESOS-9837 > URL: https://issues.apache.org/jira/browse/MESOS-9837 > Project: Mesos > Issue Type: Task > Components: containerization >Reporter: Andrei Budnik >Assignee: Andrei Budnik >Priority: Major > Labels: containerization > > Both `track()` and `pending_futures()` helper functions depend on the > `FutureTracker` actor. > `FutureTracker` actor must be available globally and there must be only one > instance of this actor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9837) Implement `FutureTracker` class along with helper functions.
Andrei Budnik created MESOS-9837: Summary: Implement `FutureTracker` class along with helper functions. Key: MESOS-9837 URL: https://issues.apache.org/jira/browse/MESOS-9837 Project: Mesos Issue Type: Improvement Components: containerization Reporter: Andrei Budnik Both `track()` and `pending_futures()` helper functions depend on the `FutureTracker` actor. `FutureTracker` actor must be available globally and there must be only one instance of this actor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9756) Introduce a container debug endpoint.
[ https://issues.apache.org/jira/browse/MESOS-9756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-9756: Assignee: Andrei Budnik > Introduce a container debug endpoint. > - > > Key: MESOS-9756 > URL: https://issues.apache.org/jira/browse/MESOS-9756 > Project: Mesos > Issue Type: Epic > Components: containerization >Reporter: Gilbert Song >Assignee: Andrei Budnik >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Deleted] (MESOS-9830) Implement the container debug endpoint on slave/http.cpp
[ https://issues.apache.org/jira/browse/MESOS-9830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik deleted MESOS-9830: - > Implement the container debug endpoint on slave/http.cpp > > > Key: MESOS-9830 > URL: https://issues.apache.org/jira/browse/MESOS-9830 > Project: Mesos > Issue Type: Task >Reporter: Gilbert Song >Assignee: Andrei Budnik >Priority: Major > Labels: containerization > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9800) libarchive cannot extract tarfile due to UTF-8 encoding issues
[ https://issues.apache.org/jira/browse/MESOS-9800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849924#comment-16849924 ] Andrei Budnik commented on MESOS-9800: -- Thanks for filing a detailed ticket! Hope [~kaysoky] might help you with this issue. > libarchive cannot extract tarfile due to UTF-8 encoding issues > -- > > Key: MESOS-9800 > URL: https://issues.apache.org/jira/browse/MESOS-9800 > Project: Mesos > Issue Type: Bug > Components: fetcher >Affects Versions: 1.7.2 > Environment: Mesos 1.7.2 and Marathon 1.4.3 running on top of Ubuntu > 16.04. >Reporter: Felipe Alfaro Solana >Priority: Major > > Starting with Mesos 1.7, the following change has been introduced: > * [MESOS-8064] - Mesos now requires libarchive to programmatically decode > .zip, .tar, .gzip, and other common file compression schemes. Version 3.3.2 > is bundled in Mesos. > However, this version of libarchive which is used by the fetcher component in > Mesos has problems in dealing with archive files (.tar and .zip) which > contain UTF-8 characters. We run Marahton on top of Mesos, and one of our > Marathon application relies on a .tar file which contains symlinks whose > target contains certain UTF-8 characters (Turkish) or the symlink name itself > contains UTF-8 characters. Mesos fetcher is unable to extract the archive and > fails with the following error: > {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: E0528 > 10:47:30.791250 6136 fetcher.cpp:613] EXIT with status 1: Failed to fetch > '/tmp/certificates.tar.gz': Failed to extract archive > '/var/mesos/slaves/10c35371-f690-4d40-8b9e-30ffd04405fb-S6/frameworks/ff2993eb-987f-47b0-b3af-fb8b49ab0470-/executors/test-nginx.fe01a0c0-8135-11e9-a160-02427a38aa03/runs/6a6e87e8-5eef-4e8e-8c00-3f081fa187b0/certificates.tar.gz' > to > '/var/mesos/slaves/10c35371-f690-4d40-8b9e-30ffd04405fb-S6/frameworks/ff2993eb-987f-47b0-b3af-fb8b49ab0470-/executors/test-nginx.fe01a0c0-8135-11e9-a160-02427a38aa03/runs/6a6e87e8-5eef-4e8e-8c00-3f081fa187b0': > Failed to read archive header: Linkname can't be converted from UTF-8 to > current locale.}} > {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]:}} > {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: End > fetcher log for container 6a6e87e8-5eef-4e8e-8c00-3f081fa187b0}} > {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: E0528 > 10:47:30.846695 4343 fetcher.cpp:571] Failed to run mesos-fetcher: Failed to > fetch all URIs for container '6a6e87e8-5eef-4e8e-8c00-3f081fa187b0': exited > with status 1}} > The same Marathon application works fine with Mesos 1.6 which does not use > libarchive. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9306) Mesos containerizer can get stuck during cgroup cleanup
[ https://issues.apache.org/jira/browse/MESOS-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849042#comment-16849042 ] Andrei Budnik edited comment on MESOS-9306 at 5/27/19 4:10 PM: --- The patch `/r/70609/` was discarded. If `cgroups::destroy` hangs due to a blocking system call caused by a kernel bug, then there is no workaround available on Mesos side to fix the issue. In this case, we could only help an operator to detect the problem. This can be achieved by introducing a debug endpoint for the Mesos containerizer, see MESOS-9756. was (Author: abudnik): The patch `/r/70609/` was discarded. If `cgroups::destroy` hangs due to a blocking system call caused by a kernel bug, then there is no workaround available on Mesos side to fix the issue. In this case, we could only help an operator to detect the problem. This could be done by introducing a debug endpoint for the Mesos containerizer, see MESOS-9756. > Mesos containerizer can get stuck during cgroup cleanup > --- > > Key: MESOS-9306 > URL: https://issues.apache.org/jira/browse/MESOS-9306 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Affects Versions: 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Greg Mann >Assignee: Andrei Budnik >Priority: Critical > Labels: containerizer, mesosphere > > I observed a task group's executor container which failed to be completely > destroyed after its associated tasks were killed. The following is an excerpt > from the agent log which is filtered to include only lines with the container > ID, {{d463b9fe-970d-4077-bab9-558464889a9e}}: > {code} > 2018-10-10 14:20:50: I1010 14:20:50.204756 6799 containerizer.cpp:2963] > Container d463b9fe-970d-4077-bab9-558464889a9e has exited > 2018-10-10 14:20:50: I1010 14:20:50.204839 6799 containerizer.cpp:2457] > Destroying container d463b9fe-970d-4077-bab9-558464889a9e in RUNNING state > 2018-10-10 14:20:50: I1010 14:20:50.204859 6799 containerizer.cpp:3124] > Transitioning the state of container d463b9fe-970d-4077-bab9-558464889a9e > from RUNNING to DESTROYING > 2018-10-10 14:20:50: I1010 14:20:50.204960 6799 linux_launcher.cpp:580] > Asked to destroy container d463b9fe-970d-4077-bab9-558464889a9e > 2018-10-10 14:20:50: I1010 14:20:50.204993 6799 linux_launcher.cpp:622] > Destroying cgroup > '/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e' > 2018-10-10 14:20:50: I1010 14:20:50.205417 6806 cgroups.cpp:2838] Freezing > cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos > 2018-10-10 14:20:50: I1010 14:20:50.205477 6810 cgroups.cpp:2838] Freezing > cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e > 2018-10-10 14:20:50: I1010 14:20:50.205708 6808 cgroups.cpp:1229] > Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after > 203008ns > 2018-10-10 14:20:50: I1010 14:20:50.205878 6800 cgroups.cpp:1229] > Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after > 339200ns > 2018-10-10 14:20:50: I1010 14:20:50.206185 6799 cgroups.cpp:2856] Thawing > cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos > 2018-10-10 14:20:50: I1010 14:20:50.206226 6808 cgroups.cpp:2856] Thawing > cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e > 2018-10-10 14:20:50: I1010 14:20:50.206455 6808 cgroups.cpp:1258] > Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after > 83968ns > 2018-10-10 14:20:50: I1010 14:20:50.306803 6810 cgroups.cpp:1258] > Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after > 100.50816ms > 2018-10-10 14:20:50: I1010 14:20:50.307531 6805 linux_launcher.cpp:654] > Destroying cgroup > '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e' > 2018-10-10 14:21:40: W1010 14:21:40.032855 6809 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:22:40: W1010 14:22:40.031224 6800 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:23:40: W1010 14:23:40.031946 6799 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:24:40: W1010 14:24:40.032979 6804 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:25:40: W1010 14:25:40.030784 6808 containerizer.cpp:2401]
[jira] [Commented] (MESOS-9306) Mesos containerizer can get stuck during cgroup cleanup
[ https://issues.apache.org/jira/browse/MESOS-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849042#comment-16849042 ] Andrei Budnik commented on MESOS-9306: -- The patch `/r/70609/` was discarded. If `cgroups::destroy` hangs due to a blocking system call caused by a kernel bug, then there is no workaround available on Mesos side to fix the issue. In this case, we could only help an operator to detect the problem. This could be done by introducing a debug endpoint for the Mesos containerizer, see MESOS-9756. > Mesos containerizer can get stuck during cgroup cleanup > --- > > Key: MESOS-9306 > URL: https://issues.apache.org/jira/browse/MESOS-9306 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Affects Versions: 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0 >Reporter: Greg Mann >Assignee: Andrei Budnik >Priority: Critical > Labels: containerizer, mesosphere > > I observed a task group's executor container which failed to be completely > destroyed after its associated tasks were killed. The following is an excerpt > from the agent log which is filtered to include only lines with the container > ID, {{d463b9fe-970d-4077-bab9-558464889a9e}}: > {code} > 2018-10-10 14:20:50: I1010 14:20:50.204756 6799 containerizer.cpp:2963] > Container d463b9fe-970d-4077-bab9-558464889a9e has exited > 2018-10-10 14:20:50: I1010 14:20:50.204839 6799 containerizer.cpp:2457] > Destroying container d463b9fe-970d-4077-bab9-558464889a9e in RUNNING state > 2018-10-10 14:20:50: I1010 14:20:50.204859 6799 containerizer.cpp:3124] > Transitioning the state of container d463b9fe-970d-4077-bab9-558464889a9e > from RUNNING to DESTROYING > 2018-10-10 14:20:50: I1010 14:20:50.204960 6799 linux_launcher.cpp:580] > Asked to destroy container d463b9fe-970d-4077-bab9-558464889a9e > 2018-10-10 14:20:50: I1010 14:20:50.204993 6799 linux_launcher.cpp:622] > Destroying cgroup > '/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e' > 2018-10-10 14:20:50: I1010 14:20:50.205417 6806 cgroups.cpp:2838] Freezing > cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos > 2018-10-10 14:20:50: I1010 14:20:50.205477 6810 cgroups.cpp:2838] Freezing > cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e > 2018-10-10 14:20:50: I1010 14:20:50.205708 6808 cgroups.cpp:1229] > Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after > 203008ns > 2018-10-10 14:20:50: I1010 14:20:50.205878 6800 cgroups.cpp:1229] > Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after > 339200ns > 2018-10-10 14:20:50: I1010 14:20:50.206185 6799 cgroups.cpp:2856] Thawing > cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos > 2018-10-10 14:20:50: I1010 14:20:50.206226 6808 cgroups.cpp:2856] Thawing > cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e > 2018-10-10 14:20:50: I1010 14:20:50.206455 6808 cgroups.cpp:1258] > Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after > 83968ns > 2018-10-10 14:20:50: I1010 14:20:50.306803 6810 cgroups.cpp:1258] > Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after > 100.50816ms > 2018-10-10 14:20:50: I1010 14:20:50.307531 6805 linux_launcher.cpp:654] > Destroying cgroup > '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e' > 2018-10-10 14:21:40: W1010 14:21:40.032855 6809 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:22:40: W1010 14:22:40.031224 6800 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:23:40: W1010 14:23:40.031946 6799 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:24:40: W1010 14:24:40.032979 6804 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:25:40: W1010 14:25:40.030784 6808 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:26:40: W1010 14:26:40.032526 6810 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:27:40: W1010 14:27:40.029932 6801 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e
[jira] [Commented] (MESOS-9306) Mesos containerizer can get stuck during cgroup cleanup
[ https://issues.apache.org/jira/browse/MESOS-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835580#comment-16835580 ] Andrei Budnik commented on MESOS-9306: -- I've reproduced the timeout case for `cgroups::destroy` by adding the following code {code:java} Owned> promise(new Promise()); return promise->future(); {code} to the beginning of [destroy()|https://github.com/apache/mesos/blob/db7ce35dc155c2de7e66ec051ee0f6bcf784b4e1/src/linux/cgroups.cpp#L1548] function. In turns out, that [`__destroy`|https://github.com/apache/mesos/blob/db7ce35dc155c2de7e66ec051ee0f6bcf784b4e1/src/linux/cgroups.cpp#L1590-L1602] is never invoked due to a missing `onDiscard` handler. We only subscribe [`onAny`|https://github.com/apache/mesos/blob/db7ce35dc155c2de7e66ec051ee0f6bcf784b4e1/src/linux/cgroups.cpp#L1613] callback, which is never called after calling `future.discard()`. The reason `cgroups::destroy` hangs for Systemd hierarchy is unknown. It might be related to some kernel issue. > Mesos containerizer can get stuck during cgroup cleanup > --- > > Key: MESOS-9306 > URL: https://issues.apache.org/jira/browse/MESOS-9306 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Affects Versions: 1.7.0 >Reporter: Greg Mann >Assignee: Andrei Budnik >Priority: Critical > Labels: containerizer, mesosphere > > I observed a task group's executor container which failed to be completely > destroyed after its associated tasks were killed. The following is an excerpt > from the agent log which is filtered to include only lines with the container > ID, {{d463b9fe-970d-4077-bab9-558464889a9e}}: > {code} > 2018-10-10 14:20:50: I1010 14:20:50.204756 6799 containerizer.cpp:2963] > Container d463b9fe-970d-4077-bab9-558464889a9e has exited > 2018-10-10 14:20:50: I1010 14:20:50.204839 6799 containerizer.cpp:2457] > Destroying container d463b9fe-970d-4077-bab9-558464889a9e in RUNNING state > 2018-10-10 14:20:50: I1010 14:20:50.204859 6799 containerizer.cpp:3124] > Transitioning the state of container d463b9fe-970d-4077-bab9-558464889a9e > from RUNNING to DESTROYING > 2018-10-10 14:20:50: I1010 14:20:50.204960 6799 linux_launcher.cpp:580] > Asked to destroy container d463b9fe-970d-4077-bab9-558464889a9e > 2018-10-10 14:20:50: I1010 14:20:50.204993 6799 linux_launcher.cpp:622] > Destroying cgroup > '/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e' > 2018-10-10 14:20:50: I1010 14:20:50.205417 6806 cgroups.cpp:2838] Freezing > cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos > 2018-10-10 14:20:50: I1010 14:20:50.205477 6810 cgroups.cpp:2838] Freezing > cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e > 2018-10-10 14:20:50: I1010 14:20:50.205708 6808 cgroups.cpp:1229] > Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after > 203008ns > 2018-10-10 14:20:50: I1010 14:20:50.205878 6800 cgroups.cpp:1229] > Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after > 339200ns > 2018-10-10 14:20:50: I1010 14:20:50.206185 6799 cgroups.cpp:2856] Thawing > cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos > 2018-10-10 14:20:50: I1010 14:20:50.206226 6808 cgroups.cpp:2856] Thawing > cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e > 2018-10-10 14:20:50: I1010 14:20:50.206455 6808 cgroups.cpp:1258] > Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after > 83968ns > 2018-10-10 14:20:50: I1010 14:20:50.306803 6810 cgroups.cpp:1258] > Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after > 100.50816ms > 2018-10-10 14:20:50: I1010 14:20:50.307531 6805 linux_launcher.cpp:654] > Destroying cgroup > '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e' > 2018-10-10 14:21:40: W1010 14:21:40.032855 6809 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:22:40: W1010 14:22:40.031224 6800 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:23:40: W1010 14:23:40.031946 6799 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:24:40: W1010 14:24:40.032979 6804 containerizer.cpp:2401] > Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: > Container does not exist > 2018-10-10 14:25:40: W1010 14:25:40.030784 6808
[jira] [Commented] (MESOS-9695) Remove the duplicate pid check in Docker containerizer
[ https://issues.apache.org/jira/browse/MESOS-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830263#comment-16830263 ] Andrei Budnik commented on MESOS-9695: -- {code:java} commit c8004ee8a0962d0e0f9147718853160bb708f5bc Author: Qian Zhang Date: Tue Apr 30 13:23:26 2019 +0200 Removed the duplicate pid check in Docker containerizer. Review: https://reviews.apache.org/r/70561/ {code} > Remove the duplicate pid check in Docker containerizer > -- > > Key: MESOS-9695 > URL: https://issues.apache.org/jira/browse/MESOS-9695 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Qian Zhang >Assignee: Qian Zhang >Priority: Major > Labels: containerization > > In `DockerContainerizerProcess::_recover`, we check if there are two > executors use duplicate pid, and error out if we find duplicate pid (see > [here|https://github.com/apache/mesos/blob/1.7.2/src/slave/containerizer/docker.cpp#L1068:L1078] > for details). However I do not see the value this check can give us but it > will cause serious issue (agent crash loop when restarting) in rare case (a > new executor reuse pid of an old executor), so I think we'd better to remove > it from Docker containerizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9718) Compile failures with char8_t by MSVC under /std:c++latest(C++20) mode
[ https://issues.apache.org/jira/browse/MESOS-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825251#comment-16825251 ] Andrei Budnik commented on MESOS-9718: -- Hi [~QuellaZhang], Just verified your patch in our internal CI - LGTM! BTW, could these tests be compiled if you remove only u8 prefix for string literals? E.g., use "~~~\u00ff\u00ff\u00ff\u00ff" instead of u8"~~~\u00ff\u00ff\u00ff\u00ff" (or "~~~\xC3\xBF\xC3\xBF\xC3\xBF\xC3\xBF") Would you like to send a PR for the patch on [https://github.com/apache/mesos]? [http://mesos.apache.org/documentation/latest/beginner-contribution/#open-a-pr] > Compile failures with char8_t by MSVC under /std:c++latest(C++20) mode > -- > > Key: MESOS-9718 > URL: https://issues.apache.org/jira/browse/MESOS-9718 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: QuellaZhang >Priority: Major > Labels: windows > Attachments: mesos.patch.txt > > > Hi All, > We've stumbled across some build failures in Mesos after implementing support > for char8_t under /std:c + + +latest in the development version of Visual C+ > + +. Could you help look at this? Thanks in advance! Noted that this issue > only found when compiles with unreleased vctoolset, that next release of MSVC > will have this behavior. > *Repro steps:* > git clone -c core.autocrlf=true [https://github.com/apache/mesos] > D:\mesos\src > open a VS 2017 x64 command prompt as admin and browse to D:\mesos > set _CL_=/std:c++latest > cd src > .\bootstrap.bat > cd .. > mkdir build_x64 && pushd build_x64 > cmake ..\src -G "Visual Studio 15 2017 Win64" > -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64 > *Failures:* > base64_tests.i > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2664: > 'std::string base64::encode_url_safe(const std::string &,bool)': cannot > convert argument 1 from 'const char8_t [12]' to 'const std::string &' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: Reason: cannot > convert from 'const char8_t [12]' to 'const std::string' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: No constructor > could take the source type, or constructor overload resolution was ambiguous > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2660: > 'testing::internal::EqHelper::Compare': function does not take 3 > arguments > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430): > note: see declaration of 'testing::internal::EqHelper::Compare' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2512: > 'testing::AssertionResult': no appropriate default constructor available > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256): > note: see declaration of 'testing::AssertionResult' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2664: > 'std::string base64::encode_url_safe(const std::string &,bool)': cannot > convert argument 1 from 'const char8_t [12]' to 'const std::string &' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: Reason: cannot > convert from 'const char8_t [12]' to 'const std::string' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: No constructor > could take the source type, or constructor overload resolution was ambiguous > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2660: > 'testing::internal::EqHelper::Compare': function does not take 3 > arguments > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430): > note: see declaration of 'testing::internal::EqHelper::Compare' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2512: > 'testing::AssertionResult': no appropriate default constructor available > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256): > note: see declaration of 'testing::AssertionResult' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2664: > 'Try base64::decode_url_safe(const std::string &)': cannot > convert argument 1 from 'const char8_t [16]' to 'const std::string &' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: Reason: cannot > convert from 'const char8_t [16]' to 'const std::string' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: No constructor > could take the source type, or constructor overload resolution was ambiguous > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2672: > 'AssertSomeEq': no matching overloaded function found >
[jira] [Commented] (MESOS-8983) SlaveRecoveryTest/0.PingTimeoutDuringRecovery is flaky
[ https://issues.apache.org/jira/browse/MESOS-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819281#comment-16819281 ] Andrei Budnik commented on MESOS-8983: -- This test fails pretty often on ARM. > SlaveRecoveryTest/0.PingTimeoutDuringRecovery is flaky > -- > > Key: MESOS-8983 > URL: https://issues.apache.org/jira/browse/MESOS-8983 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.7.0, 1.8.0 >Reporter: Alexander Rojas >Assignee: Joseph Wu >Priority: Major > Labels: flaky-test, foundations > > During an unrelated change in a PR, the apache build bot sent the following > error: > {noformat} > @ 7FF71117D888 > std::invoke<,process::Future > >,process::ProcessBase *> > @ 7FF71119257B > lambda::internal::Partial<,process::Future > >,std::_Ph<1> > >::invoke_expand<,std::tuple > >,std::_Ph<1> >,st > @ 7FF7110C08BA ) @ 7FF7110F058C > std::_Invoker_functor::_Call,process::Future > >,std::_Ph<1> >,process::ProcessBase *> > @ 7FF711183EBC > std::invoke,process::Future > >,std::_Ph<1> >,process::ProcessBase *> > @ 7FF7110C9F21 > ),process::Future > >,std::_Ph<1> >,process::ProcessBase * > @ 7FF711236416 process::ProcessBase > *)>::CallableFn,process::Future > >,std::_Ph<1> > >::operator( > @ 7FF712C1A25D process::ProcessBase *)>::operator( > @ 7FF712ACB2F9 process::ProcessBase::consume > @ 7FF712C738CA process::DispatchEvent::consume > @ 7FF70ECE7B07 process::ProcessBase::serve > @ 7FF712AD93B0 process::ProcessManager::resume > @ 7FF712C07371 ?? > @ 7FF712B2B130 > std::_Invoker_functor::_Call< > > @ 7FF712B8B8E0 > std::invoke< > > @ 7FF712B4076C > std::_LaunchPad > >,std::default_delete > > > > >::_Execute<0> > @ 7FF712C5A60A > std::_LaunchPad > >,std::default_delete > > > > >::_Run > @ 7FF712C45E78 > std::_LaunchPad > >,std::default_delete > > > > >::_Go > @ 7FF712C2C3CD std::_Pad::_Call_func > @ 7FFF9BE53428 _register_onexit_function > @ 7FFF9BE53071 _register_onexit_function > @ 7FFFB6391FE4 BaseThreadInitThunk > @ 7FFFB69FF061 RtlUserThreadStart > ll containerizers > I0606 10:25:26.680230 18356 slave.cpp:7158] Recovering executors > I0606 10:25:26.680230 18356 slave.cpp:7182] Sending reconnect request to > executor '3f11d255-bb7b-4e99-967b-055fef95b595' of framework > 62cf792a-dc69-4e3c-b54f-d83f98fb9451- at executor(1)@192.10.1.5:55652 > I0606 10:25:26.688225 22560 slave.cpp:4984] Received re-registration message > from executor '3f11d255-bb7b-4e99-967b-055fef95b595' of framework > 62cf792a-dc69-4e3c-b54f-d83f98fb9451- > I0606 10:25:26.691216 22888 slave.cpp:5901] No pings from master received > within 75secs > F0606 10:25:26.692219 22888 slave.cpp:1249] Check failed: state == > DISCONNECTED || state == RUNNING || state == TERMINATING RECOVERING > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9718) Compile failures with char8_t by MSVC under /std:c++latest mode
[ https://issues.apache.org/jira/browse/MESOS-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814730#comment-16814730 ] Andrei Budnik commented on MESOS-9718: -- [~QuellaZhang] If you have a possible fix in mind, we could discuss it via dev mailing list [1] or in a slack dev channel [2]. I and Joseph can help with committing your patches into Mesos. [1] [http://mesos.apache.org/community/#mailing-lists] [2] [http://mesos.apache.org/community/#slack] > Compile failures with char8_t by MSVC under /std:c++latest mode > --- > > Key: MESOS-9718 > URL: https://issues.apache.org/jira/browse/MESOS-9718 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: QuellaZhang >Priority: Major > Labels: windows > > Hi All, > We've stumbled across some build failures in Mesos after implementing support > for char8_t under /std:c + + +latest in the development version of Visual C+ > + +. Could you help look at this? Thanks in advance! Noted that this issue > only found when compiles with unreleased vctoolset, that next release of MSVC > will have this behavior. > *Repro steps:* > git clone -c core.autocrlf=true [https://github.com/apache/mesos] > D:\mesos\src > open a VS 2017 x64 command prompt as admin and browse to D:\mesos > set _CL_=/std:c++latest > cd src > .\bootstrap.bat > cd .. > mkdir build_x64 && pushd build_x64 > cmake ..\src -G "Visual Studio 15 2017 Win64" > -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64 > *Failures:* > base64_tests.i > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2664: > 'std::string base64::encode_url_safe(const std::string &,bool)': cannot > convert argument 1 from 'const char8_t [12]' to 'const std::string &' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: Reason: cannot > convert from 'const char8_t [12]' to 'const std::string' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: No constructor > could take the source type, or constructor overload resolution was ambiguous > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2660: > 'testing::internal::EqHelper::Compare': function does not take 3 > arguments > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430): > note: see declaration of 'testing::internal::EqHelper::Compare' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2512: > 'testing::AssertionResult': no appropriate default constructor available > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256): > note: see declaration of 'testing::AssertionResult' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2664: > 'std::string base64::encode_url_safe(const std::string &,bool)': cannot > convert argument 1 from 'const char8_t [12]' to 'const std::string &' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: Reason: cannot > convert from 'const char8_t [12]' to 'const std::string' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: No constructor > could take the source type, or constructor overload resolution was ambiguous > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2660: > 'testing::internal::EqHelper::Compare': function does not take 3 > arguments > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430): > note: see declaration of 'testing::internal::EqHelper::Compare' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2512: > 'testing::AssertionResult': no appropriate default constructor available > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256): > note: see declaration of 'testing::AssertionResult' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2664: > 'Try base64::decode_url_safe(const std::string &)': cannot > convert argument 1 from 'const char8_t [16]' to 'const std::string &' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: Reason: cannot > convert from 'const char8_t [16]' to 'const std::string' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: No constructor > could take the source type, or constructor overload resolution was ambiguous > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2672: > 'AssertSomeEq': no matching overloaded function found > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2780: > 'testing::AssertionResult AssertSomeEq(const char *,const char *,const T1 > &,const T2 &)': expects 4 arguments - 3 provided >
[jira] [Commented] (MESOS-9718) Compile failures with char8_t by MSVC under /std:c++latest mode
[ https://issues.apache.org/jira/browse/MESOS-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814322#comment-16814322 ] Andrei Budnik commented on MESOS-9718: -- [~kaysoky] what could be a possible fix or mitigation for this error? > Compile failures with char8_t by MSVC under /std:c++latest mode > --- > > Key: MESOS-9718 > URL: https://issues.apache.org/jira/browse/MESOS-9718 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: QuellaZhang >Priority: Major > Labels: windows > > Hi All, > We've stumbled across some build failures in Mesos after implementing support > for char8_t under /std:c + + +latest in the development version of Visual C+ > + +. Could you help look at this? Thanks in advance! Noted that this issue > only found when compiles with unreleased vctoolset, that next release of MSVC > will have this behavior. > *Repro steps:* > git clone -c core.autocrlf=true [https://github.com/apache/mesos] > D:\mesos\src > open a VS 2017 x64 command prompt as admin and browse to D:\mesos > set _CL_=/std:c++latest > cd src > .\bootstrap.bat > cd .. > mkdir build_x64 && pushd build_x64 > cmake ..\src -G "Visual Studio 15 2017 Win64" > -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64 > *Failures:* > base64_tests.i > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2664: > 'std::string base64::encode_url_safe(const std::string &,bool)': cannot > convert argument 1 from 'const char8_t [12]' to 'const std::string &' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: Reason: cannot > convert from 'const char8_t [12]' to 'const std::string' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: No constructor > could take the source type, or constructor overload resolution was ambiguous > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2660: > 'testing::internal::EqHelper::Compare': function does not take 3 > arguments > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430): > note: see declaration of 'testing::internal::EqHelper::Compare' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2512: > 'testing::AssertionResult': no appropriate default constructor available > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256): > note: see declaration of 'testing::AssertionResult' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2664: > 'std::string base64::encode_url_safe(const std::string &,bool)': cannot > convert argument 1 from 'const char8_t [12]' to 'const std::string &' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: Reason: cannot > convert from 'const char8_t [12]' to 'const std::string' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: No constructor > could take the source type, or constructor overload resolution was ambiguous > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2660: > 'testing::internal::EqHelper::Compare': function does not take 3 > arguments > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430): > note: see declaration of 'testing::internal::EqHelper::Compare' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2512: > 'testing::AssertionResult': no appropriate default constructor available > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256): > note: see declaration of 'testing::AssertionResult' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2664: > 'Try base64::decode_url_safe(const std::string &)': cannot > convert argument 1 from 'const char8_t [16]' to 'const std::string &' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: Reason: cannot > convert from 'const char8_t [16]' to 'const std::string' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: No constructor > could take the source type, or constructor overload resolution was ambiguous > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2672: > 'AssertSomeEq': no matching overloaded function found > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2780: > 'testing::AssertionResult AssertSomeEq(const char *,const char *,const T1 > &,const T2 &)': expects 4 arguments - 3 provided > D:\Mesos\src\3rdparty\stout\include\stout/gtest.hpp(79): note: see > declaration of 'AssertSomeEq' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2512: > 'testing::AssertionResult': no appropriate default constructor available >
[jira] [Commented] (MESOS-9718) Compile failures with char8_t by MSVC under /std:c++latest mode
[ https://issues.apache.org/jira/browse/MESOS-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814320#comment-16814320 ] Andrei Budnik commented on MESOS-9718: -- This error appeared after the following patch landed: {code:java} commit 703d0011d9049c6003f6d57026f5e764d1cb4435 Author: John Kordich Date: Thu Apr 13 18:07:25 2017 -0700 Windows: Fixed Base64Test.EncodeURLSafe. C++ encodes string literals in the compiling platform's encoding of choice, which means UTF8 for Posix, and ANSI for Windows. This has implications for this particular test, as the string literal "~~~\u00ff\u00ff\u00ff\u00ff" is translated into different bytes: Posix: { 126, 126, 126, 195, 191, 195, 191, 195, 191, 195, 191 } Windows: { 126, 126, 126, 255, 255, 255, 255 } Prepending `u8` to the string literal tells the compiler to encode the string as UTF8. This does not expose any underlying bug(s) on Windows because the test is only failing due to an incorrect input. Review: https://reviews.apache.org/r/58430/ {code} > Compile failures with char8_t by MSVC under /std:c++latest mode > --- > > Key: MESOS-9718 > URL: https://issues.apache.org/jira/browse/MESOS-9718 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: QuellaZhang >Priority: Major > Labels: windows > > Hi All, > We've stumbled across some build failures in Mesos after implementing support > for char8_t under /std:c + + +latest in the development version of Visual C+ > + +. Could you help look at this? Thanks in advance! Noted that this issue > only found when compiles with unreleased vctoolset, that next release of MSVC > will have this behavior. > *Repro steps:* > git clone -c core.autocrlf=true [https://github.com/apache/mesos] > D:\mesos\src > open a VS 2017 x64 command prompt as admin and browse to D:\mesos > set _CL_=/std:c++latest > cd src > .\bootstrap.bat > cd .. > mkdir build_x64 && pushd build_x64 > cmake ..\src -G "Visual Studio 15 2017 Win64" > -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 > -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64 > *Failures:* > base64_tests.i > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2664: > 'std::string base64::encode_url_safe(const std::string &,bool)': cannot > convert argument 1 from 'const char8_t [12]' to 'const std::string &' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: Reason: cannot > convert from 'const char8_t [12]' to 'const std::string' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: No constructor > could take the source type, or constructor overload resolution was ambiguous > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2660: > 'testing::internal::EqHelper::Compare': function does not take 3 > arguments > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430): > note: see declaration of 'testing::internal::EqHelper::Compare' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2512: > 'testing::AssertionResult': no appropriate default constructor available > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256): > note: see declaration of 'testing::AssertionResult' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2664: > 'std::string base64::encode_url_safe(const std::string &,bool)': cannot > convert argument 1 from 'const char8_t [12]' to 'const std::string &' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: Reason: cannot > convert from 'const char8_t [12]' to 'const std::string' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: No constructor > could take the source type, or constructor overload resolution was ambiguous > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2660: > 'testing::internal::EqHelper::Compare': function does not take 3 > arguments > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430): > note: see declaration of 'testing::internal::EqHelper::Compare' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2512: > 'testing::AssertionResult': no appropriate default constructor available > > D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256): > note: see declaration of 'testing::AssertionResult' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2664: > 'Try base64::decode_url_safe(const std::string &)': cannot > convert argument 1 from 'const char8_t [16]' to 'const std::string &' > D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: Reason: cannot > convert from 'const
[jira] [Comment Edited] (MESOS-9709) Docker executor can become stuck terminating
[ https://issues.apache.org/jira/browse/MESOS-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813335#comment-16813335 ] Andrei Budnik edited comment on MESOS-9709 at 4/9/19 4:58 PM: -- This agent responds on polling `/state` endpoint, but hangs on polling `/containers` and `/__processes__`. GDB can't attach to a running agent - it hangs. top -H -p `pidof mesos-agent` shows that one thread stuck in D state. Here is a stack trace of an agent's hanging thread: {code:java} [] copy_net_ns+0xa2/0x180 [] create_new_namespaces+0xf9/0x180 [] copy_namespaces+0x8e/0xd0 [] copy_process+0xb66/0x1a40 [] do_fork+0x91/0x320 [] SyS_clone+0x16/0x20 [] stub_clone+0x44/0x70 [] 0x{code} dmesg shows repeating (every 10 seconds) message: {code:java} unregister_netdevice: waiting for tunl0 to become free. Usage count = 1{code} was (Author: abudnik): This agent responds on polling `/state` endpoint, but hangs on polling `/containers` and `/__processes__`. GDB can't attach to a running agent - it hangs. top -H -p `pidof mesos-agent` shows that one thread stuck in D state. Here is a stack trace of an agent's hanging thread: {code:java} [] copy_net_ns+0xa2/0x180 [] create_new_namespaces+0xf9/0x180 [] copy_namespaces+0x8e/0xd0 [] copy_process+0xb66/0x1a40 [] do_fork+0x91/0x320 [] SyS_clone+0x16/0x20 [] stub_clone+0x44/0x70 [] 0x{code} > Docker executor can become stuck terminating > > > Key: MESOS-9709 > URL: https://issues.apache.org/jira/browse/MESOS-9709 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.8.0 >Reporter: Greg Mann >Priority: Major > Labels: containerization, mesosphere > Attachments: docker-executor-stuck.txt > > > See attached agent log; the executor container ID is > {{d2bfec33-f6bd-44ee-9345-b5710780bb59}} and the executor ID contains the > string {{819f7ef7-4f42-11e9-a566-72ec67496045}}. > After launching the executor, we see > {code} > Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re > mesos-agent[10238]: I0329 18:23:36.967316 10257 slave.cpp:3550] Launching > container d2bfec33-f6bd-44ee-9345-b5710780bb59 for executor > 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of > framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321- > Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re > mesos-agent[10238]: I0329 18:23:36.968968 10253 docker.cpp:1161] No container > info found, skipping launch > {code} > I'm not sure why the container info was not set. Once the executor > reregistration timeout elapses, the agent attempts to terminate the executor > but it does not seem to be successful. The scheduler continues to try to kill > the task but we repeatedly see > {code} > Mar 29 18:35:19 int-mountvolumeagent9-soak113s.testing.mesosphe.re > mesos-agent[10238]: W0329 18:35:19.855063 10253 slave.cpp:3823] Ignoring kill > task datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339 > because the executor > 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of > framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321- is terminating > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9709) Docker executor can become stuck terminating
[ https://issues.apache.org/jira/browse/MESOS-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813335#comment-16813335 ] Andrei Budnik edited comment on MESOS-9709 at 4/9/19 1:24 PM: -- This agent responds on polling `/state` endpoint, but hangs on polling `/containers` and `/__processes__`. GDB can't attach to a running agent - it hangs. top -H -p `pidof mesos-agent` shows that one thread stuck in D state. Here is a stack trace of an agent's hanging thread: {code:java} [] copy_net_ns+0xa2/0x180 [] create_new_namespaces+0xf9/0x180 [] copy_namespaces+0x8e/0xd0 [] copy_process+0xb66/0x1a40 [] do_fork+0x91/0x320 [] SyS_clone+0x16/0x20 [] stub_clone+0x44/0x70 [] 0x{code} was (Author: abudnik): This agent responds on polling `/state` endpoint, but hangs on polling `/containers` and `/__processes__`. GDB can't attach to a running agent - it hangs. Here is a stack trace of an agent's hanging thread: {code:java} [] copy_net_ns+0xa2/0x180 [] create_new_namespaces+0xf9/0x180 [] copy_namespaces+0x8e/0xd0 [] copy_process+0xb66/0x1a40 [] do_fork+0x91/0x320 [] SyS_clone+0x16/0x20 [] stub_clone+0x44/0x70 [] 0x{code} > Docker executor can become stuck terminating > > > Key: MESOS-9709 > URL: https://issues.apache.org/jira/browse/MESOS-9709 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.8.0 >Reporter: Greg Mann >Priority: Major > Labels: containerization, mesosphere > Attachments: docker-executor-stuck.txt > > > See attached agent log; the executor container ID is > {{d2bfec33-f6bd-44ee-9345-b5710780bb59}} and the executor ID contains the > string {{819f7ef7-4f42-11e9-a566-72ec67496045}}. > After launching the executor, we see > {code} > Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re > mesos-agent[10238]: I0329 18:23:36.967316 10257 slave.cpp:3550] Launching > container d2bfec33-f6bd-44ee-9345-b5710780bb59 for executor > 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of > framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321- > Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re > mesos-agent[10238]: I0329 18:23:36.968968 10253 docker.cpp:1161] No container > info found, skipping launch > {code} > I'm not sure why the container info was not set. Once the executor > reregistration timeout elapses, the agent attempts to terminate the executor > but it does not seem to be successful. The scheduler continues to try to kill > the task but we repeatedly see > {code} > Mar 29 18:35:19 int-mountvolumeagent9-soak113s.testing.mesosphe.re > mesos-agent[10238]: W0329 18:35:19.855063 10253 slave.cpp:3823] Ignoring kill > task datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339 > because the executor > 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of > framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321- is terminating > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9709) Docker executor can become stuck terminating
[ https://issues.apache.org/jira/browse/MESOS-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813390#comment-16813390 ] Andrei Budnik commented on MESOS-9709: -- It's a Linux kernel bug: [https://github.com/lxc/lxc/issues/2141] > Docker executor can become stuck terminating > > > Key: MESOS-9709 > URL: https://issues.apache.org/jira/browse/MESOS-9709 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.8.0 >Reporter: Greg Mann >Priority: Major > Labels: containerization, mesosphere > Attachments: docker-executor-stuck.txt > > > See attached agent log; the executor container ID is > {{d2bfec33-f6bd-44ee-9345-b5710780bb59}} and the executor ID contains the > string {{819f7ef7-4f42-11e9-a566-72ec67496045}}. > After launching the executor, we see > {code} > Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re > mesos-agent[10238]: I0329 18:23:36.967316 10257 slave.cpp:3550] Launching > container d2bfec33-f6bd-44ee-9345-b5710780bb59 for executor > 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of > framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321- > Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re > mesos-agent[10238]: I0329 18:23:36.968968 10253 docker.cpp:1161] No container > info found, skipping launch > {code} > I'm not sure why the container info was not set. Once the executor > reregistration timeout elapses, the agent attempts to terminate the executor > but it does not seem to be successful. The scheduler continues to try to kill > the task but we repeatedly see > {code} > Mar 29 18:35:19 int-mountvolumeagent9-soak113s.testing.mesosphe.re > mesos-agent[10238]: W0329 18:35:19.855063 10253 slave.cpp:3823] Ignoring kill > task datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339 > because the executor > 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of > framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321- is terminating > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9709) Docker executor can become stuck terminating
[ https://issues.apache.org/jira/browse/MESOS-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813335#comment-16813335 ] Andrei Budnik commented on MESOS-9709: -- This agent responds on polling `/state` endpoint, but hangs on polling `/containers` and `/__processes__`. GDB can't attach to a running agent - it hangs. Here is a stack trace of an agent's hanging thread: {code:java} [] copy_net_ns+0xa2/0x180 [] create_new_namespaces+0xf9/0x180 [] copy_namespaces+0x8e/0xd0 [] copy_process+0xb66/0x1a40 [] do_fork+0x91/0x320 [] SyS_clone+0x16/0x20 [] stub_clone+0x44/0x70 [] 0x{code} > Docker executor can become stuck terminating > > > Key: MESOS-9709 > URL: https://issues.apache.org/jira/browse/MESOS-9709 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.8.0 >Reporter: Greg Mann >Priority: Major > Labels: containerization, mesosphere > Attachments: docker-executor-stuck.txt > > > See attached agent log; the executor container ID is > {{d2bfec33-f6bd-44ee-9345-b5710780bb59}} and the executor ID contains the > string {{819f7ef7-4f42-11e9-a566-72ec67496045}}. > After launching the executor, we see > {code} > Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re > mesos-agent[10238]: I0329 18:23:36.967316 10257 slave.cpp:3550] Launching > container d2bfec33-f6bd-44ee-9345-b5710780bb59 for executor > 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of > framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321- > Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re > mesos-agent[10238]: I0329 18:23:36.968968 10253 docker.cpp:1161] No container > info found, skipping launch > {code} > I'm not sure why the container info was not set. Once the executor > reregistration timeout elapses, the agent attempts to terminate the executor > but it does not seem to be successful. The scheduler continues to try to kill > the task but we repeatedly see > {code} > Mar 29 18:35:19 int-mountvolumeagent9-soak113s.testing.mesosphe.re > mesos-agent[10238]: W0329 18:35:19.855063 10253 slave.cpp:3823] Ignoring kill > task datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339 > because the executor > 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of > framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321- is terminating > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9707) Calling link::lo() may cause runtime error
[ https://issues.apache.org/jira/browse/MESOS-9707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812625#comment-16812625 ] Andrei Budnik commented on MESOS-9707: -- Thanks for filing the ticket! Would you like to create a PR for the fix on [https://github.com/apache/mesos]? [http://mesos.apache.org/documentation/latest/beginner-contribution/#open-a-pr] > Calling link::lo() may cause runtime error > --- > > Key: MESOS-9707 > URL: https://issues.apache.org/jira/browse/MESOS-9707 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.7.2 >Reporter: Pavel >Priority: Major > > If mesos uses isolation="network/port_mapping" it calls link::lo() during > PortMappingIsolatorProcess::create procedure: > {code:C++} > Try> links = net::links(); > if (links.isError()) { > return Error("Failed to get all the links: " + links.error()); > } > foreach (const string& link, links.get()) { > Result test = link::internal::test(link, IFF_LOOPBACK); > if (test.isError()) { > return Error("Failed to check the flag on link: " + link); > } else if (test.get()) { > return link; > } > } > {code} > it iterates through net::links() and return first one with IFF_LOOPBACK flag. > For some network configurations test var cound be None and test.get() throws > runtime error. > In my case bridged interface caused link::internal::test(link, IFF_LOOPBACK) > to be None. > Changing code to > {code:C++} > else if (test.isSome()) { > if (test.get()) { > return link; > } > } > {code} > solves an issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors
[ https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812536#comment-16812536 ] Andrei Budnik commented on MESOS-6285: -- [~kaysoky] What is the relation between this ticket and MESOS-7947? Does MESOS-7947 provide only a partial solution? > Agents may OOM during recovery if there are too many tasks or executors > --- > > Key: MESOS-6285 > URL: https://issues.apache.org/jira/browse/MESOS-6285 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.0.1 >Reporter: Joseph Wu >Priority: Critical > Labels: mesosphere > > On an test cluster, we encountered a degenerate case where running the > example {{long-lived-framework}} for over a week would render the agent > un-recoverable. > The {{long-lived-framework}} creates one custom {{long-lived-executor}} and > launches a single task on that executor every time it receives an offer from > that agent. Over a week's worth of time, the framework manages to launch > some 400k tasks (short sleeps) on one executor. During runtime, this is not > problematic, as each completed task is quickly rotated out of the agent's > memory (and checkpointed to disk). > During recovery, however, the agent reads every single task into memory, > which leads to slow recovery; and often results in the agent being OOM-killed > before it finishes recovering. > To repro this condition quickly: > 1) Apply this patch to the {{long-lived-framework}}: > {code} > diff --git a/src/examples/long_lived_framework.cpp > b/src/examples/long_lived_framework.cpp > index 7c57eb5..1263d82 100644 > --- a/src/examples/long_lived_framework.cpp > +++ b/src/examples/long_lived_framework.cpp > @@ -358,16 +358,6 @@ private: >// Helper to launch a task using an offer. >void launch(const Offer& offer) >{ > -int taskId = tasksLaunched++; > -++metrics.tasks_launched; > - > -TaskInfo task; > -task.set_name("Task " + stringify(taskId)); > -task.mutable_task_id()->set_value(stringify(taskId)); > -task.mutable_agent_id()->MergeFrom(offer.agent_id()); > -task.mutable_resources()->CopyFrom(taskResources); > -task.mutable_executor()->CopyFrom(executor); > - > Call call; > call.set_type(Call::ACCEPT); > > @@ -380,7 +370,23 @@ private: > Offer::Operation* operation = accept->add_operations(); > operation->set_type(Offer::Operation::LAUNCH); > > -operation->mutable_launch()->add_task_infos()->CopyFrom(task); > +// Launch as many tasks as possible in the given offer. > +Resources remaining = Resources(offer.resources()).flatten(); > +while (remaining.contains(taskResources)) { > + int taskId = tasksLaunched++; > + ++metrics.tasks_launched; > + > + TaskInfo task; > + task.set_name("Task " + stringify(taskId)); > + task.mutable_task_id()->set_value(stringify(taskId)); > + task.mutable_agent_id()->MergeFrom(offer.agent_id()); > + task.mutable_resources()->CopyFrom(taskResources); > + task.mutable_executor()->CopyFrom(executor); > + > + operation->mutable_launch()->add_task_infos()->CopyFrom(task); > + > + remaining -= taskResources; > +} > > mesos->send(call); >} > {code} > 2) Run a master, agent, and {{long-lived-framework}}. On a 1 CPU, 1 GB agent > + this patch, it should take about 10 minutes to build up sufficient task > launches. > 3) Restart the agent and watch it flail during recovery. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8972) when choose docker image use user network all mesos agent crash
[ https://issues.apache.org/jira/browse/MESOS-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1681#comment-1681 ] Andrei Budnik commented on MESOS-8972: -- [~saturnman], [~omegavveapon] Could you please provide Marathon App definition (json) that causes this failure? What version of Marathon you are running? > when choose docker image use user network all mesos agent crash > --- > > Key: MESOS-8972 > URL: https://issues.apache.org/jira/browse/MESOS-8972 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.7.0 > Environment: Ubuntu 14.04 & Ubuntu 16.04, both type crashes mesos >Reporter: saturnman >Priority: Blocker > Labels: docker, network > > When submit docker task from marathon choose user network, then mesos process > crashes with the following backtrace message > mesos-agent: .././../3rdparty/stout/include/stout/option.hpp:118: const T& > Option::get() const & [with T = std::__cxx11::basic_string]: > Assertion `isSome()' failed. > *** Aborted at 1527797505 (unix time) try "date -d @1527797505" if you are > using GNU date *** > PC: @ 0x7fc03d43f428 (unknown) > *** SIGABRT (@0x4514) received by PID 17684 (TID 0x7fc033143700) from PID > 17684; stack trace: *** > @ 0x7fc03dd7d390 (unknown) > @ 0x7fc03d43f428 (unknown) > @ 0x7fc03d44102a (unknown) > @ 0x7fc03d437bd7 (unknown) > @ 0x7fc03d437c82 (unknown) > @ 0x564f1ad8871d > _ZNKR6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc3getEv > @ 0x7fc048c43256 > mesos::internal::slave::NetworkCniIsolatorProcess::getNetworkConfigJSON() > @ 0x7fc048c368cb mesos::internal::slave::NetworkCniIsolatorProcess::prepare() > @ 0x7fc0486e5c18 > _ZZN7process8dispatchI6OptionIN5mesos5slave19ContainerLaunchInfoEENS2_8internal5slave20MesosIsolatorProcessERKNS2_11ContainerIDERKNS3_15ContainerConfigESB_SE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSJ_FSH_T1_T2_EOT3_OT4_ENKUlSt10unique_ptrINS_7PromiseIS5_EESt14default_deleteISX_EEOS9_OSC_PNS_11ProcessBaseEE_clES10_S11_S12_S14_ -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9698) DroppedOperationStatusUpdate test is flaky
Andrei Budnik created MESOS-9698: Summary: DroppedOperationStatusUpdate test is flaky Key: MESOS-9698 URL: https://issues.apache.org/jira/browse/MESOS-9698 Project: Mesos Issue Type: Bug Environment: Debian 8 Reporter: Andrei Budnik Attachments: DroppedOperationStatusUpdate-badrun1.txt DroppedOperationStatusUpdate test failed with the following backtrace: {code:java} 06:50:21 mesos-tests: ../../3rdparty/stout/include/stout/option.hpp:120: T& Option::get() & [with T = mesos::FrameworkID]: Assertion `isSome()' failed. 06:50:21 *** Aborted at 1554360620 (unix time) try "date -d @1554360620" if you are using GNU date *** 06:50:21 I0404 06:50:20.663539 16308 scheduler.cpp:847] Enqueuing event OFFERS received from http://172.16.10.126:42550/master/api/v1/scheduler 06:50:21 I0404 06:50:20.663702 16308 scheduler.cpp:847] Enqueuing event UPDATE_OPERATION_STATUS received from http://172.16.10.126:42550/master/api/v1/scheduler 06:50:21 PC: @ 0x7fa726c66067 (unknown) 06:50:21 *** SIGABRT (@0x6fad) received by PID 28589 (TID 0x7fa71dfc9700) from PID 28589; stack trace: *** 06:50:21 @ 0x7fa726feb890 (unknown) 06:50:21 @ 0x7fa726c66067 (unknown) 06:50:21 @ 0x7fa726c67448 (unknown) 06:50:21 @ 0x7fa726c5f266 (unknown) 06:50:21 @ 0x7fa726c5f312 (unknown) 06:50:21 @ 0x7fa72a1be89a _ZNR6OptionIN5mesos11FrameworkIDEE3getEv.part.500 06:50:21 @ 0x7fa72a54002a mesos::internal::master::Master::updateOperationStatus() 06:50:21 @ 0x7fa72a5c583b ProtobufProcess<>::_handlerMutM<>() 06:50:21 @ 0x7fa72a58e680 ProtobufProcess<>::consume() 06:50:21 @ 0x7fa72a50cf04 mesos::internal::master::Master::_consume() 06:50:21 @ 0x7fa72a52975d mesos::internal::master::Master::consume() 06:50:21 @ 0x7fa72b60b1d3 process::ProcessManager::resume() 06:50:21 @ 0x7fa72b610ea6 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv 06:50:21 @ 0x7fa7277c6970 (unknown) 06:50:21 @ 0x7fa726fe4064 start_thread 06:50:21 @ 0x7fa726d1962d (unknown) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9693) Add master validation for SeccompInfo.
[ https://issues.apache.org/jira/browse/MESOS-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805771#comment-16805771 ] Andrei Budnik commented on MESOS-9693: -- > 2. at most one field of profile_name and unconfined should be set. better to > validate in master We have such a validation in `linux/seccomp` [isolator|https://github.com/apache/mesos/blob/9a6b3cb943fd1f8c9732cd5fb7d58a5b55c1460c/src/slave/containerizer/mesos/isolators/linux/seccomp.cpp#L102-L107]. > 1. if seccomp is not enabled, we should return failure if any fw specify > seccompInfo and return appropriate status update. There are 2 nuances that need to be taken into account. Firstly, Seccomp isolator might be disabled on some particular agents. So, whether Seccomp is enabled or not can be detected at agent level rather than cluster-wide. Secondly, we don't have a similar validation for other "unused" fields in ContainerInfo/LinuxInfo proto. E.g., a framework might specify `NetworkInfo network_infos` field in the `ContainerInfo`, but it will be ignored by an agent in case CNI and other `network_infos` consuming plugins are not enabled. > Add master validation for SeccompInfo. > -- > > Key: MESOS-9693 > URL: https://issues.apache.org/jira/browse/MESOS-9693 > Project: Mesos > Issue Type: Task >Reporter: Gilbert Song >Assignee: Andrei Budnik >Priority: Major > > 1. if seccomp is not enabled, we should return failure if any fw specify > seccompInfo and return appropriate status update. > 2. at most one field of profile_name and unconfined should be set. better to > validate in master -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9614) Implement filtering of Seccomp rules by kernel version.
Andrei Budnik created MESOS-9614: Summary: Implement filtering of Seccomp rules by kernel version. Key: MESOS-9614 URL: https://issues.apache.org/jira/browse/MESOS-9614 Project: Mesos Issue Type: Task Components: containerization Reporter: Andrei Budnik The most recent Docker profile allows specify filtering by kernel version, e.g: {code:java} { "names": [ "ptrace" ], "action": "SCMP_ACT_ALLOW", "args": null, "comment": "", "includes": { "minKernel": "4.8" }, "excludes": {} }, {code} We need to add support for `minKernel` filter. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9564) Logrotate container logger lets tasks execute arbitrary commands in the Mesos agent's namespace
[ https://issues.apache.org/jira/browse/MESOS-9564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-9564: Assignee: Andrei Budnik (was: Joseph Wu) > Logrotate container logger lets tasks execute arbitrary commands in the Mesos > agent's namespace > --- > > Key: MESOS-9564 > URL: https://issues.apache.org/jira/browse/MESOS-9564 > Project: Mesos > Issue Type: Bug > Components: agent, modules >Reporter: Joseph Wu >Assignee: Andrei Budnik >Priority: Critical > Labels: foundations, mesosphere > > The non-default {{LogrotateContainerLogger}} module allows tasks to configure > sandbox log rotation (See > http://mesos.apache.org/documentation/latest/logging/#Containers ). The > {{logrotate_stdout_options}} and {{logrotate_stderr_options}} in particular > let the task specify free-form text, which is written to a configuration file > located in the task's sandbox. The module does not sanitize or check this > configuration at all. > The logger itself will eventually run {{logrotate}} against the written > configuration file, but the logger is not isolated in the same way as the > task. For both the Mesos and Docker containerizers, the logger binary will > run in the same namespace as the Mesos agent. This makes it possible to > affect files outside of the task's mount namespace. > Two modes of attack are known to be problematic: > * Changing or adding entries to the configuration file. Normally, the > configuration file contains a single file to rotate: > {code} > /path/to/sandbox/stdout { > > } > {code} > It is trivial to add text to the {{logrotate_stdout_options}} to add a new > entry: > {code} > /path/to/sandbox/stdout { > > } > /path/to/other/file/on/disk { > > } > {code} > * Logrotate's {{postrotate}} option allows for execution of arbitrary > commands. This can again be supplied with the {{logrotate_stdout_options}} > variable. > {code} > /path/to/sandbox/stdout { > postrotate > rm -rf / > endscript > } > {code} > Some potential fixes to consider: > * Overwrite the .logrotate.conf files each time. This would give only > milliseconds between writing and calling logrotate for a thirdparty to modify > the config files maliciously. This would not help if the task itself had > postrotate options in its environment variables. > * Sanitize the free-form options field in the environment variables to remove > postrotate or injection attempts like }\n/path/to/some/file\noptions{. > * Refactor parts of the Mesos isolation code path so that the logger and IO > switchboard binary live in the same namespaces as the container (instead of > the agent). This would also be nice in that the logger's CPU usage would then > be accounted for within the container's resources. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (MESOS-6632) ContainerLogger might leak FD if container launch fails.
[ https://issues.apache.org/jira/browse/MESOS-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik updated MESOS-6632: - Comment: was deleted (was: [~gilbert] Could you please fill out Fix Version/s?) > ContainerLogger might leak FD if container launch fails. > > > Key: MESOS-6632 > URL: https://issues.apache.org/jira/browse/MESOS-6632 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.28.2, 1.0.1, 1.1.0 >Reporter: Jie Yu >Assignee: Andrei Budnik >Priority: Critical > > In MesosContainerizer, if logger->prepare() succeeds but its continuation > fails, the pipe fd allocated in the logger will get leaked. We cannot add a > destructor in ContainerLogger::SubprocessInfo to close the fd because > subprocess might close the OWNED fd. > A FD abstraction might help here. In other words, subprocess will no longer > be responsible for closing external FDs, instead, the FD destructor will be > doing so. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6632) ContainerLogger might leak FD if container launch fails.
[ https://issues.apache.org/jira/browse/MESOS-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763730#comment-16763730 ] Andrei Budnik commented on MESOS-6632: -- [~gilbert] Could you please fill out Fix Version/s? > ContainerLogger might leak FD if container launch fails. > > > Key: MESOS-6632 > URL: https://issues.apache.org/jira/browse/MESOS-6632 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.28.2, 1.0.1, 1.1.0 >Reporter: Jie Yu >Assignee: Andrei Budnik >Priority: Critical > > In MesosContainerizer, if logger->prepare() succeeds but its continuation > fails, the pipe fd allocated in the logger will get leaked. We cannot add a > destructor in ContainerLogger::SubprocessInfo to close the fd because > subprocess might close the OWNED fd. > A FD abstraction might help here. In other words, subprocess will no longer > be responsible for closing external FDs, instead, the FD destructor will be > doing so. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-6632) ContainerLogger might leak FD if container launch fails.
[ https://issues.apache.org/jira/browse/MESOS-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-6632: Assignee: Andrei Budnik > ContainerLogger might leak FD if container launch fails. > > > Key: MESOS-6632 > URL: https://issues.apache.org/jira/browse/MESOS-6632 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.28.2, 1.0.1, 1.1.0 >Reporter: Jie Yu >Assignee: Andrei Budnik >Priority: Critical > > In MesosContainerizer, if logger->prepare() succeeds but its continuation > fails, the pipe fd allocated in the logger will get leaked. We cannot add a > destructor in ContainerLogger::SubprocessInfo to close the fd because > subprocess might close the OWNED fd. > A FD abstraction might help here. In other words, subprocess will no longer > be responsible for closing external FDs, instead, the FD destructor will be > doing so. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6632) ContainerLogger might leak FD if container launch fails.
[ https://issues.apache.org/jira/browse/MESOS-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763727#comment-16763727 ] Andrei Budnik commented on MESOS-6632: -- https://reviews.apache.org/r/69684/ > ContainerLogger might leak FD if container launch fails. > > > Key: MESOS-6632 > URL: https://issues.apache.org/jira/browse/MESOS-6632 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 0.28.2, 1.0.1, 1.1.0 >Reporter: Jie Yu >Priority: Critical > > In MesosContainerizer, if logger->prepare() succeeds but its continuation > fails, the pipe fd allocated in the logger will get leaked. We cannot add a > destructor in ContainerLogger::SubprocessInfo to close the fd because > subprocess might close the OWNED fd. > A FD abstraction might help here. In other words, subprocess will no longer > be responsible for closing external FDs, instead, the FD destructor will be > doing so. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9507) Agent could not recover due to empty docker volume checkpointed files.
[ https://issues.apache.org/jira/browse/MESOS-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrei Budnik reassigned MESOS-9507: Assignee: Andrei Budnik (was: Gilbert Song) > Agent could not recover due to empty docker volume checkpointed files. > -- > > Key: MESOS-9507 > URL: https://issues.apache.org/jira/browse/MESOS-9507 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: Gilbert Song >Assignee: Andrei Budnik >Priority: Critical > Labels: containerizer > > Agent could not recover due to empty docker volume checkpointed files. Please > see logs: > {noformat} > Nov 12 17:12:00 guppy mesos-agent[38960]: E1112 17:12:00.978682 38969 > slave.cpp:6279] EXIT with status 1: Failed to perform recovery: Collect > failed: Collect failed: Failed to recover docker volumes for orphan container > e1b04051-1e4a-47a9-b866-1d625cda1d22: JSON parse failed: syntax error at line > 1 near: > Nov 12 17:12:00 guppy mesos-agent[38960]: To remedy this do as follows: > Nov 12 17:12:00 guppy mesos-agent[38960]: Step 1: rm -f > /var/lib/mesos/slave/meta/slaves/latest > Nov 12 17:12:00 guppy mesos-agent[38960]: This ensures agent doesn't recover > old live executors. > Nov 12 17:12:00 guppy mesos-agent[38960]: Step 2: Restart the agent. > Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service: main process > exited, code=exited, status=1/FAILURE > Nov 12 17:12:00 guppy systemd[1]: Unit dcos-mesos-slave.service entered > failed state. > Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service failed. > {noformat} > This is caused by agent recovery after the volume state file is created but > before checkpointing finishes. Basically the docker volume is not mounted > yet, so the docker volume isolator should skip recovering this volume. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
[ https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739635#comment-16739635 ] Andrei Budnik edited comment on MESOS-7971 at 1/10/19 5:40 PM: --- This is something different from previous ones. {code:java} E0110 17:13:09.326659 13916 master.cpp:8586] Failed to find the operation '' (uuid: 825f65eb-3ba1-4dfa-bdfa-8eb29194ace3) for an operator API call on agent ae22a9c8-0ef6-4f1e-b1eb-7b55f6e4508b-S0 {code} Full log: {code:java} [ RUN ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove I0110 17:12:59.303460 13893 cluster.cpp:174] Creating default 'local' authorizer I0110 17:12:59.304430 13912 master.cpp:416] Master ae22a9c8-0ef6-4f1e-b1eb-7b55f6e4508b (ip-172-16-10-92.ec2.internal) started on 172.16.10.92:42320 I0110 17:12:59.304451 13912 master.cpp:419] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1000secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/PfFTwT/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_operator_event_stream_subscribers="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --roles="role1" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/PfFTwT/master" --zk_session_timeout="10secs" I0110 17:12:59.304585 13912 master.cpp:468] Master only allowing authenticated frameworks to register I0110 17:12:59.304595 13912 master.cpp:474] Master only allowing authenticated agents to register I0110 17:12:59.304603 13912 master.cpp:480] Master only allowing authenticated HTTP frameworks to register I0110 17:12:59.304615 13912 credentials.hpp:37] Loading credentials for authentication from '/tmp/PfFTwT/credentials' I0110 17:12:59.304684 13912 master.cpp:524] Using default 'crammd5' authenticator I0110 17:12:59.304744 13912 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0110 17:12:59.304831 13912 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0110 17:12:59.304889 13912 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0110 17:12:59.304941 13912 master.cpp:605] Authorization enabled W0110 17:12:59.304967 13912 master.cpp:668] The '--roles' flag is deprecated. This flag will be removed in the future. See the Mesos 0.27 upgrade notes for more information I0110 17:12:59.305047 13919 hierarchical.cpp:176] Initialized hierarchical allocator process I0110 17:12:59.305128 13918 whitelist_watcher.cpp:77] No whitelist given I0110 17:12:59.305600 13914 master.cpp:2085] Elected as the leading master! I0110 17:12:59.305622 13914 master.cpp:1640] Recovering from registrar I0110 17:12:59.305698 13913 registrar.cpp:339] Recovering registrar I0110 17:12:59.305853 13912 registrar.cpp:383] Successfully fetched the registry (0B) in 118016ns I0110 17:12:59.305899 13912 registrar.cpp:487] Applied 1 operations in 8238ns; attempting to update the registry I0110 17:12:59.306036 13912 registrar.cpp:544] Successfully updated the registry in 112128ns I0110 17:12:59.306092 13912 registrar.cpp:416] Successfully recovered registrar I0110 17:12:59.306217 13916 master.cpp:1754] Recovered 0 agents from the registry (172B); allowing 10mins for agents to reregister I0110 17:12:59.306258 13919 hierarchical.cpp:216] Skipping recovery of hierarchical allocator: nothing to recover W0110 17:12:59.307780 13893 process.cpp:2829] Attempted to spawn already running process files@172.16.10.92:42320 I0110 17:12:59.308149 13893 containerizer.cpp:305] Using isolation { environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni } I0110 17:12:59.310348 13893 linux_launcher.cpp:144] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux
[jira] [Commented] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky
[ https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739635#comment-16739635 ] Andrei Budnik commented on MESOS-7971: -- This is something different from previous ones. {code:java} E0110 17:13:09.326659 13916 master.cpp:8586] Failed to find the operation '' (uuid: 825f65eb-3ba1-4dfa-bdfa-8eb29194ace3) for an operator API call on agent ae22a9c8-0ef6-4f1e-b1eb-7b55f6e4508b-S0 {code} Full log: {code:java} [ RUN ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove I0110 17:12:59.303460 13893 cluster.cpp:174] Creating default 'local' authorizer I0110 17:12:59.304430 13912 master.cpp:416] Master ae22a9c8-0ef6-4f1e-b1eb-7b55f6e4508b (ip-172-16-10-92.ec2.internal) started on 172.16.10.92:42320 I0110 17:12:59.304451 13912 master.cpp:419] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="1000secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/PfFTwT/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_operator_event_stream_subscribers="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --roles="role1" --root_submissions="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/PfFTwT/master" --zk_session_timeout="10secs" I0110 17:12:59.304585 13912 master.cpp:468] Master only allowing authenticated frameworks to register I0110 17:12:59.304595 13912 master.cpp:474] Master only allowing authenticated agents to register I0110 17:12:59.304603 13912 master.cpp:480] Master only allowing authenticated HTTP frameworks to register I0110 17:12:59.304615 13912 credentials.hpp:37] Loading credentials for authentication from '/tmp/PfFTwT/credentials' I0110 17:12:59.304684 13912 master.cpp:524] Using default 'crammd5' authenticator I0110 17:12:59.304744 13912 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0110 17:12:59.304831 13912 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0110 17:12:59.304889 13912 http.cpp:965] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0110 17:12:59.304941 13912 master.cpp:605] Authorization enabled W0110 17:12:59.304967 13912 master.cpp:668] The '--roles' flag is deprecated. This flag will be removed in the future. See the Mesos 0.27 upgrade notes for more information I0110 17:12:59.305047 13919 hierarchical.cpp:176] Initialized hierarchical allocator process I0110 17:12:59.305128 13918 whitelist_watcher.cpp:77] No whitelist given I0110 17:12:59.305600 13914 master.cpp:2085] Elected as the leading master! I0110 17:12:59.305622 13914 master.cpp:1640] Recovering from registrar I0110 17:12:59.305698 13913 registrar.cpp:339] Recovering registrar I0110 17:12:59.305853 13912 registrar.cpp:383] Successfully fetched the registry (0B) in 118016ns I0110 17:12:59.305899 13912 registrar.cpp:487] Applied 1 operations in 8238ns; attempting to update the registry I0110 17:12:59.306036 13912 registrar.cpp:544] Successfully updated the registry in 112128ns I0110 17:12:59.306092 13912 registrar.cpp:416] Successfully recovered registrar I0110 17:12:59.306217 13916 master.cpp:1754] Recovered 0 agents from the registry (172B); allowing 10mins for agents to reregister I0110 17:12:59.306258 13919 hierarchical.cpp:216] Skipping recovery of hierarchical allocator: nothing to recover W0110 17:12:59.307780 13893 process.cpp:2829] Attempted to spawn already running process files@172.16.10.92:42320 I0110 17:12:59.308149 13893 containerizer.cpp:305] Using isolation { environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni } I0110 17:12:59.310348 13893 linux_launcher.cpp:144] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher I0110 17:12:59.310752 13893
[jira] [Comment Edited] (MESOS-9463) Parallel test runner gets confused if a GTEST_FILTER expression also matches a sequential filter
[ https://issues.apache.org/jira/browse/MESOS-9463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725038#comment-16725038 ] Andrei Budnik edited comment on MESOS-9463 at 12/19/18 2:23 PM: Since GTEST filter [does not support|https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#running-a-subset-of-the-tests] boolean AND operator and does not support composition (to emulate AND operator using De Morgan's laws), we should either: 1) Fix Mesos containerizer and Mesos tests to support launching ROOT tests in parallel 2) when GTEST_FILTER is specified, run all tests in sequential mode was (Author: abudnik): Since GTEST filter [does not support|https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#running-a-subset-of-the-tests] boolean AND operator and does not support composition (to emulate AND operator using De Morgan's laws), we should either: 1) Fix mesos c'zer and mesos tests to support launching ROOT tests in parallel 2) when GTEST_FILTER is specified, run all tests in sequential mode > Parallel test runner gets confused if a GTEST_FILTER expression also matches > a sequential filter > > > Key: MESOS-9463 > URL: https://issues.apache.org/jira/browse/MESOS-9463 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Bannier >Priority: Major > Labels: parallel-tests, test > > Users expect the be able to select tests to run via {{make check}} with a > {{GTEST_FILTER}} environment variable. The parallel test runner on the other > hand programmatically also injects filter expressions to select tests to > execute sequentially. > This causes e.g., all {{*ROOT_*}} tests to be run in the sequential phase for > superusers, even if a {{GTEST_FILTER}} was set. > It seems that need to handle set {{GTEST_FILTER}} environment variables more > carefully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9463) Parallel test runner gets confused if a GTEST_FILTER expression also matches a sequential filter
[ https://issues.apache.org/jira/browse/MESOS-9463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725038#comment-16725038 ] Andrei Budnik commented on MESOS-9463: -- Since GTEST filter [does not support|https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#running-a-subset-of-the-tests] boolean AND operator and does not support composition (to emulate AND operator using De Morgan's laws), we should either: 1) Fix mesos c'zer and mesos tests to support launching ROOT tests in parallel 2) when GTEST_FILTER is specified, run all tests in sequential mode > Parallel test runner gets confused if a GTEST_FILTER expression also matches > a sequential filter > > > Key: MESOS-9463 > URL: https://issues.apache.org/jira/browse/MESOS-9463 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Bannier >Priority: Major > Labels: parallel-tests, test > > Users expect the be able to select tests to run via {{make check}} with a > {{GTEST_FILTER}} environment variable. The parallel test runner on the other > hand programmatically also injects filter expressions to select tests to > execute sequentially. > This causes e.g., all {{*ROOT_*}} tests to be run in the sequential phase for > superusers, even if a {{GTEST_FILTER}} was set. > It seems that need to handle set {{GTEST_FILTER}} environment variables more > carefully. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9462) Devices in a container are inaccessible due to `nodev` on `/var/run`.
[ https://issues.apache.org/jira/browse/MESOS-9462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715052#comment-16715052 ] Andrei Budnik edited comment on MESOS-9462 at 12/10/18 6:53 PM: [https://reviews.apache.org/r/69540/] [https://reviews.apache.org/r/69545/] was (Author: abudnik): [https://reviews.apache.org/r/69540/] > Devices in a container are inaccessible due to `nodev` on `/var/run`. > - > > Key: MESOS-9462 > URL: https://issues.apache.org/jira/browse/MESOS-9462 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.8.0 >Reporter: Jie Yu >Assignee: Andrei Budnik >Priority: Blocker > Labels: regression > > A recent [patch|https://reviews.apache.org/r/69086/] (commit > ede8155d1d043137e15007c48da36ac5fa0b5124) changes the behavior of how > standard device nodes (e.g., /dev/null, etc.) are setup. It uses bind mount > (from host) now (instead of mknod). > The devices nodes are created under > `/var/run/mesos/containers//devices`, and then bind mounted to > the container root filesystem. This is problematic for those Linux distros > that mount `/var/run` (or `/run`) as `nodev`. For instance, CentOS 7.4: > {noformat} > [jie@core-dev ~]$ cat /proc/self/mountinfo | grep "/run\ " > > > 24 62 0:19 / /run rw,nosuid,nodev shared:23 - tmpfs tmpfs rw,seclabel,mode=755 > [jie@core-dev ~]$ cat /etc/redhat-release > CentOS Linux release 7.4.1708 (Core) > {noformat} > As a result, the `/dev/null` devices in the container will inherit the > `nodev` from `/run` on the host > {noformat} > 629 625 0:121 > /mesos/containers/49f1da14-d741-4030-994c-0d8ed5093b13/devices/null /dev/null > rw,nosuid,nodev - tmpfs tmpfs rw,mode=755 > {noformat} > This will cause "Permission Denied" error when a process in the container > tries to open the device node. > You can try to reproduce this issue using Mesos Mini > {noformat} > docker run --rm --privileged -p 5050:5050 -p 5051:5051 -p 8080:8080 > mesos/mesos-mini:master-2018-12-06 > {noformat} > And the, go to Marathon UI (http://localhost:8080), and launch an app using > the following config > {code} > { > "id": "/test", > "cmd": "dd if=/dev/zero of=file bs=1024 count=1 oflag=dsync", > "cpus": 1, > "mem": 128, > "disk": 128, > "instances": 1, > "container": { > "type": "MESOS", > "docker": { > "image": "ubuntu:18.04" > } > } > } > {code} > You'll see the task failed with "Permission Denied". > The task will run normally if you use `mesos/mesos-mini:master-2018-12-01` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9462) Devices in a container are inaccessible due to `nodev` on `/var/run`.
[ https://issues.apache.org/jira/browse/MESOS-9462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715052#comment-16715052 ] Andrei Budnik commented on MESOS-9462: -- [https://reviews.apache.org/r/69540/] > Devices in a container are inaccessible due to `nodev` on `/var/run`. > - > > Key: MESOS-9462 > URL: https://issues.apache.org/jira/browse/MESOS-9462 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.8.0 >Reporter: Jie Yu >Assignee: Andrei Budnik >Priority: Blocker > Labels: regression > > A recent [patch|https://reviews.apache.org/r/69086/] (commit > ede8155d1d043137e15007c48da36ac5fa0b5124) changes the behavior of how > standard device nodes (e.g., /dev/null, etc.) are setup. It uses bind mount > (from host) now (instead of mknod). > The devices nodes are created under > `/var/run/mesos/containers//devices`, and then bind mounted to > the container root filesystem. This is problematic for those Linux distros > that mount `/var/run` (or `/run`) as `nodev`. For instance, CentOS 7.4: > {noformat} > [jie@core-dev ~]$ cat /proc/self/mountinfo | grep "/run\ " > > > 24 62 0:19 / /run rw,nosuid,nodev shared:23 - tmpfs tmpfs rw,seclabel,mode=755 > [jie@core-dev ~]$ cat /etc/redhat-release > CentOS Linux release 7.4.1708 (Core) > {noformat} > As a result, the `/dev/null` devices in the container will inherit the > `nodev` from `/run` on the host > {noformat} > 629 625 0:121 > /mesos/containers/49f1da14-d741-4030-994c-0d8ed5093b13/devices/null /dev/null > rw,nosuid,nodev - tmpfs tmpfs rw,mode=755 > {noformat} > This will cause "Permission Denied" error when a process in the container > tries to open the device node. > You can try to reproduce this issue using Mesos Mini > {noformat} > docker run --rm --privileged -p 5050:5050 -p 5051:5051 -p 8080:8080 > mesos/mesos-mini:master-2018-12-06 > {noformat} > And the, go to Marathon UI (http://localhost:8080), and launch an app using > the following config > {code} > { > "id": "/test", > "cmd": "dd if=/dev/zero of=file bs=1024 count=1 oflag=dsync", > "cpus": 1, > "mem": 128, > "disk": 128, > "instances": 1, > "container": { > "type": "MESOS", > "docker": { > "image": "ubuntu:18.04" > } > } > } > {code} > You'll see the task failed with "Permission Denied". > The task will run normally if you use `mesos/mesos-mini:master-2018-12-01` -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9461) `CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage` is flaky.
Andrei Budnik created MESOS-9461: Summary: `CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage` is flaky. Key: MESOS-9461 URL: https://issues.apache.org/jira/browse/MESOS-9461 Project: Mesos Issue Type: Bug Affects Versions: 1.8.0 Environment: Fedora 25 Reporter: Andrei Budnik Attachments: ROOT_CGROUPS_BlkioUsage-badrun.txt This test permanently fails on Fedora 25 (kernel 4.13.16-100.fc25.x86_64). {code:java} $ mount|grep blkio cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9456) Set `SCMP_FLTATR_CTL_LOG` attribute during initialization of Seccomp context
Andrei Budnik created MESOS-9456: Summary: Set `SCMP_FLTATR_CTL_LOG` attribute during initialization of Seccomp context Key: MESOS-9456 URL: https://issues.apache.org/jira/browse/MESOS-9456 Project: Mesos Issue Type: Task Components: containerization Reporter: Andrei Budnik Since version 4.14 the Linux kernel supports SECCOMP_FILTER_FLAG_LOG flag which can be used for enabling logging for all Seccomp filter operations except SECCOMP_RET_ALLOW. If a Seccomp filter does not allow the system call, then the kernel will print a message into dmesg during invocation of this system call. At the moment libseccomp ver. 2.3.3 does not provide this flag, but the latest master branch of libseccomp supports SECCOMP_FILTER_FLAG_LOG. So, we need to add {code:java} seccomp_attr_set(ctx, SCMP_FLTATR_CTL_LOG, 1);{code} into `SeccompFilter::create()` when the newest version of libseccomp will be released. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9157) cannot pull docker image from dockerhub
[ https://issues.apache.org/jira/browse/MESOS-9157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708647#comment-16708647 ] Andrei Budnik commented on MESOS-9157: -- [~MichaelBowie] feel free to reach out to me directly if you need any help on this ticket via [https://mesos.slack.com/] > cannot pull docker image from dockerhub > --- > > Key: MESOS-9157 > URL: https://issues.apache.org/jira/browse/MESOS-9157 > Project: Mesos > Issue Type: Bug > Components: fetcher >Affects Versions: 1.6.1 >Reporter: Michael Bowie >Priority: Blocker > Labels: containerization > > I am not able to pull docker images from docker hub through marathon/mesos. > I get one of two errors: > * `Aug 15 10:11:02 michael-b-dcos-agent-1 dockerd[5974]: > time="2018-08-15T10:11:02.770309104-04:00" level=error msg="Not continuing > with pull after error: context canceled"` > * `Failed to run docker -H ... Error: No such object: > mesos-d2f333a8-fef2-48fb-8b99-28c52c327790` > However, I can manually ssh into one of the agents and successfully pull the > image from the command line. > Any pointers in the right direction? > Thank you! > Similar Issues: > https://github.com/mesosphere/marathon/issues/3869 -- This message was sent by Atlassian JIRA (v7.6.3#76005)