[jira] [Created] (MESOS-10158) Mesos Agent gets stuck in Draining due to pending unacknowledged status updates

2020-07-07 Thread Andrei Budnik (Jira)
Andrei Budnik created MESOS-10158:
-

 Summary: Mesos Agent gets stuck in Draining due to pending 
unacknowledged status updates
 Key: MESOS-10158
 URL: https://issues.apache.org/jira/browse/MESOS-10158
 Project: Mesos
  Issue Type: Bug
  Components: master
Reporter: Andrei Budnik


A Mesos agent can get stuck in the Draining mode caused by pending 
unacknowledged status updates. When the framework becomes disconnected, the 
agent keeps sending task status updates for terminated tasks of that framework. 
This leads to a problem when the agent gets stuck in the Draining state because 
the master transitions the agent from DRAINING to DRAINED state only after all 
task status updates get acknowledged.

This problem can be resolved by sending ["Teardown" 
operation|https://github.com/apache/mesos/blob/8ce5d30808f3744eeded09d530f226079d569a94/include/mesos/v1/master/master.proto#L299-L303]
 for all lost frameworks. However, it would be much better if this situation 
could be handled automatically by the Master. At least, we should make it 
easier for an operator to find out what prevents draining operation to complete.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-7485) Add verbose logging for curl commands used in fetcher/puller

2020-06-10 Thread Andrei Budnik (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-7485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-7485:


Assignee: Andrei Budnik

> Add verbose logging for curl commands used in fetcher/puller
> 
>
> Key: MESOS-7485
> URL: https://issues.apache.org/jira/browse/MESOS-7485
> Project: Mesos
>  Issue Type: Bug
>Reporter: Zhitao Li
>Assignee: Andrei Budnik
>Priority: Major
>
> Right now it's pretty hard to debug curl failures from puller/fetcher: even 
> with verbose logging turned on, we only see `curl` failed but no additional 
> information.
> We should at least log the URL we send to curl. Ideally, we should also log 
> all other options exception any Auth headers (maybe indicating which auth 
> header used).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"

2020-05-29 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119576#comment-17119576
 ] 

Andrei Budnik edited comment on MESOS-10131 at 5/29/20, 1:04 PM:
-

Please keep posting error messages on agent crash. Hopefully, we'll capture a 
part of `mountinfo` containing the loop.
 I think it might be worth capturing mount info after the moment it happens. We 
could check if there are duplicate records or even detect a loop or find some 
other anomalies. `mount && cat /proc/1/mountinfo` && `cat /proc//mountinfo`


was (Author: abudnik):
Please keep posting error messages on agent crash. Hopefully, we'll capture a 
part of `mountinfo` containing the loop.
I think it might be worth capturing mount info after the moment it happens. We 
could check then if there are duplicate records or even detect a loop or find 
some other anomalies. `mount && cat /proc/1/mountinfo` && `cat /proc//mountinfo`

> Agent frequently dies with error "Cycle found in mount table hierarchy"
> ---
>
> Key: MESOS-10131
> URL: https://issues.apache.org/jira/browse/MESOS-10131
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, framework
>Affects Versions: 1.9.0
>Reporter: Thomas Plummer
>Assignee: Andrei Budnik
>Priority: Major
> Attachments: log.txt
>
>
> Our mesos agent frequently dies with the follow error in the slave logs:
>  
> {code:java}
> F0509 22:10:33.036993 17723 fs.cpp:217] Check failed: 
> !visitedParents.contains(parentId) Cycle found in mount table hierarchy at 
> entry '1954': 
> 18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs 
> rw,seclabel
> 19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw
> 20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs 
> rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755
> 21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - 
> securityfs securityfs rw
> 22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs 
> rw,seclabel
> 23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts 
> rw,seclabel,gid=5,mode=620,ptmxmode=000
> 24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755
> 25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs 
> ro,seclabel,mode=755
> 26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 
> - cgroup cgroup 
> rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
> 27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - 
> pstore pstore rw
> 28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime 
> shared:21 - efivarfs efivarfs rw
> 29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> shared:10 - cgroup cgroup rw,seclabel,perf_event
> 30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls
> 31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 
> - cgroup cgroup rw,seclabel,cpuset
> 32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - 
> cgroup cgroup rw,seclabel,blkio
> 33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 
> - cgroup cgroup rw,seclabel,freezer
> 34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 
> - cgroup cgroup rw,seclabel,hugetlb
> 35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 
> - cgroup cgroup rw,seclabel,devices
> 36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu
> 37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 
> - cgroup cgroup rw,seclabel,memory
> 38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - 
> cgroup cgroup rw,seclabel,pids
> 39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw
> 41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw
> 43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs 
> systemd-1 
> rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414
> 44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw
> 45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel
> 46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs 
> rw,seclabel
> 47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota

[jira] [Commented] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"

2020-05-29 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17119576#comment-17119576
 ] 

Andrei Budnik commented on MESOS-10131:
---

Please keep posting error messages on agent crash. Hopefully, we'll capture a 
part of `mountinfo` containing the loop.
I think it might be worth capturing mount info after the moment it happens. We 
could check then if there are duplicate records or even detect a loop or find 
some other anomalies. `mount && cat /proc/1/mountinfo` && `cat /proc//mountinfo`

> Agent frequently dies with error "Cycle found in mount table hierarchy"
> ---
>
> Key: MESOS-10131
> URL: https://issues.apache.org/jira/browse/MESOS-10131
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, framework
>Affects Versions: 1.9.0
>Reporter: Thomas Plummer
>Assignee: Andrei Budnik
>Priority: Major
> Attachments: log.txt
>
>
> Our mesos agent frequently dies with the follow error in the slave logs:
>  
> {code:java}
> F0509 22:10:33.036993 17723 fs.cpp:217] Check failed: 
> !visitedParents.contains(parentId) Cycle found in mount table hierarchy at 
> entry '1954': 
> 18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs 
> rw,seclabel
> 19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw
> 20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs 
> rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755
> 21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - 
> securityfs securityfs rw
> 22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs 
> rw,seclabel
> 23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts 
> rw,seclabel,gid=5,mode=620,ptmxmode=000
> 24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755
> 25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs 
> ro,seclabel,mode=755
> 26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 
> - cgroup cgroup 
> rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
> 27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - 
> pstore pstore rw
> 28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime 
> shared:21 - efivarfs efivarfs rw
> 29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> shared:10 - cgroup cgroup rw,seclabel,perf_event
> 30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls
> 31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 
> - cgroup cgroup rw,seclabel,cpuset
> 32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - 
> cgroup cgroup rw,seclabel,blkio
> 33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 
> - cgroup cgroup rw,seclabel,freezer
> 34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 
> - cgroup cgroup rw,seclabel,hugetlb
> 35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 
> - cgroup cgroup rw,seclabel,devices
> 36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu
> 37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 
> - cgroup cgroup rw,seclabel,memory
> 38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - 
> cgroup cgroup rw,seclabel,pids
> 39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw
> 41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw
> 43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs 
> systemd-1 
> rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414
> 44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw
> 45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel
> 46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs 
> rw,seclabel
> 47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 48 47 8:1 / /boot/efi rw,relatime shared:30 - vfat /dev/sda1 
> rw,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro
> 49 41 253:2 / /var rw,relatime shared:31 - xfs /dev/mapper/vg_system-var 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 50 41 253:5 / /home rw,nodev,relatime shared:32 - xfs 
> /dev/mapper/vg_system-home 
> 

[jira] [Comment Edited] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"

2020-05-28 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118896#comment-17118896
 ] 

Andrei Budnik edited comment on MESOS-10131 at 5/28/20, 5:21 PM:
-

I think the message containing the whole mount table is long enough (~30k 
bytes) to reach the limit of the logger buffer...
 [~tomplummer] Could you capture both truncated log message and the output of 
"cat /proc//mountinfo" next time it crashes? (and/or `mount 
&& cat /proc/1/mountinfo` if mesos agent can't start)


was (Author: abudnik):
I think the message containing the whole mount table is long enough (~30k 
bytes) to reach the limit of the logger buffer...
[~tomplummer] Could you capture both truncated log message and the output of 
"cat /proc//mountinfo" next time it crashes?

> Agent frequently dies with error "Cycle found in mount table hierarchy"
> ---
>
> Key: MESOS-10131
> URL: https://issues.apache.org/jira/browse/MESOS-10131
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, framework
>Affects Versions: 1.9.0
>Reporter: Thomas Plummer
>Assignee: Andrei Budnik
>Priority: Major
> Attachments: log.txt
>
>
> Our mesos agent frequently dies with the follow error in the slave logs:
>  
> {code:java}
> F0509 22:10:33.036993 17723 fs.cpp:217] Check failed: 
> !visitedParents.contains(parentId) Cycle found in mount table hierarchy at 
> entry '1954': 
> 18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs 
> rw,seclabel
> 19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw
> 20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs 
> rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755
> 21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - 
> securityfs securityfs rw
> 22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs 
> rw,seclabel
> 23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts 
> rw,seclabel,gid=5,mode=620,ptmxmode=000
> 24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755
> 25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs 
> ro,seclabel,mode=755
> 26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 
> - cgroup cgroup 
> rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
> 27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - 
> pstore pstore rw
> 28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime 
> shared:21 - efivarfs efivarfs rw
> 29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> shared:10 - cgroup cgroup rw,seclabel,perf_event
> 30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls
> 31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 
> - cgroup cgroup rw,seclabel,cpuset
> 32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - 
> cgroup cgroup rw,seclabel,blkio
> 33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 
> - cgroup cgroup rw,seclabel,freezer
> 34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 
> - cgroup cgroup rw,seclabel,hugetlb
> 35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 
> - cgroup cgroup rw,seclabel,devices
> 36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu
> 37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 
> - cgroup cgroup rw,seclabel,memory
> 38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - 
> cgroup cgroup rw,seclabel,pids
> 39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw
> 41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw
> 43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs 
> systemd-1 
> rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414
> 44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw
> 45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel
> 46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs 
> rw,seclabel
> 47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 48 47 8:1 / /boot/efi rw,relatime shared:30 - vfat /dev/sda1 
> 

[jira] [Commented] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"

2020-05-28 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17118896#comment-17118896
 ] 

Andrei Budnik commented on MESOS-10131:
---

I think the message containing the whole mount table is long enough (~30k 
bytes) to reach the limit of the logger buffer...
[~tomplummer] Could you capture both truncated log message and the output of 
"cat /proc//mountinfo" next time it crashes?

> Agent frequently dies with error "Cycle found in mount table hierarchy"
> ---
>
> Key: MESOS-10131
> URL: https://issues.apache.org/jira/browse/MESOS-10131
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, framework
>Affects Versions: 1.9.0
>Reporter: Thomas Plummer
>Assignee: Andrei Budnik
>Priority: Major
> Attachments: log.txt
>
>
> Our mesos agent frequently dies with the follow error in the slave logs:
>  
> {code:java}
> F0509 22:10:33.036993 17723 fs.cpp:217] Check failed: 
> !visitedParents.contains(parentId) Cycle found in mount table hierarchy at 
> entry '1954': 
> 18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs 
> rw,seclabel
> 19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw
> 20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs 
> rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755
> 21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - 
> securityfs securityfs rw
> 22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs 
> rw,seclabel
> 23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts 
> rw,seclabel,gid=5,mode=620,ptmxmode=000
> 24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755
> 25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs 
> ro,seclabel,mode=755
> 26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 
> - cgroup cgroup 
> rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
> 27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - 
> pstore pstore rw
> 28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime 
> shared:21 - efivarfs efivarfs rw
> 29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> shared:10 - cgroup cgroup rw,seclabel,perf_event
> 30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls
> 31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 
> - cgroup cgroup rw,seclabel,cpuset
> 32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - 
> cgroup cgroup rw,seclabel,blkio
> 33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 
> - cgroup cgroup rw,seclabel,freezer
> 34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 
> - cgroup cgroup rw,seclabel,hugetlb
> 35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 
> - cgroup cgroup rw,seclabel,devices
> 36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu
> 37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 
> - cgroup cgroup rw,seclabel,memory
> 38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - 
> cgroup cgroup rw,seclabel,pids
> 39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw
> 41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw
> 43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs 
> systemd-1 
> rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414
> 44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw
> 45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel
> 46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs 
> rw,seclabel
> 47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 48 47 8:1 / /boot/efi rw,relatime shared:30 - vfat /dev/sda1 
> rw,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro
> 49 41 253:2 / /var rw,relatime shared:31 - xfs /dev/mapper/vg_system-var 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 50 41 253:5 / /home rw,nodev,relatime shared:32 - xfs 
> /dev/mapper/vg_system-home 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 51 41 253:4 / /tmp rw,nosuid,nodev,noexec,relatime shared:33 - xfs 
> 

[jira] [Commented] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"

2020-05-27 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117903#comment-17117903
 ] 

Andrei Budnik commented on MESOS-10131:
---

[~tomplummer] It seems that the tail of the log message is missing. Could you 
please provide the whole log message containing the mount table? We will try to 
reproduce the problem by running a unit test to ensure that this is not a bug 
in the code.

> Agent frequently dies with error "Cycle found in mount table hierarchy"
> ---
>
> Key: MESOS-10131
> URL: https://issues.apache.org/jira/browse/MESOS-10131
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, framework
>Affects Versions: 1.9.0
>Reporter: Thomas Plummer
>Assignee: Andrei Budnik
>Priority: Major
>
> Our mesos agent frequently dies with the follow error in the slave logs:
>  
> {code:java}
> F0509 22:10:33.036993 17723 fs.cpp:217] Check failed: 
> !visitedParents.contains(parentId) Cycle found in mount table hierarchy at 
> entry '1954': 
> 18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs 
> rw,seclabel
> 19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw
> 20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs 
> rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755
> 21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - 
> securityfs securityfs rw
> 22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs 
> rw,seclabel
> 23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts 
> rw,seclabel,gid=5,mode=620,ptmxmode=000
> 24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755
> 25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs 
> ro,seclabel,mode=755
> 26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 
> - cgroup cgroup 
> rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
> 27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - 
> pstore pstore rw
> 28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime 
> shared:21 - efivarfs efivarfs rw
> 29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> shared:10 - cgroup cgroup rw,seclabel,perf_event
> 30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls
> 31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 
> - cgroup cgroup rw,seclabel,cpuset
> 32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - 
> cgroup cgroup rw,seclabel,blkio
> 33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 
> - cgroup cgroup rw,seclabel,freezer
> 34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 
> - cgroup cgroup rw,seclabel,hugetlb
> 35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 
> - cgroup cgroup rw,seclabel,devices
> 36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu
> 37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 
> - cgroup cgroup rw,seclabel,memory
> 38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - 
> cgroup cgroup rw,seclabel,pids
> 39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw
> 41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw
> 43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs 
> systemd-1 
> rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414
> 44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw
> 45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel
> 46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs 
> rw,seclabel
> 47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 48 47 8:1 / /boot/efi rw,relatime shared:30 - vfat /dev/sda1 
> rw,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro
> 49 41 253:2 / /var rw,relatime shared:31 - xfs /dev/mapper/vg_system-var 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 50 41 253:5 / /home rw,nodev,relatime shared:32 - xfs 
> /dev/mapper/vg_system-home 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 51 41 253:4 / /tmp rw,nosuid,nodev,noexec,relatime shared:33 - xfs 
> /dev/mapper/vg_system-tmp 
> 

[jira] [Commented] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"

2020-05-27 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117892#comment-17117892
 ] 

Andrei Budnik commented on MESOS-10131:
---

Mount table without extra newlines:
{code:java}
18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs 
rw,seclabel
19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw
20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs 
rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755
21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - 
securityfs securityfs rw
22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs rw,seclabel
23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts 
rw,seclabel,gid=5,mode=620,ptmxmode=000
24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755
25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs 
ro,seclabel,mode=755
26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 - 
cgroup cgroup 
rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - pstore 
pstore rw
28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime 
shared:21 - efivarfs efivarfs rw
29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
shared:10 - cgroup cgroup rw,seclabel,perf_event
30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls
31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 - 
cgroup cgroup rw,seclabel,cpuset
32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - 
cgroup cgroup rw,seclabel,blkio
33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 - 
cgroup cgroup rw,seclabel,freezer
34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 - 
cgroup cgroup rw,seclabel,hugetlb
35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 - 
cgroup cgroup rw,seclabel,devices
36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu
37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 - 
cgroup cgroup rw,seclabel,memory
38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - 
cgroup cgroup rw,seclabel,pids
39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw
41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root 
rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw
43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs systemd-1 
rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414
44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw
45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel
46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs 
rw,seclabel
47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 
rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
48 47 8:1 / /boot/efi rw,relatime shared:30 - vfat /dev/sda1 
rw,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro
49 41 253:2 / /var rw,relatime shared:31 - xfs /dev/mapper/vg_system-var 
rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
50 41 253:5 / /home rw,nodev,relatime shared:32 - xfs 
/dev/mapper/vg_system-home 
rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
51 41 253:4 / /tmp rw,nosuid,nodev,noexec,relatime shared:33 - xfs 
/dev/mapper/vg_system-tmp 
rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
53 49 253:4 / /var/tmp rw,nosuid,nodev,noexec,relatime shared:33 - xfs 
/dev/mapper/vg_system-tmp 
rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
52 49 253:3 / /var/log rw,relatime shared:34 - xfs /dev/mapper/vg_system-varlog 
rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
54 52 253:6 / /var/log/audit rw,relatime shared:35 - xfs 
/dev/mapper/vg_system-varlogaudit 
rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
187 41 0:41 / /mnt/receipt rw,relatime shared:165 - nfs4 
dtmetlnfsa01p.a.carfax.us:/ 
rw,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.18.154.117,local_lock=none,addr=172.18.138.237
188 41 0:42 / /mnt/receipt_web_dev rw,relatime shared:169 - nfs4 
dtmetlnfsa01b.a.carfax.us:/ 
rw,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=172.18.154.117,local_lock=none,addr=172.18.137.248
192 41 0:41 / /mnt/receipt_web_prod 

[jira] [Commented] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"

2020-05-27 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17117873#comment-17117873
 ] 

Andrei Budnik commented on MESOS-10131:
---

I've copy-pasted the mount table from the log excerpt into one of our unit 
tests (`FsTest.MountInfoTableReadSortedParentOfSelf`). It failed with the 
following error message:

{code:java}
../../src/tests/containerizer/fs_tests.cpp:344: Failure
table: Failed to parse entry 
'docker/overlay2/l/LOG7DILAFLJBIQ7CKDQVFXJLP7:/var/lib/docker/overlay2/l/6JVIPP3XCCWKZPFAUWKXCDWYXL:/var/lib/docker/overlay2/l/L5VKHJHVOWG24VJPJCAKGTQX5G:/var/lib/docker/overlay2/l/ZIIS5MWCIF4C6KXI2LVKVU4TMF:/var/lib/docker/overlay2/l/4JXI':
 Could not find separator ' - '
{code}

It seems that there was a memory corruption. I'm investigating what could be 
the cause.

> Agent frequently dies with error "Cycle found in mount table hierarchy"
> ---
>
> Key: MESOS-10131
> URL: https://issues.apache.org/jira/browse/MESOS-10131
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, framework
>Affects Versions: 1.9.0
>Reporter: Thomas Plummer
>Assignee: Andrei Budnik
>Priority: Major
>
> Our mesos agent frequently dies with the follow error in the slave logs:
>  
> {code:java}
> F0509 22:10:33.036993 17723 fs.cpp:217] Check failed: 
> !visitedParents.contains(parentId) Cycle found in mount table hierarchy at 
> entry '1954': 
> 18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs 
> rw,seclabel
> 19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw
> 20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs 
> rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755
> 21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - 
> securityfs securityfs rw
> 22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs 
> rw,seclabel
> 23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts 
> rw,seclabel,gid=5,mode=620,ptmxmode=000
> 24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755
> 25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs 
> ro,seclabel,mode=755
> 26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 
> - cgroup cgroup 
> rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
> 27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - 
> pstore pstore rw
> 28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime 
> shared:21 - efivarfs efivarfs rw
> 29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> shared:10 - cgroup cgroup rw,seclabel,perf_event
> 30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls
> 31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 
> - cgroup cgroup rw,seclabel,cpuset
> 32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - 
> cgroup cgroup rw,seclabel,blkio
> 33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 
> - cgroup cgroup rw,seclabel,freezer
> 34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 
> - cgroup cgroup rw,seclabel,hugetlb
> 35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 
> - cgroup cgroup rw,seclabel,devices
> 36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu
> 37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 
> - cgroup cgroup rw,seclabel,memory
> 38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - 
> cgroup cgroup rw,seclabel,pids
> 39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw
> 41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw
> 43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs 
> systemd-1 
> rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414
> 44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw
> 45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel
> 46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs 
> rw,seclabel
> 47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 48 47 8:1 / /boot/efi rw,relatime shared:30 - vfat /dev/sda1 
> rw,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro
> 49 41 253:2 / /var 

[jira] [Assigned] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"

2020-05-27 Thread Andrei Budnik (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-10131:
-

Assignee: Andrei Budnik

> Agent frequently dies with error "Cycle found in mount table hierarchy"
> ---
>
> Key: MESOS-10131
> URL: https://issues.apache.org/jira/browse/MESOS-10131
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, framework
>Affects Versions: 1.9.0
>Reporter: Thomas Plummer
>Assignee: Andrei Budnik
>Priority: Major
>
> Our mesos agent frequently dies with the follow error in the slave logs:
>  
> {code:java}
> F0509 22:10:33.036993 17723 fs.cpp:217] Check failed: 
> !visitedParents.contains(parentId) Cycle found in mount table hierarchy at 
> entry '1954': 
> 18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs 
> rw,seclabel
> 19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw
> 20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs 
> rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755
> 21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - 
> securityfs securityfs rw
> 22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs 
> rw,seclabel
> 23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts 
> rw,seclabel,gid=5,mode=620,ptmxmode=000
> 24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755
> 25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs 
> ro,seclabel,mode=755
> 26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 
> - cgroup cgroup 
> rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
> 27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - 
> pstore pstore rw
> 28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime 
> shared:21 - efivarfs efivarfs rw
> 29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> shared:10 - cgroup cgroup rw,seclabel,perf_event
> 30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls
> 31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 
> - cgroup cgroup rw,seclabel,cpuset
> 32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - 
> cgroup cgroup rw,seclabel,blkio
> 33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 
> - cgroup cgroup rw,seclabel,freezer
> 34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 
> - cgroup cgroup rw,seclabel,hugetlb
> 35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 
> - cgroup cgroup rw,seclabel,devices
> 36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu
> 37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 
> - cgroup cgroup rw,seclabel,memory
> 38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - 
> cgroup cgroup rw,seclabel,pids
> 39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw
> 41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw
> 43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs 
> systemd-1 
> rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414
> 44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw
> 45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel
> 46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs 
> rw,seclabel
> 47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 48 47 8:1 / /boot/efi rw,relatime shared:30 - vfat /dev/sda1 
> rw,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro
> 49 41 253:2 / /var rw,relatime shared:31 - xfs /dev/mapper/vg_system-var 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 50 41 253:5 / /home rw,nodev,relatime shared:32 - xfs 
> /dev/mapper/vg_system-home 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 51 41 253:4 / /tmp rw,nosuid,nodev,noexec,relatime shared:33 - xfs 
> /dev/mapper/vg_system-tmp 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 53 49 253:4 / /var/tmp rw,nosuid,nodev,noexec,relatime shared:33 - xfs 
> /dev/mapper/vg_system-tmp 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 52 49 253:3 / /var/log 

[jira] [Commented] (MESOS-10107) containeriser: failed to remove cgroup - EBUSY

2020-05-07 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101572#comment-17101572
 ] 

Andrei Budnik commented on MESOS-10107:
---

{code:java}
commit 0cb1591b709e3c9f32093d943b8e2ddcdcf7999f
Author: Charles-Francois Natali 
Date:   Sat May 2 01:41:09 2020 +0100

Keep retrying to remove cgroup on EBUSY.

This is a follow-up to MESOS-10107, which introduced retries when
calling `rmdir` on a seemingly empty cgroup fails with `EBUSY`
because of various kernel bugs.
At the time, the fix introduced a bounded number of retries, using an
exponential backoff summing up to slightly over 1s. This was done
because it was similar to what Docker does, and worked during testing.
However, after 1 month without seeing this error in our cluster at work,
we finally experienced one case where the 1s timeout wasn't enough.
It could be because the machine was busy at the time, or some other
random factor.
So instead of only trying for 1s, I think it might make sense to just
keep retrying, until the top-level container destruction timeout - set
at 1 minute - kicks in.
This actually makes more sense, and avoids having a magical timeout in
the cgroup code.
We just need to ensure that when the destroyer is finalized, it discards
the future in charge of doing the periodic remove.

This closes #362
{code}

> containeriser: failed to remove cgroup - EBUSY
> --
>
> Key: MESOS-10107
> URL: https://issues.apache.org/jira/browse/MESOS-10107
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Charles N
>Assignee: Charles Natali
>Priority: Major
>  Labels: cgroups, containerization
> Fix For: 1.10.0
>
> Attachments: mesos-remove-cgroup-race.diff, 
> reproduce-cgroup-rmdir-race.py
>
>
> We've been seeing some random errors on our cluster, where the container 
> cgroup isn't properly destroyed after the OOM killer kicked in when memory 
> limit has been exceeded - see analysis and patch below:
> Agent log:
> {noformat}
> I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: 
> 10272MB Maximum Used: 10518532KB
> MEMORY STATISTICS: 
> cache 0
> rss 10502754304
> rss_huge 4001366016
> shmem 0
> mapped_file 270336
> dirty 0
> writeback 0
> swap 0
> pgpgin 1684617
> pgpgout 95480
> pgfault 1670328
> pgmajfault 957
> inactive_anon 0
> active_anon 10501189632
> inactive_file 4096
> active_file 0
> unevictable 0
> hierarchical_memory_limit 10770972672
> hierarchical_memsw_limit 10770972672
> total_cache 0
> total_rss 10502754304
> total_rss_huge 4001366016
> total_shmem 0
> total_mapped_file 270336
> total_dirty 0
> total_writeback 0
> total_swap 0
> total_pgpgin 1684617
> total_pgpgout 95480
> total_pgfault 1670328
> total_pgmajfault 957
> total_inactive_anon 0
> total_active_anon 10501070848
> total_inactive_file 4096
> total_active_file 0
> total_unevictable 0
> I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource 
> [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be 
> terminated
> I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state
> I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state 
> of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING 
> after 4.285078272secs
> I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy 
> container 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c'
> I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 102.27072ms
> I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 242944ns
> I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for 
> executor(1)@127.0.1.1:46357
> I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited
> E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor 
> 

[jira] [Commented] (MESOS-10119) failure to destroy container can cause the agent to "leak" a GPU

2020-04-21 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088568#comment-17088568
 ] 

Andrei Budnik commented on MESOS-10119:
---

Could you reproduce the cgroups desctruction problem consistently?
What are the kernel and systemd versions installed on your agents?

> failure to destroy container can cause the agent to "leak" a GPU
> 
>
> Key: MESOS-10119
> URL: https://issues.apache.org/jira/browse/MESOS-10119
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Charles Natali
>Priority: Major
>
> At work we hit the following problem:
>  # cgroup for a task using the GPU isolation failed to be destroyed on OOM
>  # the agent continued advertising the GPU as available
>  # all subsequent attempts to start tasks using a GPU fails with "Requested 1 
> gpus but only 0 available"
> Problem 1 looks like https://issues.apache.org/jira/browse/MESOS-9950) so can 
> be tackled separately, however the fact that the agent basically leaks the 
> GPU is pretty bad, because it basically turns into /dev/null, failing all 
> subsequent tasks requesting a GPU.
>  
> See the logs:
>  
>  
> {noformat}
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874277 2138 
> memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874305 2138 
> memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874315 2138 
> memory.cpp:686] Failed to read 'memory.stat': No such file or directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874701 2136 
> memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874734 2136 
> memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874747 2136 
> memory.cpp:686] Failed to read 'memory.stat': No such file or directory
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.062358 2152 
> slave.cpp:6994] Termination of executor 
> 'task_0:067b0963-134f-a917-4503-89b6a2a630ac' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to clean up an 
> isolator when destroying container: Failed to destroy cgroups: Failed to get 
> nested cgroups: Failed to determine canonical path of 
> '/sys/fs/cgroup/memory/mesos/8ef00748-b640-4620-97dc-f719e9775e88': No such 
> file or directory
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063295 2150 
> containerizer.cpp:2567] Skipping status for container 
> 8ef00748-b640-4620-97dc-f719e9775e88 because: Container does not exist
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063429 2137 
> containerizer.cpp:2428] Ignoring update for currently being destroyed 
> container 8ef00748-b640-4620-97dc-f719e9775e88
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.079169 2150 
> slave.cpp:6994] Termination of executor 
> 'task_1:a00165a1-123a-db09-6b1a-b6c4054b0acd' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to kill all 
> processes in the container: Failed to remove cgroup 
> 'mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Failed to remove cgroup 
> '/sys/fs/cgroup/freezer/mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Device 
> or resource busy
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079537 2140 
> containerizer.cpp:2567] Skipping status for container 
> 5c1418f0-1d4d-47cd-a188-0f4b87e394f2 because: Container does not exist
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079670 2136 
> containerizer.cpp:2428] Ignoring update for currently being destroyed 
> container 5c1418f0-1d4d-47cd-a188-0f4b87e394f2
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.956969 2136 
> slave.cpp:6889] Container '87253521-8d39-47ea-b4d1-febe527d230c' for executor 
> 'task_2:8b129d24-70d2-2cab-b2df-c73911954ec3' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed to start: Requested 1 gpus 
> but only 0 available
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.957670 2149 
> memory.cpp:637] Listening on OOM events failed for container 
> 87253521-8d39-47ea-b4d1-febe527d230c: Event listener is terminating
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: W0417 17:00:07.966552 2150 
> containerizer.cpp:2421] Ignoring update for unknown container 
> 87253521-8d39-47ea-b4d1-febe527d230c
> Apr 17 17:00:08 engpuc006 mesos-slave[2068]: W0417 17:00:08.109067 2154 
> process.cpp:1480] Failed to link to '172.16.22.201:34059', connect: Failed 
> connect: 

[jira] [Commented] (MESOS-10107) containeriser: failed to remove cgroup - EBUSY

2020-04-15 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17084136#comment-17084136
 ] 

Andrei Budnik commented on MESOS-10107:
---

{code:java}
commit af3ca189aced5fbc537bfca571264142d4cd37b3
Author: Charles-Francois Natali 
Date:   Wed Apr 1 13:40:16 2020 +0100

Handled EBUSY when destroying a cgroup.

It's a workaround for kernel bugs which can cause `rmdir` to fail with
`EBUSY` even though the cgroup - appears - empty.
See for example https://lkml.org/lkml/2020/1/15/1349

This closes #355
{code}

> containeriser: failed to remove cgroup - EBUSY
> --
>
> Key: MESOS-10107
> URL: https://issues.apache.org/jira/browse/MESOS-10107
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Charles N
>Priority: Major
>  Labels: cgroups, containerization
> Fix For: 1.10.0
>
> Attachments: mesos-remove-cgroup-race.diff, 
> reproduce-cgroup-rmdir-race.py
>
>
> We've been seeing some random errors on our cluster, where the container 
> cgroup isn't properly destroyed after the OOM killer kicked in when memory 
> limit has been exceeded - see analysis and patch below:
> Agent log:
> {noformat}
> I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: 
> 10272MB Maximum Used: 10518532KB
> MEMORY STATISTICS: 
> cache 0
> rss 10502754304
> rss_huge 4001366016
> shmem 0
> mapped_file 270336
> dirty 0
> writeback 0
> swap 0
> pgpgin 1684617
> pgpgout 95480
> pgfault 1670328
> pgmajfault 957
> inactive_anon 0
> active_anon 10501189632
> inactive_file 4096
> active_file 0
> unevictable 0
> hierarchical_memory_limit 10770972672
> hierarchical_memsw_limit 10770972672
> total_cache 0
> total_rss 10502754304
> total_rss_huge 4001366016
> total_shmem 0
> total_mapped_file 270336
> total_dirty 0
> total_writeback 0
> total_swap 0
> total_pgpgin 1684617
> total_pgpgout 95480
> total_pgfault 1670328
> total_pgmajfault 957
> total_inactive_anon 0
> total_active_anon 10501070848
> total_inactive_file 4096
> total_active_file 0
> total_unevictable 0
> I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource 
> [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be 
> terminated
> I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state
> I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state 
> of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING 
> after 4.285078272secs
> I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy 
> container 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c'
> I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 102.27072ms
> I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 242944ns
> I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for 
> executor(1)@127.0.1.1:46357
> I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited
> E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor 
> 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework 
> 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all 
> processes in the container: Failed to remove cgroup 
> 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device 
> or resource busy
> {noformat}
> Initially I thought it was a race condition in the cgroup destruction code, 
> but an strace confirmed that the cgroup directory was only deleted once all 
> tasks had exited (edited and commented strace below from a different instance 
> of the same problem):
> {noformat}
> # get the list of processes
> 3431  23:01:32.293608 openat(AT_FDCWD,
> "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs",
> O_RDONLY 
> 3431  23:01:32.293669 <... openat resumed> ) = 18 <0.36>
> 3431  23:01:32.294220 read(18,  
> 

[jira] [Assigned] (MESOS-10107) containeriser: failed to remove cgroup - EBUSY

2020-04-15 Thread Andrei Budnik (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-10107:
-

Assignee: Charles Natali

> containeriser: failed to remove cgroup - EBUSY
> --
>
> Key: MESOS-10107
> URL: https://issues.apache.org/jira/browse/MESOS-10107
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Charles N
>Assignee: Charles Natali
>Priority: Major
>  Labels: cgroups, containerization
> Fix For: 1.10.0
>
> Attachments: mesos-remove-cgroup-race.diff, 
> reproduce-cgroup-rmdir-race.py
>
>
> We've been seeing some random errors on our cluster, where the container 
> cgroup isn't properly destroyed after the OOM killer kicked in when memory 
> limit has been exceeded - see analysis and patch below:
> Agent log:
> {noformat}
> I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: 
> 10272MB Maximum Used: 10518532KB
> MEMORY STATISTICS: 
> cache 0
> rss 10502754304
> rss_huge 4001366016
> shmem 0
> mapped_file 270336
> dirty 0
> writeback 0
> swap 0
> pgpgin 1684617
> pgpgout 95480
> pgfault 1670328
> pgmajfault 957
> inactive_anon 0
> active_anon 10501189632
> inactive_file 4096
> active_file 0
> unevictable 0
> hierarchical_memory_limit 10770972672
> hierarchical_memsw_limit 10770972672
> total_cache 0
> total_rss 10502754304
> total_rss_huge 4001366016
> total_shmem 0
> total_mapped_file 270336
> total_dirty 0
> total_writeback 0
> total_swap 0
> total_pgpgin 1684617
> total_pgpgout 95480
> total_pgfault 1670328
> total_pgmajfault 957
> total_inactive_anon 0
> total_active_anon 10501070848
> total_inactive_file 4096
> total_active_file 0
> total_unevictable 0
> I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource 
> [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be 
> terminated
> I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state
> I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state 
> of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING 
> after 4.285078272secs
> I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy 
> container 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c'
> I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 102.27072ms
> I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 242944ns
> I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for 
> executor(1)@127.0.1.1:46357
> I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited
> E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor 
> 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework 
> 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all 
> processes in the container: Failed to remove cgroup 
> 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device 
> or resource busy
> {noformat}
> Initially I thought it was a race condition in the cgroup destruction code, 
> but an strace confirmed that the cgroup directory was only deleted once all 
> tasks had exited (edited and commented strace below from a different instance 
> of the same problem):
> {noformat}
> # get the list of processes
> 3431  23:01:32.293608 openat(AT_FDCWD,
> "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs",
> O_RDONLY 
> 3431  23:01:32.293669 <... openat resumed> ) = 18 <0.36>
> 3431  23:01:32.294220 read(18,  
> 3431  23:01:32.294268 <... read resumed> "5878\n6036\n6210\n", 8192) =
> 15 <0.33>
> 3431  23:01:32.294306 read(18, "", 4096) = 0 <0.13>
> 3431  23:01:32.294346 close(18 
> 3431  23:01:32.294402 <... close resumed> ) = 0 <0.45>
> #kill them
> 3431  23:01:32.296266 kill(5878, SIGKILL) = 0 <0.19>
> 3431  23:01:32.296384 kill(6036, SIGKILL 
> 

[jira] [Commented] (MESOS-10107) containeriser: failed to remove cgroup - EBUSY

2020-04-01 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17072680#comment-17072680
 ] 

Andrei Budnik commented on MESOS-10107:
---

Thanks for the detailed explanations!

Could you please submit your patch to [Apache Review 
Board|http://mesos.apache.org/documentation/latest/advanced-contribution/#submit-your-patch]
 or open a [PR on 
github|http://mesos.apache.org/documentation/latest/beginner-contribution/#open-a-pr/]?

Does the workaround work reliably after changing the initial delay and retry 
count to the values taken from libcontainerd (10ms and 5)?

Should we retry only if `::rmdir()` returns EBUSY errno error?

> containeriser: failed to remove cgroup - EBUSY
> --
>
> Key: MESOS-10107
> URL: https://issues.apache.org/jira/browse/MESOS-10107
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Charles
>Priority: Major
> Attachments: mesos-remove-cgroup-race.diff, 
> reproduce-cgroup-rmdir-race.py
>
>
> We've been seeing some random errors on our cluster, where the container 
> cgroup isn't properly destroyed after the OOM killer kicked in when memory 
> limit has been exceeded - see analysis and patch below:
> Agent log:
> {noformat}
> I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: 
> 10272MB Maximum Used: 10518532KB
> MEMORY STATISTICS: 
> cache 0
> rss 10502754304
> rss_huge 4001366016
> shmem 0
> mapped_file 270336
> dirty 0
> writeback 0
> swap 0
> pgpgin 1684617
> pgpgout 95480
> pgfault 1670328
> pgmajfault 957
> inactive_anon 0
> active_anon 10501189632
> inactive_file 4096
> active_file 0
> unevictable 0
> hierarchical_memory_limit 10770972672
> hierarchical_memsw_limit 10770972672
> total_cache 0
> total_rss 10502754304
> total_rss_huge 4001366016
> total_shmem 0
> total_mapped_file 270336
> total_dirty 0
> total_writeback 0
> total_swap 0
> total_pgpgin 1684617
> total_pgpgout 95480
> total_pgfault 1670328
> total_pgmajfault 957
> total_inactive_anon 0
> total_active_anon 10501070848
> total_inactive_file 4096
> total_active_file 0
> total_unevictable 0
> I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource 
> [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be 
> terminated
> I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state
> I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state 
> of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING 
> after 4.285078272secs
> I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy 
> container 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c'
> I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 102.27072ms
> I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 242944ns
> I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for 
> executor(1)@127.0.1.1:46357
> I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited
> E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor 
> 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework 
> 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all 
> processes in the container: Failed to remove cgroup 
> 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device 
> or resource busy
> {noformat}
> Initially I thought it was a race condition in the cgroup destruction code, 
> but an strace confirmed that the cgroup directory was only deleted once all 
> tasks had exited (edited and commented strace below from a different instance 
> of the same problem):
> {noformat}
> # get the list of processes
> 3431  23:01:32.293608 openat(AT_FDCWD,
> "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs",
> O_RDONLY 
> 3431  23:01:32.293669 <... openat resumed> ) = 18 <0.36>
> 3431  

[jira] [Deleted] (MESOS-10078) Cgroups isolator: update cgroups subsystems to support nested cgroups

2020-02-27 Thread Andrei Budnik (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10078?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik deleted MESOS-10078:
--


> Cgroups isolator: update cgroups subsystems to support nested cgroups
> -
>
> Key: MESOS-10078
> URL: https://issues.apache.org/jira/browse/MESOS-10078
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: cgroups, containerization
>
> Update Cgroups Subsystems to support nested cgroups.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10098) Mesos agent fails to start on outdated systemd.

2020-02-24 Thread Andrei Budnik (Jira)
Andrei Budnik created MESOS-10098:
-

 Summary: Mesos agent fails to start on outdated systemd.
 Key: MESOS-10098
 URL: https://issues.apache.org/jira/browse/MESOS-10098
 Project: Mesos
  Issue Type: Bug
  Components: agent
Affects Versions: 1.10
 Environment: CoreOS 2411.0.0
Reporter: Andrei Budnik
Assignee: Andrei Budnik
 Fix For: 1.10


Mesos agent refuses to start due to a failure caused by the systemd-specific 
code:
{code:java}
E0220 12:03:02.943467 22298 main.cpp:670] EXIT with status 1: Expected exactly 
one socket with name unknown, got 0 instead
{code}

It turns out that some versions of systemd do not set environment variables 
`LISTEN_PID`, `LISTEN_FDS` and `LISTEN_FDNAMES` to the Mesos agent process, if 
its systemd unit is ill-formed. If this happens, `listenFdsWithName` returns an 
empty list, therefore leading to the error above.

After fixing the problem with the systemd unit, systemd sets the value for 
`LISTEN_FDNAMES` taken from the `FileDescriptorName` field. In our case, the 
env variable is set to `systemd:dcos-mesos-slave`. Since the value is expected 
to be equal to "systemd:unknown" (for the compatibility with older systemd 
versions), the mismatch of values happens and we see the same error message.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9853) Update Docker executor to allow kill policy overrides

2020-02-04 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9853?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17030016#comment-17030016
 ] 

Andrei Budnik commented on MESOS-9853:
--

Backported /r/71033/ ("Moved the Docker executor declaration into a header.") 
to the previous versions as there is a bugfix (/r/72055) that depends on this 
patch.

1.5.x
{code:java}
commit 68e8655c8fb6dbc41de6afb66a569583b32f78d3
Author: Greg Mann 
Date:   Thu Jul 25 12:17:41 2019 -0700

Moved the Docker executor declaration into a header.

This moves the declaration of the Docker executor into the
Docker executor header file and moves the code for the Docker
executor binary into a new launcher implementation file.

This change will enable the Mesos executor driver
implementation to make use of the `DockerExecutor` symbol.

Review: https://reviews.apache.org/r/71033/
{code}

1.6.x
{code:java}
commit 02eb0ceb87dadc0a5ac6f6cd9f141347e852fb80
Author: Greg Mann 
Date:   Thu Jul 25 12:17:41 2019 -0700

Moved the Docker executor declaration into a header.

This moves the declaration of the Docker executor into the
Docker executor header file and moves the code for the Docker
executor binary into a new launcher implementation file.

This change will enable the Mesos executor driver
implementation to make use of the `DockerExecutor` symbol.

Review: https://reviews.apache.org/r/71033/
{code}

1.7.x
{code:java}
commit 0567b31212105821d0b37ad049228dab6e98ed63
Author: Greg Mann 
Date:   Thu Jul 25 12:17:41 2019 -0700

Moved the Docker executor declaration into a header.

This moves the declaration of the Docker executor into the
Docker executor header file and moves the code for the Docker
executor binary into a new launcher implementation file.

This change will enable the Mesos executor driver
implementation to make use of the `DockerExecutor` symbol.

Review: https://reviews.apache.org/r/71033/
{code}

1.8.x
{code:java}
commit 1995f63352a5a8c2c8e73adefed708a8620a5d47
Author: Greg Mann 
Date:   Thu Jul 25 12:17:41 2019 -0700

Moved the Docker executor declaration into a header.

This moves the declaration of the Docker executor into the
Docker executor header file and moves the code for the Docker
executor binary into a new launcher implementation file.

This change will enable the Mesos executor driver
implementation to make use of the `DockerExecutor` symbol.

Review: https://reviews.apache.org/r/71033/
{code}

> Update Docker executor to allow kill policy overrides
> -
>
> Key: MESOS-9853
> URL: https://issues.apache.org/jira/browse/MESOS-9853
> Project: Mesos
>  Issue Type: Task
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere
> Fix For: 1.9.0
>
>
> In order for the agent to successfully override the task kill policy of 
> Docker tasks when the agent is being drained, the Docker executor must be 
> able to receive kill policy overrides and must be updated to honor them. 
> Since the Docker executor runs using the executor driver, this is currently 
> not possible. We could, for example, update the executor driver interface, or 
> move the Docker executor off of the executor driver.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-8537) Default executor doesn't wait for status updates to be ack'd before shutting down

2020-02-03 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029065#comment-17029065
 ] 

Andrei Budnik commented on MESOS-8537:
--

1.5.x
{code:java}
commit 84b7af3409d8af343da0f0420e168a42de4b110f
Author: Andrei Budnik 
Date:   Wed Jan 29 19:07:50 2020 +0100

Changed termination logic of the default executor.

Previously, the default executor terminated itself after all containers
had terminated. This could lead to termination of the executor before
processing of a terminal status update by the agent. In order
to mitigate this issue, the executor slept for one second to give a
chance to send all status updates and receive all status update
acknowledgements before terminating itself. This might have led to
various race conditions in some circumstances (e.g., on a slow host).
This patch terminates the default executor if all status updates have
been acknowledged by the agent and no running containers left.
Also, this patch increases the timeout from one second to one minute
for fail-safety.

Review: https://reviews.apache.org/r/72029
{code}

1.6.x
{code:java}
commit 205525eb56a33e58bed1fc38e0b32189b19d3fbc
Author: Andrei Budnik 
Date:   Wed Jan 29 19:07:50 2020 +0100

Changed termination logic of the default executor.

Previously, the default executor terminated itself after all containers
had terminated. This could lead to termination of the executor before
processing of a terminal status update by the agent. In order
to mitigate this issue, the executor slept for one second to give a
chance to send all status updates and receive all status update
acknowledgements before terminating itself. This might have led to
various race conditions in some circumstances (e.g., on a slow host).
This patch terminates the default executor if all status updates have
been acknowledged by the agent and no running containers left.
Also, this patch increases the timeout from one second to one minute
for fail-safety.

Review: https://reviews.apache.org/r/72029
{code}

1.7.x
{code:java}
commit 5b399080eee11ee03f4bc6c09b791c24670da6c1
Author: Andrei Budnik 
Date:   Wed Jan 29 19:07:50 2020 +0100

Changed termination logic of the default executor.

Previously, the default executor terminated itself after all containers
had terminated. This could lead to termination of the executor before
processing of a terminal status update by the agent. In order
to mitigate this issue, the executor slept for one second to give a
chance to send all status updates and receive all status update
acknowledgements before terminating itself. This might have led to
various race conditions in some circumstances (e.g., on a slow host).
This patch terminates the default executor if all status updates have
been acknowledged by the agent and no running containers left.
Also, this patch increases the timeout from one second to one minute
for fail-safety.

Review: https://reviews.apache.org/r/72029
{code}

1.8.x
{code:java}
commit a2ca451aab4625e126b9e7b470eb9f7c232dd746
Author: Andrei Budnik 
Date:   Wed Jan 29 19:07:50 2020 +0100

Changed termination logic of the default executor.

Previously, the default executor terminated itself after all containers
had terminated. This could lead to termination of the executor before
processing of a terminal status update by the agent. In order
to mitigate this issue, the executor slept for one second to give a
chance to send all status updates and receive all status update
acknowledgements before terminating itself. This might have led to
various race conditions in some circumstances (e.g., on a slow host).
This patch terminates the default executor if all status updates have
been acknowledged by the agent and no running containers left.
Also, this patch increases the timeout from one second to one minute
for fail-safety.

Review: https://reviews.apache.org/r/72029
{code}

1.9.x
{code:java}
commit f37ae68a8f0d23a2e0f31812b8fe4494109769c6
Author: Andrei Budnik 
Date:   Wed Jan 29 19:07:50 2020 +0100

Changed termination logic of the default executor.

Previously, the default executor terminated itself after all containers
had terminated. This could lead to termination of the executor before
processing of a terminal status update by the agent. In order
to mitigate this issue, the executor slept for one second to give a
chance to send all status updates and receive all status update
acknowledgements before terminating itself. This might have led to
various race conditions in some circumstances (e.g., on a slow host).
This patch terminates the default executor if all status updates have
been acknowledged by the agent and no running containers left.
Also, this patch 

[jira] [Comment Edited] (MESOS-9847) Docker executor doesn't wait for status updates to be ack'd before shutting down.

2020-02-03 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029047#comment-17029047
 ] 

Andrei Budnik edited comment on MESOS-9847 at 2/3/20 3:54 PM:
--

{code:java}
commit 457c38967bf9a53c1c5cd2743385937a26f413f6
Author: Andrei Budnik 
Date:   Wed Jan 29 13:35:02 2020 +0100

Changed termination logic of the Docker executor.

Previously, the Docker executor terminated itself after a task's
container had terminated. This could lead to termination of the
executor before processing of a terminal status update by the agent.
In order to mitigate this issue, the executor slept for one second to
give a chance to send all status updates and receive all status update
acknowledgments before terminating itself. This might have led to
various race conditions in some circumstances (e.g., on a slow host).
This patch terminates the Docker executor after receiving a terminal
status update acknowledgment. Also, this patch increases the timeout
from one second to one minute for fail-safety.

Review: https://reviews.apache.org/r/72055
{code}


was (Author: abudnik):

{code:java}
commit 683dfc1ffb0b1ca758a07d19ab3badd8cac62dc7
Author: Andrei Budnik 
Date: Wed Jan 29 19:07:50 2020 +0100

Changed termination logic of the default executor.

Previously, the default executor terminated itself after all containers
 had terminated. This could lead to termination of the executor before
 processing of a terminal status update by the agent. In order
 to mitigate this issue, the executor slept for one second to give a
 chance to send all status updates and receive all status update
 acknowledgements before terminating itself. This might have led to
 various race conditions in some circumstances (e.g., on a slow host).
 This patch terminates the default executor if all status updates have
 been acknowledged by the agent and no running containers left.
 Also, this patch increases the timeout from one second to one minute
 for fail-safety.

Review: https://reviews.apache.org/r/72029

commit 457c38967bf9a53c1c5cd2743385937a26f413f6
Author: Andrei Budnik 
Date: Wed Jan 29 13:35:02 2020 +0100

Changed termination logic of the Docker executor.

Previously, the Docker executor terminated itself after a task's
 container had terminated. This could lead to termination of the
 executor before processing of a terminal status update by the agent.
 In order to mitigate this issue, the executor slept for one second to
 give a chance to send all status updates and receive all status update
 acknowledgments before terminating itself. This might have led to
 various race conditions in some circumstances (e.g., on a slow host).
 This patch terminates the Docker executor after receiving a terminal
 status update acknowledgment. Also, this patch increases the timeout
 from one second to one minute for fail-safety.

Review: https://reviews.apache.org/r/72055
{code}

> Docker executor doesn't wait for status updates to be ack'd before shutting 
> down.
> -
>
> Key: MESOS-9847
> URL: https://issues.apache.org/jira/browse/MESOS-9847
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Meng Zhu
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerization
> Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.2, 1.10, 1.9.1
>
>
> The docker executor doesn't wait for pending status updates to be 
> acknowledged before shutting down, instead it sleeps for one second and then 
> terminates:
> {noformat}
>   void _stop()
>   {
> // A hack for now ... but we need to wait until the status update
> // is sent to the slave before we shut ourselves down.
> // TODO(tnachen): Remove this hack and also the same hack in the
> // command executor when we have the new HTTP APIs to wait until
> // an ack.
> os::sleep(Seconds(1));
> driver.get()->stop();
>   }
> {noformat}
> This would result in racing between task status update (e.g. TASK_FINISHED) 
> and executor exit. The latter would lead agent generating a `TASK_FAILED` 
> status update by itself, leading to the confusing case where the agent 
> handles two different terminal status updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9847) Docker executor doesn't wait for status updates to be ack'd before shutting down.

2020-02-03 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17029051#comment-17029051
 ] 

Andrei Budnik commented on MESOS-9847:
--

1.5.x
{code:java}
commit ff98f12a50a56c13688b87068a116d1d08142f49
Author: Andrei Budnik 
Date:   Wed Jan 29 13:35:02 2020 +0100

Changed termination logic of the Docker executor.

Previously, the Docker executor terminated itself after a task's
container had terminated. This could lead to termination of the
executor before processing of a terminal status update by the agent.
In order to mitigate this issue, the executor slept for one second to
give a chance to send all status updates and receive all status update
acknowledgments before terminating itself. This might have led to
various race conditions in some circumstances (e.g., on a slow host).
This patch terminates the Docker executor after receiving a terminal
status update acknowledgment. Also, this patch increases the timeout
from one second to one minute for fail-safety.

Review: https://reviews.apache.org/r/72055
{code}

1.6.x
{code:java}
commit f511f25be9d850ee9b65fc3ec5f54d149beb2f19
Author: Andrei Budnik 
Date:   Wed Jan 29 13:35:02 2020 +0100

Changed termination logic of the Docker executor.

Previously, the Docker executor terminated itself after a task's
container had terminated. This could lead to termination of the
executor before processing of a terminal status update by the agent.
In order to mitigate this issue, the executor slept for one second to
give a chance to send all status updates and receive all status update
acknowledgments before terminating itself. This might have led to
various race conditions in some circumstances (e.g., on a slow host).
This patch terminates the Docker executor after receiving a terminal
status update acknowledgment. Also, this patch increases the timeout
from one second to one minute for fail-safety.

Review: https://reviews.apache.org/r/72055
{code}

1.7.x
{code:java}
commit 6a7da284d1b89f8a144ed2f896f005a5ee9d4aea
Author: Andrei Budnik 
Date:   Wed Jan 29 13:35:02 2020 +0100

Changed termination logic of the Docker executor.

Previously, the Docker executor terminated itself after a task's
container had terminated. This could lead to termination of the
executor before processing of a terminal status update by the agent.
In order to mitigate this issue, the executor slept for one second to
give a chance to send all status updates and receive all status update
acknowledgments before terminating itself. This might have led to
various race conditions in some circumstances (e.g., on a slow host).
This patch terminates the Docker executor after receiving a terminal
status update acknowledgment. Also, this patch increases the timeout
from one second to one minute for fail-safety.

Review: https://reviews.apache.org/r/72055
{code}

1.8.x
{code:java}
commit 1bd0b37a7e522d63319db426dae7068b901eaea6
Author: Andrei Budnik 
Date:   Wed Jan 29 13:35:02 2020 +0100

Changed termination logic of the Docker executor.

Previously, the Docker executor terminated itself after a task's
container had terminated. This could lead to termination of the
executor before processing of a terminal status update by the agent.
In order to mitigate this issue, the executor slept for one second to
give a chance to send all status updates and receive all status update
acknowledgments before terminating itself. This might have led to
various race conditions in some circumstances (e.g., on a slow host).
This patch terminates the Docker executor after receiving a terminal
status update acknowledgment. Also, this patch increases the timeout
from one second to one minute for fail-safety.

Review: https://reviews.apache.org/r/72055
{code}


1.9.x
{code:java}
commit 3d60cba39d0377a7dc19b4c47f3bb0807418fe50
Author: Andrei Budnik 
Date:   Wed Jan 29 13:35:02 2020 +0100

Changed termination logic of the Docker executor.

Previously, the Docker executor terminated itself after a task's
container had terminated. This could lead to termination of the
executor before processing of a terminal status update by the agent.
In order to mitigate this issue, the executor slept for one second to
give a chance to send all status updates and receive all status update
acknowledgments before terminating itself. This might have led to
various race conditions in some circumstances (e.g., on a slow host).
This patch terminates the Docker executor after receiving a terminal
status update acknowledgment. Also, this patch increases the timeout
from one second to one minute for fail-safety.

Review: https://reviews.apache.org/r/72055
{code}

> Docker executor doesn't wait for status updates to be 

[jira] [Assigned] (MESOS-8537) Default executor doesn't wait for status updates to be ack'd before shutting down

2020-01-20 Thread Andrei Budnik (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-8537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-8537:


Assignee: Andrei Budnik

> Default executor doesn't wait for status updates to be ack'd before shutting 
> down
> -
>
> Key: MESOS-8537
> URL: https://issues.apache.org/jira/browse/MESOS-8537
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.4.1, 1.5.0
>Reporter: Gastón Kleiman
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerization, default-executor, mesosphere
>
> The default executor doesn't wait for pending status updates to be 
> acknowledged before shutting down, instead it sleeps for one second and then 
> terminates:
> {code}
>   void _shutdown()
>   {
> const Duration duration = Seconds(1);
> LOG(INFO) << "Terminating after " << duration;
> // TODO(qianzhang): Remove this hack since the executor now receives
> // acknowledgements for status updates. The executor can terminate
> // after it receives an ACK for a terminal status update.
> os::sleep(duration);
> terminate(self());
>   }
> {code}
> The event handler should exit if upon receiving a {{Event::ACKNOWLEDGED}} the 
> executor is shutting down, no tasks are running anymore, and all pending 
> status updates have been acknowledged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10080) Cgroups isolator: update cleanup logic to support nested cgroups

2019-12-23 Thread Andrei Budnik (Jira)
Andrei Budnik created MESOS-10080:
-

 Summary: Cgroups isolator: update cleanup logic to support nested 
cgroups
 Key: MESOS-10080
 URL: https://issues.apache.org/jira/browse/MESOS-10080
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Andrei Budnik
Assignee: Andrei Budnik


Update Cgroups isolator to cleanup a nested cgroup for a nested container 
taking into account hierarchical layout of cgroups. Lowest nested cgroups 
should be destroyed first.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10079) Cgroups isolator: recover nested cgroups

2019-12-23 Thread Andrei Budnik (Jira)
Andrei Budnik created MESOS-10079:
-

 Summary: Cgroups isolator: recover nested cgroups
 Key: MESOS-10079
 URL: https://issues.apache.org/jira/browse/MESOS-10079
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Andrei Budnik
Assignee: Andrei Budnik


Update recovery of Cgroups isolator to recover nested cgroups for those nested 
containers, which were launched in nested cgroups.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10078) Cgroups isolator: update cgroups subsystems to support nested cgroups

2019-12-23 Thread Andrei Budnik (Jira)
Andrei Budnik created MESOS-10078:
-

 Summary: Cgroups isolator: update cgroups subsystems to support 
nested cgroups
 Key: MESOS-10078
 URL: https://issues.apache.org/jira/browse/MESOS-10078
 Project: Mesos
  Issue Type: Task
Reporter: Andrei Budnik
Assignee: Andrei Budnik


Update Cgroups Subsystems to support nested cgroups.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10077) Cgroups isolator: allow updating and isolating resources for nested cgroups

2019-12-23 Thread Andrei Budnik (Jira)
Andrei Budnik created MESOS-10077:
-

 Summary: Cgroups isolator: allow updating and isolating resources 
for nested cgroups
 Key: MESOS-10077
 URL: https://issues.apache.org/jira/browse/MESOS-10077
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Andrei Budnik
Assignee: Andrei Budnik


Allow Cgroups isolator to update and isolate resources for nested cgroups.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10076) Cgroups isolator: create nested cgroups

2019-12-23 Thread Andrei Budnik (Jira)
Andrei Budnik created MESOS-10076:
-

 Summary: Cgroups isolator: create nested cgroups
 Key: MESOS-10076
 URL: https://issues.apache.org/jira/browse/MESOS-10076
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Andrei Budnik
Assignee: Andrei Budnik


Update Cgroups isolator to create nested cgroups for a nested container, which 
supports nested cgroups, during container launch preparation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns

2019-12-13 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16995737#comment-16995737
 ] 

Andrei Budnik commented on MESOS-10066:
---

cc [~qianzhang]

> mesos-docker-executor process dies when agent stops. Recovery fails when 
> agent returns
> --
>
> Key: MESOS-10066
> URL: https://issues.apache.org/jira/browse/MESOS-10066
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker, executor
>Affects Versions: 1.7.3
>Reporter: Dalton Matos Coelho Barreto
>Priority: Critical
> Attachments: logs-after.txt, logs-before.txt
>
>
> Hello all,
> The documentation about Agent Recovery shows two conditions for the recovery 
> to be possible:
>  - The agent must have recovery enabled (default true?);
>  - The scheduler must register itself saying that it has checkpointing 
> enabled.
> In my tests I'm using Marathon as the scheduler and Mesos itself sees 
> Marathon as e checkpoint-enabled scheduler:
> {noformat}
> $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, 
> "id": .id, "checkpoint": .checkpoint, "active": .active}'
> {
>   "name": "asgard-chronos",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001",
>   "checkpoint": true,
>   "active": true
> }
> {
>   "name": "marathon",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-",
>   "checkpoint": true,
>   "active": true
> }
> }}
> {noformat}
> Here is what I'm using:
>  # Mesos Master, 1.4.1
>  # Mesos Agent 1.7.3
>  # Using docker image {{mesos/mesos-centos:1.7.x}}
>  # Docker sock mounted from the host
>  # Docker binary also mounted from the host
>  # Marathon: 1.4.12
>  # Docker
> {noformat}
> Client: Docker Engine - Community
>  Version:   19.03.5
>  API version:   1.39 (downgraded from 1.40)
>  Go version:go1.12.12
>  Git commit:633a0ea838
>  Built: Wed Nov 13 07:22:05 2019
>  OS/Arch:   linux/amd64
>  Experimental:  false
> 
> Server: Docker Engine - Community
>  Engine:
>   Version:  18.09.2
>   API version:  1.39 (minimum version 1.12)
>   Go version:   go1.10.6
>   Git commit:   6247962
>   Built:Sun Feb 10 03:42:13 2019
>   OS/Arch:  linux/amd64
>   Experimental: false
> {noformat}
> h2. The problem
> Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} 
> docker image.
> {noformat}
> {
>   "id": "/sleep",
>   "cmd": "sleep 99d",
>   "cpus": 0.1,
>   "mem": 128,
>   "disk": 0,
>   "instances": 1,
>   "constraints": [],
>   "acceptedResourceRoles": [
> "*"
>   ],
>   "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
>   "image": "debian",
>   "network": "HOST",
>   "privileged": false,
>   "parameters": [],
>   "forcePullImage": true
> }
>   },
>   "labels": {},
>   "portDefinitions": []
> }
> {noformat}
> This task runs fine and get scheduled on the right agent, which is running 
> mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}).
> Here is a sample log:
> {noformat}
> mesos-slave_1  | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392895 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394399 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching 
> executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources 
> [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}]
>  in work directory 
> 

[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns

2019-12-06 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989880#comment-16989880
 ] 

Andrei Budnik commented on MESOS-10066:
---

So the Docker socket is mounted from the host FS into the Docker container? I'm 
not sure if Mesos supports such a configuration. Since mesos-docker-executor is 
launched in a separate Docker container, there is no way to establish a socket 
connection from one Docker container (where agent runs) to another (where 
executor runs). Is executor's port 10.234.172.56:9899 exposed by the Docker 
container?

AFAIK, [Mesos mini|http://mesos.apache.org/blog/mesos-mini/] uses 
Docker-in-Docker technique instead.

> mesos-docker-executor process dies when agent stops. Recovery fails when 
> agent returns
> --
>
> Key: MESOS-10066
> URL: https://issues.apache.org/jira/browse/MESOS-10066
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker, executor
>Affects Versions: 1.7.3
>Reporter: Dalton Matos Coelho Barreto
>Priority: Critical
> Attachments: logs-after.txt, logs-before.txt
>
>
> Hello all,
> The documentation about Agent Recovery shows two conditions for the recovery 
> to be possible:
>  - The agent must have recovery enabled (default true?);
>  - The scheduler must register itself saying that it has checkpointing 
> enabled.
> In my tests I'm using Marathon as the scheduler and Mesos itself sees 
> Marathon as e checkpoint-enabled scheduler:
> {noformat}
> $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, 
> "id": .id, "checkpoint": .checkpoint, "active": .active}'
> {
>   "name": "asgard-chronos",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001",
>   "checkpoint": true,
>   "active": true
> }
> {
>   "name": "marathon",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-",
>   "checkpoint": true,
>   "active": true
> }
> }}
> {noformat}
> Here is what I'm using:
>  # Mesos Master, 1.4.1
>  # Mesos Agent 1.7.3
>  # Using docker image {{mesos/mesos-centos:1.7.x}}
>  # Docker sock mounted from the host
>  # Docker binary also mounted from the host
>  # Marathon: 1.4.12
>  # Docker
> {noformat}
> Client: Docker Engine - Community
>  Version:   19.03.5
>  API version:   1.39 (downgraded from 1.40)
>  Go version:go1.12.12
>  Git commit:633a0ea838
>  Built: Wed Nov 13 07:22:05 2019
>  OS/Arch:   linux/amd64
>  Experimental:  false
> 
> Server: Docker Engine - Community
>  Engine:
>   Version:  18.09.2
>   API version:  1.39 (minimum version 1.12)
>   Go version:   go1.10.6
>   Git commit:   6247962
>   Built:Sun Feb 10 03:42:13 2019
>   OS/Arch:  linux/amd64
>   Experimental: false
> {noformat}
> h2. The problem
> Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} 
> docker image.
> {noformat}
> {
>   "id": "/sleep",
>   "cmd": "sleep 99d",
>   "cpus": 0.1,
>   "mem": 128,
>   "disk": 0,
>   "instances": 1,
>   "constraints": [],
>   "acceptedResourceRoles": [
> "*"
>   ],
>   "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
>   "image": "debian",
>   "network": "HOST",
>   "privileged": false,
>   "parameters": [],
>   "forcePullImage": true
> }
>   },
>   "labels": {},
>   "portDefinitions": []
> }
> {noformat}
> This task runs fine and get scheduled on the right agent, which is running 
> mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}).
> Here is a sample log:
> {noformat}
> mesos-slave_1  | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392895 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394399 19849 paths.cpp:748] Creating 
> sandbox 
> 

[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns

2019-12-06 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989808#comment-16989808
 ] 

Andrei Budnik commented on MESOS-10066:
---

Did you try to specify  --docker_mesos_image  command-line option for the agent 
that runs inside the Docker container?

> mesos-docker-executor process dies when agent stops. Recovery fails when 
> agent returns
> --
>
> Key: MESOS-10066
> URL: https://issues.apache.org/jira/browse/MESOS-10066
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker, executor
>Affects Versions: 1.7.3
>Reporter: Dalton Matos Coelho Barreto
>Priority: Critical
> Attachments: logs-after.txt, logs-before.txt
>
>
> Hello all,
> The documentation about Agent Recovery shows two conditions for the recovery 
> to be possible:
>  - The agent must have recovery enabled (default true?);
>  - The scheduler must register itself saying that it has checkpointing 
> enabled.
> In my tests I'm using Marathon as the scheduler and Mesos itself sees 
> Marathon as e checkpoint-enabled scheduler:
> {noformat}
> $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, 
> "id": .id, "checkpoint": .checkpoint, "active": .active}'
> {
>   "name": "asgard-chronos",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001",
>   "checkpoint": true,
>   "active": true
> }
> {
>   "name": "marathon",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-",
>   "checkpoint": true,
>   "active": true
> }
> }}
> {noformat}
> Here is what I'm using:
>  # Mesos Master, 1.4.1
>  # Mesos Agent 1.7.3
>  # Using docker image {{mesos/mesos-centos:1.7.x}}
>  # Docker sock mounted from the host
>  # Docker binary also mounted from the host
>  # Marathon: 1.4.12
>  # Docker
> {noformat}
> Client: Docker Engine - Community
>  Version:   19.03.5
>  API version:   1.39 (downgraded from 1.40)
>  Go version:go1.12.12
>  Git commit:633a0ea838
>  Built: Wed Nov 13 07:22:05 2019
>  OS/Arch:   linux/amd64
>  Experimental:  false
> 
> Server: Docker Engine - Community
>  Engine:
>   Version:  18.09.2
>   API version:  1.39 (minimum version 1.12)
>   Go version:   go1.10.6
>   Git commit:   6247962
>   Built:Sun Feb 10 03:42:13 2019
>   OS/Arch:  linux/amd64
>   Experimental: false
> {noformat}
> h2. The problem
> Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} 
> docker image.
> {noformat}
> {
>   "id": "/sleep",
>   "cmd": "sleep 99d",
>   "cpus": 0.1,
>   "mem": 128,
>   "disk": 0,
>   "instances": 1,
>   "constraints": [],
>   "acceptedResourceRoles": [
> "*"
>   ],
>   "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
>   "image": "debian",
>   "network": "HOST",
>   "privileged": false,
>   "parameters": [],
>   "forcePullImage": true
> }
>   },
>   "labels": {},
>   "portDefinitions": []
> }
> {noformat}
> This task runs fine and get scheduled on the right agent, which is running 
> mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}).
> Here is a sample log:
> {noformat}
> mesos-slave_1  | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392895 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394399 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching 
> executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources 
> 

[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies when agent stops. Recovery fails when agent returns

2019-12-06 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16989728#comment-16989728
 ] 

Andrei Budnik commented on MESOS-10066:
---

Could you please attach full agent logs?

> mesos-docker-executor process dies when agent stops. Recovery fails when 
> agent returns
> --
>
> Key: MESOS-10066
> URL: https://issues.apache.org/jira/browse/MESOS-10066
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker, executor
>Affects Versions: 1.7.3
>Reporter: Dalton Matos Coelho Barreto
>Priority: Critical
>
> Hello all,
> The documentation about Agent Recovery shows two conditions for the recovery 
> to be possible:
>  - The agent must have recovery enabled (default true?);
>  - The scheduler must register itself saying that it has checkpointing 
> enabled.
> In my tests I'm using Marathon as the scheduler and Mesos itself sees 
> Marathon as e checkpoint-enabled scheduler:
> {noformat}
> $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, 
> "id": .id, "checkpoint": .checkpoint, "active": .active}'
> {
>   "name": "asgard-chronos",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001",
>   "checkpoint": true,
>   "active": true
> }
> {
>   "name": "marathon",
>   "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-",
>   "checkpoint": true,
>   "active": true
> }
> }}
> {noformat}
> Here is what I'm using:
>  # Mesos Master, 1.4.1
>  # Mesos Agent 1.7.3
>  # Using docker image {{mesos/mesos-centos:1.7.x}}
>  # Docker sock mounted from the host
>  # Docker binary also mounted from the host
>  # Marathon: 1.4.12
>  # Docker
> {noformat}
> Client: Docker Engine - Community
>  Version:   19.03.5
>  API version:   1.39 (downgraded from 1.40)
>  Go version:go1.12.12
>  Git commit:633a0ea838
>  Built: Wed Nov 13 07:22:05 2019
>  OS/Arch:   linux/amd64
>  Experimental:  false
> 
> Server: Docker Engine - Community
>  Engine:
>   Version:  18.09.2
>   API version:  1.39 (minimum version 1.12)
>   Go version:   go1.10.6
>   Git commit:   6247962
>   Built:Sun Feb 10 03:42:13 2019
>   OS/Arch:  linux/amd64
>   Experimental: false
> {noformat}
> h2. The problem
> Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} 
> docker image.
> {noformat}
> {
>   "id": "/sleep",
>   "cmd": "sleep 99d",
>   "cpus": 0.1,
>   "mem": 128,
>   "disk": 0,
>   "instances": 1,
>   "constraints": [],
>   "acceptedResourceRoles": [
> "*"
>   ],
>   "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
>   "image": "debian",
>   "network": "HOST",
>   "privileged": false,
>   "parameters": [],
>   "forcePullImage": true
> }
>   },
>   "labels": {},
>   "portDefinitions": []
> }
> {noformat}
> This task runs fine and get scheduled on the right agent, which is running 
> mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}).
> Here is a sample log:
> {noformat}
> mesos-slave_1  | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching 
> task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7-
> mesos-slave_1  | I1205 13:24:21.392895 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394399 19849 paths.cpp:748] Creating 
> sandbox 
> '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1  | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching 
> executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 
> 4783cf15-4fb1-4c75-90fe-44eeec5258a7- with resources 
> [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}]
>  in work directory 
> 

[jira] [Created] (MESOS-10014) `tryUntrackFrameworkUnderRole` check failed in `HierarchicalAllocatorProcess::removeFramework`.

2019-10-18 Thread Andrei Budnik (Jira)
Andrei Budnik created MESOS-10014:
-

 Summary: `tryUntrackFrameworkUnderRole` check failed in 
`HierarchicalAllocatorProcess::removeFramework`.
 Key: MESOS-10014
 URL: https://issues.apache.org/jira/browse/MESOS-10014
 Project: Mesos
  Issue Type: Bug
  Components: master, test
Reporter: Andrei Budnik
 Attachments: AgentPendingOperationAfterMasterFailover-badrun.txt

`ContentType/OperationReconciliationTest.AgentPendingOperationAfterMasterFailover/0`
 test failed:
{code:java}
F1018 09:05:14.310616 21391 hierarchical.cpp:745] Check failed: 
tryUntrackFrameworkUnderRole(framework, role)  Framework: 
e6284079-cb6a-4a47-8f9a-ea9b84ff622a- role: default-role
*** Check failure stack trace: ***
@ 0x7f40fff0a1f6  google::LogMessage::Fail()
@ 0x7f40fff0a14f  google::LogMessage::SendToLog()
@ 0x7f40fff09a91  google::LogMessage::Flush()
@ 0x7f40fff0d12f  google::LogMessageFatal::~LogMessageFatal()
@ 0x7f410fd828ac  
mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::removeFramework()
@  0x186b29f  
_ZZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS1_11FrameworkIDES8_EEvRKNS_3PIDIT_EEMSA_FvT0_EOT1_ENKUlOS6_PNS_11ProcessBaseEE_clESJ_SL_
@  0x189c273  
_ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS3_11FrameworkIDESA_EEvRKNS1_3PIDIT_EEMSC_FvT0_EOT1_EUlOS8_PNS1_11ProcessBaseEE_JS8_SN_EEEDTclcl7forwardISC_Efp_Espcl7forwardIT0_Efp0_EEEOSC_DpOSP_
@  0x18990b7  
_ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_JS9_St12_PlaceholderILi113invoke_expandISP_St5tupleIJS9_SR_EESU_IJOSO_EEJLm0ELm1DTcl6invokecl7forwardISD_Efp_Espcl6expandcl3getIXT2_EEcl7forwardISH_Efp0_EEcl7forwardISK_Efp2_OSD_OSH_N5cpp1416integer_sequenceImJXspT2_SL_
@  0x1896100  
_ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS4_11FrameworkIDESB_EEvRKNS2_3PIDIT_EEMSD_FvT0_EOT1_EUlOS9_PNS2_11ProcessBaseEE_IS9_St12_PlaceholderILi1clIISO_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImILm0ELm1_Ecl16forward_as_tuplespcl7forwardIT_Efp_DpOSX_
@  0x1895174  
_ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS6_11FrameworkIDESD_EEvRKNS4_3PIDIT_EEMSF_FvT0_EOT1_EUlOSB_PNS4_11ProcessBaseEE_ISB_St12_PlaceholderILi1EISQ_EEEDTclcl7forwardISF_Efp_Espcl7forwardIT0_Efp0_EEEOSF_DpOSV_
@  0x1894b2b  
_ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNS7_11FrameworkIDESE_EEvRKNS5_3PIDIT_EEMSG_FvT0_EOT1_EUlOSC_PNS5_11ProcessBaseEE_JSC_St12_PlaceholderILi1EJSR_EEEvOSG_DpOT0_
@  0x18943bc  
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master9allocator21MesosAllocatorProcessERKNSA_11FrameworkIDESH_EEvRKNS1_3PIDIT_EEMSJ_FvT0_EOT1_EUlOSF_S3_E_ISF_St12_PlaceholderILi1EEclEOS3_
@ 0x7f41016deb22  
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
@ 0x7f410169620c  process::ProcessBase::consume()
@ 0x7f41016c0696  
_ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
@  0x1822baa  process::ProcessBase::serve()
@ 0x7f4101692af1  process::ProcessManager::resume()
@ 0x7f410168ed68  
_ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv
@ 0x7f41016b81e2  
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
@ 0x7f41016b7244  
_ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEclEv
@ 0x7f41016b6088  
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
@ 0x7f40fca44590  execute_native_thread_routine
@ 0x7f40ffa77e25  start_thread
@ 0x7f40fa396bad  __clone
@  (nil)  (unknown)
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-6480) Support for docker live-restore option in Mesos

2019-10-02 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-6480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942893#comment-16942893
 ] 

Andrei Budnik commented on MESOS-6480:
--

design doc: 
[https://docs.google.com/document/d/1JeLTr9L31S8eIg-6xpjedIUKvnfNake0kPTzxEwdUdI]

> Support for docker live-restore option in Mesos
> ---
>
> Key: MESOS-6480
> URL: https://issues.apache.org/jira/browse/MESOS-6480
> Project: Mesos
>  Issue Type: Task
>Reporter: Milind Chawre
>Priority: Major
>
> Docker-1.12 supports live-restore option which keeps containers alive during 
> docker daemon downtime https://docs.docker.com/engine/admin/live-restore/
> I tried to use this option in my Mesos setup And  observed this :
> 1. On mesos worker node stop docker daemon.
> 2. After some time start the docker daemon. All the containers running on 
> that are still visible using "docker ps". This is an expected behaviour of 
> live-restore option.
> 3. When I check mesos and marathon UI. It shows no Active tasks running on 
> that node. The containers which are still running on that node are now 
> scheduled on different mesos nodes, which is not right since I can see the 
> containers in "docker ps" output because of live-restore option.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9843) Implement tests for the `containerizer/debug` endpoint.

2019-09-24 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16936734#comment-16936734
 ] 

Andrei Budnik commented on MESOS-9843:
--

{code:java}
commit dee4b849c8179ea46947c8ea4dd031f6eb37b659
Author: Andrei Budnik abud...@apache.org
Date:   Fri Sep 6 17:01:56 2019 +0200
Added `futureTracker` to the `SlaveOptions` in tests.

`PendingFutureTracker` is shared across both Mesos containerizer and
the agent, so we need to add an option to be able to start a slave in
tests with an instance of the `futureTrack` as a parameter.

Review: https://reviews.apache.org/r/71454
{code}

{code:java}
 commit 1122674a5c03894e4552d46cfa26dca0557a8f68
Author: Andrei Budnik 
Date:   Fri Sep 6 13:25:35 2019 +0200

Implemented an integration test for /containerizer/debug endpoint.

This test starts an agent with the MockIsolator to intercept calls to
its `prepare` method, then it launches a task, which gets stuck.
We check that the /containerizer/debug endpoint returns a non-empty
list of pending futures including `MockIsolator::prepare`. After
setting the promise for the `prepare`, the task successfully starts
and we expect for the /containerizer/debug endpoint to return an
empty list of pending operations.

Review: https://reviews.apache.org/r/71455
{code}
 

> Implement tests for the `containerizer/debug` endpoint.
> ---
>
> Key: MESOS-9843
> URL: https://issues.apache.org/jira/browse/MESOS-9843
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerization
>
> Implement tests for container stuck issues and check that the agent's 
> `containerizer/debug` endpoint returns a JSON object containing information 
> about pending operations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9969) Agent crashes when trying to clean up volue

2019-09-17 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16931645#comment-16931645
 ] 

Andrei Budnik commented on MESOS-9969:
--

Could you please provide steps to reproduce this bug?

> Agent crashes when trying to clean up volue
> ---
>
> Key: MESOS-9969
> URL: https://issues.apache.org/jira/browse/MESOS-9969
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.8.2
>Reporter: Tomas Barton
>Priority: Major
>
> {code}
> Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.081748 21828 
> linux_launcher.cpp:650] Destroying cgroup 
> '/sys/fs/cgroup/systemd/mesos/370ed262-4041-4180-a7e1-9ea78070e3a6'
> Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.081876 21832 
> containerizer.cpp:2907] Checkpointing termination state to nested container's 
> runtime directory 
> '/var/run/mesos/containers/8e3997e7-c53a-4043-9a7e-26a2e436a041/containers/ae0bdc6d-c738-4352-b5d4-7572182671d5/termination'
> Sep 17 13:49:26 w03 mesos-agent[21803]: mesos-agent: 
> /pkg/src/mesos/3rdparty/stout/include/stout/option.hpp:120: T& 
> Option::get() & [with T = std::basic_string]: Assertion `isSome()' 
> failed.
> Sep 17 13:49:26 w03 mesos-agent[21803]: *** Aborted at 1568728166 (unix time) 
> try "date -d @1568728166" if you are using GNU date ***
> Sep 17 13:49:26 w03 mesos-agent[21803]: W0917 13:49:26.082281 21835 
> disk.cpp:453] Ignoring cleanup for unknown container 
> a9ba6959-ea02-4543-b7d5-92a63940
> Sep 17 13:49:26 w03 mesos-agent[21803]: PC: @ 0x7f16a3867fff (unknown)
> Sep 17 13:49:26 w03 mesos-agent[21803]: *** SIGABRT (@0x552b) received by PID 
> 21803 (TID 0x7f169e47d700) from PID 21803; stack trace: ***
> Sep 17 13:49:26 w03 mesos-agent[21803]: E0917 13:49:26.082608 21835 
> memory.cpp:501] Listening on OOM events failed for container 
> a9ba6959-ea02-4543-b7d5-92a63940: Event listener is terminating
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3be50e0 (unknown)
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3867fff (unknown)
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a386942a (unknown)
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3860e67 (unknown)
> Sep 17 13:49:26 w03 mesos-agent[21803]: I0917 13:49:26.083741 21835 
> linux.cpp:1074] Unmounting volume 
> '/var/lib/mesos/slave/slaves/04e596b7-f03d-4cba-bbbc-fa9e0aebb5d2-S17/frameworks/04e596b7-f03d-4cba-bbbc-fa9e0aebb5d2-0003/executors/es01__coordinator__8591ac8e-3d9d-45ac-bb68-bee379c8c4a4/runs/a9ba6959-ea02-4543-b7d5-92a63940/container-path'
>  for con
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3860f12 (unknown)
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a7654f13 
> _ZNR6OptionISsE3getEv.part.152
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a7666b2f 
> mesos::internal::slave::MesosContainerizerProcess::__destroy()
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a861cb41 
> process::ProcessBase::consume()
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a8633c9c 
> process::ProcessManager::resume()
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a86398a6 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a43c6200 (unknown)
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a3bdb4a4 start_thread
> Sep 17 13:49:26 w03 mesos-agent[21803]: @ 0x7f16a391dd0f (unknown)
> Sep 17 13:49:26 w03 systemd[1]: dcos-mesos-slave.service: Main process 
> exited, code=killed, status=6/ABRT
> {code}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9914) Refactor `MesosTest::StartSlave` in favour of builder style interface

2019-09-06 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16924139#comment-16924139
 ] 

Andrei Budnik commented on MESOS-9914:
--

 
{code:java}
commit 6c2a94ca0eca90e6d3517e4400f4529ddce0b84c
Author: Andrei Budnik abud...@apache.org
Date:   Mon Sep 2 17:15:52 2019 +0200
Added `SlaveOptions` for wrapping all parameters of `StartSlave`.

This patch introduces a `SlaveOptions` struct which holds optional
parameters accepted by `cluster::Slave::create`. Added an overload
of `StartSlave` that accepts `SlaveOptions`. It's a preferred way of
creating and starting an instance of `cluster::Slave` in tests, since
underlying `cluster::Slave::create` accepts a long list of optional
arguments, which might be extended in the future.

Review: https://reviews.apache.org/r/71424
{code}
 

> Refactor `MesosTest::StartSlave` in favour of builder style interface
> -
>
> Key: MESOS-9914
> URL: https://issues.apache.org/jira/browse/MESOS-9914
> Project: Mesos
>  Issue Type: Improvement
>  Components: test
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>
> Every overload of `MesosTest::StartSlave` method depend on 
> `cluster::Slave::create` method, which accepts several arguments. In fact, 
> each overload of `MesosTest::StartSlave` accepts a subset of combination of 
> arguments that `cluster::Slave::create` accept. Given that the latter accepts 
> 11 arguments at the moment, and there are already 13 overloads of the 
> `MesosTest::StartSlave`, introducing a builder-style interface is very 
> desirable. It'd allow adding more arguments to the `cluster::Slave::create` 
> without the necessity to update all existing overloads. It would be a local 
> change as it won't affect existing tests.
> See [this 
> comment|https://github.com/apache/mesos/blob/00bb0b6d6abe7700a5adab0bdaf7e91767a2db19/src/tests/mesos.hpp#L160-L177].



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9914) Refactor `MesosTest::StartSlave` in favour of builder style interface

2019-09-02 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16920964#comment-16920964
 ] 

Andrei Budnik commented on MESOS-9914:
--

[https://reviews.apache.org/r/71424/]

> Refactor `MesosTest::StartSlave` in favour of builder style interface
> -
>
> Key: MESOS-9914
> URL: https://issues.apache.org/jira/browse/MESOS-9914
> Project: Mesos
>  Issue Type: Improvement
>  Components: test
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>
> Every overload of `MesosTest::StartSlave` method depend on 
> `cluster::Slave::create` method, which accepts several arguments. In fact, 
> each overload of `MesosTest::StartSlave` accepts a subset of combination of 
> arguments that `cluster::Slave::create` accept. Given that the latter accepts 
> 11 arguments at the moment, and there are already 13 overloads of the 
> `MesosTest::StartSlave`, introducing a builder-style interface is very 
> desirable. It'd allow adding more arguments to the `cluster::Slave::create` 
> without the necessity to update all existing overloads. It would be a local 
> change as it won't affect existing tests.
> See [this 
> comment|https://github.com/apache/mesos/blob/00bb0b6d6abe7700a5adab0bdaf7e91767a2db19/src/tests/mesos.hpp#L160-L177].



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Assigned] (MESOS-9914) Refactor `MesosTest::StartSlave` in favour of builder style interface

2019-08-29 Thread Andrei Budnik (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-9914:


Assignee: Andrei Budnik

> Refactor `MesosTest::StartSlave` in favour of builder style interface
> -
>
> Key: MESOS-9914
> URL: https://issues.apache.org/jira/browse/MESOS-9914
> Project: Mesos
>  Issue Type: Improvement
>  Components: test
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>
> Every overload of `MesosTest::StartSlave` method depend on 
> `cluster::Slave::create` method, which accepts several arguments. In fact, 
> each overload of `MesosTest::StartSlave` accepts a subset of combination of 
> arguments that `cluster::Slave::create` accept. Given that the latter accepts 
> 11 arguments at the moment, and there are already 13 overloads of the 
> `MesosTest::StartSlave`, introducing a builder-style interface is very 
> desirable. It'd allow adding more arguments to the `cluster::Slave::create` 
> without the necessity to update all existing overloads. It would be a local 
> change as it won't affect existing tests.
> See [this 
> comment|https://github.com/apache/mesos/blob/00bb0b6d6abe7700a5adab0bdaf7e91767a2db19/src/tests/mesos.hpp#L160-L177].



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.

2019-08-26 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16915755#comment-16915755
 ] 

Andrei Budnik commented on MESOS-9887:
--

{code:java}
commit 8aae23ec7cd4bc50532df0b1d1ea6ec23ce078f8
Author: Andrei Budnik abud...@apache.org
Date:   Fri Aug 23 14:36:18 2019 +0200
Added missing `return` statement in `Slave::statusUpdate`.

Previously, if `statusUpdate` was called for a pending task, it would
forward the status update and then continue executing `statusUpdate`,
which then checks if there is an executor that is aware of this task.
Given that a pending task is not known to any executor, it would always
handle it by forwarding status update one more time. This patch adds
missing `return` statement, which fixes the issue.

Review: https://reviews.apache.org/r/71361
{code}
{code:java}
commit f0be23765531b05661ed7f1b124faf96744aa80b
Author: Andrei Budnik abud...@apache.org
Date:   Tue Aug 20 19:24:44 2019 +0200
Fixed out-of-order processing of terminal status updates in agent.

Previously, Mesos agent could send TASK_FAILED status update on
executor termination while processing of TASK_FINISHED status update
was in progress. Processing of task status updates involves sending
requests to the containerizer, which might finish processing of these
requests out-of-order, e.g. `MesosContainerizer::status`. Also,
the agent does not overwrite status of the terminal status update once
it's stored in the `terminatedTasks`. Hence, there was a race condition
between two terminal status updates.

Note that V1 Executors are not affected by this problem because they
wait for an acknowledgement of the terminal status update by the agent
before terminating.

This patch introduces a new data structure `pendingStatusUpdates`,
which holds a list of status updates that are being processed. This
data structure allows validating the order of processing of status
updates by the agent.

Review: https://reviews.apache.org/r/71343
{code}

> Race condition between two terminal task status updates for Docker executor.
> 
>
> Key: MESOS-9887
> URL: https://issues.apache.org/jira/browse/MESOS-9887
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Blocker
>  Labels: agent, containerization
> Attachments: race_example.txt
>
>
> h2. Overview
> Expected behavior:
>  Task successfully finishes and sends TASK_FINISHED status update.
> Observed behavior:
>  Task successfully finishes, but the agent sends TASK_FAILED with the reason 
> "REASON_EXECUTOR_TERMINATED".
> In normal circumstances, Docker executor 
> [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758]
>  final status update TASK_FINISHED to the agent, which then [gets 
> processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543]
>  by the agent before termination of the executor's process.
> However, if the processing of the initial TASK_FINISHED gets delayed, then 
> there is a chance that Docker executor terminates and the agent 
> [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
>  TASK_FAILED which will [be 
> handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826]
>  prior to the TASK_FINISHED status update.
> See attached logs which contain an example of the race condition.
> h2. Reproducing bug
> 1. Add the following code:
> {code:java}
>   static int c = 0;
>   if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates.
> ::sleep(2);
>   }
> {code}
> to the 
> [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578]
>  and to the 
> [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167].
> 2. Recompile mesos
> 3. Launch mesos master and agent locally
> 4. Launch a simple Docker task via `mesos-execute`:
> {code}
> #  cd build
> ./src/mesos-execute --master="`hostname`:5050" --name="a" 
> --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" 
> --command="ls"
> {code}
> h2. Race condition - description
> 1. Mesos agent receives TASK_FINISHED status update and then subscribes on 
> [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761].
> 2. `containerizer->status()` operation for TASK_FINISHED status update gets 
> delayed in the 

[jira] [Comment Edited] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.

2019-08-26 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912558#comment-16912558
 ] 

Andrei Budnik edited comment on MESOS-9887 at 8/26/19 12:22 PM:


[https://reviews.apache.org/r/71361/
https://reviews.apache.org/r/71343/|https://reviews.apache.org/r/71343/]


was (Author: abudnik):
https://reviews.apache.org/r/71343/

> Race condition between two terminal task status updates for Docker executor.
> 
>
> Key: MESOS-9887
> URL: https://issues.apache.org/jira/browse/MESOS-9887
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Blocker
>  Labels: agent, containerization
> Attachments: race_example.txt
>
>
> h2. Overview
> Expected behavior:
>  Task successfully finishes and sends TASK_FINISHED status update.
> Observed behavior:
>  Task successfully finishes, but the agent sends TASK_FAILED with the reason 
> "REASON_EXECUTOR_TERMINATED".
> In normal circumstances, Docker executor 
> [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758]
>  final status update TASK_FINISHED to the agent, which then [gets 
> processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543]
>  by the agent before termination of the executor's process.
> However, if the processing of the initial TASK_FINISHED gets delayed, then 
> there is a chance that Docker executor terminates and the agent 
> [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
>  TASK_FAILED which will [be 
> handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826]
>  prior to the TASK_FINISHED status update.
> See attached logs which contain an example of the race condition.
> h2. Reproducing bug
> 1. Add the following code:
> {code:java}
>   static int c = 0;
>   if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates.
> ::sleep(2);
>   }
> {code}
> to the 
> [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578]
>  and to the 
> [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167].
> 2. Recompile mesos
> 3. Launch mesos master and agent locally
> 4. Launch a simple Docker task via `mesos-execute`:
> {code}
> #  cd build
> ./src/mesos-execute --master="`hostname`:5050" --name="a" 
> --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" 
> --command="ls"
> {code}
> h2. Race condition - description
> 1. Mesos agent receives TASK_FINISHED status update and then subscribes on 
> [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761].
> 2. `containerizer->status()` operation for TASK_FINISHED status update gets 
> delayed in the composing containerizer (e.g. due to switch of the worker 
> thread that executes `status` method).
> 3. Docker executor terminates and the agent 
> [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
>  TASK_FAILED.
> 4. Docker containerizer destroys the container. A registered callback for the 
> `containerizer->wait` call in the composing containerizer dispatches [lambda 
> function|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L368-L373]
>  that will clean up `containers_` map.
> 5. Composing c'zer resumes and dispatches 
> `[status()|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L579]`
>  method to the Docker containerizer for TASK_FINISHED, which in turn hangs 
> for a few seconds.
> 6. Corresponding `containerId` gets removed from the `containers_` map of the 
> composing c'zer.
> 7. Mesos agent subscribes on 
> [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]
>  for the TASK_FAILED status update.
> 8. Composing c'zer returns ["Container not 
> found"|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L576]
>  for TASK_FAILED.
> 9. 
> `[Slave::_statusUpdate|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5826]`
>  stores TASK_FAILED terminal status update in 

[jira] [Comment Edited] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.

2019-08-26 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912558#comment-16912558
 ] 

Andrei Budnik edited comment on MESOS-9887 at 8/26/19 12:22 PM:


[https://reviews.apache.org/r/71361/]
[https://reviews.apache.org/r/71343/]


was (Author: abudnik):
[https://reviews.apache.org/r/71361/
https://reviews.apache.org/r/71343/|https://reviews.apache.org/r/71343/]

> Race condition between two terminal task status updates for Docker executor.
> 
>
> Key: MESOS-9887
> URL: https://issues.apache.org/jira/browse/MESOS-9887
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Blocker
>  Labels: agent, containerization
> Attachments: race_example.txt
>
>
> h2. Overview
> Expected behavior:
>  Task successfully finishes and sends TASK_FINISHED status update.
> Observed behavior:
>  Task successfully finishes, but the agent sends TASK_FAILED with the reason 
> "REASON_EXECUTOR_TERMINATED".
> In normal circumstances, Docker executor 
> [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758]
>  final status update TASK_FINISHED to the agent, which then [gets 
> processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543]
>  by the agent before termination of the executor's process.
> However, if the processing of the initial TASK_FINISHED gets delayed, then 
> there is a chance that Docker executor terminates and the agent 
> [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
>  TASK_FAILED which will [be 
> handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826]
>  prior to the TASK_FINISHED status update.
> See attached logs which contain an example of the race condition.
> h2. Reproducing bug
> 1. Add the following code:
> {code:java}
>   static int c = 0;
>   if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates.
> ::sleep(2);
>   }
> {code}
> to the 
> [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578]
>  and to the 
> [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167].
> 2. Recompile mesos
> 3. Launch mesos master and agent locally
> 4. Launch a simple Docker task via `mesos-execute`:
> {code}
> #  cd build
> ./src/mesos-execute --master="`hostname`:5050" --name="a" 
> --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" 
> --command="ls"
> {code}
> h2. Race condition - description
> 1. Mesos agent receives TASK_FINISHED status update and then subscribes on 
> [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761].
> 2. `containerizer->status()` operation for TASK_FINISHED status update gets 
> delayed in the composing containerizer (e.g. due to switch of the worker 
> thread that executes `status` method).
> 3. Docker executor terminates and the agent 
> [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
>  TASK_FAILED.
> 4. Docker containerizer destroys the container. A registered callback for the 
> `containerizer->wait` call in the composing containerizer dispatches [lambda 
> function|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L368-L373]
>  that will clean up `containers_` map.
> 5. Composing c'zer resumes and dispatches 
> `[status()|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L579]`
>  method to the Docker containerizer for TASK_FINISHED, which in turn hangs 
> for a few seconds.
> 6. Corresponding `containerId` gets removed from the `containers_` map of the 
> composing c'zer.
> 7. Mesos agent subscribes on 
> [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]
>  for the TASK_FAILED status update.
> 8. Composing c'zer returns ["Container not 
> found"|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L576]
>  for TASK_FAILED.
> 9. 
> `[Slave::_statusUpdate|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5826]`
>  

[jira] [Commented] (MESOS-9844) Update documentation describing `containerizer/debug` endpoint.

2019-08-22 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16913441#comment-16913441
 ] 

Andrei Budnik commented on MESOS-9844:
--

http://mesos.apache.org/documentation/latest/endpoints/slave/containerizer/debug/

> Update documentation describing `containerizer/debug` endpoint.
> ---
>
> Key: MESOS-9844
> URL: https://issues.apache.org/jira/browse/MESOS-9844
> Project: Mesos
>  Issue Type: Documentation
>  Components: containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerization
>




--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Commented] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.

2019-08-21 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912558#comment-16912558
 ] 

Andrei Budnik commented on MESOS-9887:
--

https://reviews.apache.org/r/71343/

> Race condition between two terminal task status updates for Docker executor.
> 
>
> Key: MESOS-9887
> URL: https://issues.apache.org/jira/browse/MESOS-9887
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Blocker
>  Labels: agent, containerization
> Attachments: race_example.txt
>
>
> h2. Overview
> Expected behavior:
>  Task successfully finishes and sends TASK_FINISHED status update.
> Observed behavior:
>  Task successfully finishes, but the agent sends TASK_FAILED with the reason 
> "REASON_EXECUTOR_TERMINATED".
> In normal circumstances, Docker executor 
> [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758]
>  final status update TASK_FINISHED to the agent, which then [gets 
> processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543]
>  by the agent before termination of the executor's process.
> However, if the processing of the initial TASK_FINISHED gets delayed, then 
> there is a chance that Docker executor terminates and the agent 
> [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
>  TASK_FAILED which will [be 
> handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826]
>  prior to the TASK_FINISHED status update.
> See attached logs which contain an example of the race condition.
> h2. Reproducing bug
> 1. Add the following code:
> {code:java}
>   static int c = 0;
>   if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates.
> ::sleep(2);
>   }
> {code}
> to the 
> [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578]
>  and to the 
> [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167].
> 2. Recompile mesos
> 3. Launch mesos master and agent locally
> 4. Launch a simple Docker task via `mesos-execute`:
> {code}
> #  cd build
> ./src/mesos-execute --master="`hostname`:5050" --name="a" 
> --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" 
> --command="ls"
> {code}
> h2. Race condition - description
> 1. Mesos agent receives TASK_FINISHED status update and then subscribes on 
> [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761].
> 2. `containerizer->status()` operation for TASK_FINISHED status update gets 
> delayed in the composing containerizer (e.g. due to switch of the worker 
> thread that executes `status` method).
> 3. Docker executor terminates and the agent 
> [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
>  TASK_FAILED.
> 4. Docker containerizer destroys the container. A registered callback for the 
> `containerizer->wait` call in the composing containerizer dispatches [lambda 
> function|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L368-L373]
>  that will clean up `containers_` map.
> 5. Composing c'zer resumes and dispatches 
> `[status()|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L579]`
>  method to the Docker containerizer for TASK_FINISHED, which in turn hangs 
> for a few seconds.
> 6. Corresponding `containerId` gets removed from the `containers_` map of the 
> composing c'zer.
> 7. Mesos agent subscribes on 
> [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]
>  for the TASK_FAILED status update.
> 8. Composing c'zer returns ["Container not 
> found"|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L576]
>  for TASK_FAILED.
> 9. 
> `[Slave::_statusUpdate|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5826]`
>  stores TASK_FAILED terminal status update in the executor's data structure.
> 10. Docker containerizer resumes and finishes processing of `status()` method 
> for TASK_FINISHED. Finally, it returns control to the 

[jira] [Commented] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.

2019-08-21 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16912400#comment-16912400
 ] 

Andrei Budnik commented on MESOS-9887:
--

Discarding these patches ^^ since multiple consecutive requests to the 
underlying containerizer might finish in a different order than they were sent. 
Hence, the agent should not rely on the order of completion of requests sent to 
the containerizer.

> Race condition between two terminal task status updates for Docker executor.
> 
>
> Key: MESOS-9887
> URL: https://issues.apache.org/jira/browse/MESOS-9887
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Blocker
>  Labels: agent, containerization
> Attachments: race_example.txt
>
>
> h2. Overview
> Expected behavior:
>  Task successfully finishes and sends TASK_FINISHED status update.
> Observed behavior:
>  Task successfully finishes, but the agent sends TASK_FAILED with the reason 
> "REASON_EXECUTOR_TERMINATED".
> In normal circumstances, Docker executor 
> [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758]
>  final status update TASK_FINISHED to the agent, which then [gets 
> processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543]
>  by the agent before termination of the executor's process.
> However, if the processing of the initial TASK_FINISHED gets delayed, then 
> there is a chance that Docker executor terminates and the agent 
> [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
>  TASK_FAILED which will [be 
> handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826]
>  prior to the TASK_FINISHED status update.
> See attached logs which contain an example of the race condition.
> h2. Reproducing bug
> 1. Add the following code:
> {code:java}
>   static int c = 0;
>   if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates.
> ::sleep(2);
>   }
> {code}
> to the 
> [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578]
>  and to the 
> [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167].
> 2. Recompile mesos
> 3. Launch mesos master and agent locally
> 4. Launch a simple Docker task via `mesos-execute`:
> {code}
> #  cd build
> ./src/mesos-execute --master="`hostname`:5050" --name="a" 
> --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" 
> --command="ls"
> {code}
> h2. Race condition - description
> 1. Mesos agent receives TASK_FINISHED status update and then subscribes on 
> [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761].
> 2. `containerizer->status()` operation for TASK_FINISHED status update gets 
> delayed in the composing containerizer (e.g. due to switch of the worker 
> thread that executes `status` method).
> 3. Docker executor terminates and the agent 
> [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
>  TASK_FAILED.
> 4. Docker containerizer destroys the container. A registered callback for the 
> `containerizer->wait` call in the composing containerizer dispatches [lambda 
> function|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L368-L373]
>  that will clean up `containers_` map.
> 5. Composing c'zer resumes and dispatches 
> `[status()|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L579]`
>  method to the Docker containerizer for TASK_FINISHED, which in turn hangs 
> for a few seconds.
> 6. Corresponding `containerId` gets removed from the `containers_` map of the 
> composing c'zer.
> 7. Mesos agent subscribes on 
> [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]
>  for the TASK_FAILED status update.
> 8. Composing c'zer returns ["Container not 
> found"|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L576]
>  for TASK_FAILED.
> 9. 
> `[Slave::_statusUpdate|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5826]`
>  stores 

[jira] [Commented] (MESOS-9836) Docker containerizer overwrites `/mesos/slave` cgroups.

2019-08-20 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16911305#comment-16911305
 ] 

Andrei Budnik commented on MESOS-9836:
--

Shall we deprecate the option to run a custom executor in a Docker container? 
If no one responds to our proposal in dev@ & user@ mailing lists, then we can 
safely deprecate this feature.

> Docker containerizer overwrites `/mesos/slave` cgroups.
> ---
>
> Key: MESOS-9836
> URL: https://issues.apache.org/jira/browse/MESOS-9836
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Chun-Hung Hsiao
>Priority: Critical
>  Labels: docker, mesosphere
>
> The following bug was observed on our internal testing cluster.
> The docker containerizer launched a container on an agent:
> {noformat}
> I0523 06:00:53.888579 21815 docker.cpp:1195] Starting container 
> 'f69c8a8c-eba4-4494-a305-0956a44a6ad2' for task 
> 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1' (and executor 
> 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1') of framework 
> 415284b7-2967-407d-b66f-f445e93f064e-0011
> I0523 06:00:54.524171 21815 docker.cpp:783] Checkpointing pid 13716 to 
> '/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S2/frameworks/415284b7-2967-407d-b66f-f445e93f064e-0011/executors/apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1/runs/f69c8a8c-eba4-4494-a305-0956a44a6ad2/pids/forked.pid'
> {noformat}
> After the container was launched, the docker containerizer did a {{docker 
> inspect}} on the container and cached the pid:
>  
> [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1764]
>  The pid should be slightly greater than 13716.
> The docker executor sent a {{TASK_FINISHED}} status update around 16 minutes 
> later:
> {noformat}
> I0523 06:16:17.287595 21809 slave.cpp:5566] Handling status update 
> TASK_FINISHED (Status UUID: 4e00b786-b773-46cd-8327-c7deb08f1de9) for task 
> apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1 of framework 
> 415284b7-2967-407d-b66f-f445e93f064e-0011 from executor(1)@172.31.1.7:36244
> {noformat}
> After receiving the terminal status update, the agent asked the docker 
> containerizer to update {{cpu.cfs_period_us}}, {{cpu.cfs_quota_us}} and 
> {{memory.soft_limit_in_bytes}} of the container through the cached pid:
>  
> [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1696]
> {noformat}
> I0523 06:16:17.290447 21815 docker.cpp:1868] Updated 'cpu.shares' to 102 at 
> /sys/fs/cgroup/cpu,cpuacct/mesos/slave for container 
> f69c8a8c-eba4-4494-a305-0956a44a6ad2
> I0523 06:16:17.290660 21815 docker.cpp:1895] Updated 'cpu.cfs_period_us' to 
> 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container 
> f69c8a8c-eba4-4494-a305-0956a44a6ad2
> I0523 06:16:17.889816 21815 docker.cpp:1937] Updated 
> 'memory.soft_limit_in_bytes' to 32MB for container 
> f69c8a8c-eba4-4494-a305-0956a44a6ad2
> {noformat}
> Note that the cgroup of {{cpu.shares}} was {{/mesos/slave}}. This was 
> possibly because that over the 16 minutes the pid got reused:
> {noformat}
> # zgrep 'systemd.cpp:98\]' /var/log/mesos/archive/mesos-agent.log.12.gz
> ...
> I0523 06:00:54.525178 21815 systemd.cpp:98] Assigned child process '13716' to 
> 'mesos_executors.slice'
> I0523 06:00:55.078546 21808 systemd.cpp:98] Assigned child process '13798' to 
> 'mesos_executors.slice'
> I0523 06:00:55.134096 21808 systemd.cpp:98] Assigned child process '13799' to 
> 'mesos_executors.slice'
> ...
> I0523 06:06:30.997439 21808 systemd.cpp:98] Assigned child process '32689' to 
> 'mesos_executors.slice'
> I0523 06:06:31.050976 21808 systemd.cpp:98] Assigned child process '32690' to 
> 'mesos_executors.slice'
> I0523 06:06:31.110514 21815 systemd.cpp:98] Assigned child process '32692' to 
> 'mesos_executors.slice'
> I0523 06:06:33.143726 21818 systemd.cpp:98] Assigned child process '446' to 
> 'mesos_executors.slice'
> I0523 06:06:33.196251 21818 systemd.cpp:98] Assigned child process '447' to 
> 'mesos_executors.slice'
> I0523 06:06:33.266332 21816 systemd.cpp:98] Assigned child process '449' to 
> 'mesos_executors.slice'
> ...
> I0523 06:09:34.870056 21808 systemd.cpp:98] Assigned child process '13717' to 
> 'mesos_executors.slice'
> I0523 06:09:34.937762 21813 systemd.cpp:98] Assigned child process '13744' to 
> 'mesos_executors.slice'
> I0523 06:09:35.073971 21817 systemd.cpp:98] Assigned child process '13754' to 
> 'mesos_executors.slice'
> ...
> {noformat}
> It was highly likely that the container itself exited around 06:09:35, way 
> before the docker executor detected and reported the terminal status update, 
> and then its pid was reused by 

[jira] [Commented] (MESOS-9836) Docker containerizer overwrites `/mesos/slave` cgroups.

2019-08-15 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908250#comment-16908250
 ] 

Andrei Budnik commented on MESOS-9836:
--

{quote}
So what is the purpose of Docker containerizer's update method?
{quote}

As Mesos provides an option to run a Docker image as an (custom?) executor, it 
might make sense to update the Docker container's resources (executor+its tasks 
running in the Docker container) in cgroups. If this is the case, we probably 
should deprecate such an option? Ignoring `update` for Docker c'zer sounds like 
a good idea.

> Docker containerizer overwrites `/mesos/slave` cgroups.
> ---
>
> Key: MESOS-9836
> URL: https://issues.apache.org/jira/browse/MESOS-9836
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Chun-Hung Hsiao
>Priority: Critical
>  Labels: docker, mesosphere
>
> The following bug was observed on our internal testing cluster.
> The docker containerizer launched a container on an agent:
> {noformat}
> I0523 06:00:53.888579 21815 docker.cpp:1195] Starting container 
> 'f69c8a8c-eba4-4494-a305-0956a44a6ad2' for task 
> 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1' (and executor 
> 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1') of framework 
> 415284b7-2967-407d-b66f-f445e93f064e-0011
> I0523 06:00:54.524171 21815 docker.cpp:783] Checkpointing pid 13716 to 
> '/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S2/frameworks/415284b7-2967-407d-b66f-f445e93f064e-0011/executors/apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1/runs/f69c8a8c-eba4-4494-a305-0956a44a6ad2/pids/forked.pid'
> {noformat}
> After the container was launched, the docker containerizer did a {{docker 
> inspect}} on the container and cached the pid:
>  
> [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1764]
>  The pid should be slightly greater than 13716.
> The docker executor sent a {{TASK_FINISHED}} status update around 16 minutes 
> later:
> {noformat}
> I0523 06:16:17.287595 21809 slave.cpp:5566] Handling status update 
> TASK_FINISHED (Status UUID: 4e00b786-b773-46cd-8327-c7deb08f1de9) for task 
> apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1 of framework 
> 415284b7-2967-407d-b66f-f445e93f064e-0011 from executor(1)@172.31.1.7:36244
> {noformat}
> After receiving the terminal status update, the agent asked the docker 
> containerizer to update {{cpu.cfs_period_us}}, {{cpu.cfs_quota_us}} and 
> {{memory.soft_limit_in_bytes}} of the container through the cached pid:
>  
> [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1696]
> {noformat}
> I0523 06:16:17.290447 21815 docker.cpp:1868] Updated 'cpu.shares' to 102 at 
> /sys/fs/cgroup/cpu,cpuacct/mesos/slave for container 
> f69c8a8c-eba4-4494-a305-0956a44a6ad2
> I0523 06:16:17.290660 21815 docker.cpp:1895] Updated 'cpu.cfs_period_us' to 
> 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container 
> f69c8a8c-eba4-4494-a305-0956a44a6ad2
> I0523 06:16:17.889816 21815 docker.cpp:1937] Updated 
> 'memory.soft_limit_in_bytes' to 32MB for container 
> f69c8a8c-eba4-4494-a305-0956a44a6ad2
> {noformat}
> Note that the cgroup of {{cpu.shares}} was {{/mesos/slave}}. This was 
> possibly because that over the 16 minutes the pid got reused:
> {noformat}
> # zgrep 'systemd.cpp:98\]' /var/log/mesos/archive/mesos-agent.log.12.gz
> ...
> I0523 06:00:54.525178 21815 systemd.cpp:98] Assigned child process '13716' to 
> 'mesos_executors.slice'
> I0523 06:00:55.078546 21808 systemd.cpp:98] Assigned child process '13798' to 
> 'mesos_executors.slice'
> I0523 06:00:55.134096 21808 systemd.cpp:98] Assigned child process '13799' to 
> 'mesos_executors.slice'
> ...
> I0523 06:06:30.997439 21808 systemd.cpp:98] Assigned child process '32689' to 
> 'mesos_executors.slice'
> I0523 06:06:31.050976 21808 systemd.cpp:98] Assigned child process '32690' to 
> 'mesos_executors.slice'
> I0523 06:06:31.110514 21815 systemd.cpp:98] Assigned child process '32692' to 
> 'mesos_executors.slice'
> I0523 06:06:33.143726 21818 systemd.cpp:98] Assigned child process '446' to 
> 'mesos_executors.slice'
> I0523 06:06:33.196251 21818 systemd.cpp:98] Assigned child process '447' to 
> 'mesos_executors.slice'
> I0523 06:06:33.266332 21816 systemd.cpp:98] Assigned child process '449' to 
> 'mesos_executors.slice'
> ...
> I0523 06:09:34.870056 21808 systemd.cpp:98] Assigned child process '13717' to 
> 'mesos_executors.slice'
> I0523 06:09:34.937762 21813 systemd.cpp:98] Assigned child process '13744' to 
> 'mesos_executors.slice'
> I0523 06:09:35.073971 21817 systemd.cpp:98] Assigned child process '13754' to 
> 

[jira] [Commented] (MESOS-9936) Slave recovery is very slow with high local volume persistant ( marathon app )

2019-08-15 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16908065#comment-16908065
 ] 

Andrei Budnik commented on MESOS-9936:
--

How to reproduce the issue? Could you please share an app definition or provide 
steps to reproduce?

Also, there must be more log lines between "Recovering provisioner" and 
"Finished recovering all containerizers". At least, "Provisioner recovery 
complete". Is there anything else between these 2 log lines?

> Slave recovery is very slow with high local volume persistant ( marathon app )
> --
>
> Key: MESOS-9936
> URL: https://issues.apache.org/jira/browse/MESOS-9936
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.8.1
>Reporter: Frédéric Comte
>Priority: Major
>
> I run some local persistant applications..
> After an unplannified shutdown of  nodes running this kind of applications, I 
> see that the recovery process of mesos is taking a lot of time (more than 8 
> hours)...
> This time depends of the amount of data in those volumes.
> What does Mesos do in this process ?
> {code:java}
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.771447 13370 
> docker.cpp:890] Recovering Docker containers Jul 08 07:40:44 boss1 
> mesos-agent[13345]: I0708 07:40:44.783957 13375 containerizer.cpp:801] 
> Recovering Mesos containers 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.799252 13373 
> linux_launcher.cpp:286] Recovering Linux launcher 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.810429 13375 
> containerizer.cpp:1127] Recovering isolators 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.817328 13389 
> containerizer.cpp:1166] Recovering provisioner 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.928683 13373 
> composing.cpp:339] Finished recovering all containerizers 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.950503 13354 
> status_update_manager_process.hpp:314] Recovering operation status update 
> manager 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.957418 13399 
> slave.cpp:7729] Recovering executors
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9936) Slave recovery is very slow with high local volume persistant ( marathon app )

2019-08-13 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16906165#comment-16906165
 ] 

Andrei Budnik commented on MESOS-9936:
--

[~Fcomte]
what version of Mesos are you using?

> Slave recovery is very slow with high local volume persistant ( marathon app )
> --
>
> Key: MESOS-9936
> URL: https://issues.apache.org/jira/browse/MESOS-9936
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: Frédéric Comte
>Priority: Major
>
> I run some local persistant applications..
> After an unplannified shutdown of  nodes running this kind of applications, I 
> see that the recovery process of mesos is taking a lot of time (more than 8 
> hours)...
> This time depends of the amount of data in those volumes.
> What does Mesos do in this process ?
> {code:java}
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.771447 13370 
> docker.cpp:890] Recovering Docker containers Jul 08 07:40:44 boss1 
> mesos-agent[13345]: I0708 07:40:44.783957 13375 containerizer.cpp:801] 
> Recovering Mesos containers 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.799252 13373 
> linux_launcher.cpp:286] Recovering Linux launcher 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.810429 13375 
> containerizer.cpp:1127] Recovering isolators 
> Jul 08 07:40:44 boss1 mesos-agent[13345]: I0708 07:40:44.817328 13389 
> containerizer.cpp:1166] Recovering provisioner 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.928683 13373 
> composing.cpp:339] Finished recovering all containerizers 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.950503 13354 
> status_update_manager_process.hpp:314] Recovering operation status update 
> manager 
> Jul 08 14:42:10 boss1 mesos-agent[13345]: I0708 14:42:10.957418 13399 
> slave.cpp:7729] Recovering executors
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.

2019-08-08 Thread Andrei Budnik (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9887?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-9887:


Assignee: Andrei Budnik

> Race condition between two terminal task status updates for Docker executor.
> 
>
> Key: MESOS-9887
> URL: https://issues.apache.org/jira/browse/MESOS-9887
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Blocker
>  Labels: agent, containerization
> Attachments: race_example.txt
>
>
> h2. Overview
> Expected behavior:
>  Task successfully finishes and sends TASK_FINISHED status update.
> Observed behavior:
>  Task successfully finishes, but the agent sends TASK_FAILED with the reason 
> "REASON_EXECUTOR_TERMINATED".
> In normal circumstances, Docker executor 
> [sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758]
>  final status update TASK_FINISHED to the agent, which then [gets 
> processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543]
>  by the agent before termination of the executor's process.
> However, if the processing of the initial TASK_FINISHED gets delayed, then 
> there is a chance that Docker executor terminates and the agent 
> [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
>  TASK_FAILED which will [be 
> handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826]
>  prior to the TASK_FINISHED status update.
> See attached logs which contain an example of the race condition.
> h2. Reproducing bug
> 1. Add the following code:
> {code:java}
>   static int c = 0;
>   if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates.
> ::sleep(2);
>   }
> {code}
> to the 
> [`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578]
>  and to the 
> [`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167].
> 2. Recompile mesos
> 3. Launch mesos master and agent locally
> 4. Launch a simple Docker task via `mesos-execute`:
> {code}
> #  cd build
> ./src/mesos-execute --master="`hostname`:5050" --name="a" 
> --containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" 
> --command="ls"
> {code}
> h2. Race condition - description
> 1. Mesos agent receives TASK_FINISHED status update and then subscribes on 
> [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761].
> 2. `containerizer->status()` operation for TASK_FINISHED status update gets 
> delayed in the composing containerizer (e.g. due to switch of the worker 
> thread that executes `status` method).
> 3. Docker executor terminates and the agent 
> [triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
>  TASK_FAILED.
> 4. Docker containerizer destroys the container. A registered callback for the 
> `containerizer->wait` call in the composing containerizer dispatches [lambda 
> function|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L368-L373]
>  that will clean up `containers_` map.
> 5. Composing c'zer resumes and dispatches 
> `[status()|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L579]`
>  method to the Docker containerizer for TASK_FINISHED, which in turn hangs 
> for a few seconds.
> 6. Corresponding `containerId` gets removed from the `containers_` map of the 
> composing c'zer.
> 7. Mesos agent subscribes on 
> [`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]
>  for the TASK_FAILED status update.
> 8. Composing c'zer returns ["Container not 
> found"|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L576]
>  for TASK_FAILED.
> 9. 
> `[Slave::_statusUpdate|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5826]`
>  stores TASK_FAILED terminal status update in the executor's data structure.
> 10. Docker containerizer resumes and finishes processing of `status()` method 
> for TASK_FINISHED. Finally, it returns control to the `Slave::_statusUpdate` 
> continuation. This method 
> 

[jira] [Created] (MESOS-9926) Assertion failed in Master for `Slave::apply` while running `UnreserveVolumeResources` test.

2019-08-06 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-9926:


 Summary: Assertion failed in Master for `Slave::apply` while 
running `UnreserveVolumeResources` test.
 Key: MESOS-9926
 URL: https://issues.apache.org/jira/browse/MESOS-9926
 Project: Mesos
  Issue Type: Bug
  Components: master, test
 Environment: Failed command: ['bash', '-c', "set -o pipefail; export 
OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose 
--disable-libtool-wrappers --disable-parallel-test-execution' 
ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh 2>&1 | tee 
build_71197"]
Reporter: Andrei Budnik
 Attachments: UnreserveVolumeResources-badrun.txt

`PersistentVolumeEndpointsTest.UnreserveVolumeResources` test failed:
{code:java}
F0806 02:52:55.479373 18920 master.cpp:13789] CHECK_SOME(resources): 
ports:[31000-32000]; cpus:24; mem:95641; disk(reservations: 
[(DYNAMIC,role1,test-principal)]):960; disk(reservations: 
[(DYNAMIC,role1,test-principal)])[id1:path1]:64 does not contain 
disk(reservations: [(DYNAMIC,role1,test-principal)]):1024 
*** Check failure stack trace: ***
@ 0x2b2180332cf6  google::LogMessage::Fail()
@ 0x2b2180332c3e  google::LogMessage::SendToLog()
@ 0x2b21803325e8  google::LogMessage::Flush()
@ 0x2b2180335a12  google::LogMessageFatal::~LogMessageFatal()
@ 0x56408e20bafc  _CheckFatal::~_CheckFatal()
@ 0x2b217dc362b7  mesos::internal::master::Slave::apply()
@ 0x2b217dc2c197  mesos::internal::master::Master::_apply()
@ 0x2b217dcaa5ab  
_ZZN7process8dispatchIN5mesos8internal6master6MasterEPNS3_5SlaveEPNS3_9FrameworkERKNS1_15Offer_OperationES6_S8_SB_EEvRKNS_3PIDIT_EEMSD_FvT0_T1_T2_EOT3_OT4_OT5_ENKUlOS6_OS8_OS9_PNS_11ProcessBaseEE_clESS_ST_SU_SW_
@ 0x2b217dd556c5  
_ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal6master6MasterEPNS5_5SlaveEPNS5_9FrameworkERKNS3_15Offer_OperationES8_SA_SD_EEvRKNS1_3PIDIT_EEMSF_FvT0_T1_T2_EOT3_OT4_OT5_EUlOS8_OSA_OSB_PNS1_11ProcessBaseEE_JS8_SA_SB_SY_EEEDTclcl7forwardISF_Efp_Espcl7forwardIT0_Efp0_EEEOSF_DpOS10_
@ 0x2b217dd4e482  
_ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterEPNS6_5SlaveEPNS6_9FrameworkERKNS4_15Offer_OperationES9_SB_SE_EEvRKNS2_3PIDIT_EEMSG_FvT0_T1_T2_EOT3_OT4_OT5_EUlOS9_OSB_OSC_PNS2_11ProcessBaseEE_JS9_SB_SC_St12_PlaceholderILi113invoke_expandIS10_St5tupleIJS9_SB_SC_S12_EES15_IJOSZ_EEJLm0ELm1ELm2ELm3DTcl6invokecl7forwardIT_Efp_Espcl6expandcl3getIXT2_EEcl7forwardIT0_Efp0_EEcl7forwardIT1_Efp2_OS19_OS1A_N5cpp1416integer_sequenceImJXspT2_OS1B_
@ 0x2b217dd49853  
_ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterEPNS6_5SlaveEPNS6_9FrameworkERKNS4_15Offer_OperationES9_SB_SE_EEvRKNS2_3PIDIT_EEMSG_FvT0_T1_T2_EOT3_OT4_OT5_EUlOS9_OSB_OSC_PNS2_11ProcessBaseEE_IS9_SB_SC_St12_PlaceholderILi1clIISZ_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImILm0ELm1ELm2ELm3_Ecl16forward_as_tuplespcl7forwardIT_Efp_DpOS18_
I0806 02:52:55.928766 18910 status_update_manager_process.hpp:528] Forwarding 
operation status update OPERATION_FINISHED (Status UUID: 
679c9f27-3130-4188-8c9a-07eccc25ae78) for operation UUID 
0b856527-bcaa-4595-aeab-47505dff5aa6 on agent 
ba6f270f-d8c7-4b59-b5ce-6b497fe89d7c-S0
@ 0x2b217dd46ac5  
_ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal6master6MasterEPNS8_5SlaveEPNS8_9FrameworkERKNS6_15Offer_OperationESB_SD_SG_EEvRKNS4_3PIDIT_EEMSI_FvT0_T1_T2_EOT3_OT4_OT5_EUlOSB_OSD_OSE_PNS4_11ProcessBaseEE_ISB_SD_SE_St12_PlaceholderILi1EIS11_EEEDTclcl7forwardISI_Efp_Espcl7forwardIT0_Efp0_EEEOSI_DpOS16_
@ 0x2b217dd43fc1  
_ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal6master6MasterEPNS9_5SlaveEPNS9_9FrameworkERKNS7_15Offer_OperationESC_SE_SH_EEvRKNS5_3PIDIT_EEMSJ_FvT0_T1_T2_EOT3_OT4_OT5_EUlOSC_OSE_OSF_PNS5_11ProcessBaseEE_JSC_SE_SF_St12_PlaceholderILi1EJS12_EEEvOSJ_DpOT0_
@ 0x2b217dd4144d  
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal6master6MasterEPNSC_5SlaveEPNSC_9FrameworkERKNSA_15Offer_OperationESF_SH_SK_EEvRKNS1_3PIDIT_EEMSM_FvT0_T1_T2_EOT3_OT4_OT5_EUlOSF_OSH_OSI_S3_E_JSF_SH_SI_St12_PlaceholderILi1EEclEOS3_
@ 0x2b218024eb51  
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEEclES3_
@ 0x2b2180216927  process::ProcessBase::consume()
@ 0x2b218023c5d2  
_ZNO7process13DispatchEvent7consumeEPNS_13EventConsumerE
@ 0x56408e20c7e8  process::ProcessBase::serve()
@ 0x2b2180213539  process::ProcessManager::resume()
@ 0x2b218020f886  
_ZZN7process14ProcessManager12init_threadsEvENKUlvE_clEv
@ 0x2b2180237086  

[jira] [Created] (MESOS-9914) Refactor `MesosTest::StartSlave` in favour of builder style interface

2019-07-30 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-9914:


 Summary: Refactor `MesosTest::StartSlave` in favour of builder 
style interface
 Key: MESOS-9914
 URL: https://issues.apache.org/jira/browse/MESOS-9914
 Project: Mesos
  Issue Type: Improvement
  Components: test
Reporter: Andrei Budnik


Every overload of `MesosTest::StartSlave` method depend on 
`cluster::Slave::create` method, which accepts several arguments. In fact, each 
overload of `MesosTest::StartSlave` accepts a subset of combination of 
arguments that `cluster::Slave::create` accept. Given that the latter accepts 
11 arguments at the moment, and there are already 13 overloads of the 
`MesosTest::StartSlave`, introducing a builder-style interface is very 
desirable. It'd allow adding more arguments to the `cluster::Slave::create` 
without the necessity to update all existing overloads. It would be a local 
change as it won't affect existing tests.

See [this 
comment|https://github.com/apache/mesos/blob/00bb0b6d6abe7700a5adab0bdaf7e91767a2db19/src/tests/mesos.hpp#L160-L177].



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (MESOS-9836) Docker containerizer overwrites `/mesos/slave` cgroups.

2019-07-25 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16892844#comment-16892844
 ] 

Andrei Budnik commented on MESOS-9836:
--

A typical cgroup for Docker containers looks like:
{code:java}
/system.slice/docker-3a91c29381522918a2f2cad05583b172f415da4010bad672c21a19356aec1d69.scope
{code}
Probably we should leave out all cgroups not containing "docker" substring 
instead of (or in addition to) filtering [the system root 
cgroup|https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1783-L1788].
 It's ugly, hacky, and introduces a dependency on Docker's runtime.

> Docker containerizer overwrites `/mesos/slave` cgroups.
> ---
>
> Key: MESOS-9836
> URL: https://issues.apache.org/jira/browse/MESOS-9836
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Chun-Hung Hsiao
>Priority: Critical
>  Labels: docker, mesosphere
>
> The following bug was observed on our internal testing cluster.
> The docker containerizer launched a container on an agent:
> {noformat}
> I0523 06:00:53.888579 21815 docker.cpp:1195] Starting container 
> 'f69c8a8c-eba4-4494-a305-0956a44a6ad2' for task 
> 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1' (and executor 
> 'apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1') of framework 
> 415284b7-2967-407d-b66f-f445e93f064e-0011
> I0523 06:00:54.524171 21815 docker.cpp:783] Checkpointing pid 13716 to 
> '/var/lib/mesos/slave/meta/slaves/60c42ab7-eb1a-4cec-b03d-ea06bff00c3f-S2/frameworks/415284b7-2967-407d-b66f-f445e93f064e-0011/executors/apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1/runs/f69c8a8c-eba4-4494-a305-0956a44a6ad2/pids/forked.pid'
> {noformat}
> After the container was launched, the docker containerizer did a {{docker 
> inspect}} on the container and cached the pid:
>  
> [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1764]
>  The pid should be slightly greater than 13716.
> The docker executor sent a {{TASK_FINISHED}} status update around 16 minutes 
> later:
> {noformat}
> I0523 06:16:17.287595 21809 slave.cpp:5566] Handling status update 
> TASK_FINISHED (Status UUID: 4e00b786-b773-46cd-8327-c7deb08f1de9) for task 
> apps_docker-sleep-app.1fda5b8e-7d20-11e9-9717-7aa030269ee1 of framework 
> 415284b7-2967-407d-b66f-f445e93f064e-0011 from executor(1)@172.31.1.7:36244
> {noformat}
> After receiving the terminal status update, the agent asked the docker 
> containerizer to update {{cpu.cfs_period_us}}, {{cpu.cfs_quota_us}} and 
> {{memory.soft_limit_in_bytes}} of the container through the cached pid:
>  
> [https://github.com/apache/mesos/blob/0c431dd60ae39138cc7e8b099d41ad794c02c9a9/src/slave/containerizer/docker.cpp#L1696]
> {noformat}
> I0523 06:16:17.290447 21815 docker.cpp:1868] Updated 'cpu.shares' to 102 at 
> /sys/fs/cgroup/cpu,cpuacct/mesos/slave for container 
> f69c8a8c-eba4-4494-a305-0956a44a6ad2
> I0523 06:16:17.290660 21815 docker.cpp:1895] Updated 'cpu.cfs_period_us' to 
> 100ms and 'cpu.cfs_quota_us' to 10ms (cpus 0.1) for container 
> f69c8a8c-eba4-4494-a305-0956a44a6ad2
> I0523 06:16:17.889816 21815 docker.cpp:1937] Updated 
> 'memory.soft_limit_in_bytes' to 32MB for container 
> f69c8a8c-eba4-4494-a305-0956a44a6ad2
> {noformat}
> Note that the cgroup of {{cpu.shares}} was {{/mesos/slave}}. This was 
> possibly because that over the 16 minutes the pid got reused:
> {noformat}
> # zgrep 'systemd.cpp:98\]' /var/log/mesos/archive/mesos-agent.log.12.gz
> ...
> I0523 06:00:54.525178 21815 systemd.cpp:98] Assigned child process '13716' to 
> 'mesos_executors.slice'
> I0523 06:00:55.078546 21808 systemd.cpp:98] Assigned child process '13798' to 
> 'mesos_executors.slice'
> I0523 06:00:55.134096 21808 systemd.cpp:98] Assigned child process '13799' to 
> 'mesos_executors.slice'
> ...
> I0523 06:06:30.997439 21808 systemd.cpp:98] Assigned child process '32689' to 
> 'mesos_executors.slice'
> I0523 06:06:31.050976 21808 systemd.cpp:98] Assigned child process '32690' to 
> 'mesos_executors.slice'
> I0523 06:06:31.110514 21815 systemd.cpp:98] Assigned child process '32692' to 
> 'mesos_executors.slice'
> I0523 06:06:33.143726 21818 systemd.cpp:98] Assigned child process '446' to 
> 'mesos_executors.slice'
> I0523 06:06:33.196251 21818 systemd.cpp:98] Assigned child process '447' to 
> 'mesos_executors.slice'
> I0523 06:06:33.266332 21816 systemd.cpp:98] Assigned child process '449' to 
> 'mesos_executors.slice'
> ...
> I0523 06:09:34.870056 21808 systemd.cpp:98] Assigned child process '13717' to 
> 'mesos_executors.slice'
> I0523 06:09:34.937762 21813 systemd.cpp:98] Assigned child process '13744' to 
> 'mesos_executors.slice'
> I0523 

[jira] [Created] (MESOS-9887) Race condition between two terminal task status updates for Docker executor.

2019-07-10 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-9887:


 Summary: Race condition between two terminal task status updates 
for Docker executor.
 Key: MESOS-9887
 URL: https://issues.apache.org/jira/browse/MESOS-9887
 Project: Mesos
  Issue Type: Bug
  Components: agent, containerization
Reporter: Andrei Budnik
 Attachments: race_example.txt

h2. Overview

Expected behavior:
 Task successfully finishes and sends TASK_FINISHED status update.

Observed behavior:
 Task successfully finishes, but the agent sends TASK_FAILED with the reason 
"REASON_EXECUTOR_TERMINATED".

In normal circumstances, Docker executor 
[sends|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/docker/executor.cpp#L758]
 final status update TASK_FINISHED to the agent, which then [gets 
processed|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5543]
 by the agent before termination of the executor's process.

However, if the processing of the initial TASK_FINISHED gets delayed, then 
there is a chance that Docker executor terminates and the agent 
[triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
 TASK_FAILED which will [be 
handled|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5816-L5826]
 prior to the TASK_FINISHED status update.

See attached logs which contain an example of the race condition.
h2. Reproducing bug

1. Add the following code:
{code:java}
  static int c = 0;
  if (++c == 3) { // to skip TASK_STARTING and TASK_RUNNING status updates.
::sleep(2);
  }
{code}
to the 
[`ComposingContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L578]
 and to the 
[`DockerContainerizerProcess::status`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/docker.cpp#L2167].

2. Recompile mesos

3. Launch mesos master and agent locally

4. Launch a simple Docker task via `mesos-execute`:
{code}
#  cd build
./src/mesos-execute --master="`hostname`:5050" --name="a" 
--containerizer=docker --docker_image=alpine --resources="cpus:1;mem:32" 
--command="ls"
{code}
h2. Race condition - description

1. Mesos agent receives TASK_FINISHED status update and then subscribes on 
[`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761].

2. `containerizer->status()` operation for TASK_FINISHED status update gets 
delayed in the composing containerizer (e.g. due to switch of the worker thread 
that executes `status` method).

3. Docker executor terminates and the agent 
[triggers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L6662]
 TASK_FAILED.

4. Docker containerizer destroys the container. A registered callback for the 
`containerizer->wait` call in the composing containerizer dispatches [lambda 
function|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L368-L373]
 that will clean up `containers_` map.

5. Composing c'zer resumes and dispatches 
`[status()|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L579]`
 method to the Docker containerizer for TASK_FINISHED, which in turn hangs for 
a few seconds.

6. Corresponding `containerId` gets removed from the `containers_` map of the 
composing c'zer.

7. Mesos agent subscribes on 
[`containerizer->status()`|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5754-L5761]
 for the TASK_FAILED status update.

8. Composing c'zer returns ["Container not 
found"|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/containerizer/composing.cpp#L576]
 for TASK_FAILED.

9. 
`[Slave::_statusUpdate|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5826]`
 stores TASK_FAILED terminal status update in the executor's data structure.

10. Docker executor resumes and finishes processing of `status()` method for 
TASK_FINISHED. Finally, it returns control to the `Slave::_statusUpdate` 
continuation. This method 
[discovers|https://github.com/apache/mesos/blob/0026ea46dc35cbba1f442b8e425c6cbaf81ee8f8/src/slave/slave.cpp#L5808-L5814]
 that the executor has already been destroyed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9844) Update documentation describing `containerizer/debug` endpoint.

2019-06-12 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-9844:


 Summary: Update documentation describing `containerizer/debug` 
endpoint.
 Key: MESOS-9844
 URL: https://issues.apache.org/jira/browse/MESOS-9844
 Project: Mesos
  Issue Type: Documentation
  Components: containerization
Reporter: Andrei Budnik
Assignee: Andrei Budnik






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9843) Implement tests for the `containerizer/debug` endpoint.

2019-06-12 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-9843:


 Summary: Implement tests for the `containerizer/debug` endpoint.
 Key: MESOS-9843
 URL: https://issues.apache.org/jira/browse/MESOS-9843
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Andrei Budnik
Assignee: Andrei Budnik


Implement tests for container stuck issues and check that the agent's 
`containerizer/debug` endpoint returns a JSON object containing information 
about pending operations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9842) Implement tests for the `FutureTracker` class and for its helper functions.

2019-06-12 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-9842:


 Summary: Implement tests for the `FutureTracker` class and for its 
helper functions.
 Key: MESOS-9842
 URL: https://issues.apache.org/jira/browse/MESOS-9842
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Andrei Budnik
Assignee: Andrei Budnik






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9841) Integrate `IsolatorTracker` and `LinuxLauncher` with Mesos containerizer.

2019-06-12 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-9841:


 Summary: Integrate `IsolatorTracker` and `LinuxLauncher` with 
Mesos containerizer.
 Key: MESOS-9841
 URL: https://issues.apache.org/jira/browse/MESOS-9841
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Andrei Budnik
Assignee: Andrei Budnik






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9840) Implement `LauncherTracker` class.

2019-06-12 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-9840:


 Summary: Implement `LauncherTracker` class.
 Key: MESOS-9840
 URL: https://issues.apache.org/jira/browse/MESOS-9840
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Andrei Budnik
Assignee: Andrei Budnik






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9839) Implement `IsolatorTracker` class.

2019-06-12 Thread Andrei Budnik (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9839?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-9839:


   Assignee: Andrei Budnik
 Labels: containerization  (was: )
Component/s: containerization
 Issue Type: Task  (was: Bug)

> Implement `IsolatorTracker` class.
> --
>
> Key: MESOS-9839
> URL: https://issues.apache.org/jira/browse/MESOS-9839
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerization
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9839) Implement `IsolatorTracker` class.

2019-06-12 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-9839:


 Summary: Implement `IsolatorTracker` class.
 Key: MESOS-9839
 URL: https://issues.apache.org/jira/browse/MESOS-9839
 Project: Mesos
  Issue Type: Bug
Reporter: Andrei Budnik






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9838) Leaked HTTP input connection between agent and IOSwitchboard when launched with TTY enabled.

2019-06-12 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-9838:


 Summary: Leaked HTTP input connection between agent and 
IOSwitchboard when launched with TTY enabled.
 Key: MESOS-9838
 URL: https://issues.apache.org/jira/browse/MESOS-9838
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: Andrei Budnik


Steps to reproduce:
 1) Launch a TTY container.
 2) Send the `ATTACH_CONTAINER_INPUT` request to the agent via an HTTP 
connection.
 3) Close a tcp socket used to send `ATTACH_CONTAINER_INPUT`.
 4) Send another `ATTACH_CONTAINER_INPUT` request to the agent - it returns 
`409 Conflict` HTTP error.

For each incoming `ATTACH_CONTAINER_INPUT` request the agent creates an HTTP 
connection to the IOSwitchboard via unix socket. This connection is used to 
retransmit client requests to the IOSwitchboard. IOSwitchboard closes this 
connection automatically once the client closes its HTTP connection to the 
agent: for more details see HTTP handlers in [the 
agent|https://github.com/apache/mesos/blob/1961e41a61def2b7baca7563c0b7e1855880b55c/src/slave/http.cpp#L3105-L3116]
 and in the 
[IOSwitchboard|https://github.com/apache/mesos/blob/1961e41a61def2b7baca7563c0b7e1855880b55c/src/slave/containerizer/mesos/io/switchboard.cpp#L1665-L1758].
 IOSwitchboard does not allow [multiple input 
connections|https://github.com/apache/mesos/blob/1961e41a61def2b7baca7563c0b7e1855880b55c/src/slave/containerizer/mesos/io/switchboard.cpp#L1654-L1657].
Currently, IOSwitchboard does not close HTTP connection for the 
`ATTACH_CONTAINER_INPUT` in the case described above. Hence, IOSwitchboard 
returns an error for the subsequent attempts to attach to the container input. 
The root cause needs to be found.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9837) Implement `FutureTracker` class along with helper functions.

2019-06-12 Thread Andrei Budnik (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-9837:


Assignee: Andrei Budnik

> Implement `FutureTracker` class along with helper functions.
> 
>
> Key: MESOS-9837
> URL: https://issues.apache.org/jira/browse/MESOS-9837
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Andrei Budnik
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerization
>
> Both `track()` and `pending_futures()` helper functions depend on the 
> `FutureTracker` actor.
> `FutureTracker` actor must be available globally and there must be only one 
> instance of this actor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9837) Implement `FutureTracker` class along with helper functions.

2019-06-12 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-9837:


 Summary: Implement `FutureTracker` class along with helper 
functions.
 Key: MESOS-9837
 URL: https://issues.apache.org/jira/browse/MESOS-9837
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Andrei Budnik


Both `track()` and `pending_futures()` helper functions depend on the 
`FutureTracker` actor.
`FutureTracker` actor must be available globally and there must be only one 
instance of this actor.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9756) Introduce a container debug endpoint.

2019-06-12 Thread Andrei Budnik (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9756?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-9756:


Assignee: Andrei Budnik

> Introduce a container debug endpoint.
> -
>
> Key: MESOS-9756
> URL: https://issues.apache.org/jira/browse/MESOS-9756
> Project: Mesos
>  Issue Type: Epic
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Andrei Budnik
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Deleted] (MESOS-9830) Implement the container debug endpoint on slave/http.cpp

2019-06-12 Thread Andrei Budnik (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9830?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik deleted MESOS-9830:
-


> Implement the container debug endpoint on slave/http.cpp
> 
>
> Key: MESOS-9830
> URL: https://issues.apache.org/jira/browse/MESOS-9830
> Project: Mesos
>  Issue Type: Task
>Reporter: Gilbert Song
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: containerization
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9800) libarchive cannot extract tarfile due to UTF-8 encoding issues

2019-05-28 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849924#comment-16849924
 ] 

Andrei Budnik commented on MESOS-9800:
--

Thanks for filing a detailed ticket!
Hope [~kaysoky] might help you with this issue.

> libarchive cannot extract tarfile due to UTF-8 encoding issues
> --
>
> Key: MESOS-9800
> URL: https://issues.apache.org/jira/browse/MESOS-9800
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.7.2
> Environment: Mesos 1.7.2 and Marathon 1.4.3 running on top of Ubuntu 
> 16.04.
>Reporter: Felipe Alfaro Solana
>Priority: Major
>
> Starting with Mesos 1.7, the following change has been introduced:
>  * [MESOS-8064] - Mesos now requires libarchive to programmatically decode 
> .zip, .tar, .gzip, and other common file compression schemes. Version 3.3.2 
> is bundled in Mesos.
> However, this version of libarchive which is used by the fetcher component in 
> Mesos has problems in dealing with archive files (.tar and .zip) which 
> contain UTF-8 characters. We run Marahton on top of Mesos, and one of our 
> Marathon application relies on a .tar file which contains symlinks whose 
> target contains certain UTF-8 characters (Turkish) or the symlink name itself 
> contains UTF-8 characters. Mesos fetcher is unable to extract the archive and 
> fails with the following error:
> {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: E0528 
> 10:47:30.791250  6136 fetcher.cpp:613] EXIT with status 1: Failed to fetch 
> '/tmp/certificates.tar.gz': Failed to extract archive 
> '/var/mesos/slaves/10c35371-f690-4d40-8b9e-30ffd04405fb-S6/frameworks/ff2993eb-987f-47b0-b3af-fb8b49ab0470-/executors/test-nginx.fe01a0c0-8135-11e9-a160-02427a38aa03/runs/6a6e87e8-5eef-4e8e-8c00-3f081fa187b0/certificates.tar.gz'
>  to 
> '/var/mesos/slaves/10c35371-f690-4d40-8b9e-30ffd04405fb-S6/frameworks/ff2993eb-987f-47b0-b3af-fb8b49ab0470-/executors/test-nginx.fe01a0c0-8135-11e9-a160-02427a38aa03/runs/6a6e87e8-5eef-4e8e-8c00-3f081fa187b0':
>  Failed to read archive header: Linkname can't be converted from UTF-8 to 
> current locale.}}
> {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]:}}
> {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: End 
> fetcher log for container 6a6e87e8-5eef-4e8e-8c00-3f081fa187b0}}
> {{May 28 10:47:30 t01m01.node.t01.dns.teralytics.net mesos-slave[4319]: E0528 
> 10:47:30.846695  4343 fetcher.cpp:571] Failed to run mesos-fetcher: Failed to 
> fetch all URIs for container '6a6e87e8-5eef-4e8e-8c00-3f081fa187b0': exited 
> with status 1}}
> The same Marathon application works fine with Mesos 1.6 which does not use 
> libarchive.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9306) Mesos containerizer can get stuck during cgroup cleanup

2019-05-27 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849042#comment-16849042
 ] 

Andrei Budnik edited comment on MESOS-9306 at 5/27/19 4:10 PM:
---

The patch `/r/70609/` was discarded.

If `cgroups::destroy` hangs due to a blocking system call caused by a kernel 
bug, then there is no workaround available on Mesos side to fix the issue. In 
this case, we could only help an operator to detect the problem. This can be 
achieved by introducing a debug endpoint for the Mesos containerizer, see 
MESOS-9756.


was (Author: abudnik):
The patch `/r/70609/` was discarded.

If `cgroups::destroy` hangs due to a blocking system call caused by a kernel 
bug, then there is no workaround available on Mesos side to fix the issue. In 
this case, we could only help an operator to detect the problem. This could be 
done by introducing a debug endpoint for the Mesos containerizer, see 
MESOS-9756.

> Mesos containerizer can get stuck during cgroup cleanup
> ---
>
> Key: MESOS-9306
> URL: https://issues.apache.org/jira/browse/MESOS-9306
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Affects Versions: 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Critical
>  Labels: containerizer, mesosphere
>
> I observed a task group's executor container which failed to be completely 
> destroyed after its associated tasks were killed. The following is an excerpt 
> from the agent log which is filtered to include only lines with the container 
> ID, {{d463b9fe-970d-4077-bab9-558464889a9e}}:
> {code}
> 2018-10-10 14:20:50: I1010 14:20:50.204756  6799 containerizer.cpp:2963] 
> Container d463b9fe-970d-4077-bab9-558464889a9e has exited
> 2018-10-10 14:20:50: I1010 14:20:50.204839  6799 containerizer.cpp:2457] 
> Destroying container d463b9fe-970d-4077-bab9-558464889a9e in RUNNING state
> 2018-10-10 14:20:50: I1010 14:20:50.204859  6799 containerizer.cpp:3124] 
> Transitioning the state of container d463b9fe-970d-4077-bab9-558464889a9e 
> from RUNNING to DESTROYING
> 2018-10-10 14:20:50: I1010 14:20:50.204960  6799 linux_launcher.cpp:580] 
> Asked to destroy container d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.204993  6799 linux_launcher.cpp:622] 
> Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:20:50: I1010 14:20:50.205417  6806 cgroups.cpp:2838] Freezing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.205477  6810 cgroups.cpp:2838] Freezing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.205708  6808 cgroups.cpp:1229] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 
> 203008ns
> 2018-10-10 14:20:50: I1010 14:20:50.205878  6800 cgroups.cpp:1229] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 
> 339200ns
> 2018-10-10 14:20:50: I1010 14:20:50.206185  6799 cgroups.cpp:2856] Thawing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.206226  6808 cgroups.cpp:2856] Thawing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.206455  6808 cgroups.cpp:1258] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 
> 83968ns
> 2018-10-10 14:20:50: I1010 14:20:50.306803  6810 cgroups.cpp:1258] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 
> 100.50816ms
> 2018-10-10 14:20:50: I1010 14:20:50.307531  6805 linux_launcher.cpp:654] 
> Destroying cgroup 
> '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:21:40: W1010 14:21:40.032855  6809 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:22:40: W1010 14:22:40.031224  6800 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:23:40: W1010 14:23:40.031946  6799 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:24:40: W1010 14:24:40.032979  6804 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:25:40: W1010 14:25:40.030784  6808 containerizer.cpp:2401] 

[jira] [Commented] (MESOS-9306) Mesos containerizer can get stuck during cgroup cleanup

2019-05-27 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849042#comment-16849042
 ] 

Andrei Budnik commented on MESOS-9306:
--

The patch `/r/70609/` was discarded.

If `cgroups::destroy` hangs due to a blocking system call caused by a kernel 
bug, then there is no workaround available on Mesos side to fix the issue. In 
this case, we could only help an operator to detect the problem. This could be 
done by introducing a debug endpoint for the Mesos containerizer, see 
MESOS-9756.

> Mesos containerizer can get stuck during cgroup cleanup
> ---
>
> Key: MESOS-9306
> URL: https://issues.apache.org/jira/browse/MESOS-9306
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Affects Versions: 1.4.0, 1.5.0, 1.6.0, 1.7.0, 1.8.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Critical
>  Labels: containerizer, mesosphere
>
> I observed a task group's executor container which failed to be completely 
> destroyed after its associated tasks were killed. The following is an excerpt 
> from the agent log which is filtered to include only lines with the container 
> ID, {{d463b9fe-970d-4077-bab9-558464889a9e}}:
> {code}
> 2018-10-10 14:20:50: I1010 14:20:50.204756  6799 containerizer.cpp:2963] 
> Container d463b9fe-970d-4077-bab9-558464889a9e has exited
> 2018-10-10 14:20:50: I1010 14:20:50.204839  6799 containerizer.cpp:2457] 
> Destroying container d463b9fe-970d-4077-bab9-558464889a9e in RUNNING state
> 2018-10-10 14:20:50: I1010 14:20:50.204859  6799 containerizer.cpp:3124] 
> Transitioning the state of container d463b9fe-970d-4077-bab9-558464889a9e 
> from RUNNING to DESTROYING
> 2018-10-10 14:20:50: I1010 14:20:50.204960  6799 linux_launcher.cpp:580] 
> Asked to destroy container d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.204993  6799 linux_launcher.cpp:622] 
> Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:20:50: I1010 14:20:50.205417  6806 cgroups.cpp:2838] Freezing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.205477  6810 cgroups.cpp:2838] Freezing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.205708  6808 cgroups.cpp:1229] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 
> 203008ns
> 2018-10-10 14:20:50: I1010 14:20:50.205878  6800 cgroups.cpp:1229] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 
> 339200ns
> 2018-10-10 14:20:50: I1010 14:20:50.206185  6799 cgroups.cpp:2856] Thawing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.206226  6808 cgroups.cpp:2856] Thawing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.206455  6808 cgroups.cpp:1258] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 
> 83968ns
> 2018-10-10 14:20:50: I1010 14:20:50.306803  6810 cgroups.cpp:1258] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 
> 100.50816ms
> 2018-10-10 14:20:50: I1010 14:20:50.307531  6805 linux_launcher.cpp:654] 
> Destroying cgroup 
> '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:21:40: W1010 14:21:40.032855  6809 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:22:40: W1010 14:22:40.031224  6800 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:23:40: W1010 14:23:40.031946  6799 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:24:40: W1010 14:24:40.032979  6804 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:25:40: W1010 14:25:40.030784  6808 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:26:40: W1010 14:26:40.032526  6810 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:27:40: W1010 14:27:40.029932  6801 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e 

[jira] [Commented] (MESOS-9306) Mesos containerizer can get stuck during cgroup cleanup

2019-05-08 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16835580#comment-16835580
 ] 

Andrei Budnik commented on MESOS-9306:
--

I've reproduced the timeout case for `cgroups::destroy` by adding the following 
code
{code:java}
Owned> promise(new Promise());
return promise->future();
{code}
to the beginning of 
[destroy()|https://github.com/apache/mesos/blob/db7ce35dc155c2de7e66ec051ee0f6bcf784b4e1/src/linux/cgroups.cpp#L1548]
 function. In turns out, that 
[`__destroy`|https://github.com/apache/mesos/blob/db7ce35dc155c2de7e66ec051ee0f6bcf784b4e1/src/linux/cgroups.cpp#L1590-L1602]
 is never invoked due to a missing `onDiscard` handler. We only subscribe 
[`onAny`|https://github.com/apache/mesos/blob/db7ce35dc155c2de7e66ec051ee0f6bcf784b4e1/src/linux/cgroups.cpp#L1613]
 callback, which is never called after calling `future.discard()`.

The reason `cgroups::destroy` hangs for Systemd hierarchy is unknown. It might 
be related to some kernel issue.

> Mesos containerizer can get stuck during cgroup cleanup
> ---
>
> Key: MESOS-9306
> URL: https://issues.apache.org/jira/browse/MESOS-9306
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Affects Versions: 1.7.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Critical
>  Labels: containerizer, mesosphere
>
> I observed a task group's executor container which failed to be completely 
> destroyed after its associated tasks were killed. The following is an excerpt 
> from the agent log which is filtered to include only lines with the container 
> ID, {{d463b9fe-970d-4077-bab9-558464889a9e}}:
> {code}
> 2018-10-10 14:20:50: I1010 14:20:50.204756  6799 containerizer.cpp:2963] 
> Container d463b9fe-970d-4077-bab9-558464889a9e has exited
> 2018-10-10 14:20:50: I1010 14:20:50.204839  6799 containerizer.cpp:2457] 
> Destroying container d463b9fe-970d-4077-bab9-558464889a9e in RUNNING state
> 2018-10-10 14:20:50: I1010 14:20:50.204859  6799 containerizer.cpp:3124] 
> Transitioning the state of container d463b9fe-970d-4077-bab9-558464889a9e 
> from RUNNING to DESTROYING
> 2018-10-10 14:20:50: I1010 14:20:50.204960  6799 linux_launcher.cpp:580] 
> Asked to destroy container d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.204993  6799 linux_launcher.cpp:622] 
> Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:20:50: I1010 14:20:50.205417  6806 cgroups.cpp:2838] Freezing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.205477  6810 cgroups.cpp:2838] Freezing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.205708  6808 cgroups.cpp:1229] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 
> 203008ns
> 2018-10-10 14:20:50: I1010 14:20:50.205878  6800 cgroups.cpp:1229] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 
> 339200ns
> 2018-10-10 14:20:50: I1010 14:20:50.206185  6799 cgroups.cpp:2856] Thawing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos
> 2018-10-10 14:20:50: I1010 14:20:50.206226  6808 cgroups.cpp:2856] Thawing 
> cgroup /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e
> 2018-10-10 14:20:50: I1010 14:20:50.206455  6808 cgroups.cpp:1258] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e after 
> 83968ns
> 2018-10-10 14:20:50: I1010 14:20:50.306803  6810 cgroups.cpp:1258] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/d463b9fe-970d-4077-bab9-558464889a9e/mesos after 
> 100.50816ms
> 2018-10-10 14:20:50: I1010 14:20:50.307531  6805 linux_launcher.cpp:654] 
> Destroying cgroup 
> '/sys/fs/cgroup/systemd/mesos/d463b9fe-970d-4077-bab9-558464889a9e'
> 2018-10-10 14:21:40: W1010 14:21:40.032855  6809 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:22:40: W1010 14:22:40.031224  6800 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:23:40: W1010 14:23:40.031946  6799 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:24:40: W1010 14:24:40.032979  6804 containerizer.cpp:2401] 
> Skipping status for container d463b9fe-970d-4077-bab9-558464889a9e because: 
> Container does not exist
> 2018-10-10 14:25:40: W1010 14:25:40.030784  6808 

[jira] [Commented] (MESOS-9695) Remove the duplicate pid check in Docker containerizer

2019-04-30 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16830263#comment-16830263
 ] 

Andrei Budnik commented on MESOS-9695:
--

{code:java}
commit c8004ee8a0962d0e0f9147718853160bb708f5bc
Author: Qian Zhang 
Date: Tue Apr 30 13:23:26 2019 +0200

Removed the duplicate pid check in Docker containerizer.

Review: https://reviews.apache.org/r/70561/
{code}

> Remove the duplicate pid check in Docker containerizer
> --
>
> Key: MESOS-9695
> URL: https://issues.apache.org/jira/browse/MESOS-9695
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>Priority: Major
>  Labels: containerization
>
> In `DockerContainerizerProcess::_recover`, we check if there are two 
> executors use duplicate pid, and error out if we find duplicate pid (see 
> [here|https://github.com/apache/mesos/blob/1.7.2/src/slave/containerizer/docker.cpp#L1068:L1078]
>  for details). However I do not see the value this check can give us but it 
> will cause serious issue (agent crash loop when restarting) in rare case (a 
> new executor reuse pid of an old executor), so I think we'd better to remove 
> it from Docker containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9718) Compile failures with char8_t by MSVC under /std:c++latest(C++20) mode

2019-04-24 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16825251#comment-16825251
 ] 

Andrei Budnik commented on MESOS-9718:
--

Hi [~QuellaZhang],

Just verified your patch in our internal CI - LGTM!

BTW, could these tests be compiled if you remove only u8 prefix for string 
literals? E.g.,  use
"~~~\u00ff\u00ff\u00ff\u00ff"
instead of
u8"~~~\u00ff\u00ff\u00ff\u00ff" (or "~~~\xC3\xBF\xC3\xBF\xC3\xBF\xC3\xBF")


Would you like to send a PR for the patch on [https://github.com/apache/mesos]?
[http://mesos.apache.org/documentation/latest/beginner-contribution/#open-a-pr]

> Compile failures with char8_t by MSVC under /std:c++latest(C++20) mode
> --
>
> Key: MESOS-9718
> URL: https://issues.apache.org/jira/browse/MESOS-9718
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: QuellaZhang
>Priority: Major
>  Labels: windows
> Attachments: mesos.patch.txt
>
>
> Hi All,
> We've stumbled across some build failures in Mesos after implementing support 
> for char8_t under /std:c + + +latest  in the development version of Visual C+ 
> + +. Could you help look at this? Thanks in advance! Noted that this issue 
> only found when compiles with unreleased vctoolset, that next release of MSVC 
> will have this behavior.
> *Repro steps:*
>  git clone -c core.autocrlf=true [https://github.com/apache/mesos] 
> D:\mesos\src
>  open a VS 2017 x64 command prompt as admin and browse to D:\mesos
>  set _CL_=/std:c++latest
>  cd src
>  .\bootstrap.bat
>  cd ..
>  mkdir build_x64 && pushd build_x64
>  cmake ..\src -G "Visual Studio 15 2017 Win64" 
> -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64
> *Failures:*
>  base64_tests.i
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2664: 
> 'std::string base64::encode_url_safe(const std::string &,bool)': cannot 
> convert argument 1 from 'const char8_t [12]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: Reason: cannot 
> convert from 'const char8_t [12]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2660: 
> 'testing::internal::EqHelper::Compare': function does not take 3 
> arguments
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430):
>  note: see declaration of 'testing::internal::EqHelper::Compare'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2512: 
> 'testing::AssertionResult': no appropriate default constructor available
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256):
>  note: see declaration of 'testing::AssertionResult'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2664: 
> 'std::string base64::encode_url_safe(const std::string &,bool)': cannot 
> convert argument 1 from 'const char8_t [12]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: Reason: cannot 
> convert from 'const char8_t [12]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2660: 
> 'testing::internal::EqHelper::Compare': function does not take 3 
> arguments
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430):
>  note: see declaration of 'testing::internal::EqHelper::Compare'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2512: 
> 'testing::AssertionResult': no appropriate default constructor available
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256):
>  note: see declaration of 'testing::AssertionResult'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2664: 
> 'Try base64::decode_url_safe(const std::string &)': cannot 
> convert argument 1 from 'const char8_t [16]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: Reason: cannot 
> convert from 'const char8_t [16]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2672: 
> 'AssertSomeEq': no matching overloaded function found
>  

[jira] [Commented] (MESOS-8983) SlaveRecoveryTest/0.PingTimeoutDuringRecovery is flaky

2019-04-16 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819281#comment-16819281
 ] 

Andrei Budnik commented on MESOS-8983:
--

This test fails pretty often on ARM.

> SlaveRecoveryTest/0.PingTimeoutDuringRecovery is flaky
> --
>
> Key: MESOS-8983
> URL: https://issues.apache.org/jira/browse/MESOS-8983
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.0, 1.8.0
>Reporter: Alexander Rojas
>Assignee: Joseph Wu
>Priority: Major
>  Labels: flaky-test, foundations
>
> During an unrelated change in a PR, the apache build bot sent the following 
> error:
> {noformat}
> @   7FF71117D888  
> std::invoke<,process::Future
>  >,process::ProcessBase *>
> @   7FF71119257B  
> lambda::internal::Partial<,process::Future
>  >,std::_Ph<1> 
> >::invoke_expand<,std::tuple
>  >,std::_Ph<1> >,st
> @   7FF7110C08BA  ) @   7FF7110F058C  
> std::_Invoker_functor::_Call,process::Future
>  >,std::_Ph<1> >,process::ProcessBase *>
> @   7FF711183EBC  
> std::invoke,process::Future
>  >,std::_Ph<1> >,process::ProcessBase *>
> @   7FF7110C9F21  
> ),process::Future
>  >,std::_Ph<1> >,process::ProcessBase *
> @   7FF711236416  process::ProcessBase 
> *)>::CallableFn,process::Future
>  >,std::_Ph<1> > >::operator(
> @   7FF712C1A25D  process::ProcessBase *)>::operator(
> @   7FF712ACB2F9  process::ProcessBase::consume
> @   7FF712C738CA  process::DispatchEvent::consume
> @   7FF70ECE7B07  process::ProcessBase::serve
> @   7FF712AD93B0  process::ProcessManager::resume
> @   7FF712C07371   ?? 
> @   7FF712B2B130  
> std::_Invoker_functor::_Call< >
> @   7FF712B8B8E0  
> std::invoke< >
> @   7FF712B4076C  
> std::_LaunchPad
>  >,std::default_delete > 
> > > >::_Execute<0>
> @   7FF712C5A60A  
> std::_LaunchPad
>  >,std::default_delete > 
> > > >::_Run
> @   7FF712C45E78  
> std::_LaunchPad
>  >,std::default_delete > 
> > > >::_Go
> @   7FF712C2C3CD  std::_Pad::_Call_func
> @   7FFF9BE53428  _register_onexit_function
> @   7FFF9BE53071  _register_onexit_function
> @   7FFFB6391FE4  BaseThreadInitThunk
> @   7FFFB69FF061  RtlUserThreadStart
> ll containerizers
> I0606 10:25:26.680230 18356 slave.cpp:7158] Recovering executors
> I0606 10:25:26.680230 18356 slave.cpp:7182] Sending reconnect request to 
> executor '3f11d255-bb7b-4e99-967b-055fef95b595' of framework 
> 62cf792a-dc69-4e3c-b54f-d83f98fb9451- at executor(1)@192.10.1.5:55652
> I0606 10:25:26.688225 22560 slave.cpp:4984] Received re-registration message 
> from executor '3f11d255-bb7b-4e99-967b-055fef95b595' of framework 
> 62cf792a-dc69-4e3c-b54f-d83f98fb9451-
> I0606 10:25:26.691216 22888 slave.cpp:5901] No pings from master received 
> within 75secs
> F0606 10:25:26.692219 22888 slave.cpp:1249] Check failed: state == 
> DISCONNECTED || state == RUNNING || state == TERMINATING RECOVERING
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9718) Compile failures with char8_t by MSVC under /std:c++latest mode

2019-04-10 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814730#comment-16814730
 ] 

Andrei Budnik commented on MESOS-9718:
--

[~QuellaZhang] If you have a possible fix in mind, we could discuss it via dev 
mailing list [1] or in a slack dev channel [2]. I and Joseph can help with 
committing your patches into Mesos.

[1] [http://mesos.apache.org/community/#mailing-lists]
[2] [http://mesos.apache.org/community/#slack]

> Compile failures with char8_t by MSVC under /std:c++latest mode
> ---
>
> Key: MESOS-9718
> URL: https://issues.apache.org/jira/browse/MESOS-9718
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: QuellaZhang
>Priority: Major
>  Labels: windows
>
> Hi All,
> We've stumbled across some build failures in Mesos after implementing support 
> for char8_t under /std:c + + +latest  in the development version of Visual C+ 
> + +. Could you help look at this? Thanks in advance! Noted that this issue 
> only found when compiles with unreleased vctoolset, that next release of MSVC 
> will have this behavior.
> *Repro steps:*
>  git clone -c core.autocrlf=true [https://github.com/apache/mesos] 
> D:\mesos\src
>  open a VS 2017 x64 command prompt as admin and browse to D:\mesos
>  set _CL_=/std:c++latest
>  cd src
>  .\bootstrap.bat
>  cd ..
>  mkdir build_x64 && pushd build_x64
>  cmake ..\src -G "Visual Studio 15 2017 Win64" 
> -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64
> *Failures:*
>  base64_tests.i
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2664: 
> 'std::string base64::encode_url_safe(const std::string &,bool)': cannot 
> convert argument 1 from 'const char8_t [12]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: Reason: cannot 
> convert from 'const char8_t [12]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2660: 
> 'testing::internal::EqHelper::Compare': function does not take 3 
> arguments
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430):
>  note: see declaration of 'testing::internal::EqHelper::Compare'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2512: 
> 'testing::AssertionResult': no appropriate default constructor available
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256):
>  note: see declaration of 'testing::AssertionResult'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2664: 
> 'std::string base64::encode_url_safe(const std::string &,bool)': cannot 
> convert argument 1 from 'const char8_t [12]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: Reason: cannot 
> convert from 'const char8_t [12]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2660: 
> 'testing::internal::EqHelper::Compare': function does not take 3 
> arguments
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430):
>  note: see declaration of 'testing::internal::EqHelper::Compare'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2512: 
> 'testing::AssertionResult': no appropriate default constructor available
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256):
>  note: see declaration of 'testing::AssertionResult'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2664: 
> 'Try base64::decode_url_safe(const std::string &)': cannot 
> convert argument 1 from 'const char8_t [16]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: Reason: cannot 
> convert from 'const char8_t [16]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2672: 
> 'AssertSomeEq': no matching overloaded function found
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2780: 
> 'testing::AssertionResult AssertSomeEq(const char *,const char *,const T1 
> &,const T2 &)': expects 4 arguments - 3 provided
>  

[jira] [Commented] (MESOS-9718) Compile failures with char8_t by MSVC under /std:c++latest mode

2019-04-10 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814322#comment-16814322
 ] 

Andrei Budnik commented on MESOS-9718:
--

[~kaysoky] what could be a possible fix or mitigation for this error?

> Compile failures with char8_t by MSVC under /std:c++latest mode
> ---
>
> Key: MESOS-9718
> URL: https://issues.apache.org/jira/browse/MESOS-9718
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: QuellaZhang
>Priority: Major
>  Labels: windows
>
> Hi All,
> We've stumbled across some build failures in Mesos after implementing support 
> for char8_t under /std:c + + +latest  in the development version of Visual C+ 
> + +. Could you help look at this? Thanks in advance! Noted that this issue 
> only found when compiles with unreleased vctoolset, that next release of MSVC 
> will have this behavior.
> *Repro steps:*
>  git clone -c core.autocrlf=true [https://github.com/apache/mesos] 
> D:\mesos\src
>  open a VS 2017 x64 command prompt as admin and browse to D:\mesos
>  set _CL_=/std:c++latest
>  cd src
>  .\bootstrap.bat
>  cd ..
>  mkdir build_x64 && pushd build_x64
>  cmake ..\src -G "Visual Studio 15 2017 Win64" 
> -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64
> *Failures:*
>  base64_tests.i
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2664: 
> 'std::string base64::encode_url_safe(const std::string &,bool)': cannot 
> convert argument 1 from 'const char8_t [12]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: Reason: cannot 
> convert from 'const char8_t [12]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2660: 
> 'testing::internal::EqHelper::Compare': function does not take 3 
> arguments
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430):
>  note: see declaration of 'testing::internal::EqHelper::Compare'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2512: 
> 'testing::AssertionResult': no appropriate default constructor available
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256):
>  note: see declaration of 'testing::AssertionResult'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2664: 
> 'std::string base64::encode_url_safe(const std::string &,bool)': cannot 
> convert argument 1 from 'const char8_t [12]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: Reason: cannot 
> convert from 'const char8_t [12]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2660: 
> 'testing::internal::EqHelper::Compare': function does not take 3 
> arguments
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430):
>  note: see declaration of 'testing::internal::EqHelper::Compare'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2512: 
> 'testing::AssertionResult': no appropriate default constructor available
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256):
>  note: see declaration of 'testing::AssertionResult'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2664: 
> 'Try base64::decode_url_safe(const std::string &)': cannot 
> convert argument 1 from 'const char8_t [16]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: Reason: cannot 
> convert from 'const char8_t [16]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2672: 
> 'AssertSomeEq': no matching overloaded function found
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2780: 
> 'testing::AssertionResult AssertSomeEq(const char *,const char *,const T1 
> &,const T2 &)': expects 4 arguments - 3 provided
>  D:\Mesos\src\3rdparty\stout\include\stout/gtest.hpp(79): note: see 
> declaration of 'AssertSomeEq'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2512: 
> 'testing::AssertionResult': no appropriate default constructor available
>  

[jira] [Commented] (MESOS-9718) Compile failures with char8_t by MSVC under /std:c++latest mode

2019-04-10 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9718?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16814320#comment-16814320
 ] 

Andrei Budnik commented on MESOS-9718:
--

This error appeared after the following patch landed:
{code:java}
commit 703d0011d9049c6003f6d57026f5e764d1cb4435
Author: John Kordich 
Date: Thu Apr 13 18:07:25 2017 -0700

Windows: Fixed Base64Test.EncodeURLSafe.

C++ encodes string literals in the compiling platform's encoding
of choice, which means UTF8 for Posix, and ANSI for Windows.

This has implications for this particular test, as the string literal
"~~~\u00ff\u00ff\u00ff\u00ff" is translated into different bytes:
Posix: { 126, 126, 126, 195, 191, 195, 191, 195, 191, 195, 191 }
Windows: { 126, 126, 126, 255, 255, 255, 255 }

Prepending `u8` to the string literal tells the compiler to encode
the string as UTF8. This does not expose any underlying bug(s)
on Windows because the test is only failing due to an incorrect input.

Review: https://reviews.apache.org/r/58430/
{code}

> Compile failures with char8_t by MSVC under /std:c++latest mode
> ---
>
> Key: MESOS-9718
> URL: https://issues.apache.org/jira/browse/MESOS-9718
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: QuellaZhang
>Priority: Major
>  Labels: windows
>
> Hi All,
> We've stumbled across some build failures in Mesos after implementing support 
> for char8_t under /std:c + + +latest  in the development version of Visual C+ 
> + +. Could you help look at this? Thanks in advance! Noted that this issue 
> only found when compiles with unreleased vctoolset, that next release of MSVC 
> will have this behavior.
> *Repro steps:*
>  git clone -c core.autocrlf=true [https://github.com/apache/mesos] 
> D:\mesos\src
>  open a VS 2017 x64 command prompt as admin and browse to D:\mesos
>  set _CL_=/std:c++latest
>  cd src
>  .\bootstrap.bat
>  cd ..
>  mkdir build_x64 && pushd build_x64
>  cmake ..\src -G "Visual Studio 15 2017 Win64" 
> -DCMAKE_SYSTEM_VERSION=10.0.17134.0 -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin" -T host=x64
> *Failures:*
>  base64_tests.i
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2664: 
> 'std::string base64::encode_url_safe(const std::string &,bool)': cannot 
> convert argument 1 from 'const char8_t [12]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: Reason: cannot 
> convert from 'const char8_t [12]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2660: 
> 'testing::internal::EqHelper::Compare': function does not take 3 
> arguments
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430):
>  note: see declaration of 'testing::internal::EqHelper::Compare'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(63): error C2512: 
> 'testing::AssertionResult': no appropriate default constructor available
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256):
>  note: see declaration of 'testing::AssertionResult'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2664: 
> 'std::string base64::encode_url_safe(const std::string &,bool)': cannot 
> convert argument 1 from 'const char8_t [12]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: Reason: cannot 
> convert from 'const char8_t [12]' to 'const std::string'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): note: No constructor 
> could take the source type, or constructor overload resolution was ambiguous
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2660: 
> 'testing::internal::EqHelper::Compare': function does not take 3 
> arguments
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(1430):
>  note: see declaration of 'testing::internal::EqHelper::Compare'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(67): error C2512: 
> 'testing::AssertionResult': no appropriate default constructor available
>  
> D:\Mesos\build_x64\3rdparty\googletest-1.8.0\src\googletest-1.8.0\googletest\include\gtest/gtest.h(256):
>  note: see declaration of 'testing::AssertionResult'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): error C2664: 
> 'Try base64::decode_url_safe(const std::string &)': cannot 
> convert argument 1 from 'const char8_t [16]' to 'const std::string &'
>  D:\Mesos\src\3rdparty\stout\tests\base64_tests.cpp(83): note: Reason: cannot 
> convert from 'const 

[jira] [Comment Edited] (MESOS-9709) Docker executor can become stuck terminating

2019-04-09 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813335#comment-16813335
 ] 

Andrei Budnik edited comment on MESOS-9709 at 4/9/19 4:58 PM:
--

This agent responds on polling `/state` endpoint, but hangs on polling 
`/containers` and `/__processes__`.

GDB can't attach to a running agent - it hangs.

top -H -p `pidof mesos-agent` shows that one thread stuck in D state.

Here is a stack trace of an agent's hanging thread:
{code:java}
[] copy_net_ns+0xa2/0x180
[] create_new_namespaces+0xf9/0x180
[] copy_namespaces+0x8e/0xd0
[] copy_process+0xb66/0x1a40
[] do_fork+0x91/0x320
[] SyS_clone+0x16/0x20
[] stub_clone+0x44/0x70
[] 0x{code}
dmesg shows repeating (every 10 seconds) message:
{code:java}
unregister_netdevice: waiting for tunl0 to become free. Usage count = 1{code}


was (Author: abudnik):
This agent responds on polling `/state` endpoint, but hangs on polling 
`/containers` and `/__processes__`.

GDB can't attach to a running agent - it hangs.

top -H -p `pidof mesos-agent` shows that one thread stuck in D state.

Here is a stack trace of an agent's hanging thread:
{code:java}
[] copy_net_ns+0xa2/0x180
[] create_new_namespaces+0xf9/0x180
[] copy_namespaces+0x8e/0xd0
[] copy_process+0xb66/0x1a40
[] do_fork+0x91/0x320
[] SyS_clone+0x16/0x20
[] stub_clone+0x44/0x70
[] 0x{code}

> Docker executor can become stuck terminating
> 
>
> Key: MESOS-9709
> URL: https://issues.apache.org/jira/browse/MESOS-9709
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.8.0
>Reporter: Greg Mann
>Priority: Major
>  Labels: containerization, mesosphere
> Attachments: docker-executor-stuck.txt
>
>
> See attached agent log; the executor container ID is 
> {{d2bfec33-f6bd-44ee-9345-b5710780bb59}} and the executor ID contains the 
> string {{819f7ef7-4f42-11e9-a566-72ec67496045}}.
> After launching the executor, we see
> {code}
> Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re 
> mesos-agent[10238]: I0329 18:23:36.967316 10257 slave.cpp:3550] Launching 
> container d2bfec33-f6bd-44ee-9345-b5710780bb59 for executor 
> 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of 
> framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321-
> Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re 
> mesos-agent[10238]: I0329 18:23:36.968968 10253 docker.cpp:1161] No container 
> info found, skipping launch
> {code}
> I'm not sure why the container info was not set. Once the executor 
> reregistration timeout elapses, the agent attempts to terminate the executor 
> but it does not seem to be successful. The scheduler continues to try to kill 
> the task but we repeatedly see
> {code}
> Mar 29 18:35:19 int-mountvolumeagent9-soak113s.testing.mesosphe.re 
> mesos-agent[10238]: W0329 18:35:19.855063 10253 slave.cpp:3823] Ignoring kill 
> task datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339 
> because the executor 
> 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of 
> framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321- is terminating
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9709) Docker executor can become stuck terminating

2019-04-09 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813335#comment-16813335
 ] 

Andrei Budnik edited comment on MESOS-9709 at 4/9/19 1:24 PM:
--

This agent responds on polling `/state` endpoint, but hangs on polling 
`/containers` and `/__processes__`.

GDB can't attach to a running agent - it hangs.

top -H -p `pidof mesos-agent` shows that one thread stuck in D state.

Here is a stack trace of an agent's hanging thread:
{code:java}
[] copy_net_ns+0xa2/0x180
[] create_new_namespaces+0xf9/0x180
[] copy_namespaces+0x8e/0xd0
[] copy_process+0xb66/0x1a40
[] do_fork+0x91/0x320
[] SyS_clone+0x16/0x20
[] stub_clone+0x44/0x70
[] 0x{code}


was (Author: abudnik):
This agent responds on polling `/state` endpoint, but hangs on polling 
`/containers` and `/__processes__`.

GDB can't attach to a running agent - it hangs.

Here is a stack trace of an agent's hanging thread:
{code:java}
[] copy_net_ns+0xa2/0x180
[] create_new_namespaces+0xf9/0x180
[] copy_namespaces+0x8e/0xd0
[] copy_process+0xb66/0x1a40
[] do_fork+0x91/0x320
[] SyS_clone+0x16/0x20
[] stub_clone+0x44/0x70
[] 0x{code}

> Docker executor can become stuck terminating
> 
>
> Key: MESOS-9709
> URL: https://issues.apache.org/jira/browse/MESOS-9709
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.8.0
>Reporter: Greg Mann
>Priority: Major
>  Labels: containerization, mesosphere
> Attachments: docker-executor-stuck.txt
>
>
> See attached agent log; the executor container ID is 
> {{d2bfec33-f6bd-44ee-9345-b5710780bb59}} and the executor ID contains the 
> string {{819f7ef7-4f42-11e9-a566-72ec67496045}}.
> After launching the executor, we see
> {code}
> Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re 
> mesos-agent[10238]: I0329 18:23:36.967316 10257 slave.cpp:3550] Launching 
> container d2bfec33-f6bd-44ee-9345-b5710780bb59 for executor 
> 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of 
> framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321-
> Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re 
> mesos-agent[10238]: I0329 18:23:36.968968 10253 docker.cpp:1161] No container 
> info found, skipping launch
> {code}
> I'm not sure why the container info was not set. Once the executor 
> reregistration timeout elapses, the agent attempts to terminate the executor 
> but it does not seem to be successful. The scheduler continues to try to kill 
> the task but we repeatedly see
> {code}
> Mar 29 18:35:19 int-mountvolumeagent9-soak113s.testing.mesosphe.re 
> mesos-agent[10238]: W0329 18:35:19.855063 10253 slave.cpp:3823] Ignoring kill 
> task datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339 
> because the executor 
> 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of 
> framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321- is terminating
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9709) Docker executor can become stuck terminating

2019-04-09 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813390#comment-16813390
 ] 

Andrei Budnik commented on MESOS-9709:
--

It's a Linux kernel bug: [https://github.com/lxc/lxc/issues/2141]

> Docker executor can become stuck terminating
> 
>
> Key: MESOS-9709
> URL: https://issues.apache.org/jira/browse/MESOS-9709
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.8.0
>Reporter: Greg Mann
>Priority: Major
>  Labels: containerization, mesosphere
> Attachments: docker-executor-stuck.txt
>
>
> See attached agent log; the executor container ID is 
> {{d2bfec33-f6bd-44ee-9345-b5710780bb59}} and the executor ID contains the 
> string {{819f7ef7-4f42-11e9-a566-72ec67496045}}.
> After launching the executor, we see
> {code}
> Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re 
> mesos-agent[10238]: I0329 18:23:36.967316 10257 slave.cpp:3550] Launching 
> container d2bfec33-f6bd-44ee-9345-b5710780bb59 for executor 
> 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of 
> framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321-
> Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re 
> mesos-agent[10238]: I0329 18:23:36.968968 10253 docker.cpp:1161] No container 
> info found, skipping launch
> {code}
> I'm not sure why the container info was not set. Once the executor 
> reregistration timeout elapses, the agent attempts to terminate the executor 
> but it does not seem to be successful. The scheduler continues to try to kill 
> the task but we repeatedly see
> {code}
> Mar 29 18:35:19 int-mountvolumeagent9-soak113s.testing.mesosphe.re 
> mesos-agent[10238]: W0329 18:35:19.855063 10253 slave.cpp:3823] Ignoring kill 
> task datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339 
> because the executor 
> 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of 
> framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321- is terminating
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9709) Docker executor can become stuck terminating

2019-04-09 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16813335#comment-16813335
 ] 

Andrei Budnik commented on MESOS-9709:
--

This agent responds on polling `/state` endpoint, but hangs on polling 
`/containers` and `/__processes__`.

GDB can't attach to a running agent - it hangs.

Here is a stack trace of an agent's hanging thread:
{code:java}
[] copy_net_ns+0xa2/0x180
[] create_new_namespaces+0xf9/0x180
[] copy_namespaces+0x8e/0xd0
[] copy_process+0xb66/0x1a40
[] do_fork+0x91/0x320
[] SyS_clone+0x16/0x20
[] stub_clone+0x44/0x70
[] 0x{code}

> Docker executor can become stuck terminating
> 
>
> Key: MESOS-9709
> URL: https://issues.apache.org/jira/browse/MESOS-9709
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.8.0
>Reporter: Greg Mann
>Priority: Major
>  Labels: containerization, mesosphere
> Attachments: docker-executor-stuck.txt
>
>
> See attached agent log; the executor container ID is 
> {{d2bfec33-f6bd-44ee-9345-b5710780bb59}} and the executor ID contains the 
> string {{819f7ef7-4f42-11e9-a566-72ec67496045}}.
> After launching the executor, we see
> {code}
> Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re 
> mesos-agent[10238]: I0329 18:23:36.967316 10257 slave.cpp:3550] Launching 
> container d2bfec33-f6bd-44ee-9345-b5710780bb59 for executor 
> 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of 
> framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321-
> Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re 
> mesos-agent[10238]: I0329 18:23:36.968968 10253 docker.cpp:1161] No container 
> info found, skipping launch
> {code}
> I'm not sure why the container info was not set. Once the executor 
> reregistration timeout elapses, the agent attempts to terminate the executor 
> but it does not seem to be successful. The scheduler continues to try to kill 
> the task but we repeatedly see
> {code}
> Mar 29 18:35:19 int-mountvolumeagent9-soak113s.testing.mesosphe.re 
> mesos-agent[10238]: W0329 18:35:19.855063 10253 slave.cpp:3823] Ignoring kill 
> task datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339 
> because the executor 
> 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of 
> framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321- is terminating
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9707) Calling link::lo() may cause runtime error

2019-04-08 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812625#comment-16812625
 ] 

Andrei Budnik commented on MESOS-9707:
--

Thanks for filing the ticket!
Would you like to create a PR for the fix on [https://github.com/apache/mesos]?
[http://mesos.apache.org/documentation/latest/beginner-contribution/#open-a-pr]

> Calling link::lo() may cause runtime error 
> ---
>
> Key: MESOS-9707
> URL: https://issues.apache.org/jira/browse/MESOS-9707
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.2
>Reporter: Pavel
>Priority: Major
>
> If mesos uses isolation="network/port_mapping" it calls link::lo() during 
> PortMappingIsolatorProcess::create procedure:
> {code:C++}
>   Try> links = net::links();
>   if (links.isError()) {
> return Error("Failed to get all the links: " + links.error());
>   }
>   foreach (const string& link, links.get()) {
> Result test = link::internal::test(link, IFF_LOOPBACK);
> if (test.isError()) {
>   return Error("Failed to check the flag on link: " + link);
> } else if (test.get()) {
>   return link;
> }
> }
> {code}
> it iterates through net::links() and return first one with IFF_LOOPBACK flag.
> For some network configurations test var cound be None and test.get() throws 
> runtime error.
> In my case bridged interface caused link::internal::test(link, IFF_LOOPBACK) 
> to be None.
> Changing code to 
> {code:C++}
> else if (test.isSome()) {
> if (test.get()) {
> return link;
> }
> }
> {code}
> solves an issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6285) Agents may OOM during recovery if there are too many tasks or executors

2019-04-08 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6285?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812536#comment-16812536
 ] 

Andrei Budnik commented on MESOS-6285:
--

[~kaysoky] What is the relation between this ticket and MESOS-7947? Does 
MESOS-7947 provide only a partial solution?

> Agents may OOM during recovery if there are too many tasks or executors
> ---
>
> Key: MESOS-6285
> URL: https://issues.apache.org/jira/browse/MESOS-6285
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.0.1
>Reporter: Joseph Wu
>Priority: Critical
>  Labels: mesosphere
>
> On an test cluster, we encountered a degenerate case where running the 
> example {{long-lived-framework}} for over a week would render the agent 
> un-recoverable.  
> The {{long-lived-framework}} creates one custom {{long-lived-executor}} and 
> launches a single task on that executor every time it receives an offer from 
> that agent.  Over a week's worth of time, the framework manages to launch 
> some 400k tasks (short sleeps) on one executor.  During runtime, this is not 
> problematic, as each completed task is quickly rotated out of the agent's 
> memory (and checkpointed to disk).
> During recovery, however, the agent reads every single task into memory, 
> which leads to slow recovery; and often results in the agent being OOM-killed 
> before it finishes recovering.
> To repro this condition quickly:
> 1) Apply this patch to the {{long-lived-framework}}:
> {code}
> diff --git a/src/examples/long_lived_framework.cpp 
> b/src/examples/long_lived_framework.cpp
> index 7c57eb5..1263d82 100644
> --- a/src/examples/long_lived_framework.cpp
> +++ b/src/examples/long_lived_framework.cpp
> @@ -358,16 +358,6 @@ private:
>// Helper to launch a task using an offer.
>void launch(const Offer& offer)
>{
> -int taskId = tasksLaunched++;
> -++metrics.tasks_launched;
> -
> -TaskInfo task;
> -task.set_name("Task " + stringify(taskId));
> -task.mutable_task_id()->set_value(stringify(taskId));
> -task.mutable_agent_id()->MergeFrom(offer.agent_id());
> -task.mutable_resources()->CopyFrom(taskResources);
> -task.mutable_executor()->CopyFrom(executor);
> -
>  Call call;
>  call.set_type(Call::ACCEPT);
>  
> @@ -380,7 +370,23 @@ private:
>  Offer::Operation* operation = accept->add_operations();
>  operation->set_type(Offer::Operation::LAUNCH);
>  
> -operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +// Launch as many tasks as possible in the given offer.
> +Resources remaining = Resources(offer.resources()).flatten();
> +while (remaining.contains(taskResources)) {
> +  int taskId = tasksLaunched++;
> +  ++metrics.tasks_launched;
> +
> +  TaskInfo task;
> +  task.set_name("Task " + stringify(taskId));
> +  task.mutable_task_id()->set_value(stringify(taskId));
> +  task.mutable_agent_id()->MergeFrom(offer.agent_id());
> +  task.mutable_resources()->CopyFrom(taskResources);
> +  task.mutable_executor()->CopyFrom(executor);
> +
> +  operation->mutable_launch()->add_task_infos()->CopyFrom(task);
> +
> +  remaining -= taskResources;
> +}
>  
>  mesos->send(call);
>}
> {code}
> 2) Run a master, agent, and {{long-lived-framework}}.  On a 1 CPU, 1 GB agent 
> + this patch, it should take about 10 minutes to build up sufficient task 
> launches.
> 3) Restart the agent and watch it flail during recovery.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8972) when choose docker image use user network all mesos agent crash

2019-04-04 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1681#comment-1681
 ] 

Andrei Budnik commented on MESOS-8972:
--

[~saturnman], [~omegavveapon] Could you please provide Marathon App definition 
(json) that causes this failure? What version of Marathon you are running?

> when choose docker image use user network all mesos agent crash
> ---
>
> Key: MESOS-8972
> URL: https://issues.apache.org/jira/browse/MESOS-8972
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.7.0
> Environment: Ubuntu 14.04 & Ubuntu 16.04, both type crashes mesos
>Reporter: saturnman
>Priority: Blocker
>  Labels: docker, network
>
> When submit docker task from marathon choose user network, then mesos process 
> crashes with the following backtrace message
> mesos-agent: .././../3rdparty/stout/include/stout/option.hpp:118: const T& 
> Option::get() const & [with T = std::__cxx11::basic_string]: 
> Assertion `isSome()' failed.
> *** Aborted at 1527797505 (unix time) try "date -d @1527797505" if you are 
> using GNU date ***
> PC: @ 0x7fc03d43f428 (unknown)
> *** SIGABRT (@0x4514) received by PID 17684 (TID 0x7fc033143700) from PID 
> 17684; stack trace: ***
>  @ 0x7fc03dd7d390 (unknown)
>  @ 0x7fc03d43f428 (unknown)
>  @ 0x7fc03d44102a (unknown)
>  @ 0x7fc03d437bd7 (unknown)
>  @ 0x7fc03d437c82 (unknown)
>  @ 0x564f1ad8871d 
> _ZNKR6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc3getEv
>  @ 0x7fc048c43256 
> mesos::internal::slave::NetworkCniIsolatorProcess::getNetworkConfigJSON()
>  @ 0x7fc048c368cb mesos::internal::slave::NetworkCniIsolatorProcess::prepare()
>  @ 0x7fc0486e5c18 
> _ZZN7process8dispatchI6OptionIN5mesos5slave19ContainerLaunchInfoEENS2_8internal5slave20MesosIsolatorProcessERKNS2_11ContainerIDERKNS3_15ContainerConfigESB_SE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSJ_FSH_T1_T2_EOT3_OT4_ENKUlSt10unique_ptrINS_7PromiseIS5_EESt14default_deleteISX_EEOS9_OSC_PNS_11ProcessBaseEE_clES10_S11_S12_S14_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9698) DroppedOperationStatusUpdate test is flaky

2019-04-04 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-9698:


 Summary: DroppedOperationStatusUpdate test is flaky
 Key: MESOS-9698
 URL: https://issues.apache.org/jira/browse/MESOS-9698
 Project: Mesos
  Issue Type: Bug
 Environment: Debian 8
Reporter: Andrei Budnik
 Attachments: DroppedOperationStatusUpdate-badrun1.txt

DroppedOperationStatusUpdate test failed with the following backtrace:
{code:java}
06:50:21 mesos-tests: ../../3rdparty/stout/include/stout/option.hpp:120: T& 
Option::get() & [with T = mesos::FrameworkID]: Assertion `isSome()' failed.
06:50:21 *** Aborted at 1554360620 (unix time) try "date -d @1554360620" if you 
are using GNU date ***
06:50:21 I0404 06:50:20.663539 16308 scheduler.cpp:847] Enqueuing event OFFERS 
received from http://172.16.10.126:42550/master/api/v1/scheduler
06:50:21 I0404 06:50:20.663702 16308 scheduler.cpp:847] Enqueuing event 
UPDATE_OPERATION_STATUS received from 
http://172.16.10.126:42550/master/api/v1/scheduler
06:50:21 PC: @ 0x7fa726c66067 (unknown)
06:50:21 *** SIGABRT (@0x6fad) received by PID 28589 (TID 0x7fa71dfc9700) from 
PID 28589; stack trace: ***
06:50:21 @ 0x7fa726feb890 (unknown)
06:50:21 @ 0x7fa726c66067 (unknown)
06:50:21 @ 0x7fa726c67448 (unknown)
06:50:21 @ 0x7fa726c5f266 (unknown)
06:50:21 @ 0x7fa726c5f312 (unknown)
06:50:21 @ 0x7fa72a1be89a 
_ZNR6OptionIN5mesos11FrameworkIDEE3getEv.part.500
06:50:21 @ 0x7fa72a54002a 
mesos::internal::master::Master::updateOperationStatus()
06:50:21 @ 0x7fa72a5c583b ProtobufProcess<>::_handlerMutM<>()
06:50:21 @ 0x7fa72a58e680 ProtobufProcess<>::consume()
06:50:21 @ 0x7fa72a50cf04 mesos::internal::master::Master::_consume()
06:50:21 @ 0x7fa72a52975d mesos::internal::master::Master::consume()
06:50:21 @ 0x7fa72b60b1d3 process::ProcessManager::resume()
06:50:21 @ 0x7fa72b610ea6 
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
06:50:21 @ 0x7fa7277c6970 (unknown)
06:50:21 @ 0x7fa726fe4064 start_thread
06:50:21 @ 0x7fa726d1962d (unknown)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9693) Add master validation for SeccompInfo.

2019-03-30 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16805771#comment-16805771
 ] 

Andrei Budnik commented on MESOS-9693:
--

> 2. at most one field of profile_name and unconfined should be set. better to 
> validate in master

We have such a validation in `linux/seccomp` 
[isolator|https://github.com/apache/mesos/blob/9a6b3cb943fd1f8c9732cd5fb7d58a5b55c1460c/src/slave/containerizer/mesos/isolators/linux/seccomp.cpp#L102-L107].

> 1. if seccomp is not enabled, we should return failure if any fw specify 
> seccompInfo and return appropriate status update.

There are 2 nuances that need to be taken into account.
Firstly, Seccomp isolator might be disabled on some particular agents. So, 
whether Seccomp is enabled or not can be detected at agent level rather than 
cluster-wide.
Secondly, we don't have a similar validation for other "unused" fields in 
ContainerInfo/LinuxInfo proto. E.g., a framework might specify `NetworkInfo 
network_infos` field in the `ContainerInfo`, but it will be ignored by an agent 
in case CNI and other `network_infos` consuming plugins are not enabled.

> Add master validation for SeccompInfo.
> --
>
> Key: MESOS-9693
> URL: https://issues.apache.org/jira/browse/MESOS-9693
> Project: Mesos
>  Issue Type: Task
>Reporter: Gilbert Song
>Assignee: Andrei Budnik
>Priority: Major
>
> 1. if seccomp is not enabled, we should return failure if any fw specify 
> seccompInfo and return appropriate status update.
> 2. at most one field of profile_name and unconfined should be set. better to 
> validate in master



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9614) Implement filtering of Seccomp rules by kernel version.

2019-02-27 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-9614:


 Summary: Implement filtering of Seccomp rules by kernel version.
 Key: MESOS-9614
 URL: https://issues.apache.org/jira/browse/MESOS-9614
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Andrei Budnik


The most recent Docker profile allows specify filtering by kernel version, e.g:
{code:java}
{
"names": [
"ptrace"
],
"action": "SCMP_ACT_ALLOW",
"args": null,
"comment": "",
"includes": {
"minKernel": "4.8"
},
"excludes": {}
},
{code}
We need to add support for `minKernel` filter.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9564) Logrotate container logger lets tasks execute arbitrary commands in the Mesos agent's namespace

2019-02-14 Thread Andrei Budnik (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9564?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-9564:


Assignee: Andrei Budnik  (was: Joseph Wu)

> Logrotate container logger lets tasks execute arbitrary commands in the Mesos 
> agent's namespace
> ---
>
> Key: MESOS-9564
> URL: https://issues.apache.org/jira/browse/MESOS-9564
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, modules
>Reporter: Joseph Wu
>Assignee: Andrei Budnik
>Priority: Critical
>  Labels: foundations, mesosphere
>
> The non-default {{LogrotateContainerLogger}} module allows tasks to configure 
> sandbox log rotation (See 
> http://mesos.apache.org/documentation/latest/logging/#Containers ).  The 
> {{logrotate_stdout_options}} and {{logrotate_stderr_options}} in particular 
> let the task specify free-form text, which is written to a configuration file 
> located in the task's sandbox.  The module does not sanitize or check this 
> configuration at all.
> The logger itself will eventually run {{logrotate}} against the written 
> configuration file, but the logger is not isolated in the same way as the 
> task.  For both the Mesos and Docker containerizers, the logger binary will 
> run in the same namespace as the Mesos agent.  This makes it possible to 
> affect files outside of the task's mount namespace.
> Two modes of attack are known to be problematic:
> * Changing or adding entries to the configuration file.  Normally, the 
> configuration file contains a single file to rotate:
> {code}
> /path/to/sandbox/stdout {
>   
> }
> {code}
> It is trivial to add text to the {{logrotate_stdout_options}} to add a new 
> entry:
> {code}
> /path/to/sandbox/stdout {
>   
> }
> /path/to/other/file/on/disk {
>   
> }
> {code}
> * Logrotate's {{postrotate}} option allows for execution of arbitrary 
> commands.  This can again be supplied with the {{logrotate_stdout_options}} 
> variable.
> {code}
> /path/to/sandbox/stdout {
>   postrotate
> rm -rf /
>   endscript
> }
> {code}
> Some potential fixes to consider:
> * Overwrite the .logrotate.conf files each time. This would give only 
> milliseconds between writing and calling logrotate for a thirdparty to modify 
> the config files maliciously. This would not help if the task itself had 
> postrotate options in its environment variables.
> * Sanitize the free-form options field in the environment variables to remove 
> postrotate or injection attempts like }\n/path/to/some/file\noptions{.
> * Refactor parts of the Mesos isolation code path so that the logger and IO 
> switchboard binary live in the same namespaces as the container (instead of 
> the agent). This would also be nice in that the logger's CPU usage would then 
> be accounted for within the container's resources.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (MESOS-6632) ContainerLogger might leak FD if container launch fails.

2019-02-08 Thread Andrei Budnik (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik updated MESOS-6632:
-
Comment: was deleted

(was: [~gilbert] Could you please fill out Fix Version/s?)

> ContainerLogger might leak FD if container launch fails.
> 
>
> Key: MESOS-6632
> URL: https://issues.apache.org/jira/browse/MESOS-6632
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.28.2, 1.0.1, 1.1.0
>Reporter: Jie Yu
>Assignee: Andrei Budnik
>Priority: Critical
>
> In MesosContainerizer, if logger->prepare() succeeds but its continuation 
> fails, the pipe fd allocated in the logger will get leaked. We cannot add a 
> destructor in ContainerLogger::SubprocessInfo to close the fd because 
> subprocess might close the OWNED fd.
> A FD abstraction might help here. In other words, subprocess will no longer 
> be responsible for closing external FDs, instead, the FD destructor will be 
> doing so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6632) ContainerLogger might leak FD if container launch fails.

2019-02-08 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763730#comment-16763730
 ] 

Andrei Budnik commented on MESOS-6632:
--

[~gilbert] Could you please fill out Fix Version/s?

> ContainerLogger might leak FD if container launch fails.
> 
>
> Key: MESOS-6632
> URL: https://issues.apache.org/jira/browse/MESOS-6632
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.28.2, 1.0.1, 1.1.0
>Reporter: Jie Yu
>Assignee: Andrei Budnik
>Priority: Critical
>
> In MesosContainerizer, if logger->prepare() succeeds but its continuation 
> fails, the pipe fd allocated in the logger will get leaked. We cannot add a 
> destructor in ContainerLogger::SubprocessInfo to close the fd because 
> subprocess might close the OWNED fd.
> A FD abstraction might help here. In other words, subprocess will no longer 
> be responsible for closing external FDs, instead, the FD destructor will be 
> doing so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-6632) ContainerLogger might leak FD if container launch fails.

2019-02-08 Thread Andrei Budnik (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-6632:


Assignee: Andrei Budnik

> ContainerLogger might leak FD if container launch fails.
> 
>
> Key: MESOS-6632
> URL: https://issues.apache.org/jira/browse/MESOS-6632
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.28.2, 1.0.1, 1.1.0
>Reporter: Jie Yu
>Assignee: Andrei Budnik
>Priority: Critical
>
> In MesosContainerizer, if logger->prepare() succeeds but its continuation 
> fails, the pipe fd allocated in the logger will get leaked. We cannot add a 
> destructor in ContainerLogger::SubprocessInfo to close the fd because 
> subprocess might close the OWNED fd.
> A FD abstraction might help here. In other words, subprocess will no longer 
> be responsible for closing external FDs, instead, the FD destructor will be 
> doing so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-6632) ContainerLogger might leak FD if container launch fails.

2019-02-08 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-6632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16763727#comment-16763727
 ] 

Andrei Budnik commented on MESOS-6632:
--

https://reviews.apache.org/r/69684/

> ContainerLogger might leak FD if container launch fails.
> 
>
> Key: MESOS-6632
> URL: https://issues.apache.org/jira/browse/MESOS-6632
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.28.2, 1.0.1, 1.1.0
>Reporter: Jie Yu
>Priority: Critical
>
> In MesosContainerizer, if logger->prepare() succeeds but its continuation 
> fails, the pipe fd allocated in the logger will get leaked. We cannot add a 
> destructor in ContainerLogger::SubprocessInfo to close the fd because 
> subprocess might close the OWNED fd.
> A FD abstraction might help here. In other words, subprocess will no longer 
> be responsible for closing external FDs, instead, the FD destructor will be 
> doing so.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-9507) Agent could not recover due to empty docker volume checkpointed files.

2019-02-07 Thread Andrei Budnik (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-9507?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Budnik reassigned MESOS-9507:


Assignee: Andrei Budnik  (was: Gilbert Song)

> Agent could not recover due to empty docker volume checkpointed files.
> --
>
> Key: MESOS-9507
> URL: https://issues.apache.org/jira/browse/MESOS-9507
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Andrei Budnik
>Priority: Critical
>  Labels: containerizer
>
> Agent could not recover due to empty docker volume checkpointed files. Please 
> see logs:
> {noformat}
> Nov 12 17:12:00 guppy mesos-agent[38960]: E1112 17:12:00.978682 38969 
> slave.cpp:6279] EXIT with status 1: Failed to perform recovery: Collect 
> failed: Collect failed: Failed to recover docker volumes for orphan container 
> e1b04051-1e4a-47a9-b866-1d625cda1d22: JSON parse failed: syntax error at line 
> 1 near:
> Nov 12 17:12:00 guppy mesos-agent[38960]: To remedy this do as follows: 
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 1: rm -f 
> /var/lib/mesos/slave/meta/slaves/latest
> Nov 12 17:12:00 guppy mesos-agent[38960]: This ensures agent doesn't recover 
> old live executors.
> Nov 12 17:12:00 guppy mesos-agent[38960]: Step 2: Restart the agent. 
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service: main process 
> exited, code=exited, status=1/FAILURE
> Nov 12 17:12:00 guppy systemd[1]: Unit dcos-mesos-slave.service entered 
> failed state.
> Nov 12 17:12:00 guppy systemd[1]: dcos-mesos-slave.service failed.
> {noformat}
> This is caused by agent recovery after the volume state file is created but 
> before checkpointing finishes. Basically the docker volume is not mounted 
> yet, so the docker volume isolator should skip recovering this volume.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky

2019-01-10 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739635#comment-16739635
 ] 

Andrei Budnik edited comment on MESOS-7971 at 1/10/19 5:40 PM:
---

This is something different from previous ones.
{code:java}
E0110 17:13:09.326659 13916 master.cpp:8586] Failed to find the operation '' 
(uuid: 825f65eb-3ba1-4dfa-bdfa-8eb29194ace3) for an operator API call on agent 
ae22a9c8-0ef6-4f1e-b1eb-7b55f6e4508b-S0
{code}
Full log:
{code:java}
[ RUN ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove
I0110 17:12:59.303460 13893 cluster.cpp:174] Creating default 'local' authorizer
I0110 17:12:59.304430 13912 master.cpp:416] Master 
ae22a9c8-0ef6-4f1e-b1eb-7b55f6e4508b (ip-172-16-10-92.ec2.internal) started on 
172.16.10.92:42320
I0110 17:12:59.304451 13912 master.cpp:419] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1000secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/PfFTwT/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_operator_event_stream_subscribers="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --roles="role1" 
--root_submissions="true" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/PfFTwT/master" 
--zk_session_timeout="10secs"
I0110 17:12:59.304585 13912 master.cpp:468] Master only allowing authenticated 
frameworks to register
I0110 17:12:59.304595 13912 master.cpp:474] Master only allowing authenticated 
agents to register
I0110 17:12:59.304603 13912 master.cpp:480] Master only allowing authenticated 
HTTP frameworks to register
I0110 17:12:59.304615 13912 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/PfFTwT/credentials'
I0110 17:12:59.304684 13912 master.cpp:524] Using default 'crammd5' 
authenticator
I0110 17:12:59.304744 13912 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0110 17:12:59.304831 13912 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0110 17:12:59.304889 13912 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0110 17:12:59.304941 13912 master.cpp:605] Authorization enabled
W0110 17:12:59.304967 13912 master.cpp:668] The '--roles' flag is deprecated. 
This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
more information
I0110 17:12:59.305047 13919 hierarchical.cpp:176] Initialized hierarchical 
allocator process
I0110 17:12:59.305128 13918 whitelist_watcher.cpp:77] No whitelist given
I0110 17:12:59.305600 13914 master.cpp:2085] Elected as the leading master!
I0110 17:12:59.305622 13914 master.cpp:1640] Recovering from registrar
I0110 17:12:59.305698 13913 registrar.cpp:339] Recovering registrar
I0110 17:12:59.305853 13912 registrar.cpp:383] Successfully fetched the 
registry (0B) in 118016ns
I0110 17:12:59.305899 13912 registrar.cpp:487] Applied 1 operations in 8238ns; 
attempting to update the registry
I0110 17:12:59.306036 13912 registrar.cpp:544] Successfully updated the 
registry in 112128ns
I0110 17:12:59.306092 13912 registrar.cpp:416] Successfully recovered registrar
I0110 17:12:59.306217 13916 master.cpp:1754] Recovered 0 agents from the 
registry (172B); allowing 10mins for agents to reregister
I0110 17:12:59.306258 13919 hierarchical.cpp:216] Skipping recovery of 
hierarchical allocator: nothing to recover
W0110 17:12:59.307780 13893 process.cpp:2829] Attempted to spawn already 
running process files@172.16.10.92:42320
I0110 17:12:59.308149 13893 containerizer.cpp:305] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
I0110 17:12:59.310348 13893 linux_launcher.cpp:144] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux 

[jira] [Commented] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test is flaky

2019-01-10 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16739635#comment-16739635
 ] 

Andrei Budnik commented on MESOS-7971:
--

This is something different from previous ones.
{code:java}
E0110 17:13:09.326659 13916 master.cpp:8586] Failed to find the operation '' 
(uuid: 825f65eb-3ba1-4dfa-bdfa-8eb29194ace3) for an operator API call on agent 
ae22a9c8-0ef6-4f1e-b1eb-7b55f6e4508b-S0
{code}
Full log:
{code:java}
[ RUN ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove
I0110 17:12:59.303460 13893 cluster.cpp:174] Creating default 'local' authorizer
I0110 17:12:59.304430 13912 master.cpp:416] Master 
ae22a9c8-0ef6-4f1e-b1eb-7b55f6e4508b (ip-172-16-10-92.ec2.internal) started on 
172.16.10.92:42320
I0110 17:12:59.304451 13912 master.cpp:419] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="1000secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/PfFTwT/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_operator_event_stream_subscribers="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --roles="role1" 
--root_submissions="true" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/PfFTwT/master" 
--zk_session_timeout="10secs"
I0110 17:12:59.304585 13912 master.cpp:468] Master only allowing authenticated 
frameworks to register
I0110 17:12:59.304595 13912 master.cpp:474] Master only allowing authenticated 
agents to register
I0110 17:12:59.304603 13912 master.cpp:480] Master only allowing authenticated 
HTTP frameworks to register
I0110 17:12:59.304615 13912 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/PfFTwT/credentials'
I0110 17:12:59.304684 13912 master.cpp:524] Using default 'crammd5' 
authenticator
I0110 17:12:59.304744 13912 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0110 17:12:59.304831 13912 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0110 17:12:59.304889 13912 http.cpp:965] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0110 17:12:59.304941 13912 master.cpp:605] Authorization enabled
W0110 17:12:59.304967 13912 master.cpp:668] The '--roles' flag is deprecated. 
This flag will be removed in the future. See the Mesos 0.27 upgrade notes for 
more information
I0110 17:12:59.305047 13919 hierarchical.cpp:176] Initialized hierarchical 
allocator process
I0110 17:12:59.305128 13918 whitelist_watcher.cpp:77] No whitelist given
I0110 17:12:59.305600 13914 master.cpp:2085] Elected as the leading master!
I0110 17:12:59.305622 13914 master.cpp:1640] Recovering from registrar
I0110 17:12:59.305698 13913 registrar.cpp:339] Recovering registrar
I0110 17:12:59.305853 13912 registrar.cpp:383] Successfully fetched the 
registry (0B) in 118016ns
I0110 17:12:59.305899 13912 registrar.cpp:487] Applied 1 operations in 8238ns; 
attempting to update the registry
I0110 17:12:59.306036 13912 registrar.cpp:544] Successfully updated the 
registry in 112128ns
I0110 17:12:59.306092 13912 registrar.cpp:416] Successfully recovered registrar
I0110 17:12:59.306217 13916 master.cpp:1754] Recovered 0 agents from the 
registry (172B); allowing 10mins for agents to reregister
I0110 17:12:59.306258 13919 hierarchical.cpp:216] Skipping recovery of 
hierarchical allocator: nothing to recover
W0110 17:12:59.307780 13893 process.cpp:2829] Attempted to spawn already 
running process files@172.16.10.92:42320
I0110 17:12:59.308149 13893 containerizer.cpp:305] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesystem/posix, network/cni }
I0110 17:12:59.310348 13893 linux_launcher.cpp:144] Using 
/sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0110 17:12:59.310752 13893 

[jira] [Comment Edited] (MESOS-9463) Parallel test runner gets confused if a GTEST_FILTER expression also matches a sequential filter

2018-12-19 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725038#comment-16725038
 ] 

Andrei Budnik edited comment on MESOS-9463 at 12/19/18 2:23 PM:


Since GTEST filter [does not 
support|https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#running-a-subset-of-the-tests]
 boolean AND operator and does not support composition (to emulate AND operator 
using De Morgan's laws), we should either:
 1) Fix Mesos containerizer and Mesos tests to support launching ROOT tests in 
parallel
 2) when GTEST_FILTER is specified, run all tests in sequential mode


was (Author: abudnik):
Since GTEST filter [does not 
support|https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#running-a-subset-of-the-tests]
 boolean AND operator and does not support composition (to emulate AND operator 
using De Morgan's laws), we should either:
1) Fix mesos c'zer and mesos tests to support launching ROOT tests in parallel
2) when GTEST_FILTER is specified, run all tests in sequential mode

> Parallel test runner gets confused if a GTEST_FILTER expression also matches 
> a sequential filter
> 
>
> Key: MESOS-9463
> URL: https://issues.apache.org/jira/browse/MESOS-9463
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>Priority: Major
>  Labels: parallel-tests, test
>
> Users expect the be able to select tests to run via {{make check}} with a 
> {{GTEST_FILTER}} environment variable. The parallel test runner on the other 
> hand programmatically also injects filter expressions to select tests to 
> execute sequentially.
> This causes e.g., all {{*ROOT_*}} tests to be run in the sequential phase for 
> superusers, even if a {{GTEST_FILTER}} was set.
> It seems that need to handle set {{GTEST_FILTER}} environment variables more 
> carefully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9463) Parallel test runner gets confused if a GTEST_FILTER expression also matches a sequential filter

2018-12-19 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16725038#comment-16725038
 ] 

Andrei Budnik commented on MESOS-9463:
--

Since GTEST filter [does not 
support|https://github.com/google/googletest/blob/master/googletest/docs/advanced.md#running-a-subset-of-the-tests]
 boolean AND operator and does not support composition (to emulate AND operator 
using De Morgan's laws), we should either:
1) Fix mesos c'zer and mesos tests to support launching ROOT tests in parallel
2) when GTEST_FILTER is specified, run all tests in sequential mode

> Parallel test runner gets confused if a GTEST_FILTER expression also matches 
> a sequential filter
> 
>
> Key: MESOS-9463
> URL: https://issues.apache.org/jira/browse/MESOS-9463
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Bannier
>Priority: Major
>  Labels: parallel-tests, test
>
> Users expect the be able to select tests to run via {{make check}} with a 
> {{GTEST_FILTER}} environment variable. The parallel test runner on the other 
> hand programmatically also injects filter expressions to select tests to 
> execute sequentially.
> This causes e.g., all {{*ROOT_*}} tests to be run in the sequential phase for 
> superusers, even if a {{GTEST_FILTER}} was set.
> It seems that need to handle set {{GTEST_FILTER}} environment variables more 
> carefully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9462) Devices in a container are inaccessible due to `nodev` on `/var/run`.

2018-12-10 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715052#comment-16715052
 ] 

Andrei Budnik edited comment on MESOS-9462 at 12/10/18 6:53 PM:


[https://reviews.apache.org/r/69540/]
[https://reviews.apache.org/r/69545/]


was (Author: abudnik):
[https://reviews.apache.org/r/69540/]

> Devices in a container are inaccessible due to `nodev` on `/var/run`.
> -
>
> Key: MESOS-9462
> URL: https://issues.apache.org/jira/browse/MESOS-9462
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.8.0
>Reporter: Jie Yu
>Assignee: Andrei Budnik
>Priority: Blocker
>  Labels: regression
>
> A recent [patch|https://reviews.apache.org/r/69086/] (commit 
> ede8155d1d043137e15007c48da36ac5fa0b5124) changes the behavior of how 
> standard device nodes (e.g., /dev/null, etc.) are setup. It uses bind mount 
> (from host) now (instead of mknod).
> The devices nodes are created under 
> `/var/run/mesos/containers//devices`, and then bind mounted to 
> the container root filesystem. This is problematic for those Linux distros 
> that mount `/var/run` (or `/run`) as `nodev`. For instance, CentOS 7.4:
> {noformat}
> [jie@core-dev ~]$ cat /proc/self/mountinfo | grep "/run\ "
>   
>
> 24 62 0:19 / /run rw,nosuid,nodev shared:23 - tmpfs tmpfs rw,seclabel,mode=755
> [jie@core-dev ~]$ cat /etc/redhat-release 
> CentOS Linux release 7.4.1708 (Core) 
> {noformat}
> As a result, the `/dev/null` devices in the container will inherit the 
> `nodev` from `/run` on the host
> {noformat}
> 629 625 0:121 
> /mesos/containers/49f1da14-d741-4030-994c-0d8ed5093b13/devices/null /dev/null 
> rw,nosuid,nodev - tmpfs tmpfs rw,mode=755
> {noformat}
> This will cause "Permission Denied" error when a process in the container 
> tries to open the device node.
> You can try to reproduce this issue using Mesos Mini
> {noformat}
> docker run --rm --privileged -p 5050:5050 -p 5051:5051 -p 8080:8080 
> mesos/mesos-mini:master-2018-12-06
> {noformat}
> And the, go to Marathon UI (http://localhost:8080), and launch an app using 
> the following config
> {code}
> {
>   "id": "/test",
>   "cmd": "dd if=/dev/zero of=file bs=1024 count=1 oflag=dsync",
>   "cpus": 1,
>   "mem": 128,
>   "disk": 128,
>   "instances": 1,
>   "container": {
> "type": "MESOS",
> "docker": {
>   "image": "ubuntu:18.04"
> }
>   }
> }
> {code}
> You'll see the task failed with "Permission Denied".
> The task will run normally if you use `mesos/mesos-mini:master-2018-12-01`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9462) Devices in a container are inaccessible due to `nodev` on `/var/run`.

2018-12-10 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16715052#comment-16715052
 ] 

Andrei Budnik commented on MESOS-9462:
--

[https://reviews.apache.org/r/69540/]

> Devices in a container are inaccessible due to `nodev` on `/var/run`.
> -
>
> Key: MESOS-9462
> URL: https://issues.apache.org/jira/browse/MESOS-9462
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.8.0
>Reporter: Jie Yu
>Assignee: Andrei Budnik
>Priority: Blocker
>  Labels: regression
>
> A recent [patch|https://reviews.apache.org/r/69086/] (commit 
> ede8155d1d043137e15007c48da36ac5fa0b5124) changes the behavior of how 
> standard device nodes (e.g., /dev/null, etc.) are setup. It uses bind mount 
> (from host) now (instead of mknod).
> The devices nodes are created under 
> `/var/run/mesos/containers//devices`, and then bind mounted to 
> the container root filesystem. This is problematic for those Linux distros 
> that mount `/var/run` (or `/run`) as `nodev`. For instance, CentOS 7.4:
> {noformat}
> [jie@core-dev ~]$ cat /proc/self/mountinfo | grep "/run\ "
>   
>
> 24 62 0:19 / /run rw,nosuid,nodev shared:23 - tmpfs tmpfs rw,seclabel,mode=755
> [jie@core-dev ~]$ cat /etc/redhat-release 
> CentOS Linux release 7.4.1708 (Core) 
> {noformat}
> As a result, the `/dev/null` devices in the container will inherit the 
> `nodev` from `/run` on the host
> {noformat}
> 629 625 0:121 
> /mesos/containers/49f1da14-d741-4030-994c-0d8ed5093b13/devices/null /dev/null 
> rw,nosuid,nodev - tmpfs tmpfs rw,mode=755
> {noformat}
> This will cause "Permission Denied" error when a process in the container 
> tries to open the device node.
> You can try to reproduce this issue using Mesos Mini
> {noformat}
> docker run --rm --privileged -p 5050:5050 -p 5051:5051 -p 8080:8080 
> mesos/mesos-mini:master-2018-12-06
> {noformat}
> And the, go to Marathon UI (http://localhost:8080), and launch an app using 
> the following config
> {code}
> {
>   "id": "/test",
>   "cmd": "dd if=/dev/zero of=file bs=1024 count=1 oflag=dsync",
>   "cpus": 1,
>   "mem": 128,
>   "disk": 128,
>   "instances": 1,
>   "container": {
> "type": "MESOS",
> "docker": {
>   "image": "ubuntu:18.04"
> }
>   }
> }
> {code}
> You'll see the task failed with "Permission Denied".
> The task will run normally if you use `mesos/mesos-mini:master-2018-12-01`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9461) `CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage` is flaky.

2018-12-07 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-9461:


 Summary: `CgroupsIsolatorTest.ROOT_CGROUPS_BlkioUsage` is flaky.
 Key: MESOS-9461
 URL: https://issues.apache.org/jira/browse/MESOS-9461
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.8.0
 Environment: Fedora 25
Reporter: Andrei Budnik
 Attachments: ROOT_CGROUPS_BlkioUsage-badrun.txt

This test permanently fails on Fedora 25 (kernel 4.13.16-100.fc25.x86_64).
{code:java}
$ mount|grep blkio
cgroup on /sys/fs/cgroup/blkio type cgroup 
(rw,nosuid,nodev,noexec,relatime,blkio)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9456) Set `SCMP_FLTATR_CTL_LOG` attribute during initialization of Seccomp context

2018-12-05 Thread Andrei Budnik (JIRA)
Andrei Budnik created MESOS-9456:


 Summary: Set `SCMP_FLTATR_CTL_LOG` attribute during initialization 
of Seccomp context
 Key: MESOS-9456
 URL: https://issues.apache.org/jira/browse/MESOS-9456
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Andrei Budnik


Since version 4.14 the Linux kernel supports SECCOMP_FILTER_FLAG_LOG flag which 
can be used for enabling logging for all Seccomp filter operations except 
SECCOMP_RET_ALLOW. If a Seccomp filter does not allow the system call, then the 
kernel will print a message into dmesg during invocation of this system call.

At the moment libseccomp ver. 2.3.3 does not provide this flag, but the latest 
master branch of libseccomp supports SECCOMP_FILTER_FLAG_LOG. So, we need to add
{code:java}
seccomp_attr_set(ctx, SCMP_FLTATR_CTL_LOG, 1);{code}
into `SeccompFilter::create()` when the newest version of libseccomp will be 
released.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-9157) cannot pull docker image from dockerhub

2018-12-04 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16708647#comment-16708647
 ] 

Andrei Budnik commented on MESOS-9157:
--

[~MichaelBowie] feel free to reach out to me directly if you need any help on 
this ticket via [https://mesos.slack.com/] 

> cannot pull docker image from dockerhub
> ---
>
> Key: MESOS-9157
> URL: https://issues.apache.org/jira/browse/MESOS-9157
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.6.1
>Reporter: Michael Bowie
>Priority: Blocker
>  Labels: containerization
>
> I am not able to pull docker images from docker hub through marathon/mesos. 
> I get one of two errors:
>  * `Aug 15 10:11:02 michael-b-dcos-agent-1 dockerd[5974]: 
> time="2018-08-15T10:11:02.770309104-04:00" level=error msg="Not continuing 
> with pull after error: context canceled"`
>  * `Failed to run docker -H ... Error: No such object: 
> mesos-d2f333a8-fef2-48fb-8b99-28c52c327790`
> However, I can manually ssh into one of the agents and successfully pull the 
> image from the command line. 
> Any pointers in the right direction?
> Thank you!
> Similar Issues:
> https://github.com/mesosphere/marathon/issues/3869



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   3   4   >