[jira] [Commented] (MESOS-10093) Libprocess does not properly escape subprocess argument strings on Windows

2020-04-01 Thread Benjamin Mahler (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17073265#comment-17073265
 ] 

Benjamin Mahler commented on MESOS-10093:
-

Did some research and synced with [~greggomann] and [~asekretenko]. On windows 
processes can't actually receive an argument array and they're started using a 
full string command line; it's up to them to parse these command lines into 
arguments. Most programs use {{CommandLineToArgvW}}, especially those programs 
that are not windows only (e.g. some libc, java, or python program). However, 
cmd.exe *does not* use {{CommandLineToArgvW}} and therefore applying the 
quoting is not correct.

The fix will be to surface direct command line string utilities (windows only) 
alongside the argv style, and callers will have to choose when to use the 
command line string versions (e.g. if using cmd.exe).

We may need to do something special for docker style commands (e.g. "docker run 
... "), but not sure yet.

Some cleanup patches to prepare for a fix:

https://reviews.apache.org/r/72273/
https://reviews.apache.org/r/72285/
https://reviews.apache.org/r/72286/
https://reviews.apache.org/r/72303/
https://reviews.apache.org/r/72304/

> Libprocess does not properly escape subprocess argument strings on Windows
> --
>
> Key: MESOS-10093
> URL: https://issues.apache.org/jira/browse/MESOS-10093
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: Greg Mann
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: containerization, docker, mesosphere, windows
>
> When running some tests of Mesos on Windows, I discovered that the following 
> command would not execute successfully when passed to the Docker 
> containerizer in {{TaskInfo.command}}:
> {noformat}
> python -c "print('hello world')"
> {noformat}
> The following error is found in the task sandbox:
> {noformat}
>   File "", line 1
> "print('hello
> ^
> SyntaxError: EOL while scanning string literal
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10107) containeriser: failed to remove cgroup - EBUSY

2020-04-01 Thread Charles (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072739#comment-17072739
 ] 

Charles commented on MESOS-10107:
-

Hey!

Sure, I created [https://github.com/apache/mesos/pull/355]

I answered your questions there (I'm not sure what's the best place for 
discussions, but it feels nice to have it close to thew code review).

> containeriser: failed to remove cgroup - EBUSY
> --
>
> Key: MESOS-10107
> URL: https://issues.apache.org/jira/browse/MESOS-10107
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Charles
>Priority: Major
> Attachments: mesos-remove-cgroup-race.diff, 
> reproduce-cgroup-rmdir-race.py
>
>
> We've been seeing some random errors on our cluster, where the container 
> cgroup isn't properly destroyed after the OOM killer kicked in when memory 
> limit has been exceeded - see analysis and patch below:
> Agent log:
> {noformat}
> I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: 
> 10272MB Maximum Used: 10518532KB
> MEMORY STATISTICS: 
> cache 0
> rss 10502754304
> rss_huge 4001366016
> shmem 0
> mapped_file 270336
> dirty 0
> writeback 0
> swap 0
> pgpgin 1684617
> pgpgout 95480
> pgfault 1670328
> pgmajfault 957
> inactive_anon 0
> active_anon 10501189632
> inactive_file 4096
> active_file 0
> unevictable 0
> hierarchical_memory_limit 10770972672
> hierarchical_memsw_limit 10770972672
> total_cache 0
> total_rss 10502754304
> total_rss_huge 4001366016
> total_shmem 0
> total_mapped_file 270336
> total_dirty 0
> total_writeback 0
> total_swap 0
> total_pgpgin 1684617
> total_pgpgout 95480
> total_pgfault 1670328
> total_pgmajfault 957
> total_inactive_anon 0
> total_active_anon 10501070848
> total_inactive_file 4096
> total_active_file 0
> total_unevictable 0
> I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource 
> [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be 
> terminated
> I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state
> I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state 
> of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING 
> after 4.285078272secs
> I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy 
> container 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c'
> I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 102.27072ms
> I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 242944ns
> I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for 
> executor(1)@127.0.1.1:46357
> I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited
> E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor 
> 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework 
> 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all 
> processes in the container: Failed to remove cgroup 
> 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device 
> or resource busy
> {noformat}
> Initially I thought it was a race condition in the cgroup destruction code, 
> but an strace confirmed that the cgroup directory was only deleted once all 
> tasks had exited (edited and commented strace below from a different instance 
> of the same problem):
> {noformat}
> # get the list of processes
> 3431  23:01:32.293608 openat(AT_FDCWD,
> "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs",
> O_RDONLY 
> 3431  23:01:32.293669 <... openat resumed> ) = 18 <0.36>
> 3431  23:01:32.294220 read(18,  
> 3431  23:01:32.294268 <... read resumed> "5878\n6036\n6210\n", 8192) =
> 15 <0.33>
> 3431  23:01:32.294306 read(18, "", 4096) = 0 <0.13>
> 3431  23:01:32.294346 close(18 
> 3431  23:01:32.294402 <... close resumed> ) = 0 <0.45>
> #kill them
> 3431  

[jira] [Commented] (MESOS-10107) containeriser: failed to remove cgroup - EBUSY

2020-04-01 Thread Andrei Budnik (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072680#comment-17072680
 ] 

Andrei Budnik commented on MESOS-10107:
---

Thanks for the detailed explanations!

Could you please submit your patch to [Apache Review 
Board|http://mesos.apache.org/documentation/latest/advanced-contribution/#submit-your-patch]
 or open a [PR on 
github|http://mesos.apache.org/documentation/latest/beginner-contribution/#open-a-pr/]?

Does the workaround work reliably after changing the initial delay and retry 
count to the values taken from libcontainerd (10ms and 5)?

Should we retry only if `::rmdir()` returns EBUSY errno error?

> containeriser: failed to remove cgroup - EBUSY
> --
>
> Key: MESOS-10107
> URL: https://issues.apache.org/jira/browse/MESOS-10107
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Charles
>Priority: Major
> Attachments: mesos-remove-cgroup-race.diff, 
> reproduce-cgroup-rmdir-race.py
>
>
> We've been seeing some random errors on our cluster, where the container 
> cgroup isn't properly destroyed after the OOM killer kicked in when memory 
> limit has been exceeded - see analysis and patch below:
> Agent log:
> {noformat}
> I0331 08:49:16.398592 12831 memory.cpp:515] OOM detected for container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.401342 12831 memory.cpp:555] Memory limit exceeded: Requested: 
> 10272MB Maximum Used: 10518532KB
> MEMORY STATISTICS: 
> cache 0
> rss 10502754304
> rss_huge 4001366016
> shmem 0
> mapped_file 270336
> dirty 0
> writeback 0
> swap 0
> pgpgin 1684617
> pgpgout 95480
> pgfault 1670328
> pgmajfault 957
> inactive_anon 0
> active_anon 10501189632
> inactive_file 4096
> active_file 0
> unevictable 0
> hierarchical_memory_limit 10770972672
> hierarchical_memsw_limit 10770972672
> total_cache 0
> total_rss 10502754304
> total_rss_huge 4001366016
> total_shmem 0
> total_mapped_file 270336
> total_dirty 0
> total_writeback 0
> total_swap 0
> total_pgpgin 1684617
> total_pgpgout 95480
> total_pgfault 1670328
> total_pgmajfault 957
> total_inactive_anon 0
> total_active_anon 10501070848
> total_inactive_file 4096
> total_active_file 0
> total_unevictable 0
> I0331 08:49:16.414501 12831 containerizer.cpp:3175] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has reached its limit for resource 
> [{"name":"mem","scalar":{"value":10272.0},"type":"SCALAR"}] and will be 
> terminated
> I0331 08:49:16.415262 12831 containerizer.cpp:2619] Destroying container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c in RUNNING state
> I0331 08:49:16.415323 12831 containerizer.cpp:3317] Transitioning the state 
> of container 2c2a31eb-bac5-4acd-82ee-593c4616a63c from RUNNING to DESTROYING 
> after 4.285078272secs
> I0331 08:49:16.416393 12830 linux_launcher.cpp:576] Asked to destroy 
> container 2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.416484 12830 linux_launcher.cpp:618] Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c'
> I0331 08:49:16.417093 12830 cgroups.cpp:2854] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.519397 12830 cgroups.cpp:1242] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 102.27072ms
> I0331 08:49:16.524307 12826 cgroups.cpp:2872] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c
> I0331 08:49:16.524654 12828 cgroups.cpp:1271] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c after 
> 242944ns
> I0331 08:49:16.531811 12829 slave.cpp:6539] Got exited event for 
> executor(1)@127.0.1.1:46357
> I0331 08:49:16.539868 12825 containerizer.cpp:3155] Container 
> 2c2a31eb-bac5-4acd-82ee-593c4616a63c has exited
> E0331 08:49:16.548131 12825 slave.cpp:6917] Termination of executor 
> 'task-0-e4e4f131-ee09-4eaa-8120-3797f71c0e16' of framework 
> 0ab2a2ad-d6ef-4ca2-b17a-33972f9e8af7-0001 failed: Failed to kill all 
> processes in the container: Failed to remove cgroup 
> 'mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Failed to remove cgroup 
> '/sys/fs/cgroup/freezer/mesos/2c2a31eb-bac5-4acd-82ee-593c4616a63c': Device 
> or resource busy
> {noformat}
> Initially I thought it was a race condition in the cgroup destruction code, 
> but an strace confirmed that the cgroup directory was only deleted once all 
> tasks had exited (edited and commented strace below from a different instance 
> of the same problem):
> {noformat}
> # get the list of processes
> 3431  23:01:32.293608 openat(AT_FDCWD,
> "/sys/fs/cgroup/freezer/mesos/7eb1155b-ee0d-4233-8e49-cbe81f8b4deb/cgroup.procs",
> O_RDONLY 
> 3431  23:01:32.293669 <... openat resumed> ) = 18 <0.36>