[jira] [Comment Edited] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval
[ https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145123#comment-14145123 ] Ian Downes edited comment on MESOS-1199 at 9/23/14 6:07 PM: Understood. This race has existed in the codebase for a long time. We could consider looking at /proc/\{pid\}/exe to confirm that the pid at least corresponds to the expected executable - still not perfect though. was (Author: idownes): Understood. This race has existed in the codebase for a long time. We could consider looking at /proc/{pid}/exe to confirm that the pid at least corresponds to the expected executable - still not perfect though. Subprocess is slow - gated by process::reap poll interval Key: MESOS-1199 URL: https://issues.apache.org/jira/browse/MESOS-1199 Project: Mesos Issue Type: Improvement Affects Versions: 0.18.0 Reporter: Ian Downes Assignee: Craig Hansen-Sturm Attachments: wiatpid.pdf Subprocess uses process::reap to wait on the subprocess pid and set the exit status. However, process::reap polls with a one second interval resulting in a delay up to the interval duration before the status future is set. This means if you need to wait for the subprocess to complete you get hit with E(delay) = 0.5 seconds, independent of the execution time. For example, the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the executor during launch. At Twitter we fetch a local file, i.e., a very fast operation, but the launch is blocked until the mesos-fetcher pid is reaped - adding 0 to 1 seconds for every launch! The problem is even worse with a chain of short Subprocesses because after the first Subprocess completes you'll be synchronized with the reap interval and you'll see nearly the full interval before notification, i.e., 10 Subprocesses each of 1 second duration with take ~10 seconds! This has become particularly apparent in some new tests I'm working on where test durations are now greatly extended with each taking several seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval
[ https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14087322#comment-14087322 ] Nikita Vetoshkin edited comment on MESOS-1199 at 8/6/14 6:30 AM: - Just a quick note: polling pid of non-children is a racy deal. Process can die and a new one unrelated with the same pid can spin up in between poll attempts. I wonder if we could extend executors protocol - e.g. ask executor to bind specified Unix Domain socket. Thisi socket can be polled, reconnected and slave will receive disconnect when executor dies. Any thoughts? was (Author: nekto0n): Just a quick note: polling pid of non-children is a racy deal. Process can die and a new one unrelated with the same pid can spin up in between poll attempts. I wonder if we could extend executors protocol - e.g. to bind specified Unix Domain sockets. They can be polled, reconnected and slave will receive disconnect when executor dies. Any thoughts? Subprocess is slow - gated by process::reap poll interval Key: MESOS-1199 URL: https://issues.apache.org/jira/browse/MESOS-1199 Project: Mesos Issue Type: Improvement Affects Versions: 0.18.0 Reporter: Ian Downes Assignee: Craig Hansen-Sturm Attachments: wiatpid.pdf Subprocess uses process::reap to wait on the subprocess pid and set the exit status. However, process::reap polls with a one second interval resulting in a delay up to the interval duration before the status future is set. This means if you need to wait for the subprocess to complete you get hit with E(delay) = 0.5 seconds, independent of the execution time. For example, the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the executor during launch. At Twitter we fetch a local file, i.e., a very fast operation, but the launch is blocked until the mesos-fetcher pid is reaped - adding 0 to 1 seconds for every launch! The problem is even worse with a chain of short Subprocesses because after the first Subprocess completes you'll be synchronized with the reap interval and you'll see nearly the full interval before notification, i.e., 10 Subprocesses each of 1 second duration with take ~10 seconds! This has become particularly apparent in some new tests I'm working on where test durations are now greatly extended with each taking several seconds. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Comment Edited] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval
[ https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14085494#comment-14085494 ] Yifan Gu edited comment on MESOS-1199 at 8/5/14 12:53 AM: -- How about using inotify to watch on the /proc/pid? A concern for that is inotify works only on linux. But there might be equivalent stuff on other platform. (to make dropbox works at least...) Update: Thanks for BenM's reminder that /proc/pid is not actual file. So this may not work. Let me test it... Result: Inotify doesn't give any response when the process is killed. was (Author: yifan): How about using inotify to watch on the /proc/pid? A concern for that is inotify works only on linux. But there might be equivalent stuff on other platform. (to make dropbox works at least...) Update: Thanks for BenM's reminder that /proc/pid is not actual file. So this may not work. Let me test it... Subprocess is slow - gated by process::reap poll interval Key: MESOS-1199 URL: https://issues.apache.org/jira/browse/MESOS-1199 Project: Mesos Issue Type: Improvement Affects Versions: 0.18.0 Reporter: Ian Downes Assignee: Craig Hansen-Sturm Attachments: wiatpid.pdf Subprocess uses process::reap to wait on the subprocess pid and set the exit status. However, process::reap polls with a one second interval resulting in a delay up to the interval duration before the status future is set. This means if you need to wait for the subprocess to complete you get hit with E(delay) = 0.5 seconds, independent of the execution time. For example, the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the executor during launch. At Twitter we fetch a local file, i.e., a very fast operation, but the launch is blocked until the mesos-fetcher pid is reaped - adding 0 to 1 seconds for every launch! The problem is even worse with a chain of short Subprocesses because after the first Subprocess completes you'll be synchronized with the reap interval and you'll see nearly the full interval before notification, i.e., 10 Subprocesses each of 1 second duration with take ~10 seconds! This has become particularly apparent in some new tests I'm working on where test durations are now greatly extended with each taking several seconds. -- This message was sent by Atlassian JIRA (v6.2#6252)