[jira] [Commented] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval

2014-09-23 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145123#comment-14145123
 ] 

Ian Downes commented on MESOS-1199:
---

Understood. This race has existed in the codebase for a long time. We could 
consider looking at /proc/{pid}/exe to confirm that the pid at least 
corresponds to the expected executable - still not perfect though.

 Subprocess is slow - gated by process::reap poll interval
 

 Key: MESOS-1199
 URL: https://issues.apache.org/jira/browse/MESOS-1199
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Craig Hansen-Sturm
 Attachments: wiatpid.pdf


 Subprocess uses process::reap to wait on the subprocess pid and set the exit 
 status. However, process::reap polls with a one second interval resulting in 
 a delay up to the interval duration before the status future is set.
 This means if you need to wait for the subprocess to complete you get hit 
 with E(delay) = 0.5 seconds, independent of the execution time. For example, 
 the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the 
 executor during launch. At Twitter we fetch a local file, i.e., a very fast 
 operation, but the launch is blocked until the mesos-fetcher pid is reaped - 
 adding 0 to 1 seconds for every launch!
 The problem is even worse with a chain of short Subprocesses because after 
 the first Subprocess completes you'll be synchronized with the reap interval 
 and you'll see nearly the full interval before notification, i.e., 10 
 Subprocesses each of  1 second duration with take ~10 seconds!
 This has become particularly apparent in some new tests I'm working on where 
 test durations are now greatly extended with each taking several seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval

2014-09-23 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145147#comment-14145147
 ] 

Ian Downes commented on MESOS-1199:
---

new review with dynamic poll interval: https://reviews.apache.org/r/25947/

 Subprocess is slow - gated by process::reap poll interval
 

 Key: MESOS-1199
 URL: https://issues.apache.org/jira/browse/MESOS-1199
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Craig Hansen-Sturm
 Attachments: wiatpid.pdf


 Subprocess uses process::reap to wait on the subprocess pid and set the exit 
 status. However, process::reap polls with a one second interval resulting in 
 a delay up to the interval duration before the status future is set.
 This means if you need to wait for the subprocess to complete you get hit 
 with E(delay) = 0.5 seconds, independent of the execution time. For example, 
 the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the 
 executor during launch. At Twitter we fetch a local file, i.e., a very fast 
 operation, but the launch is blocked until the mesos-fetcher pid is reaped - 
 adding 0 to 1 seconds for every launch!
 The problem is even worse with a chain of short Subprocesses because after 
 the first Subprocess completes you'll be synchronized with the reap interval 
 and you'll see nearly the full interval before notification, i.e., 10 
 Subprocesses each of  1 second duration with take ~10 seconds!
 This has become particularly apparent in some new tests I'm working on where 
 test durations are now greatly extended with each taking several seconds.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval

2014-08-04 Thread Bernd Mathiske (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14085088#comment-14085088
 ] 

Bernd Mathiske commented on MESOS-1199:
---

This is an attempt to make the best of what we have so far. Using two 
threads/actors.

1. A separate thread that uses *blocking* waitpid(0) to wait for all task 
processes that in the slave's process tree. (Details to be worked out here, 
still drafting)
2. Use all the techniques discussed above for all those task processes that are 
not in the slave's process tree, due to restarting the slave.

Do this simultaneously! At first, the number of processes in category 2 will be 
none. If the slave ever restarts, then the number of processes in category 1 is 
zero, initially. Thereafter, as new tasks are started, we may have both, for a 
while. Eventually, we either have no processes left in category 2 or the 
left-over rest is really long-running and a little delay does not matter for 
them.

How do we know if a task is a current child process or from before restart? We 
can remember the tasks that were recovered in say a set/list and use that for 
iterating over category 2. We exclude every task in that set from category 1.

If this works, then relatively long delays are a temporary thing, not the norm.

I don't know if waitpid(0) does report all terminating processes one by one 
without fail when called successively or whether it skips some. Above, I assume 
it will. The man page seems silent on the issue.


 Subprocess is slow - gated by process::reap poll interval
 

 Key: MESOS-1199
 URL: https://issues.apache.org/jira/browse/MESOS-1199
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Craig Hansen-Sturm
 Attachments: wiatpid.pdf


 Subprocess uses process::reap to wait on the subprocess pid and set the exit 
 status. However, process::reap polls with a one second interval resulting in 
 a delay up to the interval duration before the status future is set.
 This means if you need to wait for the subprocess to complete you get hit 
 with E(delay) = 0.5 seconds, independent of the execution time. For example, 
 the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the 
 executor during launch. At Twitter we fetch a local file, i.e., a very fast 
 operation, but the launch is blocked until the mesos-fetcher pid is reaped - 
 adding 0 to 1 seconds for every launch!
 The problem is even worse with a chain of short Subprocesses because after 
 the first Subprocess completes you'll be synchronized with the reap interval 
 and you'll see nearly the full interval before notification, i.e., 10 
 Subprocesses each of  1 second duration with take ~10 seconds!
 This has become particularly apparent in some new tests I'm working on where 
 test durations are now greatly extended with each taking several seconds.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval

2014-08-04 Thread Bernd Mathiske (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14085109#comment-14085109
 ] 

Bernd Mathiske commented on MESOS-1199:
---

True, but we don't have to make these syscalls if we use information we already 
have.

 Subprocess is slow - gated by process::reap poll interval
 

 Key: MESOS-1199
 URL: https://issues.apache.org/jira/browse/MESOS-1199
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Craig Hansen-Sturm
 Attachments: wiatpid.pdf


 Subprocess uses process::reap to wait on the subprocess pid and set the exit 
 status. However, process::reap polls with a one second interval resulting in 
 a delay up to the interval duration before the status future is set.
 This means if you need to wait for the subprocess to complete you get hit 
 with E(delay) = 0.5 seconds, independent of the execution time. For example, 
 the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the 
 executor during launch. At Twitter we fetch a local file, i.e., a very fast 
 operation, but the launch is blocked until the mesos-fetcher pid is reaped - 
 adding 0 to 1 seconds for every launch!
 The problem is even worse with a chain of short Subprocesses because after 
 the first Subprocess completes you'll be synchronized with the reap interval 
 and you'll see nearly the full interval before notification, i.e., 10 
 Subprocesses each of  1 second duration with take ~10 seconds!
 This has become particularly apparent in some new tests I'm working on where 
 test durations are now greatly extended with each taking several seconds.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval

2014-08-04 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14085101#comment-14085101
 ] 

Ian Downes commented on MESOS-1199:
---

It's easy to tell if a pid is a child of the current process:

getppid to get parent pid and compare to own pid.

or:

The wait3() and waitpid() calls will fail and return immediately if:

 [ECHILD]   The process specified by pid does not exist or is not a 
child of the calling process, or the process group specified by pid does not 
exist or does not have any member process that is a child of the calling 
process.


 Subprocess is slow - gated by process::reap poll interval
 

 Key: MESOS-1199
 URL: https://issues.apache.org/jira/browse/MESOS-1199
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Craig Hansen-Sturm
 Attachments: wiatpid.pdf


 Subprocess uses process::reap to wait on the subprocess pid and set the exit 
 status. However, process::reap polls with a one second interval resulting in 
 a delay up to the interval duration before the status future is set.
 This means if you need to wait for the subprocess to complete you get hit 
 with E(delay) = 0.5 seconds, independent of the execution time. For example, 
 the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the 
 executor during launch. At Twitter we fetch a local file, i.e., a very fast 
 operation, but the launch is blocked until the mesos-fetcher pid is reaped - 
 adding 0 to 1 seconds for every launch!
 The problem is even worse with a chain of short Subprocesses because after 
 the first Subprocess completes you'll be synchronized with the reap interval 
 and you'll see nearly the full interval before notification, i.e., 10 
 Subprocesses each of  1 second duration with take ~10 seconds!
 This has become particularly apparent in some new tests I'm working on where 
 test durations are now greatly extended with each taking several seconds.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval

2014-08-04 Thread Yifan Gu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14085494#comment-14085494
 ] 

Yifan Gu commented on MESOS-1199:
-

How about using inotify to watch on the /proc/pid?
A concern for that is inotify works only on linux. But there might be 
equivalent stuff on other platform. (to make dropbox works at least...)

 Subprocess is slow - gated by process::reap poll interval
 

 Key: MESOS-1199
 URL: https://issues.apache.org/jira/browse/MESOS-1199
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Craig Hansen-Sturm
 Attachments: wiatpid.pdf


 Subprocess uses process::reap to wait on the subprocess pid and set the exit 
 status. However, process::reap polls with a one second interval resulting in 
 a delay up to the interval duration before the status future is set.
 This means if you need to wait for the subprocess to complete you get hit 
 with E(delay) = 0.5 seconds, independent of the execution time. For example, 
 the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the 
 executor during launch. At Twitter we fetch a local file, i.e., a very fast 
 operation, but the launch is blocked until the mesos-fetcher pid is reaped - 
 adding 0 to 1 seconds for every launch!
 The problem is even worse with a chain of short Subprocesses because after 
 the first Subprocess completes you'll be synchronized with the reap interval 
 and you'll see nearly the full interval before notification, i.e., 10 
 Subprocesses each of  1 second duration with take ~10 seconds!
 This has become particularly apparent in some new tests I'm working on where 
 test durations are now greatly extended with each taking several seconds.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval

2014-08-01 Thread Craig Hansen-Sturm (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14082877#comment-14082877
 ] 

Craig Hansen-Sturm commented on MESOS-1199:
---

https://issues.apache.org/jira/browse/MESOS-1660 captures the immediate 
polling-interval patch, but not the long term fix.

 Subprocess is slow - gated by process::reap poll interval
 

 Key: MESOS-1199
 URL: https://issues.apache.org/jira/browse/MESOS-1199
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Craig Hansen-Sturm
 Attachments: wiatpid.pdf


 Subprocess uses process::reap to wait on the subprocess pid and set the exit 
 status. However, process::reap polls with a one second interval resulting in 
 a delay up to the interval duration before the status future is set.
 This means if you need to wait for the subprocess to complete you get hit 
 with E(delay) = 0.5 seconds, independent of the execution time. For example, 
 the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the 
 executor during launch. At Twitter we fetch a local file, i.e., a very fast 
 operation, but the launch is blocked until the mesos-fetcher pid is reaped - 
 adding 0 to 1 seconds for every launch!
 The problem is even worse with a chain of short Subprocesses because after 
 the first Subprocess completes you'll be synchronized with the reap interval 
 and you'll see nearly the full interval before notification, i.e., 10 
 Subprocesses each of  1 second duration with take ~10 seconds!
 This has become particularly apparent in some new tests I'm working on where 
 test durations are now greatly extended with each taking several seconds.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval

2014-07-29 Thread Craig Hansen-Sturm (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14078796#comment-14078796
 ] 

Craig Hansen-Sturm commented on MESOS-1199:
---

We could reuse parts of the implementation of pstree to enumerate the pid 
hierarchy.

An alternative idea, would be to use e_poll() on the file descriptors created 
for Subprocess::IO, or possibly bread-crumbs which get cleaned up on child 
exit.My understanding is that e_poll() is highly efficient, and implements 
an o(1) select operation.

In any event, before doing anything, I believe it is best to let this be a 
profiler driven thing.  Will be using xcode/instruments.

 Subprocess is slow - gated by process::reap poll interval
 

 Key: MESOS-1199
 URL: https://issues.apache.org/jira/browse/MESOS-1199
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Craig Hansen-Sturm

 Subprocess uses process::reap to wait on the subprocess pid and set the exit 
 status. However, process::reap polls with a one second interval resulting in 
 a delay up to the interval duration before the status future is set.
 This means if you need to wait for the subprocess to complete you get hit 
 with E(delay) = 0.5 seconds, independent of the execution time. For example, 
 the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the 
 executor during launch. At Twitter we fetch a local file, i.e., a very fast 
 operation, but the launch is blocked until the mesos-fetcher pid is reaped - 
 adding 0 to 1 seconds for every launch!
 The problem is even worse with a chain of short Subprocesses because after 
 the first Subprocess completes you'll be synchronized with the reap interval 
 and you'll see nearly the full interval before notification, i.e., 10 
 Subprocesses each of  1 second duration with take ~10 seconds!
 This has become particularly apparent in some new tests I'm working on where 
 test durations are now greatly extended with each taking several seconds.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval

2014-07-28 Thread Bernd Mathiske (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14077114#comment-14077114
 ] 

Bernd Mathiske commented on MESOS-1199:
---

Idea: 

1. Iterate over the pids of interest, calling kill(pid, 0) on each pid. This 
returns immediately and reports if the process is alive. 
2. Wait for a small timeout (100ms?)
3. Repeat.

This way, the wait time is a small constant plus n times the overhead of kill().

 Subprocess is slow - gated by process::reap poll interval
 

 Key: MESOS-1199
 URL: https://issues.apache.org/jira/browse/MESOS-1199
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Craig Hansen-Sturm

 Subprocess uses process::reap to wait on the subprocess pid and set the exit 
 status. However, process::reap polls with a one second interval resulting in 
 a delay up to the interval duration before the status future is set.
 This means if you need to wait for the subprocess to complete you get hit 
 with E(delay) = 0.5 seconds, independent of the execution time. For example, 
 the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the 
 executor during launch. At Twitter we fetch a local file, i.e., a very fast 
 operation, but the launch is blocked until the mesos-fetcher pid is reaped - 
 adding 0 to 1 seconds for every launch!
 The problem is even worse with a chain of short Subprocesses because after 
 the first Subprocess completes you'll be synchronized with the reap interval 
 and you'll see nearly the full interval before notification, i.e., 10 
 Subprocesses each of  1 second duration with take ~10 seconds!
 This has become particularly apparent in some new tests I'm working on where 
 test durations are now greatly extended with each taking several seconds.



--
This message was sent by Atlassian JIRA
(v6.2#6252)