[jira] [Commented] (MESOS-1571) Signal escalation timeout is not configurable
[ https://issues.apache.org/jira/browse/MESOS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14325794#comment-14325794 ] James DeFelice commented on MESOS-1571: --- In the kubernetes-mesos framework, the executor Shutdown() implementation currently force-stop's the containers it's managing (which, to my understanding, sends SIGKILL). It manages Docker containers, which are normally given 10s to shut down gracefully before Docker sends a SIGKILL. That 10s timeout is not compatible with the default slave flag `executor_shudown_grace_timeout` value of mesos (3s). However if I change the value of that timeout to 20s to give the executor more time to gracefully kill things there's no way for the executor to reason about that because it has no idea of how much time it actually has. As a workaround I've considered looking up the slave PID from the environment and querying its state.json for the startup flags, and trying to make a decision based on that. That approach seems somewhat hackish and I'd much rather do something nicer. It would be great to have an environment var `MESOS_EXECUTOR_SHUTDOWN_GRACE_PERIOD` or something, provided by the slave containerizer, so that the executor can make a decision about whether to send (via Docker) a TERM (and wait 10s) or KILL signal. Signal escalation timeout is not configurable - Key: MESOS-1571 URL: https://issues.apache.org/jira/browse/MESOS-1571 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen Assignee: Alexander Rukletsov Even though the executor shutdown grace period is set to a larger interval, the signal escalation timeout will still be 3 seconds. It should either be configurable or dependent on EXECUTOR_SHUTDOWN_GRACE_PERIOD. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1571) Signal escalation timeout is not configurable
[ https://issues.apache.org/jira/browse/MESOS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14269345#comment-14269345 ] Alexander Rukletsov commented on MESOS-1571: Commit: c8a7aff24fbd2c6ee2e6daadf4ad78f79a5e9cf6 [c8a7aff] Author: Alexander Rukletsov a...@mesosphere.io Committer: Niklas Q. Nielsen nik...@mesosphere.io Commit Date: 8 Jan 2015 14:29:31 GMT+1 Commit: aae5bfd07c0c9407453a7c38f27785e648b2724d [aae5bfd] Author: Alexander Rukletsov a...@mesosphere.io Committer: Niklas Q. Nielsen nik...@mesosphere.io Commit Date: 8 Jan 2015 14:33:12 GMT+1 Commit: f2cf562900195455e4e7fb8a6163b33a6b8aa12d [f2cf562] Author: Alexander Rukletsov a...@mesosphere.io Committer: Niklas Q. Nielsen nik...@mesosphere.io Commit Date: 8 Jan 2015 14:35:52 GMT+1 Signal escalation timeout is not configurable - Key: MESOS-1571 URL: https://issues.apache.org/jira/browse/MESOS-1571 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen Assignee: Alexander Rukletsov Even though the executor shutdown grace period is set to a larger interval, the signal escalation timeout will still be 3 seconds. It should either be configurable or dependent on EXECUTOR_SHUTDOWN_GRACE_PERIOD. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1571) Signal escalation timeout is not configurable
[ https://issues.apache.org/jira/browse/MESOS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14218198#comment-14218198 ] Alexander Rukletsov commented on MESOS-1571: https://reviews.apache.org/r/28063/ https://reviews.apache.org/r/28065/ https://reviews.apache.org/r/28069/ Signal escalation timeout is not configurable - Key: MESOS-1571 URL: https://issues.apache.org/jira/browse/MESOS-1571 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen Assignee: Alexander Rukletsov Even though the executor shutdown grace period is set to a larger interval, the signal escalation timeout will still be 3 seconds. It should either be configurable or dependent on EXECUTOR_SHUTDOWN_GRACE_PERIOD. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1571) Signal escalation timeout is not configurable
[ https://issues.apache.org/jira/browse/MESOS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14169982#comment-14169982 ] Alexander Rukletsov commented on MESOS-1571: In the current review request we pass the timeout value via containerizers. However, in order to implement [https://issues.apache.org/jira/browse/MESOS-1773], a field in the {{CommandInfo}} protobuf is needed. I would suggest to use this field for the default value as well and therefore avoid changing containarizers' code. This can work as follows: in {{Slave::runTask()}} if the task doesn't have the field {{grace_period}} set, the slave sets it to the default; in the {{executorEnvironment()}} preparation function we extract the {{grace_period}} and set the corresponding environment variable. The review request will follow. Signal escalation timeout is not configurable - Key: MESOS-1571 URL: https://issues.apache.org/jira/browse/MESOS-1571 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen Assignee: Alexander Rukletsov Even though the executor shutdown grace period is set to a larger interval, the signal escalation timeout will still be 3 seconds. It should either be configurable or dependent on EXECUTOR_SHUTDOWN_GRACE_PERIOD. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1571) Signal escalation timeout is not configurable
[ https://issues.apache.org/jira/browse/MESOS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14153209#comment-14153209 ] Alexander Rukletsov commented on MESOS-1571: https://reviews.apache.org/r/25434/ Signal escalation timeout is not configurable - Key: MESOS-1571 URL: https://issues.apache.org/jira/browse/MESOS-1571 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen Assignee: Alexander Rukletsov Even though the executor shutdown grace period is set to a larger interval, the signal escalation timeout will still be 3 seconds. It should either be configurable or dependent on EXECUTOR_SHUTDOWN_GRACE_PERIOD. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1571) Signal escalation timeout is not configurable
[ https://issues.apache.org/jira/browse/MESOS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14122858#comment-14122858 ] Till Toenshoff commented on MESOS-1571: --- Using the environment to pass that info seems to fit best when looking at the things we already pass (e.g. {{MESOS_RECOVERY_TIMEOUT}}), whereas the {{SlaveInfo}} protobuf is rather limited in additional execution specific parameters. However to me this still raises the question on why we prefer using the environment instead of proto's for such information. One obvious reason certainly is that we might need to supply information that is needed immediately before or after starting the {{ExecutorProcess}} but definitely before it successfully registered, when {{SlaveInfo}} finally becomes available to him. Despite my above argument being we already do it that way, are there better arguments for not adding things to the proto but instead using the environment for passing the additional parameters? [~benjaminhindman], [~idownes], [~tnachen] any input for this discussion? Signal escalation timeout is not configurable - Key: MESOS-1571 URL: https://issues.apache.org/jira/browse/MESOS-1571 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen Assignee: Alexander Rukletsov Even though the executor shutdown grace period is set to a larger interval, the signal escalation timeout will still be 3 seconds. It should either be configurable or dependent on EXECUTOR_SHUTDOWN_GRACE_PERIOD. Thoughts? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1571) Signal escalation timeout is not configurable
[ https://issues.apache.org/jira/browse/MESOS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14114119#comment-14114119 ] Till Toenshoff commented on MESOS-1571: --- [~nnielsen] Aye! Signal escalation timeout is not configurable - Key: MESOS-1571 URL: https://issues.apache.org/jira/browse/MESOS-1571 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen Assignee: Alexander Rukletsov Even though the executor shutdown grace period is set to a larger interval, the signal escalation timeout will still be 3 seconds. It should either be configurable or dependent on EXECUTOR_SHUTDOWN_GRACE_PERIOD. Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1571) Signal escalation timeout is not configurable
[ https://issues.apache.org/jira/browse/MESOS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14111816#comment-14111816 ] Niklas Quarfot Nielsen commented on MESOS-1571: --- [~tillt] Would you be up for shepherding this change? How about having EXECUTOR_SHUTDOWN_TIMEOUT as an upper limit for the per-task configurable timeout? I think we need to differentiate between two scenarios: 1) killTask() is called. In the command executor, this just calls its own shutdown() and _only_ the escalation in src/launcher/executor.cpp takes effect. {code} SlaveExec CommandExecutor + + + killTask() | | | +- | | | killTask() | | +--- | | | killTask() | | +--- | | | | | +---+ | | | | | | | | | | ---+ | | | shutdown() | | | ^ | | | | | | | | EXECUTOR_SIGNAL_ESCALATION_TIMEOUT | | | | | | | v | | | escalated() v v v {code} 2) The executor is shutdown due to frameworkShutdown. shutdown() is called in src/exec/exec.cpp which in turn calls shutdown on the underlying executor implementation. That is where we have the nested timeout including an escalation within the slave (executor_shutdown_grace_period) which calls containerizer-destroy() {code} SlaveExec CommandExecutor + + + | | | | | | | shutdown() | | +-^- | | | | shutdown() | | | +-^- shutdown() | | | | | ^ | | | | | | | flags.| SHUTDOWN_ | | EXECUTOR_SIGNAL_ESCALATION_TIMEOUT | shutdown_ | GRACE_PERIOD | | | grace_period | | | v | | | | | escalated() | | | v | | | | ShutdownProcess | | | kill()| | v | | | shutdownExecutorTimeout() | | | | v v v
[jira] [Commented] (MESOS-1571) Signal escalation timeout is not configurable
[ https://issues.apache.org/jira/browse/MESOS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14109392#comment-14109392 ] Alexander Rukletsov commented on MESOS-1571: So we have shutdown timeout on three levels: slave, basic executor (via ExecutorProcess) and optionally concrete executor (e.g. CommandExecutor). I would suggest we leave one configurable parameter—EXECUTOR_SHUTDOWN_TIMEOUT—on the basic executor level and calculate two other using fixed deltas. This parameter can be set via slave cmd parameters and overridden via protobuf message (TaskInfo?). Any other thoughts? Signal escalation timeout is not configurable - Key: MESOS-1571 URL: https://issues.apache.org/jira/browse/MESOS-1571 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen Assignee: Alexander Rukletsov Even though the executor shutdown grace period is set to a larger interval, the signal escalation timeout will still be 3 seconds. It should either be configurable or dependent on EXECUTOR_SHUTDOWN_GRACE_PERIOD. Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1571) Signal escalation timeout is not configurable
[ https://issues.apache.org/jira/browse/MESOS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14106695#comment-14106695 ] Alexander Rukletsov commented on MESOS-1571: Currently there are two parameters that control graceful shutdown timeout: EXECUTOR_SHUTDOWN_GRACE_PERIOD and EXECUTOR_SIGNAL_ESCALATION_TIMEOUT. The simplified event chain looks like this: 1) Slave sends a ShutdownExecutorMessage to executor 2) Executor tries to finish by sending SIGTERM to the process 3) If the process did not terminate after EXECUTOR_SIGNAL_ESCALATION_TIMEOUT, executor sends SIGKILL to the process 4) If the executor did not terminate after EXECUTOR_SHUTDOWN_GRACE_PERIOD, slave destroys the appropriate containerizer. My thoughts are: * The timeouts correlate significantly, that means setting them separately is error-prone. Currently EXECUTOR_SHUTDOWN_GRACE_PERIOD may be configured. I would propose setting one of them and calculate the other using some [hard-coded?] delta. * Since we would like to control the timeout not per slave, but per task or framework, it looks like EXECUTOR_SIGNAL_ESCALATION_TIMEOUT should be configurable. * Do we want to tie the timeout per each task? Or passing it along with ExecutorInfo or FrameworkInfo will suffice? Signal escalation timeout is not configurable - Key: MESOS-1571 URL: https://issues.apache.org/jira/browse/MESOS-1571 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen Even though the executor shutdown grace period is set to a larger interval, the signal escalation timeout will still be 3 seconds. It should either be configurable or dependent on EXECUTOR_SHUTDOWN_GRACE_PERIOD. Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (MESOS-1571) Signal escalation timeout is not configurable
[ https://issues.apache.org/jira/browse/MESOS-1571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14107013#comment-14107013 ] Niklas Quarfot Nielsen commented on MESOS-1571: --- I can help you out - think [~tstclair] could be great to shepherd this too Signal escalation timeout is not configurable - Key: MESOS-1571 URL: https://issues.apache.org/jira/browse/MESOS-1571 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen Assignee: Alexander Rukletsov Even though the executor shutdown grace period is set to a larger interval, the signal escalation timeout will still be 3 seconds. It should either be configurable or dependent on EXECUTOR_SHUTDOWN_GRACE_PERIOD. Thoughts? -- This message was sent by Atlassian JIRA (v6.2#6252)