[jira] [Commented] (MESOS-999) Slave should wait() and start executor registration timeout after launch
[ https://issues.apache.org/jira/browse/MESOS-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628517#comment-14628517 ] Yan Xu commented on MESOS-999: -- I am looking at in two ways: 1. The original {{--executor_registration_timeout}} was added because of the potential long delay caused by fetching, however with the new launch timeout + executor registration timeout split, the fetching delay and the provisioning delay are lumped into the launch delay and the registration timeout becomes not very useful because it should be fairly quick. In fact, a similar timeout {{const Duration EXECUTOR_REREGISTER_TIMEOUT = Seconds(2);}} is not even exposed by a flag. So instead of creating finer-grained timeouts, we are effectively replacing one with another. 2. End-to-end timeout vs. Multiple fine-grained ones. Multiple timeouts adds complexity in operation (need to configure them separately) and implementation (may need to introduce more states to implement them properly) but there is only one reason to them, which is, AFAIC, to prevent a task from being stuck for too long before it transitions into RUNNING state (so a framework can reschedule it elsewhere). So in this sense one coarse end-to-end timeout is all we need. Can you provide examples on when the operator would find it useful to specifically configure timeouts for different stages? Slave should wait() and start executor registration timeout after launch - Key: MESOS-999 URL: https://issues.apache.org/jira/browse/MESOS-999 Project: Mesos Issue Type: Bug Components: isolation Affects Versions: 0.18.0 Reporter: Ian Downes Priority: Minor The current code will start launch a container and wait on it before the launch is complete. We should do this only after the container has successfully launched. Likewise for the executor registration timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-999) Slave should wait() and start executor registration timeout after launch
[ https://issues.apache.org/jira/browse/MESOS-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600134#comment-14600134 ] Yan Xu commented on MESOS-999: -- So we ended up not taking on this. Our motivation for addressing this was because with docker and other new containerization efforts such as MESOS-2386, some considerable amount of time can be spent on pulling and preparing the container images before the executor is launched so we don't want the slave to kill it due to a small {{--executor_registration_timeout}}. While implementing this we realized that adding a slave level launch timeout flag doesn't solve the fundamental problem of the long image preparation having a noticeable impact on the slave and its tasks. We should instead working towards a solution that minimizes such impact. The {{--executor_registration_timeout}} flag was originally introduced to account for the time required to fetch the executor so for now giving it a large value is the reasonable way for tasks that use container images. Ultimately only the task knows what the expected preparation time is so such timeouts should probably go to ExecutorInfo or TaskInfo. I feel like this ticket as it is currently phrased could be closed as {{Won't Fix}}. Do you agree [~idownes]? Slave should wait() and start executor registration timeout after launch - Key: MESOS-999 URL: https://issues.apache.org/jira/browse/MESOS-999 Project: Mesos Issue Type: Bug Components: isolation Affects Versions: 0.18.0 Reporter: Ian Downes Assignee: Yan Xu Priority: Minor The current code will start launch a container and wait on it before the launch is complete. We should do this only after the container has successfully launched. Likewise for the executor registration timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-999) Slave should wait() and start executor registration timeout after launch
[ https://issues.apache.org/jira/browse/MESOS-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588765#comment-14588765 ] Yan Xu commented on MESOS-999: -- [~idownes]: In [~nsuneja]'s reviews there is a new flags {{--executor_launch_timeout}} to guard against the launcher taking forever to prepare the executor. I think this is an appropriate approach even though this timeout feels like something specific to each container / provisioner, a single upper bound which is configurable by the cluster operator seems sufficient. [~nsuneja] would you like to revive your reviews? Otherwise I can take it over and push it forward. Slave should wait() and start executor registration timeout after launch - Key: MESOS-999 URL: https://issues.apache.org/jira/browse/MESOS-999 Project: Mesos Issue Type: Bug Components: isolation Affects Versions: 0.18.0 Reporter: Ian Downes Assignee: Yan Xu Priority: Minor Labels: twitter The current code will start launch a container and wait on it before the launch is complete. We should do this only after the container has successfully launched. Likewise for the executor registration timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-999) Slave should wait() and start executor registration timeout after launch
[ https://issues.apache.org/jira/browse/MESOS-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579561#comment-14579561 ] Yan Xu commented on MESOS-999: -- Wasn't aware of this. Thanks [~idownes]. Slave should wait() and start executor registration timeout after launch - Key: MESOS-999 URL: https://issues.apache.org/jira/browse/MESOS-999 Project: Mesos Issue Type: Bug Components: isolation Affects Versions: 0.18.0 Reporter: Ian Downes Assignee: Yan Xu Priority: Minor Labels: twitter The current code will start launch a container and wait on it before the launch is complete. We should do this only after the container has successfully launched. Likewise for the executor registration timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-999) Slave should wait() and start executor registration timeout after launch
[ https://issues.apache.org/jira/browse/MESOS-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579541#comment-14579541 ] Ian Downes commented on MESOS-999: -- Some work was already done on this: https://reviews.apache.org/r/29720/ Slave should wait() and start executor registration timeout after launch - Key: MESOS-999 URL: https://issues.apache.org/jira/browse/MESOS-999 Project: Mesos Issue Type: Bug Components: isolation Affects Versions: 0.18.0 Reporter: Ian Downes Assignee: Yan Xu Priority: Minor Labels: twitter The current code will start launch a container and wait on it before the launch is complete. We should do this only after the container has successfully launched. Likewise for the executor registration timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-999) Slave should wait() and start executor registration timeout after launch
[ https://issues.apache.org/jira/browse/MESOS-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578018#comment-14578018 ] Ian Downes commented on MESOS-999: -- cc [~xujyan] Slave should wait() and start executor registration timeout after launch - Key: MESOS-999 URL: https://issues.apache.org/jira/browse/MESOS-999 Project: Mesos Issue Type: Bug Components: isolation Affects Versions: 0.18.0 Reporter: Ian Downes Assignee: Nishant Suneja Priority: Minor The current code will start launch a container and wait on it before the launch is complete. We should do this only after the container has successfully launched. Likewise for the executor registration timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-999) Slave should wait() and start executor registration timeout after launch
[ https://issues.apache.org/jira/browse/MESOS-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258774#comment-14258774 ] Nishant Suneja commented on MESOS-999: -- Ok. So, if I understand the problem statement correctly, we want to start the executor registration timeout timer, only after the executor process has forked of successfully from slave. So, the plan is to leverage the onReady() callback of the Future instance associated with the executor process, and start the timer ONLY on receiving this callback. This should ensure that registration timer starts only after successful forking. Slave should wait() and start executor registration timeout after launch - Key: MESOS-999 URL: https://issues.apache.org/jira/browse/MESOS-999 Project: Mesos Issue Type: Bug Components: isolation Affects Versions: 0.18.0 Reporter: Ian Downes Assignee: Nishant Suneja Priority: Minor The current code will start launch a container and wait on it before the launch is complete. We should do this only after the container has successfully launched. Likewise for the executor registration timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-999) Slave should wait() and start executor registration timeout after launch
[ https://issues.apache.org/jira/browse/MESOS-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258777#comment-14258777 ] Nishant Suneja commented on MESOS-999: -- As for test plan, I would ideally want to somehow delay the launch of the executor process by = registration_timeout. This would ensure that the current code will lead to destruction of the container. But, with this fix, we should not see the container getting destroyed, because our timer starts only after the successful launch of the container. I have the fix ready. Have to write a test case now. Slave should wait() and start executor registration timeout after launch - Key: MESOS-999 URL: https://issues.apache.org/jira/browse/MESOS-999 Project: Mesos Issue Type: Bug Components: isolation Affects Versions: 0.18.0 Reporter: Ian Downes Assignee: Nishant Suneja Priority: Minor The current code will start launch a container and wait on it before the launch is complete. We should do this only after the container has successfully launched. Likewise for the executor registration timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)