[jira] [Commented] (MESOS-999) Slave should wait() and start executor registration timeout after launch

2015-07-15 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14628517#comment-14628517
 ] 

Yan Xu commented on MESOS-999:
--

I am looking at in two ways:
1. The original {{--executor_registration_timeout}} was added because of the 
potential long delay caused by fetching, however with the new launch timeout + 
executor registration timeout split, the fetching delay and the provisioning 
delay are lumped into the launch delay and the registration timeout becomes not 
very useful because it should be fairly quick. In fact, a similar timeout 
{{const Duration EXECUTOR_REREGISTER_TIMEOUT = Seconds(2);}} is not even 
exposed by a flag. So instead of creating finer-grained timeouts, we are 
effectively replacing one with another.

2. End-to-end timeout vs. Multiple fine-grained ones. Multiple timeouts adds 
complexity in operation (need to configure them separately) and implementation 
(may need to introduce more states to implement them properly) but there is 
only one reason to them, which is, AFAIC, to prevent a task from being stuck 
for too long before it transitions into RUNNING state (so a framework can 
reschedule it elsewhere). So in this sense one coarse end-to-end timeout is all 
we need. Can you provide examples on when the operator would find it useful to 
specifically configure timeouts for different stages?


 Slave should wait() and start executor registration timeout after launch 
 -

 Key: MESOS-999
 URL: https://issues.apache.org/jira/browse/MESOS-999
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.18.0
Reporter: Ian Downes
Priority: Minor

 The current code will start launch a container and wait on it before the 
 launch is complete. We should do this only after the container has 
 successfully launched. Likewise for the executor registration timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-999) Slave should wait() and start executor registration timeout after launch

2015-06-24 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14600134#comment-14600134
 ] 

Yan Xu commented on MESOS-999:
--

So we ended up not taking on this.
Our motivation for addressing this was because with docker and other new 
containerization efforts such as MESOS-2386, some considerable amount of time 
can be spent on pulling and preparing the container images before the executor 
is launched so we don't want the slave to kill it due to a small 
{{--executor_registration_timeout}}.
While implementing this we realized that adding a slave level launch timeout 
flag doesn't solve the fundamental problem of the long image preparation having 
a noticeable impact on the slave and its tasks. We should instead working 
towards a solution that minimizes such impact.
The {{--executor_registration_timeout}} flag was originally introduced to 
account for the time required to fetch the executor so for now giving it a 
large value is the reasonable way for tasks that use container images.
Ultimately only the task knows what the expected preparation time is so such 
timeouts should probably go to ExecutorInfo or TaskInfo.

I feel like this ticket as it is currently phrased could be closed as {{Won't 
Fix}}. Do you agree [~idownes]?

 Slave should wait() and start executor registration timeout after launch 
 -

 Key: MESOS-999
 URL: https://issues.apache.org/jira/browse/MESOS-999
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Yan Xu
Priority: Minor

 The current code will start launch a container and wait on it before the 
 launch is complete. We should do this only after the container has 
 successfully launched. Likewise for the executor registration timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-999) Slave should wait() and start executor registration timeout after launch

2015-06-16 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588765#comment-14588765
 ] 

Yan Xu commented on MESOS-999:
--

[~idownes]: In [~nsuneja]'s reviews there is a new flags 
{{--executor_launch_timeout}} to guard against the launcher taking forever to 
prepare the executor. I think this is an appropriate approach even though this 
timeout feels like something specific to each container / provisioner, a single 
upper bound which is configurable by the cluster operator seems sufficient.

[~nsuneja] would you like to revive your reviews? Otherwise I can take it over 
and push it forward.

 Slave should wait() and start executor registration timeout after launch 
 -

 Key: MESOS-999
 URL: https://issues.apache.org/jira/browse/MESOS-999
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Yan Xu
Priority: Minor
  Labels: twitter

 The current code will start launch a container and wait on it before the 
 launch is complete. We should do this only after the container has 
 successfully launched. Likewise for the executor registration timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-999) Slave should wait() and start executor registration timeout after launch

2015-06-09 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579561#comment-14579561
 ] 

Yan Xu commented on MESOS-999:
--

Wasn't aware of this. Thanks [~idownes].

 Slave should wait() and start executor registration timeout after launch 
 -

 Key: MESOS-999
 URL: https://issues.apache.org/jira/browse/MESOS-999
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Yan Xu
Priority: Minor
  Labels: twitter

 The current code will start launch a container and wait on it before the 
 launch is complete. We should do this only after the container has 
 successfully launched. Likewise for the executor registration timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-999) Slave should wait() and start executor registration timeout after launch

2015-06-09 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14579541#comment-14579541
 ] 

Ian Downes commented on MESOS-999:
--

Some work was already done on this: https://reviews.apache.org/r/29720/

 Slave should wait() and start executor registration timeout after launch 
 -

 Key: MESOS-999
 URL: https://issues.apache.org/jira/browse/MESOS-999
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Yan Xu
Priority: Minor
  Labels: twitter

 The current code will start launch a container and wait on it before the 
 launch is complete. We should do this only after the container has 
 successfully launched. Likewise for the executor registration timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-999) Slave should wait() and start executor registration timeout after launch

2015-06-08 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14578018#comment-14578018
 ] 

Ian Downes commented on MESOS-999:
--

cc [~xujyan]

 Slave should wait() and start executor registration timeout after launch 
 -

 Key: MESOS-999
 URL: https://issues.apache.org/jira/browse/MESOS-999
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Nishant Suneja
Priority: Minor

 The current code will start launch a container and wait on it before the 
 launch is complete. We should do this only after the container has 
 successfully launched. Likewise for the executor registration timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-999) Slave should wait() and start executor registration timeout after launch

2014-12-25 Thread Nishant Suneja (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258774#comment-14258774
 ] 

Nishant Suneja commented on MESOS-999:
--

Ok. So, if I understand the problem statement correctly, we want to start the 
executor registration timeout timer, only after the executor process has forked 
of successfully from slave.

So, the plan is to leverage the onReady() callback of the Future instance 
associated with the executor process, and start the timer ONLY on receiving 
this callback. This should ensure that registration timer starts only after 
successful forking.

 Slave should wait() and start executor registration timeout after launch 
 -

 Key: MESOS-999
 URL: https://issues.apache.org/jira/browse/MESOS-999
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Nishant Suneja
Priority: Minor

 The current code will start launch a container and wait on it before the 
 launch is complete. We should do this only after the container has 
 successfully launched. Likewise for the executor registration timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-999) Slave should wait() and start executor registration timeout after launch

2014-12-25 Thread Nishant Suneja (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14258777#comment-14258777
 ] 

Nishant Suneja commented on MESOS-999:
--

As for test plan, I would ideally want to somehow delay the launch of the 
executor process by = registration_timeout. This would ensure that the current 
code will lead to destruction of the container. 
But, with this fix, we should not see the container getting destroyed, because 
our timer starts only after the successful launch of the container.

I have the fix ready. Have to write a test case now. 

 Slave should wait() and start executor registration timeout after launch 
 -

 Key: MESOS-999
 URL: https://issues.apache.org/jira/browse/MESOS-999
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.18.0
Reporter: Ian Downes
Assignee: Nishant Suneja
Priority: Minor

 The current code will start launch a container and wait on it before the 
 launch is complete. We should do this only after the container has 
 successfully launched. Likewise for the executor registration timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)