Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-16 Thread Niklas Nielsen
Okay - that only solves half of the problem for us: users will still see their frameworks as running even though they completed but it is a first step. Let's continue the discussion in a JIRA ticket; I'll create one shortly. Thanks for helping out! Niklas On 15 September 2014 18:17, Benjamin

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-15 Thread Benjamin Mahler
To ensure that the architecture of mesos remains a scalable one, we want to persist state in the leaves of the system as much as possible. This is why the master has never persisted tasks, task states, or status updates. Note that status updates can contain arbitrarily large amounts of data at the

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-15 Thread Niklas Nielsen
Thanks for your input Ben! (Comments inlined) On 15 September 2014 12:35, Benjamin Mahler benjamin.mah...@gmail.com wrote: To ensure that the architecture of mesos remains a scalable one, we want to persist state in the leaves of the system as much as possible. This is why the master has

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-15 Thread Benjamin Mahler
On Mon, Sep 15, 2014 at 3:11 PM, Niklas Nielsen nik...@mesosphere.io wrote: Thanks for your input Ben! (Comments inlined) On 15 September 2014 12:35, Benjamin Mahler benjamin.mah...@gmail.com wrote: To ensure that the architecture of mesos remains a scalable one, we want to persist

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-11 Thread Vinod Kone
The semantics of these changes would have an impact on the upcoming task reconciliation. @BenM: Can you chime in here on how this fits into the task reconciliation work that you've been leading? On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon a...@mesosphere.io wrote: I agree with Niklas that

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-11 Thread Adam Bordelon
Definitely relevant. If the master could be trusted to persist all the task status updates, then they could be queued up at the master instead of the slave once the master has acknowledged its receipt. Then the master could have the most up-to-date task state and can recover the resources as soon

Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-10 Thread Niklas Nielsen
Hi guys, We have run into a problem that cause tasks which completes, when a framework is disconnected and has a fail-over time, to remain in a running state even though the tasks actually finishes. Here is a test framework we have been able to reproduce the issue with:

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-10 Thread Niklas Nielsen
Here is the log of a mesos-local instance where I reproduced it: https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck in running state). There is a lot of output, so here is a filtered log for task 10: https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d At first glance, it

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-10 Thread Vinod Kone
What you observed is expected because of the way the slave (specifically, the status update manager) operates. The status update manager only sends the next update for a task if a previous update (if it exists) has been acked. In your case, since TASK_RUNNING was not acked by the framework,

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-10 Thread Vinod Kone
The main reason is to keep status update manager simple. Also, it is very easy to enforce the order of updates to the master/framework in this model. If we allow multiple updates for a task to be in flight, it's really hard (impossible?) to ensure that we are not delivering out-of-order updates

Re: Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-10 Thread Adam Bordelon
I agree with Niklas that if the executor has sent a terminal status update to the slave, then the task is done and the master should be able to recover those resources. Only sending the oldest status update to the master, especially in the case of framework failover, prevents these resources from