Okay - that only solves half of the problem for us: users will still see
their frameworks as running even though they completed but it is a first
step.
Let's continue the discussion in a JIRA ticket; I'll create one shortly.
Thanks for helping out!
Niklas
On 15 September 2014 18:17, Benjamin
To ensure that the architecture of mesos remains a scalable one, we want to
persist state in the leaves of the system as much as possible. This is why
the master has never persisted tasks, task states, or status updates. Note
that status updates can contain arbitrarily large amounts of data at the
Thanks for your input Ben! (Comments inlined)
On 15 September 2014 12:35, Benjamin Mahler benjamin.mah...@gmail.com
wrote:
To ensure that the architecture of mesos remains a scalable one, we want to
persist state in the leaves of the system as much as possible. This is why
the master has
On Mon, Sep 15, 2014 at 3:11 PM, Niklas Nielsen nik...@mesosphere.io
wrote:
Thanks for your input Ben! (Comments inlined)
On 15 September 2014 12:35, Benjamin Mahler benjamin.mah...@gmail.com
wrote:
To ensure that the architecture of mesos remains a scalable one, we want
to
persist
The semantics of these changes would have an impact on the upcoming task
reconciliation.
@BenM: Can you chime in here on how this fits into the task reconciliation
work that you've been leading?
On Wed, Sep 10, 2014 at 6:12 PM, Adam Bordelon a...@mesosphere.io wrote:
I agree with Niklas that
Definitely relevant. If the master could be trusted to persist all the task
status updates, then they could be queued up at the master instead of the
slave once the master has acknowledged its receipt. Then the master could
have the most up-to-date task state and can recover the resources as soon
Hi guys,
We have run into a problem that cause tasks which completes, when a
framework is disconnected and has a fail-over time, to remain in a running
state even though the tasks actually finishes.
Here is a test framework we have been able to reproduce the issue with:
Here is the log of a mesos-local instance where I reproduced it:
https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are
stuck in running state).
There is a lot of output, so here is a filtered log for task 10:
https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d
At first glance, it
What you observed is expected because of the way the slave (specifically,
the status update manager) operates.
The status update manager only sends the next update for a task if a
previous update (if it exists) has been acked.
In your case, since TASK_RUNNING was not acked by the framework,
The main reason is to keep status update manager simple. Also, it is very
easy to enforce the order of updates to the master/framework in this model.
If we allow multiple updates for a task to be in flight, it's really hard
(impossible?) to ensure that we are not delivering out-of-order updates
I agree with Niklas that if the executor has sent a terminal status update
to the slave, then the task is done and the master should be able to
recover those resources. Only sending the oldest status update to the
master, especially in the case of framework failover, prevents these
resources from
11 matches
Mail list logo