Hi,

As a Mesos operator, I am really surprised by this proposal.

The main advantage of the proposed design is that we can finally set nodes
down for maintenance with a configurable kill grace period and a proper
task status (with maintenance primitives, it was TASK_LOST I think) without
any specific cooperation from the frameworks.

I think that this could be just an evolution of the current primitives.

With the new proposal, it's going to be as difficult as before to have
SLA-aware maintenances because it will need cooperation from the frameworks
anyway and we know this is rarely a priority for them. We will also lose
the ability to signal future maintenance in order to optimize allocations.

For example I had this idea to improve the allocator (or write a custom
one) that would offer resources from agents with no maintenance planned in
priority, and then sort agents by maintenance date in decremasing order.
This would be a big improvement to prevent cluster reboots to trigger too
many task restarts. This will not be possible with the new primitives. The
same idea apply for frameworks too.

Maxime

Le jeu. 30 mai 2019 à 22:16, Joseph Wu <[email protected]> a écrit :

> As far as I can tell, the document is public.
>
> On Thu, May 30, 2019 at 12:22 AM Marc Roos <[email protected]>
> wrote:
>
>>
>> Is the doc not public?
>>
>>
>> -----Original Message-----
>> From: Joseph Wu [mailto:[email protected]]
>> Sent: donderdag 30 mei 2019 2:07
>> To: dev; user
>> Subject: Design doc: Agent draining and deprecation of maintenance
>> primitives
>>
>> Hi all,
>>
>> A few years back, we added some constructs called maintenance primitives
>> to Mesos.  This feature was meant to allow operators and frameworks to
>> cooperate in draining tasks off nodes scheduled for maintenance.  As far
>> as we've observed since, this feature never achieved enough adoption to
>> be useful for operators.
>>
>> As such, we are proposing a more opinionated approach for draining
>> tasks.  The goal is to have Mesos perform draining in lieu of
>> frameworks, minimizing or eliminating the need to change frameworks to
>> account for draining.  We will also be simplifying the operator
>> workflow, which would only require a single call (holding an AgentID) to
>> start draining; and a single call to bring an agent back into the
>> cluster.
>>
>>
>> Due to how closely this proposed feature overlaps with maintenance
>> primitives, we will be deprecating maintenance primitives upon
>> implementation of agent draining.
>>
>>
>> If interested, please take a look at the design document:
>>
>>
>> https://docs.google.com/document/d/1w3O80NFE6m52XNMv7EdXSO-1NebEs8opA8VZPG1tW0Y/
>>
>>
>>

Reply via email to