Hi Andrei,

Thanks for this heads up. A few questions:


> On Jun 21, 2019, at 06:50, Andrei Sekretenko <asekrete...@mesosphere.com> 
> wrote:
> 
> 
> Hi all,
> 
> we are intending to change the behavior of the suppressOffers() method of 
> MesosSchedulerDriver with regard to the transparent re-registration.
> 
> Currently, when driver becomes disconnected from a master, it performs on its 
> own a re-registration with an empty set of suppressed roles. This causes 
> un-suppression 
> of all the suppressed roles of the framework.
> 
> The plan is to alter this behavior into preserving the suppression state on 
> this re-registration.
> 
> The required set of suppressed roles will be stored in the driver, which will 
> be now performing re-registration with this set (instead of an empty one), 
> and updating the stored set whenever a call modifying the suppression state 
> of the roles in the allocator is performed. 
> Currently, the driver has two methods which perform such calls: 
> suppressOffers()  and reviveOffers().
> 
> Please feel free to raise any concerns or objections - especially if you are 
> aware of any V0 frameworks which (probably implicitly) depend on 
> un-suppression of the roles when this re-registration occurs.
> 
> 
> 
> Note that: 
>  - Frameworks which do not call suppressOffers() are, obviously, unaffected 
> by this change. 
> 
>  - Frameworks that reliably prevent transparent-re-registration (for example, 
> by calling driver.abort() immediately from the disconnected() callback), 
> should also be not affected.

I presume driver.stop(true) works as well? Marathon does this, and I believe 
the behavior is to crash so a new Marathon leader can establish a new 
connection to Mesos, which will set the appropriate suppress/revive state.

>  - Storing the suppressed roles list for re-registration and clearing it in 
> reviveOffers() do not change anything for the existing frameworks. It is 
> setting this list in suppressOffers() which might be a cause of concerns. 
> 
>  - I'm using the word "un-suppression" because re-registering with roles 
> removed from the suppressed roles list is NOT equivalent to performing REVIVE 
> call for these roles (unlike REVIVE, it does not clear offerFilters in the 
> allocator).
> 
> =====
> A bit of background on why this change is needed.
> 
> To properly support V0 frameworks with large number of roles, it is necessary 
> for the driver not to change the suppression state of the roles on its own. 
> Therefore, due to the existence of the transparent re-registration in the 
> driver, we will need to store the required suppression state in the driver 
> and make it re-register using this state.
> 
> We could possibly avoid the proposed change of suppressOffers() by adding to 
> the driver new interface for changing the suppression state, leaving 
> suppressOffers() as it is, and marking it as deprecated.
> 
> However, this will leave the behaviour of suppressOffers() deeply 
> inconsistent with everything else.
> Compare the following two sequences of events.
> First one:
>  - The framework creates and starts a driver with roles "role1", "role2"... 
> "role500", the driver registers
>  - The framework calls a new method driver.suppressOffersForRoles({"role1", 
> ..., "role500"}), the driver performs SUPPRESS call for these roles and 
> stores them in its suppressed roles set. 
>    (Alternative with the same result: the framework calls 
> driver.updateFramework(FrameworkInfo, suppressedRoles={"role1", ..., 
> "role500"}), the driver performs UPDATE_FRAMEWORK call with those parameters 
> and stores the new suppressed roles set).

I'm unfamiliar with a driver storage mechanism for storing suppressed roles; 
does this mean to say simply that the Framework knows, from its persistent 
state, which roles should be suppressed?

>  - The driver, due to some reason, disconnects and re-registers with the same 
> master, providing the stored suppressed roles set. 
>  - All the roles are still suppressed
> Second one:
>  - The framework creates and starts a driver with roles "role1", "role2"... 
> "role500", the driver registers
>  - The framework calls driver.suppressOffers(), the driver performs SUPPRESS 
> call for all roles, but doesn't modify required suppression state.
>  - The driver, due to some reason, disconnects and re-registers with the same 
> master, providing the stored suppressed roles set, which is empty. 
>  - Now, none of the roles are suppressed, allocator generates offers for 500 
> roles which will likely be declined by the framework.
> 
> This is one of the examples which makes us strongly consider altering the 
> interaction between suppressOffers() and the transparent re-registration when 
> we add storing the suppression state to the driver.



Reply via email to