Hi Andrei, Thanks for this heads up. A few questions:
> On Jun 21, 2019, at 06:50, Andrei Sekretenko <asekrete...@mesosphere.com> > wrote: > > > Hi all, > > we are intending to change the behavior of the suppressOffers() method of > MesosSchedulerDriver with regard to the transparent re-registration. > > Currently, when driver becomes disconnected from a master, it performs on its > own a re-registration with an empty set of suppressed roles. This causes > un-suppression > of all the suppressed roles of the framework. > > The plan is to alter this behavior into preserving the suppression state on > this re-registration. > > The required set of suppressed roles will be stored in the driver, which will > be now performing re-registration with this set (instead of an empty one), > and updating the stored set whenever a call modifying the suppression state > of the roles in the allocator is performed. > Currently, the driver has two methods which perform such calls: > suppressOffers() and reviveOffers(). > > Please feel free to raise any concerns or objections - especially if you are > aware of any V0 frameworks which (probably implicitly) depend on > un-suppression of the roles when this re-registration occurs. > > > > Note that: > - Frameworks which do not call suppressOffers() are, obviously, unaffected > by this change. > > - Frameworks that reliably prevent transparent-re-registration (for example, > by calling driver.abort() immediately from the disconnected() callback), > should also be not affected. I presume driver.stop(true) works as well? Marathon does this, and I believe the behavior is to crash so a new Marathon leader can establish a new connection to Mesos, which will set the appropriate suppress/revive state. > - Storing the suppressed roles list for re-registration and clearing it in > reviveOffers() do not change anything for the existing frameworks. It is > setting this list in suppressOffers() which might be a cause of concerns. > > - I'm using the word "un-suppression" because re-registering with roles > removed from the suppressed roles list is NOT equivalent to performing REVIVE > call for these roles (unlike REVIVE, it does not clear offerFilters in the > allocator). > > ===== > A bit of background on why this change is needed. > > To properly support V0 frameworks with large number of roles, it is necessary > for the driver not to change the suppression state of the roles on its own. > Therefore, due to the existence of the transparent re-registration in the > driver, we will need to store the required suppression state in the driver > and make it re-register using this state. > > We could possibly avoid the proposed change of suppressOffers() by adding to > the driver new interface for changing the suppression state, leaving > suppressOffers() as it is, and marking it as deprecated. > > However, this will leave the behaviour of suppressOffers() deeply > inconsistent with everything else. > Compare the following two sequences of events. > First one: > - The framework creates and starts a driver with roles "role1", "role2"... > "role500", the driver registers > - The framework calls a new method driver.suppressOffersForRoles({"role1", > ..., "role500"}), the driver performs SUPPRESS call for these roles and > stores them in its suppressed roles set. > (Alternative with the same result: the framework calls > driver.updateFramework(FrameworkInfo, suppressedRoles={"role1", ..., > "role500"}), the driver performs UPDATE_FRAMEWORK call with those parameters > and stores the new suppressed roles set). I'm unfamiliar with a driver storage mechanism for storing suppressed roles; does this mean to say simply that the Framework knows, from its persistent state, which roles should be suppressed? > - The driver, due to some reason, disconnects and re-registers with the same > master, providing the stored suppressed roles set. > - All the roles are still suppressed > Second one: > - The framework creates and starts a driver with roles "role1", "role2"... > "role500", the driver registers > - The framework calls driver.suppressOffers(), the driver performs SUPPRESS > call for all roles, but doesn't modify required suppression state. > - The driver, due to some reason, disconnects and re-registers with the same > master, providing the stored suppressed roles set, which is empty. > - Now, none of the roles are suppressed, allocator generates offers for 500 > roles which will likely be declined by the framework. > > This is one of the examples which makes us strongly consider altering the > interaction between suppressOffers() and the transparent re-registration when > we add storing the suppression state to the driver.