On Mon, Jun 24, 2019 at 7:09 AM Tim Harper <[email protected]> wrote:

> Hi Andrei,
>
> Thanks for this heads up. A few questions:
>
>
> > On Jun 21, 2019, at 06:50, Andrei Sekretenko <[email protected]>
> wrote:
> >
> >
> > Hi all,
> >
> > we are intending to change the behavior of the suppressOffers() method
> of MesosSchedulerDriver with regard to the transparent re-registration.
> >
> > Currently, when driver becomes disconnected from a master, it performs
> on its own a re-registration with an empty set of suppressed roles. This
> causes un-suppression
> > of all the suppressed roles of the framework.
> >
> > The plan is to alter this behavior into preserving the suppression state
> on this re-registration.
> >
> > The required set of suppressed roles will be stored in the driver, which
> will be now performing re-registration with this set (instead of an empty
> one),
> > and updating the stored set whenever a call modifying the suppression
> state of the roles in the allocator is performed.
> > Currently, the driver has two methods which perform such calls:
> suppressOffers()  and reviveOffers().
> >
> > Please feel free to raise any concerns or objections - especially if you
> are aware of any V0 frameworks which (probably implicitly) depend on
> un-suppression of the roles when this re-registration occurs.
> >
> >
> >
> > Note that:
> >  - Frameworks which do not call suppressOffers() are, obviously,
> unaffected by this change.
> >
> >  - Frameworks that reliably prevent transparent-re-registration (for
> example, by calling driver.abort() immediately from the disconnected()
> callback), should also be not affected.
>
> I presume driver.stop(true) works as well? Marathon does this, and I
> believe the behavior is to crash so a new Marathon leader can establish a
> new connection to Mesos, which will set the appropriate suppress/revive
> state.
>

Yes, driver.stop(), called from the disconnected() callback, also
guarantees that the driver will not initiate resubscription - regardless of
whether the `failover` argument is true or false.

Currently, calling IMMEDIATELY from the callback is important, otherwise
the following (unlikely) scenario might happen:
- framework calls suppressOffers()
- something happens, driver becomes disconnected and calls
scheduler->disconnected()
- scheduler somehow stores the information that driver is disconnected and
returns from disconnected()
- after disconnected() returns, the driver authenticates and sends
SUBSCRIBE with a clear set of suppressed roles
- scheduler calls stop()/abort()
- allocator unsuppresses all the roles
- when the framework is active, allocator generates offers until the
framework calls suppressOffers().

Note that the proposed change of suppressOffers() makes this scenario
impossible.


> >  - Storing the suppressed roles list for re-registration and clearing it
> in reviveOffers() do not change anything for the existing frameworks. It is
> setting this list in suppressOffers() which might be a cause of concerns.
> >
> >  - I'm using the word "un-suppression" because re-registering with roles
> removed from the suppressed roles list is NOT equivalent to performing
> REVIVE call for these roles (unlike REVIVE, it does not clear offerFilters
> in the allocator).
> >
> > =====
> > A bit of background on why this change is needed.
> >
> > To properly support V0 frameworks with large number of roles, it is
> necessary for the driver not to change the suppression state of the roles
> on its own.
> > Therefore, due to the existence of the transparent re-registration in
> the driver, we will need to store the required suppression state in the
> driver and make it re-register using this state.
> >
> > We could possibly avoid the proposed change of suppressOffers() by
> adding to the driver new interface for changing the suppression state,
> leaving suppressOffers() as it is, and marking it as deprecated.
> >
> > However, this will leave the behaviour of suppressOffers() deeply
> inconsistent with everything else.
> > Compare the following two sequences of events.
> > First one:
> >  - The framework creates and starts a driver with roles "role1",
> "role2"... "role500", the driver registers
> >  - The framework calls a new method
> driver.suppressOffersForRoles({"role1", ..., "role500"}), the driver
> performs SUPPRESS call for these roles and stores them in its suppressed
> roles set.
> >    (Alternative with the same result: the framework calls
> driver.updateFramework(FrameworkInfo, suppressedRoles={"role1", ...,
> "role500"}), the driver performs UPDATE_FRAMEWORK call with those
> parameters and stores the new suppressed roles set).
>
> I'm unfamiliar with a driver storage mechanism for storing suppressed
> roles; does this mean to say simply that the Framework knows, from its
> persistent state, which roles should be suppressed?
>

The same as for the FrameworkInfo: the MesosSchedulerDriver object will
contain the desired list of suppressed roles and use it for the transparent
re-subscription (as it does with the FrameworkInfo)
It will be possible to initialize this set via driver's constructor and the
updateFramework(), reviveOffers() and suppressOffers() methods will have to
update it.
If we will have to drop the proposal to modify the behaviour of
suppressOffers(), that will be not suppressOffers(), but some entirely new
method - suppressOffersForRoles()?

The need to update the set of suppressed roles stored by the driver arises
only due to existence of the transparent re-subscription. If the driver was
calling SUBSCRIBE only once, we would have no need to store/update this set
in the driver.


> >  - The driver, due to some reason, disconnects and re-registers with the
> same master, providing the stored suppressed roles set.
> >  - All the roles are still suppressed
> > Second one:
> >  - The framework creates and starts a driver with roles "role1",
> "role2"... "role500", the driver registers
> >  - The framework calls driver.suppressOffers(), the driver performs
> SUPPRESS call for all roles, but doesn't modify required suppression state.
> >  - The driver, due to some reason, disconnects and re-registers with the
> same master, providing the stored suppressed roles set, which is empty.
> >  - Now, none of the roles are suppressed, allocator generates offers for
> 500 roles which will likely be declined by the framework.
> >
> > This is one of the examples which makes us strongly consider altering
> the interaction between suppressOffers() and the transparent
> re-registration when we add storing the suppression state to the driver.
>
>
>
>

-- 
--
Andrei Sekretenko

Reply via email to