Re: Discussion: Scheduler API for Operation Reconciliation

2019-01-24 Thread Chun-Hung Hsiao
I chatted with Jie and Gaston, and here is a brief summary:

1. The ordering issue between the synchronous response and the event stream
would lead to extra complication for a framework, and thus the benefit
doesn't seem to worth the complication.
2. However, we should consider not forwarding the reconciliation requests
to the agents. The status updates doesn't require a trigger, and if the
agent could report gone and unregistered RPs to the master, the master can
respond to the reconciliation request itself.
The only problem I see is that frameworks may see
`OPERATION_GONE_BY_OPERATOR` -> `OPERATION_UNREACHABLE` ->
`OPERATION_GONE_BY_OPERATOR`, since the master does not persist gone RPs.

To address the original problem of MESOS-9318, we could do the following:
(1) Agent is gone => `OPERATION_GONE_BY_OPERATOR`
(2) Agent is unreachable => `OPERATION_UNREACHABLE`
(3) Agent is not registered => `OPERATION_RECOVERING`
(4) Agent is unknown => `OPERATION_UNKNOWN`
(5) Agent is registered, RP is gone => `OPERATION_GONE_BY_OPERATOR`
(6) Agent is registered, RP is not registered => `OPERATION_UNREACHABLE` or
`OPERATION_RECOVERING`
(7) Agent is registered, RP is unknown => `OPERATION_UNKNOWN`
(8) Agent is registered, RP is registered => maybe `OPERATION_UNKNOWN`?

So it seems a number of people agree with going with the asynchronous
responses through the event stream. Please reply if you have other opinions!

On Thu, Jan 24, 2019 at 1:39 PM James DeFelice 
wrote:

> I've attempted to implement support for operation status reconciliation in
> a framework that I've been building. Option (III) seems most convenient
> from my perspective as well. A single source of updates:
>
> (a) Leads to a cleaner framework design; I've had to poke a few holes in
> the framework's initial design to deal with multiple event sources, leading
> to increased complexity.
>
> (b) Allows frameworks to consume events in the order they arrive (and
> pushes the responsibility for event ordering back to Mesos). Multiple event
> sources that the framework needs to (possibly) reorder based on a timestamp
> would add further complexity that we should avoid pushing onto framework
> writers.
>
> Some other thoughts:
>
> (c) I've implemented a background polling loop for exactly the reason that
> Benno pointed out. An asychronous API call for operation status
> reconciliation would be fine with me.
>
> (d) API consistency is important. Framework devs are used to the way that
> the task status reconciliation API works, and have come up with solutions
> for dealing with the lack of boundaries for streams of explicit
> reconciliation events. The synchronous response defined for the currently
> published operation status reconciliation call isn't consistent with the
> rest of the v1 scheduler API, which generated a bit of extra work (for me)
> in the low-level mesos v1 http client lib. Consistency should be a primary
> goal when extending existing API sets.
>
> (e) There are probably other ways to solve the problem of a "lack of
> boundaries within the event stream" for explicit reconciliation requests.
> If this is this a problem that other framework devs need solved then let's
> address it as a separate issue - and aim to resolve it in a consistent way
> for both task and operation status event streams.
>
> (f) It sounds like option (III) would let Mesos send back smarter
> operation statuses in agent/RP failover cases (UNREACHABLE vs. UNKNOWN).
> Anything to limit the number of scenarios where UNKNOWN is returned to
> frameworks sounds good to me.
>
> -James
>
>
>
> On Wed, Jan 16, 2019 at 4:15 PM Benjamin Bannier <
> benjamin.bann...@mesosphere.io> wrote:
>
>> Hi,
>>
>> have we reached a conclusion here?
>>
>> From the Mesos side of things I would be strongly in favor of proposal
>> (III). This is not only consistent with what we do with task status
>> updates, but also would allow us to provide improved operation status
>> (e.g., `OPERATION_UNREACHABLE` instead of just `OPERATION_UNKNOWN` to
>> better distinguish non-terminal from terminal operation states. To
>> accomplish that we wouldn’t need to introduce extra information leakage
>> (e.g., explicitly keeping master up to date on local resource provider
>> state and associated internal consistency complications).
>>
>> This approach should also simplify framework development as a framework
>> would only need to watch a single channel to see operation status updates
>> (no need to reconcile different information sources). The benefits of
>> better status updates and simpler implementation IMO outweigh any benefits
>> of the current approach (disclaimer: I filed the slightly inflammatory
>> MESOS-9448).
>>
>> What is keeping us from moving forward with (III) at this point?
>>
>>
>> Cheers,
>>
>> Benjamin
>>
>> > On Jan 3, 2019, at 11:30 PM, Benno Evers  wrote:
>> >
>> > Hi Chun-Hung,
>> >
>> > > imagine that there are 1k nodes and 10 active + 10 gone LRPs per
>> node, then the master need to 

Re: Discussion: Scheduler API for Operation Reconciliation

2019-01-24 Thread James DeFelice
I've attempted to implement support for operation status reconciliation in
a framework that I've been building. Option (III) seems most convenient
from my perspective as well. A single source of updates:

(a) Leads to a cleaner framework design; I've had to poke a few holes in
the framework's initial design to deal with multiple event sources, leading
to increased complexity.

(b) Allows frameworks to consume events in the order they arrive (and
pushes the responsibility for event ordering back to Mesos). Multiple event
sources that the framework needs to (possibly) reorder based on a timestamp
would add further complexity that we should avoid pushing onto framework
writers.

Some other thoughts:

(c) I've implemented a background polling loop for exactly the reason that
Benno pointed out. An asychronous API call for operation status
reconciliation would be fine with me.

(d) API consistency is important. Framework devs are used to the way that
the task status reconciliation API works, and have come up with solutions
for dealing with the lack of boundaries for streams of explicit
reconciliation events. The synchronous response defined for the currently
published operation status reconciliation call isn't consistent with the
rest of the v1 scheduler API, which generated a bit of extra work (for me)
in the low-level mesos v1 http client lib. Consistency should be a primary
goal when extending existing API sets.

(e) There are probably other ways to solve the problem of a "lack of
boundaries within the event stream" for explicit reconciliation requests.
If this is this a problem that other framework devs need solved then let's
address it as a separate issue - and aim to resolve it in a consistent way
for both task and operation status event streams.

(f) It sounds like option (III) would let Mesos send back smarter operation
statuses in agent/RP failover cases (UNREACHABLE vs. UNKNOWN). Anything to
limit the number of scenarios where UNKNOWN is returned to frameworks
sounds good to me.

-James



On Wed, Jan 16, 2019 at 4:15 PM Benjamin Bannier <
benjamin.bann...@mesosphere.io> wrote:

> Hi,
>
> have we reached a conclusion here?
>
> From the Mesos side of things I would be strongly in favor of proposal
> (III). This is not only consistent with what we do with task status
> updates, but also would allow us to provide improved operation status
> (e.g., `OPERATION_UNREACHABLE` instead of just `OPERATION_UNKNOWN` to
> better distinguish non-terminal from terminal operation states. To
> accomplish that we wouldn’t need to introduce extra information leakage
> (e.g., explicitly keeping master up to date on local resource provider
> state and associated internal consistency complications).
>
> This approach should also simplify framework development as a framework
> would only need to watch a single channel to see operation status updates
> (no need to reconcile different information sources). The benefits of
> better status updates and simpler implementation IMO outweigh any benefits
> of the current approach (disclaimer: I filed the slightly inflammatory
> MESOS-9448).
>
> What is keeping us from moving forward with (III) at this point?
>
>
> Cheers,
>
> Benjamin
>
> > On Jan 3, 2019, at 11:30 PM, Benno Evers  wrote:
> >
> > Hi Chun-Hung,
> >
> > > imagine that there are 1k nodes and 10 active + 10 gone LRPs per node,
> then the master need to maintain 20k entries for LRPs.
> >
> > How big would the required additional storage be in this scenario? Even
> if it's 1KiB per LRP, using 20 MiB of extra memory doesn't sound too bad
> for such a big custer.
> >
> > In general, it seems hard to discuss the trade-offs between your
> proposals without looking at the users of that API - do you know if there
> are ayn frameworks out there that already use
> >  operation reconciliation, and if so what do they do based on the
> reconciliation response?
> >
> > As far as I know, we don't have any formal guarantees on which
> operations status changes the framework will receive without
> reconciliation. So putting on my framework-implementer hat it seems like
> I'd have no choice but to implement a continously polling background loop
> anyways if I care about knowing the latest operation statuses. If this is
> indeed the case, having a synchronous `RECONCILE_OPERATIONS` would seem to
> have little additional benefit.
> >
> > Best regards,
> > Benno
> >
> > On Wed, Dec 12, 2018 at 4:07 AM Chun-Hung Hsiao 
> wrote:
> > Hi folks,
> >
> > Recently I've being discussing the problems of the current design of the
> > experimental
> > `RECONCILE_OPERATIONS` scheduler API with a couple people. The discussion
> > was started
> > from MESOS-9318 :
> when a
> > framework receives an `OPERATION_UNKNOWN`, it doesn't know
> > if it should retry the operation or not (further details described
> below).
> > As the discussion
> > evolves, we realize there are more issues to consider, design-wise