Re: Non-checkpointing frameworks

2016-10-15 Thread Zameer Manji
+1 to A and B

Aurora has enabled checkpointing for years and requires operators to enable
checkpointing on the slaves.

On Sat, Oct 15, 2016 at 11:57 AM, Joris Van Remoortere 
wrote:

> I'm in favor of A & B. I find it provides a better "first experience" to
> users.
> From my experience you usually have to have an explicit reason to not want
> to checkpoint. Most people assume the semantics provided by the checkpoint
> behavior is default and it can be a frustrating experience for them to find
> out that is not the case.
>
> —
> *Joris Van Remoortere*
> Mesosphere
>
> On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway 
> wrote:
>
>> Hi folks,
>>
>> I'd like input from individuals who currently use frameworks but do
>> not enable checkpointing.
>>
>> Background: "checkpointing" is a parameter that can be enabled in
>> FrameworkInfo; if enabled, the agent will write the framework pid,
>> executor PIDs, and status updates to disk for any tasks started by
>> that framework. This checkpointed information means that these tasks
>> can survive an agent crash: if the agent exits (whether due to
>> crashing or as part of an upgrade procedure), a restarted agent can
>> use this information to reconnect to executors started by the previous
>> instance of the agent. The downside is that checkpointing requires
>> some additional disk I/O at the agent.
>>
>> Checkpointing is not currently the default, but in my experience it is
>> often enabled for production frameworks. As part of the work on
>> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
>> considering:
>>
>> (a) requiring that partition-aware frameworks must also enable
>> checkpointing, and/or
>> (b) enabling checkpointing by default
>>
>> If you have intentionally decided to disable checkpointing for your
>> Mesos framework, I'd be curious to hear more about your use-case and
>> why you haven't enabled it.
>>
>> Thanks!
>>
>> Neil
>>
>> --
>> Zameer Manji
>>
>


Re: Non-checkpointing frameworks

2016-10-15 Thread Joris Van Remoortere
I'm in favor of A & B. I find it provides a better "first experience" to
users.
>From my experience you usually have to have an explicit reason to not want
to checkpoint. Most people assume the semantics provided by the checkpoint
behavior is default and it can be a frustrating experience for them to find
out that is not the case.

—
*Joris Van Remoortere*
Mesosphere

On Fri, Oct 14, 2016 at 3:11 PM, Neil Conway  wrote:

> Hi folks,
>
> I'd like input from individuals who currently use frameworks but do
> not enable checkpointing.
>
> Background: "checkpointing" is a parameter that can be enabled in
> FrameworkInfo; if enabled, the agent will write the framework pid,
> executor PIDs, and status updates to disk for any tasks started by
> that framework. This checkpointed information means that these tasks
> can survive an agent crash: if the agent exits (whether due to
> crashing or as part of an upgrade procedure), a restarted agent can
> use this information to reconnect to executors started by the previous
> instance of the agent. The downside is that checkpointing requires
> some additional disk I/O at the agent.
>
> Checkpointing is not currently the default, but in my experience it is
> often enabled for production frameworks. As part of the work on
> supporting partition-aware Mesos frameworks (see MESOS-4049), we are
> considering:
>
> (a) requiring that partition-aware frameworks must also enable
> checkpointing, and/or
> (b) enabling checkpointing by default
>
> If you have intentionally decided to disable checkpointing for your
> Mesos framework, I'd be curious to hear more about your use-case and
> why you haven't enabled it.
>
> Thanks!
>
> Neil
>


Re: On Mesos versioning and deprecation policy

2016-10-15 Thread haosdent
Thanks @yan's great inputs! I couldn't agree more almost of them.

> Also the API is not just what the machine reads but all the documentation
associated with it, right? It depends on what the documentation says; what
the user _should_ expect.

I think different users may have different expectations. And the guy who
developed the APIs may have different understand from some users as well.
Our documentations should cover most of cases.

But in case that we didn't or forgot to write it explicitly in the
document, should we give up to update the API? Just like user Alice said
this is a BUG while user Bob said this is a feature. I think we still need
to raise it case by case to ensure most users are not affected by the
breaking API changes.

On Sat, Oct 15, 2016 at 6:55 AM, Vinod Kone  wrote:

> We will chat about this in the upcoming community sync (thursday 3 PM).
> So, please make sure to attend if you are interested.
>
> On Fri, Oct 14, 2016 at 3:44 PM, Yan Xu  wrote:
>
>>
>> On Fri, Oct 14, 2016 at 3:37 PM, Yan Xu  wrote:
>>
>>> Thanks Alex for starting this!
>>>
>>> In addition to comments below, I think it'll be helpful to keep the
>>> existing versioning doc concise and user-friendly while having a dedicated
>>> doc for the "implementation details" where precise requirements and
>>> procedures go. Maybe some duplication/cross-referencing is needed but Mesos
>>> developers will find the latter much more helpful while the users/framework
>>> developer will find the former easy to read.
>>>
>>> e.g., a similar split:
>>> https://github.com/kubernetes/kubernetes/blob/master/docs/api.md
>>> https://github.com/kubernetes/kubernetes/blob/master/docs/de
>>> vel/api_changes.md (which has a lot of details on how the kubernetes
>>> community is thinking about similar issues, which we can learn from)
>>>
>>> Jiang Yan Xu 
>>>
>>> On Wed, Oct 12, 2016 at 9:34 AM, Alex Rukletsov 
>>> wrote:
>>>
 Folks,

 There have been a bunch of online [1, 2] and offline discussions about
 our
 deprecation and versioning policy. I found that people—including
 myself—read the versioning doc [3] differently; moreover some aspects
 are
 not captured there. I would like to start a discussion around this
 topic by
 sharing my confusions and suggestions. This will hopefully help us stay
 on
 the same page and have similar expectations. The second goal is to
 eliminate ambiguities from the versioning doc (thanks Vinod for
 volunteering to update it).

>>>
>>> +1 Let me know if there are things I can help with.
>>>
>>>

 1. API vs. semantic changes.
 Current versioning guide treat features (e.g. flags, metrics, endpoints)
 and API differently: incompatible changes for the former are allowed
 after
 6 month deprecation cycle, while for the latter they require bumping a
 major version. I suggest we consolidate these policies.

>>>
>>> I feel that the distinction is not API vs. semantic changes, Backwards
>>> compatible API guarantee should imply backwards compatible semantics (of
>>> the API).
>>> i.e., if a change in API doesn't cause the message to be dropped to the
>>> floor but leads to behavior change that causes problems in the system, it
>>> still breaks compatibility.
>>>
>>> IMO the distinction is more between:
>>> - Compatibility between components that are impossible/very unpleasant
>>> to upgrade in lockstep - high priority for compatibility guarantee.
>>> - Compatibility between components that are generally bundled (modules)
>>> or things that usually aren't built into automated tooling (e.g., the
>>> /state endpoint) - more relaxed for now but we should explicitly exclude
>>> them from the guarantee.
>>>
>>>

 We should also define and clearly explain what changes require bumping
 the
 major version. I have no strong opinion here and would love to hear what
 people think. The original motivation for maintaining backwards
 compatibility is to make sure vN schedulers can correctly work with vN
 API
 without being updated. But what about semantic changes that do not touch
 the API? For example, what if we decide to send less task health
 updates to
 schedulers based on some health policy? It influences the flow of task
 status updates, should such change be considered compatible? Taking it
 to
 an extreme, we may not even be able to fix some bugs because someone may
 already rely on this behaviour!

>>>
>>> API changes should warrant a major version bump. Also the API is not
>>> just what the machine reads but all the documentation associated with it,
>>> right? It depends on what the documentation says; what the user _should_
>>> expect.
>>>
>>> That said, I feel that these things are hard to be talked about in the
>>> abstract. Even with a guideline, we still need to make case-by-case
>>>