Responses inline

> On Sep 6, 2016, at 1:33 AM, haosdent <[email protected]> wrote:
> 
> Hi, Silas. Thanks a lot to help test the health check changes recently.
> 
> According to my understanding about your email, you mentioned two problems:
> 
> 1. The bug that broken exists HTTP/command health check caused by r50812 
> <https://reviews.apache.org/r/50812> and r50996 
> <https://reviews.apache.org/r/50996>
> 
> >It is now true that even with the proposed change (51560), we will still get 
> >tasks rejected with TASK_ERROR in 1.1.0, despite the same exact code working 
> >in 1.0.0.
> >Even in the case of the command health checks, which are once again 
> >supported in 51560, we now get deprecation warnings, suggesting that mesos 
> >will again break us in 1.4.
> 
> As you mentioned, this is a bug and we definitely should fix before release 
> 1.1.0.
> I have updated r51560 <https://reviews.apache.org/r/51560> yesterday and 
> verify it fix the problem via r51635. As you see in
> the r51560 <https://reviews.apache.org/r/51560>, we make sure the protobuf 
> compatible again and didn't lose any
> fields. Would you help to double check if it fixes your problem when you free?
> It would be highly appreciated that if you could help to verify it.
> 
> After this bug fix, we could ensure all tasks with HTTP/command health check 
> are not when upgrading to 1.1.0.
> 

I see those changes now (I’m very very bad at the review board UI, so I’m sorry 
if it was always there and I missed it somehow).

> 2. Should we make the `HealthCheck::type` required after v2 ?
> 
> To be honest, I think 6 months should be enough and it also should be changed 
> in
> v1 because it is a minor change and we didn't make it `required` in protobuf 
> message level. We still keeping it `option` in protobuf message definition and
> add a check about it in Mesos code.
> But your concerns make sense as well, so let's see what other 
> users/developers say to
> see if we could make an agreement on this. 
> 

This is an important point. It doesn’t make sense to me that the compatibility 
policy is talking about only whether a protobuf field is optional or required — 
it seems to me that any change that takes a protocol exchange that did not 
result in a TASK_ERROR before, and changes it to cause a TASK_ERROR now, *is* 
making that protobuf field semantically required, whether or not the protobuf 
def says so.

I’ll also point out that there is no definition of ‘minor’ change in the 
compatibility document, and therefore, whether or not a change appears to be 
‘minor’ under some rubric (and I agree that this change could seem minor), if 
it’s part of the v1 mesos.proto, it affects downstream users (such as me, a 
writer of schedulers).

> On Tue, Sep 6, 2016 at 1:18 PM, Silas Snider <[email protected] 
> <mailto:[email protected]>> wrote:
> There’s a little history to this:
> 
> In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/> 
> <https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>>, 
> on the 8th of August, the HTTP health check message was changed to be 
> entirely incompatible with the previous HTTP health check message. Not only 
> was its name changed (breaking compatibility with anyone using the feature 
> with libmesos), but the field tags were rearranged, making it truly 
> wire-format incompatible. This change also introduced a ‘type’ field to the 
> HealthCheck message as an optional enum.
> 
> Next, in https://reviews.apache.org/r/50996/ 
> <https://reviews.apache.org/r/50996/> <https://reviews.apache.org/r/50996/ 
> <https://reviews.apache.org/r/50996/>>, on the 13th of August, the health 
> checking code was changed to make the new ‘type’ field mandatory — if the 
> protobuf field is not present, the mesos master rejects your task with 
> TASK_ERROR.
> 
> A colleague of mine was testing our internal scheduler against HEAD of mesos, 
> and discovered that any task they submitted was being rejected as TASK_ERROR, 
> since we were setting health checks, but not sending type. I filed 
> MESOS-6110, on the 30th of August, and haosdent huang has kindly created 
> https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/> 
> <https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/>> 
> to try to fix this.
> 
> In the course of reviewing that fix, I noticed that it only addresses the 
> case of a command health check, and does not continue to support HTTP health 
> checks in the way they were in 1.0.0. This is a problem for our scheduler, as 
> we have ~always (before mesos actually added support) passed our HTTP health 
> checks in the message, depending on our custom executor to actually perform 
> the check. It is now true that even with the proposed change (51560), we will 
> still get tasks rejected with TASK_ERROR in 1.1.0, despite the same exact 
> code working in 1.0.0.
> 
> Even in the case of the command health checks, which are once again supported 
> in 51560, we now get deprecation warnings, suggesting that mesos will again 
> break us in 1.4.
> 
> It is my team’s belief that the mesos compatibility guarantee, as documented 
> on this page: http://mesos.apache.org/documentation/latest/versioning/ 
> <http://mesos.apache.org/documentation/latest/versioning/><http://mesos.apache.org/documentation/latest/versioning/
>  <http://mesos.apache.org/documentation/latest/versioning/>> would prohibit 
> this sort of change from occurring. Specifically, the ‘API Versioning’ 
> section says "The API version is only bumped if we need to make a backwards 
> incompatible API change. We will strive to support a given API version for at 
> least a year.” and under the ‘API compatibility’ the change is considered to 
> be breaking if it would involve "Adding new required fields to existing 
> requests to “/scheduler”.”
> 
> The proposed change does indeed add a new required field — ‘type’ to the v1 
> api, in the case of command health checks in 6 months, in the case of http 
> health checks, immediately. Therefore, it seems clear that this constitutes a 
> new ‘v2’ api, and it’s very clear that 6 months is too short, especially as 
> another part of the 'API Versioning’ section says "The deprecation clock for 
> vN-1 API will start as soon as we release “N.0.0” version of Mesos. […]”
> 
> Please believe me, I understand the need to be able to change broken api and 
> implementation quickly, without spending years maintaining technical debt. 
> This is why I believe the mesos project decided to move to a model where the 
> internal protobufs are separate from the v1/v2/etc. protobufs, and 
> evolvers/devolvers are proposed. It seems clear that the right way of doing 
> this is to modify the internal protobuf to look the way you’d like (better 
> message name, clearer field order, etc.) and write an evolver from the v1 api 
> to the internal api.
> 
> Also, I think it’s important to note that the compatibility guarantees I’m 
> citing are exactly the things that make it possible at all to write a 
> scheduler against mesos and actually use it in production. Deciding that this 
> case is too insignificant to really bother with the compatibility guarantees 
> means that you’ve just pushed the tech debt issue one level higher to the 
> scheduler writers.
> 
> I’m sorry this email ended up so long, but thank you for taking some time to 
> read it — I believe that this issue is critical to the ongoing health of the 
> mesos project.
> 
> 
> > On Sep 5, 2016, at 11:14 AM, haosdent <[email protected] 
> > <mailto:[email protected]>> wrote:
> >
> > Hi, folks. As I mentioned in the previous email 
> > http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 
> > <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1> 
> > <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 
> > <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>>.
> > We have added `type` in the `HealthCheck` protobuf definition in 1.1.0 and
> > health checks without `type` specified will be deprecated since 1.1.0.
> >
> > For backwards compatibility, we still support the command health check if 
> > the
> > type is not specified for now. But we plan to make `type` become a required 
> > field
> > and return `TASK_ERROR` if the type is not specified after 6 months. The 
> > question
> > is if this meets the deprecated policy since 1.0 ? If 6 months is too short 
> > and
> > we have to deprecate it after 2.0 ?
> >
> > Looking forward the answers. Any concerns and questions are appreciated, 
> > thanks a lot!
> >
> > --
> > Best Regards,
> > Haosdent Huang
> 
> 
> 
> 
> -- 
> Best Regards,
> Haosdent Huang

Reply via email to