There’s a little history to this:

In https://reviews.apache.org/r/50812/ <https://reviews.apache.org/r/50812/>, 
on the 8th of August, the HTTP health check message was changed to be entirely 
incompatible with the previous HTTP health check message. Not only was its name 
changed (breaking compatibility with anyone using the feature with libmesos), 
but the field tags were rearranged, making it truly wire-format incompatible. 
This change also introduced a ‘type’ field to the HealthCheck message as an 
optional enum.

Next, in https://reviews.apache.org/r/50996/ 
<https://reviews.apache.org/r/50996/>, on the 13th of August, the health 
checking code was changed to make the new ‘type’ field mandatory — if the 
protobuf field is not present, the mesos master rejects your task with 
TASK_ERROR.

A colleague of mine was testing our internal scheduler against HEAD of mesos, 
and discovered that any task they submitted was being rejected as TASK_ERROR, 
since we were setting health checks, but not sending type. I filed MESOS-6110, 
on the 30th of August, and haosdent huang has kindly created 
https://reviews.apache.org/r/51560/ <https://reviews.apache.org/r/51560/> to 
try to fix this.

In the course of reviewing that fix, I noticed that it only addresses the case 
of a command health check, and does not continue to support HTTP health checks 
in the way they were in 1.0.0. This is a problem for our scheduler, as we have 
~always (before mesos actually added support) passed our HTTP health checks in 
the message, depending on our custom executor to actually perform the check. It 
is now true that even with the proposed change (51560), we will still get tasks 
rejected with TASK_ERROR in 1.1.0, despite the same exact code working in 1.0.0.

Even in the case of the command health checks, which are once again supported 
in 51560, we now get deprecation warnings, suggesting that mesos will again 
break us in 1.4.

It is my team’s belief that the mesos compatibility guarantee, as documented on 
this page: http://mesos.apache.org/documentation/latest/versioning/ 
<http://mesos.apache.org/documentation/latest/versioning/> would prohibit this 
sort of change from occurring. Specifically, the ‘API Versioning’ section says 
"The API version is only bumped if we need to make a backwards incompatible API 
change. We will strive to support a given API version for at least a year.” and 
under the ‘API compatibility’ the change is considered to be breaking if it 
would involve "Adding new required fields to existing requests to “/scheduler”.”

The proposed change does indeed add a new required field — ‘type’ to the v1 
api, in the case of command health checks in 6 months, in the case of http 
health checks, immediately. Therefore, it seems clear that this constitutes a 
new ‘v2’ api, and it’s very clear that 6 months is too short, especially as 
another part of the 'API Versioning’ section says "The deprecation clock for 
vN-1 API will start as soon as we release “N.0.0” version of Mesos. […]”

Please believe me, I understand the need to be able to change broken api and 
implementation quickly, without spending years maintaining technical debt. This 
is why I believe the mesos project decided to move to a model where the 
internal protobufs are separate from the v1/v2/etc. protobufs, and 
evolvers/devolvers are proposed. It seems clear that the right way of doing 
this is to modify the internal protobuf to look the way you’d like (better 
message name, clearer field order, etc.) and write an evolver from the v1 api 
to the internal api.

Also, I think it’s important to note that the compatibility guarantees I’m 
citing are exactly the things that make it possible at all to write a scheduler 
against mesos and actually use it in production. Deciding that this case is too 
insignificant to really bother with the compatibility guarantees means that 
you’ve just pushed the tech debt issue one level higher to the scheduler 
writers.

I’m sorry this email ended up so long, but thank you for taking some time to 
read it — I believe that this issue is critical to the ongoing health of the 
mesos project.


> On Sep 5, 2016, at 11:14 AM, haosdent <haosd...@gmail.com> wrote:
> 
> Hi, folks. As I mentioned in the previous email 
> http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1 
> <http://search-hadoop.com/m/0Vlr6Ma9DWqzG3M1>.
> We have added `type` in the `HealthCheck` protobuf definition in 1.1.0 and
> health checks without `type` specified will be deprecated since 1.1.0.
> 
> For backwards compatibility, we still support the command health check if the
> type is not specified for now. But we plan to make `type` become a required 
> field
> and return `TASK_ERROR` if the type is not specified after 6 months. The 
> question
> is if this meets the deprecated policy since 1.0 ? If 6 months is too short 
> and
> we have to deprecate it after 2.0 ?
> 
> Looking forward the answers. Any concerns and questions are appreciated, 
> thanks a lot!
> 
> -- 
> Best Regards,
> Haosdent Huang

Reply via email to