On Sun, Mar 13, 2016 at 3:46 PM, Michael Park <[email protected]> wrote:

> Hi Evan,
>
> If we wanted to backport those patches as well, we should cut a 0.25.2.
> I would first like to understand what kind of issues you're running into.
> Do you mind elaborating a little?
>


For MESOS-3738, we ran into the issue pretty much directly. We run Mesos
with the Docker executor, and mesos healthchecks stopped reporting
correctly. Our automation (built on Marathon) waits for healthcheck results
to be available before considering a task "healthy" and killing old tasks,
so this bug manifested as deploys getting stuck and never killing old tasks.

There is a patch available in the ticket, and we've built our own copies of
Mesos, including that patch. However, it took us some time to figure out
that this was the issue, and we had to set up an internal build pipeline
for Mesos that includes that patch.


For MESOS-3560, we attempted to upgrade a cluster from 0.23.1 to 0.24.1 and
saw that our Mesos slaves were unable to connect to the master due to
authentication issues. We eventually figured out that we were hitting the
bug in MESOS-3560, and switched to the older newline-delimited credential
files.


Because the Mesos upgrade process dictates that you should never skip a
minor version, anybody using Docker, command healthchecks, and
authentication on <=0.23 will hit both of these bugs and need to patch and
work around them if they want to upgrade to anything above 0.24.

While we've worked around both of these issues, it's really frustrating
that these issues have both been fixed for several months, but neither fix
was released for 0.24 or 0.25. I'm pushing for these patches to be released
(along with any other unreleased bugfixes that might be scattered around
JIRA) so that anybody else doing this upgrade doesn't need to feel the same
pain that we did.


> Thanks,
>
> MPark
>

Reply via email to