On Sun, Mar 13, 2016 at 3:46 PM, Michael Park <[email protected]> wrote:
> Hi Evan, > > If we wanted to backport those patches as well, we should cut a 0.25.2. > I would first like to understand what kind of issues you're running into. > Do you mind elaborating a little? > For MESOS-3738, we ran into the issue pretty much directly. We run Mesos with the Docker executor, and mesos healthchecks stopped reporting correctly. Our automation (built on Marathon) waits for healthcheck results to be available before considering a task "healthy" and killing old tasks, so this bug manifested as deploys getting stuck and never killing old tasks. There is a patch available in the ticket, and we've built our own copies of Mesos, including that patch. However, it took us some time to figure out that this was the issue, and we had to set up an internal build pipeline for Mesos that includes that patch. For MESOS-3560, we attempted to upgrade a cluster from 0.23.1 to 0.24.1 and saw that our Mesos slaves were unable to connect to the master due to authentication issues. We eventually figured out that we were hitting the bug in MESOS-3560, and switched to the older newline-delimited credential files. Because the Mesos upgrade process dictates that you should never skip a minor version, anybody using Docker, command healthchecks, and authentication on <=0.23 will hit both of these bugs and need to patch and work around them if they want to upgrade to anything above 0.24. While we've worked around both of these issues, it's really frustrating that these issues have both been fixed for several months, but neither fix was released for 0.24 or 0.25. I'm pushing for these patches to be released (along with any other unreleased bugfixes that might be scattered around JIRA) so that anybody else doing this upgrade doesn't need to feel the same pain that we did. > Thanks, > > MPark >

