Thanks Nikolay - I checked the frameworkid in zookeeper (/marathon/state/frameworkId) matched the one attached to the running tasks, gave the old marathon leader a restart and everything reconnected ok
(we did have to disable our service discovery pieces to avoid getting empty JSON back when marathon first booted, but other than that everything is peachy). On 18 May 2015 at 15:31, Nikolay Borodachev <[email protected]> wrote: > Have you tried to restart Marathon and Mesos processes? Once you restart them > they should pick zookeepers, elect leaders, etc. > If you're using Docker containers, they should reattach themselves to the > respective slaves. > > Thanks > Nikolay > > -----Original Message----- > From: [email protected] [mailto:[email protected]] On Behalf Of Dick > Davies > Sent: Monday, May 18, 2015 5:26 AM > To: [email protected] > Subject: cluster confusion after zookeeper blip > > We run a 3 node marathon cluster on top of 3 mesos masters + 6 slaves. > (mesos 0.21.0, marathon 0.7.5) > > This morning we had a network outage long enough for everything to lose > zookeeper. > Now our marathon UI is empty (all 3 marathons think someone else is a master, > and marathons 'proxy to leader' feature means the REST API is toast). > > Odd thing is, at the mesos level, the > mesos master UI shows no tasks running (logs mention orphaned tasks), but if > i click into the 'slaves' tab and dig down, the slave view details tasks that > are in fact active. > > Any way to bring order to this without needing to kill those tasks? we have > no actual outage from a user point of view, but the cluster itself is pretty > confused and our service discovery relies on the marathon API which is timing > out. > > Although mesos has checkpointing enabled, marathon isn't running with > checkpointing on (it's the default now but doesn't apply to existing > frameworks apparently, and we started this around marathon 0.4.x) > > Would enabling checkpointing help with this kind of issue? If so, how do i > enable it for an existing framework?

