Have you tried to restart Marathon and Mesos processes? Once you restart them 
they should pick zookeepers, elect leaders, etc.
If you're using Docker containers, they should reattach themselves to the 
respective slaves.

Thanks
Nikolay

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Dick Davies
Sent: Monday, May 18, 2015 5:26 AM
To: [email protected]
Subject: cluster confusion after zookeeper blip

We run a 3 node marathon cluster on top of 3 mesos masters + 6 slaves.
(mesos 0.21.0, marathon 0.7.5)

This morning we had a network outage long enough for everything to lose 
zookeeper.
Now our marathon UI is empty (all 3 marathons think someone else is a master, 
and marathons 'proxy to leader' feature means the REST API is toast).

Odd thing is, at the mesos level, the
mesos master UI shows no tasks running (logs mention orphaned tasks), but if i 
click into the 'slaves' tab and dig down, the slave view details tasks that are 
in fact active.

Any way to bring order to this without needing to kill those tasks? we have no 
actual outage from a user point of view, but the cluster itself is pretty 
confused and our service discovery relies on the marathon API which is timing 
out.

Although mesos has checkpointing enabled, marathon isn't running with 
checkpointing on (it's the default now but doesn't apply to existing frameworks 
apparently, and we started this around marathon 0.4.x)

Would enabling checkpointing help with this kind of issue? If so, how do i 
enable it for an existing framework?

Reply via email to