(I'm running mesos 0.16.0 and marathon 0.4.0) Every day or two, I'm seeing the mesos slaves lose touch with the master and disconnect (causing all of the services running on all of the slaves to be redeployed and restarted). The only thing I'm seeing in the logs at these times (on the slaves) is something like:
W0409 12:32:27.347270 22523 group.cpp:435] Timed out waiting to reconnect to ZooKeeper (sessionId=1446fc9b27d00b7) F0409 12:32:42.366143 22523 zookeeper.cpp:195] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] I'm not sure where to begin troubleshooting this. I will be upgrading to mesos 0.17.0 and marathon 0.4.1 in case that matters. Any pointers would be appreciated! ;ted __________________________________________________________ Ted M. Young Guidewire Software - DevOps Tel: +1 650 357 5291 [email protected]<mailto:[email protected]> | www.guidewire.com<http://www.guidewire.com/> 1001 E. Hillsdale Blvd, Suite 800, Foster City, CA 94404 Deliver insurance your way with flexible software products from Guidewire.

