Hi Tom, There's only one hostname right now and it's a static entry in the DNS, so unless there's some DNS weirdness going on (anything's possible), it's always resolving properly. Also, I'm only getting the error once every day or three, so it could be something going on somewhere on the network, but I'm not sure where to look next.
Thanks, ;ted From: Thomas Petr [mailto:[email protected]] Sent: Wednesday, April 09, 2014 4:19 PM To: [email protected] Subject: Re: Mesos slaves disconnecting because of Zookeeper? Hey Ted, Could you check your zk connection string and ensure that all the hostnames resolve correctly? When I've hit that error in the past it was due to zookeeper failing to resolve a hostname (in my case, for a EC2 instance that was deleted). Thanks, Tom On Wed, Apr 9, 2014 at 7:09 PM, Ted Young <[email protected]<mailto:[email protected]>> wrote: (I'm running mesos 0.16.0 and marathon 0.4.0) Every day or two, I'm seeing the mesos slaves lose touch with the master and disconnect (causing all of the services running on all of the slaves to be redeployed and restarted). The only thing I'm seeing in the logs at these times (on the slaves) is something like: W0409 12:32:27.347270 22523 group.cpp:435] Timed out waiting to reconnect to ZooKeeper (sessionId=1446fc9b27d00b7) F0409 12:32:42.366143 22523 zookeeper.cpp:195] Failed to create ZooKeeper, zookeeper_init: No such file or directory [2] I'm not sure where to begin troubleshooting this. I will be upgrading to mesos 0.17.0 and marathon 0.4.1 in case that matters. Any pointers would be appreciated! ;ted __________________________________________________________ Ted M. Young Guidewire Software - DevOps Tel: +1 650 357 5291<tel:%2B1%20650%20357%205291> [email protected]<mailto:[email protected]> | www.guidewire.com<http://www.guidewire.com/> 1001 E. Hillsdale Blvd, Suite 800, Foster City, CA 94404 Deliver insurance your way with flexible software products from Guidewire.

