Thanks for bringing this up! This is part of the ZK C library. We have seen failing slaves with sporadic DNS lookup failures in our clusters.
After speaking to a ZK expert, I believe one of the things going into 3.5.0 is the ability to only need to resolve one of the zk hosts correctly, as you said: https://issues.apache.org/jira/browse/ZOOKEEPER-107 But I'm unfamiliar with the details of that ticket what they ended up going forward with after all of the discussion. On Mon, Jul 28, 2014 at 11:07 PM, Itamar Ostricher <[email protected]> wrote: > Hi, > > I experimented today running mesos masters & slaves with multiple masters > using zookeeper, by editing the /etc/mesos/zk file on all nodes (masters > and slaves) to something like: > zk://master1:2181,master2:2181,master3:2181/mesos > > I noticed that if not all masters are up when a master or slave mesos > service is started, I get an error of the form: > > F0729 05:45:55.244169 2019 zookeeper.cpp:103] Failed to create ZooKeeper, > zookeeper_init: No such file or directory [2] > Googling the error I found a previous related thread [1], in which Thomas > says that this happens when zookeeper is unable to resolve one of the > hostnames. > Indeed, when I changed the zk string to contain only masters that are up, > it worked fine. > > My question is, how can this be a requirement? (and why?) > The whole point of zookeeper is to allow high-availability when some of > the masters are down, so naturally in such cases their hostnames will not > be resolved... > Is this something that occurs in mesos itself, or something in zookeeper? > > [1] > http://mail-archives.apache.org/mod_mbox/mesos-user/201404.mbox/%3ccajrb3thcjbhd1bqjb0oevkqpawmst9-yxaqwrqo9rgft45x...@mail.gmail.com%3E >

