I figured out the reason. I had configured mesos with internal IPs. I see that Zookeeper broadcasts external network IP to the masters/slaves. There was a firewall issue with the public IP. So that's the problem.
Is it the correct way for Zookeeper to broadcast the public IPs? That's understandable for cases where we extend the cluster out of the network. On Thu, Oct 15, 2015 at 6:49 PM, Ahmet Emre Aladağ <[email protected]> wrote: > Hi all, > > I'm trying to build a mesos cluster with mesosphere 0.25. > > When I run 3 mesos-master node with QUORUM=2, one is elected as the > leader, 1 minute later the leader gives the error messages below, then > restarts. Upon restart, they make another election. They keep electing one > another in a loop, consistently failing, restarting and re-electing. If I > set QUORUM=1, leader becomes stable. But slaves can't connect masters. What > could be the reason for this connection problem? > > Marathon console thinks node 1 is the leader although mesos panel shows > node 3 is the leader. > > I also tried running slaves on the same nodes as masters but they > encountered the same error and slaves are not recognized by the masters. > > > Thanks, > > MASTER ERRORS: > > E1015 11:50:35.539562 19150 socket.hpp:174] Shutdown failed on fd=25: > Transport endpoint is not connected [107] > > E1015 11:50:35.539897 19150 socket.hpp:174] Shutdown failed on fd=24: > Transport endpoint is not connected [107] > > > SLAVE ERRORS: > > E1015 15:17:53.232672 25191 socket.hpp:174] Shutdown failed on fd=10: > Transport endpoint is not connected [107] > > E1015 15:18:01.424705 25191 socket.hpp:174] Shutdown failed on fd=11: > Transport endpoint is not connected [107] > > E1015 15:19:09.392596 25191 socket.hpp:174] Shutdown failed on fd=12: > Transport endpoint is not connected [107] > > W1015 15:19:09.392750 25185 slave.cpp:3187] Master disconnected! Waiting > for a new master to be elected > > E1015 15:21:21.104575 25191 socket.hpp:174] Shutdown failed on fd=10: > Transport endpoint is not connected [107] > > E1015 15:23:31.664559 25191 socket.hpp:174] Shutdown failed on fd=10: > Transport endpoint is not connected [107] > >

