Hi David, Very interesting, this may explain some of the issues I've been having. I was having lots of issues with Hadoop jobs not finishing because (it turns out) slaves couldn't talk to each other. I eventually solved this by configuring iptables on the slaves to accept all connections on ports 1024-65535 from the other slaves. I guess I'll have to do something similar between the masters, right? Do you know what the real port ranges should be? I'm not a huge fan of opening things up that much...
Cheers, Alex On 02/04/2015 09:22 PM, David Kesler wrote: > I've been playing around with marathon and mesos recently and I was > encountering a bunch of weird, inconsistent behavior with marathon. It turns > out that some overly-strict iptables rules were blocking traffic between the > mesos master and the ephemeral port of the marathon framework leader (unless > by chance they were on the same box). > > The net result is that mesos would constantly spam re-registration requests, > think that they succeeded, then disconnect the framework since it couldn't > connect. Mesos would mark the framework as active in the ui and successfully > registered (Although the re-registered time was getting continuously updated. > During this time, the mesos leader's logs contained tons of entries of the > form: > > Feb 4 12:57:04 dev-mesos-master1 mesos-master[12510]: I0204 12:57:04.611101 > 12534 master.cpp:1573] Re-registering framework > 20141111-001826-924320522-5050-26663-0000 (marathon-0.7.6) at > [email protected]:58021 > Feb 4 12:57:04 dev-mesos-master1 mesos-master[12510]: I0204 12:57:04.611127 > 12534 master.cpp:1602] Framework 20141111-001826-924320522-5050-26663-0000 > (marathon-0.7.6) at > [email protected]:58021 failed over > Feb 4 12:57:04 dev-mesos-master1 mesos-master[12510]: I0204 12:57:04.611335 > 12534 hierarchical_allocator_process.hpp:375] Activated framework > 20141111-001826-924320522-5050-26663-0000 > Feb 4 12:57:04 dev-mesos-master1 mesos-master[12510]: I0204 12:57:04.611882 > 12534 master.cpp:3843] Sending 4 offers to framework > 20141111-001826-924320522-5050-26663-0000 (marathon-0.7.6) at > [email protected]:58021 > Feb 4 12:57:04 dev-mesos-master1 mesos-master[12510]: I0204 12:57:04.612428 > 12529 master.cpp:789] Framework 20141111-001826-924320522-5050-26663-0000 > (marathon-0.7.6) at > [email protected]:58021 disconnected > Feb 4 12:57:04 dev-mesos-master1 mesos-master[12510]: I0204 12:57:04.612452 > 12529 master.cpp:1752] Disconnecting framework > 20141111-001826-924320522-5050-26663-0000 (marathon-0.7.6) at > [email protected]:58021 > Feb 4 12:57:04 dev-mesos-master1 mesos-master[12510]: I0204 12:57:04.612463 > 12529 master.cpp:1768] Deactivating framework > 20141111-001826-924320522-5050-26663-0000 (marathon-0.7.6) at > [email protected]:58021 > Feb 4 12:57:04 dev-mesos-master1 mesos-master[12510]: I0204 12:57:04.612586 > 12530 hierarchical_allocator_process.hpp:405] Deactivated framework > 20141111-001826-924320522-5050-26663-0000 > > Where 10.3.0.57 was the box hosting the marathon leader. > > I've posted this as an issue in marathon's github > (https://github.com/mesosphere/marathon/issues/1140), but I also wanted to > post here as it may be an issue that mesos seems to not be handling the case > where it cannot successfully connect to a framework. (Obviously mesos > handling this better wouldn't fix the issues that crop up in marathon, but > it'd be nice if mesos gave some indication that it's not actually able to > successfully communicate with a framework. > >

