Hi David,

Very interesting, this may explain some of the issues I've been having.
I was having lots of issues with Hadoop jobs not finishing because (it
turns out) slaves couldn't talk to each other. I eventually solved this
by configuring iptables on the slaves to accept all connections on ports
1024-65535 from the other slaves. I guess I'll have to do something
similar between the masters, right? Do you know what the real port
ranges should be? I'm not a huge fan of opening things up that much...

Cheers,
Alex


On 02/04/2015 09:22 PM, David Kesler wrote:
> I've been playing around with marathon and mesos recently and I was 
> encountering a bunch of weird, inconsistent behavior with marathon.  It turns 
> out that some overly-strict iptables rules were blocking traffic between the 
> mesos master and the ephemeral port of the marathon framework leader (unless 
> by chance they were on the same box).
>
> The net result is that mesos would constantly spam re-registration requests, 
> think that they succeeded, then disconnect the framework since it couldn't 
> connect.  Mesos would mark the framework as active in the ui and successfully 
> registered (Although the re-registered time was getting continuously updated. 
>  During this time, the mesos leader's logs contained tons of entries of the 
> form:
>
> Feb  4 12:57:04 dev-mesos-master1 mesos-master[12510]: I0204 12:57:04.611101 
> 12534 master.cpp:1573] Re-registering framework 
> 20141111-001826-924320522-5050-26663-0000 (marathon-0.7.6)  at 
> [email protected]:58021
> Feb  4 12:57:04 dev-mesos-master1 mesos-master[12510]: I0204 12:57:04.611127 
> 12534 master.cpp:1602] Framework 20141111-001826-924320522-5050-26663-0000 
> (marathon-0.7.6) at 
> [email protected]:58021 failed over
> Feb  4 12:57:04 dev-mesos-master1 mesos-master[12510]: I0204 12:57:04.611335 
> 12534 hierarchical_allocator_process.hpp:375] Activated framework 
> 20141111-001826-924320522-5050-26663-0000
> Feb  4 12:57:04 dev-mesos-master1 mesos-master[12510]: I0204 12:57:04.611882 
> 12534 master.cpp:3843] Sending 4 offers to framework 
> 20141111-001826-924320522-5050-26663-0000 (marathon-0.7.6) at 
> [email protected]:58021
> Feb  4 12:57:04 dev-mesos-master1 mesos-master[12510]: I0204 12:57:04.612428 
> 12529 master.cpp:789] Framework 20141111-001826-924320522-5050-26663-0000 
> (marathon-0.7.6) at 
> [email protected]:58021 disconnected
> Feb  4 12:57:04 dev-mesos-master1 mesos-master[12510]: I0204 12:57:04.612452 
> 12529 master.cpp:1752] Disconnecting framework 
> 20141111-001826-924320522-5050-26663-0000 (marathon-0.7.6) at 
> [email protected]:58021
> Feb  4 12:57:04 dev-mesos-master1 mesos-master[12510]: I0204 12:57:04.612463 
> 12529 master.cpp:1768] Deactivating framework 
> 20141111-001826-924320522-5050-26663-0000 (marathon-0.7.6) at 
> [email protected]:58021
> Feb  4 12:57:04 dev-mesos-master1 mesos-master[12510]: I0204 12:57:04.612586 
> 12530 hierarchical_allocator_process.hpp:405] Deactivated framework 
> 20141111-001826-924320522-5050-26663-0000
>
> Where 10.3.0.57 was the box hosting the marathon leader.
>
> I've posted this as an issue in marathon's github 
> (https://github.com/mesosphere/marathon/issues/1140), but I also wanted to 
> post here as it may be an issue that mesos seems to not be handling the case 
> where it cannot successfully connect to a framework.  (Obviously mesos 
> handling this better wouldn't fix the issues that crop up in marathon, but 
> it'd be nice if mesos gave some indication that it's not actually able to 
> successfully communicate with a framework.
>
>

Reply via email to