The earlier email about running a scheduler outside the Mesos cluster reminded 
me of an issue that we encountered last week that I thought may bite other 
Mesos users (and I’m wondering if there could be checks added to make it more 
readily apparent in the logs). The symptom was a driver node seeing its 
authentication attempt timeout after sending out the initial message. The root 
cause was a bad /etc/hosts entry.

We were testing enabling authentication for our Mesos cluster as part of 
deploying Mesos-0.19. One of our drivers wasn’t able to connect to the master 
and the failure seemed to occur during the handshake. Logs showed the client 
sending out the initial message and the master responding but nothing past 
that. Some wireshark-ing showed us that the initial message from the framework 
had “libprocess/authenticatee(1)@127.0.0.1” in the message (which we determined 
was due to a bad /etc/hosts entry on the driver node).  So, the Mesos master 
(which was running on a different host) dutifully replied to that address and 
that is where the process came off the rails.

In hindsight, we realized that this Spark warning message spelled out the issue:
"WARN Utils: Your hostname, xxxx resolves to a loopback address: 127.0.0.1; 
using yyy.yyy.yyy.yyy instead (on interface eth0)”

I was wondering if it would be possible to detect (in the libmesos library) a 
framework sending out a loop back address and either trying  to use a different 
(more sensible) interface (akin to what Spark does) or logging the fact that it 
is sending out a loopback address very prominently. This all assumes that the 
master isn’t also using the loopback (I can see that being a valid setup for 
single-host use).

Best Regards,
-Joe Buck`

Reply via email to