Terra Field created MESOS-10005:
-----------------------------------

             Summary: Optimize Framework Broadcasts
                 Key: MESOS-10005
                 URL: https://issues.apache.org/jira/browse/MESOS-10005
             Project: Mesos
          Issue Type: Improvement
          Components: master
    Affects Versions: 1.9.0
         Environment: Ubuntu Bionic 18.04, Mesos 1.9.0 on the master, Mesos 
1.4.1 on the agents. Spark 2.1.1 is the primary framework running.
            Reporter: Terra Field
         Attachments: mesos-master.log.gz, mesos-master.stacks - 2 - 1.9.0.gz, 
mesos-master.stacks - 3 - 1.9.0.gz, mesos-master.stacks - 5 - new healthy 
master.gz

We have at any given time ~100 frameworks connected to our Mesos Master with 
agents spread across anywhere from 6,000 to 11,000 EC2 instances. We've been 
encounter a crash (which I'll document separately) and when that happens, the 
new Mesos Master will sometimes (but not always) struggle to catch up, and 
eventually crash again. Usually the third or fourth crash will end with a 
stable master (not ideal, but at least we can get to stable).

Looking over the logs, I'm seeing hundreds of attempts to contact dead agents 
each second (and presumably many contacts with healthy agents that don't throw 
an error):
{noformat}
W1003 21:39:39.299998  8618 process.cpp:1917] Failed to send 
'mesos.internal.UpdateFrameworkMessage' to '100.82.103.99:5051', connect: 
Failed to connect to 100.82.103.99:5051: Connection refused W1003 
21:39:39.300143  8618 process.cpp:1917] Failed to send 
'mesos.internal.UpdateFrameworkMessage' to '100.85.122.190:5051', connect: 
Failed to connect to 100.85.122.190:5051: Connection refused W1003 
21:39:39.300285  8618 process.cpp:1917] Failed to send 
'mesos.internal.UpdateFrameworkMessage' to '100.85.84.187:5051', connect: 
Failed to connect to 100.85.84.187:5051: Connection refused W1003 
21:39:39.302122  8618 process.cpp:1917] Failed to send 
'mesos.internal.UpdateFrameworkMessage' to '100.82.163.228:5051', connect: 
Failed to connect to 100.82.163.228:5051: Connection refused{noformat}
I gave [~bmahler] a perf trace of the master on Slack at this point, and it 
looks like the master at is spending a significant amount of time doing 
framework update broadcasting. I'll attach the perf dump to the ticket, as well 
as the log of what the master did while it was alive.

It sounds like currently, every framework update (100 total frameworks in our 
case) results in a broadcast to all 6000-11000 agents (depending on how busy 
the cluster is). Also, since our health checks rely on the UI currently, we 
usually end up killing the master because it fails a health check for long 
periods of time while overwhelmed by doing these broadcasts.

Could optimizations to be made to either throttle these broadcasts or to only 
target nodes which need those framework updates?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to