We have been running Mesos on AWS starting around Mesos 0.15. Running 100s of agent nodes isn't an issue at all. We currently autoscale the agent cluster (few to several 100s) based on usage, using a custom framework that uses Fenzo library. We run batch and service style workloads. I am glad to provide additional info if you have specific questions.
We use 3 Mesos masters (spread across 3 zones of an AWS region). Existing infrastructure provides a 5 node Zokeeper cluster to use for leader election. We leverage existing monitoring tools at Netflix, mostly based on Atlas. We have a few alerts such as no ZK leader for a while, no resource offers for too long, etc. , that tie into PagerDuty. Other alerts are at a higher level, based on expected behavior of our framework scheduler. Since we deploy immutable AMIs, our Mesos master upgrades involve deploying new ASG with upgraded Mesos masters and then destroying the old ASG. Agent upgrades also involve bringing up new ASGs with coordinated drain-off or job migration. This strategy mostly works with ease, except when there is a breaking change across versions (e.g., new master can't talk to old agent, or vice versa. This happened once so far, when ZK node content changed from protobuf to json). Additional thought will need to be put in after Mesos goes 1.0 and defines the long term version compatibility/stability more formally. I understand this strategy may not appeal to environments with strict caps on total #instances. Our Mesos agent command line contains several custom attributes that provide parameters such as the EC2 instance zone, instanceId, instance type, etc., that are useful for any constraints that the jobs can put in terms of task placement. Our framework runs multiple instances with leader election. We use Mesos framework registration with a long (1 week) timeout for re-registration to account for any delays in re-registering. On Sat, Jan 9, 2016 at 11:27 PM, lwq Adolph <kenan3...@gmail.com> wrote: > Hi everyone: > My future mesos cluster will be at least 100 nodes.So optimization of > mesos is important.May you share your experience on using mesos in > production environment.It can contain following topics: > 1. monitor tools of mesos cluster > 2. optimization of mesos parameters > > Thanks very much > > -- > Thanks & Best Regards > 卢文泉 | Adolph Lu > TEL:+86 15651006559 > Linker Networks(http://www.linkernetworks.com/) >