We have been running Mesos on AWS starting around Mesos 0.15. Running 100s
of agent nodes isn't an issue at all. We currently autoscale the agent
cluster (few to several 100s) based on usage, using a custom framework that
uses Fenzo library. We run batch and service style workloads. I am glad to
provide additional info if you have specific questions.

We use 3 Mesos masters (spread across 3 zones of an AWS region). Existing
infrastructure provides a 5 node Zokeeper cluster to use for leader
election.

We leverage existing monitoring tools at Netflix, mostly based on Atlas. We
have a few alerts such as no ZK leader for a while, no resource offers for
too long, etc. , that tie into PagerDuty. Other alerts are at a higher
level, based on expected behavior of our framework scheduler.

Since we deploy immutable AMIs, our Mesos master upgrades involve deploying
new ASG with upgraded Mesos masters and then destroying the old ASG. Agent
upgrades also involve bringing up new ASGs with coordinated drain-off or
job migration. This strategy mostly works with ease, except when there is a
breaking change across versions (e.g., new master can't talk to old agent,
or vice versa. This happened once so far, when ZK node content changed from
protobuf to json). Additional thought will need to be put in after Mesos
goes 1.0 and defines the long term version compatibility/stability more
formally. I understand this strategy may not appeal to environments with
strict caps on total #instances.

Our Mesos agent command line contains several custom attributes that
provide parameters such as the EC2 instance zone, instanceId, instance
type, etc., that are useful for any constraints that the jobs can put in
terms of task placement.

Our framework runs multiple instances with leader election. We use Mesos
framework registration with a long (1 week) timeout for re-registration to
account for any delays in re-registering.


On Sat, Jan 9, 2016 at 11:27 PM, lwq Adolph <kenan3...@gmail.com> wrote:

> Hi everyone:
>  My future mesos cluster will be at least 100 nodes.So optimization of
> mesos is important.May you share your experience on using mesos in
> production environment.It can contain following topics:
> 1. monitor tools of mesos cluster
> 2. optimization of mesos parameters
>
> Thanks very much
>
> --
> Thanks & Best Regards
> 卢文泉 | Adolph Lu
> TEL:+86 15651006559
> Linker Networks(http://www.linkernetworks.com/)
>

Reply via email to