Aurora Operations

Erb, Stephan Fri, 16 Dec 2016 08:34:34 -0800

Hi Aurorans,

I would like to start a discussion about Aurora operations and gather feedback 
on how Aurora is configured and operated at your site. The main goal is to come 
up with a set of guidelines that help new users get up to speed, and to find 
ways how we can improve our default configuration and documentation.


Of course, this is kind of a difficult endeavor. Still, I believe this can 
significantly help Aurora's positioning as one of the most scalable and 
battle-tested Mesos frameworks.

I will start with a small collection to get the discussion going:

# General Advice

* Aurora requires a ZK ensemble for leader election. This ensemble should not 
also be used for service discover. Otherwise a service discovery error/outage 
can take down the entire cluster. The same applies for the Mesos ZK.
* For fast and consistent performance, transaction logs should be on distinct 
disks not used by anything else (e.g not even logging). SSDs help as well. This 
applies to the ZK transaction log and the native/replicated log used by Aurora.
* If you have made an operator error in your cluster, stopping the Mesos 
masters is a safe step to limit the error propagation (e.g. agents do not come 
up anymore after a configuration change).

(Disclaimer: these are from this excellent talk 
https://www.youtube.com/watch?v=nNrh-gdu9m4)

# Aurora Configuration
Just a small collection from what we are using internally or what I have seen 
elsewhere

* Thermos resources: The current defaults of CPU and RAM usage are invasive. 
`-thermos_executor_cpu=0` and `-thermos_executor_ram=128MB` seem to work just 
as well in particular since the Mesos egg got slimmer in recent releases.
* Session timeouts: The default timeout is pretty small (4sec) and can lead to 
unexpected failovers during long GC pauses. A default of 10-15sec seems to be 
more appropriate.
* JVM settings: Either `-XX:+UseG1GC -XX:+UseStringDeduplication` or 
`-XX:+UseConcMarkSweepGC` seem to be sane defaults. The option 
`-Djava.net.preferIPv4Stack=true` seems to make sense in most cases as well.

# Open Questions:

* What is the best way to configure and use Aurora in a multi-framework setup?
* Are there options we recommend for smaller clusters (<100 nodes or <5000 
tasks)? For example, `-offer_filter_duration=0secs` improves scheduling 
performance on small clusters.
* Are there options we recommend for larger clusters (>1000)?

I am looking forward to your contributions.

Thanks,
Stephan

Aurora Operations

Reply via email to