Hi Aurorans, I would like to start a discussion about Aurora operations and gather feedback on how Aurora is configured and operated at your site. The main goal is to come up with a set of guidelines that help new users get up to speed, and to find ways how we can improve our default configuration and documentation.
Of course, this is kind of a difficult endeavor. Still, I believe this can significantly help Aurora's positioning as one of the most scalable and battle-tested Mesos frameworks. I will start with a small collection to get the discussion going: # General Advice * Aurora requires a ZK ensemble for leader election. This ensemble should not also be used for service discover. Otherwise a service discovery error/outage can take down the entire cluster. The same applies for the Mesos ZK. * For fast and consistent performance, transaction logs should be on distinct disks not used by anything else (e.g not even logging). SSDs help as well. This applies to the ZK transaction log and the native/replicated log used by Aurora. * If you have made an operator error in your cluster, stopping the Mesos masters is a safe step to limit the error propagation (e.g. agents do not come up anymore after a configuration change). (Disclaimer: these are from this excellent talk https://www.youtube.com/watch?v=nNrh-gdu9m4) # Aurora Configuration Just a small collection from what we are using internally or what I have seen elsewhere * Thermos resources: The current defaults of CPU and RAM usage are invasive. `-thermos_executor_cpu=0` and `-thermos_executor_ram=128MB` seem to work just as well in particular since the Mesos egg got slimmer in recent releases. * Session timeouts: The default timeout is pretty small (4sec) and can lead to unexpected failovers during long GC pauses. A default of 10-15sec seems to be more appropriate. * JVM settings: Either `-XX:+UseG1GC -XX:+UseStringDeduplication` or `-XX:+UseConcMarkSweepGC` seem to be sane defaults. The option `-Djava.net.preferIPv4Stack=true` seems to make sense in most cases as well. # Open Questions: * What is the best way to configure and use Aurora in a multi-framework setup? * Are there options we recommend for smaller clusters (<100 nodes or <5000 tasks)? For example, `-offer_filter_duration=0secs` improves scheduling performance on small clusters. * Are there options we recommend for larger clusters (>1000)? I am looking forward to your contributions. Thanks, Stephan
