Hello! >From the top of my head: we increased history_prune_threshold value from 2 days to a week if some debugging is required. Increased transient_task_state_timeout, because some tasks needs more time to be in the "killing" state, and we count on that. Also depending on your needs the max_schedule_penalty default of 1 min could be too much. We reduced that to 20 seconds.
On Tue, Dec 27, 2016 at 2:06 PM, Erb, Stephan <[email protected]> wrote: > Does anyone else has input here? Any Aurora configuration option with a > non-default value in your setup is worth sharing here. > > > > Questions are welcome as well, so that we can hopefully try to answer > those in the upcoming operations guide. > > > > Best regards, > > Stephan > > > > *From: *Zameer Manji [email protected] > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Saturday, 17 December 2016 at 00:39 > *To: *"[email protected]" <[email protected]> > *Subject: *Re: Aurora Operations > > > > For larger clusters it might be necessary to increase `- > db_max_active_connection_count` to increase API throughput. You will also > need to ensure Xmx=Xms when starting the JVM. Setting those flags to be the > same prevents heap resizing. > > > > We should consider changing the defaults to match your research. > Increasing the session timeout and lower the thermos/executor resources to > 0 should be done. > > > > To make Aurora play nicely with other frameworks, I suggest lowering > `min_offer_hold_time` to 1min or 30s. We can also lower the default here. A > lower value means Aurora will have more latency when scheduling tasks > (since it will have to wait for offers) but it will enable other frameworks > to have resources. I also suggest using the Mesos operator tooling to > dynamically reserve some minimum amount of resources to Aurora and other > frameworks to ensure that they are not starved entirely. > > > > I'm surprised setting offer_filter_duration to 0s improves performance but > that should be something to note. > > > > On Fri, Dec 16, 2016 at 8:33 AM, Erb, Stephan <[email protected]> > wrote: > > Hi Aurorans, > > > > I would like to start a discussion about Aurora operations and gather > feedback on how Aurora is configured and operated at your site. The main > goal is to come up with a set of guidelines that help new users get up to > speed, and to find ways how we can improve our default configuration and > documentation. > > > > Of course, this is kind of a difficult endeavor. Still, I believe this can > significantly help Aurora's positioning as one of the most scalable and > battle-tested Mesos frameworks. > > > > I will start with a small collection to get the discussion going: > > > > # General Advice > > > > * Aurora requires a ZK ensemble for leader election. This ensemble should > not also be used for service discover. Otherwise a service discovery > error/outage can take down the entire cluster. The same applies for the > Mesos ZK. > > * For fast and consistent performance, transaction logs should be on > distinct disks not used by anything else (e.g not even logging). SSDs help > as well. This applies to the ZK transaction log and the native/replicated > log used by Aurora. > > * If you have made an operator error in your cluster, stopping the Mesos > masters is a safe step to limit the error propagation (e.g. agents do not > come up anymore after a configuration change). > > > > (Disclaimer: these are from this excellent talk > https://www.youtube.com/watch?v=nNrh-gdu9m4) > > > > # Aurora Configuration > > Just a small collection from what we are using internally or what I have > seen elsewhere > > > > * Thermos resources: The current defaults of CPU and RAM usage are > invasive. `-thermos_executor_cpu=0` and `-thermos_executor_ram=128MB` seem > to work just as well in particular since the Mesos egg got slimmer in > recent releases. > > * Session timeouts: The default timeout is pretty small (4sec) and can > lead to unexpected failovers during long GC pauses. A default of 10-15sec > seems to be more appropriate. > > * JVM settings: Either `-XX:+UseG1GC -XX:+UseStringDeduplication` or > `-XX:+UseConcMarkSweepGC` seem to be sane defaults. The option > `-Djava.net.preferIPv4Stack=true` seems to make sense in most cases as > well. > > > > # Open Questions: > > > > * What is the best way to configure and use Aurora in a multi-framework > setup? > > * Are there options we recommend for smaller clusters (<100 nodes or <5000 > tasks)? For example, `-offer_filter_duration=0secs` improves scheduling > performance on small clusters. > > * Are there options we recommend for larger clusters (>1000)? > > > > I am looking forward to your contributions. > > > > Thanks, > > Stephan > > > > > > -- > > Zameer Manji >
