Re: Aurora Operations

Mauricio Garavaglia Tue, 27 Dec 2016 09:29:29 -0800

Hello!
>From the top of my head: we increased history_prune_threshold value from 2
days to a week if some debugging is required. Increased
transient_task_state_timeout, because some tasks needs more time to be in
the "killing" state, and we count on that. Also depending on your needs the
max_schedule_penalty default of 1 min could be too much. We reduced that to
20 seconds.


On Tue, Dec 27, 2016 at 2:06 PM, Erb, Stephan <[email protected]>
wrote:

> Does anyone else has input here? Any Aurora configuration option with a
> non-default value in your setup is worth sharing here.
>
>
>
> Questions are welcome as well, so that we can hopefully try to answer
> those in the upcoming operations guide.
>
>
>
> Best regards,
>
> Stephan
>
>
>
> *From: *Zameer Manji [email protected]
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Saturday, 17 December 2016 at 00:39
> *To: *"[email protected]" <[email protected]>
> *Subject: *Re: Aurora Operations
>
>
>
> For larger clusters it might be necessary to increase `-
> db_max_active_connection_count` to increase API throughput. You will also
> need to ensure Xmx=Xms when starting the JVM. Setting those flags to be the
> same prevents heap resizing.
>
>
>
> We should consider changing the defaults to match your research.
> Increasing the session timeout and lower the thermos/executor resources to
> 0 should be done.
>
>
>
> To make Aurora play nicely with other frameworks, I suggest lowering
> `min_offer_hold_time` to 1min or 30s. We can also lower the default here. A
> lower value means Aurora will have more latency when scheduling tasks
> (since it will have to wait for offers) but it will enable other frameworks
> to have resources. I also suggest using the Mesos operator tooling to
> dynamically reserve some minimum amount of resources to Aurora and other
> frameworks to ensure that they are not starved entirely.
>
>
>
> I'm surprised setting offer_filter_duration to 0s improves performance but
> that should be something to note.
>
>
>
> On Fri, Dec 16, 2016 at 8:33 AM, Erb, Stephan <[email protected]>
> wrote:
>
> Hi Aurorans,
>
>
>
> I would like to start a discussion about Aurora operations and gather
> feedback on how Aurora is configured and operated at your site. The main
> goal is to come up with a set of guidelines that help new users get up to
> speed, and to find ways how we can improve our default configuration and
> documentation.
>
>
>
> Of course, this is kind of a difficult endeavor. Still, I believe this can
> significantly help Aurora's positioning as one of the most scalable and
> battle-tested Mesos frameworks.
>
>
>
> I will start with a small collection to get the discussion going:
>
>
>
> # General Advice
>
>
>
> * Aurora requires a ZK ensemble for leader election. This ensemble should
> not also be used for service discover. Otherwise a service discovery
> error/outage can take down the entire cluster. The same applies for the
> Mesos ZK.
>
> * For fast and consistent performance, transaction logs should be on
> distinct disks not used by anything else (e.g not even logging). SSDs help
> as well. This applies to the ZK transaction log and the native/replicated
> log used by Aurora.
>
> * If you have made an operator error in your cluster, stopping the Mesos
> masters is a safe step to limit the error propagation (e.g. agents do not
> come up anymore after a configuration change).
>
>
>
> (Disclaimer: these are from this excellent talk
> https://www.youtube.com/watch?v=nNrh-gdu9m4)
>
>
>
> # Aurora Configuration
>
> Just a small collection from what we are using internally or what I have
> seen elsewhere
>
>
>
> * Thermos resources: The current defaults of CPU and RAM usage are
> invasive. `-thermos_executor_cpu=0` and `-thermos_executor_ram=128MB` seem
> to work just as well in particular since the Mesos egg got slimmer in
> recent releases.
>
> * Session timeouts: The default timeout is pretty small (4sec) and can
> lead to unexpected failovers during long GC pauses. A default of 10-15sec
> seems to be more appropriate.
>
> * JVM settings: Either `-XX:+UseG1GC -XX:+UseStringDeduplication` or
> `-XX:+UseConcMarkSweepGC` seem to be sane defaults. The option
> `-Djava.net.preferIPv4Stack=true` seems to make sense in most cases as
> well.
>
>
>
> # Open Questions:
>
>
>
> * What is the best way to configure and use Aurora in a multi-framework
> setup?
>
> * Are there options we recommend for smaller clusters (<100 nodes or <5000
> tasks)? For example, `-offer_filter_duration=0secs` improves scheduling
> performance on small clusters.
>
> * Are there options we recommend for larger clusters (>1000)?
>
>
>
> I am looking forward to your contributions.
>
>
>
> Thanks,
>
> Stephan
>
>
>
>
>
> --
>
> Zameer Manji
>

Re: Aurora Operations

Reply via email to