I think more explicit signaling is better. Increased latency can be due to other conditions like network issues etc. Right now our mitigation involves load-shedding and we would rather have load-avoidance. Indeed proxy is not a good option. Only Aurora "knows" when it wants to back-pressure.
On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <[email protected]> wrote: > Any log write latency will be reflected in the overall latency of the > request. Increased request latency is one of the main ways any server has > of telling a client that it's under load. It's then up to the client to > react to this. > > If you want to throw error codes, you can put a proxy in front of Aurora > that has request timeouts - which would send 503s to clients. But the issue > with that is the requests are mostly non-idempotent so you'll need to build > reconciliation logic into it. > > On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <[email protected]> > wrote: > >> Imagine something like Spinnaker using Aurora underneath to schedule >> services. That layer often "amplifies" human effort and may result in a lot >> of load on Aurora. Usually that is fine but if Aurora slowed down due to >> transient problems, it can signal that to upstream software in the same way >> that busy web servers do during cyber Monday sales :-) >> >> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <[email protected]> wrote: >> >>> I want to let upstream software "know" that Aurora is slowing down and >>>> that it should back off >>> >>> >>> Can you offer more detail about how Aurora is being used in this >>> regard? I haven't seen use cases in the past that would be amenable to >>> this behavior, so i would like to understand better. >>> >>> >>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <[email protected]> >>> wrote: >>> >>>> Thanks Bill. We havn't been able to track down a specific root >>>> cause(although ZK node is known to have issues now and then but we don't >>>> have logs for the specific outages we had). We will plan to move to 0.19.x >>>> soon. In addition I want to let upstream software "know" that Aurora is >>>> slowing down and that it should back off. To achieve this I want to send >>>> 5xx error codes back when update/rollback/kill etc are called and certain >>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, this >>>> "defense" already exists? >>>> >>>> >>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <[email protected]> >>>> wrote: >>>> >>>>> The next level is to determine why the storage lock is being held. >>>>> Common causes include: >>>>> >>>>> 1. storage snapshot slowness, when scheduler state is very large, O(gb) >>>>> 1a. long GC pauses in the scheduler, often induced by (1) >>>>> 2. scheduler replicated log on slow disks >>>>> 3. network issues between schedulers, schedulers to zookeeper, or >>>>> between zookeepers >>>>> >>>>> As an immediate (partial) remedy, i suggest you upgrade to eliminate >>>>> the use of SQL/mybatis in the scheduler. This helped twitter improve (1) >>>>> and (1a). >>>>> >>>>> commit f2755e1 >>>>> Author: Bill Farner <[email protected]> >>>>> Date: Tue Oct 24 23:34:09 2017 -0700 >>>>> >>>>> Exclusively use Map-based in-memory stores for primary storage >>>>> >>>>> >>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <[email protected]> >>>>> wrote: >>>>> >>>>>> and in log_storage_write_lock_wait_ns_per_event >>>>>> >>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is >>>>>>> that cause or effect? :-) >>>>>>> >>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks Bill. Please see inline: >>>>>>>> >>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I suspect they are getting enqueued >>>>>>>>> >>>>>>>>> >>>>>>>>> Just to be sure - the offers do eventually get through though? >>>>>>>>> >>>>>>>>> >>>>>>>> In one instance the offers did get through but it took several >>>>>>>> minutes. In other instances we reloaded the scheduler to let another >>>>>>>> one >>>>>>>> become the leader. >>>>>>>> >>>>>>>> >>>>>>>>> The most likely culprit is contention for the storage write lock, >>>>>>>>> observable >>>>>>>>> via spikes in stat log_storage_write_lock_wait_ns_total. >>>>>>>>> >>>>>>>> >>>>>>>> Thanks. I will check that one. >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time >>>>>>>>> >>>>>>>>> >>>>>>>>> This sounds like API activity. This shouldn't interfere with >>>>>>>>> offer processing directly, but could potentially slow down the >>>>>>>>> scheduler as >>>>>>>>> a whole. >>>>>>>>> >>>>>>>>> >>>>>>>> So these won't contend for locks with offer processing and task >>>>>>>> assignment threads? Only 8-10 out of 24 cores were being used on the >>>>>>>> machine. I also noticed a spike in mybatis active and bad connections. >>>>>>>> Can't say if the spike in active is due to many bad connections or vice >>>>>>>> versa or there was a 3rd source causing both of these. Are there any >>>>>>>> metrics or logs that might help here? >>>>>>>> >>>>>>>> >>>>>>>>> I also notice a lot of "Timeout reached for task..." around the >>>>>>>>>> same time. Can this happen if task is in PENDING state and does not >>>>>>>>>> reach >>>>>>>>>> ASSIGNED due to lack of offers? >>>>>>>>> >>>>>>>>> >>>>>>>>> This is unusual. Pending tasks are not timed out; this applies to >>>>>>>>> tasks in states where the scheduler is waiting for something else to >>>>>>>>> act >>>>>>>>> and it does not hear back (via a status update). >>>>>>>>> >>>>>>>> >>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates from >>>>>>>> Mesos are being delayed or processed too slowly both these effects will >>>>>>>> occur? >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> I suggest digging into the cause of delayed offer processing >>>>>>>>> first, i suspect it might be related to the task timeouts as well. >>>>>>>>> >>>>>>>>> version close to 0.18.0 >>>>>>>>> >>>>>>>>> >>>>>>>>> Is the ambiguity is due to custom patches? Can you at least >>>>>>>>> indicate the last git SHA off aurora/master? Digging much deeper in >>>>>>>>> diagnosing this may prove tricky without knowing what code is in play. >>>>>>>>> >>>>>>>>> >>>>>>>> - >>>>>>>> - c85bffd >>>>>>>> - 10 i >>>>>>>> - >>>>>>>> - s the commit from which we forked. >>>>>>>> - >>>>>>>> - >>>>>>>> - >>>>>>>> - The custom patch is mainly the dynamic reservation work done >>>>>>>> by Dmitri. We also have commits for offer/rescind race issue, >>>>>>>> setrootfs >>>>>>>> patch (which is not upstreamed yet). >>>>>>>> - >>>>>>>> - >>>>>>>> - >>>>>>>> - >>>>>>>> - >>>>>>>> >>>>>>>> I have cherrypicked the fix for Aurora-1952 as well. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I also notice a lot of "Timeout reached for task..." around the >>>>>>>>>> same time. Can this happen if task is in PENDING state and does not >>>>>>>>>> reach >>>>>>>>>> ASSIGNED due to lack of offers? >>>>>>>>>> >>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <[email protected] >>>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>>> Folks, >>>>>>>>>>> I have noticed some weird behavior in Aurora (version close to >>>>>>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But >>>>>>>>>>> if I tail >>>>>>>>>>> the logs I can see offers are coming in. I suspect they are getting >>>>>>>>>>> enqueued for processing by "executor" but stay there for a long >>>>>>>>>>> time and >>>>>>>>>>> are not processed either due to locking or thread starvation. >>>>>>>>>>> >>>>>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could >>>>>>>>>>> these >>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? >>>>>>>>>>> What >>>>>>>>>>> should I be looking for to debug this condition further? >>>>>>>>>>> >>>>>>>>>>> Mohit. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
