Any log write latency will be reflected in the overall latency of the request. Increased request latency is one of the main ways any server has of telling a client that it's under load. It's then up to the client to react to this.
If you want to throw error codes, you can put a proxy in front of Aurora that has request timeouts - which would send 503s to clients. But the issue with that is the requests are mostly non-idempotent so you'll need to build reconciliation logic into it. On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > Imagine something like Spinnaker using Aurora underneath to schedule > services. That layer often "amplifies" human effort and may result in a lot > of load on Aurora. Usually that is fine but if Aurora slowed down due to > transient problems, it can signal that to upstream software in the same way > that busy web servers do during cyber Monday sales :-) > > On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wfar...@apache.org> wrote: > >> I want to let upstream software "know" that Aurora is slowing down and >>> that it should back off >> >> >> Can you offer more detail about how Aurora is being used in this regard? >> I haven't seen use cases in the past that would be amenable to this >> behavior, so i would like to understand better. >> >> >> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mohit.ja...@uber.com> >> wrote: >> >>> Thanks Bill. We havn't been able to track down a specific root >>> cause(although ZK node is known to have issues now and then but we don't >>> have logs for the specific outages we had). We will plan to move to 0.19.x >>> soon. In addition I want to let upstream software "know" that Aurora is >>> slowing down and that it should back off. To achieve this I want to send >>> 5xx error codes back when update/rollback/kill etc are called and certain >>> metrics (like log write lock wait time) indicate heavy load. Perhaps, this >>> "defense" already exists? >>> >>> >>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wfar...@apache.org> wrote: >>> >>>> The next level is to determine why the storage lock is being held. >>>> Common causes include: >>>> >>>> 1. storage snapshot slowness, when scheduler state is very large, O(gb) >>>> 1a. long GC pauses in the scheduler, often induced by (1) >>>> 2. scheduler replicated log on slow disks >>>> 3. network issues between schedulers, schedulers to zookeeper, or >>>> between zookeepers >>>> >>>> As an immediate (partial) remedy, i suggest you upgrade to eliminate >>>> the use of SQL/mybatis in the scheduler. This helped twitter improve (1) >>>> and (1a). >>>> >>>> commit f2755e1 >>>> Author: Bill Farner <wfar...@apache.org> >>>> Date: Tue Oct 24 23:34:09 2017 -0700 >>>> >>>> Exclusively use Map-based in-memory stores for primary storage >>>> >>>> >>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>> wrote: >>>> >>>>> and in log_storage_write_lock_wait_ns_per_event >>>>> >>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>>> wrote: >>>>> >>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is >>>>>> that cause or effect? :-) >>>>>> >>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks Bill. Please see inline: >>>>>>> >>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wfar...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> I suspect they are getting enqueued >>>>>>>> >>>>>>>> >>>>>>>> Just to be sure - the offers do eventually get through though? >>>>>>>> >>>>>>>> >>>>>>> In one instance the offers did get through but it took several >>>>>>> minutes. In other instances we reloaded the scheduler to let another one >>>>>>> become the leader. >>>>>>> >>>>>>> >>>>>>>> The most likely culprit is contention for the storage write lock, >>>>>>>> observable >>>>>>>> via spikes in stat log_storage_write_lock_wait_ns_total. >>>>>>>> >>>>>>> >>>>>>> Thanks. I will check that one. >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>> getTasksWithoutConfigs() calls are being made at that time >>>>>>>> >>>>>>>> >>>>>>>> This sounds like API activity. This shouldn't interfere with offer >>>>>>>> processing directly, but could potentially slow down the scheduler as a >>>>>>>> whole. >>>>>>>> >>>>>>>> >>>>>>> So these won't contend for locks with offer processing and task >>>>>>> assignment threads? Only 8-10 out of 24 cores were being used on the >>>>>>> machine. I also noticed a spike in mybatis active and bad connections. >>>>>>> Can't say if the spike in active is due to many bad connections or vice >>>>>>> versa or there was a 3rd source causing both of these. Are there any >>>>>>> metrics or logs that might help here? >>>>>>> >>>>>>> >>>>>>>> I also notice a lot of "Timeout reached for task..." around the >>>>>>>>> same time. Can this happen if task is in PENDING state and does not >>>>>>>>> reach >>>>>>>>> ASSIGNED due to lack of offers? >>>>>>>> >>>>>>>> >>>>>>>> This is unusual. Pending tasks are not timed out; this applies to >>>>>>>> tasks in states where the scheduler is waiting for something else to >>>>>>>> act >>>>>>>> and it does not hear back (via a status update). >>>>>>>> >>>>>>> >>>>>>> Perhaps they were in ASSIGNED or some other state. If updates from >>>>>>> Mesos are being delayed or processed too slowly both these effects will >>>>>>> occur? >>>>>>> >>>>>>> >>>>>>>> >>>>>>>> I suggest digging into the cause of delayed offer processing first, >>>>>>>> i suspect it might be related to the task timeouts as well. >>>>>>>> >>>>>>>> version close to 0.18.0 >>>>>>>> >>>>>>>> >>>>>>>> Is the ambiguity is due to custom patches? Can you at least >>>>>>>> indicate the last git SHA off aurora/master? Digging much deeper in >>>>>>>> diagnosing this may prove tricky without knowing what code is in play. >>>>>>>> >>>>>>>> >>>>>>> - >>>>>>> - c85bffd >>>>>>> - 10 i >>>>>>> - >>>>>>> - s the commit from which we forked. >>>>>>> - >>>>>>> - >>>>>>> - >>>>>>> - The custom patch is mainly the dynamic reservation work done >>>>>>> by Dmitri. We also have commits for offer/rescind race issue, >>>>>>> setrootfs >>>>>>> patch (which is not upstreamed yet). >>>>>>> - >>>>>>> - >>>>>>> - >>>>>>> - >>>>>>> - >>>>>>> >>>>>>> I have cherrypicked the fix for Aurora-1952 as well. >>>>>>> >>>>>>> >>>>>>> >>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> I also notice a lot of "Timeout reached for task..." around the >>>>>>>>> same time. Can this happen if task is in PENDING state and does not >>>>>>>>> reach >>>>>>>>> ASSIGNED due to lack of offers? >>>>>>>>> >>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Folks, >>>>>>>>>> I have noticed some weird behavior in Aurora (version close to >>>>>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But if >>>>>>>>>> I tail >>>>>>>>>> the logs I can see offers are coming in. I suspect they are getting >>>>>>>>>> enqueued for processing by "executor" but stay there for a long time >>>>>>>>>> and >>>>>>>>>> are not processed either due to locking or thread starvation. >>>>>>>>>> >>>>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could >>>>>>>>>> these >>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What >>>>>>>>>> should I be looking for to debug this condition further? >>>>>>>>>> >>>>>>>>>> Mohit. >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >