> > I want to let upstream software "know" that Aurora is slowing down and > that it should back off
Can you offer more detail about how Aurora is being used in this regard? I haven't seen use cases in the past that would be amenable to this behavior, so i would like to understand better. On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > Thanks Bill. We havn't been able to track down a specific root > cause(although ZK node is known to have issues now and then but we don't > have logs for the specific outages we had). We will plan to move to 0.19.x > soon. In addition I want to let upstream software "know" that Aurora is > slowing down and that it should back off. To achieve this I want to send > 5xx error codes back when update/rollback/kill etc are called and certain > metrics (like log write lock wait time) indicate heavy load. Perhaps, this > "defense" already exists? > > > On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wfar...@apache.org> wrote: > >> The next level is to determine why the storage lock is being held. >> Common causes include: >> >> 1. storage snapshot slowness, when scheduler state is very large, O(gb) >> 1a. long GC pauses in the scheduler, often induced by (1) >> 2. scheduler replicated log on slow disks >> 3. network issues between schedulers, schedulers to zookeeper, or between >> zookeepers >> >> As an immediate (partial) remedy, i suggest you upgrade to eliminate the >> use of SQL/mybatis in the scheduler. This helped twitter improve (1) and >> (1a). >> >> commit f2755e1 >> Author: Bill Farner <wfar...@apache.org> >> Date: Tue Oct 24 23:34:09 2017 -0700 >> >> Exclusively use Map-based in-memory stores for primary storage >> >> >> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <mohit.ja...@uber.com> >> wrote: >> >>> and in log_storage_write_lock_wait_ns_per_event >>> >>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mohit.ja...@uber.com> >>> wrote: >>> >>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is that >>>> cause or effect? :-) >>>> >>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>> wrote: >>>> >>>>> Thanks Bill. Please see inline: >>>>> >>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wfar...@apache.org> >>>>> wrote: >>>>> >>>>>> I suspect they are getting enqueued >>>>>> >>>>>> >>>>>> Just to be sure - the offers do eventually get through though? >>>>>> >>>>>> >>>>> In one instance the offers did get through but it took several >>>>> minutes. In other instances we reloaded the scheduler to let another one >>>>> become the leader. >>>>> >>>>> >>>>>> The most likely culprit is contention for the storage write lock, >>>>>> observable >>>>>> via spikes in stat log_storage_write_lock_wait_ns_total. >>>>>> >>>>> >>>>> Thanks. I will check that one. >>>>> >>>>> >>>>>> >>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>> getTasksWithoutConfigs() calls are being made at that time >>>>>> >>>>>> >>>>>> This sounds like API activity. This shouldn't interfere with offer >>>>>> processing directly, but could potentially slow down the scheduler as a >>>>>> whole. >>>>>> >>>>>> >>>>> So these won't contend for locks with offer processing and task >>>>> assignment threads? Only 8-10 out of 24 cores were being used on the >>>>> machine. I also noticed a spike in mybatis active and bad connections. >>>>> Can't say if the spike in active is due to many bad connections or vice >>>>> versa or there was a 3rd source causing both of these. Are there any >>>>> metrics or logs that might help here? >>>>> >>>>> >>>>>> I also notice a lot of "Timeout reached for task..." around the same >>>>>>> time. Can this happen if task is in PENDING state and does not reach >>>>>>> ASSIGNED due to lack of offers? >>>>>> >>>>>> >>>>>> This is unusual. Pending tasks are not timed out; this applies to >>>>>> tasks in states where the scheduler is waiting for something else to act >>>>>> and it does not hear back (via a status update). >>>>>> >>>>> >>>>> Perhaps they were in ASSIGNED or some other state. If updates from >>>>> Mesos are being delayed or processed too slowly both these effects will >>>>> occur? >>>>> >>>>> >>>>>> >>>>>> I suggest digging into the cause of delayed offer processing first, i >>>>>> suspect it might be related to the task timeouts as well. >>>>>> >>>>>> version close to 0.18.0 >>>>>> >>>>>> >>>>>> Is the ambiguity is due to custom patches? Can you at least indicate >>>>>> the last git SHA off aurora/master? Digging much deeper in diagnosing >>>>>> this >>>>>> may prove tricky without knowing what code is in play. >>>>>> >>>>>> >>>>> - >>>>> - c85bffd >>>>> - 10 i >>>>> - >>>>> - s the commit from which we forked. >>>>> - >>>>> - >>>>> - >>>>> - The custom patch is mainly the dynamic reservation work done by >>>>> Dmitri. We also have commits for offer/rescind race issue, setrootfs >>>>> patch >>>>> (which is not upstreamed yet). >>>>> - >>>>> - >>>>> - >>>>> - >>>>> - >>>>> >>>>> I have cherrypicked the fix for Aurora-1952 as well. >>>>> >>>>> >>>>> >>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>>>> wrote: >>>>>> >>>>>>> I also notice a lot of "Timeout reached for task..." around the same >>>>>>> time. Can this happen if task is in PENDING state and does not reach >>>>>>> ASSIGNED due to lack of offers? >>>>>>> >>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Folks, >>>>>>>> I have noticed some weird behavior in Aurora (version close to >>>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I >>>>>>>> tail >>>>>>>> the logs I can see offers are coming in. I suspect they are getting >>>>>>>> enqueued for processing by "executor" but stay there for a long time >>>>>>>> and >>>>>>>> are not processed either due to locking or thread starvation. >>>>>>>> >>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could these >>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What >>>>>>>> should I be looking for to debug this condition further? >>>>>>>> >>>>>>>> Mohit. >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >