I agree with that. I also believe that the scheduler should be resilient in the presence of external faults. Systems that export an API must take defensive steps to protect themselves.
If I wanted to experiment with this change without modifying Aurora code "inline", is there a place I can inject a pre-processor for the API calls? On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <[email protected]> wrote: > I'd also suggest focusing on the source of the congestion. Aurora should > offer quite high scheduling throughput, and i would rather focus energy on > addressing bottlenecks. > > On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <[email protected]> wrote: > >> I think more explicit signaling is better. Increased latency can be due >> to other conditions like network issues etc. Right now our mitigation >> involves load-shedding and we would rather have load-avoidance. Indeed >> proxy is not a good option. Only Aurora "knows" when it wants to >> back-pressure. >> >> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin < >> [email protected]> wrote: >> >>> Any log write latency will be reflected in the overall latency of the >>> request. Increased request latency is one of the main ways any server has >>> of telling a client that it's under load. It's then up to the client to >>> react to this. >>> >>> If you want to throw error codes, you can put a proxy in front of Aurora >>> that has request timeouts - which would send 503s to clients. But the issue >>> with that is the requests are mostly non-idempotent so you'll need to build >>> reconciliation logic into it. >>> >>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <[email protected]> >>> wrote: >>> >>>> Imagine something like Spinnaker using Aurora underneath to schedule >>>> services. That layer often "amplifies" human effort and may result in a lot >>>> of load on Aurora. Usually that is fine but if Aurora slowed down due to >>>> transient problems, it can signal that to upstream software in the same way >>>> that busy web servers do during cyber Monday sales :-) >>>> >>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <[email protected]> >>>> wrote: >>>> >>>>> I want to let upstream software "know" that Aurora is slowing down and >>>>>> that it should back off >>>>> >>>>> >>>>> Can you offer more detail about how Aurora is being used in this >>>>> regard? I haven't seen use cases in the past that would be amenable to >>>>> this behavior, so i would like to understand better. >>>>> >>>>> >>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <[email protected]> >>>>> wrote: >>>>> >>>>>> Thanks Bill. We havn't been able to track down a specific root >>>>>> cause(although ZK node is known to have issues now and then but we don't >>>>>> have logs for the specific outages we had). We will plan to move to >>>>>> 0.19.x >>>>>> soon. In addition I want to let upstream software "know" that Aurora is >>>>>> slowing down and that it should back off. To achieve this I want to send >>>>>> 5xx error codes back when update/rollback/kill etc are called and certain >>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, >>>>>> this >>>>>> "defense" already exists? >>>>>> >>>>>> >>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> The next level is to determine why the storage lock is being held. >>>>>>> Common causes include: >>>>>>> >>>>>>> 1. storage snapshot slowness, when scheduler state is very large, >>>>>>> O(gb) >>>>>>> 1a. long GC pauses in the scheduler, often induced by (1) >>>>>>> 2. scheduler replicated log on slow disks >>>>>>> 3. network issues between schedulers, schedulers to zookeeper, or >>>>>>> between zookeepers >>>>>>> >>>>>>> As an immediate (partial) remedy, i suggest you upgrade to eliminate >>>>>>> the use of SQL/mybatis in the scheduler. This helped twitter improve >>>>>>> (1) >>>>>>> and (1a). >>>>>>> >>>>>>> commit f2755e1 >>>>>>> Author: Bill Farner <[email protected]> >>>>>>> Date: Tue Oct 24 23:34:09 2017 -0700 >>>>>>> >>>>>>> Exclusively use Map-based in-memory stores for primary storage >>>>>>> >>>>>>> >>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> and in log_storage_write_lock_wait_ns_per_event >>>>>>>> >>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is >>>>>>>>> that cause or effect? :-) >>>>>>>>> >>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <[email protected] >>>>>>>>> > wrote: >>>>>>>>> >>>>>>>>>> Thanks Bill. Please see inline: >>>>>>>>>> >>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I suspect they are getting enqueued >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Just to be sure - the offers do eventually get through though? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> In one instance the offers did get through but it took several >>>>>>>>>> minutes. In other instances we reloaded the scheduler to let another >>>>>>>>>> one >>>>>>>>>> become the leader. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> The most likely culprit is contention for the storage write >>>>>>>>>>> lock, observable via spikes in stat log_storage_write_lock_wa >>>>>>>>>>> it_ns_total. >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Thanks. I will check that one. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> This sounds like API activity. This shouldn't interfere with >>>>>>>>>>> offer processing directly, but could potentially slow down the >>>>>>>>>>> scheduler as >>>>>>>>>>> a whole. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> So these won't contend for locks with offer processing and task >>>>>>>>>> assignment threads? Only 8-10 out of 24 cores were being used on the >>>>>>>>>> machine. I also noticed a spike in mybatis active and bad >>>>>>>>>> connections. >>>>>>>>>> Can't say if the spike in active is due to many bad connections or >>>>>>>>>> vice >>>>>>>>>> versa or there was a 3rd source causing both of these. Are there any >>>>>>>>>> metrics or logs that might help here? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around the >>>>>>>>>>>> same time. Can this happen if task is in PENDING state and does >>>>>>>>>>>> not reach >>>>>>>>>>>> ASSIGNED due to lack of offers? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> This is unusual. Pending tasks are not timed out; this applies >>>>>>>>>>> to tasks in states where the scheduler is waiting for something >>>>>>>>>>> else to act >>>>>>>>>>> and it does not hear back (via a status update). >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates >>>>>>>>>> from Mesos are being delayed or processed too slowly both these >>>>>>>>>> effects >>>>>>>>>> will occur? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I suggest digging into the cause of delayed offer processing >>>>>>>>>>> first, i suspect it might be related to the task timeouts as well. >>>>>>>>>>> >>>>>>>>>>> version close to 0.18.0 >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Is the ambiguity is due to custom patches? Can you at least >>>>>>>>>>> indicate the last git SHA off aurora/master? Digging much deeper in >>>>>>>>>>> diagnosing this may prove tricky without knowing what code is in >>>>>>>>>>> play. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> - >>>>>>>>>> - c85bffd >>>>>>>>>> - 10 i >>>>>>>>>> - >>>>>>>>>> - s the commit from which we forked. >>>>>>>>>> - >>>>>>>>>> - >>>>>>>>>> - >>>>>>>>>> - The custom patch is mainly the dynamic reservation work >>>>>>>>>> done by Dmitri. We also have commits for offer/rescind race issue, >>>>>>>>>> setrootfs patch (which is not upstreamed yet). >>>>>>>>>> - >>>>>>>>>> - >>>>>>>>>> - >>>>>>>>>> - >>>>>>>>>> - >>>>>>>>>> >>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around the >>>>>>>>>>>> same time. Can this happen if task is in PENDING state and does >>>>>>>>>>>> not reach >>>>>>>>>>>> ASSIGNED due to lack of offers? >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Folks, >>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version close to >>>>>>>>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But >>>>>>>>>>>>> if I tail >>>>>>>>>>>>> the logs I can see offers are coming in. I suspect they are >>>>>>>>>>>>> getting >>>>>>>>>>>>> enqueued for processing by "executor" but stay there for a long >>>>>>>>>>>>> time and >>>>>>>>>>>>> are not processed either due to locking or thread starvation. >>>>>>>>>>>>> >>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could >>>>>>>>>>>>> these >>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? >>>>>>>>>>>>> What >>>>>>>>>>>>> should I be looking for to debug this condition further? >>>>>>>>>>>>> >>>>>>>>>>>>> Mohit. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
