> > is there a place I can inject a pre-processor for the API calls
There is no off-the-shelf way to do this. You could intercept API calls as HTTP requests similar to the CORS filter <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>. If you wanted to intercept specific calls and/or introspect arguments, you would be better off binding a layer for AuroraAdmin.Iface <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30> . On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <mohit.ja...@uber.com> wrote: > I agree with that. I also believe that the scheduler should be resilient > in the presence of external faults. Systems that export an API must take > defensive steps to protect themselves. > > If I wanted to experiment with this change without modifying Aurora code > "inline", is there a place I can inject a pre-processor for the API calls? > > On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <wfar...@apache.org> wrote: > >> I'd also suggest focusing on the source of the congestion. Aurora should >> offer quite high scheduling throughput, and i would rather focus energy on >> addressing bottlenecks. >> >> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mohit.ja...@uber.com> >> wrote: >> >>> I think more explicit signaling is better. Increased latency can be due >>> to other conditions like network issues etc. Right now our mitigation >>> involves load-shedding and we would rather have load-avoidance. Indeed >>> proxy is not a good option. Only Aurora "knows" when it wants to >>> back-pressure. >>> >>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin < >>> dmclaugh...@apache.org> wrote: >>> >>>> Any log write latency will be reflected in the overall latency of the >>>> request. Increased request latency is one of the main ways any server has >>>> of telling a client that it's under load. It's then up to the client to >>>> react to this. >>>> >>>> If you want to throw error codes, you can put a proxy in front of >>>> Aurora that has request timeouts - which would send 503s to clients. But >>>> the issue with that is the requests are mostly non-idempotent so you'll >>>> need to build reconciliation logic into it. >>>> >>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>> wrote: >>>> >>>>> Imagine something like Spinnaker using Aurora underneath to schedule >>>>> services. That layer often "amplifies" human effort and may result in a >>>>> lot >>>>> of load on Aurora. Usually that is fine but if Aurora slowed down due to >>>>> transient problems, it can signal that to upstream software in the same >>>>> way >>>>> that busy web servers do during cyber Monday sales :-) >>>>> >>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wfar...@apache.org> >>>>> wrote: >>>>> >>>>>> I want to let upstream software "know" that Aurora is slowing down >>>>>>> and that it should back off >>>>>> >>>>>> >>>>>> Can you offer more detail about how Aurora is being used in this >>>>>> regard? I haven't seen use cases in the past that would be amenable to >>>>>> this behavior, so i would like to understand better. >>>>>> >>>>>> >>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mohit.ja...@uber.com> >>>>>> wrote: >>>>>> >>>>>>> Thanks Bill. We havn't been able to track down a specific root >>>>>>> cause(although ZK node is known to have issues now and then but we don't >>>>>>> have logs for the specific outages we had). We will plan to move to >>>>>>> 0.19.x >>>>>>> soon. In addition I want to let upstream software "know" that Aurora is >>>>>>> slowing down and that it should back off. To achieve this I want to send >>>>>>> 5xx error codes back when update/rollback/kill etc are called and >>>>>>> certain >>>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, >>>>>>> this >>>>>>> "defense" already exists? >>>>>>> >>>>>>> >>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wfar...@apache.org> >>>>>>> wrote: >>>>>>> >>>>>>>> The next level is to determine why the storage lock is being held. >>>>>>>> Common causes include: >>>>>>>> >>>>>>>> 1. storage snapshot slowness, when scheduler state is very large, >>>>>>>> O(gb) >>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1) >>>>>>>> 2. scheduler replicated log on slow disks >>>>>>>> 3. network issues between schedulers, schedulers to zookeeper, or >>>>>>>> between zookeepers >>>>>>>> >>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to >>>>>>>> eliminate the use of SQL/mybatis in the scheduler. This helped twitter >>>>>>>> improve (1) and (1a). >>>>>>>> >>>>>>>> commit f2755e1 >>>>>>>> Author: Bill Farner <wfar...@apache.org> >>>>>>>> Date: Tue Oct 24 23:34:09 2017 -0700 >>>>>>>> >>>>>>>> Exclusively use Map-based in-memory stores for primary storage >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <mohit.ja...@uber.com >>>>>>>> > wrote: >>>>>>>> >>>>>>>>> and in log_storage_write_lock_wait_ns_per_event >>>>>>>>> >>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mohit.ja...@uber.com >>>>>>>>> > wrote: >>>>>>>>> >>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is >>>>>>>>>> that cause or effect? :-) >>>>>>>>>> >>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi < >>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>> >>>>>>>>>>> Thanks Bill. Please see inline: >>>>>>>>>>> >>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wfar...@apache.org >>>>>>>>>>> > wrote: >>>>>>>>>>> >>>>>>>>>>>> I suspect they are getting enqueued >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Just to be sure - the offers do eventually get through though? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> In one instance the offers did get through but it took several >>>>>>>>>>> minutes. In other instances we reloaded the scheduler to let >>>>>>>>>>> another one >>>>>>>>>>> become the leader. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> The most likely culprit is contention for the storage write >>>>>>>>>>>> lock, observable via spikes in stat log_storage_write_lock_wa >>>>>>>>>>>> it_ns_total. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Thanks. I will check that one. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> This sounds like API activity. This shouldn't interfere with >>>>>>>>>>>> offer processing directly, but could potentially slow down the >>>>>>>>>>>> scheduler as >>>>>>>>>>>> a whole. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> So these won't contend for locks with offer processing and task >>>>>>>>>>> assignment threads? Only 8-10 out of 24 cores were being used on the >>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad >>>>>>>>>>> connections. >>>>>>>>>>> Can't say if the spike in active is due to many bad connections or >>>>>>>>>>> vice >>>>>>>>>>> versa or there was a 3rd source causing both of these. Are there any >>>>>>>>>>> metrics or logs that might help here? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around the >>>>>>>>>>>>> same time. Can this happen if task is in PENDING state and does >>>>>>>>>>>>> not reach >>>>>>>>>>>>> ASSIGNED due to lack of offers? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> This is unusual. Pending tasks are not timed out; this applies >>>>>>>>>>>> to tasks in states where the scheduler is waiting for something >>>>>>>>>>>> else to act >>>>>>>>>>>> and it does not hear back (via a status update). >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates >>>>>>>>>>> from Mesos are being delayed or processed too slowly both these >>>>>>>>>>> effects >>>>>>>>>>> will occur? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> I suggest digging into the cause of delayed offer processing >>>>>>>>>>>> first, i suspect it might be related to the task timeouts as well. >>>>>>>>>>>> >>>>>>>>>>>> version close to 0.18.0 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Is the ambiguity is due to custom patches? Can you at least >>>>>>>>>>>> indicate the last git SHA off aurora/master? Digging much deeper >>>>>>>>>>>> in >>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is in >>>>>>>>>>>> play. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> - >>>>>>>>>>> - c85bffd >>>>>>>>>>> - 10 i >>>>>>>>>>> - >>>>>>>>>>> - s the commit from which we forked. >>>>>>>>>>> - >>>>>>>>>>> - >>>>>>>>>>> - >>>>>>>>>>> - The custom patch is mainly the dynamic reservation work >>>>>>>>>>> done by Dmitri. We also have commits for offer/rescind race >>>>>>>>>>> issue, >>>>>>>>>>> setrootfs patch (which is not upstreamed yet). >>>>>>>>>>> - >>>>>>>>>>> - >>>>>>>>>>> - >>>>>>>>>>> - >>>>>>>>>>> - >>>>>>>>>>> >>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi < >>>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around >>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and >>>>>>>>>>>>> does not >>>>>>>>>>>>> reach ASSIGNED due to lack of offers? >>>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi < >>>>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Folks, >>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version close >>>>>>>>>>>>>> to 0.18.0). Sometimes, it shows no offers in the UI offers page. >>>>>>>>>>>>>> But if I >>>>>>>>>>>>>> tail the logs I can see offers are coming in. I suspect they are >>>>>>>>>>>>>> getting >>>>>>>>>>>>>> enqueued for processing by "executor" but stay there for a long >>>>>>>>>>>>>> time and >>>>>>>>>>>>>> are not processed either due to locking or thread starvation. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. >>>>>>>>>>>>>> Could these >>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? >>>>>>>>>>>>>> What >>>>>>>>>>>>>> should I be looking for to debug this condition further? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Mohit. >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >