Thanks Bill. It would be the latter as this back-pressure is only needed for calls that change state. Read only calls should be quite quick to serve.
One potential correlated outage we may have had is a NIC flap on a Zookeeper node. The following questions come to mind: - Does log replication "maintain" ZK connections and suffer when a NIC flaps? - If only 1 of 5 ZK's have this issue, could there still be a problem? On Wed, Nov 29, 2017 at 11:08 AM, Bill Farner <wfar...@apache.org> wrote: > is there a place I can inject a pre-processor for the API calls > > > There is no off-the-shelf way to do this. You could intercept API calls > as HTTP requests similar to the CORS filter > <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>. > If you wanted to intercept specific calls and/or introspect arguments, you > would be better off binding a layer for AuroraAdmin.Iface > <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30> > . > > On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <mohit.ja...@uber.com> > wrote: > >> I agree with that. I also believe that the scheduler should be resilient >> in the presence of external faults. Systems that export an API must take >> defensive steps to protect themselves. >> >> If I wanted to experiment with this change without modifying Aurora code >> "inline", is there a place I can inject a pre-processor for the API calls? >> >> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <wfar...@apache.org> wrote: >> >>> I'd also suggest focusing on the source of the congestion. Aurora >>> should offer quite high scheduling throughput, and i would rather focus >>> energy on addressing bottlenecks. >>> >>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mohit.ja...@uber.com> >>> wrote: >>> >>>> I think more explicit signaling is better. Increased latency can be due >>>> to other conditions like network issues etc. Right now our mitigation >>>> involves load-shedding and we would rather have load-avoidance. Indeed >>>> proxy is not a good option. Only Aurora "knows" when it wants to >>>> back-pressure. >>>> >>>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin < >>>> dmclaugh...@apache.org> wrote: >>>> >>>>> Any log write latency will be reflected in the overall latency of the >>>>> request. Increased request latency is one of the main ways any server has >>>>> of telling a client that it's under load. It's then up to the client to >>>>> react to this. >>>>> >>>>> If you want to throw error codes, you can put a proxy in front of >>>>> Aurora that has request timeouts - which would send 503s to clients. But >>>>> the issue with that is the requests are mostly non-idempotent so you'll >>>>> need to build reconciliation logic into it. >>>>> >>>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mohit.ja...@uber.com> >>>>> wrote: >>>>> >>>>>> Imagine something like Spinnaker using Aurora underneath to schedule >>>>>> services. That layer often "amplifies" human effort and may result in a >>>>>> lot >>>>>> of load on Aurora. Usually that is fine but if Aurora slowed down due to >>>>>> transient problems, it can signal that to upstream software in the same >>>>>> way >>>>>> that busy web servers do during cyber Monday sales :-) >>>>>> >>>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wfar...@apache.org> >>>>>> wrote: >>>>>> >>>>>>> I want to let upstream software "know" that Aurora is slowing down >>>>>>>> and that it should back off >>>>>>> >>>>>>> >>>>>>> Can you offer more detail about how Aurora is being used in this >>>>>>> regard? I haven't seen use cases in the past that would be amenable to >>>>>>> this behavior, so i would like to understand better. >>>>>>> >>>>>>> >>>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mohit.ja...@uber.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks Bill. We havn't been able to track down a specific root >>>>>>>> cause(although ZK node is known to have issues now and then but we >>>>>>>> don't >>>>>>>> have logs for the specific outages we had). We will plan to move to >>>>>>>> 0.19.x >>>>>>>> soon. In addition I want to let upstream software "know" that Aurora is >>>>>>>> slowing down and that it should back off. To achieve this I want to >>>>>>>> send >>>>>>>> 5xx error codes back when update/rollback/kill etc are called and >>>>>>>> certain >>>>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, >>>>>>>> this >>>>>>>> "defense" already exists? >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wfar...@apache.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> The next level is to determine why the storage lock is being >>>>>>>>> held. Common causes include: >>>>>>>>> >>>>>>>>> 1. storage snapshot slowness, when scheduler state is very large, >>>>>>>>> O(gb) >>>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1) >>>>>>>>> 2. scheduler replicated log on slow disks >>>>>>>>> 3. network issues between schedulers, schedulers to zookeeper, or >>>>>>>>> between zookeepers >>>>>>>>> >>>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to >>>>>>>>> eliminate the use of SQL/mybatis in the scheduler. This helped >>>>>>>>> twitter >>>>>>>>> improve (1) and (1a). >>>>>>>>> >>>>>>>>> commit f2755e1 >>>>>>>>> Author: Bill Farner <wfar...@apache.org> >>>>>>>>> Date: Tue Oct 24 23:34:09 2017 -0700 >>>>>>>>> >>>>>>>>> Exclusively use Map-based in-memory stores for primary storage >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi < >>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>> >>>>>>>>>> and in log_storage_write_lock_wait_ns_per_event >>>>>>>>>> >>>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi < >>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>> >>>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. >>>>>>>>>>> Is that cause or effect? :-) >>>>>>>>>>> >>>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi < >>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks Bill. Please see inline: >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner < >>>>>>>>>>>> wfar...@apache.org> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I suspect they are getting enqueued >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Just to be sure - the offers do eventually get through though? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> In one instance the offers did get through but it took several >>>>>>>>>>>> minutes. In other instances we reloaded the scheduler to let >>>>>>>>>>>> another one >>>>>>>>>>>> become the leader. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> The most likely culprit is contention for the storage write >>>>>>>>>>>>> lock, observable via spikes in stat log_storage_write_lock_wa >>>>>>>>>>>>> it_ns_total. >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Thanks. I will check that one. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> This sounds like API activity. This shouldn't interfere with >>>>>>>>>>>>> offer processing directly, but could potentially slow down the >>>>>>>>>>>>> scheduler as >>>>>>>>>>>>> a whole. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> So these won't contend for locks with offer processing and task >>>>>>>>>>>> assignment threads? Only 8-10 out of 24 cores were being used on >>>>>>>>>>>> the >>>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad >>>>>>>>>>>> connections. >>>>>>>>>>>> Can't say if the spike in active is due to many bad connections or >>>>>>>>>>>> vice >>>>>>>>>>>> versa or there was a 3rd source causing both of these. Are there >>>>>>>>>>>> any >>>>>>>>>>>> metrics or logs that might help here? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around >>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and >>>>>>>>>>>>>> does not >>>>>>>>>>>>>> reach ASSIGNED due to lack of offers? >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> This is unusual. Pending tasks are not timed out; this >>>>>>>>>>>>> applies to tasks in states where the scheduler is waiting for >>>>>>>>>>>>> something >>>>>>>>>>>>> else to act and it does not hear back (via a status update). >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates >>>>>>>>>>>> from Mesos are being delayed or processed too slowly both these >>>>>>>>>>>> effects >>>>>>>>>>>> will occur? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> I suggest digging into the cause of delayed offer processing >>>>>>>>>>>>> first, i suspect it might be related to the task timeouts as well. >>>>>>>>>>>>> >>>>>>>>>>>>> version close to 0.18.0 >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Is the ambiguity is due to custom patches? Can you at least >>>>>>>>>>>>> indicate the last git SHA off aurora/master? Digging much deeper >>>>>>>>>>>>> in >>>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is in >>>>>>>>>>>>> play. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> - >>>>>>>>>>>> - c85bffd >>>>>>>>>>>> - 10 i >>>>>>>>>>>> - >>>>>>>>>>>> - s the commit from which we forked. >>>>>>>>>>>> - >>>>>>>>>>>> - >>>>>>>>>>>> - >>>>>>>>>>>> - The custom patch is mainly the dynamic reservation work >>>>>>>>>>>> done by Dmitri. We also have commits for offer/rescind race >>>>>>>>>>>> issue, >>>>>>>>>>>> setrootfs patch (which is not upstreamed yet). >>>>>>>>>>>> - >>>>>>>>>>>> - >>>>>>>>>>>> - >>>>>>>>>>>> - >>>>>>>>>>>> - >>>>>>>>>>>> >>>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi < >>>>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around >>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and >>>>>>>>>>>>>> does not >>>>>>>>>>>>>> reach ASSIGNED due to lack of offers? >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi < >>>>>>>>>>>>>> mohit.ja...@uber.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Folks, >>>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version close >>>>>>>>>>>>>>> to 0.18.0). Sometimes, it shows no offers in the UI offers >>>>>>>>>>>>>>> page. But if I >>>>>>>>>>>>>>> tail the logs I can see offers are coming in. I suspect they >>>>>>>>>>>>>>> are getting >>>>>>>>>>>>>>> enqueued for processing by "executor" but stay there for a long >>>>>>>>>>>>>>> time and >>>>>>>>>>>>>>> are not processed either due to locking or thread starvation. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. >>>>>>>>>>>>>>> Could these >>>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some >>>>>>>>>>>>>>> lock)? What >>>>>>>>>>>>>>> should I be looking for to debug this condition further? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Mohit. >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >