Gotcha, just making sure ;) In your situation I would just use an interceptor. For example:
https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/aop/LoggingInterceptor.java You could intercept every API method and reject it based on the value of the stat. I'm not sure how to solve your null problem, however. On Thu, Nov 30, 2017 at 2:57 PM, Mohit Jaggi <[email protected]> wrote: > David, > I am not planning to upstream this. I can benefit from the experience of > the community anyway :-) I will be happy to share the finished code if > someone wants it. > > BTW, I will read the stats I need using Stats.getVariable(). In some unit > tests of the thrift interface(which now go to BackPressuring version of the > class), this will return "null". Is there an alternative way to retrieve > stats? > > Mohit. > > On Thu, Nov 30, 2017 at 2:39 PM, David McLaughlin <[email protected]> > wrote: > >> Are you asking for a review with a view to upstreaming the patch, or just >> looking for feedback on an approach for your own filter? >> >> I'm -1 for upstreaming a change like this. >> >> On Thu, Nov 30, 2017 at 12:55 PM, Mohit Jaggi <[email protected]> >> wrote: >> >>> >>> *Why*: API clients don't get any indication of Aurora "slowness". They >>> may keep retrying and make bad things worse. If they get a "busy >>> signal" they can backoff. >>> >>> *What*: Use Aurora's stats to identify "slowness". Return 5xx to API >>> clients when they make calls like create, update, kill etc. until >>> "slowness" is gone. >>> >>> *How*: >>> I am planning to extend SchedulerThriftInterface and change the >>> injection bindings accordingly. The new class will delegate all calls to >>> the old one. Appreciate any comments/suggestions, especially on the metrics >>> that make sense to trigger throttling. Right now I am thinking of using the >>> log replication lock wait time as that correlates nicely with the kind of >>> load that slows progress in Aurora. >>> >>> class BackPressuringSchedulerThriftInterface extends >>> SchedulerThriftInterface { >>> .. >>> >>> @Inject >>> BackPressuringSchedulerThriftInterface( >>> ConfigurationManager configurationManager, >>> Thresholds thresholds, >>> NonVolatileStorage storage, >>> LockManager lockManager, >>> StorageBackup backup, >>> Recovery recovery, >>> CronJobManager cronJobManager, >>> MaintenanceController maintenance, >>> QuotaManager quotaManager, >>> StateManager stateManager, >>> UUIDGenerator uuidGenerator, >>> JobUpdateController jobUpdateController, >>> ReadOnlyScheduler.Iface readOnlyScheduler, >>> AuditMessages auditMessages, >>> TaskReconciler taskReconciler, >>> StatsProvider statsProvider) { >>> >>> super(configurationManager, >>> thresholds, >>> storage, >>> lockManager, >>> backup, >>> recovery, >>> cronJobManager, >>> maintenance, >>> quotaManager, >>> stateManager, >>> uuidGenerator, >>> jobUpdateController, >>> readOnlyScheduler, >>> auditMessages, >>> taskReconciler, >>> statsProvider); >>> } >>> >>> >>> On Wed, Nov 29, 2017 at 1:26 PM, Mohit Jaggi <[email protected]> >>> wrote: >>> >>>> Thanks Bill. It would be the latter as this back-pressure is only >>>> needed for calls that change state. Read only calls should be quite quick >>>> to serve. >>>> >>>> One potential correlated outage we may have had is a NIC flap on a >>>> Zookeeper node. The following questions come to mind: >>>> - Does log replication "maintain" ZK connections and suffer when a NIC >>>> flaps? >>>> - If only 1 of 5 ZK's have this issue, could there still be a problem? >>>> >>>> On Wed, Nov 29, 2017 at 11:08 AM, Bill Farner <[email protected]> >>>> wrote: >>>> >>>>> is there a place I can inject a pre-processor for the API calls >>>>> >>>>> >>>>> There is no off-the-shelf way to do this. You could intercept API >>>>> calls as HTTP requests similar to the CORS filter >>>>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>. >>>>> If you wanted to intercept specific calls and/or introspect arguments, you >>>>> would be better off binding a layer for AuroraAdmin.Iface >>>>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30> >>>>> . >>>>> >>>>> On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <[email protected]> >>>>> wrote: >>>>> >>>>>> I agree with that. I also believe that the scheduler should be >>>>>> resilient in the presence of external faults. Systems that export an API >>>>>> must take defensive steps to protect themselves. >>>>>> >>>>>> If I wanted to experiment with this change without modifying Aurora >>>>>> code "inline", is there a place I can inject a pre-processor for the API >>>>>> calls? >>>>>> >>>>>> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> I'd also suggest focusing on the source of the congestion. Aurora >>>>>>> should offer quite high scheduling throughput, and i would rather focus >>>>>>> energy on addressing bottlenecks. >>>>>>> >>>>>>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> I think more explicit signaling is better. Increased latency can be >>>>>>>> due to other conditions like network issues etc. Right now our >>>>>>>> mitigation >>>>>>>> involves load-shedding and we would rather have load-avoidance. Indeed >>>>>>>> proxy is not a good option. Only Aurora "knows" when it wants to >>>>>>>> back-pressure. >>>>>>>> >>>>>>>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Any log write latency will be reflected in the overall latency of >>>>>>>>> the request. Increased request latency is one of the main ways any >>>>>>>>> server >>>>>>>>> has of telling a client that it's under load. It's then up to the >>>>>>>>> client to >>>>>>>>> react to this. >>>>>>>>> >>>>>>>>> If you want to throw error codes, you can put a proxy in front of >>>>>>>>> Aurora that has request timeouts - which would send 503s to clients. >>>>>>>>> But >>>>>>>>> the issue with that is the requests are mostly non-idempotent so >>>>>>>>> you'll >>>>>>>>> need to build reconciliation logic into it. >>>>>>>>> >>>>>>>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Imagine something like Spinnaker using Aurora underneath to >>>>>>>>>> schedule services. That layer often "amplifies" human effort and may >>>>>>>>>> result >>>>>>>>>> in a lot of load on Aurora. Usually that is fine but if Aurora >>>>>>>>>> slowed down >>>>>>>>>> due to transient problems, it can signal that to upstream software >>>>>>>>>> in the >>>>>>>>>> same way that busy web servers do during cyber Monday sales :-) >>>>>>>>>> >>>>>>>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <[email protected] >>>>>>>>>> > wrote: >>>>>>>>>> >>>>>>>>>>> I want to let upstream software "know" that Aurora is slowing >>>>>>>>>>>> down and that it should back off >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Can you offer more detail about how Aurora is being used in this >>>>>>>>>>> regard? I haven't seen use cases in the past that would be >>>>>>>>>>> amenable to >>>>>>>>>>> this behavior, so i would like to understand better. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Thanks Bill. We havn't been able to track down a specific root >>>>>>>>>>>> cause(although ZK node is known to have issues now and then but we >>>>>>>>>>>> don't >>>>>>>>>>>> have logs for the specific outages we had). We will plan to move >>>>>>>>>>>> to 0.19.x >>>>>>>>>>>> soon. In addition I want to let upstream software "know" that >>>>>>>>>>>> Aurora is >>>>>>>>>>>> slowing down and that it should back off. To achieve this I want >>>>>>>>>>>> to send >>>>>>>>>>>> 5xx error codes back when update/rollback/kill etc are called and >>>>>>>>>>>> certain >>>>>>>>>>>> metrics (like log write lock wait time) indicate heavy load. >>>>>>>>>>>> Perhaps, this >>>>>>>>>>>> "defense" already exists? >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> The next level is to determine why the storage lock is being >>>>>>>>>>>>> held. Common causes include: >>>>>>>>>>>>> >>>>>>>>>>>>> 1. storage snapshot slowness, when scheduler state is very >>>>>>>>>>>>> large, O(gb) >>>>>>>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1) >>>>>>>>>>>>> 2. scheduler replicated log on slow disks >>>>>>>>>>>>> 3. network issues between schedulers, schedulers to zookeeper, >>>>>>>>>>>>> or between zookeepers >>>>>>>>>>>>> >>>>>>>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to >>>>>>>>>>>>> eliminate the use of SQL/mybatis in the scheduler. This helped >>>>>>>>>>>>> twitter >>>>>>>>>>>>> improve (1) and (1a). >>>>>>>>>>>>> >>>>>>>>>>>>> commit f2755e1 >>>>>>>>>>>>> Author: Bill Farner <[email protected]> >>>>>>>>>>>>> Date: Tue Oct 24 23:34:09 2017 -0700 >>>>>>>>>>>>> >>>>>>>>>>>>> Exclusively use Map-based in-memory stores for primary >>>>>>>>>>>>> storage >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi < >>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> and in log_storage_write_lock_wait_ns_per_event >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi < >>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. >>>>>>>>>>>>>>> Is that cause or effect? :-) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi < >>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks Bill. Please see inline: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner < >>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I suspect they are getting enqueued >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Just to be sure - the offers do eventually get through >>>>>>>>>>>>>>>>> though? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> In one instance the offers did get through but it took >>>>>>>>>>>>>>>> several minutes. In other instances we reloaded the scheduler >>>>>>>>>>>>>>>> to let >>>>>>>>>>>>>>>> another one become the leader. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> The most likely culprit is contention for the storage >>>>>>>>>>>>>>>>> write lock, observable via spikes in stat >>>>>>>>>>>>>>>>> log_storage_write_lock_wait_ns_total. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Thanks. I will check that one. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This sounds like API activity. This shouldn't interfere >>>>>>>>>>>>>>>>> with offer processing directly, but could potentially slow >>>>>>>>>>>>>>>>> down the >>>>>>>>>>>>>>>>> scheduler as a whole. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> So these won't contend for locks with offer processing and >>>>>>>>>>>>>>>> task assignment threads? Only 8-10 out of 24 cores were being >>>>>>>>>>>>>>>> used on the >>>>>>>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad >>>>>>>>>>>>>>>> connections. >>>>>>>>>>>>>>>> Can't say if the spike in active is due to many bad >>>>>>>>>>>>>>>> connections or vice >>>>>>>>>>>>>>>> versa or there was a 3rd source causing both of these. Are >>>>>>>>>>>>>>>> there any >>>>>>>>>>>>>>>> metrics or logs that might help here? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." >>>>>>>>>>>>>>>>>> around the same time. Can this happen if task is in PENDING >>>>>>>>>>>>>>>>>> state and does >>>>>>>>>>>>>>>>>> not reach ASSIGNED due to lack of offers? >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> This is unusual. Pending tasks are not timed out; this >>>>>>>>>>>>>>>>> applies to tasks in states where the scheduler is waiting for >>>>>>>>>>>>>>>>> something >>>>>>>>>>>>>>>>> else to act and it does not hear back (via a status update). >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If >>>>>>>>>>>>>>>> updates from Mesos are being delayed or processed too slowly >>>>>>>>>>>>>>>> both these >>>>>>>>>>>>>>>> effects will occur? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I suggest digging into the cause of delayed offer >>>>>>>>>>>>>>>>> processing first, i suspect it might be related to the task >>>>>>>>>>>>>>>>> timeouts as >>>>>>>>>>>>>>>>> well. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> version close to 0.18.0 >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Is the ambiguity is due to custom patches? Can you at >>>>>>>>>>>>>>>>> least indicate the last git SHA off aurora/master? Digging >>>>>>>>>>>>>>>>> much deeper in >>>>>>>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is >>>>>>>>>>>>>>>>> in play. >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> - c85bffd >>>>>>>>>>>>>>>> - 10 i >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> - s the commit from which we forked. >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> - The custom patch is mainly the dynamic reservation >>>>>>>>>>>>>>>> work done by Dmitri. We also have commits for offer/rescind >>>>>>>>>>>>>>>> race issue, >>>>>>>>>>>>>>>> setrootfs patch (which is not upstreamed yet). >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> - >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi < >>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." >>>>>>>>>>>>>>>>>> around the same time. Can this happen if task is in PENDING >>>>>>>>>>>>>>>>>> state and does >>>>>>>>>>>>>>>>>> not reach ASSIGNED due to lack of offers? >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi < >>>>>>>>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Folks, >>>>>>>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version >>>>>>>>>>>>>>>>>>> close to 0.18.0). Sometimes, it shows no offers in the UI >>>>>>>>>>>>>>>>>>> offers page. But >>>>>>>>>>>>>>>>>>> if I tail the logs I can see offers are coming in. I >>>>>>>>>>>>>>>>>>> suspect they are >>>>>>>>>>>>>>>>>>> getting enqueued for processing by "executor" but stay >>>>>>>>>>>>>>>>>>> there for a long >>>>>>>>>>>>>>>>>>> time and are not processed either due to locking or thread >>>>>>>>>>>>>>>>>>> starvation. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and >>>>>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. >>>>>>>>>>>>>>>>>>> Could these >>>>>>>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some >>>>>>>>>>>>>>>>>>> lock)? What >>>>>>>>>>>>>>>>>>> should I be looking for to debug this condition further? >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> Mohit. >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
