Design review for back pressuring API calls

Mohit Jaggi Thu, 30 Nov 2017 12:55:51 -0800

*Why*: API clients don't get any indication of Aurora "slowness". They may
keep retrying and make bad things worse. If they get a "busy signal" they
can backoff.


*What*: Use Aurora's stats to identify "slowness". Return 5xx to API
clients when they make calls like create, update, kill etc. until
"slowness" is gone.

*How*:
I am planning to extend SchedulerThriftInterface and change the injection
bindings accordingly. The new class will delegate all calls to the old one.
Appreciate any comments/suggestions, especially on the metrics that make
sense to trigger throttling. Right now I am thinking of using the log
replication lock wait time as that correlates nicely with the kind of load
that slows progress in Aurora.

class BackPressuringSchedulerThriftInterface extends SchedulerThriftInterface {
..

    @Inject
    BackPressuringSchedulerThriftInterface(
            ConfigurationManager configurationManager,
            Thresholds thresholds,
            NonVolatileStorage storage,
            LockManager lockManager,
            StorageBackup backup,
            Recovery recovery,
            CronJobManager cronJobManager,
            MaintenanceController maintenance,
            QuotaManager quotaManager,
            StateManager stateManager,
            UUIDGenerator uuidGenerator,
            JobUpdateController jobUpdateController,
            ReadOnlyScheduler.Iface readOnlyScheduler,
            AuditMessages auditMessages,
            TaskReconciler taskReconciler,
            StatsProvider statsProvider) {

        super(configurationManager,
              thresholds,
              storage,
              lockManager,
              backup,
              recovery,
              cronJobManager,
              maintenance,
              quotaManager,
              stateManager,
              uuidGenerator,
              jobUpdateController,
              readOnlyScheduler,
              auditMessages,
              taskReconciler,
              statsProvider);
    }


On Wed, Nov 29, 2017 at 1:26 PM, Mohit Jaggi <[email protected]> wrote:

> Thanks Bill. It would be the latter as this back-pressure is only needed
> for calls that change state. Read only calls should be quite quick to
> serve.
>
> One potential correlated outage we may have had is a NIC flap on a
> Zookeeper node. The following questions come to mind:
> - Does log replication "maintain" ZK connections and suffer when a NIC
> flaps?
> - If only 1 of 5 ZK's have this issue, could there still be a problem?
>
> On Wed, Nov 29, 2017 at 11:08 AM, Bill Farner <[email protected]> wrote:
>
>> is there a place I can inject a pre-processor for the API calls
>>
>>
>> There is no off-the-shelf way to do this.  You could intercept API calls
>> as HTTP requests similar to the CORS filter
>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>.
>> If you wanted to intercept specific calls and/or introspect arguments, you
>> would be better off binding a layer for AuroraAdmin.Iface
>> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30>
>> .
>>
>> On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <[email protected]>
>> wrote:
>>
>>> I agree with that. I also believe that the scheduler should be resilient
>>> in the presence of external faults. Systems that export an API must take
>>> defensive steps to protect themselves.
>>>
>>> If I wanted to experiment with this change without modifying Aurora code
>>> "inline", is there a place I can inject a pre-processor for the API calls?
>>>
>>> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <[email protected]> wrote:
>>>
>>>> I'd also suggest focusing on the source of the congestion.  Aurora
>>>> should offer quite high scheduling throughput, and i would rather focus
>>>> energy on addressing bottlenecks.
>>>>
>>>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <[email protected]>
>>>> wrote:
>>>>
>>>>> I think more explicit signaling is better. Increased latency can be
>>>>> due to other conditions like network issues etc. Right now our mitigation
>>>>> involves load-shedding and we would rather have load-avoidance. Indeed
>>>>> proxy is not a good option. Only Aurora "knows" when it wants to
>>>>> back-pressure.
>>>>>
>>>>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Any log write latency will be reflected in the overall latency of the
>>>>>> request. Increased request latency is one of the main ways any server has
>>>>>> of telling a client that it's under load. It's then up to the client to
>>>>>> react to this.
>>>>>>
>>>>>> If you want to throw error codes, you can put a proxy in front of
>>>>>> Aurora that has request timeouts - which would send 503s to clients. But
>>>>>> the issue with that is the requests are mostly non-idempotent so you'll
>>>>>> need to build reconciliation logic into it.
>>>>>>
>>>>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Imagine something like Spinnaker using Aurora underneath to schedule
>>>>>>> services. That layer often "amplifies" human effort and may result in a 
>>>>>>> lot
>>>>>>> of load on Aurora. Usually that is fine but if Aurora slowed down due to
>>>>>>> transient problems, it can signal that to upstream software in the same 
>>>>>>> way
>>>>>>> that busy web servers do during cyber Monday sales :-)
>>>>>>>
>>>>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I want to let upstream software "know" that Aurora is slowing down
>>>>>>>>> and that it should back off
>>>>>>>>
>>>>>>>>
>>>>>>>> Can you offer more detail about how Aurora is being used in this
>>>>>>>> regard?  I haven't seen use cases in the past that would be amenable to
>>>>>>>> this behavior, so i would like to understand better.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <[email protected]
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> Thanks Bill. We havn't been able to track down a specific root
>>>>>>>>> cause(although ZK node is known to have issues now and then but we 
>>>>>>>>> don't
>>>>>>>>> have logs for the specific outages we had). We will plan to move to 
>>>>>>>>> 0.19.x
>>>>>>>>> soon. In addition I want to let upstream software "know" that Aurora 
>>>>>>>>> is
>>>>>>>>> slowing down and that it should back off. To achieve this I want to 
>>>>>>>>> send
>>>>>>>>> 5xx error codes back when update/rollback/kill etc are called and 
>>>>>>>>> certain
>>>>>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, 
>>>>>>>>> this
>>>>>>>>> "defense" already exists?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> The next level is to determine why the storage lock is being
>>>>>>>>>> held.  Common causes include:
>>>>>>>>>>
>>>>>>>>>> 1. storage snapshot slowness, when scheduler state is very large,
>>>>>>>>>> O(gb)
>>>>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>>>>>>>> 2. scheduler replicated log on slow disks
>>>>>>>>>> 3. network issues between schedulers, schedulers to zookeeper, or
>>>>>>>>>> between zookeepers
>>>>>>>>>>
>>>>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to
>>>>>>>>>> eliminate the use of SQL/mybatis in the scheduler.  This helped 
>>>>>>>>>> twitter
>>>>>>>>>> improve (1) and (1a).
>>>>>>>>>>
>>>>>>>>>> commit f2755e1
>>>>>>>>>> Author: Bill Farner <[email protected]>
>>>>>>>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>>>>>>>
>>>>>>>>>>     Exclusively use Map-based in-memory stores for primary storage
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total.
>>>>>>>>>>>> Is that cause or effect? :-)
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks Bill. Please see inline:
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I suspect they are getting enqueued
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Just to be sure - the offers do eventually get through though?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> In one instance the offers did get through but it took several
>>>>>>>>>>>>> minutes. In other instances we reloaded the scheduler to let 
>>>>>>>>>>>>> another one
>>>>>>>>>>>>> become the leader.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> The most likely culprit is contention for the storage write
>>>>>>>>>>>>>> lock,  observable via spikes in stat
>>>>>>>>>>>>>> log_storage_write_lock_wait_ns_total.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks. I will check that one.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This sounds like API activity.  This shouldn't interfere with
>>>>>>>>>>>>>> offer processing directly, but could potentially slow down the 
>>>>>>>>>>>>>> scheduler as
>>>>>>>>>>>>>> a whole.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> So these won't contend for locks with offer processing and
>>>>>>>>>>>>> task assignment threads? Only 8-10 out of 24 cores were being 
>>>>>>>>>>>>> used on the
>>>>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad 
>>>>>>>>>>>>> connections.
>>>>>>>>>>>>> Can't say if the spike in active is due to many bad connections 
>>>>>>>>>>>>> or vice
>>>>>>>>>>>>> versa or there was a 3rd source causing both of these. Are there 
>>>>>>>>>>>>> any
>>>>>>>>>>>>> metrics or logs that might help here?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and 
>>>>>>>>>>>>>>> does not
>>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This is unusual.  Pending tasks are not timed out; this
>>>>>>>>>>>>>> applies to tasks in states where the scheduler is waiting for 
>>>>>>>>>>>>>> something
>>>>>>>>>>>>>> else to act and it does not hear back (via a status update).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates
>>>>>>>>>>>>> from Mesos are being delayed or processed too slowly both these 
>>>>>>>>>>>>> effects
>>>>>>>>>>>>> will occur?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I suggest digging into the cause of delayed offer processing
>>>>>>>>>>>>>> first, i suspect it might be related to the task timeouts as 
>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> version close to 0.18.0
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Is the ambiguity is due to custom patches?  Can you at least
>>>>>>>>>>>>>> indicate the last git SHA off aurora/master?  Digging much 
>>>>>>>>>>>>>> deeper in
>>>>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is in 
>>>>>>>>>>>>>> play.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    - c85bffd
>>>>>>>>>>>>>    - 10 i
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    - s the commit from which we forked.
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    - The custom patch is mainly the dynamic reservation work
>>>>>>>>>>>>>    done by Dmitri. We also have commits for offer/rescind race 
>>>>>>>>>>>>> issue,
>>>>>>>>>>>>>    setrootfs patch (which is not upstreamed yet).
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>    -
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and 
>>>>>>>>>>>>>>> does not
>>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <
>>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Folks,
>>>>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version close
>>>>>>>>>>>>>>>> to 0.18.0). Sometimes, it shows no offers in the UI offers 
>>>>>>>>>>>>>>>> page. But if I
>>>>>>>>>>>>>>>> tail the logs I can see offers are coming in. I suspect they 
>>>>>>>>>>>>>>>> are getting
>>>>>>>>>>>>>>>> enqueued for processing by "executor" but stay there for a 
>>>>>>>>>>>>>>>> long time and
>>>>>>>>>>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. 
>>>>>>>>>>>>>>>> Could these
>>>>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some 
>>>>>>>>>>>>>>>> lock)? What
>>>>>>>>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Design review for back pressuring API calls

Reply via email to