Re: Aurora pauses adding offers

Mohit Jaggi Wed, 29 Nov 2017 13:27:45 -0800

Thanks Bill. It would be the latter as this back-pressure is only needed
for calls that change state. Read only calls should be quite quick to
serve.


One potential correlated outage we may have had is a NIC flap on a
Zookeeper node. The following questions come to mind:
- Does log replication "maintain" ZK connections and suffer when a NIC
flaps?
- If only 1 of 5 ZK's have this issue, could there still be a problem?

On Wed, Nov 29, 2017 at 11:08 AM, Bill Farner <[email protected]> wrote:

> is there a place I can inject a pre-processor for the API calls
>
>
> There is no off-the-shelf way to do this.  You could intercept API calls
> as HTTP requests similar to the CORS filter
> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>.
> If you wanted to intercept specific calls and/or introspect arguments, you
> would be better off binding a layer for AuroraAdmin.Iface
> <https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30>
> .
>
> On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <[email protected]>
> wrote:
>
>> I agree with that. I also believe that the scheduler should be resilient
>> in the presence of external faults. Systems that export an API must take
>> defensive steps to protect themselves.
>>
>> If I wanted to experiment with this change without modifying Aurora code
>> "inline", is there a place I can inject a pre-processor for the API calls?
>>
>> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <[email protected]> wrote:
>>
>>> I'd also suggest focusing on the source of the congestion.  Aurora
>>> should offer quite high scheduling throughput, and i would rather focus
>>> energy on addressing bottlenecks.
>>>
>>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <[email protected]>
>>> wrote:
>>>
>>>> I think more explicit signaling is better. Increased latency can be due
>>>> to other conditions like network issues etc. Right now our mitigation
>>>> involves load-shedding and we would rather have load-avoidance. Indeed
>>>> proxy is not a good option. Only Aurora "knows" when it wants to
>>>> back-pressure.
>>>>
>>>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <
>>>> [email protected]> wrote:
>>>>
>>>>> Any log write latency will be reflected in the overall latency of the
>>>>> request. Increased request latency is one of the main ways any server has
>>>>> of telling a client that it's under load. It's then up to the client to
>>>>> react to this.
>>>>>
>>>>> If you want to throw error codes, you can put a proxy in front of
>>>>> Aurora that has request timeouts - which would send 503s to clients. But
>>>>> the issue with that is the requests are mostly non-idempotent so you'll
>>>>> need to build reconciliation logic into it.
>>>>>
>>>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Imagine something like Spinnaker using Aurora underneath to schedule
>>>>>> services. That layer often "amplifies" human effort and may result in a 
>>>>>> lot
>>>>>> of load on Aurora. Usually that is fine but if Aurora slowed down due to
>>>>>> transient problems, it can signal that to upstream software in the same 
>>>>>> way
>>>>>> that busy web servers do during cyber Monday sales :-)
>>>>>>
>>>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I want to let upstream software "know" that Aurora is slowing down
>>>>>>>> and that it should back off
>>>>>>>
>>>>>>>
>>>>>>> Can you offer more detail about how Aurora is being used in this
>>>>>>> regard?  I haven't seen use cases in the past that would be amenable to
>>>>>>> this behavior, so i would like to understand better.
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Bill. We havn't been able to track down a specific root
>>>>>>>> cause(although ZK node is known to have issues now and then but we 
>>>>>>>> don't
>>>>>>>> have logs for the specific outages we had). We will plan to move to 
>>>>>>>> 0.19.x
>>>>>>>> soon. In addition I want to let upstream software "know" that Aurora is
>>>>>>>> slowing down and that it should back off. To achieve this I want to 
>>>>>>>> send
>>>>>>>> 5xx error codes back when update/rollback/kill etc are called and 
>>>>>>>> certain
>>>>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, 
>>>>>>>> this
>>>>>>>> "defense" already exists?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> The next level is to determine why the storage lock is being
>>>>>>>>> held.  Common causes include:
>>>>>>>>>
>>>>>>>>> 1. storage snapshot slowness, when scheduler state is very large,
>>>>>>>>> O(gb)
>>>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>>>>>>> 2. scheduler replicated log on slow disks
>>>>>>>>> 3. network issues between schedulers, schedulers to zookeeper, or
>>>>>>>>> between zookeepers
>>>>>>>>>
>>>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to
>>>>>>>>> eliminate the use of SQL/mybatis in the scheduler.  This helped 
>>>>>>>>> twitter
>>>>>>>>> improve (1) and (1a).
>>>>>>>>>
>>>>>>>>> commit f2755e1
>>>>>>>>> Author: Bill Farner <[email protected]>
>>>>>>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>>>>>>
>>>>>>>>>     Exclusively use Map-based in-memory stores for primary storage
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total.
>>>>>>>>>>> Is that cause or effect? :-)
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks Bill. Please see inline:
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I suspect they are getting enqueued
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Just to be sure - the offers do eventually get through though?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> In one instance the offers did get through but it took several
>>>>>>>>>>>> minutes. In other instances we reloaded the scheduler to let 
>>>>>>>>>>>> another one
>>>>>>>>>>>> become the leader.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> The most likely culprit is contention for the storage write
>>>>>>>>>>>>> lock,  observable via spikes in stat log_storage_write_lock_wa
>>>>>>>>>>>>> it_ns_total.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks. I will check that one.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This sounds like API activity.  This shouldn't interfere with
>>>>>>>>>>>>> offer processing directly, but could potentially slow down the 
>>>>>>>>>>>>> scheduler as
>>>>>>>>>>>>> a whole.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>> So these won't contend for locks with offer processing and task
>>>>>>>>>>>> assignment threads? Only 8-10 out of 24 cores were being used on 
>>>>>>>>>>>> the
>>>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad 
>>>>>>>>>>>> connections.
>>>>>>>>>>>> Can't say if the spike in active is due to many bad connections or 
>>>>>>>>>>>> vice
>>>>>>>>>>>> versa or there was a 3rd source causing both of these. Are there 
>>>>>>>>>>>> any
>>>>>>>>>>>> metrics or logs that might help here?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and 
>>>>>>>>>>>>>> does not
>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> This is unusual.  Pending tasks are not timed out; this
>>>>>>>>>>>>> applies to tasks in states where the scheduler is waiting for 
>>>>>>>>>>>>> something
>>>>>>>>>>>>> else to act and it does not hear back (via a status update).
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates
>>>>>>>>>>>> from Mesos are being delayed or processed too slowly both these 
>>>>>>>>>>>> effects
>>>>>>>>>>>> will occur?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> I suggest digging into the cause of delayed offer processing
>>>>>>>>>>>>> first, i suspect it might be related to the task timeouts as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> version close to 0.18.0
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is the ambiguity is due to custom patches?  Can you at least
>>>>>>>>>>>>> indicate the last git SHA off aurora/master?  Digging much deeper 
>>>>>>>>>>>>> in
>>>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is in 
>>>>>>>>>>>>> play.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>    -
>>>>>>>>>>>>    - c85bffd
>>>>>>>>>>>>    - 10 i
>>>>>>>>>>>>    -
>>>>>>>>>>>>    - s the commit from which we forked.
>>>>>>>>>>>>    -
>>>>>>>>>>>>    -
>>>>>>>>>>>>    -
>>>>>>>>>>>>    - The custom patch is mainly the dynamic reservation work
>>>>>>>>>>>>    done by Dmitri. We also have commits for offer/rescind race 
>>>>>>>>>>>> issue,
>>>>>>>>>>>>    setrootfs patch (which is not upstreamed yet).
>>>>>>>>>>>>    -
>>>>>>>>>>>>    -
>>>>>>>>>>>>    -
>>>>>>>>>>>>    -
>>>>>>>>>>>>    -
>>>>>>>>>>>>
>>>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <
>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and 
>>>>>>>>>>>>>> does not
>>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <
>>>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Folks,
>>>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version close
>>>>>>>>>>>>>>> to 0.18.0). Sometimes, it shows no offers in the UI offers 
>>>>>>>>>>>>>>> page. But if I
>>>>>>>>>>>>>>> tail the logs I can see offers are coming in. I suspect they 
>>>>>>>>>>>>>>> are getting
>>>>>>>>>>>>>>> enqueued for processing by "executor" but stay there for a long 
>>>>>>>>>>>>>>> time and
>>>>>>>>>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. 
>>>>>>>>>>>>>>> Could these
>>>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some 
>>>>>>>>>>>>>>> lock)? What
>>>>>>>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Reply via email to