Re: Aurora pauses adding offers

Bill Farner Wed, 29 Nov 2017 11:08:27 -0800

>
> is there a place I can inject a pre-processor for the API calls


There is no off-the-shelf way to do this.  You could intercept API calls as
HTTP requests similar to the CORS filter
<https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/http/api/ApiModule.java#L73>.
If you wanted to intercept specific calls and/or introspect arguments, you
would be better off binding a layer for AuroraAdmin.Iface
<https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/thrift/ThriftModule.java#L30>
.

On Tue, Nov 28, 2017 at 11:46 AM, Mohit Jaggi <mohit.ja...@uber.com> wrote:

> I agree with that. I also believe that the scheduler should be resilient
> in the presence of external faults. Systems that export an API must take
> defensive steps to protect themselves.
>
> If I wanted to experiment with this change without modifying Aurora code
> "inline", is there a place I can inject a pre-processor for the API calls?
>
> On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <wfar...@apache.org> wrote:
>
>> I'd also suggest focusing on the source of the congestion.  Aurora should
>> offer quite high scheduling throughput, and i would rather focus energy on
>> addressing bottlenecks.
>>
>> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <mohit.ja...@uber.com>
>> wrote:
>>
>>> I think more explicit signaling is better. Increased latency can be due
>>> to other conditions like network issues etc. Right now our mitigation
>>> involves load-shedding and we would rather have load-avoidance. Indeed
>>> proxy is not a good option. Only Aurora "knows" when it wants to
>>> back-pressure.
>>>
>>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <
>>> dmclaugh...@apache.org> wrote:
>>>
>>>> Any log write latency will be reflected in the overall latency of the
>>>> request. Increased request latency is one of the main ways any server has
>>>> of telling a client that it's under load. It's then up to the client to
>>>> react to this.
>>>>
>>>> If you want to throw error codes, you can put a proxy in front of
>>>> Aurora that has request timeouts - which would send 503s to clients. But
>>>> the issue with that is the requests are mostly non-idempotent so you'll
>>>> need to build reconciliation logic into it.
>>>>
>>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <mohit.ja...@uber.com>
>>>> wrote:
>>>>
>>>>> Imagine something like Spinnaker using Aurora underneath to schedule
>>>>> services. That layer often "amplifies" human effort and may result in a 
>>>>> lot
>>>>> of load on Aurora. Usually that is fine but if Aurora slowed down due to
>>>>> transient problems, it can signal that to upstream software in the same 
>>>>> way
>>>>> that busy web servers do during cyber Monday sales :-)
>>>>>
>>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <wfar...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I want to let upstream software "know" that Aurora is slowing down
>>>>>>> and that it should back off
>>>>>>
>>>>>>
>>>>>> Can you offer more detail about how Aurora is being used in this
>>>>>> regard?  I haven't seen use cases in the past that would be amenable to
>>>>>> this behavior, so i would like to understand better.
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <mohit.ja...@uber.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks Bill. We havn't been able to track down a specific root
>>>>>>> cause(although ZK node is known to have issues now and then but we don't
>>>>>>> have logs for the specific outages we had). We will plan to move to 
>>>>>>> 0.19.x
>>>>>>> soon. In addition I want to let upstream software "know" that Aurora is
>>>>>>> slowing down and that it should back off. To achieve this I want to send
>>>>>>> 5xx error codes back when update/rollback/kill etc are called and 
>>>>>>> certain
>>>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, 
>>>>>>> this
>>>>>>> "defense" already exists?
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <wfar...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> The next level is to determine why the storage lock is being held.
>>>>>>>> Common causes include:
>>>>>>>>
>>>>>>>> 1. storage snapshot slowness, when scheduler state is very large,
>>>>>>>> O(gb)
>>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>>>>>> 2. scheduler replicated log on slow disks
>>>>>>>> 3. network issues between schedulers, schedulers to zookeeper, or
>>>>>>>> between zookeepers
>>>>>>>>
>>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to
>>>>>>>> eliminate the use of SQL/mybatis in the scheduler.  This helped twitter
>>>>>>>> improve (1) and (1a).
>>>>>>>>
>>>>>>>> commit f2755e1
>>>>>>>> Author: Bill Farner <wfar...@apache.org>
>>>>>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>>>>>
>>>>>>>>     Exclusively use Map-based in-memory stores for primary storage
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <mohit.ja...@uber.com
>>>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>>>>>
>>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <mohit.ja...@uber.com
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is
>>>>>>>>>> that cause or effect? :-)
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <
>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Thanks Bill. Please see inline:
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <wfar...@apache.org
>>>>>>>>>>> > wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I suspect they are getting enqueued
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Just to be sure - the offers do eventually get through though?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> In one instance the offers did get through but it took several
>>>>>>>>>>> minutes. In other instances we reloaded the scheduler to let 
>>>>>>>>>>> another one
>>>>>>>>>>> become the leader.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> The most likely culprit is contention for the storage write
>>>>>>>>>>>> lock,  observable via spikes in stat log_storage_write_lock_wa
>>>>>>>>>>>> it_ns_total.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks. I will check that one.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> This sounds like API activity.  This shouldn't interfere with
>>>>>>>>>>>> offer processing directly, but could potentially slow down the 
>>>>>>>>>>>> scheduler as
>>>>>>>>>>>> a whole.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> So these won't contend for locks with offer processing and task
>>>>>>>>>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad 
>>>>>>>>>>> connections.
>>>>>>>>>>> Can't say if the spike in active is due to many bad connections or 
>>>>>>>>>>> vice
>>>>>>>>>>> versa or there was a 3rd source causing both of these. Are there any
>>>>>>>>>>> metrics or logs that might help here?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around the
>>>>>>>>>>>>> same time. Can this happen if task is in PENDING state and does 
>>>>>>>>>>>>> not reach
>>>>>>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> This is unusual.  Pending tasks are not timed out; this applies
>>>>>>>>>>>> to tasks in states where the scheduler is waiting for something 
>>>>>>>>>>>> else to act
>>>>>>>>>>>> and it does not hear back (via a status update).
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates
>>>>>>>>>>> from Mesos are being delayed or processed too slowly both these 
>>>>>>>>>>> effects
>>>>>>>>>>> will occur?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I suggest digging into the cause of delayed offer processing
>>>>>>>>>>>> first, i suspect it might be related to the task timeouts as well.
>>>>>>>>>>>>
>>>>>>>>>>>> version close to 0.18.0
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Is the ambiguity is due to custom patches?  Can you at least
>>>>>>>>>>>> indicate the last git SHA off aurora/master?  Digging much deeper 
>>>>>>>>>>>> in
>>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is in 
>>>>>>>>>>>> play.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>    -
>>>>>>>>>>>    - c85bffd
>>>>>>>>>>>    - 10 i
>>>>>>>>>>>    -
>>>>>>>>>>>    - s the commit from which we forked.
>>>>>>>>>>>    -
>>>>>>>>>>>    -
>>>>>>>>>>>    -
>>>>>>>>>>>    - The custom patch is mainly the dynamic reservation work
>>>>>>>>>>>    done by Dmitri. We also have commits for offer/rescind race 
>>>>>>>>>>> issue,
>>>>>>>>>>>    setrootfs patch (which is not upstreamed yet).
>>>>>>>>>>>    -
>>>>>>>>>>>    -
>>>>>>>>>>>    -
>>>>>>>>>>>    -
>>>>>>>>>>>    -
>>>>>>>>>>>
>>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <
>>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around
>>>>>>>>>>>>> the same time. Can this happen if task is in PENDING state and 
>>>>>>>>>>>>> does not
>>>>>>>>>>>>> reach ASSIGNED due to lack of offers?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <
>>>>>>>>>>>>> mohit.ja...@uber.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Folks,
>>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version close
>>>>>>>>>>>>>> to 0.18.0). Sometimes, it shows no offers in the UI offers page. 
>>>>>>>>>>>>>> But if I
>>>>>>>>>>>>>> tail the logs I can see offers are coming in. I suspect they are 
>>>>>>>>>>>>>> getting
>>>>>>>>>>>>>> enqueued for processing by "executor" but stay there for a long 
>>>>>>>>>>>>>> time and
>>>>>>>>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. 
>>>>>>>>>>>>>> Could these
>>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? 
>>>>>>>>>>>>>> What
>>>>>>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Reply via email to