Re: Aurora pauses adding offers

Mohit Jaggi Mon, 27 Nov 2017 13:06:13 -0800

I think more explicit signaling is better. Increased latency can be due to
other conditions like network issues etc. Right now our mitigation involves
load-shedding and we would rather have load-avoidance. Indeed proxy is not
a good option. Only Aurora "knows" when it wants to back-pressure.


On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <[email protected]>
wrote:

> Any log write latency will be reflected in the overall latency of the
> request. Increased request latency is one of the main ways any server has
> of telling a client that it's under load. It's then up to the client to
> react to this.
>
> If you want to throw error codes, you can put a proxy in front of Aurora
> that has request timeouts - which would send 503s to clients. But the issue
> with that is the requests are mostly non-idempotent so you'll need to build
> reconciliation logic into it.
>
> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <[email protected]>
> wrote:
>
>> Imagine something like Spinnaker using Aurora underneath to schedule
>> services. That layer often "amplifies" human effort and may result in a lot
>> of load on Aurora. Usually that is fine but if Aurora slowed down due to
>> transient problems, it can signal that to upstream software in the same way
>> that busy web servers do during cyber Monday sales :-)
>>
>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <[email protected]> wrote:
>>
>>> I want to let upstream software "know" that Aurora is slowing down and
>>>> that it should back off
>>>
>>>
>>> Can you offer more detail about how Aurora is being used in this
>>> regard?  I haven't seen use cases in the past that would be amenable to
>>> this behavior, so i would like to understand better.
>>>
>>>
>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <[email protected]>
>>> wrote:
>>>
>>>> Thanks Bill. We havn't been able to track down a specific root
>>>> cause(although ZK node is known to have issues now and then but we don't
>>>> have logs for the specific outages we had). We will plan to move to 0.19.x
>>>> soon. In addition I want to let upstream software "know" that Aurora is
>>>> slowing down and that it should back off. To achieve this I want to send
>>>> 5xx error codes back when update/rollback/kill etc are called and certain
>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, this
>>>> "defense" already exists?
>>>>
>>>>
>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <[email protected]>
>>>> wrote:
>>>>
>>>>> The next level is to determine why the storage lock is being held.
>>>>> Common causes include:
>>>>>
>>>>> 1. storage snapshot slowness, when scheduler state is very large, O(gb)
>>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>>> 2. scheduler replicated log on slow disks
>>>>> 3. network issues between schedulers, schedulers to zookeeper, or
>>>>> between zookeepers
>>>>>
>>>>> As an immediate (partial) remedy, i suggest you upgrade to eliminate
>>>>> the use of SQL/mybatis in the scheduler.  This helped twitter improve (1)
>>>>> and (1a).
>>>>>
>>>>> commit f2755e1
>>>>> Author: Bill Farner <[email protected]>
>>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>>
>>>>>     Exclusively use Map-based in-memory stores for primary storage
>>>>>
>>>>>
>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>>
>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is
>>>>>>> that cause or effect? :-)
>>>>>>>
>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks Bill. Please see inline:
>>>>>>>>
>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I suspect they are getting enqueued
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Just to be sure - the offers do eventually get through though?
>>>>>>>>>
>>>>>>>>>
>>>>>>>> In one instance the offers did get through but it took several
>>>>>>>> minutes. In other instances we reloaded the scheduler to let another 
>>>>>>>> one
>>>>>>>> become the leader.
>>>>>>>>
>>>>>>>>
>>>>>>>>> The most likely culprit is contention for the storage write lock,  
>>>>>>>>> observable
>>>>>>>>> via spikes in stat log_storage_write_lock_wait_ns_total.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Thanks. I will check that one.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This sounds like API activity.  This shouldn't interfere with
>>>>>>>>> offer processing directly, but could potentially slow down the 
>>>>>>>>> scheduler as
>>>>>>>>> a whole.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> So these won't contend for locks with offer processing and task
>>>>>>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>>>>>>> machine. I also noticed a spike in mybatis active and bad connections.
>>>>>>>> Can't say if the spike in active is due to many bad connections or vice
>>>>>>>> versa or there was a 3rd source causing both of these. Are there any
>>>>>>>> metrics or logs that might help here?
>>>>>>>>
>>>>>>>>
>>>>>>>>> I also notice a lot of "Timeout reached for task..." around the
>>>>>>>>>> same time. Can this happen if task is in PENDING state and does not 
>>>>>>>>>> reach
>>>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This is unusual.  Pending tasks are not timed out; this applies to
>>>>>>>>> tasks in states where the scheduler is waiting for something else to 
>>>>>>>>> act
>>>>>>>>> and it does not hear back (via a status update).
>>>>>>>>>
>>>>>>>>
>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates from
>>>>>>>> Mesos are being delayed or processed too slowly both these effects will
>>>>>>>> occur?
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> I suggest digging into the cause of delayed offer processing
>>>>>>>>> first, i suspect it might be related to the task timeouts as well.
>>>>>>>>>
>>>>>>>>> version close to 0.18.0
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Is the ambiguity is due to custom patches?  Can you at least
>>>>>>>>> indicate the last git SHA off aurora/master?  Digging much deeper in
>>>>>>>>> diagnosing this may prove tricky without knowing what code is in play.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>    -
>>>>>>>>    - c85bffd
>>>>>>>>    - 10 i
>>>>>>>>    -
>>>>>>>>    - s the commit from which we forked.
>>>>>>>>    -
>>>>>>>>    -
>>>>>>>>    -
>>>>>>>>    - The custom patch is mainly the dynamic reservation work done
>>>>>>>>    by Dmitri. We also have commits for offer/rescind race issue, 
>>>>>>>> setrootfs
>>>>>>>>    patch (which is not upstreamed yet).
>>>>>>>>    -
>>>>>>>>    -
>>>>>>>>    -
>>>>>>>>    -
>>>>>>>>    -
>>>>>>>>
>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around the
>>>>>>>>>> same time. Can this happen if task is in PENDING state and does not 
>>>>>>>>>> reach
>>>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>>>>
>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <[email protected]
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Folks,
>>>>>>>>>>> I have noticed some weird behavior in Aurora (version close to
>>>>>>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But 
>>>>>>>>>>> if I tail
>>>>>>>>>>> the logs I can see offers are coming in. I suspect they are getting
>>>>>>>>>>> enqueued for processing by "executor" but stay there for a long 
>>>>>>>>>>> time and
>>>>>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>>>>>
>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could 
>>>>>>>>>>> these
>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? 
>>>>>>>>>>> What
>>>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>>>
>>>>>>>>>>> Mohit.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Reply via email to