Re: Aurora pauses adding offers

Mohit Jaggi Tue, 28 Nov 2017 11:47:41 -0800

I agree with that. I also believe that the scheduler should be resilient in
the presence of external faults. Systems that export an API must take
defensive steps to protect themselves.


If I wanted to experiment with this change without modifying Aurora code
"inline", is there a place I can inject a pre-processor for the API calls?

On Mon, Nov 27, 2017 at 4:59 PM, Bill Farner <[email protected]> wrote:

> I'd also suggest focusing on the source of the congestion.  Aurora should
> offer quite high scheduling throughput, and i would rather focus energy on
> addressing bottlenecks.
>
> On Mon, Nov 27, 2017 at 1:05 PM, Mohit Jaggi <[email protected]> wrote:
>
>> I think more explicit signaling is better. Increased latency can be due
>> to other conditions like network issues etc. Right now our mitigation
>> involves load-shedding and we would rather have load-avoidance. Indeed
>> proxy is not a good option. Only Aurora "knows" when it wants to
>> back-pressure.
>>
>> On Mon, Nov 27, 2017 at 12:58 PM, David McLaughlin <
>> [email protected]> wrote:
>>
>>> Any log write latency will be reflected in the overall latency of the
>>> request. Increased request latency is one of the main ways any server has
>>> of telling a client that it's under load. It's then up to the client to
>>> react to this.
>>>
>>> If you want to throw error codes, you can put a proxy in front of Aurora
>>> that has request timeouts - which would send 503s to clients. But the issue
>>> with that is the requests are mostly non-idempotent so you'll need to build
>>> reconciliation logic into it.
>>>
>>> On Mon, Nov 27, 2017 at 12:13 PM, Mohit Jaggi <[email protected]>
>>> wrote:
>>>
>>>> Imagine something like Spinnaker using Aurora underneath to schedule
>>>> services. That layer often "amplifies" human effort and may result in a lot
>>>> of load on Aurora. Usually that is fine but if Aurora slowed down due to
>>>> transient problems, it can signal that to upstream software in the same way
>>>> that busy web servers do during cyber Monday sales :-)
>>>>
>>>> On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <[email protected]>
>>>> wrote:
>>>>
>>>>> I want to let upstream software "know" that Aurora is slowing down and
>>>>>> that it should back off
>>>>>
>>>>>
>>>>> Can you offer more detail about how Aurora is being used in this
>>>>> regard?  I haven't seen use cases in the past that would be amenable to
>>>>> this behavior, so i would like to understand better.
>>>>>
>>>>>
>>>>> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks Bill. We havn't been able to track down a specific root
>>>>>> cause(although ZK node is known to have issues now and then but we don't
>>>>>> have logs for the specific outages we had). We will plan to move to 
>>>>>> 0.19.x
>>>>>> soon. In addition I want to let upstream software "know" that Aurora is
>>>>>> slowing down and that it should back off. To achieve this I want to send
>>>>>> 5xx error codes back when update/rollback/kill etc are called and certain
>>>>>> metrics (like log write lock wait time) indicate heavy load. Perhaps, 
>>>>>> this
>>>>>> "defense" already exists?
>>>>>>
>>>>>>
>>>>>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> The next level is to determine why the storage lock is being held.
>>>>>>> Common causes include:
>>>>>>>
>>>>>>> 1. storage snapshot slowness, when scheduler state is very large,
>>>>>>> O(gb)
>>>>>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>>>>>> 2. scheduler replicated log on slow disks
>>>>>>> 3. network issues between schedulers, schedulers to zookeeper, or
>>>>>>> between zookeepers
>>>>>>>
>>>>>>> As an immediate (partial) remedy, i suggest you upgrade to eliminate
>>>>>>> the use of SQL/mybatis in the scheduler.  This helped twitter improve 
>>>>>>> (1)
>>>>>>> and (1a).
>>>>>>>
>>>>>>> commit f2755e1
>>>>>>> Author: Bill Farner <[email protected]>
>>>>>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>>>>>
>>>>>>>     Exclusively use Map-based in-memory stores for primary storage
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>>>>>
>>>>>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is
>>>>>>>>> that cause or effect? :-)
>>>>>>>>>
>>>>>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <[email protected]
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Thanks Bill. Please see inline:
>>>>>>>>>>
>>>>>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I suspect they are getting enqueued
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Just to be sure - the offers do eventually get through though?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> In one instance the offers did get through but it took several
>>>>>>>>>> minutes. In other instances we reloaded the scheduler to let another 
>>>>>>>>>> one
>>>>>>>>>> become the leader.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> The most likely culprit is contention for the storage write
>>>>>>>>>>> lock,  observable via spikes in stat log_storage_write_lock_wa
>>>>>>>>>>> it_ns_total.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Thanks. I will check that one.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This sounds like API activity.  This shouldn't interfere with
>>>>>>>>>>> offer processing directly, but could potentially slow down the 
>>>>>>>>>>> scheduler as
>>>>>>>>>>> a whole.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> So these won't contend for locks with offer processing and task
>>>>>>>>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>>>>>>>>> machine. I also noticed a spike in mybatis active and bad 
>>>>>>>>>> connections.
>>>>>>>>>> Can't say if the spike in active is due to many bad connections or 
>>>>>>>>>> vice
>>>>>>>>>> versa or there was a 3rd source causing both of these. Are there any
>>>>>>>>>> metrics or logs that might help here?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around the
>>>>>>>>>>>> same time. Can this happen if task is in PENDING state and does 
>>>>>>>>>>>> not reach
>>>>>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> This is unusual.  Pending tasks are not timed out; this applies
>>>>>>>>>>> to tasks in states where the scheduler is waiting for something 
>>>>>>>>>>> else to act
>>>>>>>>>>> and it does not hear back (via a status update).
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Perhaps they were in ASSIGNED or some other state. If updates
>>>>>>>>>> from Mesos are being delayed or processed too slowly both these 
>>>>>>>>>> effects
>>>>>>>>>> will occur?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I suggest digging into the cause of delayed offer processing
>>>>>>>>>>> first, i suspect it might be related to the task timeouts as well.
>>>>>>>>>>>
>>>>>>>>>>> version close to 0.18.0
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Is the ambiguity is due to custom patches?  Can you at least
>>>>>>>>>>> indicate the last git SHA off aurora/master?  Digging much deeper in
>>>>>>>>>>> diagnosing this may prove tricky without knowing what code is in 
>>>>>>>>>>> play.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>    -
>>>>>>>>>>    - c85bffd
>>>>>>>>>>    - 10 i
>>>>>>>>>>    -
>>>>>>>>>>    - s the commit from which we forked.
>>>>>>>>>>    -
>>>>>>>>>>    -
>>>>>>>>>>    -
>>>>>>>>>>    - The custom patch is mainly the dynamic reservation work
>>>>>>>>>>    done by Dmitri. We also have commits for offer/rescind race issue,
>>>>>>>>>>    setrootfs patch (which is not upstreamed yet).
>>>>>>>>>>    -
>>>>>>>>>>    -
>>>>>>>>>>    -
>>>>>>>>>>    -
>>>>>>>>>>    -
>>>>>>>>>>
>>>>>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I also notice a lot of "Timeout reached for task..." around the
>>>>>>>>>>>> same time. Can this happen if task is in PENDING state and does 
>>>>>>>>>>>> not reach
>>>>>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Folks,
>>>>>>>>>>>>> I have noticed some weird behavior in Aurora (version close to
>>>>>>>>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But 
>>>>>>>>>>>>> if I tail
>>>>>>>>>>>>> the logs I can see offers are coming in. I suspect they are 
>>>>>>>>>>>>> getting
>>>>>>>>>>>>> enqueued for processing by "executor" but stay there for a long 
>>>>>>>>>>>>> time and
>>>>>>>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could 
>>>>>>>>>>>>> these
>>>>>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? 
>>>>>>>>>>>>> What
>>>>>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Mohit.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Reply via email to