Re: Aurora pauses adding offers

Bill Farner Mon, 27 Nov 2017 12:07:24 -0800

>
> I want to let upstream software "know" that Aurora is slowing down and
> that it should back off



Can you offer more detail about how Aurora is being used in this regard?  I
haven't seen use cases in the past that would be amenable to this behavior,
so i would like to understand better.


On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <[email protected]> wrote:

> Thanks Bill. We havn't been able to track down a specific root
> cause(although ZK node is known to have issues now and then but we don't
> have logs for the specific outages we had). We will plan to move to 0.19.x
> soon. In addition I want to let upstream software "know" that Aurora is
> slowing down and that it should back off. To achieve this I want to send
> 5xx error codes back when update/rollback/kill etc are called and certain
> metrics (like log write lock wait time) indicate heavy load. Perhaps, this
> "defense" already exists?
>
>
> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <[email protected]> wrote:
>
>> The next level is to determine why the storage lock is being held.
>> Common causes include:
>>
>> 1. storage snapshot slowness, when scheduler state is very large, O(gb)
>> 1a. long GC pauses in the scheduler, often induced by (1)
>> 2. scheduler replicated log on slow disks
>> 3. network issues between schedulers, schedulers to zookeeper, or between
>> zookeepers
>>
>> As an immediate (partial) remedy, i suggest you upgrade to eliminate the
>> use of SQL/mybatis in the scheduler.  This helped twitter improve (1) and
>> (1a).
>>
>> commit f2755e1
>> Author: Bill Farner <[email protected]>
>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>
>>     Exclusively use Map-based in-memory stores for primary storage
>>
>>
>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <[email protected]>
>> wrote:
>>
>>> and in log_storage_write_lock_wait_ns_per_event
>>>
>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <[email protected]>
>>> wrote:
>>>
>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is that
>>>> cause or effect? :-)
>>>>
>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks Bill. Please see inline:
>>>>>
>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> I suspect they are getting enqueued
>>>>>>
>>>>>>
>>>>>> Just to be sure - the offers do eventually get through though?
>>>>>>
>>>>>>
>>>>> In one instance the offers did get through but it took several
>>>>> minutes. In other instances we reloaded the scheduler to let another one
>>>>> become the leader.
>>>>>
>>>>>
>>>>>> The most likely culprit is contention for the storage write lock,  
>>>>>> observable
>>>>>> via spikes in stat log_storage_write_lock_wait_ns_total.
>>>>>>
>>>>>
>>>>> Thanks. I will check that one.
>>>>>
>>>>>
>>>>>>
>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>
>>>>>>
>>>>>> This sounds like API activity.  This shouldn't interfere with offer
>>>>>> processing directly, but could potentially slow down the scheduler as a
>>>>>> whole.
>>>>>>
>>>>>>
>>>>> So these won't contend for locks with offer processing and task
>>>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>>>> machine. I also noticed a spike in mybatis active and bad connections.
>>>>> Can't say if the spike in active is due to many bad connections or vice
>>>>> versa or there was a 3rd source causing both of these. Are there any
>>>>> metrics or logs that might help here?
>>>>>
>>>>>
>>>>>> I also notice a lot of "Timeout reached for task..." around the same
>>>>>>> time. Can this happen if task is in PENDING state and does not reach
>>>>>>> ASSIGNED due to lack of offers?
>>>>>>
>>>>>>
>>>>>> This is unusual.  Pending tasks are not timed out; this applies to
>>>>>> tasks in states where the scheduler is waiting for something else to act
>>>>>> and it does not hear back (via a status update).
>>>>>>
>>>>>
>>>>> Perhaps they were in ASSIGNED or some other state. If updates from
>>>>> Mesos are being delayed or processed too slowly both these effects will
>>>>> occur?
>>>>>
>>>>>
>>>>>>
>>>>>> I suggest digging into the cause of delayed offer processing first, i
>>>>>> suspect it might be related to the task timeouts as well.
>>>>>>
>>>>>> version close to 0.18.0
>>>>>>
>>>>>>
>>>>>> Is the ambiguity is due to custom patches?  Can you at least indicate
>>>>>> the last git SHA off aurora/master?  Digging much deeper in diagnosing 
>>>>>> this
>>>>>> may prove tricky without knowing what code is in play.
>>>>>>
>>>>>>
>>>>>    -
>>>>>    - c85bffd
>>>>>    - 10 i
>>>>>    -
>>>>>    - s the commit from which we forked.
>>>>>    -
>>>>>    -
>>>>>    -
>>>>>    - The custom patch is mainly the dynamic reservation work done by
>>>>>    Dmitri. We also have commits for offer/rescind race issue, setrootfs 
>>>>> patch
>>>>>    (which is not upstreamed yet).
>>>>>    -
>>>>>    -
>>>>>    -
>>>>>    -
>>>>>    -
>>>>>
>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>
>>>>>
>>>>>
>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I also notice a lot of "Timeout reached for task..." around the same
>>>>>>> time. Can this happen if task is in PENDING state and does not reach
>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>
>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Folks,
>>>>>>>> I have noticed some weird behavior in Aurora (version close to
>>>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But if I 
>>>>>>>> tail
>>>>>>>> the logs I can see offers are coming in. I suspect they are getting
>>>>>>>> enqueued for processing by "executor" but stay there for a long time 
>>>>>>>> and
>>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>>
>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could these
>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What
>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>
>>>>>>>> Mohit.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Aurora pauses adding offers

Reply via email to