Imagine something like Spinnaker using Aurora underneath to schedule
services. That layer often "amplifies" human effort and may result in a lot
of load on Aurora. Usually that is fine but if Aurora slowed down due to
transient problems, it can signal that to upstream software in the same way
that busy web servers do during cyber Monday sales :-)

On Mon, Nov 27, 2017 at 12:06 PM, Bill Farner <[email protected]> wrote:

> I want to let upstream software "know" that Aurora is slowing down and
>> that it should back off
>
>
> Can you offer more detail about how Aurora is being used in this regard?
> I haven't seen use cases in the past that would be amenable to this
> behavior, so i would like to understand better.
>
>
> On Mon, Nov 27, 2017 at 11:51 AM, Mohit Jaggi <[email protected]>
> wrote:
>
>> Thanks Bill. We havn't been able to track down a specific root
>> cause(although ZK node is known to have issues now and then but we don't
>> have logs for the specific outages we had). We will plan to move to 0.19.x
>> soon. In addition I want to let upstream software "know" that Aurora is
>> slowing down and that it should back off. To achieve this I want to send
>> 5xx error codes back when update/rollback/kill etc are called and certain
>> metrics (like log write lock wait time) indicate heavy load. Perhaps, this
>> "defense" already exists?
>>
>>
>> On Mon, Nov 13, 2017 at 8:38 AM, Bill Farner <[email protected]> wrote:
>>
>>> The next level is to determine why the storage lock is being held.
>>> Common causes include:
>>>
>>> 1. storage snapshot slowness, when scheduler state is very large, O(gb)
>>> 1a. long GC pauses in the scheduler, often induced by (1)
>>> 2. scheduler replicated log on slow disks
>>> 3. network issues between schedulers, schedulers to zookeeper, or
>>> between zookeepers
>>>
>>> As an immediate (partial) remedy, i suggest you upgrade to eliminate the
>>> use of SQL/mybatis in the scheduler.  This helped twitter improve (1) and
>>> (1a).
>>>
>>> commit f2755e1
>>> Author: Bill Farner <[email protected]>
>>> Date:   Tue Oct 24 23:34:09 2017 -0700
>>>
>>>     Exclusively use Map-based in-memory stores for primary storage
>>>
>>>
>>> On Fri, Nov 10, 2017 at 10:07 PM, Mohit Jaggi <[email protected]>
>>> wrote:
>>>
>>>> and in log_storage_write_lock_wait_ns_per_event
>>>>
>>>> On Fri, Nov 10, 2017 at 9:57 PM, Mohit Jaggi <[email protected]>
>>>> wrote:
>>>>
>>>>> Yes, I do see spikes in log_storage_write_lock_wait_ns_total. Is that
>>>>> cause or effect? :-)
>>>>>
>>>>> On Fri, Nov 10, 2017 at 9:34 PM, Mohit Jaggi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Thanks Bill. Please see inline:
>>>>>>
>>>>>> On Fri, Nov 10, 2017 at 8:06 PM, Bill Farner <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> I suspect they are getting enqueued
>>>>>>>
>>>>>>>
>>>>>>> Just to be sure - the offers do eventually get through though?
>>>>>>>
>>>>>>>
>>>>>> In one instance the offers did get through but it took several
>>>>>> minutes. In other instances we reloaded the scheduler to let another one
>>>>>> become the leader.
>>>>>>
>>>>>>
>>>>>>> The most likely culprit is contention for the storage write lock,  
>>>>>>> observable
>>>>>>> via spikes in stat log_storage_write_lock_wait_ns_total.
>>>>>>>
>>>>>>
>>>>>> Thanks. I will check that one.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>> getTasksWithoutConfigs() calls are being made at that time
>>>>>>>
>>>>>>>
>>>>>>> This sounds like API activity.  This shouldn't interfere with offer
>>>>>>> processing directly, but could potentially slow down the scheduler as a
>>>>>>> whole.
>>>>>>>
>>>>>>>
>>>>>> So these won't contend for locks with offer processing and task
>>>>>> assignment threads? Only 8-10 out of 24 cores were being used on the
>>>>>> machine. I also noticed a spike in mybatis active and bad connections.
>>>>>> Can't say if the spike in active is due to many bad connections or vice
>>>>>> versa or there was a 3rd source causing both of these. Are there any
>>>>>> metrics or logs that might help here?
>>>>>>
>>>>>>
>>>>>>> I also notice a lot of "Timeout reached for task..." around the same
>>>>>>>> time. Can this happen if task is in PENDING state and does not reach
>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>
>>>>>>>
>>>>>>> This is unusual.  Pending tasks are not timed out; this applies to
>>>>>>> tasks in states where the scheduler is waiting for something else to act
>>>>>>> and it does not hear back (via a status update).
>>>>>>>
>>>>>>
>>>>>> Perhaps they were in ASSIGNED or some other state. If updates from
>>>>>> Mesos are being delayed or processed too slowly both these effects will
>>>>>> occur?
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> I suggest digging into the cause of delayed offer processing first,
>>>>>>> i suspect it might be related to the task timeouts as well.
>>>>>>>
>>>>>>> version close to 0.18.0
>>>>>>>
>>>>>>>
>>>>>>> Is the ambiguity is due to custom patches?  Can you at least
>>>>>>> indicate the last git SHA off aurora/master?  Digging much deeper in
>>>>>>> diagnosing this may prove tricky without knowing what code is in play.
>>>>>>>
>>>>>>>
>>>>>>    -
>>>>>>    - c85bffd
>>>>>>    - 10 i
>>>>>>    -
>>>>>>    - s the commit from which we forked.
>>>>>>    -
>>>>>>    -
>>>>>>    -
>>>>>>    - The custom patch is mainly the dynamic reservation work done by
>>>>>>    Dmitri. We also have commits for offer/rescind race issue, setrootfs 
>>>>>> patch
>>>>>>    (which is not upstreamed yet).
>>>>>>    -
>>>>>>    -
>>>>>>    -
>>>>>>    -
>>>>>>    -
>>>>>>
>>>>>> I have cherrypicked the fix for Aurora-1952 as well.
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On Thu, Nov 9, 2017 at 9:49 PM, Mohit Jaggi <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I also notice a lot of "Timeout reached for task..." around the
>>>>>>>> same time. Can this happen if task is in PENDING state and does not 
>>>>>>>> reach
>>>>>>>> ASSIGNED due to lack of offers?
>>>>>>>>
>>>>>>>> On Thu, Nov 9, 2017 at 4:33 PM, Mohit Jaggi <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Folks,
>>>>>>>>> I have noticed some weird behavior in Aurora (version close to
>>>>>>>>> 0.18.0). Sometimes, it shows no offers in the UI offers page. But if 
>>>>>>>>> I tail
>>>>>>>>> the logs I can see offers are coming in. I suspect they are getting
>>>>>>>>> enqueued for processing by "executor" but stay there for a long time 
>>>>>>>>> and
>>>>>>>>> are not processed either due to locking or thread starvation.
>>>>>>>>>
>>>>>>>>> I see that a lot of getJobUpdateDetails() and
>>>>>>>>> getTasksWithoutConfigs() calls are being made at that time. Could 
>>>>>>>>> these
>>>>>>>>> calls starve the OfferManager(e.g. by contending for some lock)? What
>>>>>>>>> should I be looking for to debug this condition further?
>>>>>>>>>
>>>>>>>>> Mohit.
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to