Re: Understanding ATS memory usage

Veiko Kukk Wed, 31 Jan 2018 01:31:03 -0800

Hi Bryan,

System in general is under light load. No other processes cause latency.


I've submitted issue https://github.com/apache/trafficserver/issues/3057


Veiko


2018-01-27 2:20 GMT+02:00 Bryan Call <[email protected]>:

> I came across this command and it has helped track down some latency
> issues caused by other processes (ss -s).  Can you run it during the time
> you are seeing latency issues and post the results here?
>
> dstat -c --top-cpu -d --top-bio --top-latency -n
>
> -Bryan
>
> On Jan 26, 2018, at 5:48 AM, Veiko Kukk <[email protected]> wrote:
>
> Hi again,
>
> I'd really appreciate if somebody could point me in the right direction
> how to solve this.
> During whatever ATS does after each ~ 50 minutes, it has strong effect on
> response times.
> ATS is used in reverse proxy mode and we run regular tests against test
> URL on proxied server(s) (excluded from caching in ATS config).
> This test GET is ran as HAproxy health check after ~15 seconds for two
> local HAproxy backends which both pass requests to single local ATS.
>
> Quite complex setup, but point here being that tests run frequently and
> give information about ATS response times over long period.
>
> Total test runs today: 6364
> Tests that took over 7s today: 50
>
> Distribution of requests, first column is response time, second amount of
> requests under that value.
> 100 1292
> 300 4351
> 500 5194
> 700 5578
> 900 5794
> 1200 5985
> 1400 6058
> 1800 6143
>
> Here is the output of tests log that contains all the extremely slow
> responses. Test response size is only 609 bytes. Usually response time
> fluctuates around
>
> 2018-01-26T01:13:32.150186+00:00 12412
> 2018-01-26T01:13:32.150188+00:00 20803
> 2018-01-26T02:05:04.536931+00:00 29764
> 2018-01-26T02:05:04.536936+00:00 27271
> 2018-01-26T02:05:04.536941+00:00 10233
> 2018-01-26T02:56:26.968987+00:00 9511
> 2018-01-26T02:56:26.968989+00:00 30084
> 2018-01-26T02:56:26.968991+00:00 27337
> 2018-01-26T04:39:21.947460+00:00 24171
> 2018-01-26T04:39:21.947462+00:00 12042
> 2018-01-26T04:39:21.947464+00:00 36979
> 2018-01-26T04:39:31.954116+00:00 7369
> 2018-01-26T04:39:31.954118+00:00 32305
> 2018-01-26T04:39:31.954120+00:00 19779
> 2018-01-26T04:47:42.349748+00:00 29177
> 2018-01-26T04:47:42.349754+00:00 26212
> 2018-01-26T04:47:42.349757+00:00 21645
> 2018-01-26T04:47:42.349759+00:00 24932
> 2018-01-26T05:39:04.925435+00:00 32361
> 2018-01-26T05:39:04.925438+00:00 33587
> 2018-01-26T05:39:04.925440+00:00 8173
> 2018-01-26T05:39:04.925443+00:00 28149
> 2018-01-26T05:39:04.925445+00:00 29115
> 2018-01-26T06:30:27.643170+00:00 7423
> 2018-01-26T06:30:27.643172+00:00 32271
> 2018-01-26T06:30:27.643174+00:00 18927
> 2018-01-26T06:30:27.643179+00:00 27849
> 2018-01-26T06:30:37.644023+00:00 15160
> 2018-01-26T07:21:50.231681+00:00 19208
> 2018-01-26T07:21:50.231684+00:00 14984
> 2018-01-26T08:13:12.874501+00:00 16876
> 2018-01-26T08:13:22.885389+00:00 14007
> 2018-01-26T09:04:35.509167+00:00 9016
> 2018-01-26T09:04:35.509172+00:00 9356
> 2018-01-26T09:55:58.052277+00:00 24137
> 2018-01-26T09:55:58.052280+00:00 23709
> 2018-01-26T09:55:58.052282+00:00 19901
> 2018-01-26T09:55:58.052284+00:00 19034
> 2018-01-26T10:47:10.614261+00:00 23419
> 2018-01-26T10:47:10.614263+00:00 18967
> 2018-01-26T11:38:32.984318+00:00 14425
> 2018-01-26T11:38:32.984324+00:00 9797
> 2018-01-26T11:38:32.984326+00:00 11161
> 2018-01-26T11:38:32.984329+00:00 16228
> 2018-01-26T12:29:45.511517+00:00 15580
> 2018-01-26T12:29:45.511520+00:00 11439
> 2018-01-26T13:20:58.023816+00:00 21360
> 2018-01-26T13:20:58.023818+00:00 19488
> 2018-01-26T13:20:58.023821+00:00 14737
> 2018-01-26T13:20:58.023823+00:00 17118
>
>
> The question is: why does ATS regularily slow down? Are there some
> internal management jobs done, that use same single queue as requests
> coming in and then requests just wait in queue until internal processes
> finish?
> It's getting worse as ATS uptime increases, triggering HAproxy health
> check timeouts. After restart it's not that bad again for a while.
>
> How to get rid of this regular slowness?
>
> Best regards,
> Veiko
>
>
>
> 2018-01-23 13:53 GMT+02:00 Veiko Kukk <[email protected]>:
>
>> Hi,
>>
>> I should have noted before that, during that timeframe, there is no
>> higher disk activity than on average. No higher load, no disk latency, no
>> cpu load. Nothing abnormal except slow ATS.
>> ATS is running on CentOS 7 directly on hardware dedicated server.
>>
>> Dirty pages related config that's been always there for that server:
>>
>> vm.dirty_background_ratio = 5
>> vm.dirty_ratio = 40
>> vm.swappiness = 0
>>
>> # free -m
>>                    total           used           free          shared
>> buff/cache   available
>> Mem:         128831       27841        2436        4331       98554
>>  95722
>> Swap:          4095           0        4095
>>
>> As you see, there is lot of available memory.
>>
>> I don't see how writing dirty pages could slow down ATS when there is no
>> indication of excessive load on any of the system resources.
>> And this strange regularity: every ~ 50 minutes this happens. Like some
>> regular (cronjob like) task is being ran inside ATS that delays all other
>> tasks.
>>
>> ATS is using 9TB raw partition if that information might be relevant.
>>
>> Could you point me to the documentation dealing with dir entry sync
>> periods of ATS?
>>
>>
>> --
>> Veiko
>>
>>
>>
>> 2018-01-23 12:12 GMT+02:00 Leif Hedstrom <[email protected]>:
>>
>>>
>>>
>>> On Jan 23, 2018, at 7:36 PM, Veiko Kukk <[email protected]> wrote:
>>>
>>> Hi again,
>>>
>>> During that mysterious task that happens after ~ 50-51 minutes causes
>>> requests/responses to slow down very much, even time out.
>>> Requests that usually take few hundred milliseconds are now taking over
>>> 30s and timing out. This happens only during that time when memory
>>> consumption is suddenly dropped by ATS. Happens for both bypassed urls and
>>> for hits.
>>> ATS version is 7.1.1 and this looks like serious bug for me.
>>>
>>>
>>>
>>> That sounds suspiciously like kernel paging activity, maybe it’s
>>> spending that time dumping flushing dirty pages?  Maybe transparent huge
>>> pages ? Or tweak the sysctl’s for dirty page ratios?
>>>
>>> The other thing to possibly look at is the dir entry sync periods of
>>> ATS. Whenever we sync those to disk, we consume both more memory and more
>>> disk I/O, and maybe you are putting too much pressure on the VM (i.e. maybe
>>> you need to turn down the RAM cache or tweak the amount of directory
>>> entries you have).
>>>
>>> — Leif
>>>
>>> E.g.
>>>
>>> https://lonesysadmin.net/2013/12/22/better-linux-disk-cachin
>>> g-performance-vm-dirty_ratio/
>>>
>>>
>>>
>>
>
>

Re: Understanding ATS memory usage

Reply via email to