I came across this command and it has helped track down some latency issues 
caused by other processes (ss -s).  Can you run it during the time you are 
seeing latency issues and post the results here?

dstat -c --top-cpu -d --top-bio --top-latency -n

-Bryan

> On Jan 26, 2018, at 5:48 AM, Veiko Kukk <[email protected]> wrote:
> 
> Hi again,
> 
> I'd really appreciate if somebody could point me in the right direction how 
> to solve this.
> During whatever ATS does after each ~ 50 minutes, it has strong effect on 
> response times.
> ATS is used in reverse proxy mode and we run regular tests against test URL 
> on proxied server(s) (excluded from caching in ATS config).
> This test GET is ran as HAproxy health check after ~15 seconds for two local 
> HAproxy backends which both pass requests to single local ATS.
> 
> Quite complex setup, but point here being that tests run frequently and give 
> information about ATS response times over long period.
> 
> Total test runs today: 6364
> Tests that took over 7s today: 50
> 
> Distribution of requests, first column is response time, second amount of 
> requests under that value.
> 100 1292
> 300 4351
> 500 5194
> 700 5578
> 900 5794
> 1200 5985
> 1400 6058
> 1800 6143
> 
> Here is the output of tests log that contains all the extremely slow 
> responses. Test response size is only 609 bytes. Usually response time 
> fluctuates around 
> 
> 2018-01-26T01:13:32.150186+00:00 12412
> 2018-01-26T01:13:32.150188+00:00 20803
> 2018-01-26T02:05:04.536931+00:00 29764
> 2018-01-26T02:05:04.536936+00:00 27271
> 2018-01-26T02:05:04.536941+00:00 10233
> 2018-01-26T02:56:26.968987+00:00 9511
> 2018-01-26T02:56:26.968989+00:00 30084
> 2018-01-26T02:56:26.968991+00:00 27337
> 2018-01-26T04:39:21.947460+00:00 24171
> 2018-01-26T04:39:21.947462+00:00 12042
> 2018-01-26T04:39:21.947464+00:00 36979
> 2018-01-26T04:39:31.954116+00:00 7369
> 2018-01-26T04:39:31.954118+00:00 32305
> 2018-01-26T04:39:31.954120+00:00 19779
> 2018-01-26T04:47:42.349748+00:00 29177
> 2018-01-26T04:47:42.349754+00:00 26212
> 2018-01-26T04:47:42.349757+00:00 21645
> 2018-01-26T04:47:42.349759+00:00 24932
> 2018-01-26T05:39:04.925435+00:00 32361
> 2018-01-26T05:39:04.925438+00:00 33587
> 2018-01-26T05:39:04.925440+00:00 8173
> 2018-01-26T05:39:04.925443+00:00 28149
> 2018-01-26T05:39:04.925445+00:00 29115
> 2018-01-26T06:30:27.643170+00:00 7423
> 2018-01-26T06:30:27.643172+00:00 32271
> 2018-01-26T06:30:27.643174+00:00 18927
> 2018-01-26T06:30:27.643179+00:00 27849
> 2018-01-26T06:30:37.644023+00:00 15160
> 2018-01-26T07:21:50.231681+00:00 19208
> 2018-01-26T07:21:50.231684+00:00 14984
> 2018-01-26T08:13:12.874501+00:00 16876
> 2018-01-26T08:13:22.885389+00:00 14007
> 2018-01-26T09:04:35.509167+00:00 9016
> 2018-01-26T09:04:35.509172+00:00 9356
> 2018-01-26T09:55:58.052277+00:00 24137
> 2018-01-26T09:55:58.052280+00:00 23709
> 2018-01-26T09:55:58.052282+00:00 19901
> 2018-01-26T09:55:58.052284+00:00 19034
> 2018-01-26T10:47:10.614261+00:00 23419
> 2018-01-26T10:47:10.614263+00:00 18967
> 2018-01-26T11:38:32.984318+00:00 14425
> 2018-01-26T11:38:32.984324+00:00 9797
> 2018-01-26T11:38:32.984326+00:00 11161
> 2018-01-26T11:38:32.984329+00:00 16228
> 2018-01-26T12:29:45.511517+00:00 15580
> 2018-01-26T12:29:45.511520+00:00 11439
> 2018-01-26T13:20:58.023816+00:00 21360
> 2018-01-26T13:20:58.023818+00:00 19488
> 2018-01-26T13:20:58.023821+00:00 14737
> 2018-01-26T13:20:58.023823+00:00 17118
> 
> 
> The question is: why does ATS regularily slow down? Are there some internal 
> management jobs done, that use same single queue as requests coming in and 
> then requests just wait in queue until internal processes finish?
> It's getting worse as ATS uptime increases, triggering HAproxy health check 
> timeouts. After restart it's not that bad again for a while.
> 
> How to get rid of this regular slowness?
> 
> Best regards,
> Veiko
> 
> 
> 
> 2018-01-23 13:53 GMT+02:00 Veiko Kukk <[email protected] 
> <mailto:[email protected]>>:
> Hi,
> 
> I should have noted before that, during that timeframe, there is no higher 
> disk activity than on average. No higher load, no disk latency, no cpu load. 
> Nothing abnormal except slow ATS.
> ATS is running on CentOS 7 directly on hardware dedicated server.
> 
> Dirty pages related config that's been always there for that server:
> 
> vm.dirty_background_ratio = 5
> vm.dirty_ratio = 40
> vm.swappiness = 0
> 
> # free -m
>                    total           used           free          shared  
> buff/cache   available
> Mem:         128831       27841        2436        4331       98554       
> 95722
> Swap:          4095           0        4095
> 
> As you see, there is lot of available memory.
> 
> I don't see how writing dirty pages could slow down ATS when there is no 
> indication of excessive load on any of the system resources.
> And this strange regularity: every ~ 50 minutes this happens. Like some 
> regular (cronjob like) task is being ran inside ATS that delays all other 
> tasks.
> 
> ATS is using 9TB raw partition if that information might be relevant.
> 
> Could you point me to the documentation dealing with dir entry sync periods 
> of ATS?
> 
> 
> -- 
> Veiko
> 
> 
> 
> 2018-01-23 12:12 GMT+02:00 Leif Hedstrom <[email protected] 
> <mailto:[email protected]>>:
> 
> 
>> On Jan 23, 2018, at 7:36 PM, Veiko Kukk <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hi again,
>> 
>> During that mysterious task that happens after ~ 50-51 minutes causes 
>> requests/responses to slow down very much, even time out.
>> Requests that usually take few hundred milliseconds are now taking over 30s 
>> and timing out. This happens only during that time when memory consumption 
>> is suddenly dropped by ATS. Happens for both bypassed urls and for hits.
>> ATS version is 7.1.1 and this looks like serious bug for me.
> 
> 
> That sounds suspiciously like kernel paging activity, maybe it’s spending 
> that time dumping flushing dirty pages?  Maybe transparent huge pages ? Or 
> tweak the sysctl’s for dirty page ratios?
> 
> The other thing to possibly look at is the dir entry sync periods of ATS. 
> Whenever we sync those to disk, we consume both more memory and more disk 
> I/O, and maybe you are putting too much pressure on the VM (i.e. maybe you 
> need to turn down the RAM cache or tweak the amount of directory entries you 
> have).
> 
> — Leif
> 
> E.g.
> 
> https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
>  
> <https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/>
> 
> 
> 
> 

Reply via email to