I came across this command and it has helped track down some latency issues caused by other processes (ss -s). Can you run it during the time you are seeing latency issues and post the results here?
dstat -c --top-cpu -d --top-bio --top-latency -n -Bryan > On Jan 26, 2018, at 5:48 AM, Veiko Kukk <[email protected]> wrote: > > Hi again, > > I'd really appreciate if somebody could point me in the right direction how > to solve this. > During whatever ATS does after each ~ 50 minutes, it has strong effect on > response times. > ATS is used in reverse proxy mode and we run regular tests against test URL > on proxied server(s) (excluded from caching in ATS config). > This test GET is ran as HAproxy health check after ~15 seconds for two local > HAproxy backends which both pass requests to single local ATS. > > Quite complex setup, but point here being that tests run frequently and give > information about ATS response times over long period. > > Total test runs today: 6364 > Tests that took over 7s today: 50 > > Distribution of requests, first column is response time, second amount of > requests under that value. > 100 1292 > 300 4351 > 500 5194 > 700 5578 > 900 5794 > 1200 5985 > 1400 6058 > 1800 6143 > > Here is the output of tests log that contains all the extremely slow > responses. Test response size is only 609 bytes. Usually response time > fluctuates around > > 2018-01-26T01:13:32.150186+00:00 12412 > 2018-01-26T01:13:32.150188+00:00 20803 > 2018-01-26T02:05:04.536931+00:00 29764 > 2018-01-26T02:05:04.536936+00:00 27271 > 2018-01-26T02:05:04.536941+00:00 10233 > 2018-01-26T02:56:26.968987+00:00 9511 > 2018-01-26T02:56:26.968989+00:00 30084 > 2018-01-26T02:56:26.968991+00:00 27337 > 2018-01-26T04:39:21.947460+00:00 24171 > 2018-01-26T04:39:21.947462+00:00 12042 > 2018-01-26T04:39:21.947464+00:00 36979 > 2018-01-26T04:39:31.954116+00:00 7369 > 2018-01-26T04:39:31.954118+00:00 32305 > 2018-01-26T04:39:31.954120+00:00 19779 > 2018-01-26T04:47:42.349748+00:00 29177 > 2018-01-26T04:47:42.349754+00:00 26212 > 2018-01-26T04:47:42.349757+00:00 21645 > 2018-01-26T04:47:42.349759+00:00 24932 > 2018-01-26T05:39:04.925435+00:00 32361 > 2018-01-26T05:39:04.925438+00:00 33587 > 2018-01-26T05:39:04.925440+00:00 8173 > 2018-01-26T05:39:04.925443+00:00 28149 > 2018-01-26T05:39:04.925445+00:00 29115 > 2018-01-26T06:30:27.643170+00:00 7423 > 2018-01-26T06:30:27.643172+00:00 32271 > 2018-01-26T06:30:27.643174+00:00 18927 > 2018-01-26T06:30:27.643179+00:00 27849 > 2018-01-26T06:30:37.644023+00:00 15160 > 2018-01-26T07:21:50.231681+00:00 19208 > 2018-01-26T07:21:50.231684+00:00 14984 > 2018-01-26T08:13:12.874501+00:00 16876 > 2018-01-26T08:13:22.885389+00:00 14007 > 2018-01-26T09:04:35.509167+00:00 9016 > 2018-01-26T09:04:35.509172+00:00 9356 > 2018-01-26T09:55:58.052277+00:00 24137 > 2018-01-26T09:55:58.052280+00:00 23709 > 2018-01-26T09:55:58.052282+00:00 19901 > 2018-01-26T09:55:58.052284+00:00 19034 > 2018-01-26T10:47:10.614261+00:00 23419 > 2018-01-26T10:47:10.614263+00:00 18967 > 2018-01-26T11:38:32.984318+00:00 14425 > 2018-01-26T11:38:32.984324+00:00 9797 > 2018-01-26T11:38:32.984326+00:00 11161 > 2018-01-26T11:38:32.984329+00:00 16228 > 2018-01-26T12:29:45.511517+00:00 15580 > 2018-01-26T12:29:45.511520+00:00 11439 > 2018-01-26T13:20:58.023816+00:00 21360 > 2018-01-26T13:20:58.023818+00:00 19488 > 2018-01-26T13:20:58.023821+00:00 14737 > 2018-01-26T13:20:58.023823+00:00 17118 > > > The question is: why does ATS regularily slow down? Are there some internal > management jobs done, that use same single queue as requests coming in and > then requests just wait in queue until internal processes finish? > It's getting worse as ATS uptime increases, triggering HAproxy health check > timeouts. After restart it's not that bad again for a while. > > How to get rid of this regular slowness? > > Best regards, > Veiko > > > > 2018-01-23 13:53 GMT+02:00 Veiko Kukk <[email protected] > <mailto:[email protected]>>: > Hi, > > I should have noted before that, during that timeframe, there is no higher > disk activity than on average. No higher load, no disk latency, no cpu load. > Nothing abnormal except slow ATS. > ATS is running on CentOS 7 directly on hardware dedicated server. > > Dirty pages related config that's been always there for that server: > > vm.dirty_background_ratio = 5 > vm.dirty_ratio = 40 > vm.swappiness = 0 > > # free -m > total used free shared > buff/cache available > Mem: 128831 27841 2436 4331 98554 > 95722 > Swap: 4095 0 4095 > > As you see, there is lot of available memory. > > I don't see how writing dirty pages could slow down ATS when there is no > indication of excessive load on any of the system resources. > And this strange regularity: every ~ 50 minutes this happens. Like some > regular (cronjob like) task is being ran inside ATS that delays all other > tasks. > > ATS is using 9TB raw partition if that information might be relevant. > > Could you point me to the documentation dealing with dir entry sync periods > of ATS? > > > -- > Veiko > > > > 2018-01-23 12:12 GMT+02:00 Leif Hedstrom <[email protected] > <mailto:[email protected]>>: > > >> On Jan 23, 2018, at 7:36 PM, Veiko Kukk <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hi again, >> >> During that mysterious task that happens after ~ 50-51 minutes causes >> requests/responses to slow down very much, even time out. >> Requests that usually take few hundred milliseconds are now taking over 30s >> and timing out. This happens only during that time when memory consumption >> is suddenly dropped by ATS. Happens for both bypassed urls and for hits. >> ATS version is 7.1.1 and this looks like serious bug for me. > > > That sounds suspiciously like kernel paging activity, maybe it’s spending > that time dumping flushing dirty pages? Maybe transparent huge pages ? Or > tweak the sysctl’s for dirty page ratios? > > The other thing to possibly look at is the dir entry sync periods of ATS. > Whenever we sync those to disk, we consume both more memory and more disk > I/O, and maybe you are putting too much pressure on the VM (i.e. maybe you > need to turn down the RAM cache or tweak the amount of directory entries you > have). > > — Leif > > E.g. > > https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/ > > <https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/> > > > >
