Hi again, I'd really appreciate if somebody could point me in the right direction how to solve this. During whatever ATS does after each ~ 50 minutes, it has strong effect on response times. ATS is used in reverse proxy mode and we run regular tests against test URL on proxied server(s) (excluded from caching in ATS config). This test GET is ran as HAproxy health check after ~15 seconds for two local HAproxy backends which both pass requests to single local ATS.
Quite complex setup, but point here being that tests run frequently and give information about ATS response times over long period. Total test runs today: 6364 Tests that took over 7s today: 50 Distribution of requests, first column is response time, second amount of requests under that value. 100 1292 300 4351 500 5194 700 5578 900 5794 1200 5985 1400 6058 1800 6143 Here is the output of tests log that contains all the extremely slow responses. Test response size is only 609 bytes. Usually response time fluctuates around 2018-01-26T01:13:32.150186+00:00 12412 2018-01-26T01:13:32.150188+00:00 20803 2018-01-26T02:05:04.536931+00:00 29764 2018-01-26T02:05:04.536936+00:00 27271 2018-01-26T02:05:04.536941+00:00 10233 2018-01-26T02:56:26.968987+00:00 9511 2018-01-26T02:56:26.968989+00:00 30084 2018-01-26T02:56:26.968991+00:00 27337 2018-01-26T04:39:21.947460+00:00 24171 2018-01-26T04:39:21.947462+00:00 12042 2018-01-26T04:39:21.947464+00:00 36979 2018-01-26T04:39:31.954116+00:00 7369 2018-01-26T04:39:31.954118+00:00 32305 2018-01-26T04:39:31.954120+00:00 19779 2018-01-26T04:47:42.349748+00:00 29177 2018-01-26T04:47:42.349754+00:00 26212 2018-01-26T04:47:42.349757+00:00 21645 2018-01-26T04:47:42.349759+00:00 24932 2018-01-26T05:39:04.925435+00:00 32361 2018-01-26T05:39:04.925438+00:00 33587 2018-01-26T05:39:04.925440+00:00 8173 2018-01-26T05:39:04.925443+00:00 28149 2018-01-26T05:39:04.925445+00:00 29115 2018-01-26T06:30:27.643170+00:00 7423 2018-01-26T06:30:27.643172+00:00 32271 2018-01-26T06:30:27.643174+00:00 18927 2018-01-26T06:30:27.643179+00:00 27849 2018-01-26T06:30:37.644023+00:00 15160 2018-01-26T07:21:50.231681+00:00 19208 2018-01-26T07:21:50.231684+00:00 14984 2018-01-26T08:13:12.874501+00:00 16876 2018-01-26T08:13:22.885389+00:00 14007 2018-01-26T09:04:35.509167+00:00 9016 2018-01-26T09:04:35.509172+00:00 9356 2018-01-26T09:55:58.052277+00:00 24137 2018-01-26T09:55:58.052280+00:00 23709 2018-01-26T09:55:58.052282+00:00 19901 2018-01-26T09:55:58.052284+00:00 19034 2018-01-26T10:47:10.614261+00:00 23419 2018-01-26T10:47:10.614263+00:00 18967 2018-01-26T11:38:32.984318+00:00 14425 2018-01-26T11:38:32.984324+00:00 9797 2018-01-26T11:38:32.984326+00:00 11161 2018-01-26T11:38:32.984329+00:00 16228 2018-01-26T12:29:45.511517+00:00 15580 2018-01-26T12:29:45.511520+00:00 11439 2018-01-26T13:20:58.023816+00:00 21360 2018-01-26T13:20:58.023818+00:00 19488 2018-01-26T13:20:58.023821+00:00 14737 2018-01-26T13:20:58.023823+00:00 17118 The question is: why does ATS regularily slow down? Are there some internal management jobs done, that use same single queue as requests coming in and then requests just wait in queue until internal processes finish? It's getting worse as ATS uptime increases, triggering HAproxy health check timeouts. After restart it's not that bad again for a while. How to get rid of this regular slowness? Best regards, Veiko 2018-01-23 13:53 GMT+02:00 Veiko Kukk <[email protected]>: > Hi, > > I should have noted before that, during that timeframe, there is no higher > disk activity than on average. No higher load, no disk latency, no cpu > load. Nothing abnormal except slow ATS. > ATS is running on CentOS 7 directly on hardware dedicated server. > > Dirty pages related config that's been always there for that server: > > vm.dirty_background_ratio = 5 > vm.dirty_ratio = 40 > vm.swappiness = 0 > > # free -m > total used free shared > buff/cache available > Mem: 128831 27841 2436 4331 98554 > 95722 > Swap: 4095 0 4095 > > As you see, there is lot of available memory. > > I don't see how writing dirty pages could slow down ATS when there is no > indication of excessive load on any of the system resources. > And this strange regularity: every ~ 50 minutes this happens. Like some > regular (cronjob like) task is being ran inside ATS that delays all other > tasks. > > ATS is using 9TB raw partition if that information might be relevant. > > Could you point me to the documentation dealing with dir entry sync > periods of ATS? > > > -- > Veiko > > > > 2018-01-23 12:12 GMT+02:00 Leif Hedstrom <[email protected]>: > >> >> >> On Jan 23, 2018, at 7:36 PM, Veiko Kukk <[email protected]> wrote: >> >> Hi again, >> >> During that mysterious task that happens after ~ 50-51 minutes causes >> requests/responses to slow down very much, even time out. >> Requests that usually take few hundred milliseconds are now taking over >> 30s and timing out. This happens only during that time when memory >> consumption is suddenly dropped by ATS. Happens for both bypassed urls and >> for hits. >> ATS version is 7.1.1 and this looks like serious bug for me. >> >> >> >> That sounds suspiciously like kernel paging activity, maybe it’s spending >> that time dumping flushing dirty pages? Maybe transparent huge pages ? Or >> tweak the sysctl’s for dirty page ratios? >> >> The other thing to possibly look at is the dir entry sync periods of ATS. >> Whenever we sync those to disk, we consume both more memory and more disk >> I/O, and maybe you are putting too much pressure on the VM (i.e. maybe you >> need to turn down the RAM cache or tweak the amount of directory entries >> you have). >> >> — Leif >> >> E.g. >> >> https://lonesysadmin.net/2013/12/22/better-linux-disk-cachin >> g-performance-vm-dirty_ratio/ >> >> >> >
