Hi Bryan, System in general is under light load. No other processes cause latency.
I've submitted issue https://github.com/apache/trafficserver/issues/3057 Veiko 2018-01-27 2:20 GMT+02:00 Bryan Call <[email protected]>: > I came across this command and it has helped track down some latency > issues caused by other processes (ss -s). Can you run it during the time > you are seeing latency issues and post the results here? > > dstat -c --top-cpu -d --top-bio --top-latency -n > > -Bryan > > On Jan 26, 2018, at 5:48 AM, Veiko Kukk <[email protected]> wrote: > > Hi again, > > I'd really appreciate if somebody could point me in the right direction > how to solve this. > During whatever ATS does after each ~ 50 minutes, it has strong effect on > response times. > ATS is used in reverse proxy mode and we run regular tests against test > URL on proxied server(s) (excluded from caching in ATS config). > This test GET is ran as HAproxy health check after ~15 seconds for two > local HAproxy backends which both pass requests to single local ATS. > > Quite complex setup, but point here being that tests run frequently and > give information about ATS response times over long period. > > Total test runs today: 6364 > Tests that took over 7s today: 50 > > Distribution of requests, first column is response time, second amount of > requests under that value. > 100 1292 > 300 4351 > 500 5194 > 700 5578 > 900 5794 > 1200 5985 > 1400 6058 > 1800 6143 > > Here is the output of tests log that contains all the extremely slow > responses. Test response size is only 609 bytes. Usually response time > fluctuates around > > 2018-01-26T01:13:32.150186+00:00 12412 > 2018-01-26T01:13:32.150188+00:00 20803 > 2018-01-26T02:05:04.536931+00:00 29764 > 2018-01-26T02:05:04.536936+00:00 27271 > 2018-01-26T02:05:04.536941+00:00 10233 > 2018-01-26T02:56:26.968987+00:00 9511 > 2018-01-26T02:56:26.968989+00:00 30084 > 2018-01-26T02:56:26.968991+00:00 27337 > 2018-01-26T04:39:21.947460+00:00 24171 > 2018-01-26T04:39:21.947462+00:00 12042 > 2018-01-26T04:39:21.947464+00:00 36979 > 2018-01-26T04:39:31.954116+00:00 7369 > 2018-01-26T04:39:31.954118+00:00 32305 > 2018-01-26T04:39:31.954120+00:00 19779 > 2018-01-26T04:47:42.349748+00:00 29177 > 2018-01-26T04:47:42.349754+00:00 26212 > 2018-01-26T04:47:42.349757+00:00 21645 > 2018-01-26T04:47:42.349759+00:00 24932 > 2018-01-26T05:39:04.925435+00:00 32361 > 2018-01-26T05:39:04.925438+00:00 33587 > 2018-01-26T05:39:04.925440+00:00 8173 > 2018-01-26T05:39:04.925443+00:00 28149 > 2018-01-26T05:39:04.925445+00:00 29115 > 2018-01-26T06:30:27.643170+00:00 7423 > 2018-01-26T06:30:27.643172+00:00 32271 > 2018-01-26T06:30:27.643174+00:00 18927 > 2018-01-26T06:30:27.643179+00:00 27849 > 2018-01-26T06:30:37.644023+00:00 15160 > 2018-01-26T07:21:50.231681+00:00 19208 > 2018-01-26T07:21:50.231684+00:00 14984 > 2018-01-26T08:13:12.874501+00:00 16876 > 2018-01-26T08:13:22.885389+00:00 14007 > 2018-01-26T09:04:35.509167+00:00 9016 > 2018-01-26T09:04:35.509172+00:00 9356 > 2018-01-26T09:55:58.052277+00:00 24137 > 2018-01-26T09:55:58.052280+00:00 23709 > 2018-01-26T09:55:58.052282+00:00 19901 > 2018-01-26T09:55:58.052284+00:00 19034 > 2018-01-26T10:47:10.614261+00:00 23419 > 2018-01-26T10:47:10.614263+00:00 18967 > 2018-01-26T11:38:32.984318+00:00 14425 > 2018-01-26T11:38:32.984324+00:00 9797 > 2018-01-26T11:38:32.984326+00:00 11161 > 2018-01-26T11:38:32.984329+00:00 16228 > 2018-01-26T12:29:45.511517+00:00 15580 > 2018-01-26T12:29:45.511520+00:00 11439 > 2018-01-26T13:20:58.023816+00:00 21360 > 2018-01-26T13:20:58.023818+00:00 19488 > 2018-01-26T13:20:58.023821+00:00 14737 > 2018-01-26T13:20:58.023823+00:00 17118 > > > The question is: why does ATS regularily slow down? Are there some > internal management jobs done, that use same single queue as requests > coming in and then requests just wait in queue until internal processes > finish? > It's getting worse as ATS uptime increases, triggering HAproxy health > check timeouts. After restart it's not that bad again for a while. > > How to get rid of this regular slowness? > > Best regards, > Veiko > > > > 2018-01-23 13:53 GMT+02:00 Veiko Kukk <[email protected]>: > >> Hi, >> >> I should have noted before that, during that timeframe, there is no >> higher disk activity than on average. No higher load, no disk latency, no >> cpu load. Nothing abnormal except slow ATS. >> ATS is running on CentOS 7 directly on hardware dedicated server. >> >> Dirty pages related config that's been always there for that server: >> >> vm.dirty_background_ratio = 5 >> vm.dirty_ratio = 40 >> vm.swappiness = 0 >> >> # free -m >> total used free shared >> buff/cache available >> Mem: 128831 27841 2436 4331 98554 >> 95722 >> Swap: 4095 0 4095 >> >> As you see, there is lot of available memory. >> >> I don't see how writing dirty pages could slow down ATS when there is no >> indication of excessive load on any of the system resources. >> And this strange regularity: every ~ 50 minutes this happens. Like some >> regular (cronjob like) task is being ran inside ATS that delays all other >> tasks. >> >> ATS is using 9TB raw partition if that information might be relevant. >> >> Could you point me to the documentation dealing with dir entry sync >> periods of ATS? >> >> >> -- >> Veiko >> >> >> >> 2018-01-23 12:12 GMT+02:00 Leif Hedstrom <[email protected]>: >> >>> >>> >>> On Jan 23, 2018, at 7:36 PM, Veiko Kukk <[email protected]> wrote: >>> >>> Hi again, >>> >>> During that mysterious task that happens after ~ 50-51 minutes causes >>> requests/responses to slow down very much, even time out. >>> Requests that usually take few hundred milliseconds are now taking over >>> 30s and timing out. This happens only during that time when memory >>> consumption is suddenly dropped by ATS. Happens for both bypassed urls and >>> for hits. >>> ATS version is 7.1.1 and this looks like serious bug for me. >>> >>> >>> >>> That sounds suspiciously like kernel paging activity, maybe it’s >>> spending that time dumping flushing dirty pages? Maybe transparent huge >>> pages ? Or tweak the sysctl’s for dirty page ratios? >>> >>> The other thing to possibly look at is the dir entry sync periods of >>> ATS. Whenever we sync those to disk, we consume both more memory and more >>> disk I/O, and maybe you are putting too much pressure on the VM (i.e. maybe >>> you need to turn down the RAM cache or tweak the amount of directory >>> entries you have). >>> >>> — Leif >>> >>> E.g. >>> >>> https://lonesysadmin.net/2013/12/22/better-linux-disk-cachin >>> g-performance-vm-dirty_ratio/ >>> >>> >>> >> > >
