Are you seeing the CPU spike during this time too?  What does dstat show you 
during this time?  It might be helpful to do a sudo perf record when you are 
seeing latency issues to see what function calls are using the CPU during this 
time.

Do you have a large RAM cache and multiple NUMA nodes?  Doing a numastat -v 
will show you if you have memory allocated unevenly across the NUMA nodes.    
Also, setting transparent_hugepage=never on boot has help people with large RAM 
caches.  This is a problem that Steve might be having.

What is your request rate during these spikes?  It is possible you are running 
into lock contention in the cache if the request rate is high.  If you have 
only one storage devices and no volumes, then you will only have 1 lock for the 
cache.  If you configure volumes to 4 this will reduce the amount of lock 
contention.  This is our default volume.config:

volume=1 scheme=http size=25%
volume=2 scheme=http size=25%
volume=3 scheme=http size=25%
volume=4 scheme=http size=25%

Hopefully, one of these pointers will fix the issue you are having.

-Bryan




> On Aug 21, 2016, at 9:00 PM, Hiroaki Nakamura <[email protected]> wrote:
> 
> I turned on debug log and got slow log and debug log.
> 
> slow log
> [Aug 22 11:32:48.103] Server {0x2af64b6c1700} ERROR: <HttpSM.cc:6820
> (update_stats)> [6072850] Slow Request: client_ip: 133.167.4.17:48228
> url: **masked**/7270.png status: 200 unique id:  redirection_tries: 0
> bytes: 43 fd: 0 client state: 0 server state: 9 ua_begin: 0.000
> ua_first_read: 0.000 ua_read_header_done: 0.000 cache_open_read_begin:
> 0.003 cache_open_read_end: 2.913 dns_lookup_begin: 2.913
> dns_lookup_end: 2.913 server_connect: 2.925 server_first_read: 2.950
> server_read_header_done: 2.950 server_close: 2.953 ua_close: 2.953
> sm_finish: 2.954 plugin_active: 0.000 plugin_total: 0.000
> 
> 
> debug log
> 
> [Aug 22 11:32:45.151] Server {0x2af64b6c1700} DEBUG: <HttpSM.cc:4403
> (do_cache_lookup_and_read)> (http_seq)
> [HttpSM::do_cache_lookup_and_read] [6072850] Issuing cache lookup for
> URL  **masked**/7270.png
> ...
> [Aug 22 11:32:47.942] Server {0x2af64bcc7700} DEBUG:
> <HttpTransactCache.cc:183 (SelectFromAlternates)> (http_match)
> [SelectFromAlternates] # alternates = 1
> ...
> [Aug 22 11:32:48.061] Server {0x2af64b6c1700} DEBUG:
> <HttpCacheSM.cc:102 (state_cache_open_read)> (http_cache) [6072850]
> [&HttpCacheSM::state_cache_open_read, CACHE_EVENT_OPEN_READ]
> 
> I noticed the gap which is roughly equal to (cache_open_read_end -
> cache_open_read_begin) above.
> 
> 2016-08-22 9:25 GMT+09:00 Hiroaki Nakamura <[email protected]>:
>> Hi Bryan,
>> Thanks for your reply!
>> 
>> I checked my server and I found it is already a small value.
>> 
>> $ sudo sysctl -a | grep vm.dirty_writeback_centisecs
>> vm.dirty_writeback_centisecs = 500
>> 
>> Is there another config to check?
>> Thanks
>> 
>> 2016-08-20 1:36 GMT+09:00 Bryan Call <[email protected]>:
>>> I have seen issues with the write buffers filling and being flushed while it
>>> stalls the process.  This happened to me when doing large file transfers on
>>> Linux.
>>> 
>>> I ended up changing the value of vm.dirty_writeback_centisecs to 1500.
>>> 
>>> $ sudo sysctl -a | grep writeback
>>> vm.dirty_writeback_centisecs = 1500
>>> 
>>> Here is an article on the issue:
>>> https://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/
>>> 
>>> -Bryan
>>> 
>>> 
>>> 
>>> 
>>> On Aug 18, 2016, at 10:56 PM, Hiroaki Nakamura <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> I forgot to write traffic server version.
>>> I'm using ATS 6.1.1 and using the raw partition for the cache storage.
>>> 
>>> I also noticed vmstat disk/writ values remain 0 for one minute,
>>> and then non-zero values follow and slow responses happen.
>>> 
>>> 
>>> 2016-08-19 14:27 GMT+09:00 Hiroaki Nakamura <[email protected]>:
>>> 
>>> Hi,
>>> 
>>> I'm having a trouble that sometimes a request took 3 to 5 seconds on
>>> approximately
>>> every 39 minutes.
>>> 
>>> I turned on the slow log and found out that cache_open_read_end is the most
>>> time consuming.
>>> 
>>> status: 200 unique id:  redirection_tries: 0 bytes: 43 fd: 0 client
>>> state: 0 server state: 9 ua_begin: 0.000 ua_first_read: 0.000
>>> ua_read_header_done: 0.000 cache_open_read_begin: 0.000
>>> cache_open_read_end: 2.536 dns_lookup_begin: 2.536 dns_lookup_end:
>>> 2.536 server_connect: 2.537 server_first_read: 2.589
>>> server_read_header_done: 2.589 server_close: 2.589 ua_close: 2.589
>>> sm_finish: 2.589 plugin_active: 0.000 plugin_total: 0.000
>>> 
>>> Could you give me advice to know the reason why cache_open_read_end took so
>>> long
>>> and fix the slow responses?
>>> 
>>> Thanks,
>>> Hiroaki
>>> 
>>> 

Reply via email to