[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316488#comment-14316488 ] James Peach commented on TS-3386: - This sounds like something bad is happenning to{{ traffic_manager}}. Is there anything in {{manager.log}} or {{dags.log}} about that? Is it deadlocked or crashing? Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server Starting --- Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 -
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316482#comment-14316482 ] Zhao Yongming commented on TS-3386: --- well, things get more interesting. q1: why you will lose the cached content in a restart of the traffic server?? q1.1: is that a cache issue? q2: you are going to protect the origin server, why you think that limit on the UA side connection is a better solution to the limit on the origin side? q2.1 have you seen any occurrence of connection(httpSM) hanghup? q2.2 what is a better way to handle of the connection issue, for example timeout? when you try to handle tons of cache, tons of the traffic, keep it simple, keep it robust always better than anything intelligent. yes, we have fixed many cache issue we meet, http SM issues, and connections timeout issue, connection leaking ... I think most of the important change already in the official tree. and this is the way we figure out the root issues in ATS, which may lead to just some very tiny fix that will only affect very high traffic site with very strict SLA requirement. Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:32 test-cache
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316517#comment-14316517 ] Luca Bruno commented on TS-3386: James: I've pasted the manager.log above, dags had nothing related. And syslog clearly states that the server returned 502 to cop and decided to kill the server after 2 times that it failed: {noformat} Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server {noformat} To reproduce, set connection throttle to 100 and perform 1000 concurrent requests to ATS with minimal hit ratio (so that it also creates connections to the origin server). Zhao: q1: Whenever I restart ATS, the cache starts from zero. It's a 500gb raw disk cache. It's not supposed to work like this? Perhaps I'm missing some option. Either ways, it would keep restarting the server, so fixing the cache issue wouldn't solve this problem. q2: I'm just testing stuff. Of course I won't be limiting the number of connections like that, but I still find this a bug. q2.1: not sure sorry, what is httpSM? q2.2: I think there are 2 options: either don't accept(), or accept and immediately close the connection with an error (which is apparently what the server does, except that it does also for the heartbeat). For example it's not that nginx kills its worker children because it reached the maximum number of connections. Sure I keep it simple, but robustness is also the fact that the server must not restarted just because it's busy. Being busy means it's working and must not be killed by the heartbeat. In which case it can be solved by using a dedicated connection for the heartbeat in my opinion. Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316560#comment-14316560 ] Luca Bruno commented on TS-3386: Sorry, when I restart trafficserver it indeeds re-reads the cache (after adding the udev rule to the raw disk). But of course when it gets restarted by cop, it restarts often and thus I see the cache empty. Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server Starting --- Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server Version: Apache
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316683#comment-14316683 ] Leif Hedstrom commented on TS-3386: --- I think TS-2490 is close / similar in effect, where we can kill traffic_server because manager can't get a connection to it. Not quite the same, but similar issue. I can not find the Jira that talked about running out of connections, maybe it was fixed? I thought there was a commit sometime ago, about excluding the back door requests from connection throttling? To the OP, which version of ATS are you running ? Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR:
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316671#comment-14316671 ] Leif Hedstrom commented on TS-3386: --- I'm pretty sure there was a Jira on exactly that suggestion [~jpe...@apache.org] Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server Starting --- Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:44 test-cache
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316655#comment-14316655 ] James Peach commented on TS-3386: - Ok, that's nasty ... we should exempt heartbeats from connection throttling IMHO. Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server Starting --- Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:44 test-cache
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316783#comment-14316783 ] Luca Bruno commented on TS-3386: I'm using latest 5.2.0. Just to say, I've read some of the code of the cop and it's quite clear: if it fails 2 times, it's killed. Didn't put attention on whether it's using a separate connection though. Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server Starting --- Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server Version: Apache Traffic
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316648#comment-14316648 ] Leif Hedstrom commented on TS-3386: --- This is a known problem, if you starve the system on connections, so it starts throttling, it'll kill traffic_server. Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server Starting --- Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42)
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316279#comment-14316279 ] Zhao Yongming commented on TS-3386: --- well, the remap metters, please don't mess up 127.0.0.1 8080 with most of the services, that is not what ATS working to as a proxy. use something like map http://mydomain.com:8080/ . and do your testing using modified /etc/hosts or -x 127.0.0.1:8080 in curl. Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server Starting --- Feb 11 10:06:44 test-cache
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316376#comment-14316376 ] Zhao Yongming commented on TS-3386: --- if you want to talk about the kill, I'd like to say there should be more work before taking down the server, but how would you know that the connection full and all works well? we have tried to put the heartbeat into a connection that will not be affect in the connection limit, but sounds not so good too the heart beat is a fake L7 service health checker, which is design to find out something abnormal :D Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm]
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316343#comment-14316343 ] Zhao Yongming commented on TS-3386: --- well, proxy.config.net.connections_throttle = 1000, are you kidding? ATS is not squid nor httpd-1.x Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server Starting --- Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:44
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316350#comment-14316350 ] Luca Bruno commented on TS-3386: Even for 100 connections, it should limit the connections but not restarting and lose the cache: that's what I call kidding. So if I have 3 like the default config suggests, and it reaches the maximum, it would still restart. I think I will disable the kill signal to the child after a failed heartbeat, would you accept such kind of patch including a config option? Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:44
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316388#comment-14316388 ] Luca Bruno commented on TS-3386: We can find something abnormal by other means in our infrastructure. For example it suffices to alert: heartbeat failed because of 502, instead of killing the server. For us, it's better to wait than to lose terabytes of cache, which would then kill the origin servers afterwards. Also why isn't putting heartbeat in a dedicated connection not good? What wasn't good when you tried it? Do you have some related work I can read? Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR:
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316238#comment-14316238 ] Zhao Yongming commented on TS-3386: --- o, please just don't load any traffic and enable debug on http.*|dns.*, and I'd suspect this is a HostDB reverse lookup on 127.0.0.1 or lookup on localhost issue. let us dig it out. Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server Starting --- Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server Version: Apache Traffic Server -
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316316#comment-14316316 ] Luca Bruno commented on TS-3386: Ok my remap is now: {noformat} map http://test-cache.local:8080/ http://test-cache.local:8081/ {noformat} Where test-cache.local is a real entry in a dns server of the local network. Things didn't change much. Btw I've found that in the error.log there's something valuable (it also happened before), this is repeated indefinitely in the error.log: {noformat} 20150211.15h53m56s RESPONSE: sent 192.168.199.31 status 502 (Connect Error internal error - server connection terminated/-1) for 'http://test-cache.service.farm:8081/storage/test29429?size=11023age=46230sleep=0' {noformat} Please note that sending those concurrent requests to the backend server work just fine without any error. Anyway, I believe an error in the backend should not result in ATS being restarted. Debug: {noformat} [Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_seq) [HttpSM::do_hostdb_lookup] Doing DNS Lookup [Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) [ink_cluster_time] local: 1423666415, highest_delta: 0, cluster: 1423666415 [Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) [HttpTransact::OSDNSLookup] This was attempt 1 [Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_seq) [HttpTransact::OSDNSLookup] DNS Lookup successful [Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) [OSDNSLookup] DNS lookup for O.S. successful IP: 192.168.x.y {noformat} Then I noticed this: {noformat} [Feb 11 15:53:35.043] Server {0x2b7ddc565e00} DEBUG: (http) [293] hostdb update marking IP: 192.168.x.y:8081 as down {noformat} {noformat} Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager -
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316266#comment-14316266 ] Luca Bruno commented on TS-3386: Thanks a lot for your answer. I'm seeing this: {noformat} [Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_trans) Next action SM_ACTION_DNS_LOOKUP; OSDNSLookup [Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http) [9] State Transition: SM_ACTION_API_CACHE_LOOKUP_COMPLETE - SM_ACTION_DNS_LOOKUP [Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_seq) [HttpSM::do_hostdb_lookup] Doing DNS Lookup [Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_trans) [ink_cluster_time] local: 1423665061, highest_delta: 0, cluster: 1423665061 [Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_trans) [HttpTransact::OSDNSLookup] This was attempt 1 [Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_seq) [HttpTransact::OSDNSLookup] DNS Lookup successful [Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_trans) [OSDNSLookup] DNS lookup for O.S. successful IP: 127.0.0.1 {noformat} Should I be looking for something else? This is my remap: {noformat} map http://127.0.0.1:8080/ http://127.0.0.1:8081/ {noformat} Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316134#comment-14316134 ] Luca Bruno commented on TS-3386: Additionally, proxy.config.cop.core_signal is 0, which should not sent a signal to stop the server. Yet it sends signal 9 it seems, unless I'm misunderstanding the option description. Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server Starting --- Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server Version: Apache Traffic Server -
[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted
[ https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316143#comment-14316143 ] Luca Bruno commented on TS-3386: So it's like the server saturated the connections, and it replied 502 even to the heartbeat server, without analyzing the response. Am I right? In this case, I'd either like to disable heartbeat completely or start a second tcp server just for heartbeat, isntead of mixing heartbeat + client connections within the same server. Heartbeat failed with high load, trafficserver restarted Key: TS-3386 URL: https://issues.apache.org/jira/browse/TS-3386 Project: Traffic Server Issue Type: Bug Components: Performance Reporter: Luca Bruno I've been evaluating ATS for some days. I'm using it with mostly default settings, except I've lowered the number of connections to the backend, I have a raw storage of 500gb, and disabled ram cache. Working fine, then I wanted to stress it more. I've increased the test to 1000 concurrent requests, then the ATS worker has been restarted and thus lost the whole cache. /var/log/syslog: {noformat} Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server Starting --- Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: [LocalManager::pollMgmtProcessServer] Error in read (errno: 104) Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: [LocalManager::sendMgmtMsgToProcesses] Error writing message Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: (last system error 32: Broken pipe) Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status signal [32985 256] Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, making sure traffic_server is dead Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting --- Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:05:19) Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server Starting --- Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on Feb 10 2015 at 13:04:42) Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: RLIMIT_NOFILE(7):cur(736236),max(736236) Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1] Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 status(502) Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2] Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: Killed Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: [Alarms::signalAlarm] Server Process was reset Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: ---