[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316488#comment-14316488
 ] 

James Peach commented on TS-3386:
-

This sounds like something bad is happenning to{{ traffic_manager}}. Is there 
anything in {{manager.log}} or {{dags.log}} about that? Is it deadlocked or 
crashing?

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Zhao Yongming (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316482#comment-14316482
 ] 

Zhao Yongming commented on TS-3386:
---

well, things get more interesting.
q1: why you will lose the cached content in a restart of the traffic server??
 q1.1: is that a cache issue?

q2: you are going to protect the origin server, why you think that limit on the 
UA side connection is a better solution to the limit on the origin side?
 q2.1 have you seen any occurrence of connection(httpSM) hanghup?
 q2.2 what is a better way to handle of the connection issue, for example 
timeout?

when you try to handle tons of cache, tons of the traffic, keep it simple, keep 
it robust always better than anything intelligent.

yes, we have fixed many cache issue we meet, http SM issues, and connections 
timeout issue, connection leaking ... I think most of the important change 
already in the official tree. and this is the way we figure out the root issues 
in ATS, which may lead to just some very tiny fix that will only affect very 
high traffic site with very strict SLA requirement.

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316517#comment-14316517
 ] 

Luca Bruno commented on TS-3386:


James:

I've pasted the manager.log above, dags had nothing related. And syslog clearly 
states that the server returned 502 to cop and decided to kill the server after 
2 times that it failed:
{noformat}
Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
status(502)
Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
{noformat}

To reproduce, set connection throttle to 100 and perform 1000 concurrent 
requests to ATS with minimal hit ratio (so that it also creates connections to 
the origin server).

Zhao:

q1: Whenever I restart ATS, the cache starts from zero. It's a 500gb raw disk 
cache. It's not supposed to work like this? Perhaps I'm missing some option. 
Either ways, it would keep restarting the server, so fixing the cache issue 
wouldn't solve this problem.

q2: I'm just testing stuff. Of course I won't be limiting the number of 
connections like that, but I still find this a bug.
q2.1: not sure sorry, what is httpSM?
q2.2: I think there are 2 options: either don't accept(), or accept and 
immediately close the connection with an error (which is apparently what the 
server does, except that it does also for the heartbeat). For example it's not 
that nginx kills its worker children because it reached the maximum number of 
connections.

Sure I keep it simple, but robustness is also the fact that the server must not 
restarted just because it's busy. Being busy means it's working and must not be 
killed by the heartbeat. In which case it can be solved by using a dedicated 
connection for the heartbeat in my opinion.



 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316560#comment-14316560
 ] 

Luca Bruno commented on TS-3386:


Sorry, when I restart trafficserver it indeeds re-reads the cache (after adding 
the udev rule to the raw disk). But of course when it gets restarted by cop, it 
restarts often and thus I see the cache empty.

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server 
 Version: Apache 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Leif Hedstrom (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316683#comment-14316683
 ] 

Leif Hedstrom commented on TS-3386:
---

I think TS-2490 is close / similar in effect, where we can kill traffic_server 
because manager can't get a connection to it. Not quite the same, but similar 
issue. I can not find the Jira that talked about running out of connections, 
maybe it was fixed? I thought there was a commit sometime ago, about excluding 
the back door requests from connection throttling?

To the OP, which version of ATS are you running ?

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Leif Hedstrom (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316671#comment-14316671
 ] 

Leif Hedstrom commented on TS-3386:
---

I'm pretty sure there was a Jira on exactly that suggestion [~jpe...@apache.org]

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:44 test-cache 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316655#comment-14316655
 ] 

James Peach commented on TS-3386:
-

Ok, that's nasty ... we should exempt heartbeats from connection throttling 
IMHO.

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:44 test-cache 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316783#comment-14316783
 ] 

Luca Bruno commented on TS-3386:


I'm using latest 5.2.0. Just to say, I've read some of the code of the cop and 
it's quite clear: if it fails 2 times, it's killed. Didn't put attention on 
whether it's using a separate connection though.

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server 
 Version: Apache Traffic 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Leif Hedstrom (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316648#comment-14316648
 ] 

Leif Hedstrom commented on TS-3386:
---

This is a known problem, if you starve the system on connections, so it starts 
throttling, it'll kill traffic_server. 

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Zhao Yongming (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316279#comment-14316279
 ] 

Zhao Yongming commented on TS-3386:
---

well, the remap metters, please don't mess up 127.0.0.1 8080 with most of the 
services, that is not what ATS working to as a proxy.

use something like map http://mydomain.com:8080/ . and do your testing 
using modified /etc/hosts or -x 127.0.0.1:8080 in curl.

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:44 test-cache 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Zhao Yongming (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316376#comment-14316376
 ] 

Zhao Yongming commented on TS-3386:
---

if you want to talk about the kill, I'd like to say there should be more work 
before taking down the server, but how would you know that the connection full 
and all works well?

we have tried to put the heartbeat into a connection that will not be affect in 
the connection limit, but sounds not so good too

the heart beat is a fake L7 service health checker, which is design to find out 
something abnormal :D

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Zhao Yongming (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316343#comment-14316343
 ] 

Zhao Yongming commented on TS-3386:
---

well, proxy.config.net.connections_throttle = 1000, are you kidding? ATS is not 
squid nor httpd-1.x

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:44 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316350#comment-14316350
 ] 

Luca Bruno commented on TS-3386:


Even for 100 connections, it should limit the connections but not restarting 
and lose the cache: that's what I call kidding. So if I have 3 like the 
default config suggests, and it reaches the maximum, it would still restart.

I think I will disable the kill signal to the child after a failed heartbeat, 
would you accept such kind of patch including a config option?

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:44 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316388#comment-14316388
 ] 

Luca Bruno commented on TS-3386:


We can find something abnormal by other means in our infrastructure. For 
example it suffices to alert: heartbeat failed because of 502, instead of 
killing the server. For us, it's better to wait than to lose terabytes of 
cache, which would then kill the origin servers afterwards.

Also why isn't putting heartbeat in a dedicated connection not good? What 
wasn't good when you tried it? Do you have some related work I can read?

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Zhao Yongming (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316238#comment-14316238
 ] 

Zhao Yongming commented on TS-3386:
---

o, please just don't load any traffic and enable debug on http.*|dns.*, and I'd 
suspect this is a HostDB reverse lookup on 127.0.0.1 or lookup on localhost 
issue. let us dig it out.

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server 
 Version: Apache Traffic Server - 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316316#comment-14316316
 ] 

Luca Bruno commented on TS-3386:


Ok my remap is now:
{noformat}
map http://test-cache.local:8080/ http://test-cache.local:8081/
{noformat}

Where test-cache.local is a real entry in a dns server of the local network. 
Things didn't change much.

Btw I've found that in the error.log there's something valuable (it also 
happened before), this is repeated indefinitely in the error.log:

{noformat}
20150211.15h53m56s RESPONSE: sent 192.168.199.31 status 502 (Connect Error 
internal error - server connection terminated/-1) for 
'http://test-cache.service.farm:8081/storage/test29429?size=11023age=46230sleep=0'
{noformat}

Please note that sending those concurrent requests to the backend server work 
just fine without any error. Anyway, I believe an error in the backend should 
not result in ATS being restarted.

Debug:

{noformat}
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_seq) 
[HttpSM::do_hostdb_lookup] Doing DNS Lookup
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) 
[ink_cluster_time] local: 1423666415, highest_delta: 0, cluster: 1423666415
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) 
[HttpTransact::OSDNSLookup] This was attempt 1
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_seq) 
[HttpTransact::OSDNSLookup] DNS Lookup successful
[Feb 11 15:53:35.041] Server {0x2b7ddc565e00} DEBUG: (http_trans) [OSDNSLookup] 
DNS lookup for O.S. successful IP: 192.168.x.y
{noformat}

Then I noticed this:

{noformat}
[Feb 11 15:53:35.043] Server {0x2b7ddc565e00} DEBUG: (http) [293] hostdb update 
marking IP: 192.168.x.y:8081 as down
{noformat}
{noformat}

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316266#comment-14316266
 ] 

Luca Bruno commented on TS-3386:


Thanks a lot for your answer. I'm seeing this:

{noformat}
[Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_trans) Next action 
SM_ACTION_DNS_LOOKUP; OSDNSLookup
[Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http) [9] State 
Transition: SM_ACTION_API_CACHE_LOOKUP_COMPLETE - SM_ACTION_DNS_LOOKUP
[Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_seq) 
[HttpSM::do_hostdb_lookup] Doing DNS Lookup
[Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_trans) 
[ink_cluster_time] local: 1423665061, highest_delta: 0, cluster: 1423665061
[Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_trans) 
[HttpTransact::OSDNSLookup] This was attempt 1
[Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_seq) 
[HttpTransact::OSDNSLookup] DNS Lookup successful
[Feb 11 15:31:01.819] Server {0x2b9677700e00} DEBUG: (http_trans) [OSDNSLookup] 
DNS lookup for O.S. successful IP: 127.0.0.1
{noformat}

Should I be looking for something else? This is my remap:
{noformat}
map http://127.0.0.1:8080/ http://127.0.0.1:8081/
{noformat}

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316134#comment-14316134
 ] 

Luca Bruno commented on TS-3386:


Additionally, proxy.config.cop.core_signal is 0, which should not sent a signal 
to stop the server. Yet it sends signal 9 it seems, unless I'm misunderstanding 
the option description.

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: traffic_server 
 Version: Apache Traffic Server - 

[jira] [Commented] (TS-3386) Heartbeat failed with high load, trafficserver restarted

2015-02-11 Thread Luca Bruno (JIRA)

[ 
https://issues.apache.org/jira/browse/TS-3386?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14316143#comment-14316143
 ] 

Luca Bruno commented on TS-3386:


So it's like the server saturated the connections, and it replied 502 even to 
the heartbeat server, without analyzing the response. Am I right? In this case, 
I'd either like to disable heartbeat completely or start a second tcp server 
just for heartbeat, isntead of mixing heartbeat + client connections within the 
same server.

 Heartbeat failed with high load, trafficserver restarted
 

 Key: TS-3386
 URL: https://issues.apache.org/jira/browse/TS-3386
 Project: Traffic Server
  Issue Type: Bug
  Components: Performance
Reporter: Luca Bruno

 I've been evaluating ATS for some days. I'm using it with mostly default 
 settings, except I've lowered the number of connections to the backend, I 
 have a raw storage of 500gb, and disabled ram cache.
 Working fine, then I wanted to stress it more. I've increased the test to 
 1000 concurrent requests, then the ATS worker has been restarted and thus 
 lost the whole cache.
 /var/log/syslog:
 {noformat}
 Feb 11 10:05:52 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:05:52 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:02 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:02 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:02 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:04 test-cache traffic_server[59047]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:12 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} FATAL: 
 [LocalManager::pollMgmtProcessServer] Error in read (errno: 104)
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR: 
 [LocalManager::sendMgmtMsgToProcesses] Error writing message
 Feb 11 10:06:22 test-cache traffic_manager[32985]: {0x7f975c537720} ERROR:  
 (last system error 32: Broken pipe)
 Feb 11 10:06:22 test-cache traffic_cop[32984]: cop received child status 
 signal [32985 256]
 Feb 11 10:06:22 test-cache traffic_cop[32984]: traffic_manager not running, 
 making sure traffic_server is dead
 Feb 11 10:06:22 test-cache traffic_cop[32984]: spawning traffic_manager
 Feb 11 10:06:22 test-cache traffic_cop[32984]: binpath is bin
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: --- Manager Starting 
 ---
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: Manager Version: 
 Apache Traffic Server - traffic_manager - 5.2.0 - (build # 11013 on Feb 10 
 2015 at 13:05:19)
 Feb 11 10:06:22 test-cache traffic_manager[59057]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: --- traffic_server 
 Starting ---
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: traffic_server 
 Version: Apache Traffic Server - traffic_server - 5.2.0 - (build # 11013 on 
 Feb 10 2015 at 13:04:42)
 Feb 11 10:06:24 test-cache traffic_server[59065]: NOTE: 
 RLIMIT_NOFILE(7):cur(736236),max(736236)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:32 test-cache traffic_cop[32984]: server heartbeat failed [1]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: (http test) received non-200 
 status(502)
 Feb 11 10:06:42 test-cache traffic_cop[32984]: server heartbeat failed [2]
 Feb 11 10:06:42 test-cache traffic_cop[32984]: killing server
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [LocalManager::pollMgmtProcessServer] Server Process terminated due to Sig 9: 
 Killed
 Feb 11 10:06:42 test-cache traffic_manager[59057]: {0x7f2c94ded720} ERROR: 
 [Alarms::signalAlarm] Server Process was reset
 Feb 11 10:06:44 test-cache traffic_server[59077]: NOTE: ---