DO NOT REPLY [Bug 33374] New: - Connections are not reset under heavy load with lots of closed remote connections

bugzilla Wed, 02 Feb 2005 13:51:53 -0800

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG·
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://issues.apache.org/bugzilla/show_bug.cgi?id=33374>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND·
INSERTED IN THE BUG DATABASE.


http://issues.apache.org/bugzilla/show_bug.cgi?id=33374

           Summary: Connections are not reset under heavy load with lots of
                    closed remote connections
           Product: Tomcat 5
           Version: 5.0.28
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P2
         Component: Native:JK
        AssignedTo: tomcat-dev@jakarta.apache.org
        ReportedBy: [EMAIL PROTECTED]


Hi,

There is a problem with mod_jk and Tomcat under high load that causes slots to
be kept by Apache even though the backend Tomcat links has gone down. This is
caused by too eager user aborts, ie. where you have web clients that abort the
connection by just closing it. You get a broken pipe on the Tomcat side and
eventually that is cleared out - since under less traffic this is not an issue.
Now consider that you get hammered with requests and many are aborted, ie. out
of 30 per seconds 10 are aborted. 

There are two issues with that, both I will list as separate reports:

1. Reporting errors to user code
2. Closing connections

We use Apache/2.0.52 (Unix) mod_ssl/2.0.52 OpenSSL/0.9.7d mod_jk/1.2.6 and
Tomcat 5.0.28

This report is for problem #2 above. 

For some reason, when we get those broken pipe exceptions as reported in bug
#33371 we are having big problems with Apache. What happens is that the
scoreboard of Apache fills up with connections (see attached file), all slots
show status "W" which means it the request is in work by the handler, in this
case mod_jk. This only happens if we go over a certain amount of requests per
second, otherwise it does not cause any problems. But when we go over the limit
and the slots fill up, what we notice is that we get heaps of "Broken pipe"
exceptions on the Tomcat side, where we have 9-10 app servers all running
Tomcat, load balanced by mod_jk. All of these servers start to show these Broken
pipe exceptions. It is not that we never have those exceptions, of course when a
user aborts, then we get that exception in the logs. But at peak time and when
the slots fill up in Apache, then we have heaps of them, ie. a couple of them
per second.

Those errors may be caused because of the too eager calling system closing
connection all the time because we respond a bit to slow, but shouldn't those
connections be closed all the way, ie. also on the Apache side, in other words
the connections between Apache and the user? 

That is the part that seem to fail. Furthermore we get heaps of stray
connections on the Tomcat machines. If you do a "netstat -p -n" you get heaps of
connections to local port 8009 (AJP13) but no process is connected to it. I
tried to fix this by applying a this patch to Tomcat 5.0.28, file:
/jakarta-tomcat-5.0.28-src/jakarta-tomcat-connectors/jk/java/org/apache/jk/common/ChannelSocket.java:

267d266
<     final int error=8;
507,513c506
<         try {
<             os.write( buf, 0, len );
<         } catch (IOException ex) {
<             ep.setNote( error, Boolean.TRUE ); 
<             log.warn("Flagging endpoint as in error. " + ep.getNote(
socketNote ) + "\nError was: " + ex.getMessage() );
<             throw ex;
<         }
---
>         os.write( buf, 0, len );
522,528c515
<             try {
<                 os.flush();
<             } catch (IOException ex) {
<                 ep.setNote( error, Boolean.TRUE );
<                 log.warn("Flagging endpoint as in error. " + ep.getNote(
socketNote ) + "\nError was: " + ex.getMessage() );
<                 throw ex;
<             }
---
>             os.flush();
681c668
<                         log.warn("Closing ajp connection " + status + " - " +
ep.getNote( socketNote ));
---
>                         log.warn("Closing ajp connection " + status );
691,695d677
<                     break;
<                 }
<                 if ( ep.getNote ( error ) == Boolean.TRUE ) {
<                     log.warn("processConnection: Error reported, closing
connection. " + ep.getNote( socketNote ));
<                     ep.setNote( error, null );


What it does is remembering when there was an error during IO and when control
comes back all the way through the call stack it eventually closes this 
connection. 

This is a problem in itself and as reported in bug #33371, the user code does
not get the error, but even if it does it can swallow the exception just like
the current code does and therefore the low-level code does not handle the
problem at all. My patch makes sure that connections that have caused IO errors
to be thrown to close the connection.

That took care of our stray connections, and things worked better for a while,
but we are getting the problem again. The bottomline seems to be that my patch
is useful and should in some way or another be part of the Coyote code, but
under heavy load the problem is not solved yet. 

If you investigate the attached scoreboard dump, you will see how long some of
those connections are sitting there, ie. some of the values in the "Seconds
Since" (SS) column are greater than 800! And we set all timeouts we could find
to a lower value:

httpd.conf:           Timeout 60
worker.properties:    worker.tc1.reply_timeout=120000
worker.properties:    worker.tc1.socket_timeout=60

Why do they get stuck? I am pretty sure they are all those many Broken pipe
exception connections we get, but since applying my patch I do close them all,
why does that not flow back to mod_jk?

The biggest problem of all is that mod_jk creates a new connection to the app
servers and does not clear out the previous connection to the user. What happens
is that when we have those 1000 slots on 3 web servers full, and I do a thread
dump on the 9-10 app servers, all 150 configured TP-Processor threads are idle!
They all show this state:

"TP-Processor183" daemon prio=1 tid=0x5af00018 nid=0x626a runnable
[8bfff000..8bfffb58]
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.read(SocketInputStream.java:129)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:183)
    at java.io.BufferedInputStream.read1(BufferedInputStream.java:222)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:277)
    - locked <0x4e030028> (a java.io.BufferedInputStream)
    at org.apache.jk.common.ChannelSocket.read(ChannelSocket.java:611)
    at org.apache.jk.common.ChannelSocket.receive(ChannelSocket.java:548)
    at 
org.apache.jk.common.ChannelSocket.processConnection(ChannelSocket.java:676)
    at org.apache.jk.common.SocketConnection.runIt(ChannelSocket.java:890)
    at
org.apache.tomcat.util.threads.ThreadPool$ControlRunnable.run(ThreadPool.java:683)
    at java.lang.Thread.run(Thread.java:534)

which means they wait for a new request to come in, but this does not happens
since all Apache slots are full. 

To recap:

- Connections (User<>Apache) do not get reset or released (quick enough) under
heavy load when they should (eg. user closes connection)
- Closing connection in Coyote code does only partially help the problem.
- mod_jk creates new connection to Tomcat, but old connection to user is not
dropped.

I am sorry that I cannot assist any further yet, but I keep reading the mod_jk
and Tomcat code and see if I can find anything. Until then i wanted to get the
issue out in hope that someone else may have an idea.

Regards,

Lars George, CTO
WorldLingo

-- 
Configure bugmail: http://issues.apache.org/bugzilla/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

DO NOT REPLY [Bug 33374] New: - Connections are not reset under heavy load with lots of closed remote connections

Reply via email to