Re: Tracking down no responding issue after a massive load

Sanford Liu Wed, 03 Jul 2019 07:53:27 -0700

Hi Mark,

I have updated the Tomcat's version to 9.0.21(Docker image tag is
tomcat:9.0.21-jdk8. Sorry for my word 'official', it is actually built by
Docker).
The Tomcat Native's version is 1.2.21. It is built from
the tomcat-native.tar.gz, which is provided in the tomcat 9.0.21
distribution.


I have tested again with the same configuration and the same steps. The
results is the same as I mentioned above. After a single "keep-alive"
request was sent, a next following "non-keep-alive" request didn't receive
the response in a long time(TCP connection was established).

I read the source code of 9.0.21 roughly. It seems that the mechanism is
not changed. The Poller still has a chance to use a large number of
"nextPollerTime" to poll events.


Best Regards,

Chang Liu



Mark Thomas <ma...@apache.org> 于2019年7月3日周三 下午7:39写道：

> On 03/07/2019 10:59, Sanford Liu wrote:
> > Hi Team,
> > My team are facing a no responding issue in the below circumstances:
> >
> > 1. Env:
> >  ApacheTomcat:8.5.15, JDK: 1.8.0_121
>
> That Tomcat version is more than 2 years old.
>
> > 2. Tomcat configuration:
> >  enable APR: protocol="org.apache.coyote.http11.Http11AprProtocol"
> >  set maxThreads of the Executor: maxThreads="1200"
>
> Which version of Tomcat Native are you using.
>
> > 3. This web server was under a massive load.
> >  All requests were HTTP 1.1 requests and were marked with a "Connection:
> > close" HTTP header.
> >  At this point, web server showed some latency for the responses, but it
> is
> > still running.
> >
> > 4. Some "keep-alive" requests were coming.
> >  Those requests were marked with a "Connection: keep-alive" HTTP header.
> >
> > 5. The following non-keep-alive requests was not responding for a long
> time.
> >  We run the thread dump with jstack at this time, we saw Acceptor thread
> is
> > WAITING:
>
> <snip/>
>
> >  But The Poller thread is RUNNING:
> >  "http-apr-8080-Poller@5809" daemon prio=5 tid=0x25 nid=NA runnable
> >   java.lang.Thread.State: RUNNABLE
> >  at org.apache.tomcat.jni.Poll.poll(Poll.java:-1)
>
> <snip/>
>
> >  All the Executor threads are WAITING like this:
> >  "http-apr-8080-exec-21@6078" daemon prio=5 tid=0x3f nid=NA waiting
> >   java.lang.Thread.State: WAITING
> >  at sun.misc.Unsafe.park(Unsafe.java:-1)
> >  at
> java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
>
> <snip/>
>
> > We dived in the source code, found some clues:
> >
> > 1. Acceptor thread is parking by the LimitLatch. It means LimitLatch is
> not
> > released by other occupations, which also means the previous connections
> do
> > not perform the close operation in our circumstances.
> > (AprEndpoint.java#L955)
>
> There have been some bugs fixed in this area since 8.5.15.
>
> > 2. The close logic is take place in the Poller thread.
> > (AprEndpoint.java#L1624)
> >
> > 3. If the polling logic takes lot of time, the Poller thread will be
> > blocked(although it is still running, it could be blocked by the native
> > method), and the
> > destroySocket method will be suspended. (AprEndpoint.java#L1680)
> >
> > 4. Because the Acceptor processes the new connection directly(not
> registers
> > to the poller, AprEndpoint.java#L2268). So the pollerSpace[i] always
> equals
> > to actualPollerSize at this case(AprEndpoint.java#L1679), the
> > "nextPollerTime" will be increased so large. But when some "keep-alive"
> > requests arrive, the Handler implementation will process those
> connections
> > and register each back to the poller (AbstractProtocol.java#L933), so the
> > pollerSpace is changed, and the Poller will use a large value of
> > "nextPollerTime" to poll the socket, so the Poller thread would blocked
> in
> > a long time.
> >
> > To prove that, we setup a similar environment to reproduce this issue:
> >
> > 1. Use official tomcat docker image(tomcat:8.5.15) to run a simple http
> > server
>
> Note: There is no ASF provided official docker image. If 8.5.15 is the
> latest version provided by Docker I'd strongly recommend that you use a
> more up to date version of Tomcat.
>
> > 2. Change the config:
> >     <Connector
> >     port="8080"
> >     protocol="org.apache.coyote.http11.Http11AprProtocol"
> >     connectionTimeout="20000"
> >     redirectPort="8443"
> >     maxConnections="20" /> // make testing easy to reached max
> connections
> >
> > 3. Keep sending lots of "non-keep-alive" requests in 5 min
> >  $ wrk2 -t8 -c32 -d10m -R128 -s ./closed.lua
> > http://127.0.0.1:8080/hello?latency=10
> >
> > 4. Send a single "keep-alive" request and do not close this connection on
> > client side
> >
> > 5. After that, send another "non-keep-alive" request. We can see no
> > response returned in a reasonable time(waiting in 30 sec).
> >
> > A workaround:
> > By set deferAccept="false" for the connector configuration, we can force
> > Acceptor to register the new connection to the
> > poller(AprEndpoint.java#L2250), and "nextPollerTime" will not lose
> control.
> >
> > So, is that a real issue for Tomcat?
>
> It does look like there is an edge case here that isn't handled correctly.
>
> The use of multiple pollers stems from a work-around for Windows
> platforms that are now obsolete. I thought we discussed removing that
> code. I need to find that discussion and remind myself of the conclusion.
>
> Mark
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org
>
>

Re: Tracking down no responding issue after a massive load

Reply via email to