Am 24.02.2015 um 00:29 schrieb Jesse Defer:
-----Original Message-----
Am 23.02.2015 um 22:53 schrieb Jesse Defer:
-----Original Message-----
Am 23.02.2015 um 19:03 schrieb Jesse Defer:
...
How many Apache web servers sit in front of the Tomcats? Is each of the
Apaches configured with 512 processes?
All Tomcats are configured the same, maxThreads is 4096, apr, Tomcat 6.
16 apaches, all configured identically with 512 processes, the same
worker.properties, etc.
OK. So less threads than 16*512. But the apr connector - like the nio
one - can handle more connections than it has threads. Not more
in-flight requests, but for both connectors idle connections (keep-alive
connections) are cheap and do not need a pool thread.
Our busiest Tomcats peak at about 2000 connections to their AJP port from the
web farm. I'm not sure how closely connections relate to thread usage, I
haven't looked too closely at the JVM yet.
For BIO one connection one thread. For APR and NIO it's roughly one
in-flight request one thread. Connections waiting for the next request
to arrive are cheap and handled by few separate poller threads.
...
And only the lb is added to the worker.list? Or also the individual AJP
workers?
Just the LBs.
OK
If your Apache uses a threaded APR library, you could try using
JkWatchdogInterval. That moved the maintenance from the normal
request processing code to a separate thread per Apache process.
I suspect that once your app (one or some of your apps) gets slow,
requests pile up in Apache and >Apache can no longer connect to Tomcat,
because it doesn't have free threads. Some Apache children still can
connect, because the connect is handled by the OS underneath Tomcat, but
those connections do not get the awaited cping reply.
Shouldn't the ping timeout cause the processes to exit the wait state if
the cping reply hasn't been received? I have seen the SS column in
server-status grow into the thousands.
Yes it would. But it could be a mixture: lots of Apache processes
waiting for hanging or slow Tomcat responses due to an aplication or GC
bottleneck, and some on-top requests failing in cping.
In addition to the Thread Dumps, you should also activate GC-Logs for your
Tomcat JVMs.
I have gc logging enabled already.
ACK
- using the worker or prefork MPM in combination with
JkWatchdogInterval would reduce the amount of shm locking due to
global maintenance calls, because it is done per process, not per
thread. You might have other reasons for using prefork though. Have
you?
Use of prefork is historical. We have been looking at going to event,
but Haven't made the switch yet.
If you are using many Apaches each with 512 children, you will most likely
need to use an NIO connector on the Tomcat processes and would also
benefit from worker or event MPM for Apache. Each process e.g. with
50 or 100 threads and a relatively short connection_pool_timeout like e.g.
60 seconds.
I am going to bring up accelerating the switch to the event MPM with my
team. Will NIO give us more scalability than APR?
Not really, both are good at connection scalability. Maybe don't change
to much at the same time. I'd say NIO is a bit more mainstream than APR,
but your mileage my vary.
What was the reason, that you started suspecting the fcntl() underneath
jk_shm_lock? Are we sure, that the stack you provided is really the
typical one? And was the strace dumping all system calls or only the fcntl
calls?
I can't be sure it's the typical one, I only know every time I sample the stack
of a stuck process that's what I get, plus the strace shows no other system
calls except time(). It seems like it's stuck in a fairly tight loop somewhere.
I haven't looked too deeply in the mod_jk code though and haven't collected
any system call statistics.
I don't necessarily suspect fcntl, but it's getting called a lot (the strace
is not filtered). System time jumps when the issue is happening and the fcntl
looks like the vast majority of system calls. Since the issue got worse when
we added more processes it seems plausible it could be lock contention.
Hmm, but lock contention should result in the fcntl call to take more
and more time. Could it be a problem with the system time? That somehow
the "don't run the maintenance for the next 60 seconds" rule doesn't get
effective because the clock jumps?
I suggest:
- check whether there's a performance problem on the webapp side when it
happens next time
- check thread dumps
- check GC log
- activate Tomcat access log and add "%D" to the pattern, so that you
can analyze request duration on the Tomcat side post-mortem
- using the request duration logs, you can also check, whether a
potential performance problem happens on all nodes or only on some.
Depending on the result, the GC log check and thread dump creation and
check would be easier or harder (more work).
- run strace with time stamps plus system call duration next time the
problem occurs
- also add %D to the LogFormat of your Apache access log. Do also add
the name of the load balancer member that handled the request to the
access log. mod_jk provides "notes", that can be logged in the access
log (look for "JK_LB_LAST_NAME" in
http://tomcat.apache.org/connectors-doc/reference/apache.html)
- prepare switching to a threaded MPM, be it worker or event. Use
JkWatchdogInterval there.
- put JkShmFile to a local file system
- check whether your system clock on the Apache servers works correctly
and doesn't jump (common virtualization problem)
- configure the jk status worker and check, whether many of your workers
are in ERROR state. You can also retrieve info from the JK status worker
in a simple to parse text format and automate retrieval and extraction
of data (similar to what one can do with the Apache server-status)
Regards,
Rainer
---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org