RE: mod_jk causing stuck Apache processes

Jesse Defer Mon, 23 Feb 2015 13:54:48 -0800

> -----Original Message-----
> Am 23.02.2015 um 19:03 schrieb Jesse Defer:
> > I have a farm of Apache httpd servers proxying to Tomcat with mod_jk and
> I am having issues with Apache processes getting stuck (as seen by the W
> state in server-status).  I am sending to this list because the stack
> traces show httpd gets stuck in mod_jk.
> >
> > httpd is configured for prefork and with 512 servers on start and
> maximum.  When the problem occurs we end up with nearly all 512 processes
> in the W state until we restart it.  The problem occurs more often when
> load is high but is not restricted to high load.  The problem started
> occuring more often when we increased the servers from 384 to 512.  The
> hosts have enough memory and do not swap.  The issue occurs intermitently
> and is not tied to a particular host or instance Tomcat (there are ~150
> Tomcat instances).  The JkShmFile is on tmpfs.
> 
> Why on tmpfs?


Not sure, might be because of poor local IO performance (hosts are VMs) or 
historic use of network filesystems.

> > Environment: RHEL5.11, Apache 2.4.10 (prefork), JK 1.2.40, APR 1.5.1,
> > APR-UTIL 1.5.4
> >
> > The stuck httpd processes all show the same stack and strace:
> >
> > pstack:
> > #0  0x00002b3439026bff in fcntl () from /lib64/libpthread.so.0
> > #1  0x00002b3440911656 in jk_shm_lock () from
> > /usr/local/apache2/modules/mod_jk.so
> > #2  0x00002b3440917805 in ajp_maintain () from
> > /usr/local/apache2/modules/mod_jk.so
> > #3  0x00002b3440906cac in maintain_workers () from
> > /usr/local/apache2/modules/mod_jk.so
> > #4  0x00002b3440901140 in wc_maintain () from
> > /usr/local/apache2/modules/mod_jk.so
> > #5  0x00002b34408f40b6 in jk_handler () from
> > /usr/local/apache2/modules/mod_jk.so
> > #6  0x0000000000448eca in ap_run_handler ()
> > #7  0x000000000044cc92 in ap_invoke_handler ()
> > #8  0x000000000045e24f in ap_process_async_request ()
> > #9  0x000000000045e38f in ap_process_request ()
> > #10 0x000000000045ab65 in ap_process_http_connection ()
> > #11 0x00000000004530ba in ap_run_process_connection ()
> > #12 0x000000000046423a in child_main ()
> > #13 0x0000000000464544 in make_child ()
> > #14 0x00000000004649ae in prefork_run ()
> > #15 0x0000000000430634 in ap_run_mpm ()
> > #16 0x000000000042ad97 in main ()
> 
> So mod_jk tries to get a lock on the shared memory before reading and
> updating some shared memory data. That as one of many things mod_jk does
> is normal. It would be not normal, if most processes seem to almost always
> sit in this stack.
> 
> > strace:
> > fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) =
> > 0 fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) =
> 0
> > time(NULL)                              = 1424711498
> > fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) =
> > 0 fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1})
> > = 0 fcntl(19, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0,
> > len=1}) = 0 fcntl(19, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET,
> > start=0, len=1}) = 0 fcntl(19, F_SETLKW, {type=F_WRLCK,
> > whence=SEEK_SET, start=0, len=1}) = 0 fcntl(19, F_SETLKW,
> > {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
> >
> > Any help tracking this issue down would be appreciated.
> 
> This doesn't look like mod_jk is hanging in the jk_shm_lock() but instead
> it looks like normal processing, locking and then unlocking.
> Locking and unlocking suceeds (return code 0).
> 
> You didn't provide time stamps to get an idea, whether it is normal
> behavior or not. What is your request throughput (requests per second
> forwarded by mod_jk as long as it is running well)?

Next time it happens I will capture timestamps, but it scrolls pretty 
quickly and doesn't seem to get hung up on any calls.  We have very seasonal
traffic, but normally around 30/sec.  The high period peaks at around 200/sec.

> I suspect something else is wrong. Have you checked, whether in fact
> mod_jk waits for the response from requests it has send to the backend and
> which do not return quickly, e.g. looking at a Java thread dump
> (not: heap dump) of the backend during the times you have problems?

I haven't, next time it happens I'll look.

> What are the error or warn messages in your mod_jk log file?

Here is one instance of each type of warn/error in the logs of an affected host.
All of these are fairly common but we have tomcats going up and down all the 
time and I think most are related to that.  Log level is info.

[Mon Feb 23 03:05:37 2015] [25833:47348544151936] [error] 
ajp_service::jk_ajp_common.c (2693): (my27) connecting to tomcat failed.
[Mon Feb 23 03:05:39 2015] [25944:47348544151936] [error] 
ajp_connect_to_endpoint::jk_ajp_common.c (1050): (ead21) cping/cpong after 
connecting to the backend server failed (errno=110)
[Mon Feb 23 03:05:39 2015] [25944:47348544151936] [error] 
ajp_send_request::jk_ajp_common.c (1663): (ead21) connecting to backend failed. 
Tomcat is probably not started or is listening on the wrong port (errno=110)
[Mon Feb 23 12:59:18 2015] [25927:47348544151936] [error] 
ajp_get_reply::jk_ajp_common.c (2176): (dir16) Tomcat is down or refused 
connection. No response has been sent to the client (yet)
[Mon Feb 23 14:12:12 2015] [26170:47348544151936] [warn] 
ajp_handle_cping_cpong::jk_ajp_common.c (965): awaited reply cpong, received 3 
instead. Closing connection

> Did you use "apachectl graceful" shortly before the problems start, or
> change configuration via the mod_jk status worker?

No.
 
> What is special here, is the use of many processes plus many workers. So
> the lock is used quite a lot. Still the "global maintain" functionality
> which uses the jk_shm_lock() in the stack above should be called by each
> process only every worker.maintain seconds, by default every 60 seconds.
> And each project should quickly detect whether another process already did
> the global maintain. During the global maintain, for any ajp13 worker
> there is really just a few simply local code statements. For any lb (load
> balancer) worker, there is a little more to do, especialy checking whether
> any members have failed long ago and should now be recovered. ut still
> those operations are all local, not going on the network.
> 
> What is your workr struture? Are the 150 workers part of few load
> balancers or are they all in the worker.list? Can you show significant
> parts of your workers.properties? Are you using the non-default
> "pessimistic" locking?

I misspoke, we have ~150 applications so that would be the load balancer 
count.  Total worker count between all the LBs would be ~700.  We are using 
the default locking.  Here is one LB:

# 109t
worker.109t1.port=62037
worker.109t1.host=java1
worker.109t1.type=ajp13
worker.109t1.lbfactor=1
worker.109t1.connection_pool_timeout=539
worker.109t1.socket_keepalive=true
worker.109t1.ping_mode=CP
worker.109t1.ping_timeout=5000
worker.109t1.redirect=109t2

worker.109t2.reference=worker.109t1
worker.109t2.host=java2
worker.109t2.redirect=109t3

worker.109t3.reference=worker.109t1
worker.109t3.host=java3
worker.109t3.redirect=109t4

worker.109t4.reference=worker.109t1
worker.109t4.host=java4
worker.109t4.redirect=109t1

worker.109tlb.type=lb
worker.109tlb.balance_workers=109t1,109t2,109t3,109t4
worker.109tlb.sticky_session=true

> If your Apache uses a threaded APR library, you could try using
> JkWatchdogInterval. That moved the maintenance from the normal request
> processing code to a separate thread per Apache process.
> 
> Regards,
> 
> Rainer
> 
> - using the worker or prefork MPM in combination with JkWatchdogInterval 
> would 
> reduce the amount of shm locking due to global maintenance calls, because it 
> is done per process, not per thread. You might have other reasons for using 
> prefork though. Have you?

Use of prefork is historical.  We have been looking at going to event, but 
Haven't made the switch yet.

Respectfully,

Jesse

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

RE: mod_jk causing stuck Apache processes

Reply via email to