Hello,
using Apache 1.3.27 on Solaris 8 we see a problem with returning from a CGI
script. I'd like to ask if this is a known phenomenon, and if there is a
workaround or patch or a
page about it. I searched the Web, the Apache sources and the archives of this
mailing list but I could not find a hint about this particular problem. One
posting lead me to
the function in alloc.c mentioned below.
We are using Apache 1.3.27 with mod_cgi on Solaris 8. With both the version
provided by Oracle and the Sun Freeware version we have the following problem:
a CGI script builds
a page that contains references to additional files, like Javascript libraries
and CSS files. Loading of these files can be delayed by approximately 3 seconds
(seen with the
Firefox plugin Firebug on client side, and in both snoop and truss logs on
server side), which significantly decreases site performance. Not *every* call
is affected, but the
delays occur quite often.
The original script was written in Perl but perl ist not the problem - the
delay occurs with a shell script as well:
#!/usr/bin/tcsh
echo 'Content-Type: text/html; charset=ISO-8859-1'
echo ''
echo '<html>'
echo ' <head>'
echo '<title>Load Delay</title>'
echo '<script src="/xyz.js" type="text/javascript"></script>'
echo '<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"
/>' echo '</head>' echo '<body>' echo 'Load delay.'
echo '</body>'
echo '</html>'
According to truss the additional files are not the problem. We can reproduce
the delay with a simple script that generates a page without supplementary file
requests:
#!/usr/local/bin/perl -w
print <<EOHTML;
Content-Type: text/html; charset=ISO-8859-1
<html>
<head>
<title>Load Delay</title>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
</head>
<body>
Load Delay
</body>
</html>
EOHTML
Using truss we get this typical trace for the delay case:
22173/1: 53.3753 fork1() = 29230
29230/1: 53.3753 fork1() (returning as child ...) = 22173
[... CGI script processing ...]
29230/1: 53.4027 write(1, " C o n t e n t - T y p e".., 239) = 239
22173/1: 53.4028 read(8, " C o n t e n t - T y p e".., 4096) = 239
22173/1: 53.4031 time()
= 1177084845 29230/1: 53.4031 llseek(0, 0,
SEEK_CUR) Err#29 ESPIPE 29230/1: 53.4032 _exit(0)
22173/1: 53.4033 fstat64(8, 0xFFBE55C0) = 0 22173/1:
53.4034 fcntl(8, F_SETFL, 0x00000080) = 0 22173/1: 53.4036
read(8,
0xFFBE57B0, 8192) = 0 22173/1: 53.4037 close(8)
= 0 22173/1: 53.4038 read(10, 0x000D0920, 4096)
= 0 22173/1: 53.4039 close(10) = 0
22173/1: 53.4040 poll(0xFFBEB888, 1, 0) = 0 22173/1:
53.4042
write(5, " H T T P / 1 . 1 2 0 0".., 421) = 421 22173/1: 53.4044 time()
= 1177084845 22173/1: 53.4047
write(17, " 1 3 7 . 1 6
7 . 2 2 0 .".., 103) = 103 22173/1: 53.4048 waitid(P_PID, 29230,
0xFFBEB868, WEXITED|WTRAPPED|WNOHANG) = 0 22173/1: 53.4049 getpid()
= 22173 [22169] 22173/1: 53.4049 kill(29230, SIGTERM)
= 0 22173/1: 53.4050 lwp_cond_signal(0xFF0ECD30) = 0
22173/4:
53.4050 lwp_cond_wait(0xFF0ECD30, 0xFF0ECD18, 0x00000000) = 0 22173/1:
lwp_sema_wait(0x00067260) (sleeping...) 22173/2: signotifywait()
(sleeping...)
22173/3: lwp_sema_wait(0xFDF0DE60) (sleeping...)
22173/4: lwp_cond_wait(0xFF0ECD30, 0xFF0ECD18, 0xFDE0BD30) (sleeping...)
22173/4: 56.4115 lwp_cond_wait(0xFF0ECD30, 0xFF0ECD18, 0xFDE0BD30) Err#62
ETIME 22173/4: 56.4116
lwp_sema_post(0x00067260) = 0 22173/1: 56.4118
lwp_sema_wait(0x00067260) = 0 22173/1: 56.4119
lwp_mutex_lock(0xFF0F3500)
= 0 22173/4: 56.4119 lwp_cond_signal(0x00067318)
= 0 22173/1: 56.4120 getpid() =
22173 [22169]
22173/1: 56.4121 kill(29230, SIGKILL) = 0 22173/1:
56.4121 waitid(P_PID, 29230, 0xFFBEB868, WEXITED|WTRAPPED) = 0
So, the child process running the CGI script calls exit(), but it seems it
cannot be cleaned up by the kernel, so waitpid() and SIGTERM fail. Finally,
after a delay of three
seconds, Apache sends SIGKILL to clean up the child process. Subsequent
requests are not served until then.
To me it seems the truss log matches the code in Apaches source file alloc.c,
in free_proc_chain() (I am not familiar with Apache sources so it is an
assumption this is the
code running):
static void free_proc_chain(struct process_chain *procs)
{
/* Dispose of the subprocesses we've spawned off in the course of
* whatever it was we're cleaning up now. This may involve killing * some
of them off... */
struct process_chain *p;
int need_timeout = 0;
int status;
if (procs == NULL)
return; /* No work. Whew! */
/* First, check to see if we need to do the SIGTERM, sleep, SIGKILL
* dance with any of the processes we're cleaning up. If we've got * any
kill-on-sight subprocesses, ditch them now as well, so they * don't waste any
more cycles doing
whatever it is that they shouldn't * be doing anymore.
*/
#ifdef WIN32
[... WIN32 code cut out as we are running Solaris ...]
#elif defined(NETWARE)
#else
#ifndef NEED_WAITPID
/* Pick up all defunct processes */
for (p = procs; p; p = p->next) {
if (waitpid(p->pid, (int *) 0, WNOHANG) > 0) {
p->kill_how = kill_never;
}
}
#endif
for (p = procs; p; p = p->next) {
if ((p->kill_how == kill_after_timeout)
|| (p->kill_how == kill_only_once)) {
/* Subprocess may be dead already. Only need the timeout if not. */ if
(ap_os_kill(p->pid, SIGTERM) != -1) need_timeout = 1;
}
else if (p->kill_how == kill_always) {
kill(p->pid, SIGKILL);
}
}
/* Sleep only if we have to... */
if (need_timeout)
sleep(3);
/* OK, the scripts we just timed out for have had a chance to clean up
* --- now, just get rid of them, and also clean up the system accounting *
goop... */
for (p = procs; p; p = p->next) {
if (p->kill_how == kill_after_timeout)
kill(p->pid, SIGKILL);
if (p->kill_how != kill_never)
waitpid(p->pid, &status, 0);
}
#endif /* WIN32 */
}
In this code there are a waitpid() call with WNOHANG, a sending of SIGTERM, a
three second delay, and a final SIGKILL, similar to the truss log.
The question is why the child process cannot be cleaned up after exit()
(sometimes it works, but very often it fails, with the same script). Is this a
known problem? Is there
a workaround? Is it probably related to the operating system version (Solaris
8), and if yes is there a patch? (The system is well maintained and patched but
possibly a
special patch was missed?)
Any help or hint would be appreciated.
Thank you in advance
Jochen
---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: [EMAIL PROTECTED]
" from the digest: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]