[users@httpd] Child process running CGI script does not terminate after exit() and is killed after a 3s delay, under Solaris 8

Jochen Stenzel Thu, 26 Apr 2007 03:00:45 -0700

Hello,

using Apache 1.3.27 on Solaris 8 we see a problem with returning from a CGI 
script. I'd like to ask if this is a known phenomenon, and if there is a 
workaround or patch or a
page about it. I searched the Web, the Apache sources and the archives of this 
mailing list but I could not find a hint about this particular problem. One 
posting lead me to
the function in alloc.c mentioned below.


We are using Apache 1.3.27 with mod_cgi on Solaris 8. With both the version 
provided by Oracle and the Sun Freeware version we have the following problem: 
a CGI script builds
a page that contains references to additional files, like Javascript libraries 
and CSS files. Loading of these files can be delayed by approximately 3 seconds 
(seen with the
Firefox plugin Firebug on client side, and in both snoop and truss logs on 
server side), which significantly decreases site performance. Not *every* call 
is affected, but the
delays occur quite often.

The original script was written in Perl but perl ist not the problem - the 
delay occurs with a shell script as well:

 #!/usr/bin/tcsh

 echo 'Content-Type: text/html; charset=ISO-8859-1'
 echo ''
 echo '<html>'
 echo ' <head>'
 echo '<title>Load Delay</title>'
 echo '<script src="/xyz.js" type="text/javascript"></script>'
 echo '<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" 
/>' echo '</head>' echo '<body>' echo 'Load delay.'
 echo '</body>'
 echo '</html>'

According to truss the additional files are not the problem. We can reproduce 
the delay with a simple script that generates a page without supplementary file 
requests:

 #!/usr/local/bin/perl -w

 print <<EOHTML;
 Content-Type: text/html; charset=ISO-8859-1

 <html>
  <head>
   <title>Load Delay</title>
   <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
  </head>
  <body>
   Load Delay
  </body>
 </html>

 EOHTML

Using truss we get this typical trace for the delay case:

22173/1:  53.3753 fork1()                                         = 29230 
29230/1:  53.3753 fork1()         (returning as child ...)        = 22173

                [... CGI script processing ...]

29230/1:  53.4027 write(1, " C o n t e n t - T y p e".., 239)     = 239 
22173/1:  53.4028 read(8, " C o n t e n t - T y p e".., 4096)     = 239 
22173/1:  53.4031 time()
                                   = 1177084845 29230/1:  53.4031 llseek(0, 0, 
SEEK_CUR)                          Err#29 ESPIPE 29230/1:  53.4032 _exit(0)

22173/1:  53.4033 fstat64(8, 0xFFBE55C0)                          = 0 22173/1:  
53.4034 fcntl(8, F_SETFL, 0x00000080)                   = 0 22173/1:  53.4036 
read(8,
0xFFBE57B0, 8192)                       = 0 22173/1:  53.4037 close(8)          
                              = 0 22173/1:  53.4038 read(10, 0x000D0920, 4096)
     = 0 22173/1:  53.4039 close(10)                                       = 0 
22173/1:  53.4040 poll(0xFFBEB888, 1, 0)                          = 0 22173/1:  
53.4042
write(5, " H T T P / 1 . 1   2 0 0".., 421)     = 421 22173/1:  53.4044 time()  
                                        = 1177084845 22173/1:  53.4047 
write(17, " 1 3 7 . 1 6
7 . 2 2 0 .".., 103)    = 103 22173/1:  53.4048 waitid(P_PID, 29230, 
0xFFBEB868, WEXITED|WTRAPPED|WNOHANG) = 0 22173/1:  53.4049 getpid()
  = 22173 [22169] 22173/1:  53.4049 kill(29230, SIGTERM)                        
    = 0 22173/1:  53.4050 lwp_cond_signal(0xFF0ECD30)                     = 0 
22173/4:
53.4050 lwp_cond_wait(0xFF0ECD30, 0xFF0ECD18, 0x00000000) = 0 22173/1:  
lwp_sema_wait(0x00067260)       (sleeping...) 22173/2:  signotifywait()         
        (sleeping...)
22173/3:  lwp_sema_wait(0xFDF0DE60)       (sleeping...)
22173/4:  lwp_cond_wait(0xFF0ECD30, 0xFF0ECD18, 0xFDE0BD30) (sleeping...) 
22173/4:  56.4115 lwp_cond_wait(0xFF0ECD30, 0xFF0ECD18, 0xFDE0BD30) Err#62 
ETIME 22173/4:  56.4116
lwp_sema_post(0x00067260)                       = 0 22173/1:  56.4118 
lwp_sema_wait(0x00067260)                       = 0 22173/1:  56.4119 
lwp_mutex_lock(0xFF0F3500)
             = 0 22173/4:  56.4119 lwp_cond_signal(0x00067318)                  
   = 0 22173/1:  56.4120 getpid()                                        = 
22173 [22169]
22173/1:  56.4121 kill(29230, SIGKILL)                            = 0 22173/1:  
56.4121 waitid(P_PID, 29230, 0xFFBEB868, WEXITED|WTRAPPED) = 0

So, the child process running the CGI script calls exit(), but it seems it 
cannot be cleaned up by the kernel, so waitpid() and SIGTERM fail. Finally, 
after a delay of three
seconds, Apache sends SIGKILL to clean up the child process. Subsequent 
requests are not served until then.

To me it seems the truss log matches the code in Apaches source file alloc.c, 
in free_proc_chain() (I am not familiar with Apache sources so it is an 
assumption this is the
code running):

static void free_proc_chain(struct process_chain *procs)
{
    /* Dispose of the subprocesses we've spawned off in the course of
     * whatever it was we're cleaning up now.  This may involve killing * some 
of them off... */

    struct process_chain *p;
    int need_timeout = 0;
    int status;

    if (procs == NULL)
    return;            /* No work.  Whew! */

    /* First, check to see if we need to do the SIGTERM, sleep, SIGKILL
     * dance with any of the processes we're cleaning up.  If we've got * any 
kill-on-sight subprocesses, ditch them now as well, so they * don't waste any 
more cycles doing
whatever it is that they shouldn't * be doing anymore.
     */
#ifdef WIN32
   [... WIN32 code cut out as we are running Solaris ...]
#elif defined(NETWARE)
#else
#ifndef NEED_WAITPID
    /* Pick up all defunct processes */
    for (p = procs; p; p = p->next) {
    if (waitpid(p->pid, (int *) 0, WNOHANG) > 0) {
        p->kill_how = kill_never;
    }
    }
#endif

    for (p = procs; p; p = p->next) {
    if ((p->kill_how == kill_after_timeout)
        || (p->kill_how == kill_only_once)) {
        /* Subprocess may be dead already.  Only need the timeout if not. */ if 
(ap_os_kill(p->pid, SIGTERM) != -1) need_timeout = 1;
    }
    else if (p->kill_how == kill_always) {
        kill(p->pid, SIGKILL);
    }
    }

    /* Sleep only if we have to... */

    if (need_timeout)
    sleep(3);

    /* OK, the scripts we just timed out for have had a chance to clean up
     * --- now, just get rid of them, and also clean up the system accounting * 
goop... */

    for (p = procs; p; p = p->next) {

    if (p->kill_how == kill_after_timeout)
        kill(p->pid, SIGKILL);

    if (p->kill_how != kill_never)
        waitpid(p->pid, &status, 0);
    }
#endif /* WIN32 */
}


In this code there are a waitpid() call with WNOHANG, a sending of SIGTERM, a 
three second delay, and a final SIGKILL, similar to the truss log.

The question is why the child process cannot be cleaned up after exit() 
(sometimes it works, but very often it fails, with the same script). Is this a 
known problem? Is there
a workaround? Is it probably related to the operating system version (Solaris 
8), and if yes is there a patch? (The system is well maintained and patched but 
possibly a
special patch was missed?)

Any help or hint would be appreciated.

Thank you in advance

                  Jochen










---------------------------------------------------------------------
The official User-To-User support forum of the Apache HTTP Server Project.
See <URL:http://httpd.apache.org/userslist.html> for more info.
To unsubscribe, e-mail: [EMAIL PROTECTED]
   "   from the digest: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[users@httpd] Child process running CGI script does not terminate after exit() and is killed after a 3s delay, under Solaris 8

Reply via email to