Hi, I think this is related to https://issues.apache.org/bugzilla/show_bug.cgi?id=48949
mpm worker, httpd 2.2.21 and mod_fcgid 3.3.6 Sometimes after graceful restart fcgi processes are left behind. In error log [Sun Sep 25 10:47:25 2011] [notice] SIGUSR1 received. Doing graceful restart [Sun Sep 25 10:47:25 2011] [emerg] mod_fcgid: server is restarted, pid 17962 must exit [Sun Sep 25 10:47:25 2011] [emerg] (22)Invalid argument: mod_fcgid: can't lock process table in PM, pid 17962 this is from void proctable_pm_lock(server_rec *s) { apr_status_t rv; if (g_global_share->must_exit) { ap_log_error(APLOG_MARK, APLOG_EMERG, 0, s, "mod_fcgid: server is restarted, pid %" APR_PID_T_FMT " must exit", getpid()); kill(getpid(), SIGTERM); } if ((rv = proctable_lock_internal()) != APR_SUCCESS) { ap_log_error(APLOG_MARK, APLOG_EMERG, rv, s, "mod_fcgid: can't lock process table in PM, pid %" exit(1); } } static apr_status_t proctable_lock_internal(void) { return apr_global_mutex_lock(g_sharelock); } main proces manager loop apr_status_t pm_main(server_rec * main_server, apr_pool_t * configpool) { while (1) { if (procmgr_must_exit()) break; /* Wait for command */ if (procmgr_peek_cmd(&command, main_server) == APR_SUCCESS) { if (is_spawn_allowed(main_server, &command)) fastcgi_spawn(&command, main_server, configpool); procmgr_finish_notify(main_server); } /* Move matched node to error list */ scan_idlelist_zombie(main_server); scan_idlelist(main_server); scan_busylist(main_server); /* Kill() and wait() nodes in error list */ scan_errorlist(main_server); } /* Stop all processes */ kill_all_subprocess(main_server); return APR_SUCCESS; } in the scan_* functions check for must_exit flag before calling proctable_pm_lock(), static void scan_idlelist_zombie(server_rec * main_server) {.. /* Should I check zombie processes in idle list now? */ if (procmgr_must_exit() || apr_time_sec(now) - apr_time_sec(lastzombiescan) <= sconf->zombie_scan_interval) return; lastzombiescan = now; /* Check the list */ proc_table = proctable_get_table_array(); previous_node = proctable_get_idle_list(); check_list_header = &temp_header; proctable_pm_lock(main_server); ... must_exit flag used in proctable_pm_lock and g_caughtSigTerm used by procmgr_must_exit(), both are set in a signalhandler, so it's possible that they get changed between procmgr_must_exit check and proctable_pm_lock in scan_* function in main PM loop. If this happens and g_sharelock already is nuked (by parent process reinitializing) before proctable_pm_lock is called, PM will exit without calling kill_all_subprocess(main_server) thus leaving fcgi processes behind, I believe this is happening in the error log above. To make sure this condition is almost allways met I added this to pm_main() main loop sleep(10) proctable_pm_lock(main_server); proctable_pm_unlock(main_server); and I get errors as above and fcgi processes are left behind on each restart. I have seen these errors on production systems but they are quite rare 1 in 20 graceful restarts tops. Parent process signals all its children waits for a second or two and starts reinitialization which in turn clears proctable so PM doesn't have information about it's children anymore. mod_fcgid is discarding all shared memory segments while initializing maybe it should take care of old proctable if previous PM failed to by sending SIG_KILL to all of them before removing it or wait some time for old PM to exit before removing old proctable. Setting GracefulShutdownTimeout grater then artificial sleep() in PM main loop seems to fix this issue. Main process is waiting longer for it's children to finish, so PM is able to kill all its children before process table gets cleared. I think this will also work in production but it can't guarantee that all children are always killed, like when there are some heavy oom-killing going on with 3 digit loadavg, PM thread can not make it in time before GracefulShutdownTimeout. What would be the best way to fix this ? Regards, Michal Grzedzicki