mod_fcgid gracefull restart with mpm worker

Lazy Mon, 26 Sep 2011 06:51:12 -0700

Hi,

I think this is related to
https://issues.apache.org/bugzilla/show_bug.cgi?id=48949


mpm worker, httpd 2.2.21 and mod_fcgid 3.3.6

Sometimes after graceful restart fcgi processes are left behind. In error log

[Sun Sep 25 10:47:25 2011] [notice] SIGUSR1 received.  Doing graceful restart
[Sun Sep 25 10:47:25 2011] [emerg] mod_fcgid: server is restarted, pid
17962 must exit
[Sun Sep 25 10:47:25 2011] [emerg] (22)Invalid argument: mod_fcgid:
can't lock process table in PM, pid 17962

this is from

void proctable_pm_lock(server_rec *s)
{
    apr_status_t rv;

    if (g_global_share->must_exit) {
        ap_log_error(APLOG_MARK, APLOG_EMERG, 0, s,
                     "mod_fcgid: server is restarted, pid %" APR_PID_T_FMT
                     " must exit",
                     getpid());
        kill(getpid(), SIGTERM);
    }
    if ((rv = proctable_lock_internal()) != APR_SUCCESS) {
        ap_log_error(APLOG_MARK, APLOG_EMERG, rv, s,
                     "mod_fcgid: can't lock process table in PM, pid %"
        exit(1);
    }
}

static apr_status_t proctable_lock_internal(void)
{
    return apr_global_mutex_lock(g_sharelock);
}

main proces manager loop

apr_status_t pm_main(server_rec * main_server, apr_pool_t * configpool)
{

    while (1) {
        if (procmgr_must_exit())
            break;

        /* Wait for command */
        if (procmgr_peek_cmd(&command, main_server) == APR_SUCCESS) {
            if (is_spawn_allowed(main_server, &command))
                fastcgi_spawn(&command, main_server, configpool);
            procmgr_finish_notify(main_server);
        }
        /* Move matched node to error list */
        scan_idlelist_zombie(main_server);
        scan_idlelist(main_server);
        scan_busylist(main_server);
        /* Kill() and wait() nodes in error list */
        scan_errorlist(main_server);
    }

    /* Stop all processes */
    kill_all_subprocess(main_server);

    return APR_SUCCESS;
}

in the scan_* functions check for must_exit flag before calling
proctable_pm_lock(),

static void scan_idlelist_zombie(server_rec * main_server)
{..
    /* Should I check zombie processes in idle list now? */
    if (procmgr_must_exit()
        || apr_time_sec(now) - apr_time_sec(lastzombiescan) <=
        sconf->zombie_scan_interval)
        return;
    lastzombiescan = now;

    /*
       Check the list
     */
    proc_table = proctable_get_table_array();
    previous_node = proctable_get_idle_list();
    check_list_header = &temp_header;

    proctable_pm_lock(main_server);
...

must_exit flag used in proctable_pm_lock and g_caughtSigTerm used by
procmgr_must_exit(), both are set in a signalhandler, so it's possible
that
they get changed between procmgr_must_exit check and proctable_pm_lock
in scan_* function in main PM loop.

If this happens and g_sharelock already is nuked (by parent process
reinitializing) before proctable_pm_lock is called, PM will exit
without calling kill_all_subprocess(main_server) thus leaving fcgi
processes behind, I believe this is happening in the error log above.

To make sure this condition is almost allways met I added this to
pm_main() main loop

sleep(10)
proctable_pm_lock(main_server);
proctable_pm_unlock(main_server);

and I get errors as above and fcgi processes are left behind on each
restart. I have seen these errors on production systems but they are
quite rare 1 in 20 graceful restarts tops. Parent process signals all
its children waits for a second or two and starts reinitialization
which in turn clears proctable so PM doesn't have
information about it's children anymore.

mod_fcgid is discarding all shared memory segments while initializing
maybe it should take care of old proctable if previous PM failed to by
sending SIG_KILL to all of them
before removing it or wait some time for old PM to exit before
removing old proctable.

Setting GracefulShutdownTimeout grater then artificial sleep() in PM
main loop seems to fix this issue. Main process is waiting longer for
it's children to finish, so PM is able to kill all its children before
process table gets cleared. I think this will also work in production
but it can't guarantee that all children are always killed, like when
there are some heavy
oom-killing going on with 3 digit loadavg, PM thread can not make it
in time before GracefulShutdownTimeout.

What would be the best way to fix this ?


Regards,

Michal Grzedzicki

mod_fcgid gracefull restart with mpm worker

Reply via email to