Hi We recently got this famous error at random after migrating a mail server to new hardware on a Redhat entreprise 3 clone. I saw a lot of reference to this problem, but no real solution (use sql backend, upgrade hardware, ..). As we didn't have this problem before we looked if there was some disk problem but the new server was a lot faster with hadware raid., hyperthreading etc Looking at the open files with 'lsof' I saw that when it happened, we always had a qmail-pop3d process holding a write lock on open-smtp.lock , with a bunch of "D" vchkpw processes, probably waiting for the lock. As qmail has nothing to do with open-smtp i looked in the code and I think I found a locking problem in open_smtp_relay() in vpopmail.c.
Basicly, if the tcp.smtp.cdb file (re)generation fails for some reason (update_rules()), vchkpw launches (execve) qmail-pop3d without releasing its lock on open-smtp.lock. The lock(file) is then inherited by qmail-pop3d and thus stays locked until qmail-pop3d exits. And that can be quite long on some huge mailboxes (ie after a migration when the mx relays have dumped all their queued mail in the mailboxes) or when a customer has a slow link. As long has qmail-pop3d isn't terminated, all other pop3 connections will fail with "Try again ..." after 30 seconds (and stupid outlook will ask confirmation for the customer's password...). After applying the following patch (against 5.3.27, but the problem is still present in 5.4.10), the problem seems to be cured; No more error for more than two days. Well I still have to figure out why update_rules fails on RHEL3 but at least now only the faultly process has a problem, not all the others. Hope this heps.. Gaetan Here is the patch. diff -ruN vpopmail-5.3.27.orig/vpopmail.c vpopmail-5.3.27/vpopmail.c --- vpopmail-5.3.27.orig/vpopmail.c 2003-09-03 23:46:22.000000000 +0200 +++ vpopmail-5.3.27/vpopmail.c 2005-11-23 23:35:48.000000000 +0100 @@ -2519,10 +2519,13 @@ if ( rebuild_cdb ) { if (update_rules() != 0) { printf("Error. update_rules() failed\n"); + #ifdef FILE_LOCKING + unlock_lock(fileno(fs_lok_file), 0, SEEK_SET, 0); + fclose(fs_lok_file); + #endif /* FILE_LOCKING */ return (-1); } } - #ifdef FILE_LOCKING unlock_lock(fileno(fs_lok_file), 0, SEEK_SET, 0); fclose(fs_lok_file);