Hi

We recently got this famous error at random after migrating a mail
server to new hardware on a Redhat entreprise 3 clone. I saw a lot of
reference to this problem, but no real solution (use sql backend,
upgrade hardware, ..).
As we didn't have this problem before we looked if there was some disk
problem but the new server was a lot faster with hadware raid.,
hyperthreading etc
Looking at the open files with 'lsof' I saw that when it happened, we
always had a qmail-pop3d process holding a write lock on open-smtp.lock
, with a bunch of  "D" vchkpw processes, probably waiting for the lock.
As qmail has nothing to do with open-smtp i looked in the code and I
think I found a locking problem in open_smtp_relay() in vpopmail.c.

Basicly, if the tcp.smtp.cdb file (re)generation fails for some reason
(update_rules()), vchkpw launches (execve) qmail-pop3d without releasing
its lock on open-smtp.lock. The lock(file) is then inherited by
qmail-pop3d and thus stays locked until qmail-pop3d exits. And that can
be quite long on some huge mailboxes (ie after a migration when the mx
relays have dumped all their queued mail in the mailboxes) or when a
customer has a slow link. As long has qmail-pop3d isn't terminated, all
other pop3 connections will fail with "Try again ..." after 30 seconds
(and stupid outlook will ask confirmation for the customer's password...).

After applying the following patch (against 5.3.27, but the problem is
still present in 5.4.10), the problem seems to be cured; No more error
for more than two days.
Well I still have to figure out why update_rules fails on RHEL3 but at
least now only the faultly process has a problem, not all the others.

Hope this heps..

Gaetan

Here is the patch.


diff -ruN vpopmail-5.3.27.orig/vpopmail.c vpopmail-5.3.27/vpopmail.c
--- vpopmail-5.3.27.orig/vpopmail.c     2003-09-03 23:46:22.000000000 +0200
+++ vpopmail-5.3.27/vpopmail.c  2005-11-23 23:35:48.000000000 +0100
@@ -2519,10 +2519,13 @@
  if ( rebuild_cdb ) {
    if (update_rules() != 0) {
      printf("Error. update_rules() failed\n");
+       #ifdef FILE_LOCKING
+         unlock_lock(fileno(fs_lok_file), 0, SEEK_SET, 0);
+         fclose(fs_lok_file);
+       #endif /* FILE_LOCKING */
      return (-1);
    }
  }
-
#ifdef FILE_LOCKING
  unlock_lock(fileno(fs_lok_file), 0, SEEK_SET, 0);
  fclose(fs_lok_file);



Reply via email to