Re: Problems with qmail-remote hanging
> Setting an alarm is a nasty hack in my opinion, but I have to admit > that it's something I considered. Well, the qmail-remote connection is well and truly wedged once it's in this state and if the select() timed out as it's meant to, qmail-remote would exit with a delivery failure indication, so it's not that bad a hack. It's also very easy to code - just a single alarm() call at teh top of main(). > A slightly neater solution might be to use > the SO_KEEPALIVE socket option - if it works (and there isn't a good reason > not to use it) that is. It'll be interesting to hear if this works. > What would be better is finding out why this happens, of course. Indeed. Does Linux offer tools/syscalls that would tell you why the select worked, but the read failed? > P.S. If anyone is keeping track, Linux 2.2.19, concurrencyremote set to 200 I hesitate to say this, but Linux kernels seem to predominate in this regard, but that just may be that qmail is running on more Linux out there than other Unixen. Regards.
RE: Problems with qmail-remote hanging
> This problem's been reported before. If your OS says that an fd is > readable via select(), then the read() should not block. > > As you observe though, the read is blocking so your OS is probably not > telling the truth when it returns from the select(). > > The archives have plenty of discussion on this and the simplest > solution is to put a large-value alarm() handler in qmail-remote. No > one as yet seems to be able to narrow down which OSes do this and > under what circumstances. Mark, Thanks for the reply. I only seem to experience the problem with large mail-outs. One possibility is that because of the way qmail works, there's a significant chance that we will be making a large number of simultaneous connections to some servers. It's possible that this is causing a connection to be blackholed somewhere ... that doesn't explain why select/read are failing to agree, though. Perhaps select thinks the connection is closed, but read doesn't. Setting an alarm is a nasty hack in my opinion, but I have to admit that it's something I considered. A slightly neater solution might be to use the SO_KEEPALIVE socket option - if it works (and there isn't a good reason not to use it) that is. What would be better is finding out why this happens, of course. Thanks, Richard P.S. If anyone is keeping track, Linux 2.2.19, concurrencyremote set to 200
Re: Problems with qmail-remote hanging
> I've been running qmail on a number of platforms quite happily for a > while - until now I've had no problems at all. However, I am now > experiencing a problem with qmail-remote hanging. > The problem I see is with qmail-remote failing to terminate when a > connection times-out. If left alone, the number of "stuck" processes will > slowly climb, after about a month I had about 25 such processes. The network > connections remain in the "ESTABLISHED" state. > > Looking at the process list right now, I have one stuck: > > # ps -ef | grep qmail-remote > qmailr 12278 662 0 13:13 ?00:00:00 qmail-remote > xx.co.uk xx > qmailr 19876 662 0 16:09 ?00:00:00 qmail-remote xx.com > > root 19912 19489 0 16:10 pts/000:00:00 grep qmail-remote > > # strace -p 12278 > read(3, > > ... all socket read()s in qmail-remote should be protected by a > select and therefore should not block as this one is doing now. After > recompiling with debugging and symbols, I get ... Exactly. This problem's been reported before. If your OS says that an fd is readable via select(), then the read() should not block. As you observe though, the read is blocking so your OS is probably not telling the truth when it returns from the select(). The archives have plenty of discussion on this and the simplest solution is to put a large-value alarm() handler in qmail-remote. No one as yet seems to be able to narrow down which OSes do this and under what circumstances. Regards.
Problems with qmail-remote hanging
Hi, I've been running qmail on a number of platforms quite happily for a while - until now I've had no problems at all. However, I am now experiencing a problem with qmail-remote hanging. I'm running qmail on this server for sending mails from websites and bulk mail-outs (up to about 40,000 recipients.) The server doesn't receive mails iteself to a great extent. It's a dual-cpu Dell running Linux. I have another very similar installation which has absolutely no problems. Qmail on this server is 100% standard Qmail 1.03. The problem I see is with qmail-remote failing to terminate when a connection times-out. If left alone, the number of "stuck" processes will slowly climb, after about a month I had about 25 such processes. The network connections remain in the "ESTABLISHED" state. Looking at the process list right now, I have one stuck: # ps -ef | grep qmail-remote qmailr 12278 662 0 13:13 ?00:00:00 qmail-remote xx.co.uk xx qmailr 19876 662 0 16:09 ?00:00:00 qmail-remote xx.com root 19912 19489 0 16:10 pts/000:00:00 grep qmail-remote # strace -p 12278 read(3, ... all socket read()s in qmail-remote should be protected by a select and therefore should not block as this one is doing now. After recompiling with debugging and symbols, I get ... # gdb qmail-remote 12278 GNU gdb 5.0 Attaching to program: /home/qmail/bin/qmail-remote, Pid 12278 Reading symbols from /lib/libresolv.so.2...done. Loaded symbols for /lib/libresolv.so.2 Reading symbols from /lib/libc.so.6...wdone. Loaded symbols for /lib/libc.so.6 Reading symbols from /lib/ld-linux.so.2...hdone. Loaded symbols for /lib/ld-linux.so.2 0x40103424 in __libc_read () from /lib/libc.so.6 (gdb) where #0 0x40103424 in __libc_read () from /lib/libc.so.6 #1 0x3b654f80 in ?? () #2 0x8048f05 in saferead (fd=-1, buf=0x8051180 "", len=128) at qmail-remote.c:113 #3 0x804d193 in oneread (op=0x8048ee8 , fd=-1, buf=0x8051180 "", len=128) at substdi.c:14 #4 0x804d25e in substdio_feed (s=0x804f3d0) at substdi.c:44 #5 0x804d3ab in substdio_get (s=0x804f3d0, buf=0xbdc7 "", len=1) at substdi.c:75 #6 0x8048f70 in get (ch=0xbdc7 "") at qmail-remote.c:137 #7 0x8048fda in smtpcode () at qmail-remote.c:150 #8 0x80492cb in smtp () at qmail-remote.c:225 #9 0x8049d31 in main (argc=4, argv=0xbe94) at qmail-remote.c:420 #10 0x4004bf31 in __libc_start_main (main=0x804987c , argc=4, ubp_av=0xbe94, init=0x804878c <_init>, fini=0x804dd10 <_fini>, rtld_fini=0x4000e274 <_dl_fini>, stack_end=0xbe8c) at ../sysdeps/generic/libc-start.c:129 ... in smtp() ... 220 { 221 unsigned long code; 222 int flagbother; 223 int i; 224 225 =>if (smtpcode() != 220) quit("ZConnected to "," but greeting failed"); 226 227 substdio_puts(&smtpto,"HELO "); 228 substdio_put(&smtpto,helohost.s,helohost.len); 229 substdio_puts(&smtpto,"\r\n"); saferead() calls timeoutread() which calls select() and then read(). fd=-1 is a red-herring, it's not used by saferead in qmail-remote. Can anyone explain this, or has anyone experienced anything similar? Thanks, Richard