Re: qmail-remote (cry wolf?)
On Tue 19 Jun, 2001, Mark Jefferys [EMAIL PROTECTED] wrote: On Sun, Jun 17, 2001 at 08:56:13PM +0100, James R Grinter wrote: Go look at timeoutread(), which *is* in your path. The select is in the line right before where you wedge. sorry, yes. You're right. It doesn't. (Don't know about other people's.) It assumes that the fd_sets will be cleared on timeout. Setting the fd_sets each time is always necessary and doesn't protect against this issue, anyway. I've now properly read the code, and I see what you're suggesting. I may be naive in believing manual pages, but in lieu of other evidence I do tend to go with what they say and it does explicitly mention zeroing the values upon timeout - therefore I wouldn't have expected to see this particular problem on Solaris 2.x. However, it wouldn't be too hard to modify it to log the condition of timeout being reached and an fdset not being zero. I also put a debugging version of qmail-remote on my system, so if it ever decides to hang again I can fling gdb at it. yes, that is what I should do too. James.
Re: qmail-remote (cry wolf?)
Hi, [Summary: Some systems leave the fd_sets alone when select times out.] I think it isn't relevant. qmail-remote doesn't seem to use select, It does. timeoutread.c: int timeoutread(t,fd,buf,len) int t; int fd; char *buf; int len; { fd_set rfds; struct timeval tv; tv.tv_sec = t; tv.tv_usec = 0; FD_ZERO(rfds); FD_SET(fd,rfds); if (select(fd + 1,rfds,(fd_set *) 0,(fd_set *) 0,tv) == -1) return -1; if (FD_ISSET(fd,rfds)) return read(fd,buf,len); errno = error_timeout; return -1; } When select returns -1 (error case) everything is fine. When select returns 0, i.e. in the timeout case, read is called if select has not cleared the fd bit out of rfds. So if there really exist OS which do not clear the bits, then qmail will potentially block in read on those OS. As to different OS behaviour, Solaris 2.6 (and 7) both say: and errorfds arguments are not modified. If the timeout interval expires without the specified condition being true for any of the specified file descriptors, the objects pointed to by the readfs, writefs, and errorfds arguments have all bits set to 0. On Solaris the above code would work without flaws. whereas SunOS 4.1.4 (my usual 'old bsd system' benchmark) says: descriptor sets. 0 indicates that the time limit referred to by timeout expired. On failure, select() returns -1, sets errno to indicate the error, and the descriptor sets are not changed. It doesn't tell explicitly what it does when it returns 0, but as it's mentioned only in the error case, that the bits are not cleared, one supposes that in timeout situations they are cleared, and thus qmail will not have any problems. claudio -- Claudio Nieder, Kanalweg 1, CH-8610 Uster, Tel +41 79 357 6743 yahoo messenger: claudionieder aim: claudionieder icq:42315212 mailto:[EMAIL PROTECTED]http://www.claudio.ch
Re: qmail-remote (cry wolf?)
On Mon, Jun 18, 2001 at 11:05:34PM +0200, Claudio Nieder allegedly wrote: On Solaris the above code would work without flaws. whereas SunOS 4.1.4 (my usual 'old bsd system' benchmark) says: descriptor sets. 0 indicates that the time limit referred to by timeout expired. On failure, select() returns -1, sets errno to indicate the error, and the descriptor sets are not changed. It doesn't tell explicitly what it does when it returns 0, but as it's mentioned only in the error case, that the bits are not cleared, one supposes that in timeout situations they are cleared, and thus qmail will not have any problems. Same with FreeBSD 4.3 - by implication. ... On return, select() replaces the given descriptor sets with subsets consisting of those descriptors that are ready for the requested operation. ... RETURN VALUES ... If select() returns with an error, including one due to an interrupted call, the descriptor sets will be unmodified. For this who are having significant recurrences of this problem, are you in a position to change timeoutread.c to check for a zero return from select? It sure would help isolate this problem if you can. Regards.
Re: qmail-remote (cry wolf?)
On Sun, Jun 17, 2001 at 08:56:13PM +0100, James R Grinter wrote: % I think it isn't relevant. qmail-remote doesn't seem to use select, % or at least it's nowhere in the path where my qmail-remote wedges. Go look at timeoutread(), which *is* in your path. The select is in the line right before where you wedge. % As to different OS behaviour, Solaris 2.6 (and 7) both say: [Man page claims it doesn't do this.] % whereas SunOS 4.1.4 (my usual 'old bsd system' benchmark) says: [Man page unclear.] % and I can tell you that I've not seen the problem happen with % qmail-remote on SunOS 4.1.4. Well, I don't necessarily trust man pages to tell the truth, especially if this was added accidentally (i.e. if it's a bug). And I still haven't seen anything to really convince me that any OS actually does this. I've only seen that a few people think some do, that it could easily happen as a bug, and that it could explain the hung qmail-remotes. And it's easily fixed if it is the problem. In other words, I'm not saying that this is the cause, only that it's possible. % Indeed, I think DJB's code (and most % other people's) compensates for both behaviours by setting the % necessary FD's each time anyway. It doesn't. (Don't know about other people's.) It assumes that the fd_sets will be cleared on timeout. Setting the fd_sets each time is always necessary and doesn't protect against this issue, anyway. In any case, since I did see (one) stuck process recently I built myself a test to see if I could reproduce it. I wasn't. At least on a RedHat linux 2.2.19-6.2.1 or -6.2.1smp, it looks like select acts sanely on a timeout, at least some of the time. I also put a debugging version of qmail-remote on my system, so if it ever decides to hang again I can fling gdb at it. Mark
RE: qmail-remote (cry wolf?)
Mark, How would I need to go about building a dubug version of qmail-remote? Also, how to terminate the process so that I can 'fling' gdb at it? With a little I can probably have output from gdb within a couple hours. -- Troy Settle Pulaski Networks 540.994.4254 ** -Original Message- ** From: Mark Jefferys [mailto:[EMAIL PROTECTED]] ** Sent: Monday, June 18, 2001 9:27 PM ** To: James R Grinter ** Cc: [EMAIL PROTECTED] ** Subject: Re: qmail-remote (cry wolf?) ** ** ** On Sun, Jun 17, 2001 at 08:56:13PM +0100, James R Grinter wrote: ** ** % I think it isn't relevant. qmail-remote doesn't seem to use select, ** % or at least it's nowhere in the path where my qmail-remote wedges. ** ** Go look at timeoutread(), which *is* in your path. The select is in ** the line right before where you wedge. ** ** % As to different OS behaviour, Solaris 2.6 (and 7) both say: ** ** [Man page claims it doesn't do this.] ** ** % whereas SunOS 4.1.4 (my usual 'old bsd system' benchmark) says: ** ** [Man page unclear.] ** ** % and I can tell you that I've not seen the problem happen with ** % qmail-remote on SunOS 4.1.4. ** ** Well, I don't necessarily trust man pages to tell the truth, ** especially if this was added accidentally (i.e. if it's a bug). ** ** And I still haven't seen anything to really convince me that any OS ** actually does this. I've only seen that a few people think some do, ** that it could easily happen as a bug, and that it could explain the ** hung qmail-remotes. And it's easily fixed if it is the problem. ** ** In other words, I'm not saying that this is the cause, only that it's ** possible. ** ** % Indeed, I think DJB's code (and most ** % other people's) compensates for both behaviours by setting the ** % necessary FD's each time anyway. ** ** It doesn't. (Don't know about other people's.) It assumes that the ** fd_sets will be cleared on timeout. Setting the fd_sets each time is ** always necessary and doesn't protect against this issue, anyway. ** ** ** In any case, since I did see (one) stuck process recently I built ** myself a test to see if I could reproduce it. I wasn't. At least on ** a RedHat linux 2.2.19-6.2.1 or -6.2.1smp, it looks like select acts ** sanely on a timeout, at least some of the time. ** ** I also put a debugging version of qmail-remote on my system, so if it ** ever decides to hang again I can fling gdb at it. ** ** ** Mark ** **
Re: qmail-remote (cry wolf?)
On Mon, Jun 18, 2001 at 11:20:36PM -0400, Troy Settle wrote: % How would I need to go about building a dubug version of qmail-remote? I set conf-cc and conf-ld to 'gcc -g', edited timeoutread.c slightly to save the return value of the select in a variable, then built qmail-remote and put it in place of the live one. I'll attach a patch matching what I did to timeoutread.c. % Also, how to terminate the process so that I can 'fling' gdb at it? I wasn't planning on terminating it. Rather I was thinking of using gdb's attach command to take over the process, and then start examining variables. Mostly, I was going to wing it. I expect the full attachment sequence to look something like this: (gdb) attach pid-of-stuck-qmail-remote (gdb) symbol-file /var/qmail/bin/qmail-remote (gdb) directory path-to-qmail-source-with-modified-timeoutread.c (gdb) bt (gdb) up -- repeat until at timeoutread() stack frame (gdb) p res (gdb) p fd (gdb) p rfds -- or something like that % With a little I can probably have output from gdb within a couple hours. Good luck, then. Mark --- timeoutread.c Mon Jun 15 03:53:16 1998 +++ timeoutread.c Mon Jun 18 22:23:24 2001 @@ -7,6 +7,7 @@ { fd_set rfds; struct timeval tv; + int res; tv.tv_sec = t; tv.tv_usec = 0; @@ -14,7 +15,8 @@ FD_ZERO(rfds); FD_SET(fd,rfds); - if (select(fd + 1,rfds,(fd_set *) 0,(fd_set *) 0,tv) == -1) return -1; + res = select(fd + 1,rfds,(fd_set *) 0,(fd_set *) 0,tv); + if (res == -1) return -1; if (FD_ISSET(fd,rfds)) return read(fd,buf,len); errno = error_timeout;
Re: qmail-remote (cry wolf?)
Dave Sill [EMAIL PROTECTED] writes: Three of the four are running Red Hat 6.2. That could simply be because 75% of qmail systems are running RH 6.2, though. :-) I see this problem occasionally, with mail being sent from a Solaris 2.6 system. It frequently happens for mail to one particular ISP (freeserve.co.uk, aka Planet Online/Energis Squared), who run Exim on (I believe) Linux systems. I suspect they're using something to load balance the TCP sessions, as repeatedly connecting to the two A records for their one MX record shows up several different system names in the 220 banners. This could be the cause of the TCP session never closing down, but it's clear that because we're in a read() we never try and send anything that might illicit a TCP reset. No word on which qmail patches, if any, were installed on these Mine is stock qmail 1.03. I kept meaning to get around to posting the evidence I collected here, so here (finally) it is: Here's my example stuck qmail-remote, with a backtrace from gdb and also lsof output. Unfortunately I didn't keep truss output for this one. (I should point out that this output was collected on Jan 11th...) qmailr 4322 211 0 Nov 03 ?0:00 qmail-remote oglaroon.freeserve.co.uk mark-thomas-owner-mt=oglaroon.freeserve.c # gdb /var/qmail/bin/qmail-remote 4322 GNU gdb 4.18 Copyright 1998 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type show copying to see the conditions. There is absolutely no warranty for GDB. Type show warranty for details. This GDB was configured as sparc-sun-solaris2.6... (no debugging symbols found)... Attaching to program `/var/qmail/bin/qmail-remote', process 4322 Reading symbols from /usr/lib/libresolv.so.2...(no debugging symbols found)... done. Reading symbols from /usr/lib/libsocket.so.1...(no debugging symbols found)... done. Reading symbols from /usr/lib/libnsl.so.1...(no debugging symbols found)... done. Reading symbols from /usr/lib/libc.so.1...(no debugging symbols found)...done. Reading symbols from /usr/lib/libdl.so.1...(no debugging symbols found)...done. Reading symbols from /usr/lib/libmp.so.2...(no debugging symbols found)...done. Symbols already loaded for /usr/lib/libresolv.so.2 Symbols already loaded for /usr/lib/libsocket.so.1 Symbols already loaded for /usr/lib/libnsl.so.1 Symbols already loaded for /usr/lib/libc.so.1 Symbols already loaded for /usr/lib/libdl.so.1 Symbols already loaded for /usr/lib/libmp.so.2 0xef6386b8 in _read () from /usr/lib/libc.so.1 (gdb) bt #0 0xef6386b8 in _read () from /usr/lib/libc.so.1 #1 0x13c7c in timeoutread () #2 0x12524 in saferead () #3 0x160e0 in oneread () #4 0x161a0 in substdio_feed () #5 0x16290 in substdio_get () #6 0x12594 in get () #7 0x1261c in smtpcode () #8 0x12938 in smtp () #9 0x133b0 in main () (gdb) # lsof -p 4322 COMMANDPID USER FD TYPE DEVICE SIZE/OFFNODE NAME qmail-rem 4322 qmailr cwd VDIR 85,14 512 328536 /var/qmail qmail-rem 4322 qmailr txt VREG 85,1463804 361948 /var/qmail/bin/qmail-remote qmail-rem 4322 qmailr txt VREG 85,019304 30060 /usr/lib/libmp.so.2 qmail-rem 4322 qmailr txt VREG 85,0 1014088 30137 /usr/lib/libc.so.1 qmail-rem 4322 qmailr txt VREG 85,0 721916 32170 /usr/lib/libnsl.so.1 qmail-rem 4322 qmailr txt VREG 85,053656 30072 /usr/lib/libsocket.so.1 qmail-rem 4322 qmailr txt VREG 85,092952 30061 /usr/lib/libresolv.so.2 qmail-rem 4322 qmailr txt VREG 85,0 4280 30124 /usr/lib/libdl.so.1 qmail-rem 4322 qmailr txt VREG 85,0 166196 30030 /usr/lib/ld.so.1 qmail-rem 4322 qmailr0r VREG 85,14 5021 150345 /var/qmail/queue/mess/17/150345 qmail-rem 4322 qmailr1u FIFO 0xf7e0e144 0t0 1091718 PIPE-0xf7e0e0c0 qmail-rem 4322 qmailr2u FIFO 0xf7e0e144 0t0 1091718 PIPE-0xf7e0e0c0 qmail-rem 4322 qmailr3u inet 0xf77ec040 0t0 TCP agent57.gbnet.net:59889-slb-mail-inG1.svr.pol.co.uk:smtp (ESTABLISHED)
Re: qmail-remote (cry wolf?)
I came across the following, which *might* explain some of these deadlocking problems: http://kt.zork.net/kernel-traffic/kt20010611_121.html#6 [Summary: Some systems leave the fd_sets alone when select times out.] If I read this right, timeoutconn/read/write (and anything else that uses select) have to check for a result of 0 explicitly to be completely portable. Even if an OS doesn't do this intentionally, it's quite easy to see someone forgetting to clear the fd_sets on a timeout by accident, so some defensive coding against the problem (explicitly checking for a result of 0) may be worthwhile. Or this may just be a red herring... Mark N.B. Although someone claimed to have seen a BSD man page reporting that it wouldn't clear the fd_sets on a timeout, I was unable to find any evidence of such a thing with Google. And at least one standard (Single UNIX Specification v2) has forbidden this kind of weirdness. P.S. And I just found one of these bloody hung qmail-remotes on one of my systems!@#$! Stuck in read of fd 3; directed at email.com (who clearly have no clue how to set up DNS records for email, and are down anyway). Redhat Linux kernel 2.2.19-6.2.1smp.
Re: qmail-remote (cry wolf?)
Mark Jefferys [EMAIL PROTECTED] writes: [Summary: Some systems leave the fd_sets alone when select times out.] Even if an OS doesn't do this intentionally, it's quite easy to see someone forgetting to clear the fd_sets on a timeout by accident, so some defensive coding against the problem (explicitly checking for a result of 0) may be worthwhile. Or this may just be a red herring... I think it isn't relevant. qmail-remote doesn't seem to use select, or at least it's nowhere in the path where my qmail-remote wedges. As to different OS behaviour, Solaris 2.6 (and 7) both say: C Library Functionsselect(3C) On failure, the objects pointed to by the readfs, writefs, and errorfds arguments are not modified. If the timeout interval expires without the specified condition being true for any of the specified file descriptors, the objects pointed to by the readfs, writefs, and errorfds arguments have all bits set to 0. whereas SunOS 4.1.4 (my usual 'old bsd system' benchmark) says: SELECT(2) SYSTEM CALLS SELECT(2) select() returns a non-negative value on success. A posi- tive value indicates the number of ready descriptors in the descriptor sets. 0 indicates that the time limit referred to by timeout expired. On failure, select() returns -1, sets errno to indicate the error, and the descriptor sets are not changed. and I can tell you that I've not seen the problem happen with qmail-remote on SunOS 4.1.4. Indeed, I think DJB's code (and most other people's) compensates for both behaviours by setting the necessary FD's each time anyway. N.B. Although someone claimed to have seen a BSD man page reporting that it wouldn't clear the fd_sets on a timeout, I was unable to find See above! James.
Re: qmail-remote (cry wolf?)
I wanted to let the list know something about this topic. Sorry if this has been covered, but I just started following the list again because of I'm having this same problem. I'm running qmail on Redhat 6.x with a 2.2.12-20smp kernel. It has been running for well over a year (I don't remember the exact date I installed everything). But, over the last couple of months, I started getting a few complaints of emails not being sent out in a timely manner. Every time I received the complaint, I checked my processes with 'ps ax | grep qmail' and low and behold there would 20 or so copies of qmail-remote that were sitting doing nothing. In fact, they had been sitting doing nothing for so long, they were swapped out. The first few times, I just killed the qmail-remote processes and watched as qmail again started sending messages across the Internet. After a while though, I noticed a pattern. Every time I killed all the stuck qmail-remote processes, almost all the mail going out vis the newly revived qmail-remote processes was directed to [EMAIL PROTECTED] (XX being a number in the range of 01 to about 37 or so). After some further investigation, I found that all of the messages going to edirectnetwork.net were bounce messages stating that the user they were trying to send mail to no longer existed. So, I sent an email to [EMAIL PROTECTED] to try to explain the situation to her/him and offered to help solve the problem as she/he was needlessly waisting both their and my bandwidth. After a few days and no response whatsoever, I simply found the IP address for every optXX.edirectnetwork.net server and added it to my tcpserver rules to be blocked so that they could no longer send mail to my server. Here's the good part. I setup the block over 48 hour ago, and have since not had a single qmail-remote process lock-up. Before I made the change, I would normally have at least two to three stuck qmail-remote processes in that same amount of time. I don't know how and/or if the two are even related. I hope this helps. Again, I apologize if this has already been covered. Eric Calvert Caveland Connection
RE: qmail-remote (cry wolf?)
** -Original Message- ** From: Mark [mailto:[EMAIL PROTECTED]] ** Sent: Friday, June 08, 2001 11:14 AM ** To: [EMAIL PROTECTED] ** Subject: Re: qmail-remote (cry wolf?) ** ** ** processed those 1500 messages in less than 30 minutes. ** However, it left ** behind another handfull of stuck qmail-remote processes. ** Other messages ** were undeliverable and left in the queue, and still others ** were sent back to ** sender with permanent errors. ** ** What do you mean by stuck? Do you mean they *never* go away - even ** after a day or two? As others have pointed out, a slow delivery can ** take a long, long time. That's not necessarily a problem, that's just ** the way it is. Yes, I've had qmail-remote processes sit there for weeks. I think that instead of killing them off wholesale, I'll pick one or two processes and see just how long they'll hang around. I'll post weekly updates if there's any interest. I keep hearing that it might be a very slow delivery. How is this possible when there isn't any network connection open to the remote host in question, let alone a connection to it's smtp port. As far as I can tell, this is a problem between qmail-remote and the kernel. This is happening on multiple operating systems, so that leads me to believe that this is not an OS bug. ** ** To find out a bit more about what a stuck qmail-remote is doing, you ** may want to ktrace it and show us the output. Find the process id of the ** stuck qmail-remote and then as root go: ktrace -p thepid ** ** Leave that running for at least an hour and show us the output. Yes, I ** mean at least an hour. ** Ok, I meant to come back in an hour and stop the trace, but after running ktrace for 9 hours (while I slept), the resulting ktrace.out file is exactly 0 bytes in length. Would you like me to send a copy? g I did verify the behavior of ktrace, and a ktrace on qmail-send generated tons of data within seconds. ktrace is working. Anything else y'all would like me to lok at? -- Troy Settle Pulaski Networks 540.994.4254
Re: qmail-remote (cry wolf?)
On Sat, Jun 09, 2001 at 06:32:55AM -0400, Troy Settle wrote: Yes, I've had qmail-remote processes sit there for weeks. I think that instead of killing them off wholesale, I'll pick one or two processes and see just how long they'll hang around. I'll post weekly updates if there's any interest. Here is what I have on one of mail servers (ps -waux|grep qmail-remote, real email addresses removed, domain names are left. I only left user, pids, state, date, and prog name on the output for readability purposes): qmailr 7365 S May19 qmail-remote iname.com [EMAIL PROTECTED] [EMAIL PROTECTED] qmailr 14602 S May19 qmail-remote mail.com [EMAIL PROTECTED] [EMAIL PROTECTED] qmailr 25415 S May19 qmail-remote careful.com [EMAIL PROTECTED] [EMAIL PROTECTED] qmailr 25875 S May19 qmail-remote programmer.net [EMAIL PROTECTED] [EMAIL PROTECTED] qmailr 25902 S May19 qmail-remote mail.com [EMAIL PROTECTED] [EMAIL PROTECTED] qmailr 852S May19 qmail-remote mail.com [EMAIL PROTECTED] [EMAIL PROTECTED] qmailr 20283 S May25 qmail-remote ziplip.com [EMAIL PROTECTED] [EMAIL PROTECTED] qmailr 29814 S May18 qmail-remote mail.com [EMAIL PROTECTED] [EMAIL PROTECTED] qmailr 25877 S May19 qmail-remote mail.com [EMAIL PROTECTED] [EMAIL PROTECTED] qmailr 25145 S May19 qmail-remote mail.com [EMAIL PROTECTED] [EMAIL PROTECTED] qmailr 27208 S Jun08 qmail-remote hp.com [EMAIL PROTECTED] [EMAIL PROTECTED] qmailr 27070 S Jun08 qmail-remote mail.com [EMAIL PROTECTED] [EMAIL PROTECTED] qmailr 11525 S Jun08 qmail-remote best-service.com [EMAIL PROTECTED] [EMAIL PROTECTED] qmailr 13766 S Jun08 qmail-remote mad.scientist.com [EMAIL PROTECTED] [EMAIL PROTECTED] As you can see, processes running since May 19th cannot possibly be explained by slow deliver -- 20 days is just too much. The following domains go through outblaze.com mail servers: iname.com mail.com careful.com programmer.net best-service.com mad.scientist.com The following domains do not go through outblaze: ziplip.com hp.com Unforunatelly, I cannot explain this situation by blaming everything on outblaze. -- Eugene Miretskiy [EMAIL PROTECTED] InVision.com, INC. (631) 543-1000 www.invision.net / www.longisland.com
Re: qmail-remote (cry wolf?)
I think we may have red-herringed on the OS thing -- if RH6.2, as deployed, had this sort of problem, I think we would have run across it before this, no? The inclusion of a FreeBSD-4.2-STABLE in the mix seems to nix a RH specific bug as well (althought it obviously does not rule it out entirely*). Perhaps we're overlooking some other, more subtle commonality between these four setups? Could at least two of the OP's please detail (for me, if not for the list, at least) the devices that sit between the NIC of the host in question and the Big Bad Internet? Routers, hubs, transparent firewalls, everything? *I highly recommend that the FreeBSD-4.2-STABLE user at least upgrade to 4.3R -- I'm not sure at which point in 4.2-STABLE you froze your local tree, but a whole bunch of fixes made it into 4.3, and it's been running great for me. -- Greg White
Re: qmail-remote (cry wolf?)
Is it possible that some external devices s.a. switch/router/firewall/anything could be causing this problem? Yes, very possble. Some firewalls do transparent SMTP or POP proxying, and there have been many bugs in such implementations. No. Regardless of what the other end does, a conforming OS should not wedge qmail-remote forever. Why do people keep suggesting this? You have three choices: 1. Show the bug in the code containing the select() and read() 2. Show an interpretation error regarding the semantics of read() and select() If you can do either of those, we can conclude that qmail-remote is coded incorrectly and needs fixing. If you can do neither of these, then this leaves you with the inescapable conclusion that qmail-remote *is* playing by the rules, in which case you are left with the only alternative: 3. the other side of the C code is not playing by the rules: ie a bug in the compiler, libraries or OS. I will note that no one has done 1 or 2 yet, so that leaves 3. Regards.
Re: qmail-remote (cry wolf?)
On Fri, Jun 08, 2001 at 08:11:21PM -0400, Yevgeniy Miretskiy allegedly wrote: On Fri, Jun 08, 2001 at 09:47:16PM +, Mark wrote: Then it's an OS bug. qmail-remote only gets to the read() if the OS (via select() ) says that the read will not block. Ergo, the OS is lying. If it's OS bug, anybody heard/knows of such severe network related bug in RedHat 6.2? What about FreeBSD 4.2 (I believe somebody reported problem with FreeBSD as well)??? What are the chances of _such_ bug in _both_ OSes? I'd like to mention, that I ran qmail of FreeBSD (starting from 3.x all the way to latest) for couple years and _never_ observed this behaviour on FreeBSD. I ran it on Solaris 2.5/2.6 for years and did experience this sort of behaviour. It went away on 2.8. So what? No one has shown that qmail-remote is doing anything wrong. If it's not doing anything wrong, them maybe the problem is somewhere else? Conversely, every reading of the code in question suggests that qmail-remote is doing everything right. The fact that this problem occurs on at least two OSes simply suggests to me that the TCP/IP interaction is a boundary condition perhaps triggered by distance connections and perhaps also by an uncommon remote TCP/IP stack. Regardless of which, if an OS renegs on the fd-will-not-block promise, then it can *only* be an OS bug. Regards.
Re: qmail-remote (cry wolf?)
As far as I can tell, this is a problem between qmail-remote and the kernel. Correct. This is happening on multiple operating systems, so that leads me to believe that this is not an OS bug. But many OSes share TCP/IP implementations or mis-interpretations of the protocol. Many coders of TCP/IP stacks look at other implementations to work out what to do. There is a *lot* of commonality between OSes in this regard. Eg, the Linux crowd and the FreeBSD crowd reguarly refer to each others implementations to decide how to do something (or not do something as the case may be). ** To find out a bit more about what a stuck qmail-remote is doing, you ** may want to ktrace it and show us the output. Find the process id of the ** stuck qmail-remote and then as root go: ktrace -p thepid ** ** Leave that running for at least an hour and show us the output. Yes, I ** mean at least an hour. ** Ok, I meant to come back in an hour and stop the trace, but after running ktrace for 9 hours (while I slept), the resulting ktrace.out file is exactly 0 bytes in length. Would you like me to send a copy? g It's a bummer that ktrace is like that on FreeBSD. It doesn't show the *current* system call that the process is sitting on. Conversely, truss on Solaris does this nicely... You can conclude though that qmail-remote wasn't sitting on the select() as that has a timeout and should show the system calls associated with the reading loop. If it's not sitting on the select() what is it sitting on? If it's the read() well, how could that be if select() said the read would not block? Regards.
Re: qmail-remote (cry wolf?)
On Sat, Jun 09, 2001 at 09:05:00AM -0700, Greg White allegedly wrote: I think we may have red-herringed on the OS thing -- if RH6.2, as deployed, had this sort of problem, I think we would have run across it before this, no? The inclusion of a FreeBSD-4.2-STABLE in the mix seems to nix a RH specific bug as well (althought it obviously does not rule it out entirely*). Perhaps we're overlooking some other, more subtle commonality between these four setups? Indeed. Using commonality to solve a problem is a fine technique. However the underlying assumption is that it is a single problem that is being solved here. We have no certainty of that, all we do have is a single *symptom* - qmail-remote wedges on some systems, on some occassions. If it is a single problem, here are some commonalities that might be explored: 1. Bug in qmail-remote 2. Common compiler (think optimization error) 3. Common clib error (think semantic error or bug) 4. Common OS (think semantic error or bug) 5. Common TCP/IP stack 6. Common network interface code (perhaps all derived from a vendor reference implementation) All of which *may* only be triggered by a certain set of TCP/IP events initiated from the peer end. Indeed the peer may be an uncommon OS/TCP/IP combo which reduces the occurence of this problem to isolated situations. And you can be very certain that this is a very very rare event. Just consider how many invocations of qmail-remote have successfully completed in the last 3 years on many many thousands of OSes in many thousands of locations around the world. What does that mean? It's probably a tough problem to nail down without access to the interaction history between all of the above components. Regards.
Re: qmail-remote (cry wolf?)
Greg White writes: I think we may have red-herringed on the OS thing -- if RH6.2, as deployed, had this sort of problem, I think we would have run across it before this, no? Hmmm I wonder. I could do a denial of service attack on qmail-remote by receiving email very, very slowly, and by sending email to a server which is guaranteed to be received and guaranteed to bounce. qmail doesn't keep track of very slow hosts, but only hosts that time out. -- -russ nelson [EMAIL PROTECTED] http://russnelson.com Crynwr sells support for free software | PGPok | 521 Pleasant Valley Rd. | +1 315 268 1925 voice | John Hartford, RIP Potsdam, NY 13676-3213 | +1 315 268 9201 FAX |
Re: qmail-remote (cry wolf?)
Russell Nelson [EMAIL PROTECTED] wrote: Hmmm I wonder. I could do a denial of service attack on qmail-remote by receiving email very, very slowly, and by sending email to a server which is guaranteed to be received and guaranteed to bounce. qmail doesn't keep track of very slow hosts, but only hosts that time out. I've been thinking along the same lines. qmail-smtpd would seem to also be vulnerable to this (not that this is djb's fault). Lowering timeoutremote and timeoutsmtpd from their defaults of 1200 would help against this problem cropping up due to genuinely slow servers, but not against a deliberate attack (send one byte every ten minutes, or two minutes, or whatever, tying up a qmail-smtpd process for an indefinite period). Perhaps something like a maxlifetime control file for qmail-remote and qmail-smtpd? At process startup, set an alarm for X seconds -- if the ALRM is received, abort the connection as gracefully as possible (i.e. try to send RSET and QUIT in qmail-remote, issue a 4xx error in qmail-smtpd) but quit regardless of whether these attempts to quit gracefully are successful or not. It doesn't sound too complicated. Anybody see any major issues with this? Charles -- --- Charles Cazabon[EMAIL PROTECTED] GPL'ed software available at: http://www.qcc.sk.ca/~charlesc/software/ Any opinions expressed are just that -- my opinions. ---
Re: qmail-remote (cry wolf?)
On Sat, Jun 09, 2001 at 03:11:59PM -0400, Russell Nelson allegedly wrote: Greg White writes: I think we may have red-herringed on the OS thing -- if RH6.2, as deployed, had this sort of problem, I think we would have run across it before this, no? Hmmm I wonder. I could do a denial of service attack on qmail-remote by receiving email very, very slowly, You'd also have to set the TCP/IP receive window size down, otherwise you may think you're only receiving at one byte every, say, 20 minutes, but in fact your TCP/IP stack got a window full of data at one time. But yes, you could slow it down considerably and if you got to the extreme limit of 20 minutes per byte, a 1M email will take about 9 months... It sure is the case that some sort of gross delivery timer makes sense and it would work around the problem that initiated this thread... and by sending email to a server which is guaranteed to be received and guaranteed to bounce. qmail doesn't keep track of very slow hosts, but only hosts that time out. Of course it has to be your server that accepts the traffic slowly so it's a DOS on yourself at the same time. Not the typical MO for a successful DOS. This is only proof of concepts, but to implement a gross timer, you might use this program as a wrapper to qmail-remote (which of course you move to qmail-remote.real): main(int argc, char **argv) { alarm(5*60*60); /* Max of five hours for a remote delivery */ execv(/var/qmail/bin/qmail-remote.real, argv); _exit(1); } This wrapper gives qmail-remote an arbitrary 5 hours to make a delivery at which point qmail-remote gets a SIGALRM which it happens not to have registered a handler for and thus the OS takes the default action which is to terminate the process. Regards.
Re: qmail-remote (cry wolf?)
On Sat, Jun 09, 2001 at 05:58:49PM +, Mark wrote: It's a bummer that ktrace is like that on FreeBSD. It doesn't show the *current* system call that the process is sitting on. Conversely, truss on Solaris does this nicely... But FreeBSD does have a (procfs-based) truss. -- Jos Backus _/ _/_/_/Modularity is not a hack. _/ _/ _/-- D. J. Bernstein _/ _/_/_/ _/ _/ _/_/ [EMAIL PROTECTED] _/_/ _/_/_/use Std::Disclaimer;
Re: qmail-remote (cry wolf?)
Perhaps something like a maxlifetime control file for qmail-remote and (Serendipity strikes again - I just posted sample code for this). qmail-smtpd? At process startup, set an alarm for X seconds -- if the ALRM is received, abort the connection as gracefully as possible (i.e. try to send RSET and QUIT in qmail-remote, issue a 4xx error in qmail-smtpd) but quit regardless of whether these attempts to quit gracefully are successful or not. Why bother with a graceful exit? You'd have to set yet another alarm for the (likely) case that your graceful exit fails. That's seems like unnecessary complexity for a connection that is almost certainly dead. Regards.
Re: qmail-remote (cry wolf?)
On Sat, Jun 09, 2001 at 01:00:46PM -0700, Jos Backus allegedly wrote: On Sat, Jun 09, 2001 at 05:58:49PM +, Mark wrote: It's a bummer that ktrace is like that on FreeBSD. It doesn't show the *current* system call that the process is sitting on. Conversely, truss on Solaris does this nicely... But FreeBSD does have a (procfs-based) truss. Right. But it suffers from the same problem that ktrace does in that it starts with the next system call, not the current one. Leastwise it does on a 4.3 I have access to, do you get something different? Regards.
Re: qmail-remote (cry wolf?)
On Sat, Jun 09, 2001 at 08:12:03PM +, Mark wrote: On Sat, Jun 09, 2001 at 01:00:46PM -0700, Jos Backus allegedly wrote: On Sat, Jun 09, 2001 at 05:58:49PM +, Mark wrote: It's a bummer that ktrace is like that on FreeBSD. It doesn't show the *current* system call that the process is sitting on. Conversely, truss on Solaris does this nicely... But FreeBSD does have a (procfs-based) truss. Right. But it suffers from the same problem that ktrace does in that it starts with the next system call, not the current one. Leastwise it does on a 4.3 I have access to, do you get something different? I get what you get :) Greetz, Peter -- Against Free Sex! http://www.dataloss.nl/Megahard_en.html
Re: qmail-remote (cry wolf?)
On Sat, Jun 09, 2001 at 01:54:37PM -0600, Charles Cazabon wrote: Perhaps something like a maxlifetime control file for qmail-remote and qmail-smtpd? At process startup, set an alarm for X seconds -- if the ALRM is received, abort the connection as gracefully as possible (i.e. try to send RSET and QUIT in qmail-remote, issue a 4xx error in qmail-smtpd) but quit regardless of whether these attempts to quit gracefully are successful or not. It doesn't sound too complicated. Anybody see any major issues with this? No, but this may be more complicated than needed. I've been using the attached program on one of my machines for years (the machine was behind a dialup line and it was definitively not funny to sponsor the Deutsche Telekom just because the other end of a SMTP connection was slow). It worked in it's crude form. It leaves a hole open which could lead to duplicate transfers. Regards, Uwe -- setalarm.c --- #include signal.h #include unistd.h #include errno.h #include stdlib.h #include stdio.h static void die_usage(void) {fputs(usage: setalarm SECONDS program [arguments ...]\n,stderr); exit(111);} static void die_parse(void) {fputs(setalarm: fatal: failed to parse the number of seconds\n,stderr); exit(111);} static void die_exec(char *x) {int e=errno; fputs(setalarm: fatal: failed to execute ,stderr); perror(x); exit(111);} int main(int argc, char **argv) { unsigned long ul; int e; char *ep; ssize_t l; if (argc3) die_parse(); errno=0; ul=strtoul(argv[1],ep,10); if (*ep || !argv[1][0] || errno==ERANGE) die_parse(); signal(SIGALRM,SIG_DFL); alarm(ul); execvp(argv[2],argv+2); die_exec(argv[2]); }
Re: qmail-remote (cry wolf?)
On Sat, Jun 09, 2001 at 08:11:41PM +, Mark wrote: On Sat, Jun 09, 2001 at 01:00:46PM -0700, Jos Backus allegedly wrote: But FreeBSD does have a (procfs-based) truss. Right. But it suffers from the same problem that ktrace does in that it starts with the next system call, not the current one. Leastwise it does on a 4.3 I have access to, do you get something different? Nope (I'm still at PRE_SMPNG, waiting for -current to stabilize (hah!)). One idea would be to run the process under truss, and pipe the truss output through multilog, providing one with a syscall activity history without the danger of filling up partitions (as would likely happen when using ktrace). -- Jos Backus _/ _/_/_/Modularity is not a hack. _/ _/ _/-- D. J. Bernstein _/ _/_/_/ _/ _/ _/_/ [EMAIL PROTECTED] _/_/ _/_/_/use Std::Disclaimer;
Re: qmail-remote (cry wolf?)
On Thu, Jun 07, 2001 at 10:43:27PM +0300, Mike Jackson wrote: What are the probabilities of the Sendmail server being the one causing the problems? What if the mail admin of mg.hk5.outblaze.com has used some sort of patch that is causing qmail-remote's to hang? Which means it might be exploitable as a DoS... There's been similar problems with mail.com (hosted by outblaze) as well. I still haven't been able to manually connect to any of their servers. It seems as they are under heavy load according to an apology at their home page. Jörgen
RE: qmail-remote (cry wolf?)
Not being a programmer, I have no clue how to trace this, but if someone were able to help me, I'd be glad to give it a go. I'm on FreeBSD 4.2-STABLE at the moment, and will be updating again soon. Qmail is built with patches, a concurrency patch and the patches from the FreeBSD port. qmail-remote itself was not patched from 1.03. What I'm seeing on these stuck processes, is that they're in a state of 'sbwait' (as shown by top). netstat doesn't show any open connections to the remote hosts (smtp or otherwise). This problem doesn't seem to be related to the remote host, no matter the MTA. I've seen several stuck qmail-remote processes to a certain host, but scanning through logs shows that mail has been successfully sent to that same host on multiple occasions, both prior and after the stuck process was launched. This doesn't seem to be a networking problem. On one occasion, I had over 1500 messages queued up because the number of stuck qmail-remote processes ate up my concurrency limit. After clearing up the blockge, the box processed those 1500 messages in less than 30 minutes. However, it left behind another handfull of stuck qmail-remote processes. Other messages were undeliverable and left in the queue, and still others were sent back to sender with permanent errors. Logs are intact. There's a start of delivery entry, but if qmail-remote gets stuck, there is no further reference to those messages. Yes, I can read the messages in the queue. They are intact and appear to be properly formatted. There is no proxy server or firewall between this box and the rest of the Internet. Only a Cisco 2924 switch, a 3640 router and a T1 ride out to ATT or Sprint. I hope all this information helps. Anyone should feel free to ask for more details, but please be specific in the information you need. Remember, a lot of us here are admins, not developers. -- Troy Settle Pulaski Networks 540.994.4254 ** -Original Message- ** From: Mark [mailto:[EMAIL PROTECTED]] ** Sent: Thursday, June 07, 2001 6:00 PM ** To: [EMAIL PROTECTED] ** Subject: Re: qmail-remote (cry wolf?) ** ** ** What are the probabilities of the Sendmail server being the ** one causing ** the problems? What if the mail admin of mg.hk5.outblaze.com has used ** some sort of patch that is causing qmail-remote's to hang? Has anyone ** communicated with outblaze.com's postmaster? ** ** There is nothing a remote system can do that will hang qmail-remote on ** a correctly functioning OS. If the local TCP stack has accepted data ** and indicated available via the select() return, then the remote ** system has no further say as the read() only fetches the data ** previously received. ** ** I'll bet it's an OS bug - most likely in the TCP stack. Eg, it may be ** that the local TCP stack - in some circumstance - discards unread ** data, *then* marks the local socket as unreadable, rather than around ** the other way. That sort of window would wedge the select/read ** sequence in qmail-remote. ** ** ** Regards. ** **
Re: qmail-remote (cry wolf?)
processed those 1500 messages in less than 30 minutes. However, it left behind another handfull of stuck qmail-remote processes. Other messages were undeliverable and left in the queue, and still others were sent back to sender with permanent errors. What do you mean by stuck? Do you mean they *never* go away - even after a day or two? As others have pointed out, a slow delivery can take a long, long time. That's not necessarily a problem, that's just the way it is. To find out a bit more about what a stuck qmail-remote is doing, you may want to ktrace it and show us the output. Find the process id of the stuck qmail-remote and then as root go: ktrace -p thepid Leave that running for at least an hour and show us the output. Yes, I mean at least an hour. Regards.
Re: qmail-remote (cry wolf?)
One more time, I did tcpdump and strace on stuck qmail-remote for over an hour. strace shows that qmail-remote is stuck on: 'read(3', and tcpdump shows that nothing comes in. On Fri, Jun 08, 2001 at 03:13:54PM +, Mark wrote: processed those 1500 messages in less than 30 minutes. However, it left behind another handfull of stuck qmail-remote processes. Other messages were undeliverable and left in the queue, and still others were sent back to sender with permanent errors. What do you mean by stuck? Do you mean they *never* go away - even after a day or two? As others have pointed out, a slow delivery can take a long, long time. That's not necessarily a problem, that's just the way it is. To find out a bit more about what a stuck qmail-remote is doing, you may want to ktrace it and show us the output. Find the process id of the stuck qmail-remote and then as root go: ktrace -p thepid Leave that running for at least an hour and show us the output. Yes, I mean at least an hour. Regards. -- Eugene Miretskiy [EMAIL PROTECTED] InVision.com, INC. (631) 543-1000 www.invision.net / www.longisland.com
Re: qmail-remote (cry wolf?)
On Fri, Jun 08, 2001 at 03:51:18PM -0400, Yevgeniy Miretskiy allegedly wrote: One more time, I did tcpdump and strace on stuck qmail-remote for over an hour. strace shows that qmail-remote is stuck on: 'read(3', and tcpdump shows that nothing comes in. One more time. Then it's an OS bug. qmail-remote only gets to the read() if the OS (via select() ) says that the read will not block. Ergo, the OS is lying. Regards.
Re: qmail-remote (cry wolf?)
On Fri, Jun 08, 2001 at 09:47:16PM +, Mark wrote: Then it's an OS bug. qmail-remote only gets to the read() if the OS (via select() ) says that the read will not block. Ergo, the OS is lying. If it's OS bug, anybody heard/knows of such severe network related bug in RedHat 6.2? What about FreeBSD 4.2 (I believe somebody reported problem with FreeBSD as well)??? What are the chances of _such_ bug in _both_ OSes? I'd like to mention, that I ran qmail of FreeBSD (starting from 3.x all the way to latest) for couple years and _never_ observed this behaviour on FreeBSD. Is it possible that some external devices s.a. switch/router/firewall/anything could be causing this problem? -- Eugene Miretskiy [EMAIL PROTECTED] InVision.com, INC. (631) 543-1000 www.invision.net / www.longisland.com
Re: qmail-remote (cry wolf?)
Yevgeniy Miretskiy [EMAIL PROTECTED] wrote: On Fri, Jun 08, 2001 at 09:47:16PM +, Mark wrote: Then it's an OS bug. qmail-remote only gets to the read() if the OS (via select() ) says that the read will not block. Ergo, the OS is lying. If it's OS bug, anybody heard/knows of such severe network related bug in RedHat 6.2? Many, especially with earlier kernels. Upgrade to 2.2.19-6.2.1 if you haven't already. 2.2.14 in particular was a nasty one, at least as shipped by RedHat. And no, I'm not trolling -- I use RedHat myself. What are the chances of _such_ bug in _both_ OSes? Coincidences happen. Is it possible that some external devices s.a. switch/router/firewall/anything could be causing this problem? Yes, very possble. Some firewalls do transparent SMTP or POP proxying, and there have been many bugs in such implementations. Charles -- --- Charles Cazabon[EMAIL PROTECTED] GPL'ed software available at: http://www.qcc.sk.ca/~charlesc/software/ Any opinions expressed are just that -- my opinions. ---
qmail-remote (cry wolf?)
Sorry, but I'm not all comfortable with this... There's been 4 similar reports of qmail-remote not behaving properly to this list during the last month. http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg00558.html http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg01332.html http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00283.html http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00426.html We still haven't been able to help any of them... This doesn't look like a coincidence to me since two of the reports concerned the same recipient server (outblaze.com). Unfortunately it seems related to network programming, which I know very little about. Any other thoughts about this? Jörgen
Re: qmail-remote (cry wolf?)
On Thu, Jun 07, 2001 at 07:36:53PM +0200, Jörgen Persson allegedly wrote: Sorry, but I'm not all comfortable with this... There's been 4 similar reports of qmail-remote not behaving properly to this list during the last month. http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg00558.html http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg01332.html http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00283.html http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00426.html We still haven't been able to help any of them... This doesn't look like a coincidence to me since two of the reports concerned the same recipient server (outblaze.com). Unfortunately it seems related to network programming, which I know very little about. Any other thoughts about this? If it's an unpatched qmail-remote, then remain suspicious of some OS bug. I spent a long time looking at qmail-remote when a similar problem occured on a Solaris 2.5 system (or maybe 2.6, I forget now). Here are the two lines of code: if (select(fd + 1,rfds,(fd_set *) 0,(fd_set *) 0,tv) == -1) return -1; if (FD_ISSET(fd,rfds)) return read(fd,buf,len); That's about as simple as you can get! I don't see any way that the read() call will occur without select() returning the fdset bit. So, if select() says that a read can occur, then the only reason that the read() can then block is if the OS is lying... So, it's kinda hard to see a problem with qmail-remote here. Do OSes ever get it wrong? Sure. If this is a relatively widespread problem, then you might want to put an alarm() handler into qmail-remote, but if you can't rely on the OS, all bets are really off, right? Regards.
Re: qmail-remote (cry wolf?)
What are the probabilities of the Sendmail server being the one causing the problems? What if the mail admin of mg.hk5.outblaze.com has used some sort of patch that is causing qmail-remote's to hang? Has anyone communicated with outblaze.com's postmaster? There is nothing a remote system can do that will hang qmail-remote on a correctly functioning OS. If the local TCP stack has accepted data and indicated available via the select() return, then the remote system has no further say as the read() only fetches the data previously received. I'll bet it's an OS bug - most likely in the TCP stack. Eg, it may be that the local TCP stack - in some circumstance - discards unread data, *then* marks the local socket as unreadable, rather than around the other way. That sort of window would wedge the select/read sequence in qmail-remote. Regards.
Re: qmail-remote (cry wolf?)
On Thu, Jun 07, 2001 at 07:36:53PM +0200, Jörgen Persson wrote: http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg00558.html rethat 6.2, outblaze.com http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg01332.html rethat 6.2, outblaze.com http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00283.html FreeBSD 4.something, not outblaze.com. Some of the host are just unreachable. http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00426.html rethat 6.2, outblaze.com We still haven't been able to help any of them... Somebody with rh-6.2 / linux-2.2.14 might want to look into this, though i'd recommend a kernel upgrade. There have been a number of networking problems in the linux kernel, and some of them were quite awful and hard to trigger. I remember that i downgraded my home server to 2.2.13 after 2.2.14 broke my rsync-over-ssh or network tar. Regards, Uwe
Re: qmail-remote (cry wolf?)
J=F6rgen Persson [EMAIL PROTECTED] wrote: There's been 4 similar reports of qmail-remote not behaving properly t= o this list during the last month.=20 http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg00558.= html http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg01332.= html http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00283.= html http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00426.= html We still haven't been able to help any of them... This doesn't look like a coincidence to me since two of the reports concerned the same recipient server (outblaze.com). Unfortunately it seems related to network programming, which I know very little about. Any other thoughts about this=3F Three of the four are running Red Hat 6.2. That could simply be because 75% of qmail systems are running RH 6.2, though. :-) No word on which qmail patches, if any, were installed on these systems. -Dave
Re: qmail-remote (cry wolf?)
On Thu, Jun 07, 2001 at 07:36:53PM +0200, Jörgen Persson wrote: Sorry, but I'm not all comfortable with this... There's been 4 similar reports of qmail-remote not behaving properly to this list during the last month. http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg00558.html http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg01332.html http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00283.html http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00426.html We still haven't been able to help any of them... Could Neil Kandalgaonkar, Eric Wang, Troy Settle, and Yevgeniy Miretskiy perhaps get together and compare notes? Do you all share an OS (I noticed that two posters appeared to mention RH6.2 -- is this the case for all? Is there another factor that you all share? (I do note that geography does not appear to be a factor)... This information could allow us to get somewhere. If needed, I'm willing to create a mini-list ala .qmail-something to address all four of the OPs This doesn't look like a coincidence to me since two of the reports concerned the same recipient server (outblaze.com). Unfortunately it seems related to network programming, which I know very little about. It's really tough to even know what to look at at this point... As soon as I saw that outblaze was in HK, I thought of geographical/routing issues, but none of the posters seems to share common geography. Hmmm... -- Greg White
Re: qmail-remote (cry wolf?)
Jörgen Persson wrote: Sorry, but I'm not all comfortable with this... There's been 4 similar reports of qmail-remote not behaving properly to this list during the last month. We still haven't been able to help any of them... This doesn't look like a coincidence to me since two of the reports concerned the same recipient server (outblaze.com). Unfortunately it seems related to network programming, which I know very little about. Any other thoughts about this? Jörgen Hi, Just a little investigation. $ nslookup set type=mx outblaze.com outblaze.compreference = 20, mail exchanger = mg.hk5.outblaze.com outblaze.compreference = 10, mail exchanger = spf1.hq.outblaze.com I was curious if they both ran the same MTA, so I checked it out. $ telnet spf1.hq.outblaze.com 25 Trying 202.77.223.28... Connected to spf1.hq.outblaze.com. Escape character is '^]'. 220 spf1.hq.outblaze.com ESMTP Postfix $ telnet mg.hk5.outblaze.com 25 Trying 202.123.209.152... Connected to mg.hk5.outblaze.com. Escape character is '^]'. 220 mg.hk5.outblaze.com ESMTP Sendmail 8.11.2/8.11.2; Thu, 7 Jun 2001 19:26:17 GMT What are the probabilities of the Sendmail server being the one causing the problems? What if the mail admin of mg.hk5.outblaze.com has used some sort of patch that is causing qmail-remote's to hang? Has anyone communicated with outblaze.com's postmaster? -- Mike
Re: qmail-remote (cry wolf?)
Mark et. al. - It *is* possible, though, for qmail-remote to move slowly enough that it appears to hang (yes, even for hours or days). timeoutremote applies to every read() and write() - in the very worst case, each of these system calls might move only a single byte. Consider a 5000 byte message and a timeoutremote set to 1200 seconds (the default). The worst case just for sending the data alone - not including smtp overhead and reading responses from the remote server - is almost 70 days (1200 * 5000 / 60 / 60 / 24 ~= 69.44). Granted, this is extremely unlikely, but you get the idea - some scenarios can cause qmail-remote to move extraordinarily slowly, while still functioning correctly - that is, within the limits imposed by timeoutremote. IMO, the best thing to look at from the people having this problem would be the output from whatever your system call tracing software is (ktrace for FreeBSD, truss for Solaris, strace for Linux...), run on the offending qmail-remote process. If there is no output for over 'timeoutremote' seconds, there's almost certainly a TCP stack bug; otherwise, I'd tend to blame the problem described above. Thanks, David Lowe On 7 Jun 2001, Mark wrote: What are the probabilities of the Sendmail server being the one causing the problems? What if the mail admin of mg.hk5.outblaze.com has used some sort of patch that is causing qmail-remote's to hang? Has anyone communicated with outblaze.com's postmaster? There is nothing a remote system can do that will hang qmail-remote on a correctly functioning OS. If the local TCP stack has accepted data and indicated available via the select() return, then the remote system has no further say as the read() only fetches the data previously received. I'll bet it's an OS bug - most likely in the TCP stack. Eg, it may be that the local TCP stack - in some circumstance - discards unread data, *then* marks the local socket as unreadable, rather than around the other way. That sort of window would wedge the select/read sequence in qmail-remote. Regards.
Re: qmail-remote (cry wolf?)
On Thu, Jun 07, 2001 at 04:39:25PM -0700, David Lowe allegedly wrote: Mark et. al. - It *is* possible, though, for qmail-remote to move slowly enough that it appears to hang (yes, even for hours or days). timeoutremote applies to every read() and write() - in the very worst case, each of these system calls might move only a single byte. Consider a 5000 byte message and a timeoutremote set to 1200 seconds (the default). The worst case just for sending the data alone - not including smtp overhead and reading responses from the remote server - is almost 70 days (1200 * 5000 / 60 / 60 / 24 ~= 69.44). Granted, this is extremely unlikely, but you get the idea - some scenarios can cause qmail-remote to move extraordinarily slowly, while still functioning correctly - that is, within the limits imposed by timeoutremote. Right. But in that case, the syscall trace would show qmail-remote blocked on the select() not the read(). The read() only gets executed if select() says there is data in which case the read does not block and the code immediately loops back on the select() again. As I recall, the original syscal trace showed qmail-remote blocked on the read(). Regards.