Re: qmail-remote (cry wolf?)

2001-06-19 Thread James R Grinter

On Tue 19 Jun, 2001, Mark Jefferys [EMAIL PROTECTED] wrote:
On Sun, Jun 17, 2001 at 08:56:13PM +0100, James R Grinter wrote:
Go look at timeoutread(), which *is* in your path.  The select is in
the line right before where you wedge.

sorry, yes. You're right.

It doesn't.  (Don't know about other people's.)  It assumes that the
fd_sets will be cleared on timeout.  Setting the fd_sets each time is
always necessary and doesn't protect against this issue, anyway.

I've now properly read the code, and I see what you're suggesting. I
may be naive in believing manual pages, but in lieu of other evidence I
do tend to go with what they say and it does explicitly mention zeroing
the values upon timeout - therefore I wouldn't have expected to see
this particular problem on Solaris 2.x.

However, it wouldn't be too hard to modify it to log the condition of
timeout being reached and an fdset not being zero.

I also put a debugging version of qmail-remote on my system, so if it
ever decides to hang again I can fling gdb at it.

yes, that is what I should do too.

James.



Re: qmail-remote (cry wolf?)

2001-06-18 Thread Claudio Nieder

Hi,

  [Summary: Some systems leave the fd_sets alone when select times out.]
 I think it isn't relevant. qmail-remote doesn't seem to use select,

It does. timeoutread.c:

int timeoutread(t,fd,buf,len) int t; int fd; char *buf; int len;
{
  fd_set rfds;
  struct timeval tv;

  tv.tv_sec = t;
  tv.tv_usec = 0;

  FD_ZERO(rfds);
  FD_SET(fd,rfds);

  if (select(fd + 1,rfds,(fd_set *) 0,(fd_set *) 0,tv) == -1) return -1;
  if (FD_ISSET(fd,rfds)) return read(fd,buf,len);

  errno = error_timeout;
  return -1;
}

When select returns -1 (error case) everything is fine. When select
returns 0, i.e. in the timeout case, read is called if select has
not cleared the fd bit out of rfds. So if there really exist OS
which do not clear the bits, then qmail will potentially block in
read on those OS.

 As to different OS behaviour, Solaris 2.6 (and 7) both say:
  and  errorfds  arguments  are  not modified.  If the timeout
  interval expires without the specified condition being  true
  for  any  of  the  specified  file  descriptors, the objects
  pointed to by the readfs, writefs,  and  errorfds  arguments
  have all bits set to 0.

On Solaris the above code would work without flaws.

 whereas SunOS 4.1.4 (my usual 'old bsd system' benchmark) says:
  descriptor sets.  0 indicates that the time  limit  referred
  to  by  timeout  expired.   On failure, select() returns -1,
  sets errno to indicate the error, and  the  descriptor  sets
  are not changed.

It doesn't tell explicitly what it does when it returns 0, but as
it's mentioned only in the error case, that the bits are not cleared,
one supposes that in timeout situations they are cleared, and thus
qmail will not have any problems.

claudio
-- 
Claudio Nieder, Kanalweg 1, CH-8610 Uster, Tel +41 79 357 6743
yahoo messenger: claudionieder aim: claudionieder icq:42315212
mailto:[EMAIL PROTECTED]http://www.claudio.ch



Re: qmail-remote (cry wolf?)

2001-06-18 Thread MarkD

On Mon, Jun 18, 2001 at 11:05:34PM +0200, Claudio Nieder allegedly wrote:

 On Solaris the above code would work without flaws.
 
  whereas SunOS 4.1.4 (my usual 'old bsd system' benchmark) says:
   descriptor sets.  0 indicates that the time  limit  referred
   to  by  timeout  expired.   On failure, select() returns -1,
   sets errno to indicate the error, and  the  descriptor  sets
   are not changed.
 
 It doesn't tell explicitly what it does when it returns 0, but as
 it's mentioned only in the error case, that the bits are not cleared,
 one supposes that in timeout situations they are cleared, and thus
 qmail will not have any problems.

Same with FreeBSD 4.3 - by implication.

 ...  On return, select() replaces the given descriptor sets with
 subsets consisting of those descriptors that are ready for the
 requested operation.

...

RETURN VALUES

 ... If select() returns with an error, including one due to an
 interrupted call, the descriptor sets will be unmodified.



For this who are having significant recurrences of this problem, are
you in a position to change timeoutread.c to check for a zero return
from select? It sure would help isolate this problem if you can.


Regards.



Re: qmail-remote (cry wolf?)

2001-06-18 Thread Mark Jefferys

On Sun, Jun 17, 2001 at 08:56:13PM +0100, James R Grinter wrote:

% I think it isn't relevant. qmail-remote doesn't seem to use select,
% or at least it's nowhere in the path where my qmail-remote wedges.

Go look at timeoutread(), which *is* in your path.  The select is in
the line right before where you wedge.

% As to different OS behaviour, Solaris 2.6 (and 7) both say:

[Man page claims it doesn't do this.]

% whereas SunOS 4.1.4 (my usual 'old bsd system' benchmark) says:

[Man page unclear.]

% and I can tell you that I've not seen the problem happen with
% qmail-remote on SunOS 4.1.4.

Well, I don't necessarily trust man pages to tell the truth,
especially if this was added accidentally (i.e. if it's a bug).

And I still haven't seen anything to really convince me that any OS
actually does this.  I've only seen that a few people think some do,
that it could easily happen as a bug, and that it could explain the
hung qmail-remotes.  And it's easily fixed if it is the problem.

In other words, I'm not saying that this is the cause, only that it's
possible.

%  Indeed, I think DJB's code (and most
% other people's) compensates for both behaviours by setting the
% necessary FD's each time anyway.

It doesn't.  (Don't know about other people's.)  It assumes that the
fd_sets will be cleared on timeout.  Setting the fd_sets each time is
always necessary and doesn't protect against this issue, anyway.


In any case, since I did see (one) stuck process recently I built
myself a test to see if I could reproduce it.  I wasn't.  At least on
a RedHat linux 2.2.19-6.2.1 or -6.2.1smp, it looks like select acts
sanely on a timeout, at least some of the time.

I also put a debugging version of qmail-remote on my system, so if it
ever decides to hang again I can fling gdb at it.


Mark




RE: qmail-remote (cry wolf?)

2001-06-18 Thread Troy Settle


Mark,

How would I need to go about building a dubug version of qmail-remote?
Also, how to terminate the process so that I can 'fling' gdb at it?

With a little I can probably have output from gdb within a couple hours.

--
  Troy Settle
  Pulaski Networks
  540.994.4254


** -Original Message-
** From: Mark Jefferys [mailto:[EMAIL PROTECTED]]
** Sent: Monday, June 18, 2001 9:27 PM
** To: James R Grinter
** Cc: [EMAIL PROTECTED]
** Subject: Re: qmail-remote (cry wolf?)
**
**
** On Sun, Jun 17, 2001 at 08:56:13PM +0100, James R Grinter wrote:
**
** % I think it isn't relevant. qmail-remote doesn't seem to use select,
** % or at least it's nowhere in the path where my qmail-remote wedges.
**
** Go look at timeoutread(), which *is* in your path.  The select is in
** the line right before where you wedge.
**
** % As to different OS behaviour, Solaris 2.6 (and 7) both say:
**
** [Man page claims it doesn't do this.]
**
** % whereas SunOS 4.1.4 (my usual 'old bsd system' benchmark) says:
**
** [Man page unclear.]
**
** % and I can tell you that I've not seen the problem happen with
** % qmail-remote on SunOS 4.1.4.
**
** Well, I don't necessarily trust man pages to tell the truth,
** especially if this was added accidentally (i.e. if it's a bug).
**
** And I still haven't seen anything to really convince me that any OS
** actually does this.  I've only seen that a few people think some do,
** that it could easily happen as a bug, and that it could explain the
** hung qmail-remotes.  And it's easily fixed if it is the problem.
**
** In other words, I'm not saying that this is the cause, only that it's
** possible.
**
** %  Indeed, I think DJB's code (and most
** % other people's) compensates for both behaviours by setting the
** % necessary FD's each time anyway.
**
** It doesn't.  (Don't know about other people's.)  It assumes that the
** fd_sets will be cleared on timeout.  Setting the fd_sets each time is
** always necessary and doesn't protect against this issue, anyway.
**
**
** In any case, since I did see (one) stuck process recently I built
** myself a test to see if I could reproduce it.  I wasn't.  At least on
** a RedHat linux 2.2.19-6.2.1 or -6.2.1smp, it looks like select acts
** sanely on a timeout, at least some of the time.
**
** I also put a debugging version of qmail-remote on my system, so if it
** ever decides to hang again I can fling gdb at it.
**
**
** Mark
**
**




Re: qmail-remote (cry wolf?)

2001-06-18 Thread Mark Jefferys

On Mon, Jun 18, 2001 at 11:20:36PM -0400, Troy Settle wrote:

% How would I need to go about building a dubug version of qmail-remote?

I set conf-cc and conf-ld to 'gcc -g', edited timeoutread.c slightly
to save the return value of the select in a variable, then built
qmail-remote and put it in place of the live one.  I'll attach a patch
matching what I did to timeoutread.c.

% Also, how to terminate the process so that I can 'fling' gdb at it?

I wasn't planning on terminating it.  Rather I was thinking of using
gdb's attach command to take over the process, and then start
examining variables.  Mostly, I was going to wing it.

I expect the full attachment sequence to look something like this:

(gdb) attach pid-of-stuck-qmail-remote
(gdb) symbol-file /var/qmail/bin/qmail-remote
(gdb) directory path-to-qmail-source-with-modified-timeoutread.c
(gdb) bt
(gdb) up   -- repeat until at timeoutread() stack frame
(gdb) p res
(gdb) p fd
(gdb) p rfds   -- or something like that

% With a little I can probably have output from gdb within a couple hours.

Good luck, then.


Mark



--- timeoutread.c   Mon Jun 15 03:53:16 1998
+++ timeoutread.c   Mon Jun 18 22:23:24 2001
@@ -7,6 +7,7 @@
 {
   fd_set rfds;
   struct timeval tv;
+  int res;
 
   tv.tv_sec = t;
   tv.tv_usec = 0;
@@ -14,7 +15,8 @@
   FD_ZERO(rfds);
   FD_SET(fd,rfds);
 
-  if (select(fd + 1,rfds,(fd_set *) 0,(fd_set *) 0,tv) == -1) return -1;
+  res = select(fd + 1,rfds,(fd_set *) 0,(fd_set *) 0,tv);
+  if (res == -1) return -1;
   if (FD_ISSET(fd,rfds)) return read(fd,buf,len);
 
   errno = error_timeout;



Re: qmail-remote (cry wolf?)

2001-06-17 Thread James R Grinter

Dave Sill [EMAIL PROTECTED] writes:
 Three of the four are running Red Hat 6.2. That could simply be
 because 75% of qmail systems are running RH 6.2, though. :-)

I see this problem occasionally, with mail being sent from a Solaris
2.6 system. It frequently happens for mail to one particular ISP
(freeserve.co.uk, aka Planet Online/Energis Squared), who run Exim on
(I believe) Linux systems.

I suspect they're using something to load balance the TCP sessions, as
repeatedly connecting to the two A records for their one MX record
shows up several different system names in the 220 banners. This could
be the cause of the TCP session never closing down, but it's clear
that because we're in a read() we never try and send anything that
might illicit a TCP reset.

 No word on which qmail patches, if any, were installed on these

Mine is stock qmail 1.03.

I kept meaning to get around to posting the evidence I collected here,
so here (finally) it is:

Here's my example stuck qmail-remote, with a backtrace from gdb and
also lsof output. Unfortunately I didn't keep truss output for this
one. (I should point out that this output was collected on Jan 11th...)

qmailr  4322   211  0   Nov 03 ?0:00 qmail-remote oglaroon.freeserve.co.uk 
mark-thomas-owner-mt=oglaroon.freeserve.c

# gdb /var/qmail/bin/qmail-remote 4322
GNU gdb 4.18
Copyright 1998 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type show copying to see the conditions.
There is absolutely no warranty for GDB.  Type show warranty for details.
This GDB was configured as sparc-sun-solaris2.6...
(no debugging symbols found)...

Attaching to program `/var/qmail/bin/qmail-remote', process 4322
Reading symbols from /usr/lib/libresolv.so.2...(no debugging symbols found)...
done.
Reading symbols from /usr/lib/libsocket.so.1...(no debugging symbols found)...
done.
Reading symbols from /usr/lib/libnsl.so.1...(no debugging symbols found)...
done.
Reading symbols from /usr/lib/libc.so.1...(no debugging symbols found)...done.
Reading symbols from /usr/lib/libdl.so.1...(no debugging symbols found)...done.
Reading symbols from /usr/lib/libmp.so.2...(no debugging symbols found)...done.
Symbols already loaded for /usr/lib/libresolv.so.2
Symbols already loaded for /usr/lib/libsocket.so.1
Symbols already loaded for /usr/lib/libnsl.so.1
Symbols already loaded for /usr/lib/libc.so.1
Symbols already loaded for /usr/lib/libdl.so.1
Symbols already loaded for /usr/lib/libmp.so.2
0xef6386b8 in _read () from /usr/lib/libc.so.1
(gdb) bt
#0  0xef6386b8 in _read () from /usr/lib/libc.so.1
#1  0x13c7c in timeoutread ()
#2  0x12524 in saferead ()
#3  0x160e0 in oneread ()
#4  0x161a0 in substdio_feed ()
#5  0x16290 in substdio_get ()
#6  0x12594 in get ()
#7  0x1261c in smtpcode ()
#8  0x12938 in smtp ()
#9  0x133b0 in main ()
(gdb)

# lsof -p 4322
COMMANDPID   USER   FD   TYPE DEVICE SIZE/OFFNODE NAME
qmail-rem 4322 qmailr  cwd   VDIR  85,14  512  328536 /var/qmail
qmail-rem 4322 qmailr  txt   VREG  85,1463804  361948 
/var/qmail/bin/qmail-remote
qmail-rem 4322 qmailr  txt   VREG   85,019304   30060 /usr/lib/libmp.so.2
qmail-rem 4322 qmailr  txt   VREG   85,0  1014088   30137 /usr/lib/libc.so.1
qmail-rem 4322 qmailr  txt   VREG   85,0   721916   32170 /usr/lib/libnsl.so.1
qmail-rem 4322 qmailr  txt   VREG   85,053656   30072 /usr/lib/libsocket.so.1
qmail-rem 4322 qmailr  txt   VREG   85,092952   30061 /usr/lib/libresolv.so.2
qmail-rem 4322 qmailr  txt   VREG   85,0 4280   30124 /usr/lib/libdl.so.1
qmail-rem 4322 qmailr  txt   VREG   85,0   166196   30030 /usr/lib/ld.so.1
qmail-rem 4322 qmailr0r  VREG  85,14 5021  150345 
/var/qmail/queue/mess/17/150345
qmail-rem 4322 qmailr1u  FIFO 0xf7e0e144  0t0 1091718 PIPE-0xf7e0e0c0
qmail-rem 4322 qmailr2u  FIFO 0xf7e0e144  0t0 1091718 PIPE-0xf7e0e0c0
qmail-rem 4322 qmailr3u  inet 0xf77ec040  0t0 TCP 
agent57.gbnet.net:59889-slb-mail-inG1.svr.pol.co.uk:smtp (ESTABLISHED)



Re: qmail-remote (cry wolf?)

2001-06-17 Thread Mark Jefferys

I came across the following, which *might* explain some of these
deadlocking problems:

http://kt.zork.net/kernel-traffic/kt20010611_121.html#6

[Summary: Some systems leave the fd_sets alone when select times out.]

If I read this right, timeoutconn/read/write (and anything else that
uses select) have to check for a result of 0 explicitly to be
completely portable.

Even if an OS doesn't do this intentionally, it's quite easy to see
someone forgetting to clear the fd_sets on a timeout by accident, so
some defensive coding against the problem (explicitly checking for a
result of 0) may be worthwhile.

Or this may just be a red herring...


Mark

N.B.  Although someone claimed to have seen a BSD man page reporting
that it wouldn't clear the fd_sets on a timeout, I was unable to find
any evidence of such a thing with Google.  And at least one standard
(Single UNIX Specification v2) has forbidden this kind of weirdness.

P.S.  And I just found one of these bloody hung qmail-remotes on one
of my systems!@#$!  Stuck in read of fd 3; directed at email.com (who
clearly have no clue how to set up DNS records for email, and are down
anyway).  Redhat Linux kernel 2.2.19-6.2.1smp.




Re: qmail-remote (cry wolf?)

2001-06-17 Thread James R Grinter

Mark Jefferys [EMAIL PROTECTED] writes:
 [Summary: Some systems leave the fd_sets alone when select times out.]

 Even if an OS doesn't do this intentionally, it's quite easy to see
 someone forgetting to clear the fd_sets on a timeout by accident, so
 some defensive coding against the problem (explicitly checking for a
 result of 0) may be worthwhile.
 
 Or this may just be a red herring...

I think it isn't relevant. qmail-remote doesn't seem to use select,
or at least it's nowhere in the path where my qmail-remote wedges.

As to different OS behaviour, Solaris 2.6 (and 7) both say:

  C Library Functionsselect(3C)

 On failure, the objects pointed to by the  readfs,  writefs,
 and  errorfds  arguments  are  not modified.  If the timeout
 interval expires without the specified condition being  true
 for  any  of  the  specified  file  descriptors, the objects
 pointed to by the readfs, writefs,  and  errorfds  arguments
 have all bits set to 0.

whereas SunOS 4.1.4 (my usual 'old bsd system' benchmark) says:

  SELECT(2) SYSTEM CALLS  SELECT(2)

 select() returns a non-negative value on success.   A  posi-
 tive  value indicates the number of ready descriptors in the
 descriptor sets.  0 indicates that the time  limit  referred
 to  by  timeout  expired.   On failure, select() returns -1,
 sets errno to indicate the error, and  the  descriptor  sets
 are not changed.

and I can tell you that I've not seen the problem happen with
qmail-remote on SunOS 4.1.4. Indeed, I think DJB's code (and most
other people's) compensates for both behaviours by setting the
necessary FD's each time anyway.

 N.B.  Although someone claimed to have seen a BSD man page reporting
 that it wouldn't clear the fd_sets on a timeout, I was unable to find

See above!

James.



Re: qmail-remote (cry wolf?)

2001-06-15 Thread Eric Calvert

I wanted to let the list know something about this topic.  Sorry if this has
been covered, but I just started following the list again because of I'm
having this same problem.

I'm running qmail on Redhat 6.x with a 2.2.12-20smp kernel.  It has been
running for well over a year (I don't remember the exact date I installed
everything).  But, over the last couple of months, I started getting a few
complaints of emails not being sent out in a timely manner.  Every time I
received the complaint, I checked my processes with 'ps ax | grep qmail' and
low and behold there would 20 or so copies of qmail-remote that were sitting
doing nothing.  In fact, they had been sitting doing nothing for so long,
they were swapped out.  The first few times, I just killed the qmail-remote
processes and watched as qmail again started sending messages across the
Internet.  After a while though, I noticed a pattern.  Every time I killed
all the stuck qmail-remote processes, almost all the mail going out vis the
newly revived qmail-remote processes was directed to
[EMAIL PROTECTED] (XX being a number in the range of 01 to
about 37 or so).  After some further investigation, I found that all of the
messages going to edirectnetwork.net were bounce messages stating that the
user they were trying to send mail to no longer existed.  So, I sent an
email to [EMAIL PROTECTED] to try to explain the situation to
her/him and offered to help solve the problem as she/he was needlessly
waisting both their and my bandwidth.  After a few days and no response
whatsoever, I simply found the IP address for every optXX.edirectnetwork.net
server and added it to my tcpserver rules to be blocked so that they could
no longer send mail to my server.

Here's the good part.  I setup the block over 48 hour ago, and have since
not had a single qmail-remote process lock-up.   Before I made the change, I
would normally have at least two to three stuck qmail-remote processes in
that same amount of time.

I don't know how and/or if the two are even related.  I hope this helps.

Again, I apologize if this has already been covered.

Eric Calvert
Caveland Connection





RE: qmail-remote (cry wolf?)

2001-06-09 Thread Troy Settle

** -Original Message-
** From: Mark [mailto:[EMAIL PROTECTED]]
** Sent: Friday, June 08, 2001 11:14 AM
** To: [EMAIL PROTECTED]
** Subject: Re: qmail-remote (cry wolf?)
**
**
**  processed those 1500 messages in less than 30 minutes.
** However, it left
**  behind another handfull of stuck qmail-remote processes.
** Other messages
**  were undeliverable and left in the queue, and still others
** were sent back to
**  sender with permanent errors.
**
** What do you mean by stuck? Do you mean they *never* go away - even
** after a day or two? As others have pointed out, a slow delivery can
** take a long, long time. That's not necessarily a problem, that's just
** the way it is.

Yes, I've had qmail-remote processes sit there for weeks.  I think that
instead of killing them off wholesale, I'll pick one or two processes and
see just how long they'll hang around. I'll post weekly updates if there's
any interest.

I keep hearing that it might be a very slow delivery.  How is this possible
when there isn't any network connection open to the remote host in question,
let alone a connection to it's smtp port.

As far as I can tell, this is a problem between qmail-remote and the kernel.
This is happening on multiple operating systems, so that leads me to believe
that this is not an OS bug.


**
** To find out a bit more about what a stuck qmail-remote is doing, you
** may want to ktrace it and show us the output. Find the process id of the
** stuck qmail-remote and then as root go: ktrace -p thepid
**
** Leave that running for at least an hour and show us the output. Yes, I
** mean at least an hour.
**

Ok, I meant to come back in an hour and stop the trace, but after running
ktrace for 9 hours (while I slept), the resulting ktrace.out file is exactly
0 bytes in length.  Would you like me to send a copy? g

I did verify the behavior of ktrace, and a ktrace on qmail-send generated
tons of data within seconds.  ktrace is working.

Anything else y'all would like me to lok at?
--
  Troy Settle
  Pulaski Networks
  540.994.4254




Re: qmail-remote (cry wolf?)

2001-06-09 Thread Yevgeniy Miretskiy

On Sat, Jun 09, 2001 at 06:32:55AM -0400, Troy Settle wrote:
 Yes, I've had qmail-remote processes sit there for weeks.  I think that
 instead of killing them off wholesale, I'll pick one or two processes and
 see just how long they'll hang around. I'll post weekly updates if there's
 any interest.

Here is what I have on one of mail servers (ps -waux|grep qmail-remote, real email
addresses removed, domain names are left. I only left user, pids, state, date, and prog
name on the output for readability purposes):
 
qmailr 7365   S May19 qmail-remote iname.com [EMAIL PROTECTED] [EMAIL PROTECTED]
qmailr 14602  S May19 qmail-remote mail.com [EMAIL PROTECTED] [EMAIL PROTECTED]
qmailr 25415  S May19 qmail-remote careful.com [EMAIL PROTECTED] [EMAIL PROTECTED]
qmailr 25875  S May19 qmail-remote programmer.net [EMAIL PROTECTED] 
[EMAIL PROTECTED]
qmailr 25902  S May19 qmail-remote mail.com [EMAIL PROTECTED] [EMAIL PROTECTED]
qmailr 852S May19 qmail-remote mail.com [EMAIL PROTECTED] [EMAIL PROTECTED]
qmailr 20283  S May25 qmail-remote ziplip.com [EMAIL PROTECTED] [EMAIL PROTECTED]
qmailr 29814  S May18 qmail-remote mail.com [EMAIL PROTECTED] [EMAIL PROTECTED]
qmailr 25877  S May19 qmail-remote mail.com [EMAIL PROTECTED] [EMAIL PROTECTED]
qmailr 25145  S May19 qmail-remote mail.com [EMAIL PROTECTED] [EMAIL PROTECTED]
qmailr 27208  S Jun08 qmail-remote hp.com [EMAIL PROTECTED] [EMAIL PROTECTED]
qmailr 27070  S Jun08 qmail-remote mail.com [EMAIL PROTECTED] [EMAIL PROTECTED]
qmailr 11525  S Jun08 qmail-remote best-service.com [EMAIL PROTECTED] 
[EMAIL PROTECTED]
qmailr 13766  S Jun08 qmail-remote mad.scientist.com [EMAIL PROTECTED] 
[EMAIL PROTECTED]

As you can see, processes running since May 19th cannot possibly be explained by
slow deliver -- 20 days is just too much.
The following domains go through outblaze.com mail servers:
  iname.com
  mail.com
  careful.com
  programmer.net
  best-service.com
  mad.scientist.com

The following domains do not go through outblaze:
  ziplip.com
  hp.com

Unforunatelly, I cannot explain this situation by blaming everything on outblaze.



-- 
  Eugene Miretskiy [EMAIL PROTECTED]
  InVision.com, INC.  (631) 543-1000
  www.invision.net  /  www.longisland.com 



Re: qmail-remote (cry wolf?)

2001-06-09 Thread Greg White

I think we may have red-herringed on the OS thing -- if RH6.2, as
deployed, had this sort of problem, I think we would have run across it
before this, no? The inclusion of a FreeBSD-4.2-STABLE in the mix seems
to nix a RH specific bug as well (althought it obviously does not rule
it out entirely*). Perhaps we're overlooking some other, more subtle
commonality between these four setups?

Could at least two of the OP's please detail (for me, if not for the
list, at least) the devices that sit between the NIC of the host in
question and the Big Bad Internet? Routers, hubs, transparent firewalls,
everything?

*I highly recommend that the FreeBSD-4.2-STABLE user at least upgrade to
4.3R -- I'm not sure at which point in 4.2-STABLE you froze your local
tree, but a whole bunch of fixes made it into 4.3, and it's been running
great for me.

-- 
Greg White



Re: qmail-remote (cry wolf?)

2001-06-09 Thread Mark

  Is it possible that some external devices s.a.
  switch/router/firewall/anything could be causing this problem?
 
 Yes, very possble.  Some firewalls do transparent SMTP or POP proxying, and
 there have been many bugs in such implementations.

No. Regardless of what the other end does, a conforming OS should not
wedge qmail-remote forever. Why do people keep suggesting this?

You have three choices:

1. Show the bug in the code containing the select() and read()

2. Show an interpretation error regarding the semantics of
   read() and select()

If you can do either of those, we can conclude that qmail-remote is
coded incorrectly and needs fixing.

If you can do neither of these, then this leaves you with the
inescapable conclusion that qmail-remote *is* playing by the rules, in
which case you are left with the only alternative:

3. the other side of the C code is not playing by the rules:
   ie a bug in the compiler, libraries or OS.


I will note that no one has done 1 or 2 yet, so that leaves 3.


Regards.



Re: qmail-remote (cry wolf?)

2001-06-09 Thread Mark

On Fri, Jun 08, 2001 at 08:11:21PM -0400, Yevgeniy Miretskiy allegedly wrote:
 On Fri, Jun 08, 2001 at 09:47:16PM +, Mark wrote:
  Then it's an OS bug.
  
  qmail-remote only gets to the read() if the OS (via select() ) says
  that the read will not block. Ergo, the OS is lying.
 
 If it's OS bug, anybody heard/knows of such severe network related
 bug in RedHat 6.2?
 
 What about FreeBSD 4.2 (I believe somebody reported problem with
 FreeBSD as well)???
 
 What are the chances of _such_ bug in _both_ OSes?
 I'd like to mention, that I ran qmail of FreeBSD (starting from 3.x all
 the way to latest) for couple years and _never_ observed this behaviour
 on FreeBSD.

I ran it on Solaris 2.5/2.6 for years and did experience this sort of
behaviour. It went away on 2.8. So what?

No one has shown that qmail-remote is doing anything wrong. If it's
not doing anything wrong, them maybe the problem is somewhere else?
Conversely, every reading of the code in question suggests that
qmail-remote is doing everything right.

The fact that this problem occurs on at least two OSes simply suggests
to me that the TCP/IP interaction is a boundary condition perhaps
triggered by distance connections and perhaps also by an uncommon
remote TCP/IP stack.

Regardless of which, if an OS renegs on the fd-will-not-block
promise, then it can *only* be an OS bug.


Regards.




Re: qmail-remote (cry wolf?)

2001-06-09 Thread Mark

 As far as I can tell, this is a problem between qmail-remote and the kernel.

Correct.

 This is happening on multiple operating systems, so that leads me to believe
 that this is not an OS bug.

But many OSes share TCP/IP implementations or mis-interpretations of
the protocol. Many coders of TCP/IP stacks look at other
implementations to work out what to do. There is a *lot* of
commonality between OSes in this regard. Eg, the Linux crowd and the
FreeBSD crowd reguarly refer to each others implementations to decide
how to do something (or not do something as the case may be).

 ** To find out a bit more about what a stuck qmail-remote is doing, you
 ** may want to ktrace it and show us the output. Find the process id of the
 ** stuck qmail-remote and then as root go: ktrace -p thepid
 **
 ** Leave that running for at least an hour and show us the output. Yes, I
 ** mean at least an hour.
 **
 
 Ok, I meant to come back in an hour and stop the trace, but after running
 ktrace for 9 hours (while I slept), the resulting ktrace.out file is exactly
 0 bytes in length.  Would you like me to send a copy? g

It's a bummer that ktrace is like that on FreeBSD. It doesn't show the
*current* system call that the process is sitting on. Conversely,
truss on Solaris does this nicely...

You can conclude though that qmail-remote wasn't sitting on the
select() as that has a timeout and should show the system calls
associated with the reading loop. If it's not sitting on the select()
what is it sitting on? If it's the read() well, how could that be if
select() said the read would not block?


Regards.



Re: qmail-remote (cry wolf?)

2001-06-09 Thread Mark

On Sat, Jun 09, 2001 at 09:05:00AM -0700, Greg White allegedly wrote:
 I think we may have red-herringed on the OS thing -- if RH6.2, as
 deployed, had this sort of problem, I think we would have run across it
 before this, no? The inclusion of a FreeBSD-4.2-STABLE in the mix seems
 to nix a RH specific bug as well (althought it obviously does not rule
 it out entirely*). Perhaps we're overlooking some other, more subtle
 commonality between these four setups?

Indeed. Using commonality to solve a problem is a fine
technique. However the underlying assumption is that it is a single
problem that is being solved here. We have no certainty of that, all
we do have is a single *symptom* - qmail-remote wedges on some
systems, on some occassions.

If it is a single problem, here are some commonalities that might be
explored:

1.  Bug in qmail-remote
2.  Common compiler (think optimization error)
3.  Common clib error (think semantic error or bug)
4.  Common OS (think semantic error or bug)
5.  Common TCP/IP stack
6.  Common network interface code (perhaps all derived
from a vendor reference implementation)

All of which *may* only be triggered by a certain set of TCP/IP events
initiated from the peer end. Indeed the peer may be an uncommon
OS/TCP/IP combo which reduces the occurence of this problem to
isolated situations.

And you can be very certain that this is a very very rare event. Just
consider how many invocations of qmail-remote have successfully
completed in the last 3 years on many many thousands of OSes in many
thousands of locations around the world.

What does that mean? It's probably a tough problem to nail down
without access to the interaction history between all of the above
components.


Regards.



Re: qmail-remote (cry wolf?)

2001-06-09 Thread Russell Nelson

Greg White writes:
  I think we may have red-herringed on the OS thing -- if RH6.2, as
  deployed, had this sort of problem, I think we would have run across it
  before this, no?

Hmmm  I wonder.  I could do a denial of service attack on
qmail-remote by receiving email very, very slowly, and by sending
email to a server which is guaranteed to be received and guaranteed to
bounce.  qmail doesn't keep track of very slow hosts, but only hosts
that time out.

-- 
-russ nelson [EMAIL PROTECTED]  http://russnelson.com
Crynwr sells support for free software  | PGPok | 
521 Pleasant Valley Rd. | +1 315 268 1925 voice | John Hartford, RIP
Potsdam, NY 13676-3213  | +1 315 268 9201 FAX   | 



Re: qmail-remote (cry wolf?)

2001-06-09 Thread Charles Cazabon

Russell Nelson [EMAIL PROTECTED] wrote:
 
 Hmmm  I wonder.  I could do a denial of service attack on
 qmail-remote by receiving email very, very slowly, and by sending
 email to a server which is guaranteed to be received and guaranteed to
 bounce.  qmail doesn't keep track of very slow hosts, but only hosts
 that time out.

I've been thinking along the same lines.  qmail-smtpd would seem to also be
vulnerable to this (not that this is djb's fault).  Lowering timeoutremote and
timeoutsmtpd from their defaults of 1200 would help against this problem
cropping up due to genuinely slow servers, but not against a deliberate attack
(send one byte every ten minutes, or two minutes, or whatever, tying up a
qmail-smtpd process for an indefinite period).

Perhaps something like a maxlifetime control file for qmail-remote and
qmail-smtpd?  At process startup, set an alarm for X seconds -- if the ALRM is
received, abort the connection as gracefully as possible (i.e. try to send
RSET and QUIT in qmail-remote, issue a 4xx error in qmail-smtpd) but quit
regardless of whether these attempts to quit gracefully are successful or not.

It doesn't sound too complicated.  Anybody see any major issues with this?

Charles
-- 
---
Charles Cazabon[EMAIL PROTECTED]
GPL'ed software available at:  http://www.qcc.sk.ca/~charlesc/software/
Any opinions expressed are just that -- my opinions.
---



Re: qmail-remote (cry wolf?)

2001-06-09 Thread Mark

On Sat, Jun 09, 2001 at 03:11:59PM -0400, Russell Nelson allegedly wrote:
 Greg White writes:
   I think we may have red-herringed on the OS thing -- if RH6.2, as
   deployed, had this sort of problem, I think we would have run across it
   before this, no?
 
 Hmmm  I wonder.  I could do a denial of service attack on
 qmail-remote by receiving email very, very slowly,

You'd also have to set the TCP/IP receive window size down, otherwise
you may think you're only receiving at one byte every, say, 20
minutes, but in fact your TCP/IP stack got a window full of data at
one time.

But yes, you could slow it down considerably and if you got to the
extreme limit of 20 minutes per byte, a 1M email will take about 9
months...

It sure is the case that some sort of gross delivery timer makes sense
and it would work around the problem that initiated this thread...

 and by sending
 email to a server which is guaranteed to be received and guaranteed to
 bounce.  qmail doesn't keep track of very slow hosts, but only hosts
 that time out.

Of course it has to be your server that accepts the traffic slowly so
it's a DOS on yourself at the same time. Not the typical MO for a
successful DOS.


This is only proof of concepts, but to implement a gross timer, you
might use this program as a wrapper to qmail-remote (which of course
you move to qmail-remote.real):

main(int argc, char **argv)
{
alarm(5*60*60); /* Max of five hours for a remote delivery */
execv(/var/qmail/bin/qmail-remote.real, argv);
_exit(1);
}


This wrapper gives qmail-remote an arbitrary 5 hours to make a
delivery at which point qmail-remote gets a SIGALRM which it happens
not to have registered a handler for and thus the OS takes the default
action which is to terminate the process.


Regards.



Re: qmail-remote (cry wolf?)

2001-06-09 Thread Jos Backus

On Sat, Jun 09, 2001 at 05:58:49PM +, Mark wrote:
 It's a bummer that ktrace is like that on FreeBSD. It doesn't show the
 *current* system call that the process is sitting on. Conversely,
 truss on Solaris does this nicely...

But FreeBSD does have a (procfs-based) truss.

-- 
Jos Backus _/  _/_/_/Modularity is not a hack.
  _/  _/   _/-- D. J. Bernstein
 _/  _/_/_/ 
_/  _/  _/_/
[EMAIL PROTECTED] _/_/   _/_/_/use Std::Disclaimer;



Re: qmail-remote (cry wolf?)

2001-06-09 Thread Mark

 Perhaps something like a maxlifetime control file for qmail-remote and

(Serendipity strikes again - I just posted sample code for this).

 qmail-smtpd?  At process startup, set an alarm for X seconds -- if the ALRM is
 received, abort the connection as gracefully as possible (i.e. try to send
 RSET and QUIT in qmail-remote, issue a 4xx error in qmail-smtpd) but quit
 regardless of whether these attempts to quit gracefully are successful or not.

Why bother with a graceful exit? You'd have to set yet another alarm
for the (likely) case that your graceful exit fails. That's seems like
unnecessary complexity for a connection that is almost certainly dead.


Regards.



Re: qmail-remote (cry wolf?)

2001-06-09 Thread Mark

On Sat, Jun 09, 2001 at 01:00:46PM -0700, Jos Backus allegedly wrote:
 On Sat, Jun 09, 2001 at 05:58:49PM +, Mark wrote:
  It's a bummer that ktrace is like that on FreeBSD. It doesn't show the
  *current* system call that the process is sitting on. Conversely,
  truss on Solaris does this nicely...
 
 But FreeBSD does have a (procfs-based) truss.

Right. But it suffers from the same problem that ktrace does in that
it starts with the next system call, not the current one. Leastwise it
does on a 4.3 I have access to, do you get something different?


Regards.



Re: qmail-remote (cry wolf?)

2001-06-09 Thread Peter van Dijk

On Sat, Jun 09, 2001 at 08:12:03PM +, Mark wrote:
 On Sat, Jun 09, 2001 at 01:00:46PM -0700, Jos Backus allegedly wrote:
  On Sat, Jun 09, 2001 at 05:58:49PM +, Mark wrote:
   It's a bummer that ktrace is like that on FreeBSD. It doesn't show the
   *current* system call that the process is sitting on. Conversely,
   truss on Solaris does this nicely...
  
  But FreeBSD does have a (procfs-based) truss.
 
 Right. But it suffers from the same problem that ktrace does in that
 it starts with the next system call, not the current one. Leastwise it
 does on a 4.3 I have access to, do you get something different?

I get what you get :)

Greetz, Peter
-- 
Against Free Sex!   http://www.dataloss.nl/Megahard_en.html



Re: qmail-remote (cry wolf?)

2001-06-09 Thread Uwe Ohse

On Sat, Jun 09, 2001 at 01:54:37PM -0600, Charles Cazabon wrote:
 
 Perhaps something like a maxlifetime control file for qmail-remote and
 qmail-smtpd?  At process startup, set an alarm for X seconds -- if the ALRM is
 received, abort the connection as gracefully as possible (i.e. try to send
 RSET and QUIT in qmail-remote, issue a 4xx error in qmail-smtpd) but quit
 regardless of whether these attempts to quit gracefully are successful or not.
 
 It doesn't sound too complicated.  Anybody see any major issues with this?

No, but this may be more complicated than needed.
I've been using the attached program on one of my machines for years 
(the machine was behind a dialup line and it was definitively not
funny to sponsor the Deutsche Telekom just because the other end
of a SMTP connection was slow).
It worked in it's crude form. It leaves a hole open which could
lead to duplicate transfers.

Regards, Uwe
-- setalarm.c ---
#include signal.h
#include unistd.h
#include errno.h
#include stdlib.h
#include stdio.h
static void die_usage(void)
{fputs(usage: setalarm SECONDS program [arguments ...]\n,stderr);
 exit(111);}
static void die_parse(void)
{fputs(setalarm: fatal: failed to parse the number of seconds\n,stderr);
 exit(111);}
static void die_exec(char *x)
{int e=errno;
 fputs(setalarm: fatal: failed to execute ,stderr);
 perror(x);
 exit(111);}

int 
main(int argc, char **argv)
{
unsigned long ul;
int e;
char *ep;
ssize_t l;
if (argc3) die_parse();
errno=0;
ul=strtoul(argv[1],ep,10);
if (*ep || !argv[1][0] || errno==ERANGE) die_parse();
signal(SIGALRM,SIG_DFL);
alarm(ul);
execvp(argv[2],argv+2);
die_exec(argv[2]);
}



Re: qmail-remote (cry wolf?)

2001-06-09 Thread Jos Backus

On Sat, Jun 09, 2001 at 08:11:41PM +, Mark wrote:
 On Sat, Jun 09, 2001 at 01:00:46PM -0700, Jos Backus allegedly wrote:
  But FreeBSD does have a (procfs-based) truss.
 
 Right. But it suffers from the same problem that ktrace does in that
 it starts with the next system call, not the current one. Leastwise it
 does on a 4.3 I have access to, do you get something different?

Nope (I'm still at PRE_SMPNG, waiting for -current to stabilize (hah!)).

One idea would be to run the process under truss, and pipe the truss output
through multilog, providing one with a syscall activity history without the
danger of filling up partitions (as would likely happen when using ktrace).

-- 
Jos Backus _/  _/_/_/Modularity is not a hack.
  _/  _/   _/-- D. J. Bernstein
 _/  _/_/_/ 
_/  _/  _/_/
[EMAIL PROTECTED] _/_/   _/_/_/use Std::Disclaimer;



Re: qmail-remote (cry wolf?)

2001-06-08 Thread Jörgen Persson

On Thu, Jun 07, 2001 at 10:43:27PM +0300, Mike Jackson wrote:
  What are the probabilities of the Sendmail server being the one causing
 the problems? What if the mail admin of mg.hk5.outblaze.com has used
 some sort of patch that is causing qmail-remote's to hang?

Which means it might be exploitable as a DoS... 

There's been similar problems with mail.com (hosted by outblaze) as
well. I still haven't been able to manually connect to any of their
servers. It seems as they are under heavy load according to an apology
at their home page.

Jörgen



RE: qmail-remote (cry wolf?)

2001-06-08 Thread Troy Settle


Not being a programmer, I have no clue how to trace this, but if someone
were able to help me, I'd be glad to give it a go.  I'm on FreeBSD
4.2-STABLE at the moment, and will be updating again soon.

Qmail is built with patches, a concurrency patch and the patches from the
FreeBSD port.  qmail-remote itself was not patched from 1.03.

What I'm seeing on these stuck processes, is that they're in a state of
'sbwait' (as shown by top).  netstat doesn't show any open connections to
the remote hosts (smtp or otherwise).

This problem doesn't seem to be related to the remote host, no matter the
MTA.  I've seen several stuck qmail-remote processes to a certain host, but
scanning through logs shows that mail has been successfully sent to that
same host on multiple occasions, both prior and after the stuck process was
launched.

This doesn't seem to be a networking problem.  On one occasion, I had over
1500 messages queued up because the number of stuck qmail-remote processes
ate up my concurrency limit.  After clearing up the blockge, the box
processed those 1500 messages in less than 30 minutes.  However, it left
behind another handfull of stuck qmail-remote processes.  Other messages
were undeliverable and left in the queue, and still others were sent back to
sender with permanent errors.

Logs are intact.  There's a start of delivery entry, but if qmail-remote
gets stuck, there is no further reference to those messages.

Yes, I can read the messages in the queue.  They are intact and appear to be
properly formatted.

There is no proxy server or firewall between this box and the rest of the
Internet.  Only a Cisco 2924 switch, a 3640 router and a T1 ride out to ATT
or Sprint.


I hope all this information helps.  Anyone should feel free to ask for more
details, but please be specific in the information you need.  Remember, a
lot of us here are admins, not developers.


--
  Troy Settle
  Pulaski Networks
  540.994.4254


** -Original Message-
** From: Mark [mailto:[EMAIL PROTECTED]]
** Sent: Thursday, June 07, 2001 6:00 PM
** To: [EMAIL PROTECTED]
** Subject: Re: qmail-remote (cry wolf?)
**
**
**   What are the probabilities of the Sendmail server being the
** one causing
**  the problems? What if the mail admin of mg.hk5.outblaze.com has used
**  some sort of patch that is causing qmail-remote's to hang? Has anyone
**  communicated with outblaze.com's postmaster?
**
** There is nothing a remote system can do that will hang qmail-remote on
** a correctly functioning OS. If the local TCP stack has accepted data
** and indicated available via the select() return, then the remote
** system has no further say as the read() only fetches the data
** previously received.
**
** I'll bet it's an OS bug - most likely in the TCP stack. Eg, it may be
** that the local TCP stack - in some circumstance - discards unread
** data, *then* marks the local socket as unreadable, rather than around
** the other way. That sort of window would wedge the select/read
** sequence in qmail-remote.
**
**
** Regards.
**
**




Re: qmail-remote (cry wolf?)

2001-06-08 Thread Mark

 processed those 1500 messages in less than 30 minutes.  However, it left
 behind another handfull of stuck qmail-remote processes.  Other messages
 were undeliverable and left in the queue, and still others were sent back to
 sender with permanent errors.

What do you mean by stuck? Do you mean they *never* go away - even
after a day or two? As others have pointed out, a slow delivery can
take a long, long time. That's not necessarily a problem, that's just
the way it is.

To find out a bit more about what a stuck qmail-remote is doing, you
may want to ktrace it and show us the output. Find the process id of the
stuck qmail-remote and then as root go: ktrace -p thepid

Leave that running for at least an hour and show us the output. Yes, I
mean at least an hour.


Regards.



Re: qmail-remote (cry wolf?)

2001-06-08 Thread Yevgeniy Miretskiy

One more time,

I did tcpdump and strace on stuck qmail-remote for over an hour.
strace shows that qmail-remote is stuck on: 'read(3', and tcpdump shows
that nothing comes in.

On Fri, Jun 08, 2001 at 03:13:54PM +, Mark wrote:
  processed those 1500 messages in less than 30 minutes.  However, it left
  behind another handfull of stuck qmail-remote processes.  Other messages
  were undeliverable and left in the queue, and still others were sent back to
  sender with permanent errors.
 
 What do you mean by stuck? Do you mean they *never* go away - even
 after a day or two? As others have pointed out, a slow delivery can
 take a long, long time. That's not necessarily a problem, that's just
 the way it is.
 
 To find out a bit more about what a stuck qmail-remote is doing, you
 may want to ktrace it and show us the output. Find the process id of the
 stuck qmail-remote and then as root go: ktrace -p thepid
 
 Leave that running for at least an hour and show us the output. Yes, I
 mean at least an hour.
 
 
 Regards.
 

-- 
  Eugene Miretskiy [EMAIL PROTECTED]
  InVision.com, INC.  (631) 543-1000
  www.invision.net  /  www.longisland.com 



Re: qmail-remote (cry wolf?)

2001-06-08 Thread Mark

On Fri, Jun 08, 2001 at 03:51:18PM -0400, Yevgeniy Miretskiy allegedly wrote:
 One more time,
 
 I did tcpdump and strace on stuck qmail-remote for over an hour.
 strace shows that qmail-remote is stuck on: 'read(3', and tcpdump shows
 that nothing comes in.

One more time. Then it's an OS bug.

qmail-remote only gets to the read() if the OS (via select() ) says
that the read will not block. Ergo, the OS is lying.


Regards.



Re: qmail-remote (cry wolf?)

2001-06-08 Thread Yevgeniy Miretskiy

On Fri, Jun 08, 2001 at 09:47:16PM +, Mark wrote:
 Then it's an OS bug.
 
 qmail-remote only gets to the read() if the OS (via select() ) says
 that the read will not block. Ergo, the OS is lying.

If it's OS bug, anybody heard/knows of such severe network related
bug in RedHat 6.2?

What about FreeBSD 4.2 (I believe somebody reported problem with
FreeBSD as well)???

What are the chances of _such_ bug in _both_ OSes?
I'd like to mention, that I ran qmail of FreeBSD (starting from 3.x all
the way to latest) for couple years and _never_ observed this behaviour
on FreeBSD.

Is it possible that some external devices s.a.
switch/router/firewall/anything could be causing this problem?


-- 
  Eugene Miretskiy [EMAIL PROTECTED]
  InVision.com, INC.  (631) 543-1000
  www.invision.net  /  www.longisland.com 



Re: qmail-remote (cry wolf?)

2001-06-08 Thread Charles Cazabon

Yevgeniy Miretskiy [EMAIL PROTECTED] wrote:
 On Fri, Jun 08, 2001 at 09:47:16PM +, Mark wrote:
  Then it's an OS bug.
  
  qmail-remote only gets to the read() if the OS (via select() ) says
  that the read will not block. Ergo, the OS is lying.
 
 If it's OS bug, anybody heard/knows of such severe network related
 bug in RedHat 6.2?

Many, especially with earlier kernels.  Upgrade to 2.2.19-6.2.1 if you haven't
already. 2.2.14 in particular was a nasty one, at least as shipped by RedHat.
And no, I'm not trolling -- I use RedHat myself.

 What are the chances of _such_ bug in _both_ OSes?

Coincidences happen.

 Is it possible that some external devices s.a.
 switch/router/firewall/anything could be causing this problem?

Yes, very possble.  Some firewalls do transparent SMTP or POP proxying, and
there have been many bugs in such implementations.

Charles
-- 
---
Charles Cazabon[EMAIL PROTECTED]
GPL'ed software available at:  http://www.qcc.sk.ca/~charlesc/software/
Any opinions expressed are just that -- my opinions.
---



qmail-remote (cry wolf?)

2001-06-07 Thread Jörgen Persson

Sorry, but I'm not all comfortable with this...

There's been 4 similar reports of qmail-remote not behaving properly to
this list during the last month. 

http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg00558.html
http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg01332.html
http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00283.html
http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00426.html

We still haven't been able to help any of them...

This doesn't look like a coincidence to me since two of the reports
concerned the same recipient server (outblaze.com). Unfortunately it
seems related to network programming, which I know very little about.

Any other thoughts about this?

Jörgen



Re: qmail-remote (cry wolf?)

2001-06-07 Thread Mark

On Thu, Jun 07, 2001 at 07:36:53PM +0200, Jörgen Persson allegedly wrote:
 Sorry, but I'm not all comfortable with this...
 
 There's been 4 similar reports of qmail-remote not behaving properly to
 this list during the last month. 
 
 http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg00558.html
 http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg01332.html
 http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00283.html
 http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00426.html
 
 We still haven't been able to help any of them...
 
 This doesn't look like a coincidence to me since two of the reports
 concerned the same recipient server (outblaze.com). Unfortunately it
 seems related to network programming, which I know very little about.
 
 Any other thoughts about this?

If it's an unpatched qmail-remote, then remain suspicious of some OS
bug. I spent a long time looking at qmail-remote when a similar
problem occured on a Solaris 2.5 system (or maybe 2.6, I forget
now). Here are the two lines of code:

  if (select(fd + 1,rfds,(fd_set *) 0,(fd_set *) 0,tv) == -1) return -1;
  if (FD_ISSET(fd,rfds)) return read(fd,buf,len);

That's about as simple as you can get!

I don't see any way that the read() call will occur without select()
returning the fdset bit. So, if select() says that a read can occur,
then the only reason that the read() can then block is if the OS is
lying...

So, it's kinda hard to see a problem with qmail-remote here. Do OSes
ever get it wrong? Sure.

If this is a relatively widespread problem, then you might want to put
an alarm() handler into qmail-remote, but if you can't rely on the OS,
all bets are really off, right?


Regards.



Re: qmail-remote (cry wolf?)

2001-06-07 Thread Mark

  What are the probabilities of the Sendmail server being the one causing
 the problems? What if the mail admin of mg.hk5.outblaze.com has used
 some sort of patch that is causing qmail-remote's to hang? Has anyone
 communicated with outblaze.com's postmaster?

There is nothing a remote system can do that will hang qmail-remote on
a correctly functioning OS. If the local TCP stack has accepted data
and indicated available via the select() return, then the remote
system has no further say as the read() only fetches the data
previously received.

I'll bet it's an OS bug - most likely in the TCP stack. Eg, it may be
that the local TCP stack - in some circumstance - discards unread
data, *then* marks the local socket as unreadable, rather than around
the other way. That sort of window would wedge the select/read
sequence in qmail-remote.


Regards.



Re: qmail-remote (cry wolf?)

2001-06-07 Thread Uwe Ohse

On Thu, Jun 07, 2001 at 07:36:53PM +0200, Jörgen Persson wrote:
 
 http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg00558.html

rethat 6.2, outblaze.com

 http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg01332.html

rethat 6.2, outblaze.com

 http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00283.html

FreeBSD 4.something, not outblaze.com.
Some of the host are just unreachable.


 http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00426.html

rethat 6.2, outblaze.com

 
 We still haven't been able to help any of them...

Somebody with rh-6.2 / linux-2.2.14 might want to look into this,
though i'd recommend a kernel upgrade. There have been a number
of networking problems in the linux kernel, and some of them were
quite awful and hard to trigger.
I remember that i downgraded my home server to 2.2.13 after 
2.2.14 broke my rsync-over-ssh or network tar.

Regards, Uwe



Re: qmail-remote (cry wolf?)

2001-06-07 Thread Dave Sill

J=F6rgen Persson [EMAIL PROTECTED] wrote:

There's been 4 similar reports of qmail-remote not behaving properly t=
o
this list during the last month.=20

http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg00558.=
html
http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg01332.=
html
http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00283.=
html
http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00426.=
html

We still haven't been able to help any of them...

This doesn't look like a coincidence to me since two of the reports
concerned the same recipient server (outblaze.com). Unfortunately it
seems related to network programming, which I know very little about.

Any other thoughts about this=3F

Three of the four are running Red Hat 6.2. That could simply be
because 75% of qmail systems are running RH 6.2, though. :-)

No word on which qmail patches, if any, were installed on these
systems.

-Dave



Re: qmail-remote (cry wolf?)

2001-06-07 Thread Greg White

On Thu, Jun 07, 2001 at 07:36:53PM +0200, Jörgen Persson wrote:
 Sorry, but I'm not all comfortable with this...
 
 There's been 4 similar reports of qmail-remote not behaving properly to
 this list during the last month. 
 
 http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg00558.html
 http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/05/msg01332.html
 http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00283.html
 http://www.ornl.gov/its/archives/mailing-lists/qmail/2001/06/msg00426.html
 
 We still haven't been able to help any of them...


Could Neil Kandalgaonkar, Eric Wang, Troy Settle, and Yevgeniy Miretskiy
perhaps get together and compare notes? Do you all share an OS (I
noticed that two posters appeared to mention RH6.2 -- is this the case
for all? Is there another factor that you all share? (I do note that
geography does not appear to be a factor)... This information could
allow us to get somewhere.

If needed, I'm willing to create a mini-list ala .qmail-something to
address all four of the OPs

 
 This doesn't look like a coincidence to me since two of the reports
 concerned the same recipient server (outblaze.com). Unfortunately it
 seems related to network programming, which I know very little about.

It's really tough to even know what to look at at this point... As soon
as I saw that outblaze was in HK, I thought of geographical/routing
issues, but none of the posters seems to share common geography. Hmmm...

-- 
Greg White



Re: qmail-remote (cry wolf?)

2001-06-07 Thread Mike Jackson

Jörgen Persson wrote:
 
 Sorry, but I'm not all comfortable with this...
 
 There's been 4 similar reports of qmail-remote not behaving properly to
 this list during the last month.

 We still haven't been able to help any of them...
 
 This doesn't look like a coincidence to me since two of the reports
 concerned the same recipient server (outblaze.com). Unfortunately it
 seems related to network programming, which I know very little about.
 
 Any other thoughts about this?
 
 Jörgen

Hi,
 Just a little investigation.

$ nslookup

 set type=mx
 outblaze.com

outblaze.compreference = 20, mail exchanger = mg.hk5.outblaze.com
outblaze.compreference = 10, mail exchanger = spf1.hq.outblaze.com


 I was curious if they both ran the same MTA, so I checked it out.


$ telnet spf1.hq.outblaze.com 25
Trying 202.77.223.28...
Connected to spf1.hq.outblaze.com.
Escape character is '^]'.
220 spf1.hq.outblaze.com ESMTP Postfix

$ telnet mg.hk5.outblaze.com 25
Trying 202.123.209.152...
Connected to mg.hk5.outblaze.com.
Escape character is '^]'.
220 mg.hk5.outblaze.com ESMTP Sendmail 8.11.2/8.11.2; Thu, 7 Jun 2001
19:26:17 GMT


 What are the probabilities of the Sendmail server being the one causing
the problems? What if the mail admin of mg.hk5.outblaze.com has used
some sort of patch that is causing qmail-remote's to hang? Has anyone
communicated with outblaze.com's postmaster?

--
Mike



Re: qmail-remote (cry wolf?)

2001-06-07 Thread David Lowe

Mark et. al. -

It *is* possible, though, for qmail-remote to move slowly enough that it
appears to hang (yes, even for hours or days).  timeoutremote applies to
every read() and write() - in the very worst case, each of these system
calls might move only a single byte.

Consider a 5000 byte message and a timeoutremote set to 1200 seconds (the
default).  The worst case just for sending the data alone - not including
smtp overhead and reading responses from the remote server - is almost 70
days (1200 * 5000 / 60 / 60 / 24 ~= 69.44).

Granted, this is extremely unlikely, but you get the idea - some scenarios
can cause qmail-remote to move extraordinarily slowly, while still
functioning correctly - that is, within the limits imposed by
timeoutremote.

IMO, the best thing to look at from the people having this problem would
be the output from whatever your system call tracing software is (ktrace
for FreeBSD, truss for Solaris, strace for Linux...), run on the offending
qmail-remote process.  If there is no output for over 'timeoutremote'
seconds, there's almost certainly a TCP stack bug; otherwise, I'd tend to
blame the problem described above.

Thanks,
David Lowe

On 7 Jun 2001, Mark wrote:

   What are the probabilities of the Sendmail server being the one causing
  the problems? What if the mail admin of mg.hk5.outblaze.com has used
  some sort of patch that is causing qmail-remote's to hang? Has anyone
  communicated with outblaze.com's postmaster?
 
 There is nothing a remote system can do that will hang qmail-remote on
 a correctly functioning OS. If the local TCP stack has accepted data
 and indicated available via the select() return, then the remote
 system has no further say as the read() only fetches the data
 previously received.
 
 I'll bet it's an OS bug - most likely in the TCP stack. Eg, it may be
 that the local TCP stack - in some circumstance - discards unread
 data, *then* marks the local socket as unreadable, rather than around
 the other way. That sort of window would wedge the select/read
 sequence in qmail-remote.
 
 
 Regards.
 











Re: qmail-remote (cry wolf?)

2001-06-07 Thread Mark

On Thu, Jun 07, 2001 at 04:39:25PM -0700, David Lowe allegedly wrote:
 Mark et. al. -
 
 It *is* possible, though, for qmail-remote to move slowly enough that it
 appears to hang (yes, even for hours or days).  timeoutremote applies to
 every read() and write() - in the very worst case, each of these system
 calls might move only a single byte.
 
 Consider a 5000 byte message and a timeoutremote set to 1200 seconds (the
 default).  The worst case just for sending the data alone - not including
 smtp overhead and reading responses from the remote server - is almost 70
 days (1200 * 5000 / 60 / 60 / 24 ~= 69.44).
 
 Granted, this is extremely unlikely, but you get the idea - some scenarios
 can cause qmail-remote to move extraordinarily slowly, while still
 functioning correctly - that is, within the limits imposed by
 timeoutremote.

Right. But in that case, the syscall trace would show qmail-remote
blocked on the select() not the read(). The read() only gets executed
if select() says there is data in which case the read does not block
and the code immediately loops back on the select() again.

As I recall, the original syscal trace showed qmail-remote blocked on
the read().


Regards.