Re: Problems with qmail-remote hanging

2001-07-31 Thread MarkD

>   Setting an alarm is a nasty hack in my opinion, but I have to admit
> that it's something I considered.

Well, the qmail-remote connection is well and truly wedged once it's
in this state and if the select() timed out as it's meant to,
qmail-remote would exit with a delivery failure indication, so it's
not that bad a hack. It's also very easy to code - just a single
alarm() call at teh top of main().

> A slightly neater solution might be to use
> the SO_KEEPALIVE socket option - if it works (and there isn't a good reason
> not to use it) that is.

It'll be interesting to hear if this works.

>   What would be better is finding out why this happens, of course.

Indeed. Does Linux offer tools/syscalls that would tell you why the
select worked, but the read failed?

> P.S. If anyone is keeping track, Linux 2.2.19, concurrencyremote set to 200

I hesitate to say this, but Linux kernels seem to predominate in this
regard, but that just may be that qmail is running on more Linux out
there than other Unixen.


Regards.



RE: Problems with qmail-remote hanging

2001-07-31 Thread Richard Underwood

> This problem's been reported before. If your OS says that an fd is
> readable via select(), then the read() should not block.
> 
> As you observe though, the read is blocking so your OS is probably not
> telling the truth when it returns from the select().
> 
> The archives have plenty of discussion on this and the simplest
> solution is to put a large-value alarm() handler in qmail-remote. No
> one as yet seems to be able to narrow down which OSes do this and
> under what circumstances.

Mark,

Thanks for the reply. I only seem to experience the problem with
large mail-outs. One possibility is that because of the way qmail works,
there's a significant chance that we will be making a large number of
simultaneous connections to some servers.

It's possible that this is causing a connection to be blackholed
somewhere ... that doesn't explain why select/read are failing to agree,
though. Perhaps select thinks the connection is closed, but read doesn't.

Setting an alarm is a nasty hack in my opinion, but I have to admit
that it's something I considered. A slightly neater solution might be to use
the SO_KEEPALIVE socket option - if it works (and there isn't a good reason
not to use it) that is.

What would be better is finding out why this happens, of course.

Thanks,

Richard

P.S. If anyone is keeping track, Linux 2.2.19, concurrencyremote set to 200



Re: Problems with qmail-remote hanging

2001-07-30 Thread MarkD

>   I've been running qmail on a number of platforms quite happily for a
> while - until now I've had no problems at all. However, I am now
> experiencing a problem with qmail-remote hanging.

>   The problem I see is with qmail-remote failing to terminate when a
> connection times-out. If left alone, the number of "stuck" processes will
> slowly climb, after about a month I had about 25 such processes. The network
> connections remain in the "ESTABLISHED" state.
> 
>   Looking at the process list right now, I have one stuck:
> 
> # ps -ef | grep qmail-remote
> qmailr   12278   662  0 13:13 ?00:00:00 qmail-remote
> xx.co.uk xx
> qmailr   19876   662  0 16:09 ?00:00:00 qmail-remote xx.com
> 
> root 19912 19489  0 16:10 pts/000:00:00 grep qmail-remote
> 
> # strace -p 12278
> read(3,  
> 
>   ... all socket read()s in qmail-remote should be protected by a
> select and therefore should not block as this one is doing now. After
> recompiling with debugging and symbols, I get ...

Exactly.

This problem's been reported before. If your OS says that an fd is
readable via select(), then the read() should not block.

As you observe though, the read is blocking so your OS is probably not
telling the truth when it returns from the select().

The archives have plenty of discussion on this and the simplest
solution is to put a large-value alarm() handler in qmail-remote. No
one as yet seems to be able to narrow down which OSes do this and
under what circumstances.


Regards.



Problems with qmail-remote hanging

2001-07-30 Thread Richard Underwood

Hi,

I've been running qmail on a number of platforms quite happily for a
while - until now I've had no problems at all. However, I am now
experiencing a problem with qmail-remote hanging.

I'm running qmail on this server for sending mails from websites and
bulk mail-outs (up to about 40,000 recipients.) The server doesn't receive
mails iteself to a great extent.

It's a dual-cpu Dell running Linux. I have another very similar
installation which has absolutely no problems. Qmail on this server is 100%
standard Qmail 1.03.

The problem I see is with qmail-remote failing to terminate when a
connection times-out. If left alone, the number of "stuck" processes will
slowly climb, after about a month I had about 25 such processes. The network
connections remain in the "ESTABLISHED" state.

Looking at the process list right now, I have one stuck:

# ps -ef | grep qmail-remote
qmailr   12278   662  0 13:13 ?00:00:00 qmail-remote
xx.co.uk xx
qmailr   19876   662  0 16:09 ?00:00:00 qmail-remote xx.com

root 19912 19489  0 16:10 pts/000:00:00 grep qmail-remote

# strace -p 12278
read(3,  

... all socket read()s in qmail-remote should be protected by a
select and therefore should not block as this one is doing now. After
recompiling with debugging and symbols, I get ...

# gdb qmail-remote 12278
GNU gdb 5.0
Attaching to program: /home/qmail/bin/qmail-remote, Pid 12278
Reading symbols from /lib/libresolv.so.2...done.
Loaded symbols for /lib/libresolv.so.2
Reading symbols from /lib/libc.so.6...wdone.
Loaded symbols for /lib/libc.so.6
Reading symbols from /lib/ld-linux.so.2...hdone.
Loaded symbols for /lib/ld-linux.so.2
0x40103424 in __libc_read () from /lib/libc.so.6
(gdb) where
#0  0x40103424 in __libc_read () from /lib/libc.so.6
#1  0x3b654f80 in ?? ()
#2  0x8048f05 in saferead (fd=-1, buf=0x8051180 "", len=128)
at qmail-remote.c:113
#3  0x804d193 in oneread (op=0x8048ee8 , fd=-1, buf=0x8051180 "", 
len=128) at substdi.c:14
#4  0x804d25e in substdio_feed (s=0x804f3d0) at substdi.c:44
#5  0x804d3ab in substdio_get (s=0x804f3d0, buf=0xbdc7 "", len=1)
at substdi.c:75
#6  0x8048f70 in get (ch=0xbdc7 "") at qmail-remote.c:137
#7  0x8048fda in smtpcode () at qmail-remote.c:150
#8  0x80492cb in smtp () at qmail-remote.c:225
#9  0x8049d31 in main (argc=4, argv=0xbe94) at qmail-remote.c:420
#10 0x4004bf31 in __libc_start_main (main=0x804987c , argc=4, 
ubp_av=0xbe94, init=0x804878c <_init>, fini=0x804dd10 <_fini>, 
rtld_fini=0x4000e274 <_dl_fini>, stack_end=0xbe8c)
at ../sysdeps/generic/libc-start.c:129

... in smtp() ...

220 {
221   unsigned long code;
222   int flagbother;
223   int i;
224  
225 =>if (smtpcode() != 220) quit("ZConnected to "," but greeting
failed");
226  
227   substdio_puts(&smtpto,"HELO ");
228   substdio_put(&smtpto,helohost.s,helohost.len);
229   substdio_puts(&smtpto,"\r\n");

saferead() calls timeoutread() which calls select() and then read().
fd=-1 is a red-herring, it's not used by saferead in qmail-remote.

Can anyone explain this, or has anyone experienced anything similar?

Thanks,

Richard