On Fri, Jun 16, 2017 at 09:11:29AM +0200, Gilles Chehade wrote:
> On Fri, Jun 16, 2017 at 01:05:25AM +0300, Henri Kemppainen wrote:
> > I had a little debugging session with awolk@ over at #openbsd-daily. His
> > smtpd would over time end up with hung sessions that never timeout.
> >
> > The problem is related to the data_io path's congestion control which
> > may pause the session. In this case the io system will not wait for
> > read events and as such will not have a chance to timeout until it is
> > resumed.
> >
> > If the pause happens when a full message is just about to pass through
> > the data_io path, the session is never resumed, even though there is
> > obviously no more congestion and the session should be reading more
> > input from the client again.
> >
> > A debug trace excerpt shows the course of events:
> >
> > mtp: 0xe54baa1e000: IO_DATAIN <io:0xe551d0d5000 fd=25 to=300000 fl=R
> > ib=16839 ob=0>
> > debug: smtp: 0xe54baa1e000: filter congestion: pausing session
> > smtp: 0xe54baa1e000 (data): IO_LOWAT <io:0xe55ac8e8c00 fd=32 to=-1 fl=W
> > ib=0 ob=0>
> > debug: smtp: 0xe54baa1e000: data io done (259146 bytes)
> > debug: 0xe54baa1e000: end of message, msgflags=0x0000
> > smtp: 0xe54baa1e000: >>> 250 2.0.0: 4f36f9e3 Message accepted for delivery
> > smtp: 0xe54baa1e000: STATE_BODY -> STATE_HELO
> > smtp: 0xe54baa1e000: IO_LOWAT <io:0xe551d0d5000 fd=25 to=300000 fl=W,F_PI
> > ib=0 ob=0>
> >
> > From this point on, session 0xe54baa1e000 and its io 0xe551d0d5000
> > (which has the pause_in flag) are never seen again in the trace, and
> > fstat shows a corresponding connection to smtpd that never goes away.
> >
> > The proposed fix is to always resume the session if the data_io path
> > hits the low water mark.
> >
> > Mr. Wolk tested this diff against smtpd on 6.1 as well as a against
> > -current version of smtpd (compiled on the same system running 6.1).
> >
>
> Nice catch, the diff reads fine to me, I'll commit later today when I
> have another ok from eric@
Yes, this looks correct. But, I would rather move the resume test before
the EOM test, to avoid touching the session after the transfer has been
finalized by smtp_data_io_done().
Eric.
> side note, smtpd should not have been able to leak more than 5 fd, it
> should not have been able to exhaust, is this what you observed ?
>
> > Index: usr.sbin/smtpd/smtp_session.c
> > ===================================================================
> > RCS file: /cvs/src/usr.sbin/smtpd/smtp_session.c,v
> > retrieving revision 1.303
> > diff -u -p -r1.303 smtp_session.c
> > --- usr.sbin/smtpd/smtp_session.c 17 May 2017 14:00:06 -0000 1.303
> > +++ usr.sbin/smtpd/smtp_session.c 15 Jun 2017 20:28:12 -0000
> > @@ -1474,9 +1474,10 @@ smtp_data_io(struct io *io, int evt, voi
> > break;
> >
> > case IO_LOWAT:
> > - if (s->tx->dataeom && io_queued(s->tx->oev) == 0) {
> > + if (s->tx->dataeom && io_queued(s->tx->oev) == 0)
> > smtp_data_io_done(s);
> > - } else if (io_paused(s->io, IO_IN)) {
> > +
> > + if (io_paused(s->io, IO_IN)) {
> > log_debug("debug: smtp: %p: filter congestion over:
> > resuming session", s);
> > io_resume(s->io, IO_IN);
> > }
> >
> >
>
> --
> Gilles Chehade
>
> https://www.poolp.org @poolpOrg
>