Re: Postfix benchmark: bug or performance regression ?

2019-03-29 Thread Wietse Venema
Ralf Hildebrandt:
> * Viktor Dukhovni :
> > > On Mar 28, 2019, at 12:03 PM, Wietse Venema  wrote:
> > > 
> > > And thank you for your thorough investigation that helped to narrow
> > > down the root cause: under high traffic conditions, LMTP connections
> > > are cached but never reused, therefore those idle cached connections
> > > are exhausting server resources.
> > 
> > Yes, thanks.  I was similarly impressed by the OP's detailed report.
> > 
> > For the record, in case anyone has not read the entire thread, the issue
> > is only with LMTP over unix-domain sockets.  LMTP over TCP is not affected.
> 
> Ah, excellent. I was already afraid my setup would occasionally suffer
> from this effect.

LMTP over TCP uses the same code paths as SMTP.

Wietse


Re: Postfix benchmark: bug or performance regression ?

2019-03-29 Thread Ralf Hildebrandt
* Viktor Dukhovni :
> > On Mar 28, 2019, at 12:03 PM, Wietse Venema  wrote:
> > 
> > And thank you for your thorough investigation that helped to narrow
> > down the root cause: under high traffic conditions, LMTP connections
> > are cached but never reused, therefore those idle cached connections
> > are exhausting server resources.
> 
> Yes, thanks.  I was similarly impressed by the OP's detailed report.
> 
> For the record, in case anyone has not read the entire thread, the issue
> is only with LMTP over unix-domain sockets.  LMTP over TCP is not affected.

Ah, excellent. I was already afraid my setup would occasionally suffer
from this effect.

-- 
[*] sys4 AG

https://sys4.de, +49 (89) 30 90 46 64
Schleißheimer Straße 26/MG, 80333 München
   
Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Marc Schiffbauer
Aufsichtsratsvorsitzender: Florian Kirstein


Re: Postfix benchmark: bug or performance regression ?

2019-03-28 Thread Viktor Dukhovni
> On Mar 28, 2019, at 12:03 PM, Wietse Venema  wrote:
> 
> And thank you for your thorough investigation that helped to narrow
> down the root cause: under high traffic conditions, LMTP connections
> are cached but never reused, therefore those idle cached connections
> are exhausting server resources.

Yes, thanks.  I was similarly impressed by the OP's detailed report.

For the record, in case anyone has not read the entire thread, the issue
is only with LMTP over unix-domain sockets.  LMTP over TCP is not affected.

-- 
Viktor.



Re: Postfix benchmark: bug or performance regression ?

2019-03-28 Thread Wietse Venema
Juliana Rodrigueiro:
> > To get rid of the 2s delays:
> >
> > /etc/postfix/main.cf:
> > lmtp_connection_cache_on_demand = no
> >
> > Please let us know if that helps. Meanwhile we can develop a proper fix.
>
> And yes, it worked, that helped a lot. Although not as fast as before, but now
> I see only reasonable delays.
>
> Thank you very much for looking into it!!!

And thank you for your thorough investigation that helped to narrow
down the root cause: under high traffic conditions, LMTP connections
are cached but never reused, therefore those idle cached connections
are exhausting server resources.

Wietse


Re: Postfix benchmark: bug or performance regression ?

2019-03-28 Thread Juliana Rodrigueiro
On Wednesday, 27 March 2019 20:01:49 CET Viktor Dukhovni wrote:
> On Wed, Mar 27, 2019 at 03:36:28PM +0100, Juliana Rodrigueiro wrote:
> > However, during a benchmark, we realized 3.3.2 was 5 times slower than the
> > version before.
> 
> This is misleading.  Postfix is not 5 times slower, your benchmark
> appears to be measuring the LMTP delivery rate to a single sink
> mailbox, and this artificial work-load may be slower under some
> adverse conditions (below...).
> 

Although the benchmark produces a very large artificial work-load, this 
scenario is still a very valid use case, even more considering that 20 emails 
are enough to clog up deliveries.

The stark difference between the LMTP server (cyrus 2.4) log messages is that 
most deliveries reuse the same session id when running postfix <= 2.11.0:

Mar 28 10:22:22 localdomain lmtpunix[4881]: auditlog: append 
sessionid= mailbox= 
uniqueid=<3aab65ba5c98d020> uid=<28047> 
guid=<521e5e4a35095385881941a77f3035f6087c8942>
Mar 28 10:22:22 localdomain lmtpunix[4881]: Delivered: 
<20190328092221.f2f6...@localdomain.com> to mailbox: user.dst
Mar 28 10:22:22 localdomain lmtpunix[4881]: USAGE dst user: 0.00 sys: 
0.00
Mar 28 10:22:22 localdomain lmtpunix[4926]: auditlog: append 
sessionid= mailbox= 
uniqueid=<3aab65ba5c98d020> uid=<28048> 
guid=
Mar 28 10:22:22 localdomain lmtpunix[4926]: Delivered: 
<2019032809.0044...@localdomain.com> to mailbox: user.dst
...
...
Mar 28 10:22:22 localdomain lmtpunix[4926]: auditlog: append 
sessionid= mailbox= 
uniqueid=<3aab65ba5c98d020> uid=<28060> 
guid=<0dc9e922f8b8cfc01ccc5bfeafa8220812adf8fa>
Mar 28 10:22:22 localdomain lmtpunix[4926]: Delivered: 
<2019032809.0da1...@localdomain.com> to mailbox: user.dst
Mar 28 10:22:22 localdomain lmtpunix[4926]: USAGE dst user: 0.00 sys: 
0.004000
Mar 28 10:22:22 localdomain lmtpunix[6449]: auditlog: append 
sessionid= mailbox= 
uniqueid=<3aab65ba5c98d020> uid=<28061> 
guid=<781d02ef98dde39e03461a60aa18beb858f8a5d2>
Mar 28 10:22:22 localdomain lmtpunix[6449]: Delivered: 
<2019032809.0ea9...@localdomain.com> to mailbox: user.dst


In the other hand, for version >= 2.11.1, the sessions are used once per 
process.

Mar 28 10:04:06 localdomain lmtpunix[4273]: Delivered: 
<20190328090404.7cf8...@localdomain.com> to mailbox: user.dst
Mar 28 10:04:06 localdomain lmtpunix[4273]: USAGE dst user: 0.004000 sys: 
0.00
Mar 28 10:04:06 localdomain lmtpunix[4292]: auditlog: append 
sessionid= mailbox= 
uniqueid=<3aab65ba5c98d020> uid=<28032> 
guid=<165f88d841750f2cc9a59a74e777ada4c884dae3>
Mar 28 10:04:06 localdomain lmtpunix[4292]: Delivered: 
<20190328090404.7dee...@localdomain.com> to mailbox: user.dst
Mar 28 10:04:06 localdomain lmtpunix[4292]: USAGE dst user: 0.00 sys: 
0.00
Mar 28 10:04:06 localdomain lmtpunix[4295]: auditlog: append 
sessionid= mailbox= 
uniqueid=<3aab65ba5c98d020> uid=<28033> 
guid=
...
...
Mar 28 10:04:10 localdomain lmtpunix[4295]: auditlog: append 
sessionid= mailbox= 
uniqueid=<3aab65ba5c98d020> uid=<28041> 
guid=<7ba5808ec13795785ab5a5737cfef97ff24d307a>
Mar 28 10:04:10 localdomain lmtpunix[4295]: Delivered: 
<20190328090404.87ac...@localdomain.com> to mailbox: user.dst
Mar 28 10:04:10 localdomain lmtpunix[4295]: USAGE dst user: 0.00 sys: 
0.004000
Mar 28 10:04:12 localdomain lmtpunix[4273]: auditlog: append 
sessionid= mailbox= 
uniqueid=<3aab65ba5c98d020> uid=<28042> 
guid=<8ee41c08e69e9cdf9be61b2f2bad1a2d2c590f12>
Mar 28 10:04:12 localdomain lmtpunix[4273]: Delivered: 
<20190328090404.88ae...@localdomain.com> to mailbox: user.dst

> > Mar 27 14:46:58 localdomain postfix/lmtp[24750]: 7CFDE72:
> > to=, orig_to=,
> > relay=localdomain.com[/var/ imap/socket/lmtp], delay=8.1,
> > delays=0/6/2/0.02, dsn=2.1.5, status=sent (250
> This delivery spent 6s waiting in the active queue and 2s to complete
> a connection.  Why are connections to the LMTP server so expensive?
> Is your syslog service logging synchronously?  Perhaps logging of
> connection establishment (in the LMTP server) is causing congestion
> at LMTP connect?  What is the LMTP server doing in connect and LHLO
> processing?
> 

Unlikely, the LMTP server doesn't log any more or less than it did before.

> > -#define HAVE_NEXTHOP_STATE(state)
> > (STR((state)->iterator->request_nexthop) != 0)
> > +#define HAVE_NEXTHOP_STATE(state)
> > (STR((state)->iterator->request_nexthop)[0] != 0)
> That's a necessary bugfix, without it the test is always true.
> 

I understand the necessary fix, and that is why I was puzzled by the behaviour 
change.

> > Questions:
> > 
> > (1) Does it mean that the lmtp connections via socket are never cached or
> > reused?
> There is no caching by nexthop domain, since there is no nexthop
> domain for unix-domain LMTP sockets.  All there is is a nexthop
> socket address.  So session caching, if any, would naturally be
> just by endpoint address.  However, in 2.11 the smtp_connect_local()
> function also changed:
> 
>
> 

Re: Postfix benchmark: bug or performance regression ?

2019-03-28 Thread Wietse Venema
Wietse Venema:
> Juliana Rodrigueiro:
> > Excerpt of maillog version > 2.11.1:
> > Mar 27 14:46:50 localdomain postfix/lmtp[24750]: 6CEFF61: 
> > to=, orig_to=, 
> > relay=localdomain.com[/var/
> > imap/socket/lmtp], delay=0.02, delays=0.01/0/0.01/0, dsn=2.1.5, status=sent 
> > (250 2.1.5 Ok SESSIONID=)
> > Mar 27 14:46:50 localdomain postfix/lmtp[24752]: 6E8CE63: 
> > to=, orig_to=, 
> > relay=localdomain.com[/var/
> > imap/socket/lmtp], delay=0.02, delays=0/0/0.01/0.01, dsn=2.1.5, status=sent 
> > (250 2.1.5 Ok SESSIONID=)
> > Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 6E8CE63: removed
> 
> The a/b/c/d delays logging shows:
> a = time from message arrival to last active queue entry
> b = time from last active queue entry to connection setup
> c = time in connection setup, including DNS, EHLO and STARTTLS
> d = time in message transmission
> 
> Here the delays are 0.01s or less.
> 
> > Mar 27 14:46:52 localdomain postfix/lmtp[24752]: 764A06A: 
> > to=, orig_to=, 
> > relay=localdomain.com[/var/
> > imap/socket/lmtp], delay=2, delays=0/2/0/0, dsn=2.1.5, status=sent (250 
> > 2.1.5 
> > Ok SESSIONID=)
> 
> The message had to wait 2s until a delivery agent became available, for
> reasons that will become clear later.
> 
> > Mar 27 14:46:54 localdomain postfix/lmtp[24750]: 778DC61: 
> > to=, orig_to=, 
> > relay=localdomain.com[/var/
> > imap/socket/lmtp], delay=4, delays=0.01/2/2/0.02, dsn=2.1.5, status=sent 
> > (250 
> > 2.1.5 Ok SESSIONID=)
> 
> The message had to wait 2s until a delivery agent became available,
> and then it spent 2s in connection setup (including DNS lookup).
> 
> > Mar 27 14:46:58 localdomain postfix/lmtp[24750]: 7CFDE72: 
> > to=, orig_to=, 
> > relay=localdomain.com[/var/
> > imap/socket/lmtp], delay=8.1, delays=0/6/2/0.02, dsn=2.1.5, status=sent 
> > (250 
> > 2.1.5 Ok SESSIONID=)
> 
> This message had to wait 6s until a delivery agent became available,
> and then it spent 2s in connection setup (including DNS lookup).
> 
> There is a clear pattern in here: under load, the LMTP client spends
> 2s in connection setup (including DNS lookup).
> 

To get rid of the 2s delays:

/etc/postfix/main.cf:
lmtp_connection_cache_on_demand = no

Please let us know if that helps. Meanwhile we can develop a proper fix.

Wietse


Re: Postfix benchmark: bug or performance regression ?

2019-03-27 Thread Viktor Dukhovni
> On Mar 27, 2019, at 3:01 PM, Viktor Dukhovni  
> wrote:
> 
> There's likely a bug.  We should either simulate a synthetic nexthop
> ($myhostname?) for unix-domain destinations, and then do nexthop
> reuse (and perhaps do no caching by endpoint address for unix-domain
> destinations), or go back to using re-use by endpoint address.
> Currently there's a mismatch, and we're keeping connections alive
> that can't be reused.

And of course the nexthop can't be "$myhostname".  If there are multiple
LMTP unix-domain socket destinations, we must not mix them up, so the
nexthop needs to be just the socket pathname.

-- 
Viktor.



Re: Postfix benchmark: bug or performance regression ?

2019-03-27 Thread Wietse Venema
Juliana Rodrigueiro:
> Excerpt of maillog version > 2.11.1:
> Mar 27 14:46:50 localdomain postfix/lmtp[24750]: 6CEFF61: 
> to=, orig_to=, 
> relay=localdomain.com[/var/
> imap/socket/lmtp], delay=0.02, delays=0.01/0/0.01/0, dsn=2.1.5, status=sent 
> (250 2.1.5 Ok SESSIONID=)
> Mar 27 14:46:50 localdomain postfix/lmtp[24752]: 6E8CE63: 
> to=, orig_to=, 
> relay=localdomain.com[/var/
> imap/socket/lmtp], delay=0.02, delays=0/0/0.01/0.01, dsn=2.1.5, status=sent 
> (250 2.1.5 Ok SESSIONID=)
> Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 6E8CE63: removed

The a/b/c/d delays logging shows:
a = time from message arrival to last active queue entry
b = time from last active queue entry to connection setup
c = time in connection setup, including DNS, EHLO and STARTTLS
d = time in message transmission

Here the delays are 0.01s or less.

> Mar 27 14:46:52 localdomain postfix/lmtp[24752]: 764A06A: 
> to=, orig_to=, 
> relay=localdomain.com[/var/
> imap/socket/lmtp], delay=2, delays=0/2/0/0, dsn=2.1.5, status=sent (250 2.1.5 
> Ok SESSIONID=)

The message had to wait 2s until a delivery agent became available, for
reasons that will become clear later.

> Mar 27 14:46:54 localdomain postfix/lmtp[24750]: 778DC61: 
> to=, orig_to=, 
> relay=localdomain.com[/var/
> imap/socket/lmtp], delay=4, delays=0.01/2/2/0.02, dsn=2.1.5, status=sent (250 
> 2.1.5 Ok SESSIONID=)

The message had to wait 2s until a delivery agent became available,
and then it spent 2s in connection setup (including DNS lookup).

> Mar 27 14:46:58 localdomain postfix/lmtp[24750]: 7CFDE72: 
> to=, orig_to=, 
> relay=localdomain.com[/var/
> imap/socket/lmtp], delay=8.1, delays=0/6/2/0.02, dsn=2.1.5, status=sent (250 
> 2.1.5 Ok SESSIONID=)

This message had to wait 6s until a delivery agent became available,
and then it spent 2s in connection setup (including DNS lookup).

There is a clear pattern in here: under load, the LMTP client spends
2s in connection setup (including DNS lookup).

Also, the connections are not reused, which may be the result of a
Postfix change that I will look into.

But connection reuse is an optimization; it should never take 2s
to set up an LMTP connection, not even if the LMTP server is on the
other side of the world.

The core problem is that under load, it takes 2s to set up an LMTP
connection. If you can fix that, then Postfix will perform fine
with or without connection reuse.

What kind of LMTP server are you using?

Wietse


Re: Postfix benchmark: bug or performance regression ?

2019-03-27 Thread Viktor Dukhovni
On Wed, Mar 27, 2019 at 03:36:28PM +0100, Juliana Rodrigueiro wrote:

> However, during a benchmark, we realized 3.3.2 was 5 times slower than the 
> version before.

This is misleading.  Postfix is not 5 times slower, your benchmark
appears to be measuring the LMTP delivery rate to a single sink
mailbox, and this artificial work-load may be slower under some
adverse conditions (below...).

> With version >= 2.11.1, the first emails are received and processed somewhat 
> instantly,

How many?  Are those connections closed or kept open in scache?

> Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 7079468: 
> from=, size=356, nrcpt=1 (queue active)
> Mar 27 14:46:50 localdomain postfix/lmtp[24750]: 6CEFF61: 
> to=, orig_to=, 
> relay=localdomain.com[/var/
> imap/socket/lmtp], delay=0.02, delays=0.01/0/0.01/0, dsn=2.1.5, status=sent 
> (250 2.1.5 Ok SESSIONID=)

> Mar 27 14:46:50 localdomain postfix/lmtp[24752]: 6E8CE63: 
> to=, orig_to=, 
> relay=localdomain.com[/var/
> imap/socket/lmtp], delay=0.02, delays=0/0/0.01/0.01, dsn=2.1.5, status=sent 
> (250 2.1.5 Ok SESSIONID=)

> Mar 27 14:46:52 localdomain postfix/lmtp[24752]: 764A06A: 
> to=, orig_to=, 
> relay=localdomain.com[/var/
> imap/socket/lmtp], delay=2, delays=0/2/0/0, dsn=2.1.5, status=sent (250 2.1.5 
> Ok SESSIONID=)

This message waited 2s in the active queue, before it was its turn
to go, this suggests some other deliveries were still in progress...

> Mar 27 14:46:54 localdomain postfix/lmtp[24750]: 778DC61: 
> to=, orig_to=, 
> relay=localdomain.com[/var/
> imap/socket/lmtp], delay=4, delays=0.01/2/2/0.02, dsn=2.1.5, status=sent (250 

This delivery took 2 seconds to complete a connection, why?

> Mar 27 14:46:58 localdomain postfix/lmtp[24750]: 7CFDE72: 
> to=, orig_to=, 
> relay=localdomain.com[/var/
> imap/socket/lmtp], delay=8.1, delays=0/6/2/0.02, dsn=2.1.5, status=sent (250 

This delivery spent 6s waiting in the active queue and 2s to complete
a connection.  Why are connections to the LMTP server so expensive?
Is your syslog service logging synchronously?  Perhaps logging of
connection establishment (in the LMTP server) is causing congestion
at LMTP connect?  What is the LMTP server doing in connect and LHLO
processing?

> Mar 27 15:00:35 localdomain postfix/lmtp[26451]: AB58261: 
> to=, orig_to=, 
> relay=localdomain.com[/var/
> imap/socket/lmtp], conn_use=2, delay=0.01, delays=0/0/0/0, dsn=2.1.5, 
> status=sent (250 2.1.5 Ok SESSIONID=)

This shows LMTP connection re-use, but that should not make much
difference with unix-domain sockets.  Local connections are supposed
to be fast.

> Mar 27 15:00:35 localdomain postfix/lmtp[26453]: AC80663: 
> to=, orig_to=, 
> relay=localdomain.com[/var/
> imap/socket/lmtp], conn_use=2, delay=0.01, delays=0/0/0/0, dsn=2.1.5, 
> status=sent (250 2.1.5 Ok SESSIONID=)

Again re-use, but the next connection is fresh:

> Mar 27 15:00:35 localdomain postfix/lmtp[26451]: AD4EB67: 
> to=, orig_to=, 
> relay=localdomain.com[/var/
> imap/socket/lmtp], delay=0.01, delays=0/0/0/0, dsn=2.1.5, status=sent (250 
> 2.1.5 Ok SESSIONID=)

> Going deeper with bisecting the code, I found out that the following line 
> created this behaviour change:
> 
> -#define HAVE_NEXTHOP_STATE(state) (STR((state)->iterator->request_nexthop) 
> != 
> 0)
> +#define HAVE_NEXTHOP_STATE(state) 
> (STR((state)->iterator->request_nexthop)[0] != 0)

That's a necessary bugfix, without it the test is always true.

> Questions:
> 
> (1) Does it mean that the lmtp connections via socket are never cached or 
> reused?

There is no caching by nexthop domain, since there is no nexthop
domain for unix-domain LMTP sockets.  All there is is a nexthop
socket address.  So session caching, if any, would naturally be
just by endpoint address.  However, in 2.11 the smtp_connect_local()
function also changed:


https://github.com/vdukhovni/postfix/commit/072fa2d1f7beaf41756c30a69ef96cb611461cec#diff-24650cb98191c311688117b61626fffbR506

to use smtp_reuse_nexthop() rather than smtp_reuse_addr(), so that
SASL authenticated connections might be re-used, but this looks
like a mistake, since we're therefore creating cached sessions that
will never be re-used, but that tie up LMTP server resources,
possibly leading to subsequent congestion under load.

> (2) Is this expected behaviour? Is this a bug?

There's likely a bug.  We should either simulate a synthetic nexthop
($myhostname?) for unix-domain destinations, and then do nexthop
reuse (and perhaps do no caching by endpoint address for unix-domain
destinations), or go back to using re-use by endpoint address.
Currently there's a mismatch, and we're keeping connections alive
that can't be reused.

The fix will be easy, once we decide on the fine details.

-- 
Viktor.