Re: Postfix benchmark: bug or performance regression ?
Ralf Hildebrandt: > * Viktor Dukhovni : > > > On Mar 28, 2019, at 12:03 PM, Wietse Venema wrote: > > > > > > And thank you for your thorough investigation that helped to narrow > > > down the root cause: under high traffic conditions, LMTP connections > > > are cached but never reused, therefore those idle cached connections > > > are exhausting server resources. > > > > Yes, thanks. I was similarly impressed by the OP's detailed report. > > > > For the record, in case anyone has not read the entire thread, the issue > > is only with LMTP over unix-domain sockets. LMTP over TCP is not affected. > > Ah, excellent. I was already afraid my setup would occasionally suffer > from this effect. LMTP over TCP uses the same code paths as SMTP. Wietse
Re: Postfix benchmark: bug or performance regression ?
* Viktor Dukhovni : > > On Mar 28, 2019, at 12:03 PM, Wietse Venema wrote: > > > > And thank you for your thorough investigation that helped to narrow > > down the root cause: under high traffic conditions, LMTP connections > > are cached but never reused, therefore those idle cached connections > > are exhausting server resources. > > Yes, thanks. I was similarly impressed by the OP's detailed report. > > For the record, in case anyone has not read the entire thread, the issue > is only with LMTP over unix-domain sockets. LMTP over TCP is not affected. Ah, excellent. I was already afraid my setup would occasionally suffer from this effect. -- [*] sys4 AG https://sys4.de, +49 (89) 30 90 46 64 Schleißheimer Straße 26/MG, 80333 München Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Marc Schiffbauer Aufsichtsratsvorsitzender: Florian Kirstein
Re: Postfix benchmark: bug or performance regression ?
> On Mar 28, 2019, at 12:03 PM, Wietse Venema wrote: > > And thank you for your thorough investigation that helped to narrow > down the root cause: under high traffic conditions, LMTP connections > are cached but never reused, therefore those idle cached connections > are exhausting server resources. Yes, thanks. I was similarly impressed by the OP's detailed report. For the record, in case anyone has not read the entire thread, the issue is only with LMTP over unix-domain sockets. LMTP over TCP is not affected. -- Viktor.
Re: Postfix benchmark: bug or performance regression ?
Juliana Rodrigueiro: > > To get rid of the 2s delays: > > > > /etc/postfix/main.cf: > > lmtp_connection_cache_on_demand = no > > > > Please let us know if that helps. Meanwhile we can develop a proper fix. > > And yes, it worked, that helped a lot. Although not as fast as before, but now > I see only reasonable delays. > > Thank you very much for looking into it!!! And thank you for your thorough investigation that helped to narrow down the root cause: under high traffic conditions, LMTP connections are cached but never reused, therefore those idle cached connections are exhausting server resources. Wietse
Re: Postfix benchmark: bug or performance regression ?
On Wednesday, 27 March 2019 20:01:49 CET Viktor Dukhovni wrote: > On Wed, Mar 27, 2019 at 03:36:28PM +0100, Juliana Rodrigueiro wrote: > > However, during a benchmark, we realized 3.3.2 was 5 times slower than the > > version before. > > This is misleading. Postfix is not 5 times slower, your benchmark > appears to be measuring the LMTP delivery rate to a single sink > mailbox, and this artificial work-load may be slower under some > adverse conditions (below...). > Although the benchmark produces a very large artificial work-load, this scenario is still a very valid use case, even more considering that 20 emails are enough to clog up deliveries. The stark difference between the LMTP server (cyrus 2.4) log messages is that most deliveries reuse the same session id when running postfix <= 2.11.0: Mar 28 10:22:22 localdomain lmtpunix[4881]: auditlog: append sessionid= mailbox= uniqueid=<3aab65ba5c98d020> uid=<28047> guid=<521e5e4a35095385881941a77f3035f6087c8942> Mar 28 10:22:22 localdomain lmtpunix[4881]: Delivered: <20190328092221.f2f6...@localdomain.com> to mailbox: user.dst Mar 28 10:22:22 localdomain lmtpunix[4881]: USAGE dst user: 0.00 sys: 0.00 Mar 28 10:22:22 localdomain lmtpunix[4926]: auditlog: append sessionid= mailbox= uniqueid=<3aab65ba5c98d020> uid=<28048> guid= Mar 28 10:22:22 localdomain lmtpunix[4926]: Delivered: <2019032809.0044...@localdomain.com> to mailbox: user.dst ... ... Mar 28 10:22:22 localdomain lmtpunix[4926]: auditlog: append sessionid= mailbox= uniqueid=<3aab65ba5c98d020> uid=<28060> guid=<0dc9e922f8b8cfc01ccc5bfeafa8220812adf8fa> Mar 28 10:22:22 localdomain lmtpunix[4926]: Delivered: <2019032809.0da1...@localdomain.com> to mailbox: user.dst Mar 28 10:22:22 localdomain lmtpunix[4926]: USAGE dst user: 0.00 sys: 0.004000 Mar 28 10:22:22 localdomain lmtpunix[6449]: auditlog: append sessionid= mailbox= uniqueid=<3aab65ba5c98d020> uid=<28061> guid=<781d02ef98dde39e03461a60aa18beb858f8a5d2> Mar 28 10:22:22 localdomain lmtpunix[6449]: Delivered: <2019032809.0ea9...@localdomain.com> to mailbox: user.dst In the other hand, for version >= 2.11.1, the sessions are used once per process. Mar 28 10:04:06 localdomain lmtpunix[4273]: Delivered: <20190328090404.7cf8...@localdomain.com> to mailbox: user.dst Mar 28 10:04:06 localdomain lmtpunix[4273]: USAGE dst user: 0.004000 sys: 0.00 Mar 28 10:04:06 localdomain lmtpunix[4292]: auditlog: append sessionid= mailbox= uniqueid=<3aab65ba5c98d020> uid=<28032> guid=<165f88d841750f2cc9a59a74e777ada4c884dae3> Mar 28 10:04:06 localdomain lmtpunix[4292]: Delivered: <20190328090404.7dee...@localdomain.com> to mailbox: user.dst Mar 28 10:04:06 localdomain lmtpunix[4292]: USAGE dst user: 0.00 sys: 0.00 Mar 28 10:04:06 localdomain lmtpunix[4295]: auditlog: append sessionid= mailbox= uniqueid=<3aab65ba5c98d020> uid=<28033> guid= ... ... Mar 28 10:04:10 localdomain lmtpunix[4295]: auditlog: append sessionid= mailbox= uniqueid=<3aab65ba5c98d020> uid=<28041> guid=<7ba5808ec13795785ab5a5737cfef97ff24d307a> Mar 28 10:04:10 localdomain lmtpunix[4295]: Delivered: <20190328090404.87ac...@localdomain.com> to mailbox: user.dst Mar 28 10:04:10 localdomain lmtpunix[4295]: USAGE dst user: 0.00 sys: 0.004000 Mar 28 10:04:12 localdomain lmtpunix[4273]: auditlog: append sessionid= mailbox= uniqueid=<3aab65ba5c98d020> uid=<28042> guid=<8ee41c08e69e9cdf9be61b2f2bad1a2d2c590f12> Mar 28 10:04:12 localdomain lmtpunix[4273]: Delivered: <20190328090404.88ae...@localdomain.com> to mailbox: user.dst > > Mar 27 14:46:58 localdomain postfix/lmtp[24750]: 7CFDE72: > > to=, orig_to=, > > relay=localdomain.com[/var/ imap/socket/lmtp], delay=8.1, > > delays=0/6/2/0.02, dsn=2.1.5, status=sent (250 > This delivery spent 6s waiting in the active queue and 2s to complete > a connection. Why are connections to the LMTP server so expensive? > Is your syslog service logging synchronously? Perhaps logging of > connection establishment (in the LMTP server) is causing congestion > at LMTP connect? What is the LMTP server doing in connect and LHLO > processing? > Unlikely, the LMTP server doesn't log any more or less than it did before. > > -#define HAVE_NEXTHOP_STATE(state) > > (STR((state)->iterator->request_nexthop) != 0) > > +#define HAVE_NEXTHOP_STATE(state) > > (STR((state)->iterator->request_nexthop)[0] != 0) > That's a necessary bugfix, without it the test is always true. > I understand the necessary fix, and that is why I was puzzled by the behaviour change. > > Questions: > > > > (1) Does it mean that the lmtp connections via socket are never cached or > > reused? > There is no caching by nexthop domain, since there is no nexthop > domain for unix-domain LMTP sockets. All there is is a nexthop > socket address. So session caching, if any, would naturally be > just by endpoint address. However, in 2.11 the smtp_connect_local() > function also changed: > > >
Re: Postfix benchmark: bug or performance regression ?
Wietse Venema: > Juliana Rodrigueiro: > > Excerpt of maillog version > 2.11.1: > > Mar 27 14:46:50 localdomain postfix/lmtp[24750]: 6CEFF61: > > to=, orig_to=, > > relay=localdomain.com[/var/ > > imap/socket/lmtp], delay=0.02, delays=0.01/0/0.01/0, dsn=2.1.5, status=sent > > (250 2.1.5 Ok SESSIONID=) > > Mar 27 14:46:50 localdomain postfix/lmtp[24752]: 6E8CE63: > > to=, orig_to=, > > relay=localdomain.com[/var/ > > imap/socket/lmtp], delay=0.02, delays=0/0/0.01/0.01, dsn=2.1.5, status=sent > > (250 2.1.5 Ok SESSIONID=) > > Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 6E8CE63: removed > > The a/b/c/d delays logging shows: > a = time from message arrival to last active queue entry > b = time from last active queue entry to connection setup > c = time in connection setup, including DNS, EHLO and STARTTLS > d = time in message transmission > > Here the delays are 0.01s or less. > > > Mar 27 14:46:52 localdomain postfix/lmtp[24752]: 764A06A: > > to=, orig_to=, > > relay=localdomain.com[/var/ > > imap/socket/lmtp], delay=2, delays=0/2/0/0, dsn=2.1.5, status=sent (250 > > 2.1.5 > > Ok SESSIONID=) > > The message had to wait 2s until a delivery agent became available, for > reasons that will become clear later. > > > Mar 27 14:46:54 localdomain postfix/lmtp[24750]: 778DC61: > > to=, orig_to=, > > relay=localdomain.com[/var/ > > imap/socket/lmtp], delay=4, delays=0.01/2/2/0.02, dsn=2.1.5, status=sent > > (250 > > 2.1.5 Ok SESSIONID=) > > The message had to wait 2s until a delivery agent became available, > and then it spent 2s in connection setup (including DNS lookup). > > > Mar 27 14:46:58 localdomain postfix/lmtp[24750]: 7CFDE72: > > to=, orig_to=, > > relay=localdomain.com[/var/ > > imap/socket/lmtp], delay=8.1, delays=0/6/2/0.02, dsn=2.1.5, status=sent > > (250 > > 2.1.5 Ok SESSIONID=) > > This message had to wait 6s until a delivery agent became available, > and then it spent 2s in connection setup (including DNS lookup). > > There is a clear pattern in here: under load, the LMTP client spends > 2s in connection setup (including DNS lookup). > To get rid of the 2s delays: /etc/postfix/main.cf: lmtp_connection_cache_on_demand = no Please let us know if that helps. Meanwhile we can develop a proper fix. Wietse
Re: Postfix benchmark: bug or performance regression ?
> On Mar 27, 2019, at 3:01 PM, Viktor Dukhovni > wrote: > > There's likely a bug. We should either simulate a synthetic nexthop > ($myhostname?) for unix-domain destinations, and then do nexthop > reuse (and perhaps do no caching by endpoint address for unix-domain > destinations), or go back to using re-use by endpoint address. > Currently there's a mismatch, and we're keeping connections alive > that can't be reused. And of course the nexthop can't be "$myhostname". If there are multiple LMTP unix-domain socket destinations, we must not mix them up, so the nexthop needs to be just the socket pathname. -- Viktor.
Re: Postfix benchmark: bug or performance regression ?
Juliana Rodrigueiro: > Excerpt of maillog version > 2.11.1: > Mar 27 14:46:50 localdomain postfix/lmtp[24750]: 6CEFF61: > to=, orig_to=, > relay=localdomain.com[/var/ > imap/socket/lmtp], delay=0.02, delays=0.01/0/0.01/0, dsn=2.1.5, status=sent > (250 2.1.5 Ok SESSIONID=) > Mar 27 14:46:50 localdomain postfix/lmtp[24752]: 6E8CE63: > to=, orig_to=, > relay=localdomain.com[/var/ > imap/socket/lmtp], delay=0.02, delays=0/0/0.01/0.01, dsn=2.1.5, status=sent > (250 2.1.5 Ok SESSIONID=) > Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 6E8CE63: removed The a/b/c/d delays logging shows: a = time from message arrival to last active queue entry b = time from last active queue entry to connection setup c = time in connection setup, including DNS, EHLO and STARTTLS d = time in message transmission Here the delays are 0.01s or less. > Mar 27 14:46:52 localdomain postfix/lmtp[24752]: 764A06A: > to=, orig_to=, > relay=localdomain.com[/var/ > imap/socket/lmtp], delay=2, delays=0/2/0/0, dsn=2.1.5, status=sent (250 2.1.5 > Ok SESSIONID=) The message had to wait 2s until a delivery agent became available, for reasons that will become clear later. > Mar 27 14:46:54 localdomain postfix/lmtp[24750]: 778DC61: > to=, orig_to=, > relay=localdomain.com[/var/ > imap/socket/lmtp], delay=4, delays=0.01/2/2/0.02, dsn=2.1.5, status=sent (250 > 2.1.5 Ok SESSIONID=) The message had to wait 2s until a delivery agent became available, and then it spent 2s in connection setup (including DNS lookup). > Mar 27 14:46:58 localdomain postfix/lmtp[24750]: 7CFDE72: > to=, orig_to=, > relay=localdomain.com[/var/ > imap/socket/lmtp], delay=8.1, delays=0/6/2/0.02, dsn=2.1.5, status=sent (250 > 2.1.5 Ok SESSIONID=) This message had to wait 6s until a delivery agent became available, and then it spent 2s in connection setup (including DNS lookup). There is a clear pattern in here: under load, the LMTP client spends 2s in connection setup (including DNS lookup). Also, the connections are not reused, which may be the result of a Postfix change that I will look into. But connection reuse is an optimization; it should never take 2s to set up an LMTP connection, not even if the LMTP server is on the other side of the world. The core problem is that under load, it takes 2s to set up an LMTP connection. If you can fix that, then Postfix will perform fine with or without connection reuse. What kind of LMTP server are you using? Wietse
Re: Postfix benchmark: bug or performance regression ?
On Wed, Mar 27, 2019 at 03:36:28PM +0100, Juliana Rodrigueiro wrote: > However, during a benchmark, we realized 3.3.2 was 5 times slower than the > version before. This is misleading. Postfix is not 5 times slower, your benchmark appears to be measuring the LMTP delivery rate to a single sink mailbox, and this artificial work-load may be slower under some adverse conditions (below...). > With version >= 2.11.1, the first emails are received and processed somewhat > instantly, How many? Are those connections closed or kept open in scache? > Mar 27 14:46:50 localdomain postfix/qmgr[24339]: 7079468: > from=, size=356, nrcpt=1 (queue active) > Mar 27 14:46:50 localdomain postfix/lmtp[24750]: 6CEFF61: > to=, orig_to=, > relay=localdomain.com[/var/ > imap/socket/lmtp], delay=0.02, delays=0.01/0/0.01/0, dsn=2.1.5, status=sent > (250 2.1.5 Ok SESSIONID=) > Mar 27 14:46:50 localdomain postfix/lmtp[24752]: 6E8CE63: > to=, orig_to=, > relay=localdomain.com[/var/ > imap/socket/lmtp], delay=0.02, delays=0/0/0.01/0.01, dsn=2.1.5, status=sent > (250 2.1.5 Ok SESSIONID=) > Mar 27 14:46:52 localdomain postfix/lmtp[24752]: 764A06A: > to=, orig_to=, > relay=localdomain.com[/var/ > imap/socket/lmtp], delay=2, delays=0/2/0/0, dsn=2.1.5, status=sent (250 2.1.5 > Ok SESSIONID=) This message waited 2s in the active queue, before it was its turn to go, this suggests some other deliveries were still in progress... > Mar 27 14:46:54 localdomain postfix/lmtp[24750]: 778DC61: > to=, orig_to=, > relay=localdomain.com[/var/ > imap/socket/lmtp], delay=4, delays=0.01/2/2/0.02, dsn=2.1.5, status=sent (250 This delivery took 2 seconds to complete a connection, why? > Mar 27 14:46:58 localdomain postfix/lmtp[24750]: 7CFDE72: > to=, orig_to=, > relay=localdomain.com[/var/ > imap/socket/lmtp], delay=8.1, delays=0/6/2/0.02, dsn=2.1.5, status=sent (250 This delivery spent 6s waiting in the active queue and 2s to complete a connection. Why are connections to the LMTP server so expensive? Is your syslog service logging synchronously? Perhaps logging of connection establishment (in the LMTP server) is causing congestion at LMTP connect? What is the LMTP server doing in connect and LHLO processing? > Mar 27 15:00:35 localdomain postfix/lmtp[26451]: AB58261: > to=, orig_to=, > relay=localdomain.com[/var/ > imap/socket/lmtp], conn_use=2, delay=0.01, delays=0/0/0/0, dsn=2.1.5, > status=sent (250 2.1.5 Ok SESSIONID=) This shows LMTP connection re-use, but that should not make much difference with unix-domain sockets. Local connections are supposed to be fast. > Mar 27 15:00:35 localdomain postfix/lmtp[26453]: AC80663: > to=, orig_to=, > relay=localdomain.com[/var/ > imap/socket/lmtp], conn_use=2, delay=0.01, delays=0/0/0/0, dsn=2.1.5, > status=sent (250 2.1.5 Ok SESSIONID=) Again re-use, but the next connection is fresh: > Mar 27 15:00:35 localdomain postfix/lmtp[26451]: AD4EB67: > to=, orig_to=, > relay=localdomain.com[/var/ > imap/socket/lmtp], delay=0.01, delays=0/0/0/0, dsn=2.1.5, status=sent (250 > 2.1.5 Ok SESSIONID=) > Going deeper with bisecting the code, I found out that the following line > created this behaviour change: > > -#define HAVE_NEXTHOP_STATE(state) (STR((state)->iterator->request_nexthop) > != > 0) > +#define HAVE_NEXTHOP_STATE(state) > (STR((state)->iterator->request_nexthop)[0] != 0) That's a necessary bugfix, without it the test is always true. > Questions: > > (1) Does it mean that the lmtp connections via socket are never cached or > reused? There is no caching by nexthop domain, since there is no nexthop domain for unix-domain LMTP sockets. All there is is a nexthop socket address. So session caching, if any, would naturally be just by endpoint address. However, in 2.11 the smtp_connect_local() function also changed: https://github.com/vdukhovni/postfix/commit/072fa2d1f7beaf41756c30a69ef96cb611461cec#diff-24650cb98191c311688117b61626fffbR506 to use smtp_reuse_nexthop() rather than smtp_reuse_addr(), so that SASL authenticated connections might be re-used, but this looks like a mistake, since we're therefore creating cached sessions that will never be re-used, but that tie up LMTP server resources, possibly leading to subsequent congestion under load. > (2) Is this expected behaviour? Is this a bug? There's likely a bug. We should either simulate a synthetic nexthop ($myhostname?) for unix-domain destinations, and then do nexthop reuse (and perhaps do no caching by endpoint address for unix-domain destinations), or go back to using re-use by endpoint address. Currently there's a mismatch, and we're keeping connections alive that can't be reused. The fix will be easy, once we decide on the fine details. -- Viktor.