Re: [ANNOUNCE] haproxy-2.1.3
Hi Luke, On Thu, Feb 20, 2020 at 10:22:46AM +0100, Luke Seelenbinder wrote: > Hi Willy, > > > - the H2 mux had an incorrect buffer full detection causing the send > >phase to stop on a fragment boundary then to immediately wake up all > >waiting threads to go on, resulting in an excessive CPU usage in some > >tricky situations. It is possible that those using H2 with many streams > >per connection and moderately large objects, like Luke's maps servers, > >could observe a CPU usage drop (maybe Luke on his map servers). > > We just deployed 2.1.3 across our PoP network last night, and I can indeed > verify we're seeing better CPU usage--anywhere from 40-50% aggregate > reduction! Oh, that's really impressive! > Once we have a few more days of data, I'll send a pretty chart so you can > enjoy the fruits of your hard work. Great, thank you! Willy
Re: [ANNOUNCE] haproxy-2.1.3
Hi Willy, > - the H2 mux had an incorrect buffer full detection causing the send >phase to stop on a fragment boundary then to immediately wake up all >waiting threads to go on, resulting in an excessive CPU usage in some >tricky situations. It is possible that those using H2 with many streams >per connection and moderately large objects, like Luke's maps servers, >could observe a CPU usage drop (maybe Luke on his map servers). We just deployed 2.1.3 across our PoP network last night, and I can indeed verify we're seeing better CPU usage—anywhere from 40-50% aggregate reduction! Once we have a few more days of data, I'll send a pretty chart so you can enjoy the fruits of your hard work. Best, Luke — Luke Seelenbinder Stadia Maps | Founder stadiamaps.com > On 12 Feb 2020, at 17:44, Willy Tarreau wrote: > > Hi, > > HAProxy 2.1.3 was released on 2020/02/12. It added 86 new commits > after version 2.1.2. > > It's clear that 2.1 has been one of the calmest releases in a while, to > the point of making us forget that it still had a few fixes pending that > would be pleasant to have in a released version! So after accumulating > fixes for 7 weeks, it's about time to have another one! > > Here are the most relevant fixes: > > - pools: there is an ABA race condition in pool_flush() (which is called >when stopping as well as under memory pressure) which can lead to a >crash. It's been there since 1.9 and is very hard to trigger, but if >you run with many threads and reload very often you may occasionally >hit it, seeing a trace of the old process crashing in your system >logs. > > - there was a bug in the way our various hashes were calculated, some >of them were considering the inputs as signed chars instead of >unsigned ones, so some non-ASCII characters would hash differently >across different architectures and wouldn't match another component's >calculation (e.g. a CRC32 inserted in a header would differ when given >values with the 8th bit set, or applied to the PROXY protocol header). >The bug has been there since 1.5-dev20 but became visible since it >affected Postfix's validation of the PROXY protocol's CRC32. It's >unlikely that anyone will ever witness it if it didn't happen already, >but I tagged it "major" to make sure it is properly backported to >distro packages, since not having it on certain nodes may sometimes >result in hash inconsistencies which can be very hard to diagnose. > > - the addition of the Early-Data header when using 0rtt could wrongly >be emitted during SSL handshake as well. > > - health checks could crash if using handshakes (e.g. SSL) mixed with >DNS that takes time to retrieve an address, causing an attempt to >use an incompletely initialized connection. > > - the peers listening socket was missing from the seamless reload, >possibly causing some failed bindings when not using reuseport, >resulting in the new process giving up. > > - splicing could often end up on a timeout because after the last block >we did not switch back to HTX to complete the message. > > - fixed a small race affecting idle connections, allowing one thread to >pick a connection at the same moment another one would decide to free >it because there are too many idle. > > - response redirects were appended to the actual response instead of >replacing it. This could cause various errors, including data >corruption on the client if the entire response didn't fit into the >buffer at once. > > - when stopping or when releasing a few connections after a listener's >maxconn was reached, we could distribute some work to inexistent >threads if the listener had "1/odd" or "1/even" while the process >had less than 64 threads. An easy workaround for this is to explicitly >reference the thread numbers instead. > > - when proxying an HTTP/1 client to an HTTP/2 server, make sure to clean >up the "TE" header from anything but "trailers", otherwise the server >may reject a request if it came from a browser placing "gzip" there. > > - the H2 mux had an incorrect buffer full detection causing the send >phase to stop on a fragment boundary then to immediately wake up all >waiting threads to go on, resulting in an excessive CPU usage in some >tricky situations. It is possible that those using H2 with many streams >per connection and moderately large objects, like Luke's maps servers, >could observe a CPU usage drop (maybe Luke on his map servers). > > - it was possible to lose the master-worker status after a failed reload >when it was only mentioned in the config and not on the command line. > > - when decoding the Netscaler's CIP protocol we forgot to allocate the >storage for the src/dst addresses, crashing the process. > > - upon pipe creation failure due to shortage of file descriptors, the >struct pipe
[ANNOUNCE] haproxy-2.1.3
Hi, HAProxy 2.1.3 was released on 2020/02/12. It added 86 new commits after version 2.1.2. It's clear that 2.1 has been one of the calmest releases in a while, to the point of making us forget that it still had a few fixes pending that would be pleasant to have in a released version! So after accumulating fixes for 7 weeks, it's about time to have another one! Here are the most relevant fixes: - pools: there is an ABA race condition in pool_flush() (which is called when stopping as well as under memory pressure) which can lead to a crash. It's been there since 1.9 and is very hard to trigger, but if you run with many threads and reload very often you may occasionally hit it, seeing a trace of the old process crashing in your system logs. - there was a bug in the way our various hashes were calculated, some of them were considering the inputs as signed chars instead of unsigned ones, so some non-ASCII characters would hash differently across different architectures and wouldn't match another component's calculation (e.g. a CRC32 inserted in a header would differ when given values with the 8th bit set, or applied to the PROXY protocol header). The bug has been there since 1.5-dev20 but became visible since it affected Postfix's validation of the PROXY protocol's CRC32. It's unlikely that anyone will ever witness it if it didn't happen already, but I tagged it "major" to make sure it is properly backported to distro packages, since not having it on certain nodes may sometimes result in hash inconsistencies which can be very hard to diagnose. - the addition of the Early-Data header when using 0rtt could wrongly be emitted during SSL handshake as well. - health checks could crash if using handshakes (e.g. SSL) mixed with DNS that takes time to retrieve an address, causing an attempt to use an incompletely initialized connection. - the peers listening socket was missing from the seamless reload, possibly causing some failed bindings when not using reuseport, resulting in the new process giving up. - splicing could often end up on a timeout because after the last block we did not switch back to HTX to complete the message. - fixed a small race affecting idle connections, allowing one thread to pick a connection at the same moment another one would decide to free it because there are too many idle. - response redirects were appended to the actual response instead of replacing it. This could cause various errors, including data corruption on the client if the entire response didn't fit into the buffer at once. - when stopping or when releasing a few connections after a listener's maxconn was reached, we could distribute some work to inexistent threads if the listener had "1/odd" or "1/even" while the process had less than 64 threads. An easy workaround for this is to explicitly reference the thread numbers instead. - when proxying an HTTP/1 client to an HTTP/2 server, make sure to clean up the "TE" header from anything but "trailers", otherwise the server may reject a request if it came from a browser placing "gzip" there. - the H2 mux had an incorrect buffer full detection causing the send phase to stop on a fragment boundary then to immediately wake up all waiting threads to go on, resulting in an excessive CPU usage in some tricky situations. It is possible that those using H2 with many streams per connection and moderately large objects, like Luke's maps servers, could observe a CPU usage drop (maybe Luke on his map servers). - it was possible to lose the master-worker status after a failed reload when it was only mentioned in the config and not on the command line. - when decoding the Netscaler's CIP protocol we forgot to allocate the storage for the src/dst addresses, crashing the process. - upon pipe creation failure due to shortage of file descriptors, the struct pipe was still returned after having been released, quickly crashing the process. Fortunately the automatic maxconn/maxpipe settings do not allow this situation to happen but very old configs still having "ulimit-n" could have been affected. - the "tcp-request session" rules would report an error upon a "reject" action, making the listener throttle itself to protect resources, which could actually amplify the problem. - the "commit ssl cert" command on the CLI used the old SSL_CTX instead of the new one, which caused some certs not to work anymore (found on openssl-1.0.2 with ECDSA+ECDHE). There is quite a number of other SSL SSL fixes for small bugs that were found while troubleshooting this issue, mainly in relation with dynamic cert updates. - the H1 mux could attempt to perform a sendto() when facing new data after having already failed, resulting in excess calls to sendto(). The rest has