Hello,
we're using HAProxy in multiple different scenarios and recently upgraded from 1.8 to 2.0 (2.0.3 atm) to benefit from all it's nice improvements. However, we've also noticed some strange behaviour we don't understand yet. After upgrading, we see an increased number of "Connection reset by peer" error logs on the application servers (Java 8 applications) that HAProxy is sending the traffic to. We're currently running four instances of HAProxy in parallel, each on it's own VM. Configuration is identical on all of these. There is one TLS frontend, also using client certificates. Clients are not browsers, but network devices sending telemetry data at fixed intervals over Thrift in HTTP, so client behaviour is very uniform and predictable. Clients try to use keep-alive, to keep the overhead at a minimum. Spread over all instances, we serve around 1700 req/s. After upgrading from 1.8.20 to 2.0.2, we saw an increase of "Connection reset by peer" errors on the application servers, going up from around 12 errors per hour to around 300 per hour, but couldn't see any changes in HAProxy's metrics we scrape and store from the stats socket using the standalone Prometheus exporter and Prometheus. We captured and analyzed a tcpdump from one of the application servers and noticed TCP RSTs often (but not always) being sent from HAProxy to the backend server in response to the servers FIN ACK at the end of the response (we're using option http-server-close). However, with the RST package, HAProxy still ACKs the last seq of the response, so it looks like nothing is actually lost. On connections where HAProxy was initiating the connection close, everything looked fine. In an attempt to mitigate this, we tried changing to option http-pretend-keepalive. This, however, lead to all servers reaching their maxconn, HAProxy starting to queue and all instances capping at 100% CPU. A day later, HAProxy 2.0.3 was released, and we upgraded to it, thinking "mux-h1: Trim excess server data ..." might have something to do with that, but it hasn't changed anything on that front actually. We haven't tried using option http-pretend-keepalive again though. We also have another scenario with automated clients where the clients do not use keep-alive. We're seeing the same "connection reset" increase in this one. However, in that scenario, changing to option http-pretend-keepalive didn't change anything (as in: not triggering the failure described above, but also having no impact on connection resets). I've attached the config from our test environment, which is identical to our production config, except for hostnames and IP addresses. CONTROL_BACKEND is the high volume backend, at about 1500 req/s, MONITORING_BACKEND is about 200 req/s (sum over all 4 instances). If you have any ideas how we could tackle this, where we could dig further, or see anything wrong in our config, please let me know. Kind regards, Julian Poschmann (See attached file: haproxy.cfg) LANCOM Systems GmbH Adenauerstr. 20 / B2 52146 Würselen Deutschland Tel: +49 2405 49936-0 Fax: +49 2405 49936-99 Web: https://www.lancom-systems.de Geschäftsführer: Ralf Koenzen, Stefan Herrlich Sitz der Gesellschaft: Aachen, Amtsgericht Aachen, HRB 16976
haproxy.cfg
Description: Binary data