Hi Sander, On Tue, Mar 10, 2020 at 10:28:38AM +0100, Sander Klein wrote: > Hi All, > > I'm looking in a strange issue I'm having and I start to think it is HAProxy > related. > > I have a setup with HAProxy serving multiple frontends and multiple backends > which are Nginx server with PHP-FPM. Sometimes all of the sudden the maxconn > limit is hit and connections get queued to a backend server and I do not > have a clue why. The backend is not overloaded, not traffic is flowing, > Nginx/PHP-FPM picks op other connections like the health checks from HAProxy > or out monitoring server, PHP-FPM is not doing anything so no long runing > processes, Nginx is doint nothing, but it does not receive any new > connection from HAProxy. Sometimes this is for 1 second, but this also > happens for as much as 30 seconds. > > It does not happen on all backend servers at once, just random at one > server. So if I have defined a backend with 2 servers it happens to only one > at a time.
There could be two possible explanations to this. The first one is that the backend server is sometimes slowing down for whatever reason on the requests it is processing, resulting in maxconn being reached on haproxy. This commonly happens on applications where one request is much more expensive than most others. I've seen some sites use a dedicated backend with much lower maxconn for a search button for example because they knew this search request could take multiple seconds, and if enough of them happen at the same time, the maxconn is reached and extra requests get queued even if they could have been handled. The other possibility would be that some requests produce huge responses that take a lot of time to be consumed by the clients. And until one response finishes to be delivered there's no more slots available on the server. > I'm running HAProxy 2.0.13 on Debian Buster in a VM. I've tested with 'no > option http-use-htx' and HAProxy 2.1.3 and I see the problem on both. > Backends are Nginx with PHP-FPM and only using HTTP/1.1 over port 80, also > VM's. > > Today I disabled H2 on the frontends and now the problem seems to have > disappeared. So it seems to be releated to that part. But, I'm not sure. > How should I go on and debug this? The best way to do it is to emit "show sess all", "show info", "show stats" and "show fd" on the CLI when this happens. This will indicate if there are many connections still active to the server under trouble, and/or if there are many connections from a single source address for example. One thing that could happen with H2 would be that it's trivial to send many requests at once (100 by default), and if someone wants to have fun with your site once they've found an expensive request, sending 100 of these doesn't take more than a single TCP segment. Or if you're having heavy pages, a single H2 connection can request many objects at once and if the network link to the client is lossy, these can take a while to deliver over a single connection. And since H2 is much faster than H1 on reliable networks, but much worse on lossy networks, that could be an explanation. Do you have lots of static files ? If so it might make sense to deliver them from dedicated servers that are not subject to the very low maxconn. And if such objects are small, you could also enable some caching to reduce the number of connections to the servers when fetching them. > The config looks a bit like this (very redacted and very, very much > shortened): (...) Looks pretty fine at first glance. I'm seeing "prefer-last-server", it *might* participate to the problem if it's caused by sudden spikes, as it will encourage requests from a same client to go to the same server, but that's not necessarily the case. Willy