Hello, A few months ago, my colleague Joey Lynch described a "true zero downtime haproxy reload" on this mailing list [1]. The solution he implemented uses a qdisc to block outgoing SYNs to an haproxy instance that is shutting down. This prevents the operating system from assigning new connections to a socket connected to an haproxy instance which is no longer accepting requests, preventing a race condition between that haproxy's last accept() and its call to close() the socket. That solution has been working well for us, but lately we have found that a few requests (~100 requests per billion) still receive RST packets when haproxy reloads.
Our existing solution works thus: 1. coordinator process places a plug qdisc on outgoing SYNs 2. coordinator process brings up a new haproxy 3. new haproxy completes initialization, and sends a signal to old haproxy 4. old haproxy handles the signal and unbinds its listen sockets 5. new haproxy forks, indicating that it has completed initialization 6. coordinator process unplugs outgoing SYNs 7. in the (possibly-distant) future, old haproxy exits Joey diagnosed the issue as a race condition in this sequence. Specifically, nothing requires that step 5 (old haproxy unbinds) happens before steps 6-7. On a heavily loaded system, the new haproxy can complete initialization before the old one has time to unbind its sockets. To fix this, I would like to implement a communication mechanism that permits the shutting-down haproxy to indicate that it has finished unbinding its sockets to another process. The modified solution would work like so: 1. new haproxy initializes and binds sockets (SO_REUSEPORT is required) 2. new haproxy completes initialization, and indicates it is finished by forking (existing default behavior); old haproxy is still running normally. 3. coordinator process places a plug qdisc on outgoing SYNs 4. coordinator process sends a soft-stop signal to old haproxy 5. old haproxy unbinds its listening sockets, and communicates that it has done so to the coordinator 6. coordinator unplugs outgoing SYNs 7. in the (possibly-distant) future, old haproxy exits As a communication mechanism, I would like to propose flock() to achieve this purpose. I have produced this extremely rough proof-of-concept patch [2] (lacks opt-in, error handling, resumption support, et cetera). It modifies haproxy at two points: * immediately before listening on sockets, haproxy will open a "socketlock" file at a well-known location and flock it exclusively * before pausing proxies, haproxy will un-flock() that file Under this solution, the coordinator process would wait for the socketlock file to be unflocked, and use that to sequence its unplugging of the plug qdisc. This has a number of desirable properties: * there is no race condition anymore (that I can see) * initialization of the new haproxy no longer requires a plugged qdisc. Requests can continue flowing while the new haproxy is initializing * when the old haproxy finishes unbinding, the coordinator is informed (almost) immediately. No polling is necessary by the coordinator process. * the coordinator doesn't have to do anything gross (like inspect /proc) to learn the state of the shutting-down haproxy. My questions for the list are twofold: 1. does this solution solve the problem it intends to? Are there any lurking issues you can see? 2. would you accept a patch along the lines of [2]? What improvements (beyond those already mentioned) would you like to see in it? Thanks, Josh [1] https://marc.info/?l=haproxy&m=142894591021699&w=2 [2] https://gist.github.com/hashbrowncipher/a0da32514f2f240cbbbf