Hello,

A few months ago, my colleague Joey Lynch described a "true zero
downtime haproxy reload" on this mailing list [1]. The solution he
implemented uses a qdisc to block outgoing SYNs to an haproxy instance
that is shutting down. This prevents the operating system from
assigning new connections to a socket connected to an haproxy instance
which is no longer accepting requests, preventing a race condition
between that haproxy's last accept() and its call to close() the
socket. That solution has been working well for us, but lately we have
found that a few requests (~100 requests per billion) still receive
RST packets when haproxy reloads.

Our existing solution works thus:

  1. coordinator process places a plug qdisc on outgoing SYNs
  2. coordinator process brings up a new haproxy
  3. new haproxy completes initialization, and sends a signal to old haproxy
  4. old haproxy handles the signal and unbinds its listen sockets
  5. new haproxy forks, indicating that it has completed initialization
  6. coordinator process unplugs outgoing SYNs
  7. in the (possibly-distant) future, old haproxy exits

Joey diagnosed the issue as a race condition in this sequence.
Specifically, nothing requires that step 5 (old haproxy unbinds)
happens before steps 6-7. On a heavily loaded system, the new haproxy
can complete initialization before the old one has time to unbind its
sockets. To fix this, I would like to implement a communication
mechanism that permits the shutting-down haproxy to indicate that it
has finished unbinding its sockets to another process.

The modified solution would work like so:

  1. new haproxy initializes and binds sockets (SO_REUSEPORT is required)
  2. new haproxy completes initialization, and indicates it is
finished by forking (existing default behavior); old haproxy is still
running normally.
  3. coordinator process places a plug qdisc on outgoing SYNs
  4. coordinator process sends a soft-stop signal to old haproxy
  5. old haproxy unbinds its listening sockets, and communicates that
it has done so to the coordinator
  6. coordinator unplugs outgoing SYNs
  7. in the (possibly-distant) future, old haproxy exits

As a communication mechanism, I would like to propose flock() to
achieve this purpose. I have produced this extremely rough
proof-of-concept patch [2] (lacks opt-in, error handling, resumption
support, et cetera). It modifies haproxy at two points:
 * immediately before listening on sockets, haproxy will open a
"socketlock" file at a well-known location and flock it exclusively
 * before pausing proxies, haproxy will un-flock() that file

Under this solution, the coordinator process would wait for the
socketlock file to be unflocked, and use that to sequence its
unplugging of the plug qdisc. This has a number of desirable
properties:
 * there is no race condition anymore (that I can see)
 * initialization of the new haproxy no longer requires a plugged
qdisc. Requests can continue flowing while the new haproxy is
initializing
 * when the old haproxy finishes unbinding, the coordinator is
informed (almost) immediately. No polling is necessary by the
coordinator process.
 * the coordinator doesn't have to do anything gross (like inspect
/proc) to learn the state of the shutting-down haproxy.

My questions for the list are twofold:
  1. does this solution solve the problem it intends to? Are there any
lurking issues you can see?
  2. would you accept a patch along the lines of [2]?  What
improvements (beyond those already mentioned) would you like to see in
it?

Thanks,
Josh

[1] https://marc.info/?l=haproxy&m=142894591021699&w=2
[2] https://gist.github.com/hashbrowncipher/a0da32514f2f240cbbbf

Reply via email to