Randomness in hard-stop-after
When reconfiguring across a large web tier, it's possible that many HAProxy reloads are initiated at nearly the same time. We have a large number of long-lived TCP sessions, so we must use hard-stop-after to eventually kill them off, or HAProxy instances will remain open for days. It would be very helpful to have an option like `spread-hard-stop-after`, similar to `spread-checks`, to introduce some random variance into the timeout such that the disconnection of these sessions does not cause a major traffic event across the stack. Also, thank you very much for the 1.8.4 release!
Re: High load average under 1.8 with multiple draining processes
Reload is done via the sysvinit wrapper, which executes: $HAPROXY -f "$CONFIG" -p $PIDFILE -sf $(cat $PIDFILE) -D $EXTRAOPTS No $EXTRAOPTS are specified. We don't use unix sockets and we don't use master-worker. I'll send an anonymized haproxy config directly to you two along with the results of "show sess" et al. We don't use peers but if we fall back to nbproc we likely will. More info to come shortly. On 1/15/18 8:22 AM, Willy Tarreau wrote: > On Mon, Jan 15, 2018 at 08:14:40AM -0600, Samuel Reed wrote: >> Thank you for the patch and your quick attention to this issue. Results >> after a few reloads, 8 threads on 16 core machine, both draining and new >> process have patches. >> >> New process: >> >> % time seconds usecs/call calls errors syscall >> -- --- --- - - >> 96.24 0.432348 16 26917 epoll_wait >> 3.60 0.016158 16 1023 7 recvfrom >> 0.12 0.000524 31 17 sendto >> 0.04 0.000190 0 1126 3 write >> 0.01 0.36 1 70 23 read >> 0.00 0.00 0 21 close >> 0.00 0.00 0 7 socket >> 0.00 0.00 0 7 7 connect >> 0.00 0.00 0 13 sendmsg >> 0.00 0.00 0 17 setsockopt >> 0.00 0.00 0 7 fcntl >> 0.00 0.00 0 9 epoll_ctl >> 0.00 0.00 0 5 5 accept4 >> -- --- --- - - >> 100.00 0.449256 29239 45 total >> >> >> Draining process: >> >> % time seconds usecs/call calls errors syscall >> -- --- --- - - >> 78.26 0.379045 16 23424 epoll_wait >> 13.94 0.067539 7 9877 4 recvfrom >> 7.80 0.037764 4 10471 6 write >> 0.00 0.07 0 29 10 read >> 0.00 0.00 0 9 close >> 0.00 0.00 0 5 sendto >> 0.00 0.00 0 3 shutdown >> 0.00 0.00 0 20 epoll_ctl >> -- --- --- - - >> 100.00 0.484355 43838 20 total >> >> >> I ran this a few times while both processes were live and the numbers >> weren't significantly different. The new process still has a remarkably >> high proportion of epoll_wait. > Thank you Samuel for the test. It's sad, but it may indicate something > completely different. Christopher at least I'm willing to integrate your > fix to rule out this corner case in the future. > > Among the differences possible between an old and a new process, we can > enumerate very few things, for example the peers, which work differently > for new and old processes. Do you use peers in your config ? It would also > be possible that we pass an fd corresponding to a more or less closed > listener or something like this. Do you reload with -x to pass FDs across > processes ? Do you use master-worker ? Just trying to rule out a number of > hypothesis. An anonymized version of your config will definitely help here > I'm afraid. > > Thanks! > Willy
Re: High load average under 1.8 with multiple draining processes
Thank you for the patch and your quick attention to this issue. Results after a few reloads, 8 threads on 16 core machine, both draining and new process have patches. New process: % time seconds usecs/call calls errors syscall -- --- --- - - 96.24 0.432348 16 26917 epoll_wait 3.60 0.016158 16 1023 7 recvfrom 0.12 0.000524 31 17 sendto 0.04 0.000190 0 1126 3 write 0.01 0.36 1 70 23 read 0.00 0.00 0 21 close 0.00 0.00 0 7 socket 0.00 0.00 0 7 7 connect 0.00 0.00 0 13 sendmsg 0.00 0.00 0 17 setsockopt 0.00 0.00 0 7 fcntl 0.00 0.00 0 9 epoll_ctl 0.00 0.00 0 5 5 accept4 -- --- --- - - 100.00 0.449256 29239 45 total Draining process: % time seconds usecs/call calls errors syscall -- --- --- - - 78.26 0.379045 16 23424 epoll_wait 13.94 0.067539 7 9877 4 recvfrom 7.80 0.037764 4 10471 6 write 0.00 0.07 0 29 10 read 0.00 0.00 0 9 close 0.00 0.00 0 5 sendto 0.00 0.00 0 3 shutdown 0.00 0.00 0 20 epoll_ctl -- --- --- - - 100.00 0.484355 43838 20 total I ran this a few times while both processes were live and the numbers weren't significantly different. The new process still has a remarkably high proportion of epoll_wait. On 1/15/18 7:48 AM, Christopher Faulet wrote: > Le 12/01/2018 à 18:51, Willy Tarreau a écrit : >> On Fri, Jan 12, 2018 at 11:06:32AM -0600, Samuel Reed wrote: >>> On 1.8-git, similar results on the new process: >>> >>> % time seconds usecs/call calls errors syscall >>> -- --- --- - - >>> 93.75 0.265450 15 17805 epoll_wait >>> 4.85 0.013730 49 283 write >>> 1.40 0.003960 15 266 12 recvfrom >>> 0.01 0.18 0 42 12 read >>> 0.00 0.00 0 28 close >>> 0.00 0.00 0 12 socket >>> 0.00 0.00 0 12 12 connect >>> 0.00 0.00 0 19 1 sendto >>> 0.00 0.00 0 12 sendmsg >>> 0.00 0.00 0 6 shutdown >>> 0.00 0.00 0 35 setsockopt >>> 0.00 0.00 0 7 getsockopt >>> 0.00 0.00 0 12 fcntl >>> 0.00 0.00 0 13 epoll_ctl >>> 0.00 0.00 0 2 2 accept4 >>> -- --- --- - - >>> 100.00 0.283158 18554 39 total >>> >>> Cursory look through the strace output looks the same, with the same >>> three types as in the last email, including the cascade. >> >> OK thank you for testing. On Monday we'll study this with Christopher. >> > > Hi Samuel, > > Here are 2 patches that may solve your problem. Idea is to set the > poller timeout to 0 for a specific thread only when some processing > are expected for this thread. The job was already done for tasks and > applets using bitfields. Now, we do the same for the FDs. > > For now, we don't know if it aims your problem, but it should avoid a > thread to loop for nothing. Could you check if it works please ? >
Re: High load average under 1.8 with multiple draining processes
On 1.8-git, similar results on the new process: % time seconds usecs/call calls errors syscall -- --- --- - - 93.75 0.265450 15 17805 epoll_wait 4.85 0.013730 49 283 write 1.40 0.003960 15 266 12 recvfrom 0.01 0.18 0 42 12 read 0.00 0.00 0 28 close 0.00 0.00 0 12 socket 0.00 0.00 0 12 12 connect 0.00 0.00 0 19 1 sendto 0.00 0.00 0 12 sendmsg 0.00 0.00 0 6 shutdown 0.00 0.00 0 35 setsockopt 0.00 0.00 0 7 getsockopt 0.00 0.00 0 12 fcntl 0.00 0.00 0 13 epoll_ctl 0.00 0.00 0 2 2 accept4 -- --- --- - - 100.00 0.283158 18554 39 total Cursory look through the strace output looks the same, with the same three types as in the last email, including the cascade. On 1/12/18 10:23 AM, Willy Tarreau wrote: > On Fri, Jan 12, 2018 at 10:13:55AM -0600, Samuel Reed wrote: >> Excellent! Please let me know if there's any other output you'd like >> from this machine. >> >> Strace on that new process shows thousands of these types of syscalls, >> which vary slightly, >> >> epoll_wait(3, {{EPOLLIN, {u32=206, u64=206}}}, 200, 239) = 1 > If the u32 value almost doesn't vary, that's an uncaught event. We've > got a report for this that we've just fixed yesterday which started to > appear after the system was upgraded with Meltdown fixes. That seems > unrelated but reverting made the problem disappear. > >> and these: >> >> epoll_wait(3, {}, 200, 0) = 0 > This one used to appear in yesterday's report though it could be caused > by other bugs as well. That's the one I predicted. > >> There is also something of a cascade (each repeats about 10-20x before >> the next): >> >> epoll_wait(3, {{EPOLLIN, {u32=47, u64=47}}}, 200, 71) = 1 >> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656, >> u64=656}}}, 200, 65) = 2 >> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656, >> u64=656}}, {EPOLLIN, {u32=227, u64=227}}}, 200, 0) = 3 >> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656, >> u64=656}}, {EPOLLIN, {u32=227, u64=227}}, {EPOLLIN, {u32=785, >> u64=785}}}, 200, 65) = 4 >> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656, >> u64=656}}, {EPOLLIN, {u32=227, u64=227}}, {EPOLLIN, {u32=785, u64=785}}, >> {EPOLLIN, {u32=639, u64=639}}}, 200, 64) = 5 >> >> I've seen it go as deep as 15. The trace is absolutely dominated by these. > OK that's very interesting. Just in doubt, please update to latest > 1.8-git to see if it makes this issue disappear. > > Thanks, > Willy
Re: High load average under 1.8 with multiple draining processes
Excellent! Please let me know if there's any other output you'd like from this machine. Strace on that new process shows thousands of these types of syscalls, which vary slightly, epoll_wait(3, {{EPOLLIN, {u32=206, u64=206}}}, 200, 239) = 1 and these: epoll_wait(3, {}, 200, 0) = 0 There is also something of a cascade (each repeats about 10-20x before the next): epoll_wait(3, {{EPOLLIN, {u32=47, u64=47}}}, 200, 71) = 1 epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656, u64=656}}}, 200, 65) = 2 epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656, u64=656}}, {EPOLLIN, {u32=227, u64=227}}}, 200, 0) = 3 epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656, u64=656}}, {EPOLLIN, {u32=227, u64=227}}, {EPOLLIN, {u32=785, u64=785}}}, 200, 65) = 4 epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656, u64=656}}, {EPOLLIN, {u32=227, u64=227}}, {EPOLLIN, {u32=785, u64=785}}, {EPOLLIN, {u32=639, u64=639}}}, 200, 64) = 5 I've seen it go as deep as 15. The trace is absolutely dominated by these. On 1/12/18 10:01 AM, Willy Tarreau wrote: > On Fri, Jan 12, 2018 at 09:50:58AM -0600, Samuel Reed wrote: >> To accelerate the process, I've increased the number of threads from 4 >> to 8 on a 16-core machine. Ran strace for about 5s on each. >> >> Single process (8 threads): >> >> $ strace -cp 16807 >> % time seconds usecs/call calls errors syscall >> -- --- --- - - >> 71.36 0.330172 21 15479 epoll_wait >> 13.59 0.062861 4 14477 1 write >> 10.58 0.048955 4 11518 10 recvfrom >> 4.44 0.020544 38 537 244 read > This one is OK and shows that quite some time is in fact spent waiting > for I/O events. > >> Two processes (2x8 threads): >> >> ## Draining process >> >> % time seconds usecs/call calls errors syscall >> -- --- --- - - >> 48.65 0.544758 30 18359 epoll_wait >> 28.69 0.321283 14 23540 write >> 22.60 0.253049 19 13338 recvfrom >> 0.04 0.000474 1 786 374 read >> 0.03 0.000287 2 187 sendto > This one as well. > >> ## "New" process >> >> % time seconds usecs/call calls errors syscall >> -- --- --- - - >> 93.87 1.588239 11 149253 epoll_wait >> 3.84 0.064985 30 2140 31 recvfrom >> 1.77 0.029905 13 2388 write >> 0.34 0.005737 10 589 130 read >> 0.12 0.002018 38 53 close >> 0.06 0.000960 8 114 2 sendto > This one is very interesting! So the epoll_wait to other syscalls ratio > went from roughly 1/2 to 30/1. I'm pretty sure that a regular strace would > show you a large number of epoll_wait(0)=0 indicating we're missing some > events. I seem to remember that sometimes there are situations where a > thread may be notified by epoll() about an fd it cannot take care of but > I don't remember in which case, I'll have to discuss with Christopher. > > But at least now we have an explanation and it's not directly related to > thread contention but more likely with the mapping of FDs to threads, so > we may have opportunities to improve the situation here. > > Thanks! > Willy
Re: High load average under 1.8 with multiple draining processes
To accelerate the process, I've increased the number of threads from 4 to 8 on a 16-core machine. Ran strace for about 5s on each. Single process (8 threads): $ strace -cp 16807 % time seconds usecs/call calls errors syscall -- --- --- - - 71.36 0.330172 21 15479 epoll_wait 13.59 0.062861 4 14477 1 write 10.58 0.048955 4 11518 10 recvfrom 4.44 0.020544 38 537 244 read 0.02 0.94 1 135 sendto 0.01 0.51 3 16 epoll_ctl 0.00 0.00 0 24 close 0.00 0.00 0 9 socket 0.00 0.00 0 9 9 connect 0.00 0.00 0 25 sendmsg 0.00 0.00 0 6 shutdown 0.00 0.00 0 24 setsockopt 0.00 0.00 0 3 getsockopt 0.00 0.00 0 9 fcntl 0.00 0.00 0 8 7 accept4 -- --- --- - - 100.00 0.462677 42279 271 total Two processes (2x8 threads): ## Draining process % time seconds usecs/call calls errors syscall -- --- --- - - 48.65 0.544758 30 18359 epoll_wait 28.69 0.321283 14 23540 write 22.60 0.253049 19 13338 recvfrom 0.04 0.000474 1 786 374 read 0.03 0.000287 2 187 sendto 0.00 0.00 0 2 close 0.00 0.00 0 1 sendmsg 0.00 0.00 0 1 shutdown 0.00 0.00 0 1 setsockopt -- --- --- - - 100.00 1.119851 56215 374 total ## "New" process % time seconds usecs/call calls errors syscall -- --- --- - - 93.87 1.588239 11 149253 epoll_wait 3.84 0.064985 30 2140 31 recvfrom 1.77 0.029905 13 2388 write 0.34 0.005737 10 589 130 read 0.12 0.002018 38 53 close 0.06 0.000960 8 114 2 sendto 0.00 0.31 1 25 shutdown 0.00 0.19 0 102 sendmsg 0.00 0.19 0 58 epoll_ctl 0.00 0.15 0 31 fcntl 0.00 0.00 0 31 socket 0.00 0.00 0 31 31 connect 0.00 0.00 0 94 setsockopt 0.00 0.00 0 8 getsockopt 0.00 0.00 0 47 29 accept4 -- --- --- - - 100.00 1.691928 154964 223 total It does indeed appear the new process is contending with the old, even with just 16 threads on a 16-core box. A third process (oversubscribed to 24 threads on 16 cores): % time seconds usecs/call calls errors syscall -- --- --- - - 97.21 0.950863 14 69926 epoll_wait 2.32 0.022677 208 109 write 0.42 0.004106 48 85 14 recvfrom 0.04 0.000439 17 26 close 0.00 0.22 1 34 epoll_ctl 0.00 0.00 0 136 33 read 0.00 0.00 0 1 brk 0.00 0.00 0 14 socket 0.00 0.00 0 14 14 connect 0.00 0.00 0 15 1 sendto 0.00 0.00 0 11 sendmsg 0.00 0.00 0 13 1 shutdown 0.00 0.00 0 50 setsockopt 0.00 0.00 0 3 getsockopt 0.00 0.00 0 14 fcntl 0.00 0.00 0 34 22 accept4 -- --- --- - - 100.00 0.978107 70485 85 total During this time, each of the three processes was running at roughly 250-350% CPU. On 1/12/18 9:34 AM, Willy Tarreau wrote: > On Fri, Jan 12, 2018 at 09:28:54AM -0600, Samuel Reed wrote: >> Thanks for your quick answer, Willy. >> >> That's a shame to hear but makes sense. We'll try
Re: High load average under 1.8 with multiple draining processes
Thanks for your quick answer, Willy. That's a shame to hear but makes sense. We'll try out some ideas for reducing contention. We don't use cpu-map with nbthread; I considered it best to let the kernel take care of this, especially since there are some other processes on that box. I don't really want to fall back to nbproc but we may have to, at least until we get the number of reloads down. On 1/12/18 8:55 AM, Willy Tarreau wrote: > Hi Samuel, > > On Thu, Jan 11, 2018 at 08:29:15PM -0600, Samuel Reed wrote: >> Is there a regression in the 1.8 series with SO_REUSEPORT and nbthread >> (we didn't see this before with nbproc) or somewhere we should start >> looking? > In fact no, nbthread is simply new so it's not a regression but we're > starting to see some side effects. One possibility I can easily imagine > is that at most places we're using spinlocks because the locks are very > short-lived and very small, so they have to be cheap. One limit of > spinlocks is that it's mandatory that you don't have more threads than > cores so that your threads are never scheduled out with a lock held, to > let another one spin for nothing for a timeslice. > > The reload makes an interesting case because if you use cpumap to bind > your threads to CPU cores, during the soft-stop period, they do have to > coexist on the same cores and a thread of one process disturbs the thread > of another process by stealing its CPU regularly. > > I can't say I'm seeing any easy solution to this in the short term, that's > something we have to add to the list of things to improve in the future. > Maybe something as simple as starting with SCHED_FIFO to prevent threads > from being preempted outside of the poll loop, and dropping it upon reload > could help a lot, but that's just speculation. > > We'll have to continue to think about this I guess. It may be possible > that if your old processes last very long you'd continue to get a better > experience using nbproc than nbthread :-/ > > Willy
High load average under 1.8 with multiple draining processes
We've recently upgraded to HAProxy 1.8.3, which we run with `nbthread 4` (we used to run nbproc 4 with older releases). This has generally been good, especially for stick tables & stats. We terminate SSL and proxy a large number of long-running TCP connections (websockets). When configuration changes (usually a server going up or down), a reload occurs. For many hours, there may be 2-5 active HAProxy instances as the old ones drain. We use hard-stop-after to keep it reasonable. I have noticed that load average gets *very* high while these processes are still present. Our web tier is usually at a load average of 5 with 16 cores. Across the board, load averages go up if stale HAProxy instances are active. I saw as high as 34 with one instance that had 5 instances, and 100% CPU, most of it sys. Even with just 2, the loadavg is double what it is with 1. Terminating the old process immediately brings the load down. Is there a regression in the 1.8 series with SO_REUSEPORT and nbthread (we didn't see this before with nbproc) or somewhere we should start looking? We make (relatively) heavy use of stick tables for DDoS protection purposes and terminate SSL, but aside from that our configuration is pretty vanilla. Nothing changed from 1.7 to 1.8 except changing nbproc to nbthread. Thanks!
Re: Adding HSTS or custom headers on redirect
Pavlos Parissis pavlos.parissis@... writes: On 2 December 2014 at 09:17, Samuel Reed samuel.trace.reed at gmail.com wrote:I'm running the latest 1.5 release. Our site runs primarily on the `www` subdomain, but we want to enable HSTS for all subdomains (includeSubdomains). Unfortunately, due to the way HSTS works, the HSTS header MUST be present on the redirect from https://example.com tohttps://www.example.com. I am using configuration like: rspadd Strict-Transport-Security:\ max-age=31536000;\ includeSubDomains redirect prefix https://www.example.com code 301 if \ { hdr(host) -i example.com } For whatever reason, even when the rspadd line is before the redirect, no headers are added to the redirect, making this impossible. I've considered a fake backend with a fake 503 file to get around this - something like: HTTP/1.1 301 Moved Permanently Cache-Control: no-cache Content-Length: 0 Strict-Transport-Security: max-age=31536000; includeSubDomains; preload Location: https://www.example.com/ Connection: close While this will work, it feels really hacky. Is there a better way to add a header on a redirect? Have a look at the thread 'add response header based on presence of request header', your case matches the case I mentioned there. Cheers,Pavlos Hi Pavlos - unfortunately this does not match my use case, I always want to send the HSTS header, it is not dependent on the request. I just need to ensure that this header is *always* sent, even on a redirect.
Adding HSTS or custom headers on redirect
I'm running the latest 1.5 release. Our site runs primarily on the `www` subdomain, but we want to enable HSTS for all subdomains (includeSubdomains). Unfortunately, due to the way HSTS works, the HSTS header MUST be present on the redirect from https://example.com to https://www.example.com. I am using configuration like: rspadd Strict-Transport-Security:\ max-age=31536000;\ includeSubDomains redirect prefix https://www.example.com code 301 if \ { hdr(host) -i example.com } For whatever reason, even when the rspadd line is before the redirect, no headers are added to the redirect, making this impossible. I've considered a fake backend with a fake 503 file to get around this - something like: HTTP/1.1 301 Moved Permanently Cache-Control: no-cache Content-Length: 0 Strict-Transport-Security: max-age=31536000; includeSubDomains; preload Location: https://www.example.com/ Connection: close While this will work, it feels really hacky. Is there a better way to add a header on a redirect? Thanks Sam
Stick-table misconfiguration; haproxy should throw error
I just wanted to report an issue that had me hitting my head on the table for a few hours: I reconfigured a development environment to use Chef to generate haproxy config files. Every 2-3 minutes or so, depending on load, haproxy would crash with a `trap divide error` (divide by zero). It worked fine previously, and during those few minutes it works perfectly, but the trap divide error persisted. Tried a new machine, still got it. Tried every 1.5 version from 1.5.5 all the way to the early dev builds, still got it. Turns out the culprit was a misnamed variable on my end that was causing my stick table definition to read `stick-table type ip ,bytes-in-rate(),http_err_rate(10s)`. That is, there was no period defined on the bytes-in-rate. It would be very nice if haproxy would regard this as a configuration error and throw on start rather than crash minutes later with a cryptic trap divide error. Thanks for a great project. Best, Sam signature.asc Description: Message signed with OpenPGP using GPGMail