Re: [1.9 HEAD] HAProxy using 100% CPU
Ok, I'll wait for update from you! :) pon., 13 maj 2019 o 08:03 Willy Tarreau napisał(a): > Hi Maciej, > > On Mon, May 13, 2019 at 07:21:59AM +0200, Maciej Zdeb wrote: > > Hi, > > > > I'm not observing any issues, so I think it's fixed. :) > > > > Willy, Olivier thank you very much! > > Great, thanks. I'm going to issue 1.9.8 with the patch I sent you then. > However after discussing about it with Olivier, we came to the conclusion > that it is not the cleanest solution, as it addresses the consequence of > a problem (being flow controlled and being in the send list) instead of > the cause. Olivier proposed me a different one that I'd very much > appreciate you to test after the release; I don't want to take any new > risk for 1.9.8 for now. We'll keep you updated. > > Cheers, > Willy >
Re: [1.9 HEAD] HAProxy using 100% CPU
Hi Maciej, On Mon, May 13, 2019 at 07:21:59AM +0200, Maciej Zdeb wrote: > Hi, > > I'm not observing any issues, so I think it's fixed. :) > > Willy, Olivier thank you very much! Great, thanks. I'm going to issue 1.9.8 with the patch I sent you then. However after discussing about it with Olivier, we came to the conclusion that it is not the cleanest solution, as it addresses the consequence of a problem (being flow controlled and being in the send list) instead of the cause. Olivier proposed me a different one that I'd very much appreciate you to test after the release; I don't want to take any new risk for 1.9.8 for now. We'll keep you updated. Cheers, Willy
Re: [1.9 HEAD] HAProxy using 100% CPU
Hi, I'm not observing any issues, so I think it's fixed. :) Willy, Olivier thank you very much! Kind regards, niedz., 12 maj 2019 o 12:11 Willy Tarreau napisał(a): > On Sun, May 12, 2019 at 12:00:44PM +0200, Maciej Zdeb wrote: > > As for now it is stable, no looping. I'll leave it till monday and return > > with feedback! :) > > Many thanks Maciej for this positive feedback. I'll merge it and issue > 1.9.8 > then. There are definitely enough pending fixes for a release! > > Cheers, > Willy >
Re: [1.9 HEAD] HAProxy using 100% CPU
On Sun, May 12, 2019 at 12:00:44PM +0200, Maciej Zdeb wrote: > As for now it is stable, no looping. I'll leave it till monday and return > with feedback! :) Many thanks Maciej for this positive feedback. I'll merge it and issue 1.9.8 then. There are definitely enough pending fixes for a release! Cheers, Willy
Re: [1.9 HEAD] HAProxy using 100% CPU
As for now it is stable, no looping. I'll leave it till monday and return with feedback! :) Thanks, sob., 11.05.2019, 15:44 użytkownik Maciej Zdeb napisał: > Patch applied, finger crossed, testing! :-) > > Thanks! > > sob., 11 maj 2019 o 14:58 Willy Tarreau napisał(a): > >> On Sat, May 11, 2019 at 11:01:42AM +0200, Willy Tarreau wrote: >> > On Sat, May 11, 2019 at 10:52:35AM +0200, Willy Tarreau wrote: >> > > I certainly made a few reasoning mistakes above but I don't see >> anything >> > > in the code preventing this case from happening. >> > > >> > > Thus I'd like you to try the attached patch which is supposed to >> prevent >> > > this scenario from happening. At least I've verified that it doesn't >> > > break the h2spec test suite. >> > >> > While trying to check if it still applied to the latest 1.9 I figured >> > that it corresponds to what Olivier had also found and fixed in his >> > latest patch :-/ The positive point is that my analysis was correct. >> > >> > So I'm afraid that if it still fails with his fix, we'll need another >> > core :-( >> >> Actually not, Olivier's fix is incomplete regarding the scenario I >> proposed : >> - in h2s_frt_make_resp_data() we can set H2_SF_BLK_SFCTL and remove the >> element from the list >> - then in h2_shutr() and h2_shutw(), we check if the list is empty before >> subscribing the element, which is true after the case above >> - then in h2c_update_all_ws() we still have H2_SF_BLK_SFCTL with the item >> in the send_list, thus LIST_ADDQ() adds it a second time. >> >> Thus the first part of the patch I sent is still required, I'm attaching >> it >> again, rebased on top of Olivier's patch and simplified so that we don't >> detach then re-attach. >> >> I'm still keeping hope ;-) >> >> Willy >> >
Re: [1.9 HEAD] HAProxy using 100% CPU
Patch applied, finger crossed, testing! :-) Thanks! sob., 11 maj 2019 o 14:58 Willy Tarreau napisał(a): > On Sat, May 11, 2019 at 11:01:42AM +0200, Willy Tarreau wrote: > > On Sat, May 11, 2019 at 10:52:35AM +0200, Willy Tarreau wrote: > > > I certainly made a few reasoning mistakes above but I don't see > anything > > > in the code preventing this case from happening. > > > > > > Thus I'd like you to try the attached patch which is supposed to > prevent > > > this scenario from happening. At least I've verified that it doesn't > > > break the h2spec test suite. > > > > While trying to check if it still applied to the latest 1.9 I figured > > that it corresponds to what Olivier had also found and fixed in his > > latest patch :-/ The positive point is that my analysis was correct. > > > > So I'm afraid that if it still fails with his fix, we'll need another > > core :-( > > Actually not, Olivier's fix is incomplete regarding the scenario I > proposed : > - in h2s_frt_make_resp_data() we can set H2_SF_BLK_SFCTL and remove the > element from the list > - then in h2_shutr() and h2_shutw(), we check if the list is empty before > subscribing the element, which is true after the case above > - then in h2c_update_all_ws() we still have H2_SF_BLK_SFCTL with the item > in the send_list, thus LIST_ADDQ() adds it a second time. > > Thus the first part of the patch I sent is still required, I'm attaching it > again, rebased on top of Olivier's patch and simplified so that we don't > detach then re-attach. > > I'm still keeping hope ;-) > > Willy >
Re: [1.9 HEAD] HAProxy using 100% CPU
On Sat, May 11, 2019 at 11:01:42AM +0200, Willy Tarreau wrote: > On Sat, May 11, 2019 at 10:52:35AM +0200, Willy Tarreau wrote: > > I certainly made a few reasoning mistakes above but I don't see anything > > in the code preventing this case from happening. > > > > Thus I'd like you to try the attached patch which is supposed to prevent > > this scenario from happening. At least I've verified that it doesn't > > break the h2spec test suite. > > While trying to check if it still applied to the latest 1.9 I figured > that it corresponds to what Olivier had also found and fixed in his > latest patch :-/ The positive point is that my analysis was correct. > > So I'm afraid that if it still fails with his fix, we'll need another > core :-( Actually not, Olivier's fix is incomplete regarding the scenario I proposed : - in h2s_frt_make_resp_data() we can set H2_SF_BLK_SFCTL and remove the element from the list - then in h2_shutr() and h2_shutw(), we check if the list is empty before subscribing the element, which is true after the case above - then in h2c_update_all_ws() we still have H2_SF_BLK_SFCTL with the item in the send_list, thus LIST_ADDQ() adds it a second time. Thus the first part of the patch I sent is still required, I'm attaching it again, rebased on top of Olivier's patch and simplified so that we don't detach then re-attach. I'm still keeping hope ;-) Willy diff --git a/src/mux_h2.c b/src/mux_h2.c index d201921..94e4e5e 100644 --- a/src/mux_h2.c +++ b/src/mux_h2.c @@ -1484,9 +1484,8 @@ static void h2c_update_all_ws(struct h2c *h2c, int diff) if (h2s->mws > 0 && (h2s->flags & H2_SF_BLK_SFCTL)) { h2s->flags &= ~H2_SF_BLK_SFCTL; - if (h2s->send_wait) + if (h2s->send_wait && LIST_ISEMPTY(>list)) LIST_ADDQ(>send_list, >list); - } node = eb32_next(node); @@ -1791,9 +1790,8 @@ static int h2c_handle_window_update(struct h2c *h2c, struct h2s *h2s) h2s->mws += inc; if (h2s->mws > 0 && (h2s->flags & H2_SF_BLK_SFCTL)) { h2s->flags &= ~H2_SF_BLK_SFCTL; - if (h2s->send_wait) + if (h2s->send_wait && LIST_ISEMPTY(>list)) LIST_ADDQ(>send_list, >list); - } } else {
Re: [1.9 HEAD] HAProxy using 100% CPU
> What I find very strange is why you're possibly the only one seeing this > (and maybe also @serimin on github issue #94). If we could figure what > makes your case specific it could help narrow the issue down. I'm seeing > that you have a very simple Lua service to respond to health checks, so > I've been thinking that maybe we do have some remaining bugs when Lua is > accessed over H2 (e.g. incorrect length or whatever), but it's not the > case on your connection since there are something like 17 streams so we > can rule out the hypothesis of a health check connection, and thus that > Lua was used. > Correct, lua applets are used for health checking and only for specific list of clients (distinguished by source IP). Health checks are made over http 1.0 (not even 1.1), so it should not affect h2 in HAProxy. This particular HAProxy is used for serving images, so the responses are quite large and so is the traffic. I'm using HAProxy 1.9.7 on other machines, without that problem so the traffic specification or config matters. The difference from other instances of HAProxy 1.9.7 I'm using is the consistent hashing (hash-type consistent, balance hdr(...) and server id specified inf config for each server) and also inter, fastinter, downinter, slowstart settings.
Re: [1.9 HEAD] HAProxy using 100% CPU
On Sat, May 11, 2019 at 10:52:35AM +0200, Willy Tarreau wrote: > I certainly made a few reasoning mistakes above but I don't see anything > in the code preventing this case from happening. > > Thus I'd like you to try the attached patch which is supposed to prevent > this scenario from happening. At least I've verified that it doesn't > break the h2spec test suite. While trying to check if it still applied to the latest 1.9 I figured that it corresponds to what Olivier had also found and fixed in his latest patch :-/ The positive point is that my analysis was correct. So I'm afraid that if it still fails with his fix, we'll need another core :-( Thanks, Willy
Re: [1.9 HEAD] HAProxy using 100% CPU
On Sat, May 11, 2019 at 09:56:18AM +0200, Willy Tarreau wrote: > I'm back to auditing the code to figure how we can free an h2s without > first detaching it from the lists. I hope to have yet another patch to > propose to you. So I'm seeing something which bothers me in the code. Since I'm not at ease with these parts I may say stupid things. But overall the idea is the following : - everywhere before releasing an h2s we remove it from its lists, so it cannot exist in a pool while still belonging to a list ; that's fine. => it means the element we're looping on is still live in the connection. - at every place we're adding an element to the list, we take care of removing it first, or checking that it was not attached to a list. Every single place except one : static void h2c_update_all_ws(struct h2c *h2c, int diff) (...) if (h2s->mws > 0 && (h2s->flags & H2_SF_BLK_SFCTL)) { h2s->flags &= ~H2_SF_BLK_SFCTL; if (h2s->send_wait) LIST_ADDQ(>send_list, >list); } Instead the check is made on send_wait being non-null. - in h2s_frt_make_resp_data() we test this send_wait pointer before deciding to remove the element from the list in case we're blocking on flow control : if (size <= 0) { h2s->flags |= H2_SF_BLK_SFCTL; if (h2s->send_wait) { LIST_DEL(>list); LIST_INIT(>list); } goto end; } So if for any reason send_wait is not set, nobody removes the element from the list. - this function h2s_frt_make_resp_data() is called from h2_snd_buf(), which starts by resetting h2s->send_wait. And it doesn't delete the stream from the list only in the case where it managed to send some data. So what I'm wondering is if something like the following scenario may happen : 1) a stream wants to send data, it calls h2_snd_buf() a first time. The output buffer is full, the call fails 2) the caller (si_cs_send()) detects the error and calls h2_subscribe() to attach the stream to the send_list 3) some room is made in the buffer, the list is walked over and this stream is woken up. 4) h2_snd_buf() is called again to try to send, send_wait is immediately reset 5) h2s_frt_make_resp_data() detects that the stream's flow control credit is exhausted and declines. It sets H2_SF_BLK_SFCTL but doesn't delete the element from the send list since send_wait is NULL 6) another stream still has credit and fills the connection's buffer with its data. 7) the server the first stream is associated to delivers more data while the stream is still waiting for a window update, and attempts to send again via h2_snd_buf() 8) h2_snd_buf() doesn't detect that the stream is blocked because the check on sending_list doesn't match, so it tries again and sets send_wait to NULL again, then tries to send. It fails again because the connection's buffer is full 9) si_cs_send() above calls h2_subsribe() again to set send_wait and try to add the stream to the send_list. But since the stream has the SFCTL flag it's not added again, and it remains in the send_list where it still is. 10) a window update finally arrives for this stream (or an initial window size update for all streams). h2c_handle_window_update() is called, sees the stream has the SFCTL flag, removes it and appends it to the end of the send_list, where it already was. 11) h2_process_mux() scans the list, finds the first reference to the list element (where it was first added), and wakes it up. When going to the next one however, it directly jumps to the one that was added past the second addition, ignoring the orphans in the list that sit between the first place of the element and its last one. 12) h2_snd_buf() is called again for this element, then deletes it from the list, which has the consequence of unlinking the element from its last predecessor and successors, but not from the first one. Once the element is deleted, it's initialized and its next points to itself (an empty list head). 13) h2_process_mux() scans the list again to process more streams, and reaches the first predecessor of this element, then visits the next which is this element again, looping onto itself. I certainly made a few reasoning mistakes above but I don't see anything in the code preventing this case from happening. Thus I'd like you to try the attached patch which is supposed to prevent this scenario from happening. At least I've verified that it doesn't break the h2spec test suite. Thanks, Willy diff --git a/src/mux_h2.c b/src/mux_h2.c index 67a297a..e2b1533 100644 --- a/src/mux_h2.c +++ b/src/mux_h2.c @@ -1484,9 +1484,10 @@ static void h2c_update_all_ws(struct h2c *h2c, int diff) if (h2s->mws > 0 && (h2s->flags & H2_SF_BLK_SFCTL)) {
Re: [1.9 HEAD] HAProxy using 100% CPU
Hi Maciej, On Fri, May 10, 2019 at 06:45:21PM +0200, Maciej Zdeb wrote: > Olivier, it's still looping, but differently: > > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { > (gdb) n > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) (...) > (gdb) p *h2s > $1 = {cs = 0x2f84190, sess = 0x819580 , h2c = 0x2f841a0, h1m > = {state = 48, flags = 0, curr_len = 38317, body_len = 103852, next = 413, > err_pos = -1, err_state = 0}, by_id = {node = {branches = {b = {0x34c0260, > 0x321d330}}, node_p = 0x0, leaf_p = 0x0, bit = 1, pfx = 47005}, > key = 3}, id = 3, flags = 28675, mws = 1017461, errcode = H2_ERR_NO_ERROR, > st = H2_SS_CLOSED, status = 200, body_len = 0, rxbuf = {size = 0, area = > 0x0, data = 0, > head = 0}, wait_event = {task = 0x2cd0ed0, handle = 0x3, events = 0}, > recv_wait = 0x0, send_wait = 0x321d390, list = {n = 0x321d3b8, p = > 0x321d3b8}, sending_list = {n = 0x3174cf8, p = 0x3174cf8}} (...) In fact it's exactly the same. I've analyzed the core you sent me (thanks a lot for this by the way), and it explains very well why it's looping, though I don't yet understand how we entered this situation. It also explains how it can randomly crash instead of looping. What happens is that the item you're inspecting above has an empty list element which points to itself. Thus it doesn't belong to a list and its next is itself. But it's still attached to its predecessor. In your core file it was very visible as the h2c list's head points to the element and the list tail's tail points to it as well. So what we're seeing is an element that was *not* unlinked from a list before being removed / recycled. Once it was reused, it was initialized with an empty list header and the next time the connection's list was visited, it caused the loop. What I find very strange is why you're possibly the only one seeing this (and maybe also @serimin on github issue #94). If we could figure what makes your case specific it could help narrow the issue down. I'm seeing that you have a very simple Lua service to respond to health checks, so I've been thinking that maybe we do have some remaining bugs when Lua is accessed over H2 (e.g. incorrect length or whatever), but it's not the case on your connection since there are something like 17 streams so we can rule out the hypothesis of a health check connection, and thus that Lua was used. I'm back to auditing the code to figure how we can free an h2s without first detaching it from the lists. I hope to have yet another patch to propose to you. Thanks again for your traces, they've been amazingly useful! Willy
Re: [1.9 HEAD] HAProxy using 100% CPU
Olivier, it's still looping, but differently: 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) n 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2613if (!LIST_ISEMPTY(>sending_list)) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2613if (!LIST_ISEMPTY(>sending_list)) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2613if (!LIST_ISEMPTY(>sending_list)) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2613if (!LIST_ISEMPTY(>sending_list)) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2613if (!LIST_ISEMPTY(>sending_list)) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2613if (!LIST_ISEMPTY(>sending_list)) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2613if (!LIST_ISEMPTY(>sending_list)) (gdb) gcore warning: target file /proc/12265/cmdline contained unexpected null characters warning: Memory read failed for corefile section, 12288 bytes at 0x7fff17ff3000. Saved corefile core.12265 (gdb) p *h2s $1 = {cs = 0x2f84190, sess = 0x819580 , h2c = 0x2f841a0, h1m = {state = 48, flags = 0, curr_len = 38317, body_len = 103852, next = 413, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b = {0x34c0260, 0x321d330}}, node_p = 0x0, leaf_p = 0x0, bit = 1, pfx = 47005}, key = 3}, id = 3, flags = 28675, mws = 1017461, errcode = H2_ERR_NO_ERROR, st = H2_SS_CLOSED, status = 200, body_len = 0, rxbuf = {size = 0, area = 0x0, data = 0, head = 0}, wait_event = {task = 0x2cd0ed0, handle = 0x3, events = 0}, recv_wait = 0x0, send_wait = 0x321d390, list = {n = 0x321d3b8, p = 0x321d3b8}, sending_list = {n = 0x3174cf8, p = 0x3174cf8}} (gdb) p *h2s_back $2 = {cs = 0x2f84190, sess = 0x819580 , h2c = 0x2f841a0, h1m = {state = 48, flags = 0, curr_len = 38317, body_len = 103852, next = 413, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b = {0x34c0260, 0x321d330}}, node_p = 0x0, leaf_p = 0x0, bit = 1, pfx = 47005}, key = 3}, id = 3, flags = 28675, mws = 1017461, errcode = H2_ERR_NO_ERROR, st = H2_SS_CLOSED, status = 200, body_len = 0, rxbuf = {size = 0, area = 0x0, data = 0, head = 0}, wait_event = {task = 0x2cd0ed0, handle = 0x3, events = 0}, recv_wait = 0x0, send_wait = 0x321d390, list = {n = 0x321d3b8, p = 0x321d3b8}, sending_list = {n = 0x3174cf8, p = 0x3174cf8}} (gdb) p *h2c $3 = {conn = 0x2d10700, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR, flags = 0, streams_limit = 100, max_id = 9, rcvd_c = 0, rcvd_s = 0, ddht = 0x2f311a0, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 3, dfl = 4, dft = 8 '\b', dff = 0 '\000', dpl = 0 '\000', last_sid = -1, mbuf = {size = 16384, area = 0x34f0a10 "",
Re: [1.9 HEAD] HAProxy using 100% CPU
I've just sent some additional data to Willy. :) Sure, I'll test your patch! pt., 10 maj 2019 o 15:11 Olivier Houchard napisał(a): > Hi Maciej, > > On Thu, May 09, 2019 at 07:25:54PM +0200, Maciej Zdeb wrote: > > Hi again, > > > > I have bad news, HAProxy 1.9.7-35b44da still looping :/ > > > > gdb session: > > h2_process_mux (h2c=0x1432420) at src/mux_h2.c:2609 > > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, > list) { > > (gdb) n > > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > > H2_CF_MUX_BLOCK_ANY) > > (gdb) > > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, > list) { > > (gdb) > > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > > H2_CF_MUX_BLOCK_ANY) > > (gdb) > > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, > list) { > > (gdb) > > 2613if (!LIST_ISEMPTY(>sending_list)) > > (gdb) > > 2619if (!h2s->send_wait) { > > (gdb) > > 2620LIST_DEL_INIT(>list); > > (gdb) > > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, > list) { > > (gdb) > > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > > H2_CF_MUX_BLOCK_ANY) > > (gdb) > > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, > list) { > > (gdb) > > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > > H2_CF_MUX_BLOCK_ANY) > > (gdb) > > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, > list) { > > (gdb) > > 2613if (!LIST_ISEMPTY(>sending_list)) > > (gdb) > > 2619if (!h2s->send_wait) { > > (gdb) > > 2620LIST_DEL_INIT(>list); > > (gdb) > > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, > list) { > > (gdb) > > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > > H2_CF_MUX_BLOCK_ANY) > > (gdb) > > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, > list) { > > (gdb) > > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > > H2_CF_MUX_BLOCK_ANY) > > (gdb) > > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, > list) { > > (gdb) > > 2613if (!LIST_ISEMPTY(>sending_list)) > > (gdb) > > 2619if (!h2s->send_wait) { > > (gdb) > > 2620LIST_DEL_INIT(>list); > > (gdb) > > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, > list) { > > (gdb) p *h2s > > $1 = {cs = 0x1499030, sess = 0x819580 , h2c = 0x1432420, > h1m > > = {state = H1_MSG_DONE, flags = 29, curr_len = 0, body_len = 111976, > next = > > 411, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b = > > {0x13dcf50, > > 0x15b3120}}, node_p = 0x125ab90, leaf_p = 0x15b3121, bit = 1, > pfx > > = 0}, key = 11}, id = 11, flags = 28675, mws = 977198, errcode = > > H2_ERR_NO_ERROR, st = H2_SS_CLOSED, status = 200, body_len = 0, rxbuf = > > {size = 0, area = 0x0, > > data = 0, head = 0}, wait_event = {task = 0x15077a0, handle = 0x0, > > events = 0}, recv_wait = 0x0, send_wait = 0x0, list = {n = 0x15b31a8, p = > > 0x15b31a8}, sending_list = {n = 0x15b31b8, p = 0x15b31b8}} > > (gdb) p *h2s_back > > $2 = {cs = 0x1499030, sess = 0x819580 , h2c = 0x1432420, > h1m > > = {state = H1_MSG_DONE, flags = 29, curr_len = 0, body_len = 111976, > next = > > 411, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b = > > {0x13dcf50, > > 0x15b3120}}, node_p = 0x125ab90, leaf_p = 0x15b3121, bit = 1, > pfx > > = 0}, key = 11}, id = 11, flags = 28675, mws = 977198, errcode = > > H2_ERR_NO_ERROR, st = H2_SS_CLOSED, status = 200, body_len = 0, rxbuf = > > {size = 0, area = 0x0, > > data = 0, head = 0}, wait_event = {task = 0x15077a0, handle = 0x0, > > events = 0}, recv_wait = 0x0, send_wait = 0x0, list = {n = 0x15b31a8, p = > > 0x15b31a8}, sending_list = {n = 0x15b31b8, p = 0x15b31b8}} > > (gdb) p *h2c > > $3 = {conn = 0x17e3310, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR, > > flags = 0, streams_limit = 100, max_id = 13, rcvd_c = 0, rcvd_s = 0, > ddht = > > 0x1e99a40, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 13, > dfl > > = 4, > > dft = 8 '\b', dff = 0 '\000', dpl = 0 '\000', last_sid = -1, mbuf = > {size > > = 16384, area = 0x1e573a0 "", data = 13700, head = 0}, msi = -1, mfl = 0, > > mft = 0 '\000', mff = 0 '\000', miw = 65535, mws = 10159243, mfs = 16384, > > timeout = 2, shut_timeout = 2, nb_streams = 2, nb_cs = 3, > > nb_reserved = 0, stream_cnt = 7, proxy = 0xb85fc0, task = 0x126aa30, > > streams_by_id = {b = {0x125ab91, 0x0}}, send_list = {n = 0x15b31a8, p = > > 0x125ac18}, fctl_list = { > > n = 0x14324f8, p = 0x14324f8}, sending_list = {n = 0x1432508, p = > > 0x1432508}, buf_wait = {target = 0x0, wakeup_cb = 0x0, list = {n = > > 0x1432528, p = 0x1432528}}, wait_event = {task = 0x1420fa0, handle = 0x0, > > events = 1}} > > (gdb) p list > > $4 = (int *) 0x0 > > (gdb) n > > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > > H2_CF_MUX_BLOCK_ANY) > > (gdb) n > > 2609list_for_each_entry_safe(h2s,
Re: [1.9 HEAD] HAProxy using 100% CPU
Hi Maciej, On Thu, May 09, 2019 at 07:25:54PM +0200, Maciej Zdeb wrote: > Hi again, > > I have bad news, HAProxy 1.9.7-35b44da still looping :/ > > gdb session: > h2_process_mux (h2c=0x1432420) at src/mux_h2.c:2609 > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { > (gdb) n > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { > (gdb) > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { > (gdb) > 2613if (!LIST_ISEMPTY(>sending_list)) > (gdb) > 2619if (!h2s->send_wait) { > (gdb) > 2620LIST_DEL_INIT(>list); > (gdb) > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { > (gdb) > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { > (gdb) > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { > (gdb) > 2613if (!LIST_ISEMPTY(>sending_list)) > (gdb) > 2619if (!h2s->send_wait) { > (gdb) > 2620LIST_DEL_INIT(>list); > (gdb) > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { > (gdb) > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { > (gdb) > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { > (gdb) > 2613if (!LIST_ISEMPTY(>sending_list)) > (gdb) > 2619if (!h2s->send_wait) { > (gdb) > 2620LIST_DEL_INIT(>list); > (gdb) > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { > (gdb) p *h2s > $1 = {cs = 0x1499030, sess = 0x819580 , h2c = 0x1432420, h1m > = {state = H1_MSG_DONE, flags = 29, curr_len = 0, body_len = 111976, next = > 411, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b = > {0x13dcf50, > 0x15b3120}}, node_p = 0x125ab90, leaf_p = 0x15b3121, bit = 1, pfx > = 0}, key = 11}, id = 11, flags = 28675, mws = 977198, errcode = > H2_ERR_NO_ERROR, st = H2_SS_CLOSED, status = 200, body_len = 0, rxbuf = > {size = 0, area = 0x0, > data = 0, head = 0}, wait_event = {task = 0x15077a0, handle = 0x0, > events = 0}, recv_wait = 0x0, send_wait = 0x0, list = {n = 0x15b31a8, p = > 0x15b31a8}, sending_list = {n = 0x15b31b8, p = 0x15b31b8}} > (gdb) p *h2s_back > $2 = {cs = 0x1499030, sess = 0x819580 , h2c = 0x1432420, h1m > = {state = H1_MSG_DONE, flags = 29, curr_len = 0, body_len = 111976, next = > 411, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b = > {0x13dcf50, > 0x15b3120}}, node_p = 0x125ab90, leaf_p = 0x15b3121, bit = 1, pfx > = 0}, key = 11}, id = 11, flags = 28675, mws = 977198, errcode = > H2_ERR_NO_ERROR, st = H2_SS_CLOSED, status = 200, body_len = 0, rxbuf = > {size = 0, area = 0x0, > data = 0, head = 0}, wait_event = {task = 0x15077a0, handle = 0x0, > events = 0}, recv_wait = 0x0, send_wait = 0x0, list = {n = 0x15b31a8, p = > 0x15b31a8}, sending_list = {n = 0x15b31b8, p = 0x15b31b8}} > (gdb) p *h2c > $3 = {conn = 0x17e3310, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR, > flags = 0, streams_limit = 100, max_id = 13, rcvd_c = 0, rcvd_s = 0, ddht = > 0x1e99a40, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 13, dfl > = 4, > dft = 8 '\b', dff = 0 '\000', dpl = 0 '\000', last_sid = -1, mbuf = {size > = 16384, area = 0x1e573a0 "", data = 13700, head = 0}, msi = -1, mfl = 0, > mft = 0 '\000', mff = 0 '\000', miw = 65535, mws = 10159243, mfs = 16384, > timeout = 2, shut_timeout = 2, nb_streams = 2, nb_cs = 3, > nb_reserved = 0, stream_cnt = 7, proxy = 0xb85fc0, task = 0x126aa30, > streams_by_id = {b = {0x125ab91, 0x0}}, send_list = {n = 0x15b31a8, p = > 0x125ac18}, fctl_list = { > n = 0x14324f8, p = 0x14324f8}, sending_list = {n = 0x1432508, p = > 0x1432508}, buf_wait = {target = 0x0, wakeup_cb = 0x0, list = {n = > 0x1432528, p = 0x1432528}}, wait_event = {task = 0x1420fa0, handle = 0x0, > events = 1}} > (gdb) p list > $4 = (int *) 0x0 > (gdb) n > 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) n > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { > (gdb) p *h2s > $5 = {cs = 0x1499030, sess = 0x819580 , h2c = 0x1432420, h1m > = {state = H1_MSG_DONE, flags = 29, curr_len = 0, body_len = 111976, next = > 411, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b = > {0x13dcf50, > 0x15b3120}}, node_p = 0x125ab90, leaf_p = 0x15b3121, bit = 1, pfx > = 0}, key = 11}, id = 11, flags = 28675,
Re: [1.9 HEAD] HAProxy using 100% CPU
I'm gettingh old... I failed to remember to dump core :( And already killed the process. Sorry, but the issue must reoccur and I can't say how long it may take. As soon as I get core dump I'll return. pt., 10.05.2019, 10:35 użytkownik Willy Tarreau napisał: > Hi Maciej, > > On Thu, May 09, 2019 at 07:25:54PM +0200, Maciej Zdeb wrote: > > Hi again, > > > > I have bad news, HAProxy 1.9.7-35b44da still looping :/ > > Well, it's getting really annoying. Something's definitely wrong in > this list and I can't figure what. > > > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, > list) { > > (gdb) p *h2s > > $1 = {cs = 0x1499030, sess = 0x819580 , h2c = 0x1432420, > h1m > ^^^ > > Seeing things like the above make me doubt about the list's integrity, thus > it could again hold an element that was already reused somewhere else. > Could > it be possible for you to share your unstripped executable, a core dump and > your config ? (not to the list! just send a private link to Olivier or me). > > At this point either we find what's happening or we'll have to issue 1.9.8 > with this bug still alive, which doesn't make me feel comfortable to say > the least :-/ > > Willy >
Re: [1.9 HEAD] HAProxy using 100% CPU
Hi Maciej, On Thu, May 09, 2019 at 07:25:54PM +0200, Maciej Zdeb wrote: > Hi again, > > I have bad news, HAProxy 1.9.7-35b44da still looping :/ Well, it's getting really annoying. Something's definitely wrong in this list and I can't figure what. > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { > (gdb) p *h2s > $1 = {cs = 0x1499030, sess = 0x819580 , h2c = 0x1432420, h1m ^^^ Seeing things like the above make me doubt about the list's integrity, thus it could again hold an element that was already reused somewhere else. Could it be possible for you to share your unstripped executable, a core dump and your config ? (not to the list! just send a private link to Olivier or me). At this point either we find what's happening or we'll have to issue 1.9.8 with this bug still alive, which doesn't make me feel comfortable to say the least :-/ Willy
Re: [1.9 HEAD] HAProxy using 100% CPU
Hi again, I have bad news, HAProxy 1.9.7-35b44da still looping :/ gdb session: h2_process_mux (h2c=0x1432420) at src/mux_h2.c:2609 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) n 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2613if (!LIST_ISEMPTY(>sending_list)) (gdb) 2619if (!h2s->send_wait) { (gdb) 2620LIST_DEL_INIT(>list); (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2613if (!LIST_ISEMPTY(>sending_list)) (gdb) 2619if (!h2s->send_wait) { (gdb) 2620LIST_DEL_INIT(>list); (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) 2613if (!LIST_ISEMPTY(>sending_list)) (gdb) 2619if (!h2s->send_wait) { (gdb) 2620LIST_DEL_INIT(>list); (gdb) 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) p *h2s $1 = {cs = 0x1499030, sess = 0x819580 , h2c = 0x1432420, h1m = {state = H1_MSG_DONE, flags = 29, curr_len = 0, body_len = 111976, next = 411, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b = {0x13dcf50, 0x15b3120}}, node_p = 0x125ab90, leaf_p = 0x15b3121, bit = 1, pfx = 0}, key = 11}, id = 11, flags = 28675, mws = 977198, errcode = H2_ERR_NO_ERROR, st = H2_SS_CLOSED, status = 200, body_len = 0, rxbuf = {size = 0, area = 0x0, data = 0, head = 0}, wait_event = {task = 0x15077a0, handle = 0x0, events = 0}, recv_wait = 0x0, send_wait = 0x0, list = {n = 0x15b31a8, p = 0x15b31a8}, sending_list = {n = 0x15b31b8, p = 0x15b31b8}} (gdb) p *h2s_back $2 = {cs = 0x1499030, sess = 0x819580 , h2c = 0x1432420, h1m = {state = H1_MSG_DONE, flags = 29, curr_len = 0, body_len = 111976, next = 411, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b = {0x13dcf50, 0x15b3120}}, node_p = 0x125ab90, leaf_p = 0x15b3121, bit = 1, pfx = 0}, key = 11}, id = 11, flags = 28675, mws = 977198, errcode = H2_ERR_NO_ERROR, st = H2_SS_CLOSED, status = 200, body_len = 0, rxbuf = {size = 0, area = 0x0, data = 0, head = 0}, wait_event = {task = 0x15077a0, handle = 0x0, events = 0}, recv_wait = 0x0, send_wait = 0x0, list = {n = 0x15b31a8, p = 0x15b31a8}, sending_list = {n = 0x15b31b8, p = 0x15b31b8}} (gdb) p *h2c $3 = {conn = 0x17e3310, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR, flags = 0, streams_limit = 100, max_id = 13, rcvd_c = 0, rcvd_s = 0, ddht = 0x1e99a40, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 13, dfl = 4, dft = 8 '\b', dff = 0 '\000', dpl = 0 '\000', last_sid = -1, mbuf = {size = 16384, area = 0x1e573a0 "", data = 13700, head = 0}, msi = -1, mfl = 0, mft = 0 '\000', mff = 0 '\000', miw = 65535, mws = 10159243, mfs = 16384, timeout = 2, shut_timeout = 2, nb_streams = 2, nb_cs = 3, nb_reserved = 0, stream_cnt = 7, proxy = 0xb85fc0, task = 0x126aa30, streams_by_id = {b = {0x125ab91, 0x0}}, send_list = {n = 0x15b31a8, p = 0x125ac18}, fctl_list = { n = 0x14324f8, p = 0x14324f8}, sending_list = {n = 0x1432508, p = 0x1432508}, buf_wait = {target = 0x0, wakeup_cb = 0x0, list = {n = 0x1432528, p = 0x1432528}}, wait_event = {task = 0x1420fa0, handle = 0x0, events = 1}} (gdb) p list $4 = (int *) 0x0 (gdb) n 2610if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) n 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) p *h2s $5 = {cs = 0x1499030, sess = 0x819580 , h2c = 0x1432420, h1m = {state = H1_MSG_DONE, flags = 29, curr_len = 0, body_len = 111976, next = 411, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b = {0x13dcf50, 0x15b3120}}, node_p = 0x125ab90, leaf_p = 0x15b3121, bit = 1, pfx = 0}, key = 11}, id = 11, flags = 28675, mws = 977198, errcode = H2_ERR_NO_ERROR, st = H2_SS_CLOSED, status = 200, body_len = 0, rxbuf = {size = 0, area = 0x0, data = 0, head = 0}, wait_event = {task = 0x15077a0, handle = 0x0, events = 0}, recv_wait = 0x0, send_wait = 0x0, list = {n = 0x15b31a8, p = 0x15b31a8}, sending_list = {n =
Re: [1.9 HEAD] HAProxy using 100% CPU
I'm happy to help! :) Checking Olivier patch. Thanks! czw., 9 maj 2019 o 14:34 Willy Tarreau napisał(a): > On Thu, May 09, 2019 at 02:31:58PM +0200, Maciej Zdeb wrote: > > What a bad luck :D I must have compiled it just before you pushed that > > change (segfault above is from haproxy 1.9.7-9b8ac0f). > > Great, so there's still some hope. I really appreciate your help and > feedback here, such issues are extremely difficult to track down and > even to reproduce and your help is invaluable here. > > Cheers, > Willy >
Re: [1.9 HEAD] HAProxy using 100% CPU
On Thu, May 09, 2019 at 02:31:58PM +0200, Maciej Zdeb wrote: > What a bad luck :D I must have compiled it just before you pushed that > change (segfault above is from haproxy 1.9.7-9b8ac0f). Great, so there's still some hope. I really appreciate your help and feedback here, such issues are extremely difficult to track down and even to reproduce and your help is invaluable here. Cheers, Willy
Re: [1.9 HEAD] HAProxy using 100% CPU
On Thu, May 09, 2019 at 02:19:26PM +0200, Maciej Zdeb wrote: > Hi Willy, > > I've built 1.9 from head, unfortunately something is wrong, right now I've > got segfault: > > Core was generated by `/usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p > /var/run/haproxy.pid -D -sf 75'. > Program terminated with signal SIGSEGV, Segmentation fault. > #0 0x00484ab4 in h2_process_mux (h2c=0x1cda990) at > src/mux_h2.c:2609 > 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { > (gdb) bt f > #0 0x00484ab4 in h2_process_mux (h2c=0x1cda990) at > src/mux_h2.c:2609 > h2s = 0x2 ^ So now we have the proof that the memory is corrupted! Olivier thinks he found a possible culprit one hour ago, which could cause a detached stream to stay referenced, and thus to corrupt the lists. I'm unsure to what extent it can cause the presence of a dead stream in this list, but from the beginning we're seeing wrong stuff there in your reports and it affects the same list :-/ For reference the commit ID in 1.9 is 07a9f0265 ("BUG/MEDIUM: h2: Make sure we set send_list to NULL in h2_detach()."). I hope that's not what you're running so that there's still some hope. Care to double-check ? Thanks, Willy
Re: [1.9 HEAD] HAProxy using 100% CPU
Hi Willy, I've built 1.9 from head, unfortunately something is wrong, right now I've got segfault: Core was generated by `/usr/sbin/haproxy -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -D -sf 75'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00484ab4 in h2_process_mux (h2c=0x1cda990) at src/mux_h2.c:2609 2609list_for_each_entry_safe(h2s, h2s_back, >send_list, list) { (gdb) bt f #0 0x00484ab4 in h2_process_mux (h2c=0x1cda990) at src/mux_h2.c:2609 h2s = 0x2 h2s_back = #1 h2_send (h2c=0x1cda990) at src/mux_h2.c:2747 flags = conn = 0x2d12190 done = 0 sent = 0 #2 0x0048d488 in pool_free (ptr=0x1cdaa68, pool=0x0) at include/common/memory.h:327 No locals. #3 __b_drop (buf=) at include/common/buffer.h:112 No locals. #4 b_drop (buf=0x1cda9b8) at include/common/buffer.h:119 No locals. #5 b_free (buf=0x1cda9b8) at include/common/buffer.h:125 No locals. #6 h2_release_buf (h2c=0x1cda990, bptr=0x1cda9b8) at src/mux_h2.c:407 bptr = 0x1cda9b8 #7 h2_process (h2c=0x1cda990) at src/mux_h2.c:2889 conn = 0xfffdd7b8 #8 0x004bf425 in run_thread_poll_loop (data=0x0) at src/haproxy.c:2710 ptif = ptdf = start_lock = 0 #9 0x0001 in ?? () No symbol table info available. #10 0x0041fb86 in main (argc=, argv=0x7ffe885af938) at src/haproxy.c:3354 tids = 0x1924900 threads = 0xfffc i = old_sig = {__val = {0, 140199073187936, 140199073187072, 140199071033905, 0, 140199073173696, 1, 0, 18446603340516163585, 140199073187072, 1, 140199073235400, 0, 140199073236256, 0, 24}} blocked_sig = {__val = {1844674406710583, 18446744073709551615 }} err = retry = limit = {rlim_cur = 100, rlim_max = 100} errmsg = "\000\000\000\000\000\000\000\000n\000\000\000w", '\000' , "p1K\243\202\177\000\000\000\000\000\000\000\000\000\000`'\\\242\202\177\000\000\030\000\000\000\000\000\000\000\200f4\001\000\000\000\000>\001\000\024\000\000\000\000 \352J\243\202\177\000\000`\221\065\001\000\000\000\000\340*(\242\202\177\000\000\370\371Z\210" pidfd = czw., 9 maj 2019 o 10:51 Willy Tarreau napisał(a): > Hi Maciej, > > I've just pushed a number of fixes into 1.9-master, including the one > I was talking about, if you want to try again. > > Cheers, > Willy >
Re: [1.9 HEAD] HAProxy using 100% CPU
Hi Maciej, I've just pushed a number of fixes into 1.9-master, including the one I was talking about, if you want to try again. Cheers, Willy
Re: [1.9 HEAD] HAProxy using 100% CPU
Ok, thanks! śr., 8.05.2019, 16:50 użytkownik Willy Tarreau napisał: > On Wed, May 08, 2019 at 04:48:55PM +0200, Maciej Zdeb wrote: > > I'll gladly test Olivier patch after backporting. :) > > Thank you Maciej. I still have other stuff to deal with before going back > to the pending 1.9 backports, and I'll merge it. > > Willy >
Re: [1.9 HEAD] HAProxy using 100% CPU
I'll gladly test Olivier patch after backporting. :) śr., 8.05.2019, 15:29 użytkownik Willy Tarreau napisał: > On Wed, May 08, 2019 at 03:03:23PM +0200, Olivier Houchard wrote: > > > > I can't seem to remember :) > > > > > > Given the number of bugs we've dealt with in the last few weeks, you're > > > forgiven :-) > > > > > > > I'm afraid I'm getting old :/ > > Ah! you see how it feels! > > > > > I think that patch is safe to backport anyway, but I wouldn't swear > it solves > > > > Maciej's issue. > > > > > > OK. I'll take it then so that we can study the situation under better > > > conditions if the problem happens again. It may be possible that we've > > > also overlooked other situations where this infinite loop could happen > > > with SUB_CALL_UNSUBSCRIBE and that were addressed by your patch. > > > > > > > Yes it is certainly possible it helps, I just can affirm it with > confidence. > > This is why I prefer to take it. The amount of changes that went into 1.9 > was significant enough to get such hard-to-debug issues. As long as fixes > don't harm, I prefer to take them than having to fight one full week with > a bug just to discover that it was already fixed upstream. > > Thanks, > Willy >
Re: [1.9 HEAD] HAProxy using 100% CPU
On Wed, May 08, 2019 at 03:03:23PM +0200, Olivier Houchard wrote: > > > I can't seem to remember :) > > > > Given the number of bugs we've dealt with in the last few weeks, you're > > forgiven :-) > > > > I'm afraid I'm getting old :/ Ah! you see how it feels! > > > I think that patch is safe to backport anyway, but I wouldn't swear it > > > solves > > > Maciej's issue. > > > > OK. I'll take it then so that we can study the situation under better > > conditions if the problem happens again. It may be possible that we've > > also overlooked other situations where this infinite loop could happen > > with SUB_CALL_UNSUBSCRIBE and that were addressed by your patch. > > > > Yes it is certainly possible it helps, I just can affirm it with confidence. This is why I prefer to take it. The amount of changes that went into 1.9 was significant enough to get such hard-to-debug issues. As long as fixes don't harm, I prefer to take them than having to fight one full week with a bug just to discover that it was already fixed upstream. Thanks, Willy
Re: [1.9 HEAD] HAProxy using 100% CPU
On Wed, May 08, 2019 at 02:30:07PM +0200, Willy Tarreau wrote: > On Wed, May 08, 2019 at 01:56:05PM +0200, Olivier Houchard wrote: > > > > One of processes stuck in infinite loop, admin socket is not responsive > > > > so > > > > I've got information only from gdb: > > > > > > > > 0x00484ab8 in h2_process_mux (h2c=0x2e8ff30) at > > > > src/mux_h2.c:2589 > > > > 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) > > > > (gdb) n > > > (...) > > > > > > CCing Olivier. Olivier, I'm wondering if this is not directly related to > > > what you addressed with this fix merged in 2.0 but not backported : > > > > > > 998410a ("BUG/MEDIUM: h2: Revamp the way send subscriptions works.") > > > > > > From what I'm seeing there's no error, the stream is in the sending list, > > > there's no blocking flag, well, everything looks OK, but we're looping > > > on SUB_CALL_UNSUBSCRIBE which apprently should not if I understand it > > > right. Do you think we should backport this patch ? > > > > > > Remaining of the trace below for reference. > > > > > > THanks, > > > Willy > > > > > > > I can't seem to remember :) > > Given the number of bugs we've dealt with in the last few weeks, you're > forgiven :-) > I'm afraid I'm getting old :/ > > I remember a bug involving the h2s being added to the send_list twice, and > > thus leading to an infinite loop when parsing the send_list, but it was > > fixed by 9a0f559676c4d309edbe42ba33197e7dd8935f1c, and backported. > > For me, this one was mostly to make me happier about the code, and stop > > abusing SUB_CALL_UNSUBSCRIBE, and if I didn't mention it should be > > backported > > it's probably I considered the code in 1.9 was OK (or I just forgot), but > > it's still flagged as "BUG" too (but that could be me having "BUG/MEDIUM" > > hardcoded in my finger). > > I think that patch is safe to backport anyway, but I wouldn't swear it > > solves > > Maciej's issue. > > OK. I'll take it then so that we can study the situation under better > conditions if the problem happens again. It may be possible that we've > also overlooked other situations where this infinite loop could happen > with SUB_CALL_UNSUBSCRIBE and that were addressed by your patch. > Yes it is certainly possible it helps, I just can affirm it with confidence. Regards, Olivier
Re: [1.9 HEAD] HAProxy using 100% CPU
On Wed, May 08, 2019 at 01:56:05PM +0200, Olivier Houchard wrote: > > > One of processes stuck in infinite loop, admin socket is not responsive so > > > I've got information only from gdb: > > > > > > 0x00484ab8 in h2_process_mux (h2c=0x2e8ff30) at src/mux_h2.c:2589 > > > 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) > > > (gdb) n > > (...) > > > > CCing Olivier. Olivier, I'm wondering if this is not directly related to > > what you addressed with this fix merged in 2.0 but not backported : > > > > 998410a ("BUG/MEDIUM: h2: Revamp the way send subscriptions works.") > > > > From what I'm seeing there's no error, the stream is in the sending list, > > there's no blocking flag, well, everything looks OK, but we're looping > > on SUB_CALL_UNSUBSCRIBE which apprently should not if I understand it > > right. Do you think we should backport this patch ? > > > > Remaining of the trace below for reference. > > > > THanks, > > Willy > > > > I can't seem to remember :) Given the number of bugs we've dealt with in the last few weeks, you're forgiven :-) > I remember a bug involving the h2s being added to the send_list twice, and > thus leading to an infinite loop when parsing the send_list, but it was > fixed by 9a0f559676c4d309edbe42ba33197e7dd8935f1c, and backported. > For me, this one was mostly to make me happier about the code, and stop > abusing SUB_CALL_UNSUBSCRIBE, and if I didn't mention it should be backported > it's probably I considered the code in 1.9 was OK (or I just forgot), but > it's still flagged as "BUG" too (but that could be me having "BUG/MEDIUM" > hardcoded in my finger). > I think that patch is safe to backport anyway, but I wouldn't swear it solves > Maciej's issue. OK. I'll take it then so that we can study the situation under better conditions if the problem happens again. It may be possible that we've also overlooked other situations where this infinite loop could happen with SUB_CALL_UNSUBSCRIBE and that were addressed by your patch. Thanks, Willy
Re: [1.9 HEAD] HAProxy using 100% CPU
Hi, On Wed, May 08, 2019 at 07:11:27AM +0200, Willy Tarreau wrote: > Hi Maciej, > > On Tue, May 07, 2019 at 07:08:47PM +0200, Maciej Zdeb wrote: > > Hi, > > > > I've got another bug with 100% CPU on HAProxy process, it is built from > > HEAD of 1.9 branch. > > > > One of processes stuck in infinite loop, admin socket is not responsive so > > I've got information only from gdb: > > > > 0x00484ab8 in h2_process_mux (h2c=0x2e8ff30) at src/mux_h2.c:2589 > > 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) > > (gdb) n > (...) > > CCing Olivier. Olivier, I'm wondering if this is not directly related to > what you addressed with this fix merged in 2.0 but not backported : > > 998410a ("BUG/MEDIUM: h2: Revamp the way send subscriptions works.") > > From what I'm seeing there's no error, the stream is in the sending list, > there's no blocking flag, well, everything looks OK, but we're looping > on SUB_CALL_UNSUBSCRIBE which apprently should not if I understand it > right. Do you think we should backport this patch ? > > Remaining of the trace below for reference. > > THanks, > Willy > I can't seem to remember :) I remember a bug involving the h2s being added to the send_list twice, and thus leading to an infinite loop when parsing the send_list, but it was fixed by 9a0f559676c4d309edbe42ba33197e7dd8935f1c, and backported. For me, this one was mostly to make me happier about the code, and stop abusing SUB_CALL_UNSUBSCRIBE, and if I didn't mention it should be backported it's probably I considered the code in 1.9 was OK (or I just forgot), but it's still flagged as "BUG" too (but that could be me having "BUG/MEDIUM" hardcoded in my finger). I think that patch is safe to backport anyway, but I wouldn't swear it solves Maciej's issue. Regards, Olivier
Re: [1.9 HEAD] HAProxy using 100% CPU
Hi Maciej, On Tue, May 07, 2019 at 07:08:47PM +0200, Maciej Zdeb wrote: > Hi, > > I've got another bug with 100% CPU on HAProxy process, it is built from > HEAD of 1.9 branch. > > One of processes stuck in infinite loop, admin socket is not responsive so > I've got information only from gdb: > > 0x00484ab8 in h2_process_mux (h2c=0x2e8ff30) at src/mux_h2.c:2589 > 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) > (gdb) n (...) CCing Olivier. Olivier, I'm wondering if this is not directly related to what you addressed with this fix merged in 2.0 but not backported : 998410a ("BUG/MEDIUM: h2: Revamp the way send subscriptions works.") >From what I'm seeing there's no error, the stream is in the sending list, there's no blocking flag, well, everything looks OK, but we're looping on SUB_CALL_UNSUBSCRIBE which apprently should not if I understand it right. Do you think we should backport this patch ? Remaining of the trace below for reference. THanks, Willy --- > 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) > 2586list_for_each_entry(h2s, >send_list, list) { > (gdb) > 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) > 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) > (gdb) > 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) > 2586list_for_each_entry(h2s, >send_list, list) { > (gdb) > 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) p h2c > $1 = (struct h2c *) 0x2e8ff30 > (gdb) p *h2c > $2 = {conn = 0x2b4c900, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR, > flags = 0, streams_limit = 100, max_id = 149, rcvd_c = 0, rcvd_s = 0, ddht > = 0x34099c0, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 149, > dfl = 0, > dft = 1 '\001', dff = 37 '%', dpl = 0 '\000', last_sid = -1, mbuf = {size > = 16384, area = 0x2ec3d50 "", data = 0, head = 0}, msi = -1, mfl = 0, mft = > 0 '\000', mff = 0 '\000', miw = 6291456, mws = 15443076, mfs = 16384, > timeout = 2, shut_timeout = 2, nb_streams = 53, nb_cs = 53, > nb_reserved = 0, stream_cnt = 75, proxy = 0x219ffe0, task = 0x34081d0, > streams_by_id = {b = {0x2adc2e1, 0x0}}, send_list = {n = 0x2ac5b38, p = > 0x3093c18}, > fctl_list = {n = 0x2e90008, p = 0x2e90008}, sending_list = {n = > 0x2ac5b48, p = 0x2ec2798}, buf_wait = {target = 0x0, wakeup_cb = 0x0, list > = {n = 0x2e90038, p = 0x2e90038}}, wait_event = {task = 0x2b2ae90, handle = > 0x0, events = 1}} > (gdb) n > 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) > (gdb) p *h2s > $3 = {cs = 0x297bdb0, sess = 0x819580 , h2c = 0x2e8ff30, h1m > = {state = H1_MSG_RPBEFORE, flags = 12, curr_len = 0, body_len = 0, next = > 0, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b = > {0x2a72250, > 0x2961c30}}, node_p = 0x2a72251, leaf_p = 0x2961c31, bit = 1, pfx > = 49017}, key = 103}, id = 103, flags = 16385, mws = 6291456, errcode = > H2_ERR_NO_ERROR, st = H2_SS_HREM, status = 0, body_len = 0, rxbuf = {size = > 0, > area = 0x0, data = 0, head = 0}, wait_event = {task = 0x2fb3ee0, handle > = 0x0, events = 0}, recv_wait = 0x2b8d700, send_wait = 0x2b8d700, list = {n > = 0x3130108, p = 0x2b02238}, sending_list = {n = 0x3130118, p = 0x2b02248}} > (gdb) n > 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) p *h2c > $4 = {conn = 0x2b4c900, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR, > flags = 0, streams_limit = 100, max_id = 149, rcvd_c = 0, rcvd_s = 0, ddht > = 0x34099c0, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 149, > dfl = 0, > dft = 1 '\001', dff = 37 '%', dpl = 0 '\000', last_sid = -1, mbuf = {size > = 16384, area = 0x2ec3d50 "", data = 0, head = 0}, msi = -1, mfl = 0, mft = > 0 '\000', mff = 0 '\000', miw = 6291456, mws = 15443076, mfs = 16384, > timeout = 2, shut_timeout = 2, nb_streams = 53, nb_cs = 53, > nb_reserved = 0, stream_cnt = 75, proxy = 0x219ffe0, task = 0x34081d0, > streams_by_id = {b = {0x2adc2e1, 0x0}}, send_list = {n = 0x2ac5b38, p = > 0x3093c18}, > fctl_list = {n = 0x2e90008, p = 0x2e90008}, sending_list = {n = > 0x2ac5b48, p = 0x2ec2798}, buf_wait = {target = 0x0, wakeup_cb = 0x0, list > = {n = 0x2e90038, p = 0x2e90038}}, wait_event = {task = 0x2b2ae90, handle = > 0x0, events = 1}} > (gdb) n > 2586list_for_each_entry(h2s, >send_list, list) { > (gdb) n > 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & > H2_CF_MUX_BLOCK_ANY) > (gdb) n > 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) > > > HAProxy info: > HA-Proxy version 1.9.7-207ba5a 2019/05/05 - https://haproxy.org/ > Build options : > TARGET = linux2628 > CPU = generic > CC = gcc > CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement > -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter >
[1.9 HEAD] HAProxy using 100% CPU
Hi, I've got another bug with 100% CPU on HAProxy process, it is built from HEAD of 1.9 branch. One of processes stuck in infinite loop, admin socket is not responsive so I've got information only from gdb: 0x00484ab8 in h2_process_mux (h2c=0x2e8ff30) at src/mux_h2.c:2589 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) (gdb) n 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2586list_for_each_entry(h2s, >send_list, list) { (gdb) 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) (gdb) 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) 2586list_for_each_entry(h2s, >send_list, list) { (gdb) 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) p h2c $1 = (struct h2c *) 0x2e8ff30 (gdb) p *h2c $2 = {conn = 0x2b4c900, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR, flags = 0, streams_limit = 100, max_id = 149, rcvd_c = 0, rcvd_s = 0, ddht = 0x34099c0, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 149, dfl = 0, dft = 1 '\001', dff = 37 '%', dpl = 0 '\000', last_sid = -1, mbuf = {size = 16384, area = 0x2ec3d50 "", data = 0, head = 0}, msi = -1, mfl = 0, mft = 0 '\000', mff = 0 '\000', miw = 6291456, mws = 15443076, mfs = 16384, timeout = 2, shut_timeout = 2, nb_streams = 53, nb_cs = 53, nb_reserved = 0, stream_cnt = 75, proxy = 0x219ffe0, task = 0x34081d0, streams_by_id = {b = {0x2adc2e1, 0x0}}, send_list = {n = 0x2ac5b38, p = 0x3093c18}, fctl_list = {n = 0x2e90008, p = 0x2e90008}, sending_list = {n = 0x2ac5b48, p = 0x2ec2798}, buf_wait = {target = 0x0, wakeup_cb = 0x0, list = {n = 0x2e90038, p = 0x2e90038}}, wait_event = {task = 0x2b2ae90, handle = 0x0, events = 1}} (gdb) n 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) (gdb) p *h2s $3 = {cs = 0x297bdb0, sess = 0x819580 , h2c = 0x2e8ff30, h1m = {state = H1_MSG_RPBEFORE, flags = 12, curr_len = 0, body_len = 0, next = 0, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b = {0x2a72250, 0x2961c30}}, node_p = 0x2a72251, leaf_p = 0x2961c31, bit = 1, pfx = 49017}, key = 103}, id = 103, flags = 16385, mws = 6291456, errcode = H2_ERR_NO_ERROR, st = H2_SS_HREM, status = 0, body_len = 0, rxbuf = {size = 0, area = 0x0, data = 0, head = 0}, wait_event = {task = 0x2fb3ee0, handle = 0x0, events = 0}, recv_wait = 0x2b8d700, send_wait = 0x2b8d700, list = {n = 0x3130108, p = 0x2b02238}, sending_list = {n = 0x3130118, p = 0x2b02248}} (gdb) n 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) p *h2c $4 = {conn = 0x2b4c900, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR, flags = 0, streams_limit = 100, max_id = 149, rcvd_c = 0, rcvd_s = 0, ddht = 0x34099c0, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 149, dfl = 0, dft = 1 '\001', dff = 37 '%', dpl = 0 '\000', last_sid = -1, mbuf = {size = 16384, area = 0x2ec3d50 "", data = 0, head = 0}, msi = -1, mfl = 0, mft = 0 '\000', mff = 0 '\000', miw = 6291456, mws = 15443076, mfs = 16384, timeout = 2, shut_timeout = 2, nb_streams = 53, nb_cs = 53, nb_reserved = 0, stream_cnt = 75, proxy = 0x219ffe0, task = 0x34081d0, streams_by_id = {b = {0x2adc2e1, 0x0}}, send_list = {n = 0x2ac5b38, p = 0x3093c18}, fctl_list = {n = 0x2e90008, p = 0x2e90008}, sending_list = {n = 0x2ac5b48, p = 0x2ec2798}, buf_wait = {target = 0x0, wakeup_cb = 0x0, list = {n = 0x2e90038, p = 0x2e90038}}, wait_event = {task = 0x2b2ae90, handle = 0x0, events = 1}} (gdb) n 2586list_for_each_entry(h2s, >send_list, list) { (gdb) n 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags & H2_CF_MUX_BLOCK_ANY) (gdb) n 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE) HAProxy info: HA-Proxy version 1.9.7-207ba5a 2019/05/05 - https://haproxy.org/ Build options : TARGET = linux2628 CPU = generic CC = gcc CFLAGS = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits -DIP_BIND_ADDRESS_NO_PORT=24 OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_DL=1 USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1 USE_PCRE_JIT=1 Default settings : maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200 Built with OpenSSL version : OpenSSL 1.1.1b 26 Feb 2019 Running on OpenSSL version : OpenSSL 1.1.1b 26 Feb 2019 OpenSSL library supports TLS extensions : yes OpenSSL library supports SNI : yes OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3 Built with Lua version : Lua 5.3.5 Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT IP_FREEBIND Built with zlib version : 1.2.8 Running on zlib version : 1.2.8 Compression algorithms supported :