Re: [2.0.17] crash with coredump

2020-11-13 Thread Christopher Faulet
Le 11/11/2020 à 12:43, Maciej Zdeb a écrit : Wow! Yes, I can confirm that a crash does not occur now. :) I checked 2.0 and 2.2 branches. I'll keep testing it for a couple days just to be sure. So that stacktrace I shared before (on spoe_release_appctx function) was very lucky... Do you think

Re: [2.0.17] crash with coredump

2020-11-11 Thread Maciej Zdeb
śr., 11 lis 2020 o 12:53 Willy Tarreau napisał(a): > Two months of chasing a non reproducible > memory corruption with zero initial info is quite an achievement, many > thanks for doing that! > Initially it crashed (once every few hours) only on our most critical HAProxy servers and with SPOA

Re: [2.0.17] crash with coredump

2020-11-11 Thread Willy Tarreau
On Wed, Nov 11, 2020 at 12:43:50PM +0100, Maciej Zdeb wrote: > Wow! Yes, I can confirm that a crash does not occur now. :) I checked 2.0 > and 2.2 branches. I'll keep testing it for a couple days just to be sure. > > So that stacktrace I shared before (on spoe_release_appctx function) was > very

Re: [2.0.17] crash with coredump

2020-11-11 Thread Maciej Zdeb
Wow! Yes, I can confirm that a crash does not occur now. :) I checked 2.0 and 2.2 branches. I'll keep testing it for a couple days just to be sure. So that stacktrace I shared before (on spoe_release_appctx function) was very lucky... Do you think that it'd be possible to find the bug without the

Re: [2.0.17] crash with coredump

2020-11-10 Thread Willy Tarreau
Hi Christopher, On Tue, Nov 10, 2020 at 09:17:15PM +0100, Christopher Faulet wrote: > Le 10/11/2020 à 18:12, Maciej Zdeb a écrit : > > Hi, > > > > I'm so happy you're able to replicate it! :) > > > > With that patch that disabled pool_flush I still can reproduce on my r > > server and on

Re: [2.0.17] crash with coredump

2020-11-10 Thread Christopher Faulet
Le 10/11/2020 à 18:12, Maciej Zdeb a écrit : Hi, I'm so happy you're able to replicate it! :) With that patch that disabled pool_flush I still can reproduce on my r server and on production, just different places of crash: Hi Maciej, Could you test the following patch please ? For now I

Re: [2.0.17] crash with coredump

2020-11-10 Thread Maciej Zdeb
Hi, I'm so happy you're able to replicate it! :) With that patch that disabled pool_flush I still can reproduce on my r server and on production, just different places of crash: on r: (gdb) bt #0 tasklet_wakeup (tl=0xd720c300a000) at include/haproxy/task.h:328 #1 h2s_notify_recv

Re: [2.0.17] crash with coredump

2020-11-10 Thread Willy Tarreau
On Tue, Nov 10, 2020 at 04:14:52PM +0100, Willy Tarreau wrote: > Seems like we're getting closer. Will continue digging now. I found that among the 5 crashes I got, 3 were under pool_flush() that is precisely called during the soft stopping. I tried to disable that function with the patch below

Re: [2.0.17] crash with coredump

2020-11-10 Thread Willy Tarreau
Hi Maciej, On Tue, Nov 10, 2020 at 03:21:45PM +0100, Maciej Zdeb wrote: > Hi, > > I'm very sorry that my skills in gdb and knowledge of HAProxy and C are not > sufficient for this debugging process. Quite frankly, you don't have to be sorry for anything :-) I could reproduce the crash on 2.2

Re: [2.0.17] crash with coredump

2020-11-10 Thread Maciej Zdeb
Hi, I'm very sorry that my skills in gdb and knowledge of HAProxy and C are not sufficient for this debugging process. With the patch applied I tried again to use spoa from "contrib/spoa_example/". Example spoa agent does not understand my spoe-message and silently ignores it, but it doesn't

Re: [2.0.17] crash with coredump

2020-11-09 Thread Maciej Zdeb
It crashed now on first test in process_stream: struct task *process_stream(struct task *t, void *context, unsigned short state) { struct server *srv; struct stream *s = context; struct session *sess = s->sess; unsigned int rqf_last, rpf_last; unsigned int

Re: [2.0.17] crash with coredump

2020-11-09 Thread Christopher Faulet
Le 09/11/2020 à 13:10, Maciej Zdeb a écrit : I've played little bit with the patch and it led me to backend.c file and connect_server() function int connect_server(struct stream *s) { [...] if (!conn_xprt_ready(srv_conn) && !srv_conn->mux) {                 /* set the correct protocol on the

Re: [2.0.17] crash with coredump

2020-11-09 Thread Maciej Zdeb
I've played little bit with the patch and it led me to backend.c file and connect_server() function int connect_server(struct stream *s) { [...] if (!conn_xprt_ready(srv_conn) && !srv_conn->mux) { /* set the correct protocol on the output stream interface */ if

Re: [2.0.17] crash with coredump

2020-11-09 Thread Maciej Zdeb
Hi, This time h2s = 0x30 ;) it crashed here: void testcorrupt(void *ptr) { [...] if (h2s->cs != cs) return; [...] Program terminated with signal SIGSEGV, Segmentation fault. #0 0x556b617f0562 in testcorrupt (ptr=0x7f99741d85a0) at src/mux_h2.c:6228 6228 src/mux_h2.c: No

Re: [2.0.17] crash with coredump

2020-11-06 Thread Willy Tarreau
Maciej, I wrote this ugly patch to try to crash as soon as possible when a corrupt h2s->subs is detected. The patch was written for 2.2. I only instrumented roughly 30 places in process_stream() which is a fairly likely candidate. I just hope it happens within the context of the stream itself

Re: [2.0.17] crash with coredump

2020-11-06 Thread Willy Tarreau
Hi Kirill, On Fri, Nov 06, 2020 at 06:41:03PM +0100, Kirill A. Korinsky wrote: > Hey, > > I'm wondering, does it related to this code: > > + /* some tasks may have woken other ones up */ > + if (max_processed && thread_has_tasks()) > + goto not_done_yet; > + (...) > as

Re: [2.0.17] crash with coredump

2020-11-06 Thread Kirill A. Korinsky
Hey, I'm wondering, does it related to this code: + /* some tasks may have woken other ones up */ + if (max_processed && thread_has_tasks()) + goto not_done_yet; + from

Re: [2.0.17] crash with coredump

2020-11-03 Thread Maciej Zdeb
I modified h2s struct in 2.2 branch with HEAD set to f96508aae6b49277dcf142caa35042678cf8e2ca "MEDIUM: mux-h2: merge recv_wait and send_wait event notifications" like below (subs is in exact place of removed wait_event): struct h2s { [...] struct tasklet *dummy0; struct

Re: [2.0.17] crash with coredump

2020-11-02 Thread Maciej Zdeb
I'm wondering, the corrupted address was always at "wait_event" in h2s struct, after its removal in: http://git.haproxy.org/?p=haproxy-2.2.git;a=commitdiff;h=5723f295d85febf5505f8aef6afabb6b23d6fdec;hp=f11be0ea1e8e571234cb41a2fcdde2cf2161df37 crashes went away. But with the above patch and after

Re: [2.0.17] crash with coredump

2020-11-02 Thread Kirill A. Korinsky
Maciej, Looks like memory corruption is still here but it corrupt just some another place. Willy do you agree? -- wbr, Kirill > On 2. Nov 2020, at 15:34, Maciej Zdeb wrote: > > So after Kirill suggestion to modify h2s struct in a way that tasklet > "shut_tl" is before recv_wait I verified

Re: [2.0.17] crash with coredump

2020-11-02 Thread Maciej Zdeb
So after Kirill suggestion to modify h2s struct in a way that tasklet "shut_tl" is before recv_wait I verified if in 2.2.4 the same crash will occur nd it did not! After the patch that merges recv_wait and send_wait:

Re: [2.0.17] crash with coredump

2020-11-02 Thread Maciej Zdeb
Great idea Kirill, With such modification: struct h2s { [...] struct tasklet *shut_tl; struct wait_event *recv_wait; /* recv wait_event the conn_stream associated is waiting on (via h2_subscribe) */ struct wait_event *send_wait; /* send wait_event the conn_stream

Re: [2.0.17] crash with coredump

2020-11-02 Thread Kirill A. Korinsky
Hi, Thanks for update. After read Wully's recommendation and provided commit that fixed an issue I'm curious can you "edit" a bit this commit and move `shut_tl` before `recv_wait` instead of removed `wait_event`? It is a quiet dummy way to confirm that memory corruption had gone, and not

Re: [2.0.17] crash with coredump

2020-11-02 Thread Maciej Zdeb
Hi, Update for people on the list that might be interested in the issue, because part of discussion was private. I wanted to check Willy suggestion and modified h2s struct (added dummy fields): struct h2s { [...] uint16_t status; /* HTTP response status */ unsigned

Re: [2.0.17] crash with coredump

2020-09-25 Thread Kirill A. Korinsky
Very interesting. Anyway, I can see that this pice of code was refactored some time ago: https://github.com/haproxy/haproxy/commit/f96508aae6b49277dcf142caa35042678cf8e2ca Maybe it is worth to try 2.2 or 2.3

Re: [2.0.17] crash with coredump

2020-09-25 Thread Willy Tarreau
On Fri, Sep 25, 2020 at 03:26:47PM +0200, Kirill A. Korinsky wrote: > > On 25. Sep 2020, at 15:06, Willy Tarreau wrote: > > > > Till here your analysis is right but: > > - the overflow would only be at most the number of extra threads running > >init_genrand() concurrently, or more

Re: [2.0.17] crash with coredump

2020-09-25 Thread Willy Tarreau
On Fri, Sep 25, 2020 at 03:26:05PM +0200, Maciej Zdeb wrote: > > Here I can suggest to implement Yarrow PRGN (that is very simple to > > implement) with some lua-pure cryptographic hash function. > > We're using lrandom because of the algorithm Mersenne Twister and its well > known weaknesses and

Re: [2.0.17] crash with coredump

2020-09-25 Thread Maciej Zdeb
Yes at the same place with same value: (gdb) bt full #0 0x559ce98b964b in h2s_notify_recv (h2s=0x559cef94e7e0) at src/mux_h2.c:783 sw = 0x pt., 25 wrz 2020 o 15:42 Kirill A. Korinsky napisał(a): > > On 25. Sep 2020, at 15:26, Maciej Zdeb wrote: > > > > I was mailing

Re: [2.0.17] crash with coredump

2020-09-25 Thread Kirill A. Korinsky
> On 25. Sep 2020, at 15:26, Maciej Zdeb wrote: > > I was mailing outside the list with Willy and Christopher but it's worth > sharing that the problem occurs even with nbthread = 1. I've managed to > confirm it today. I'm curious is it crashed at the same place with the same value? -- wbr,

Re: [2.0.17] crash with coredump

2020-09-25 Thread Kirill A. Korinsky
> On 25. Sep 2020, at 15:06, Willy Tarreau wrote: > > Till here your analysis is right but: > - the overflow would only be at most the number of extra threads running >init_genrand() concurrently, or more precisely the distance between >the most upfront to the latest thread, so in the

Re: [2.0.17] crash with coredump

2020-09-25 Thread Maciej Zdeb
> Here I can suggest to implement Yarrow PRGN (that is very simple to > implement) with some lua-pure cryptographic hash function. We're using lrandom because of the algorithm Mersenne Twister and its well known weaknesses and strengths. > In fact I know it's possible to call haproxy's internal

Re: [2.0.17] crash with coredump

2020-09-25 Thread Willy Tarreau
Hi Kirill, On Fri, Sep 25, 2020 at 12:34:16PM +0200, Kirill A. Korinsky wrote: > I've extracted a pice of code from lrandom and put it here: > https://gist.github.com/catap/bf862cc0d289083fc1ccd38c905e2416 > > > You can see that

Re: [2.0.17] crash with coredump

2020-09-25 Thread Maciej Zdeb
Hi Kirill, Thanks for your hints and time! Unfortunately, I think lrandom is not the cause of crash. We're using lrandom with threads for couple of months on our other servers without any crash. I think lua in HAproxy is executed in a single thread so your analysis is correct but this assumption

Re: [2.0.17] crash with coredump

2020-09-25 Thread Kirill A. Korinsky
Good day, I'd like to share with your my two cents regarding this topic: >> lrandom (PRNG for lua, we're using it for 2 or 3 years without any >> problems, and soon we will drop it from our build) > > Never heard of this last one, not that it would make it suspicious at > all, just that it

Re: [2.0.17] crash with coredump

2020-09-17 Thread Willy Tarreau
On Thu, Sep 17, 2020 at 10:56:39AM +0200, Maciej Zdeb wrote: > Hi, > > Our config is quite complex and I'm trying to narrow it down. It is > occurring only on one production haproxy cluster (which consists of 6 > servers in each of two data centers) with significant load - crashes occurs > on

Re: [2.0.17] crash with coredump

2020-09-17 Thread Maciej Zdeb
Hi, Our config is quite complex and I'm trying to narrow it down. It is occurring only on one production haproxy cluster (which consists of 6 servers in each of two data centers) with significant load - crashes occurs on random servers so I would exclude memory corruption. I'm suspecting SPOE

Re: [2.0.17] crash with coredump

2020-09-17 Thread Willy Tarreau
Hi guys, On Thu, Sep 17, 2020 at 11:05:31AM +1000, Igor Cicimov wrote: (...) > > Coredump fragment from thread1: > > (gdb) bt > > #0 0x55cbbf6ed64b in h2s_notify_recv (h2s=0x7f65b8b55130) at > > src/mux_h2.c:783 So the code is this one: 777 static void __maybe_unused

Re: [2.0.17] crash with coredump

2020-09-16 Thread Igor Cicimov
Hi Maciej, On Wed, Sep 16, 2020 at 9:00 PM Maciej Zdeb wrote: > Hi, > > Our HAProxy (2.0.14) started to crash, so first we upgraded to 2.0.17 but > it didn't help. Below you'll find traces from coredump > > Version: > HA-Proxy version 2.0.17 2020/07/31 - https://haproxy.org/ > Build options : >

[2.0.17] crash with coredump

2020-09-16 Thread Maciej Zdeb
Hi, Our HAProxy (2.0.14) started to crash, so first we upgraded to 2.0.17 but it didn't help. Below you'll find traces from coredump Version: HA-Proxy version 2.0.17 2020/07/31 - https://haproxy.org/ Build options : TARGET = linux-glibc CPU = generic CC = gcc CFLAGS = -O0 -g