Re: cygwin compilation error

2019-05-07 Thread Илья Шипицин
I messed up with commit message. One more try

ср, 8 мая 2019 г. в 11:33, Илья Шипицин :

> small fix
>
> ср, 8 мая 2019 г. в 11:12, Willy Tarreau :
>
>> On Wed, May 08, 2019 at 11:09:04AM +0500,  ??? wrote:
>> > ??, 8 ??? 2019 ?. ? 11:06, Willy Tarreau :
>> >
>> > > On Wed, May 08, 2019 at 10:59:20AM +0500,  ??? wrote:
>> > > > travis-ci supports windows builds.
>> > >
>> > > cool!
>> > >
>> >
>> > my current roadmap is
>> >
>> > 1) patch fixes SSL variants (already sent to list). without it we are
>> NOT
>> > building LibreSSL at all (i.e. we use default openssl-1.0.2 for all
>> builds)
>>
>> Pushed just now.
>>
>> > 2) BoringSSL
>> >
>> > 3) update gcc, clang, enable sanitizers
>> >
>> > 4) cygwin
>>
>> OK, sounds good.
>>
>> Thanks,
>> Willy
>>
>
From ad9961e92c692430272c9088a49759c889dac6f1 Mon Sep 17 00:00:00 2001
From: Ilya Shipitsin 
Date: Wed, 8 May 2019 11:32:02 +0500
Subject: [PATCH] BUILD: do not use "RAND_keep_random_devices_open" when
 building against LibreSSL

---
 src/haproxy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/haproxy.c b/src/haproxy.c
index 4c371254..c8a8aaf0 100644
--- a/src/haproxy.c
+++ b/src/haproxy.c
@@ -590,7 +590,7 @@ void mworker_reload()
 		ptdf->fct();
 	if (fdtab)
 		deinit_pollers();
-#if defined(USE_OPENSSL) && (OPENSSL_VERSION_NUMBER >= 0x10101000L)
+#if defined(USE_OPENSSL) && (OPENSSL_VERSION_NUMBER >= 0x10101000L) && !defined LIBRESSL_VERSION_NUMBER)
 	if (global.ssl_used_frontend || global.ssl_used_backend)
 		/* close random device FDs */
 		RAND_keep_random_devices_open(0);
-- 
2.20.1



Re: cygwin compilation error

2019-05-07 Thread Илья Шипицин
small fix

ср, 8 мая 2019 г. в 11:12, Willy Tarreau :

> On Wed, May 08, 2019 at 11:09:04AM +0500,  ??? wrote:
> > ??, 8 ??? 2019 ?. ? 11:06, Willy Tarreau :
> >
> > > On Wed, May 08, 2019 at 10:59:20AM +0500,  ??? wrote:
> > > > travis-ci supports windows builds.
> > >
> > > cool!
> > >
> >
> > my current roadmap is
> >
> > 1) patch fixes SSL variants (already sent to list). without it we are NOT
> > building LibreSSL at all (i.e. we use default openssl-1.0.2 for all
> builds)
>
> Pushed just now.
>
> > 2) BoringSSL
> >
> > 3) update gcc, clang, enable sanitizers
> >
> > 4) cygwin
>
> OK, sounds good.
>
> Thanks,
> Willy
>
From ad9961e92c692430272c9088a49759c889dac6f1 Mon Sep 17 00:00:00 2001
From: Ilya Shipitsin 
Date: Wed, 8 May 2019 11:32:02 +0500
Subject: [PATCH] BUILD: do not use && !defined LIBRESSL_VERSION_NUMBER) when
 building against LibreSSL

---
 src/haproxy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/haproxy.c b/src/haproxy.c
index 4c371254..c8a8aaf0 100644
--- a/src/haproxy.c
+++ b/src/haproxy.c
@@ -590,7 +590,7 @@ void mworker_reload()
 		ptdf->fct();
 	if (fdtab)
 		deinit_pollers();
-#if defined(USE_OPENSSL) && (OPENSSL_VERSION_NUMBER >= 0x10101000L)
+#if defined(USE_OPENSSL) && (OPENSSL_VERSION_NUMBER >= 0x10101000L) && !defined LIBRESSL_VERSION_NUMBER)
 	if (global.ssl_used_frontend || global.ssl_used_backend)
 		/* close random device FDs */
 		RAND_keep_random_devices_open(0);
-- 
2.20.1



Re: [PATCH 1/1] BUILD: travis-ci bugfixes and improvements

2019-05-07 Thread Willy Tarreau
On Tue, May 07, 2019 at 01:42:43AM +0500, chipits...@gmail.com wrote:
> From: Ilya Shipitsin 
> 
> Call missing scripts/build-ssl.sh (which actually builds SSL variants)
> Enable OpenSSL, LibreSSL builds caching, it saves a bunch of time
> LibreSSL builds are not allowed to fail anymore
> Add openssl to osx builds

Merged, thanks!
Willy



Re: cygwin compilation error

2019-05-07 Thread Willy Tarreau
On Wed, May 08, 2019 at 11:09:04AM +0500,  ??? wrote:
> ??, 8 ??? 2019 ?. ? 11:06, Willy Tarreau :
> 
> > On Wed, May 08, 2019 at 10:59:20AM +0500,  ??? wrote:
> > > travis-ci supports windows builds.
> >
> > cool!
> >
> 
> my current roadmap is
> 
> 1) patch fixes SSL variants (already sent to list). without it we are NOT
> building LibreSSL at all (i.e. we use default openssl-1.0.2 for all builds)

Pushed just now.

> 2) BoringSSL
> 
> 3) update gcc, clang, enable sanitizers
> 
> 4) cygwin

OK, sounds good.

Thanks,
Willy



Re: cygwin compilation error

2019-05-07 Thread Илья Шипицин
ср, 8 мая 2019 г. в 11:06, Willy Tarreau :

> On Wed, May 08, 2019 at 10:59:20AM +0500,  ??? wrote:
> > travis-ci supports windows builds.
>
> cool!
>

my current roadmap is

1) patch fixes SSL variants (already sent to list). without it we are NOT
building LibreSSL at all (i.e. we use default openssl-1.0.2 for all builds)

2) BoringSSL

3) update gcc, clang, enable sanitizers

4) cygwin


>
> > I will add such build a bit later (after
> > we settle with current travis-ci fixes)
>
> ...and this cygwin build issue :-)
>
> Willy
>


Re: cygwin compilation error

2019-05-07 Thread Willy Tarreau
On Wed, May 08, 2019 at 10:59:20AM +0500,  ??? wrote:
> travis-ci supports windows builds.

cool!

> I will add such build a bit later (after
> we settle with current travis-ci fixes)

...and this cygwin build issue :-)

Willy



Re: haproxy 2.0 docker images

2019-05-07 Thread Willy Tarreau
Hi Aleks,

On Mon, May 06, 2019 at 08:17:23AM +0200, Aleksandar Lazic wrote:
> > The outputs below raises some questions to me.
> >
> > * Should in the OPTIONS output also be the EXTRA_OBJS ?

That's a good question. I was hesitating but given that the goal is
to be able to easily rebuild a similar executable, maybe we should
add it indeed.

> > * Should PCRE2 be used instead of PCRE ?

No opinion :-)

> > * Should PRIVATE_CACHE be used in the default build?

No, because this one disables inter-process sharing of SSL sessions.

> > * Should SLZ be used in the default build?

It's just a matter of choice. I personally always build with it for
prod servers because it saves a huge amount of memory and some CPU,
but it also adds one extra dependency. I'd say that if it doesn't
require extra efforts it's worth it. If it adds some packaging burden
you can simply drop it and fall back to zlib.

> > * Make NS sense in a container image?

I don't think so indeed, though it doesn't cost much to keep it, at
least so that you use the same build options everywhere.

> > * Can DEVICEATLAS 51DEGREES WURFL be used together?
> >  - From technically point of view

>From a technical point of view I don't see any obvious incompatibility.
However doing automated builds from all 3 of these might not always be
trivial as it will require that you can include these respective
libraries, some of which may only be downloaded after registering on
their site. Please don't ship an executable built with the dummy libs
since it will be useless and misleading (it's only useful for full-
featured builds).

> >  - From license point of view

You have to carefully check. I believe at least one of them mentions
patents so this can even make the resulting executable look dangerous
for some users and make them stay away from your images. Anyway as
usual with anything related to licensing, the best advice I could give
you is to ask a lawyer :-/  This alone might be a valid reason for not
wasting too much time down this road.

Cheers,
Willy



Re: cygwin compilation error

2019-05-07 Thread Илья Шипицин
travis-ci supports windows builds. I will add such build a bit later (after
we settle with current travis-ci fixes)

ср, 8 мая 2019 г. в 10:52, Willy Tarreau :

> Hi,
>
> On Mon, May 06, 2019 at 12:54:47PM +0300, Gil Bahat wrote:
> > Hi,
> >
> > is cygwin still supported anymore?
>
> Well, we never know :-)  I mean, we're always open to fixes to make it
> work as long as they don't impact other platforms.
>
> > the target seems to be present in the
> > Makefiles and I'd love to be able to use it. I'm running into what seems
> to
> > be a workable linker error:
> >
> > $ make TARGET=cygwin
> >   LD  haproxy
> > src/http_act.o:http_act.c:(.rdata+0x340): multiple definition of
> > `.weak.ist_uc.'
> > src/ev_poll.o:ev_poll.c:(.rdata+0x20): first defined here
>
> Aie that's really bad, it means the linker doesn't support weak symbols :-(
> Weak symbols are very handy as they are able to be included and linked in
> only once if they are used, and not linked if unused. The info I'm finding
> on the net suggest that symbols must be resolved at link time, which is the
> case here. So maybe it's just a matter of definition.
>
> I can suggest a few things to try in include/common/ist.h :
>
>   - replace "__weak__" with "weak" just in case it's different there
> (I don't even know why I marked it "__weak__", probably just by
> mimetism with "__attribute__" and because it worked
>
>   - add "#pragma weak ist_lc" and "#pragma weak ist_uc" in ist.h,
> before the definitions
>
>   - add "extern const unsigned char ist_lc[256];" and
> "extern const unsigned char ist_uc[256];" before the definitions
>
> In case one of them is enough to work, we can merge them.
>
> Thanks,
> Willy
>
>


Re: cygwin compilation error

2019-05-07 Thread Willy Tarreau
Hi,

On Mon, May 06, 2019 at 12:54:47PM +0300, Gil Bahat wrote:
> Hi,
> 
> is cygwin still supported anymore?

Well, we never know :-)  I mean, we're always open to fixes to make it
work as long as they don't impact other platforms.

> the target seems to be present in the
> Makefiles and I'd love to be able to use it. I'm running into what seems to
> be a workable linker error:
> 
> $ make TARGET=cygwin
>   LD  haproxy
> src/http_act.o:http_act.c:(.rdata+0x340): multiple definition of
> `.weak.ist_uc.'
> src/ev_poll.o:ev_poll.c:(.rdata+0x20): first defined here

Aie that's really bad, it means the linker doesn't support weak symbols :-(
Weak symbols are very handy as they are able to be included and linked in
only once if they are used, and not linked if unused. The info I'm finding
on the net suggest that symbols must be resolved at link time, which is the
case here. So maybe it's just a matter of definition.

I can suggest a few things to try in include/common/ist.h :

  - replace "__weak__" with "weak" just in case it's different there
(I don't even know why I marked it "__weak__", probably just by
mimetism with "__attribute__" and because it worked

  - add "#pragma weak ist_lc" and "#pragma weak ist_uc" in ist.h,
before the definitions

  - add "extern const unsigned char ist_lc[256];" and
"extern const unsigned char ist_uc[256];" before the definitions

In case one of them is enough to work, we can merge them.

Thanks,
Willy



Re: haproxy-1.9 sanitizers finding

2019-05-07 Thread Willy Tarreau
Hi Ilya,

On Tue, May 07, 2019 at 11:47:54AM +0500,  ??? wrote:
> Hello,
> 
> when running regtests against 1.9 branch there are findings (not seen in
> master branch)
> 
> ***  h10.0
> debug|=
> ***  h10.0 debug|==16493==ERROR: AddressSanitizer: heap-use-after-free
> on address 0x61903c95 at pc 0x006ca207 bp 0x7ffd92124b60 sp
> 0x7ffd92124b50
> ***  h10.0 debug|WRITE of size 1 at 0x61903c95 thread T0
> ***  h10.0 debug|#0 0x6ca206 in update_log_hdr src/log.c:1260
> ***  h10.0 debug|#1 0x6ca206 in __send_log src/log.c:1445
> ***  h10.0 debug|#2 0x6ca48a in send_log src/log.c:1323
(...)

OK these are the same that you reported on master which is fixed there
and not backported yet. It should eventually get backported ;-)

Thanks,
Willy



Re: systemd watchdog support?

2019-05-07 Thread Willy Tarreau
Hi guys,

On Tue, May 07, 2019 at 10:40:17PM +0200, William Lallemand wrote:
> Hi Patrick,
> 
> On Tue, May 07, 2019 at 02:23:15PM -0400, Patrick Hemmer wrote:
> > So with the prevalence of the issues lately where haproxy is going 
> > unresponsive and consuming 100% CPU, I wanted to see what thoughts were 
> > on implementing systemd watchdog functionality.

First, let me tell you I'm also all for a watchdog system. For me, an
unresponsive process is the worst thing that can ever happen because
it's the hardest one to detect and it takes time to fix it. This is also
why I've been working on a lockup detection for the worker processes that
is able to produce some context info and possibly an analysable core dump.
I expect to have it for 2.0-final, this is important to accelerate finding
of such painful bugs and fix them early if any remains.

> The master uses a special backend, invisible to the user, which contains 1
> server per worker, it uses the socketpair of the worker for the address. They
> are always connected and they can communicate. This architecture allows to
> forward commands to the CLI of the worker.
> 
> One of my ideas was to do the equivalent of adding a "check" keyword for each
> of these server line. We would have to implement a special check which will
> send a CLI command and wait for its response.
> 
> If one of the server does not respond, we could execute the exit-on-failure
> procedure.

I'd like that we keep a trace of the failed process. Send it a SIGXCPU or
SIGABRT, and kill the other ones cleanly.

> > The last idea would be to have the watchdog watch the master only, and 
> > the master watches the workers in turn. If a worker stops responding, 
> > the master would restart just that one worker.
> > 
> 
> That's not a good idea to restart only one worker, and that's not possible 
> with
> the current architecture, and too much complicated. In my opinion it's better
> to kill everything so systemd can restart properly with Restart=on-failure,
> this is what is done when one of the worker segfault, for example.

I totally agree. In the past when nbproc was used a lot, we've had many
reports of people getting caught by one process dying once in a while,
till the point where there were not enough processes left to handle the
traffic, making the service barely responsive but still up. This gives
a terrible image of a hosted service outside, while a dead process would
be detected, failed-over or restarted.

Cheers,
Willy



Re: [1.9 HEAD] HAProxy using 100% CPU

2019-05-07 Thread Willy Tarreau
Hi Maciej,

On Tue, May 07, 2019 at 07:08:47PM +0200, Maciej Zdeb wrote:
> Hi,
> 
> I've got another bug with 100% CPU on HAProxy process, it is built from
> HEAD of 1.9 branch.
> 
> One of processes stuck in infinite loop, admin socket is not responsive so
> I've got information only from gdb:
> 
> 0x00484ab8 in h2_process_mux (h2c=0x2e8ff30) at src/mux_h2.c:2589
> 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE)
> (gdb) n
(...)

CCing Olivier. Olivier, I'm wondering if this is not directly related to
what you addressed with this fix merged in 2.0 but not backported :

  998410a ("BUG/MEDIUM: h2: Revamp the way send subscriptions works.")

>From what I'm seeing there's no error, the stream is in the sending list,
there's no blocking flag, well, everything looks OK, but we're looping
on SUB_CALL_UNSUBSCRIBE which apprently should not if I understand it
right. Do you think we should backport this patch ?

Remaining of the trace below for reference.

THanks,
Willy

---

> 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
> H2_CF_MUX_BLOCK_ANY)
> (gdb)
> 2586list_for_each_entry(h2s, &h2c->send_list, list) {
> (gdb)
> 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
> H2_CF_MUX_BLOCK_ANY)
> (gdb)
> 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE)
> (gdb)
> 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
> H2_CF_MUX_BLOCK_ANY)
> (gdb)
> 2586list_for_each_entry(h2s, &h2c->send_list, list) {
> (gdb)
> 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
> H2_CF_MUX_BLOCK_ANY)
> (gdb) p h2c
> $1 = (struct h2c *) 0x2e8ff30
> (gdb) p *h2c
> $2 = {conn = 0x2b4c900, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR,
> flags = 0, streams_limit = 100, max_id = 149, rcvd_c = 0, rcvd_s = 0, ddht
> = 0x34099c0, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 149,
> dfl = 0,
>   dft = 1 '\001', dff = 37 '%', dpl = 0 '\000', last_sid = -1, mbuf = {size
> = 16384, area = 0x2ec3d50 "", data = 0, head = 0}, msi = -1, mfl = 0, mft =
> 0 '\000', mff = 0 '\000', miw = 6291456, mws = 15443076, mfs = 16384,
>   timeout = 2, shut_timeout = 2, nb_streams = 53, nb_cs = 53,
> nb_reserved = 0, stream_cnt = 75, proxy = 0x219ffe0, task = 0x34081d0,
> streams_by_id = {b = {0x2adc2e1, 0x0}}, send_list = {n = 0x2ac5b38, p =
> 0x3093c18},
>   fctl_list = {n = 0x2e90008, p = 0x2e90008}, sending_list = {n =
> 0x2ac5b48, p = 0x2ec2798}, buf_wait = {target = 0x0, wakeup_cb = 0x0, list
> = {n = 0x2e90038, p = 0x2e90038}}, wait_event = {task = 0x2b2ae90, handle =
> 0x0, events = 1}}
> (gdb) n
> 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE)
> (gdb) p *h2s
> $3 = {cs = 0x297bdb0, sess = 0x819580 , h2c = 0x2e8ff30, h1m
> = {state = H1_MSG_RPBEFORE, flags = 12, curr_len = 0, body_len = 0, next =
> 0, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b =
> {0x2a72250,
>   0x2961c30}}, node_p = 0x2a72251, leaf_p = 0x2961c31, bit = 1, pfx
> = 49017}, key = 103}, id = 103, flags = 16385, mws = 6291456, errcode =
> H2_ERR_NO_ERROR, st = H2_SS_HREM, status = 0, body_len = 0, rxbuf = {size =
> 0,
> area = 0x0, data = 0, head = 0}, wait_event = {task = 0x2fb3ee0, handle
> = 0x0, events = 0}, recv_wait = 0x2b8d700, send_wait = 0x2b8d700, list = {n
> = 0x3130108, p = 0x2b02238}, sending_list = {n = 0x3130118, p = 0x2b02248}}
> (gdb) n
> 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
> H2_CF_MUX_BLOCK_ANY)
> (gdb) p *h2c
> $4 = {conn = 0x2b4c900, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR,
> flags = 0, streams_limit = 100, max_id = 149, rcvd_c = 0, rcvd_s = 0, ddht
> = 0x34099c0, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 149,
> dfl = 0,
>   dft = 1 '\001', dff = 37 '%', dpl = 0 '\000', last_sid = -1, mbuf = {size
> = 16384, area = 0x2ec3d50 "", data = 0, head = 0}, msi = -1, mfl = 0, mft =
> 0 '\000', mff = 0 '\000', miw = 6291456, mws = 15443076, mfs = 16384,
>   timeout = 2, shut_timeout = 2, nb_streams = 53, nb_cs = 53,
> nb_reserved = 0, stream_cnt = 75, proxy = 0x219ffe0, task = 0x34081d0,
> streams_by_id = {b = {0x2adc2e1, 0x0}}, send_list = {n = 0x2ac5b38, p =
> 0x3093c18},
>   fctl_list = {n = 0x2e90008, p = 0x2e90008}, sending_list = {n =
> 0x2ac5b48, p = 0x2ec2798}, buf_wait = {target = 0x0, wakeup_cb = 0x0, list
> = {n = 0x2e90038, p = 0x2e90038}}, wait_event = {task = 0x2b2ae90, handle =
> 0x0, events = 1}}
> (gdb) n
> 2586list_for_each_entry(h2s, &h2c->send_list, list) {
> (gdb) n
> 2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
> H2_CF_MUX_BLOCK_ANY)
> (gdb) n
> 2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE)
> 
> 
> HAProxy info:
> HA-Proxy version 1.9.7-207ba5a 2019/05/05 - https://haproxy.org/
> Build options :
>   TARGET  = linux2628
>   CPU = generic
>   CC  = gcc
>   CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
> -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter
> -W

Re: systemd watchdog support?

2019-05-07 Thread William Lallemand
Hi Patrick,

On Tue, May 07, 2019 at 02:23:15PM -0400, Patrick Hemmer wrote:
> So with the prevalence of the issues lately where haproxy is going 
> unresponsive and consuming 100% CPU, I wanted to see what thoughts were 
> on implementing systemd watchdog functionality.
> 
> In our case, haproxy going unresponsive is extremely problematic as our 
> clustering software (pacemaker+systemd) sees the service still running, 
> and doesn't realize it needs to restart the service or fail over.
> We could look into implementing some sort of custom check resource in 
> pacemaker, but before going down that route I wanted to explore the 
> systemd watchdog functionality.
> 
> The watchdog is implemented by periodically sending "WATCHDOG=1" on the 
> systemd notification socket. However there are a few different ways I 
> can see this being implemented.
> 
> We could put this in the master control process, but this only tells us 
> if the master is functioning, not the workers, which are what really matter.
> 
> So the next thought would be for all of the workers to listen on a 
> shared socket. The master would periodically send a request to that 
> socket, and as long as it gets a response, it pings the watchdog. This 
> tells us that there is at least one worker able to accept traffic.
> 
> However if a frontend is bound to a specific worker, then that would 
> frontend would be non-responsive, and the watchdog wouldn't restart the 
> service. For that the worker would have to send a request to each worker 
> separately, and require a response from all of them before it pings the 
> watchdog. This would be better able to detect issues, but for some 
> people who aren't using any bound-to-process frontends, they would be 
> able to handle failure of a single worker and potentially schedule a 
> restart/reload at a less impactful time.
>

The master uses a special backend, invisible to the user, which contains 1
server per worker, it uses the socketpair of the worker for the address. They
are always connected and they can communicate. This architecture allows to
forward commands to the CLI of the worker.

One of my ideas was to do the equivalent of adding a "check" keyword for each
of these server line. We would have to implement a special check which will
send a CLI command and wait for its response.

If one of the server does not respond, we could execute the exit-on-failure
procedure.

> 
> The last idea would be to have the watchdog watch the master only, and 
> the master watches the workers in turn. If a worker stops responding, 
> the master would restart just that one worker.
> 

That's not a good idea to restart only one worker, and that's not possible with
the current architecture, and too much complicated. In my opinion it's better
to kill everything so systemd can restart properly with Restart=on-failure,
this is what is done when one of the worker segfault, for example.

> 
> Any thoughts on the matter, or do we not want to do this, and rely on a 
> custom check in the cluster management software?
> 
> -Patrick
> 

-- 
William Lallemand



Re: [PATCH] wurfl device detection build fixes and dummy library

2019-05-07 Thread Aaron Park (Support)
# Please type your reply above this line #

You are registered as a cc on this help desk request and are thus receiving 
email notifications on all updates to the request.
Reply to this email to add a comment to the request.

--

Aaron Park, May 7, 15:30 EDT

Hi Willy,

I just wanted to check in to let you know that our engineers are continuing to 
build a patch to address the varying issues you are seeing. 

Because this build will take some time, unless you have any other questions, we 
will close out this ticket for now and will keep you informed of any updates we 
have for you.

Thanks,

Aaron


e: supp...@scientiamobile.com
ScientiaMobile Customer Support Team

--

Willy Tarreau, Apr 19, 10:49 EDT

Sorry, with the patches this time.

Willy

Attachment(s):
0007-WIP-wurfl-pass-fPIC-when-compiling.patch - 
https://support.scientiamobile.com/attachments/token/fAR7m6yvCQlVHA4dqp23I5sJS/?name=0007-WIP-wurfl-pass-fPIC-when-compiling.patch
0008-WIP-wurfl-fix-broken-symlinks.patch - 
https://support.scientiamobile.com/attachments/token/2Nrk5JLukGKWfCDj6rXLNFeik/?name=0008-WIP-wurfl-fix-broken-symlinks.patch
0009-WIP-wurfl-address-build-issues-by-doing-a-static-lib.patch - 
https://support.scientiamobile.com/attachments/token/mJXKrE2ecZHS7RjwXwTwbbQHi/?name=0009-WIP-wurfl-address-build-issues-by-doing-a-static-lib.patch
0010-WIP-wurfl-indicate-in-haproxy-vv-the-wurfl-version-i.patch - 
https://support.scientiamobile.com/attachments/token/Pc7IJxxno71IOlfHSByian8dp/?name=0010-WIP-wurfl-indicate-in-haproxy-vv-the-wurfl-version-i.patch
0011-WIP-wurfl-move-wurfl.h-into-wurfl-to-maintain-direct.patch - 
https://support.scientiamobile.com/attachments/token/qkvpDMK65KKVXVJsJyrNSxp7t/?name=0011-WIP-wurfl-move-wurfl.h-into-wurfl-to-maintain-direct.patch
0012-WIP-wurfl-mention-how-to-build-the-dummy-lib-in-the-.patch - 
https://support.scientiamobile.com/attachments/token/IWptQoEt33SpfGBPLJWNh5ZNL/?name=0012-WIP-wurfl-mention-how-to-build-the-dummy-lib-in-the-.patch
0013-WIP-wurfl-rename-makefile-to-Makefile.patch - 
https://support.scientiamobile.com/attachments/token/BD9188uXGpWCBfCxcungaYVoJ/?name=0013-WIP-wurfl-rename-makefile-to-Makefile.patch

--

Willy Tarreau, Apr 19, 10:46 EDT

Hi Paul,

On Thu, Apr 18, 2019 at 02:46:17PM +0200, Paul Stephen Borile wrote:
> please find attached to this email the 6 patches that cover various areas
> of restyling of
> the WURFL device detection feature for HAProxy. All patches can be back
> ported to 1.9 if necessary.
> Last patch is a dummy WURFL library that can be used to build/run haproxy
> compiled with the USE_WURFL option to make easier checking for any build
> problem in the future.
> We'll try to do the same and make sure that the module does not break
> builds again as happened in the past.

So I gave a look to this patch set and had to perform a few adjustments
to make it work but now it looks OK. I'm attaching the changes I made so
that you can review them, they're all related to the dummy lib in order
to 1) fix its build and 2) ease the testing without having to modify the
build environment (since adding non-standard stuff into /usr/include or
/usr/lib is a no-go on most development environments).

I figured that it was much simpler to build a ".a" from the file so that
it can naturally be loaded by the regular build process. I added the
ability to report the libwurfl version in "haproxy -vv" and when the
dummy lib is detected, it's explicitly mentioned "dummy library" there
so that you don't have to deal with false positives when users report
issues. I also added a little bit of doc explaining to haproxy devs
how to build with wurfl. This way I think it could be added by default
to any developer's build script so that it never breaks in the future.

I'm attaching my changes. I'm fine with retrofitting them into your
patches if they look OK to you. Please just let me know if you're OK to
go with this (and if you're OK with me backporting this to 1.9 so that
we can fix 1.9 once for all).

Thanks!
Willy

PS: I've CCed the contact address in the maintainers file just to verify
that there is no typo there, please confirm that it was properly
received.'


This email is a service from ScientiaMobile.









[PM6MEL-7Z9M]

Re: HAProxy 1.9.6 unresponsive

2019-05-07 Thread Willy Tarreau
Hi Patrick,

On Tue, May 07, 2019 at 02:01:33PM -0400, Patrick Hemmer wrote:
> Just in case it's useful, we had the issue recur today. However I gleaned a
> little more information from this recurrence. Provided below are several
> outputs from a gdb `bt full`. The important bit is that in the captures, the
> last frame which doesn't change between each capture is the `si_cs_send`
> function. The last stack capture provided has the shortest stack depth of
> all the captures, and is inside `h2_snd_buf`.

Thank you. At first glance this remains similar. Christopher and I have
been studying these issues intensely these days because they have deep
roots into some design choices and tradeoffs we've had to make and that
we're relying on, and we've come to conclusions about some long term
changes to address the causes, and some fixes for 1.9 that now appear
valid. We're still carefully reviewing our changes before pushing them.
Then I think we'll emit 1.9.8 anyway since it will already fix quite a
number of issues addressed since 1.9.7, so for you it will probably be
easier to try again.
 
> Otherwise it's still the behavior is the same as last time with `strace`
> showing absolutely nothing, so it's still looping.

I'm not surprised. We managed to break that loop in a dirty way a first
time but it came with impacts (some random errors could be spewed depending
on the frame sizes, which is obviously not acceptable). But yes, this loop
has no way to give up. That's the second argument convincing me of finishing
the watchdog so that at least it dies when this happens!

Expect some updates on this this week.

Cheers,
Willy



systemd watchdog support?

2019-05-07 Thread Patrick Hemmer
So with the prevalence of the issues lately where haproxy is going 
unresponsive and consuming 100% CPU, I wanted to see what thoughts were 
on implementing systemd watchdog functionality.


In our case, haproxy going unresponsive is extremely problematic as our 
clustering software (pacemaker+systemd) sees the service still running, 
and doesn't realize it needs to restart the service or fail over.
We could look into implementing some sort of custom check resource in 
pacemaker, but before going down that route I wanted to explore the 
systemd watchdog functionality.



The watchdog is implemented by periodically sending "WATCHDOG=1" on the 
systemd notification socket. However there are a few different ways I 
can see this being implemented.


We could put this in the master control process, but this only tells us 
if the master is functioning, not the workers, which are what really matter.


So the next thought would be for all of the workers to listen on a 
shared socket. The master would periodically send a request to that 
socket, and as long as it gets a response, it pings the watchdog. This 
tells us that there is at least one worker able to accept traffic.


However if a frontend is bound to a specific worker, then that would 
frontend would be non-responsive, and the watchdog wouldn't restart the 
service. For that the worker would have to send a request to each worker 
separately, and require a response from all of them before it pings the 
watchdog. This would be better able to detect issues, but for some 
people who aren't using any bound-to-process frontends, they would be 
able to handle failure of a single worker and potentially schedule a 
restart/reload at a less impactful time.


The last idea would be to have the watchdog watch the master only, and 
the master watches the workers in turn. If a worker stops responding, 
the master would restart just that one worker.



Any thoughts on the matter, or do we not want to do this, and rely on a 
custom check in the cluster management software?


-Patrick



Re: HAProxy 1.9.6 unresponsive

2019-05-07 Thread Patrick Hemmer




*From:* Willy Tarreau [mailto:w...@1wt.eu]
*Sent:* Monday, May 6, 2019, 08:42 EDT
*To:* Patrick Hemmer 
*Cc:* haproxy@formilux.org
*Subject:* HAProxy 1.9.6 unresponsive


On Sun, May 05, 2019 at 09:40:02AM +0200, Willy Tarreau wrote:

With this said, after studying the code a little bit more, I'm seeing a
potential case where if we'd have a trailers entry in the HTX buffer but
no end of message, we could loop forever there not consuming this block.
I have no idea if this is possible in an HTX message, I'll ask Christopher
tomorrow. In any case we need to address this one way or another, possibly
reporting an error instead if required. Thus I'm postponing 1.9.8 for
tomorrow.

So the case is indeed possible and at the moment all we can do is try to
minimize the probability to produce it :-(  The issue is caused by the
moment we've received the end of trailsers but not the end of the mesage.
 From the H2 protocol perspective if we've sent the END_STREAM flag, the
stream is closed, and a closed stream gets detached and cannot receive
new traffic, so at best we'll occasionally close too early and report
client failures at the upper layers while everything went OK. We cannot
send trailers without the END_STREAM flag since no frame may follow.
Abusing CONTINUATION is out of question here as this would require to
completely freeze the whole connection (including control frames) for
the time it takes to get this final EOM block. I thought about simply
reporting an error when we're in this situation between trailers and EOM
but it will mean that occasionally some chunked responses of sizes close
to 16N kB with trailers may err out, which is not acceptable either.

For 2.0 we approximately see what needs to be modified to address this
situation, but that will not be trivial and not backportable.

For 1.9 I'm still trying to figure what the "best" solution is. I may
finally end up marking the stream as closed as soon as we see the
trailers pushed down. I'm just unsure right now about all the possible
consequences and need to study the edge cases. Also I fear that this
will be something hard to unroll later, so I'm still studying.

Willy


Just in case it's useful, we had the issue recur today. However I 
gleaned a little more information from this recurrence. Provided below 
are several outputs from a gdb `bt full`. The important bit is that in 
the captures, the last frame which doesn't change between each capture 
is the `si_cs_send` function. The last stack capture provided has the 
shortest stack depth of all the captures, and is inside `h2_snd_buf`.


Otherwise it's still the behavior is the same as last time with `strace` 
showing absolutely nothing, so it's still looping.





#0  h1_headers_to_hdr_list (start=0x7f5a4ea6b5fb "grpco\243?", 
stop=0x7f5a4ea6b5ff "o\243?", hdr=hdr@entry=0x7ffdc58f6400, 
hdr_num=hdr_num@entry=101, h1m=h1m@entry=0x7ffdc58f63d0, 
slp=slp@entry=0x0) at src/h1.c:793

    ret = 
    state = 
    ptr = 
    end = 
    hdr_count = 
    skip = 0
    sol = 
    col = 
    eol = 
    sov = 
    sl = 
    skip_update = 
    restarting = 
    n = 
    v = {ptr = 0x7f5a4eb51453 "LZ\177", len = 140025825685243}
#1  0x7f5a4d862539 in h2s_htx_make_trailers 
(h2s=h2s@entry=0x7f5a4ecc7860, htx=htx@entry=0x7f5a4ea67630) at 
src/mux_h2.c:4996
    list = {{n = {ptr = 0x0, len = 0}, v = {ptr = 0x0, len = 0}} 
}

    h2c = 0x7f5a4ec56610
    blk = 
    blk_end = 0x0
    outbuf = {size = 140025844274259, area = 0x7f5a4d996efb 
 
"\205\300~\aHc\320H\001SXH\205\355t\026Lc\310E1\300D\211\351L\211⾃", 
data = 16472, head = 140025845781936}
    h1m = {state = H1_MSG_HDR_NAME, flags = 2056, curr_len = 0, 
body_len = 0, next = 4, err_pos = 0, err_state = 1320431563}

    type = 
    ret = 
    hdr = 0
    idx = 5
    start = 
#2  0x7f5a4d866ef5 in h2_snd_buf (cs=0x7f5a4e9a8980, 
buf=0x7f5a4e777d78, count=4, flags=) at src/mux_h2.c:5372

    h2s = 
    orig_count = 
    total = 16291
    ret = 
    htx = 0x7f5a4ea67630
    blk = 
    btype = 
    idx = 
#3  0x7f5a4d8f4be4 in si_cs_send (cs=cs@entry=0x7f5a4e9a8980) at 
src/stream_interface.c:691

    send_flag = 
    conn = 0x7f5a4e86f4c0
    si = 0x7f5a4e777f98
    oc = 0x7f5a4e777d70
    ret = 
    did_send = 0
#4  0x7f5a4d8f6305 in si_cs_io_cb (t=, 
ctx=0x7f5a4e777f98, state=) at src/stream_interface.c:737

    si = 0x7f5a4e777f98
    cs = 0x7f5a4e9a8980
    ret = 0
#5  0x7f5a4d925f02 in process_runnable_tasks () at src/task.c:437
    t = 
    state = 
    ctx = 
    process = 
    t = 
    max_processed = 
#6  0x7f5a4d89f6ff in run_poll_loop () at src/haproxy.c:2642
    next = 
    exp = 
#7  run_thread_poll_loop (data=data@entry=0x7f5a4e62a9b0) at 
src/haproxy.c:

[1.9 HEAD] HAProxy using 100% CPU

2019-05-07 Thread Maciej Zdeb
Hi,

I've got another bug with 100% CPU on HAProxy process, it is built from
HEAD of 1.9 branch.

One of processes stuck in infinite loop, admin socket is not responsive so
I've got information only from gdb:

0x00484ab8 in h2_process_mux (h2c=0x2e8ff30) at src/mux_h2.c:2589
2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE)
(gdb) n
2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
H2_CF_MUX_BLOCK_ANY)
(gdb)
2586list_for_each_entry(h2s, &h2c->send_list, list) {
(gdb)
2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
H2_CF_MUX_BLOCK_ANY)
(gdb)
2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE)
(gdb)
2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
H2_CF_MUX_BLOCK_ANY)
(gdb)
2586list_for_each_entry(h2s, &h2c->send_list, list) {
(gdb)
2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
H2_CF_MUX_BLOCK_ANY)
(gdb) p h2c
$1 = (struct h2c *) 0x2e8ff30
(gdb) p *h2c
$2 = {conn = 0x2b4c900, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR,
flags = 0, streams_limit = 100, max_id = 149, rcvd_c = 0, rcvd_s = 0, ddht
= 0x34099c0, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 149,
dfl = 0,
  dft = 1 '\001', dff = 37 '%', dpl = 0 '\000', last_sid = -1, mbuf = {size
= 16384, area = 0x2ec3d50 "", data = 0, head = 0}, msi = -1, mfl = 0, mft =
0 '\000', mff = 0 '\000', miw = 6291456, mws = 15443076, mfs = 16384,
  timeout = 2, shut_timeout = 2, nb_streams = 53, nb_cs = 53,
nb_reserved = 0, stream_cnt = 75, proxy = 0x219ffe0, task = 0x34081d0,
streams_by_id = {b = {0x2adc2e1, 0x0}}, send_list = {n = 0x2ac5b38, p =
0x3093c18},
  fctl_list = {n = 0x2e90008, p = 0x2e90008}, sending_list = {n =
0x2ac5b48, p = 0x2ec2798}, buf_wait = {target = 0x0, wakeup_cb = 0x0, list
= {n = 0x2e90038, p = 0x2e90038}}, wait_event = {task = 0x2b2ae90, handle =
0x0, events = 1}}
(gdb) n
2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE)
(gdb) p *h2s
$3 = {cs = 0x297bdb0, sess = 0x819580 , h2c = 0x2e8ff30, h1m
= {state = H1_MSG_RPBEFORE, flags = 12, curr_len = 0, body_len = 0, next =
0, err_pos = -1, err_state = 0}, by_id = {node = {branches = {b =
{0x2a72250,
  0x2961c30}}, node_p = 0x2a72251, leaf_p = 0x2961c31, bit = 1, pfx
= 49017}, key = 103}, id = 103, flags = 16385, mws = 6291456, errcode =
H2_ERR_NO_ERROR, st = H2_SS_HREM, status = 0, body_len = 0, rxbuf = {size =
0,
area = 0x0, data = 0, head = 0}, wait_event = {task = 0x2fb3ee0, handle
= 0x0, events = 0}, recv_wait = 0x2b8d700, send_wait = 0x2b8d700, list = {n
= 0x3130108, p = 0x2b02238}, sending_list = {n = 0x3130118, p = 0x2b02248}}
(gdb) n
2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
H2_CF_MUX_BLOCK_ANY)
(gdb) p *h2c
$4 = {conn = 0x2b4c900, st0 = H2_CS_FRAME_H, errcode = H2_ERR_NO_ERROR,
flags = 0, streams_limit = 100, max_id = 149, rcvd_c = 0, rcvd_s = 0, ddht
= 0x34099c0, dbuf = {size = 0, area = 0x0, data = 0, head = 0}, dsi = 149,
dfl = 0,
  dft = 1 '\001', dff = 37 '%', dpl = 0 '\000', last_sid = -1, mbuf = {size
= 16384, area = 0x2ec3d50 "", data = 0, head = 0}, msi = -1, mfl = 0, mft =
0 '\000', mff = 0 '\000', miw = 6291456, mws = 15443076, mfs = 16384,
  timeout = 2, shut_timeout = 2, nb_streams = 53, nb_cs = 53,
nb_reserved = 0, stream_cnt = 75, proxy = 0x219ffe0, task = 0x34081d0,
streams_by_id = {b = {0x2adc2e1, 0x0}}, send_list = {n = 0x2ac5b38, p =
0x3093c18},
  fctl_list = {n = 0x2e90008, p = 0x2e90008}, sending_list = {n =
0x2ac5b48, p = 0x2ec2798}, buf_wait = {target = 0x0, wakeup_cb = 0x0, list
= {n = 0x2e90038, p = 0x2e90038}}, wait_event = {task = 0x2b2ae90, handle =
0x0, events = 1}}
(gdb) n
2586list_for_each_entry(h2s, &h2c->send_list, list) {
(gdb) n
2587if (h2c->st0 >= H2_CS_ERROR || h2c->flags &
H2_CF_MUX_BLOCK_ANY)
(gdb) n
2589if (h2s->send_wait->events & SUB_CALL_UNSUBSCRIBE)


HAProxy info:
HA-Proxy version 1.9.7-207ba5a 2019/05/05 - https://haproxy.org/
Build options :
  TARGET  = linux2628
  CPU = generic
  CC  = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter
-Wno-old-style-declaration -Wno-ignored-qualifiers -Wno-clobbered
-Wno-missing-field-initializers -Wtype-limits -DIP_BIND_ADDRESS_NO_PORT=24
  OPTIONS = USE_GETADDRINFO=1 USE_ZLIB=1 USE_REGPARM=1 USE_DL=1
USE_OPENSSL=1 USE_LUA=1 USE_PCRE=1 USE_PCRE_JIT=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.1.1b  26 Feb 2019
Running on OpenSSL version : OpenSSL 1.1.1b  26 Feb 2019
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2 TLSv1.3
Built with Lua version : Lua 5.3.5
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported 

Re: [External] Re: QAT intermittent healthcheck errors

2019-05-07 Thread Emeric Brun
On 5/7/19 3:35 PM, Marcin Deranek wrote:
> Hi Emeric,
> 
> On 5/7/19 1:53 PM, Emeric Brun wrote:
>> On 5/7/19 1:24 PM, Marcin Deranek wrote:
>>> Hi Emeric,
>>>
>>> On 5/7/19 11:44 AM, Emeric Brun wrote:
 Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see 
 attachment for end result). Unfortunately after applying the patch there 
 is no change in behavior: we still leak /dev/usdm_drv descriptors and have 
 "stuck" HAProxy instances after reload..
>>> Regards,
>>
>>

 Could you perform a test recompiling the usdm_drv and the engine with this 
 patch, it applies on QAT 1.7 but I've no hardware to test this version 
 here.

 It should fix the fd leak.
>>>
>>> It did fix fd leak:
>>>
>>> # ls -al /proc/2565/fd|fgrep dev
>>> lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
>>> lrwx-- 1 root root 64 May  7 13:15 7 -> /dev/usdm_drv
>>>
>>> # systemctl reload haproxy.service
>>> # ls -al /proc/2565/fd|fgrep dev
>>> lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
>>> lrwx-- 1 root root 64 May  7 13:15 8 -> /dev/usdm_drv
>>>
>>> # systemctl reload haproxy.service
>>> # ls -al /proc/2565/fd|fgrep dev
>>> lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
>>> lrwx-- 1 root root 64 May  7 13:15 9 -> /dev/usdm_drv
>>>
>>> But there are still stuck processes :-( This is with both patches included: 
>>> for QAT and HAProxy.
>>> Regards,
>>>
>>> Marcin Deranek
>>
>> Thank you Marcin! Anyway it's was also a bug.
>>
>> Could you process a 'show fds' command on a stucked process adding the patch 
>> in attachement.
> 
> I did apply this patch and all previous patches (QAT + HAProxy 
> ssl_free_engine). This is what I got after 1st reload:
> 
> show proc
> #         
> 8025    master  0   1   0d 00h03m25s
> # workers
> 31269   worker  1   0   0d 00h00m39s
> 31270   worker  2   0   0d 00h00m39s
> 31271   worker  3   0   0d 00h00m39s
> 31272   worker  4   0   0d 00h00m39s
> # old workers
> 9286    worker  [was: 1]    1   0d 00h03m25s
> 9287    worker  [was: 2]    1   0d 00h03m25s
> 9288    worker  [was: 3]    1   0d 00h03m25s
> 9289    worker  [was: 4]    1   0d 00h03m25s
> 
> @!9286 show fd
>  13 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 owner=0x23eaae0 
> iocb=0x4877c0(mworker_accept_wrapper) tmask=0x1 umask=0x0
>  16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x4e1ab0 
> iocb=0x4e1ab0(thread_sync_io_handler) tmask=0x umask=0x0
>  20 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1601b840 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>  21 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 owner=0x1f0ec4f0 
> iocb=0x4ce6e0(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x00241300 fe=GLOBAL 
> mux=PASS mux_ctx=0x22ad8630
>    1412 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1bab1f30 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1413 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x247e5bc0 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1414 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x18883650 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1415 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x14476c10 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1416 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11a27850 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1418 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x12008230 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1419 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1bb0a570 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1420 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11c94790 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1421 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1449e050 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1422 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x1f00c150 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1423 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x15f40550 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1424 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x124b6340 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1425 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11fe4500 
> iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
>    1426 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 owner=0x11c70a60 
> iocb=0x4f4d50(

Re: [External] Re: QAT intermittent healthcheck errors

2019-05-07 Thread Marcin Deranek

Hi Emeric,

On 5/7/19 1:53 PM, Emeric Brun wrote:

On 5/7/19 1:24 PM, Marcin Deranek wrote:

Hi Emeric,

On 5/7/19 11:44 AM, Emeric Brun wrote:

Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see attachment for end 
result). Unfortunately after applying the patch there is no change in behavior: we still leak /dev/usdm_drv 
descriptors and have "stuck" HAProxy instances after reload..

Regards,





Could you perform a test recompiling the usdm_drv and the engine with this 
patch, it applies on QAT 1.7 but I've no hardware to test this version here.

It should fix the fd leak.


It did fix fd leak:

# ls -al /proc/2565/fd|fgrep dev
lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
lrwx-- 1 root root 64 May  7 13:15 7 -> /dev/usdm_drv

# systemctl reload haproxy.service
# ls -al /proc/2565/fd|fgrep dev
lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
lrwx-- 1 root root 64 May  7 13:15 8 -> /dev/usdm_drv

# systemctl reload haproxy.service
# ls -al /proc/2565/fd|fgrep dev
lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
lrwx-- 1 root root 64 May  7 13:15 9 -> /dev/usdm_drv

But there are still stuck processes :-( This is with both patches included: for 
QAT and HAProxy.
Regards,

Marcin Deranek


Thank you Marcin! Anyway it's was also a bug.

Could you process a 'show fds' command on a stucked process adding the patch in 
attachement.


I did apply this patch and all previous patches (QAT + HAProxy 
ssl_free_engine). This is what I got after 1st reload:


show proc
# 
8025master  0   1   0d 00h03m25s
# workers
31269   worker  1   0   0d 00h00m39s
31270   worker  2   0   0d 00h00m39s
31271   worker  3   0   0d 00h00m39s
31272   worker  4   0   0d 00h00m39s
# old workers
9286worker  [was: 1]1   0d 00h03m25s
9287worker  [was: 2]1   0d 00h03m25s
9288worker  [was: 3]1   0d 00h03m25s
9289worker  [was: 4]1   0d 00h03m25s

@!9286 show fd
 13 : st=0x05(R:PrA W:pra) ev=0x01(heopI) [lc] cache=0 
owner=0x23eaae0 iocb=0x4877c0(mworker_accept_wrapper) tmask=0x1 umask=0x0
 16 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x4e1ab0 iocb=0x4e1ab0(thread_sync_io_handler) 
tmask=0x umask=0x0
 20 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x1601b840 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
 21 : st=0x22(R:pRa W:pRa) ev=0x00(heopi) [lc] cache=0 
owner=0x1f0ec4f0 iocb=0x4ce6e0(conn_fd_handler) tmask=0x1 umask=0x0 
cflg=0x00241300 fe=GLOBAL mux=PASS mux_ctx=0x22ad8630
   1412 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x1bab1f30 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1413 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x247e5bc0 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1414 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x18883650 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1415 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x14476c10 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1416 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x11a27850 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1418 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x12008230 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1419 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x1bb0a570 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1420 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x11c94790 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1421 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x1449e050 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1422 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x1f00c150 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1423 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x15f40550 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1424 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x124b6340 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1425 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x11fe4500 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1426 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x11c70a60 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1427 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x12572540 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1428 : st=0x05(R:PrA W:pra) ev=0x00(heopi) [lc] cache=0 
owner=0x1249a420 iocb=0x4f4d50(ssl_async_fd_free) tmask=0x1 umask=0x0
   1430 : st=0x05(R:PrA W:p

Re: leak of handle to /dev/urandom since 1.8?

2019-05-07 Thread William Lallemand
On Fri, May 03, 2019 at 10:49:54AM +, Robert Allen1 wrote:
> For the sake of the list, the patch now looks like:
> 
> +#if defined(USE_OPENSSL) && (OPENSSL_VERSION_NUMBER >= 0x10101000L)
> +   if (global.ssl_used_frontend || global.ssl_used_backend)
> +   /* close random device FDs */
> +   RAND_keep_random_devices_open(0);
> +#endif
> 
> and requests a backport to 1.8 and 1.9 where we noticed this issue (and 
> which
> include the re-exec for reload code, if I followed its history 
> thoroughly).
> 
> Rob
> 

I pushed the patch in master, thanks.

-- 
William Lallemand



Re: [PATCH v2 1/2] MINOR: systemd: Use the variables from /etc/default/haproxy

2019-05-07 Thread William Lallemand
On Mon, May 06, 2019 at 04:07:47PM +0200, William Lallemand wrote:
> On Mon, May 06, 2019 at 02:20:32PM +0200, Vincent Bernat wrote:
> > However, many people prefer /etc/default and /etc/sysconfig to systemd
> > overrides. And for distribution, it enables a smoother transition. For
> > Debian, we would still add the EnvironmentFile directive. You could
> > still be compatible with both styles of distribution with:
> > 
> > EnvironmentFile=-/etc/default/haproxy
> > EnvironmentFile=-/etc/sysconfig/haproxy
> 
> Oh that's right, I forgot that the - was checking if the file exists, looks 
> like
> a good solution.
> 

Just pushed in master the 2 previous patches + a patch which add 
/etc/sysconfig/haproxy.

Thanks everyone.

-- 
William Lallemand



Re: [External] Re: QAT intermittent healthcheck errors

2019-05-07 Thread Emeric Brun
On 5/7/19 1:24 PM, Marcin Deranek wrote:
> Hi Emeric,
> 
> On 5/7/19 11:44 AM, Emeric Brun wrote:
>> Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see 
>> attachment for end result). Unfortunately after applying the patch there is 
>> no change in behavior: we still leak /dev/usdm_drv descriptors and have 
>> "stuck" HAProxy instances after reload..
> Regards,


>>
>> Could you perform a test recompiling the usdm_drv and the engine with this 
>> patch, it applies on QAT 1.7 but I've no hardware to test this version here.
>>
>> It should fix the fd leak.
> 
> It did fix fd leak:
> 
> # ls -al /proc/2565/fd|fgrep dev
> lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
> lrwx-- 1 root root 64 May  7 13:15 7 -> /dev/usdm_drv
> 
> # systemctl reload haproxy.service
> # ls -al /proc/2565/fd|fgrep dev
> lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
> lrwx-- 1 root root 64 May  7 13:15 8 -> /dev/usdm_drv
> 
> # systemctl reload haproxy.service
> # ls -al /proc/2565/fd|fgrep dev
> lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
> lrwx-- 1 root root 64 May  7 13:15 9 -> /dev/usdm_drv
> 
> But there are still stuck processes :-( This is with both patches included: 
> for QAT and HAProxy.
> Regards,
> 
> Marcin Deranek

Thank you Marcin! Anyway it's was also a bug.

Could you process a 'show fds' command on a stucked process adding the patch in 
attachement.

R,
Emeric

>From d0e095c2aa54f020de8fc50db867eff1ef73350e Mon Sep 17 00:00:00 2001
From: Emeric Brun 
Date: Fri, 19 Apr 2019 17:15:28 +0200
Subject: [PATCH] MINOR: ssl/cli: async fd io-handlers printable on show fd

This patch exports the async fd iohandlers and make them printable
doing a 'show fd' on cli.
---
 include/proto/ssl_sock.h | 4 
 src/cli.c| 9 +
 src/ssl_sock.c   | 4 ++--
 3 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/proto/ssl_sock.h b/include/proto/ssl_sock.h
index 62ebcb87..ce52fb74 100644
--- a/include/proto/ssl_sock.h
+++ b/include/proto/ssl_sock.h
@@ -85,6 +85,10 @@ SSL_CTX *ssl_sock_get_generated_cert(unsigned int key, struct bind_conf *bind_co
 int ssl_sock_set_generated_cert(SSL_CTX *ctx, unsigned int key, struct bind_conf *bind_conf);
 unsigned int ssl_sock_generated_cert_key(const void *data, size_t len);
 
+#if (OPENSSL_VERSION_NUMBER >= 0x101fL) && !defined(OPENSSL_NO_ASYNC)
+void ssl_async_fd_handler(int fd);
+void ssl_async_fd_free(int fd);
+#endif
 
 /* ssl shctx macro */
 
diff --git a/src/cli.c b/src/cli.c
index 568ceba2..843c3d04 100644
--- a/src/cli.c
+++ b/src/cli.c
@@ -69,6 +69,9 @@
 #include 
 #include 
 #include 
+#ifdef USE_OPENSSL
+#include 
+#endif
 
 #define PAYLOAD_PATTERN "<<"
 
@@ -998,6 +1001,12 @@ static int cli_io_handler_show_fd(struct appctx *appctx)
 			 (fdt.iocb == listener_accept)  ? "listener_accept" :
 			 (fdt.iocb == poller_pipe_io_handler) ? "poller_pipe_io_handler" :
 			 (fdt.iocb == mworker_accept_wrapper) ? "mworker_accept_wrapper" :
+#ifdef USE_OPENSSL
+#if (OPENSSL_VERSION_NUMBER >= 0x101fL) && !defined(OPENSSL_NO_ASYNC)
+			 (fdt.iocb == ssl_async_fd_free) ? "ssl_async_fd_free" :
+			 (fdt.iocb == ssl_async_fd_handler) ? "ssl_async_fd_handler" :
+#endif
+#endif
 			 "unknown");
 
 		if (fdt.iocb == conn_fd_handler) {
diff --git a/src/ssl_sock.c b/src/ssl_sock.c
index 112520c8..58ae8a26 100644
--- a/src/ssl_sock.c
+++ b/src/ssl_sock.c
@@ -573,7 +573,7 @@ fail_get:
 /*
  * openssl async fd handler
  */
-static void ssl_async_fd_handler(int fd)
+void ssl_async_fd_handler(int fd)
 {
 	struct connection *conn = fdtab[fd].owner;
 
@@ -594,7 +594,7 @@ static void ssl_async_fd_handler(int fd)
 /*
  * openssl async delayed SSL_free handler
  */
-static void ssl_async_fd_free(int fd)
+void ssl_async_fd_free(int fd)
 {
 	SSL *ssl = fdtab[fd].owner;
 	OSSL_ASYNC_FD all_fd[32];
-- 
2.17.1



Re: [External] Re: QAT intermittent healthcheck errors

2019-05-07 Thread Marcin Deranek

Hi Emeric,

On 5/7/19 11:44 AM, Emeric Brun wrote:

Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see attachment for end 
result). Unfortunately after applying the patch there is no change in behavior: we still leak /dev/usdm_drv 
descriptors and have "stuck" HAProxy instances after reload..

Regards,





Could you perform a test recompiling the usdm_drv and the engine with this 
patch, it applies on QAT 1.7 but I've no hardware to test this version here.

It should fix the fd leak.


It did fix fd leak:

# ls -al /proc/2565/fd|fgrep dev
lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
lrwx-- 1 root root 64 May  7 13:15 7 -> /dev/usdm_drv

# systemctl reload haproxy.service
# ls -al /proc/2565/fd|fgrep dev
lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
lrwx-- 1 root root 64 May  7 13:15 8 -> /dev/usdm_drv

# systemctl reload haproxy.service
# ls -al /proc/2565/fd|fgrep dev
lr-x-- 1 root root 64 May  7 13:15 0 -> /dev/null
lrwx-- 1 root root 64 May  7 13:15 9 -> /dev/usdm_drv

But there are still stuck processes :-( This is with both patches 
included: for QAT and HAProxy.

Regards,

Marcin Deranek



Re: QAT intermittent healthcheck errors

2019-05-07 Thread Marcin Deranek

On 5/7/19 11:44 AM, Emeric Brun wrote:


Could you perform a test recompiling the usdm_drv and the engine with this 
patch, it applies on QAT 1.7 but I've no hardware to test this version here.

It should fix the fd leak.


Will do and report back.

Marcin Deranek



Re: QAT intermittent healthcheck errors

2019-05-07 Thread Emeric Brun
Hi Marcin,>> As I use HAProxy 1.8 I had to adjust the patch (see attachment 
for end result). Unfortunately after applying the patch there is no change in 
behavior: we still leak /dev/usdm_drv descriptors and have "stuck" HAProxy 
instances after reload..
>>> Regards,
>>
>>

Could you perform a test recompiling the usdm_drv and the engine with this 
patch, it applies on QAT 1.7 but I've no hardware to test this version here.

It should fix the fd leak.

R,
Emeric
diff -urN quickassist.old/utilities/libusdm_drv/linux/user_space/qae_mem_hugepage_utils.c quickassist/utilities/libusdm_drv/linux/user_space/qae_mem_hugepage_utils.c
--- quickassist.old/utilities/libusdm_drv/linux/user_space/qae_mem_hugepage_utils.c	2019-05-07 11:35:15.654202291 +0200
+++ quickassist/utilities/libusdm_drv/linux/user_space/qae_mem_hugepage_utils.c	2019-05-07 11:35:44.302292417 +0200
@@ -104,7 +104,7 @@
 /* standard page size */
 page_size = getpagesize();
 
-fd = qae_open("/proc/self/pagemap", O_RDONLY);
+fd = qae_open("/proc/self/pagemap", O_RDONLY|O_CLOEXEC);
 if (fd < 0)
 {
 return 0;
diff -urN quickassist.old/utilities/libusdm_drv/linux/user_space/qae_mem_utils.c quickassist/utilities/libusdm_drv/linux/user_space/qae_mem_utils.c
--- quickassist.old/utilities/libusdm_drv/linux/user_space/qae_mem_utils.c	2019-03-15 15:23:43.0 +0100
+++ quickassist/utilities/libusdm_drv/linux/user_space/qae_mem_utils.c	2019-05-07 11:24:08.755921241 +0200
@@ -745,7 +745,7 @@
 
 if (fd > 0)
 close(fd);
-fd = qae_open(QAE_MEM, O_RDWR);
+fd = qae_open(QAE_MEM, O_RDWR|O_CLOEXEC);
 if (fd < 0)
 {
 CMD_ERROR("%s:%d Unable to initialize memory file handle %s \n",