Re: Connections stuck in CLOSE_WAIT state with h2

2018-07-25 Thread Willy Tarreau
Hi Olivier,

On Wed, Jul 25, 2018 at 05:51:46PM +0200, Olivier Doucet wrote:
> It seems I have the same issue as Milan :
> We activated HTTP/2 on production a few weeks ago, and on some customers
> (not all !) we can observe a very strange behaviour : it seems some
> frontend sessions are not closed, leading to 'slim' reached if HAProxy runs
> for several days without being reloaded.
(...)
> With "flag" debug binary I debugged cflg and all "lost" sessions are in
> this state :
> conn->flags = CO_FL_XPRT_TRACKED | CO_FL_CONNECTED | CO_FL_ADDR_FROM_SET |
> CO_FL_XPRT_READY | CO_FL_CTRL_READY
> 
> This issue is very close to Milan bug, that's why I posted as a reply. If
> I'm wrong, I'll split it in another thread.

It's definitely the same as Milan's from my point of view. Many thanks
for reporting it as well.

> Willy, are your patches "production-safe" (meaning it is reasonable enough
> to run it a few hours in production) ? Can it be applied on 1.8.12 release,
> or do I need to download latest trunk ?

Yes they are safe enough so that I expect them to be backported for
1.8.13. I'm even reasonably confident that they should fix the problem.
You need to apply them on the latest 1.8 git maintenance version (in
which case you don't need the first one which is already merged).

> I can reproduce the issue quickly (~ 2 hours to be sure) on my side to help
> !

That could be great!

Thanks!
Willy



Re: Connections stuck in CLOSE_WAIT state with h2

2018-07-25 Thread Olivier Doucet
Hello,

2018-07-25 10:20 GMT+02:00 Willy Tarreau :

> Hi Milan,
>
> On Wed, Jul 25, 2018 at 10:15:50AM +0200, Milan Petruzelka wrote:
> > Now I'll add both patches (WIP-h2 and h2-error.diff) and give it a try in
> > production.
>
> Thank you. At first I thought you still had the errors with them applied
> and was despeared, now I understand there's still hope :-)
>
> Cheers,
> Willy
>


It seems I have the same issue as Milan :
We activated HTTP/2 on production a few weeks ago, and on some customers
(not all !) we can observe a very strange behaviour : it seems some
frontend sessions are not closed, leading to 'slim' reached if HAProxy runs
for several days without being reloaded.

What can be observed:
- on a specific frontend, scur keeps growing over and over.
- reloading haproxy (with -sf parameter) clears sessions connected
- it happens only on specific frontends, but I failed to cross info (we
have issue on 2 frontends, both have reasonable trafic, but some frontends
with much more do not have the issue).
- disabling HTTP/2 solve the problem for these specific frontends, so this
is definitely HTTP/2 related

I ran several "show fd" and "show sess" on haproxy process, and filter it
with the frontend name. Both shows different number of lines, and the
difference is growing over time.

I am running HAProxy 1.8.12 with OpenSSL 1.1.0h compiled statically.

Here is one fd with the issue:
 58 : st=0x25(R:PrA W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x46ceeb0
iocb=0x530c20(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80201300
fe=:443 mux=H2 mux_ctx=0x45e27e0

With "flag" debug binary I debugged cflg and all "lost" sessions are in
this state :
conn->flags = CO_FL_XPRT_TRACKED | CO_FL_CONNECTED | CO_FL_ADDR_FROM_SET |
CO_FL_XPRT_READY | CO_FL_CTRL_READY

This issue is very close to Milan bug, that's why I posted as a reply. If
I'm wrong, I'll split it in another thread.

Willy, are your patches "production-safe" (meaning it is reasonable enough
to run it a few hours in production) ? Can it be applied on 1.8.12 release,
or do I need to download latest trunk ?

I can reproduce the issue quickly (~ 2 hours to be sure) on my side to help
!

Olivier


Duplicate haproxy processes after setting server to MAINT via stats page

2018-07-25 Thread Alessandro Gherardi
Hi,Running haproxy 1.5 under Ubuntu trusty as a service (service haproxy 
start/stop), I noticed that sometimes (not always) when I set a server to MAINT 
via the haproxy_stats page, I end up with duplicate haproxy processes.
Any ideas? Has this problem been fixed in haproxy 1.8?
Thank you in advance,Alessandro

Re: Configuring HAProxy session limits

2018-07-25 Thread Àbéjídé Àyodélé
Thanks for your response! It clarified alot.


Re: Issue with TCP splicing

2018-07-25 Thread Olivier Houchard
On Wed, Jul 25, 2018 at 09:02:24AM -0400, Julien Semaan wrote:
> Hi Olivier,
> 
> Thanks for the time you're taking to check the issue.
> 
> I'll get an environment back with TCP splicing enabled and I'll run it in
> GDB and provide you a core dump
> 

That would be great, thank you !

Olivier



Re: Issue with TCP splicing

2018-07-25 Thread Julien Semaan

Hi Olivier,

Thanks for the time you're taking to check the issue.

I'll get an environment back with TCP splicing enabled and I'll run it 
in GDB and provide you a core dump


Best Regards,

--
Julien Semaan
jsem...@inverse.ca   ::  +1 (866) 353-6153 *155  ::www.inverse.ca
Inverse inc. :: Leaders behind SOGo (www.sogo.nu) and PacketFence 
(www.packetfence.org)



On 2018-07-25 08:59 AM, Olivier Houchard wrote:

Hi Julien,

On Tue, Jul 24, 2018 at 01:29:49PM -0400, Julien Semaan wrote:

Sorry, that was a "can" that really meant "can't" :) I can't reproduce it.

     Aw well, I was surprised it was so easy :)


yea, that would be too easy :)


Can you try to upgrade to 1.8.12 ? A number of bugs have been fixed since

     I did try the upgrade to 1.8.12, got the same results (segfault)
although I wasn't able to confirm it did segfault in the TCP splicing.


What kind hove load do you have when it segfaults ?

     Far from enormous, maximum 10 requests per second, but as I said in my
first post, the amount of TCP retransmissions and resets is very large due
to the fact we're black-holing the traffic since we use haproxy for our
captive portal
     I'd be happy to provide a pcap but for privacy reasons I can't extract
it from a production environment and I can't see to replicate it in lab.


I understand you don't want to send that kind of data.
I definitively can't seem to reproduce it, with a configuration very similar
to yours, including using netfilter to drop random packets to try to match
your setup as best as possible.
I'm afraid unless we're able to reproduce it, or you at least get a core,
it'll be very difficult to debug.

Regards,

Olivier




Re: Issue with TCP splicing

2018-07-25 Thread Olivier Houchard
Hi Julien,

On Tue, Jul 24, 2018 at 01:29:49PM -0400, Julien Semaan wrote:
> > Sorry, that was a "can" that really meant "can't" :) I can't reproduce it.
>     Aw well, I was surprised it was so easy :)
> 

yea, that would be too easy :)

> > Can you try to upgrade to 1.8.12 ? A number of bugs have been fixed since
>     I did try the upgrade to 1.8.12, got the same results (segfault)
> although I wasn't able to confirm it did segfault in the TCP splicing.
> 
> > What kind hove load do you have when it segfaults ?
>     Far from enormous, maximum 10 requests per second, but as I said in my
> first post, the amount of TCP retransmissions and resets is very large due
> to the fact we're black-holing the traffic since we use haproxy for our
> captive portal
>     I'd be happy to provide a pcap but for privacy reasons I can't extract
> it from a production environment and I can't see to replicate it in lab.
> 

I understand you don't want to send that kind of data.
I definitively can't seem to reproduce it, with a configuration very similar
to yours, including using netfilter to drop random packets to try to match
your setup as best as possible.
I'm afraid unless we're able to reproduce it, or you at least get a core,
it'll be very difficult to debug.

Regards,

Olivier



force-persist and use_server combined

2018-07-25 Thread Veiko Kukk

Hi,

I'd like to understand if I've made a mistake in configuration or there 
might be a bug in HAproxy 1.7.11.


defaults section has "option redispatch".

backend load_balancer
  mode http
  option httplog
  option httpchk HEAD /load_balance_health HTTP/1.1\r\nHost:\ foo.bar
  balance url_param file_id
  hash-type consistent

  acl status0 path_beg -i /dl/
  acl status1 path_beg -i /haproxy
  use-server local_http_frontend if status0 or status1
  force-persist if status0 or status1

  server local_http_frontend /var/run/haproxy.sock.http-frontend check 
send-proxy

  server remote_http_frontend 192.168.1.52:8080 check send-proxy


The idea here is that HAproxy statistics page, some other backend 
statistics and also some remote health checks running against path under 
/dl/ would always reach only local_http_frontend, never go anywhere else 
even when local really is down, not just marked as down.


This config does not work, it forwards /haproxy?stats request to 
remote_http_frontend when local_http_frontend is really down.


Is it expected? Any ways to overcome this limitation?

Thanks in advance,
Veiko




Re: [PATCH] MINOR: ssl: BoringSSL matches OpenSSL 1.1.0

2018-07-25 Thread Emmanuel Hocdet
Le 25 juil. 2018 à 10:34, Emmanuel Hocdet  a écrit :Hi WillyLe 24 juil. 2018 à 18:59, Willy Tarreau  a écrit :Hi Manu,On Mon, Jul 23, 2018 at 06:12:34PM +0200, Emmanuel Hocdet wrote:Hi Willy,This patch is necessary to build with current BoringSSL (SSL_SESSION is now opaque).BoringSSL correctly matches OpenSSL 1.1.0 since 3b2ff028 for haproxy needs.The patch revert part of haproxy 019f9b10 (openssl-compat.h).This will not break openssl/libressl compat.OK, but the chunk here seems to contradict this assertion :@@ -119,13 +114,6 @@ static inline const OCSP_CERTID *OCSP_SINGLERESP_get0_id(const OCSP_SINGLERESP *}#endif-#endif--#if (OPENSSL_VERSION_NUMBER < 0x101fL) || defined(LIBRESSL_VERSION_NUMBER)-/*- * Functions introduced in OpenSSL 1.1.0 and not yet present in LibreSSL- */-static inline pem_password_cb *SSL_CTX_get_default_passwd_cb(SSL_CTX *ctx){	return ctx->default_passwd_callback;I'm seeing that libressl will use a different code that is commonwith openssl while you seem to have targetted boringssl only. Maybe this part escaped from a larger patch that you used during development ?It’s ok because this function is inserted upper in the patch.As said, it's only a revert from 019f9b10 patches for openssl-compat.h.From:# Functions introduced in OpenSSL 1.1.0 and not yet present in LibreSSL / BoringSSL# Functions introduced in OpenSSL 1.1.0 and not yet present in LibreSSLTo:# Functions introduced in OpenSSL 1.1.0 and not yet present in LibreSSLThis patch is easier to read out of context:

0001-MINOR-ssl-BoringSSL-matches-OpenSSL-1.1.0.patch
Description: Binary data


Re: [PATCH] MINOR: ssl: BoringSSL matches OpenSSL 1.1.0

2018-07-25 Thread Emmanuel Hocdet
Hi Willy

> Le 24 juil. 2018 à 18:59, Willy Tarreau  a écrit :
> 
> Hi Manu,
> 
> On Mon, Jul 23, 2018 at 06:12:34PM +0200, Emmanuel Hocdet wrote:
>> Hi Willy,
>> 
>> This patch is necessary to build with current BoringSSL (SSL_SESSION is now 
>> opaque).
>> BoringSSL correctly matches OpenSSL 1.1.0 since 3b2ff028 for haproxy needs.
>> The patch revert part of haproxy 019f9b10 (openssl-compat.h).
>> This will not break openssl/libressl compat.
> 
> OK, but the chunk here seems to contradict this assertion :
> 
> 
> @@ -119,13 +114,6 @@ static inline const OCSP_CERTID 
> *OCSP_SINGLERESP_get0_id(const OCSP_SINGLERESP *
> }
> #endif
> 
> -#endif
> -
> -#if (OPENSSL_VERSION_NUMBER < 0x101fL) || 
> defined(LIBRESSL_VERSION_NUMBER)
> -/*
> - * Functions introduced in OpenSSL 1.1.0 and not yet present in LibreSSL
> - */
> -
> static inline pem_password_cb *SSL_CTX_get_default_passwd_cb(SSL_CTX *ctx)
> {
>   return ctx->default_passwd_callback;
> 
> I'm seeing that libressl will use a different code that is common
> with openssl while you seem to have targetted boringssl only. Maybe 
> this part escaped from a larger patch that you used during development ?
> 

It’s ok because this function is inserted upper in the patch.

As said, it's only a revert from 019f9b10 patches for openssl-compat.h.
From:
# Functions introduced in OpenSSL 1.1.0 and not yet present in LibreSSL / 
BoringSSL
# Functions introduced in OpenSSL 1.1.0 and not yet present in LibreSSL
To:
# Functions introduced in OpenSSL 1.1.0 and not yet present in LibreSSL

++
Manu




Re: Connections stuck in CLOSE_WAIT state with h2

2018-07-25 Thread Willy Tarreau
Hi Milan,

On Wed, Jul 25, 2018 at 10:15:50AM +0200, Milan Petruzelka wrote:
> Now I'll add both patches (WIP-h2 and h2-error.diff) and give it a try in
> production.

Thank you. At first I thought you still had the errors with them applied
and was despeared, now I understand there's still hope :-)

Cheers,
Willy



Re: Connections stuck in CLOSE_WAIT state with h2

2018-07-25 Thread Milan Petruželka
Hi Willy,

On Tue, 24 Jul 2018 at 14:37, Willy Tarreau  wrote:

> So I'm having one update to emit the missing info on "show fd" (patch
> merged
> and pushed already, that I'm attaching here if it's easier for you)


I've compiled version 1.8.12-12a4b5-16 with from Git and let it run
overnight. Now I have 4 blocked connections. Maxid is always equal to
lastid:

 14 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x17c51f0
iocb=0x4d34d0(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x17a5240 st0=7 err=5 maxid=133 lastid=133
flg=0x1000 nbst=6 nbcs=0 fctl_cnt=0 send_cnt=6 tree_cnt=6 orph_cnt=6
dbuf=0/0 mbuf=0/16384
 19 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x178fab0
iocb=0x4d34d0(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1798be0 st0=7 err=5 maxid=131 lastid=131
flg=0x1000 nbst=12 nbcs=0 fctl_cnt=0 send_cnt=12 tree_cnt=12
orph_cnt=12 dbuf=0/0 mbuf=0/16384
 20 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x17bac40
iocb=0x4d34d0(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x19e6220 st0=7 err=5 maxid=27 lastid=27
flg=0x1000 nbst=1 nbcs=0 fctl_cnt=0 send_cnt=1 tree_cnt=1 orph_cnt=1
dbuf=0/0 mbuf=0/16384
 22 : st=0x20(R:pra W:pRa) ev=0x00(heopi) [nlc] cache=0 owner=0x18d2410
iocb=0x4d34d0(conn_fd_handler) tmask=0x1 umask=0x0 cflg=0x80203300
fe=fe-http mux=H2 mux_ctx=0x1a0dab0 st0=7 err=5 maxid=107 lastid=107
flg=0x1000 nbst=5 nbcs=0 fctl_cnt=0 send_cnt=5 tree_cnt=5 orph_cnt=5
dbuf=0/0 mbuf=0/16384



> And I *think* (and hope) that with these 2 patches on top of latest 1.8
>
we're OK now. What I would appreciate quite a lot if you're willing to
> let me abuse your time is to either git pull or apply
> 0001-MINOR-h2-add-the-error-code-and-the-max-last-stream-.patch on top
> of your up-to-date branch, then apply
> 0001-WIP-h2-try-to-address-possible-causes-for-the-close_.patch then
> apply h2-error.diff and test again.
>

Now I'll add both patches (WIP-h2 and h2-error.diff) and give it a try in
production.

I think we're about to nail it down, to be honest ;-)
>

That would be great :-)
Milan