Re: Warning: upgrading to openssl master+ enable_tls1_3 (coming v1.1.1) could break handshakes for all protocol versions .

2018-01-12 Thread Gibson, Brian (IMS)
The way I read it you just have to be sure to specify a valid tls 1.3 cipher.  
I have not attempted the configuration though to confirm.

Sent from Nine

From: Pavlos Parissis 
Sent: Friday, January 12, 2018 4:55 PM
To: Emeric Brun; haproxy@formilux.org
Subject: Re: Warning: upgrading to openssl master+ enable_tls1_3 (coming 
v1.1.1) could break handshakes for all protocol versions .

On 12/01/2018 03:57 μμ, Emeric Brun wrote:
> Hi All,
>
> FYI: upgrading to next openssl-1.1.1 could break your prod if you're using a 
> forced cipher list because
> handshake will fail regardless the tls protocol version if you don't specify 
> a cipher valid for TLSv1.3
> in your cipher list.
>
> https://github.com/openssl/openssl/issues/5057
>
> https://github.com/openssl/openssl/issues/5065
>
> Openssl's team doesn't seem to consider this as an issue and I'm just bored 
> to discuss with them.
>
> R,
> Emeric
>


So, If we enable TLSv1.3, together with TLSv1.2, on the server side, then 
client must support
TLSv1.3 otherwise it will get a nice SSL error. Am I right? If I am right, I 
hope I'm not, then we
have to wait for all clients to support TLSv1.3 before we enabled it on the 
server side, this
doesn't sound right and I am pretty sure I am completely wrong here.

Cheers,
Pavlos





Information in this e-mail may be confidential. It is intended only for the 
addressee(s) identified above. If you are not the addressee(s), or an employee 
or agent of the addressee(s), please note that any dissemination, distribution, 
or copying of this communication is strictly prohibited. If you have received 
this e-mail in error, please notify the sender of the error.



Re: Warning: upgrading to openssl master+ enable_tls1_3 (coming v1.1.1) could break handshakes for all protocol versions .

2018-01-12 Thread Pavlos Parissis
On 12/01/2018 03:57 μμ, Emeric Brun wrote:
> Hi All,
> 
> FYI: upgrading to next openssl-1.1.1 could break your prod if you're using a 
> forced cipher list because
> handshake will fail regardless the tls protocol version if you don't specify 
> a cipher valid for TLSv1.3
> in your cipher list.
> 
> https://github.com/openssl/openssl/issues/5057
> 
> https://github.com/openssl/openssl/issues/5065
> 
> Openssl's team doesn't seem to consider this as an issue and I'm just bored 
> to discuss with them.
> 
> R,
> Emeric
> 


So, If we enable TLSv1.3, together with TLSv1.2, on the server side, then 
client must support
TLSv1.3 otherwise it will get a nice SSL error. Am I right? If I am right, I 
hope I'm not, then we
have to wait for all clients to support TLSv1.3 before we enabled it on the 
server side, this
doesn't sound right and I am pretty sure I am completely wrong here.

Cheers,
Pavlos




signature.asc
Description: OpenPGP digital signature


Re: High load average under 1.8 with multiple draining processes

2018-01-12 Thread Willy Tarreau
On Fri, Jan 12, 2018 at 11:06:32AM -0600, Samuel Reed wrote:
> On 1.8-git, similar results on the new process:
> 
> % time seconds  usecs/call calls    errors syscall
> -- --- --- - - 
>  93.75    0.265450  15 17805   epoll_wait
>   4.85    0.013730  49   283   write
>   1.40    0.003960  15   266    12 recvfrom
>   0.01    0.18   0    42    12 read
>   0.00    0.00   0    28   close
>   0.00    0.00   0    12   socket
>   0.00    0.00   0    12    12 connect
>   0.00    0.00   0    19 1 sendto
>   0.00    0.00   0    12   sendmsg
>   0.00    0.00   0 6   shutdown
>   0.00    0.00   0    35   setsockopt
>   0.00    0.00   0 7   getsockopt
>   0.00    0.00   0    12   fcntl
>   0.00    0.00   0    13   epoll_ctl
>   0.00    0.00   0 2 2 accept4
> -- --- --- - - 
> 100.00    0.283158 18554    39 total
> 
> Cursory look through the strace output looks the same, with the same
> three types as in the last email, including the cascade.

OK thank you for testing. On Monday we'll study this with Christopher.

Have a nice week-end!
Willy



Re: High load average under 1.8 with multiple draining processes

2018-01-12 Thread Samuel Reed
On 1.8-git, similar results on the new process:

% time seconds  usecs/call calls    errors syscall
-- --- --- - - 
 93.75    0.265450  15 17805   epoll_wait
  4.85    0.013730  49   283   write
  1.40    0.003960  15   266    12 recvfrom
  0.01    0.18   0    42    12 read
  0.00    0.00   0    28   close
  0.00    0.00   0    12   socket
  0.00    0.00   0    12    12 connect
  0.00    0.00   0    19 1 sendto
  0.00    0.00   0    12   sendmsg
  0.00    0.00   0 6   shutdown
  0.00    0.00   0    35   setsockopt
  0.00    0.00   0 7   getsockopt
  0.00    0.00   0    12   fcntl
  0.00    0.00   0    13   epoll_ctl
  0.00    0.00   0 2 2 accept4
-- --- --- - - 
100.00    0.283158 18554    39 total

Cursory look through the strace output looks the same, with the same
three types as in the last email, including the cascade.


On 1/12/18 10:23 AM, Willy Tarreau wrote:
> On Fri, Jan 12, 2018 at 10:13:55AM -0600, Samuel Reed wrote:
>> Excellent! Please let me know if there's any other output you'd like
>> from this machine.
>>
>> Strace on that new process shows thousands of these types of syscalls,
>> which vary slightly,
>>
>> epoll_wait(3, {{EPOLLIN, {u32=206, u64=206}}}, 200, 239) = 1
> If the u32 value almost doesn't vary, that's an uncaught event. We've
> got a report for this that we've just fixed yesterday which started to
> appear after the system was upgraded with Meltdown fixes. That seems
> unrelated but reverting made the problem disappear.
>
>> and these:
>>
>> epoll_wait(3, {}, 200, 0)   = 0
> This one used to appear in yesterday's report though it could be caused
> by other bugs as well. That's the one I predicted.
>
>> There is also something of a cascade (each repeats about 10-20x before
>> the next):
>>
>> epoll_wait(3, {{EPOLLIN, {u32=47, u64=47}}}, 200, 71) = 1
>> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
>> u64=656}}}, 200, 65) = 2
>> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
>> u64=656}}, {EPOLLIN, {u32=227, u64=227}}}, 200, 0) = 3
>> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
>> u64=656}}, {EPOLLIN, {u32=227, u64=227}}, {EPOLLIN, {u32=785,
>> u64=785}}}, 200, 65) = 4
>> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
>> u64=656}}, {EPOLLIN, {u32=227, u64=227}}, {EPOLLIN, {u32=785, u64=785}},
>> {EPOLLIN, {u32=639, u64=639}}}, 200, 64) = 5
>>
>> I've seen it go as deep as 15. The trace is absolutely dominated by these.
> OK that's very interesting. Just in doubt, please update to latest
> 1.8-git to see if it makes this issue disappear.
>
> Thanks,
> Willy




Re: High load average under 1.8 with multiple draining processes

2018-01-12 Thread Willy Tarreau
On Fri, Jan 12, 2018 at 10:13:55AM -0600, Samuel Reed wrote:
> Excellent! Please let me know if there's any other output you'd like
> from this machine.
> 
> Strace on that new process shows thousands of these types of syscalls,
> which vary slightly,
> 
> epoll_wait(3, {{EPOLLIN, {u32=206, u64=206}}}, 200, 239) = 1

If the u32 value almost doesn't vary, that's an uncaught event. We've
got a report for this that we've just fixed yesterday which started to
appear after the system was upgraded with Meltdown fixes. That seems
unrelated but reverting made the problem disappear.

> and these:
> 
> epoll_wait(3, {}, 200, 0)   = 0

This one used to appear in yesterday's report though it could be caused
by other bugs as well. That's the one I predicted.

> There is also something of a cascade (each repeats about 10-20x before
> the next):
> 
> epoll_wait(3, {{EPOLLIN, {u32=47, u64=47}}}, 200, 71) = 1
> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
> u64=656}}}, 200, 65) = 2
> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
> u64=656}}, {EPOLLIN, {u32=227, u64=227}}}, 200, 0) = 3
> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
> u64=656}}, {EPOLLIN, {u32=227, u64=227}}, {EPOLLIN, {u32=785,
> u64=785}}}, 200, 65) = 4
> epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
> u64=656}}, {EPOLLIN, {u32=227, u64=227}}, {EPOLLIN, {u32=785, u64=785}},
> {EPOLLIN, {u32=639, u64=639}}}, 200, 64) = 5
> 
> I've seen it go as deep as 15. The trace is absolutely dominated by these.

OK that's very interesting. Just in doubt, please update to latest
1.8-git to see if it makes this issue disappear.

Thanks,
Willy



Re: High load average under 1.8 with multiple draining processes

2018-01-12 Thread Samuel Reed
Excellent! Please let me know if there's any other output you'd like
from this machine.

Strace on that new process shows thousands of these types of syscalls,
which vary slightly,

epoll_wait(3, {{EPOLLIN, {u32=206, u64=206}}}, 200, 239) = 1

and these:

epoll_wait(3, {}, 200, 0)   = 0

There is also something of a cascade (each repeats about 10-20x before
the next):

epoll_wait(3, {{EPOLLIN, {u32=47, u64=47}}}, 200, 71) = 1
epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
u64=656}}}, 200, 65) = 2
epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
u64=656}}, {EPOLLIN, {u32=227, u64=227}}}, 200, 0) = 3
epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
u64=656}}, {EPOLLIN, {u32=227, u64=227}}, {EPOLLIN, {u32=785,
u64=785}}}, 200, 65) = 4
epoll_wait(3, {{EPOLLIN, {u32=93, u64=93}}, {EPOLLIN, {u32=656,
u64=656}}, {EPOLLIN, {u32=227, u64=227}}, {EPOLLIN, {u32=785, u64=785}},
{EPOLLIN, {u32=639, u64=639}}}, 200, 64) = 5

I've seen it go as deep as 15. The trace is absolutely dominated by these.


On 1/12/18 10:01 AM, Willy Tarreau wrote:
> On Fri, Jan 12, 2018 at 09:50:58AM -0600, Samuel Reed wrote:
>> To accelerate the process, I've increased the number of threads from 4
>> to 8 on a 16-core machine. Ran strace for about 5s on each.
>>
>> Single process (8 threads):
>>
>> $ strace -cp 16807
>> % time seconds  usecs/call calls    errors syscall
>> -- --- --- - - 
>>  71.36    0.330172  21 15479   epoll_wait
>>  13.59    0.062861   4 14477 1 write
>>  10.58    0.048955   4 11518    10 recvfrom
>>   4.44    0.020544  38   537   244 read
> This one is OK and shows that quite some time is in fact spent waiting
> for I/O events.
>
>> Two processes (2x8 threads):
>>
>> ## Draining process
>>
>> % time seconds  usecs/call calls    errors syscall
>> -- --- --- - - 
>>  48.65    0.544758  30 18359   epoll_wait
>>  28.69    0.321283  14 23540   write
>>  22.60    0.253049  19 13338   recvfrom
>>   0.04    0.000474   1   786   374 read
>>   0.03    0.000287   2   187   sendto
> This one as well.
>
>> ## "New" process
>>
>> % time seconds  usecs/call calls    errors syscall
>> -- --- --- - - 
>>  93.87    1.588239  11    149253   epoll_wait
>>   3.84    0.064985  30  2140    31 recvfrom
>>   1.77    0.029905  13  2388   write
>>   0.34    0.005737  10   589   130 read
>>   0.12    0.002018  38    53   close
>>   0.06    0.000960   8   114 2 sendto
> This one is very interesting! So the epoll_wait to other syscalls ratio
> went from roughly 1/2 to 30/1. I'm pretty sure that a regular strace would
> show you a large number of epoll_wait(0)=0 indicating we're missing some
> events. I seem to remember that sometimes there are situations where a
> thread may be notified by epoll() about an fd it cannot take care of but
> I don't remember in which case, I'll have to discuss with Christopher.
>
> But at least now we have an explanation and it's not directly related to
> thread contention but more likely with the mapping of FDs to threads, so
> we may have opportunities to improve the situation here.
>
> Thanks!
> Willy




Re: High load average under 1.8 with multiple draining processes

2018-01-12 Thread Willy Tarreau
On Fri, Jan 12, 2018 at 09:50:58AM -0600, Samuel Reed wrote:
> To accelerate the process, I've increased the number of threads from 4
> to 8 on a 16-core machine. Ran strace for about 5s on each.
> 
> Single process (8 threads):
> 
> $ strace -cp 16807
> % time seconds  usecs/call calls    errors syscall
> -- --- --- - - 
>  71.36    0.330172  21 15479   epoll_wait
>  13.59    0.062861   4 14477 1 write
>  10.58    0.048955   4 11518    10 recvfrom
>   4.44    0.020544  38   537   244 read

This one is OK and shows that quite some time is in fact spent waiting
for I/O events.

> Two processes (2x8 threads):
> 
> ## Draining process
> 
> % time seconds  usecs/call calls    errors syscall
> -- --- --- - - 
>  48.65    0.544758  30 18359   epoll_wait
>  28.69    0.321283  14 23540   write
>  22.60    0.253049  19 13338   recvfrom
>   0.04    0.000474   1   786   374 read
>   0.03    0.000287   2   187   sendto

This one as well.

> ## "New" process
> 
> % time seconds  usecs/call calls    errors syscall
> -- --- --- - - 
>  93.87    1.588239  11    149253   epoll_wait
>   3.84    0.064985  30  2140    31 recvfrom
>   1.77    0.029905  13  2388   write
>   0.34    0.005737  10   589   130 read
>   0.12    0.002018  38    53   close
>   0.06    0.000960   8   114 2 sendto

This one is very interesting! So the epoll_wait to other syscalls ratio
went from roughly 1/2 to 30/1. I'm pretty sure that a regular strace would
show you a large number of epoll_wait(0)=0 indicating we're missing some
events. I seem to remember that sometimes there are situations where a
thread may be notified by epoll() about an fd it cannot take care of but
I don't remember in which case, I'll have to discuss with Christopher.

But at least now we have an explanation and it's not directly related to
thread contention but more likely with the mapping of FDs to threads, so
we may have opportunities to improve the situation here.

Thanks!
Willy



Re: High load average under 1.8 with multiple draining processes

2018-01-12 Thread Samuel Reed
To accelerate the process, I've increased the number of threads from 4
to 8 on a 16-core machine. Ran strace for about 5s on each.

Single process (8 threads):

$ strace -cp 16807
% time seconds  usecs/call calls    errors syscall
-- --- --- - - 
 71.36    0.330172  21 15479   epoll_wait
 13.59    0.062861   4 14477 1 write
 10.58    0.048955   4 11518    10 recvfrom
  4.44    0.020544  38   537   244 read
  0.02    0.94   1   135   sendto
  0.01    0.51   3    16   epoll_ctl
  0.00    0.00   0    24   close
  0.00    0.00   0 9   socket
  0.00    0.00   0 9 9 connect
  0.00    0.00   0    25   sendmsg
  0.00    0.00   0 6   shutdown
  0.00    0.00   0    24   setsockopt
  0.00    0.00   0 3   getsockopt
  0.00    0.00   0 9   fcntl
  0.00    0.00   0 8 7 accept4
-- --- --- - - 
100.00    0.462677 42279   271 total


Two processes (2x8 threads):

## Draining process

% time seconds  usecs/call calls    errors syscall
-- --- --- - - 
 48.65    0.544758  30 18359   epoll_wait
 28.69    0.321283  14 23540   write
 22.60    0.253049  19 13338   recvfrom
  0.04    0.000474   1   786   374 read
  0.03    0.000287   2   187   sendto
  0.00    0.00   0 2   close
  0.00    0.00   0 1   sendmsg
  0.00    0.00   0 1   shutdown
  0.00    0.00   0 1   setsockopt
-- --- --- - - 
100.00    1.119851 56215   374 total

## "New" process

% time seconds  usecs/call calls    errors syscall
-- --- --- - - 
 93.87    1.588239  11    149253   epoll_wait
  3.84    0.064985  30  2140    31 recvfrom
  1.77    0.029905  13  2388   write
  0.34    0.005737  10   589   130 read
  0.12    0.002018  38    53   close
  0.06    0.000960   8   114 2 sendto
  0.00    0.31   1    25   shutdown
  0.00    0.19   0   102   sendmsg
  0.00    0.19   0    58   epoll_ctl
  0.00    0.15   0    31   fcntl
  0.00    0.00   0    31   socket
  0.00    0.00   0    31    31 connect
  0.00    0.00   0    94   setsockopt
  0.00    0.00   0 8   getsockopt
  0.00    0.00   0    47    29 accept4
-- --- --- - - 
100.00    1.691928    154964   223 total


It does indeed appear the new process is contending with the old, even
with just 16 threads on a 16-core box. A third process (oversubscribed
to 24 threads on 16 cores):

% time seconds  usecs/call calls    errors syscall
-- --- --- - - 
 97.21    0.950863  14 69926   epoll_wait
  2.32    0.022677 208   109   write
  0.42    0.004106  48    85    14 recvfrom
  0.04    0.000439  17    26   close
  0.00    0.22   1    34   epoll_ctl
  0.00    0.00   0   136    33 read
  0.00    0.00   0 1   brk
  0.00    0.00   0    14   socket
  0.00    0.00   0    14    14 connect
  0.00    0.00   0    15 1 sendto
  0.00    0.00   0    11   sendmsg
  0.00    0.00   0    13 1 shutdown
  0.00    0.00   0    50   setsockopt
  0.00    0.00   0 3   getsockopt
  0.00    0.00   0    14   fcntl
  0.00    0.00   0    34    22 accept4
-- --- --- - - 
100.00    0.978107 70485    85 total


During this time, each of the three processes was running at roughly
250-350% CPU.


On 1/12/18 9:34 AM, Willy Tarreau wrote:
> On Fri, Jan 12, 2018 at 09:28:54AM -0600, Samuel Reed wrote:
>> Thanks for your quick answer, Willy.
>>
>> That's a shame to hear but makes sense. We'll try out some ideas for
>> reducing 

Re: High load average under 1.8 with multiple draining processes

2018-01-12 Thread Willy Tarreau
On Fri, Jan 12, 2018 at 09:28:54AM -0600, Samuel Reed wrote:
> Thanks for your quick answer, Willy.
> 
> That's a shame to hear but makes sense. We'll try out some ideas for
> reducing contention. We don't use cpu-map with nbthread; I considered it
> best to let the kernel take care of this, especially since there are
> some other processes on that box.

So that definitely explains why 5 instances start to give you a high load
with 4 threads on 16 cores. Note, do you happen to see some processes
running at 100% CPU (or in fact 400% since you have 4 threads) ? It would
be possible that some remaining bugs would cause older processes and their
threads to spin a bit too much.

If you're interested, when this happens you could run "strace -cp $pid"
for a few seconds, it will report the syscall count over that period. A
typical rule of thumb is that if you see more epoll_wait() than recvfrom()
or read(), there's an issue somewhere in the code.

> I don't really want to fall back to
> nbproc but we may have to, at least until we get the number of reloads down.

It's possible, but let's see if there's a way to improve the situation a
bit by gathering some elements first ;-)

Willy



Re: High load average under 1.8 with multiple draining processes

2018-01-12 Thread Samuel Reed
Thanks for your quick answer, Willy.

That's a shame to hear but makes sense. We'll try out some ideas for
reducing contention. We don't use cpu-map with nbthread; I considered it
best to let the kernel take care of this, especially since there are
some other processes on that box. I don't really want to fall back to
nbproc but we may have to, at least until we get the number of reloads down.


On 1/12/18 8:55 AM, Willy Tarreau wrote:
> Hi Samuel,
>
> On Thu, Jan 11, 2018 at 08:29:15PM -0600, Samuel Reed wrote:
>> Is there a regression in the 1.8 series with SO_REUSEPORT and nbthread
>> (we didn't see this before with nbproc) or somewhere we should start
>> looking?
> In fact no, nbthread is simply new so it's not a regression but we're
> starting to see some side effects. One possibility I can easily imagine
> is that at most places we're using spinlocks because the locks are very
> short-lived and very small, so they have to be cheap. One limit of
> spinlocks is that it's mandatory that you don't have more threads than
> cores so that your threads are never scheduled out with a lock held, to
> let another one spin for nothing for a timeslice.
>
> The reload makes an interesting case because if you use cpumap to bind
> your threads to CPU cores, during the soft-stop period, they do have to
> coexist on the same cores and a thread of one process disturbs the thread
> of another process by stealing its CPU regularly.
>
> I can't say I'm seeing any easy solution to this in the short term, that's
> something we have to add to the list of things to improve in the future.
> Maybe something as simple as starting with SCHED_FIFO to prevent threads
> from being preempted outside of the poll loop, and dropping it upon reload
> could help a lot, but that's just speculation.
>
> We'll have to continue to think about this I guess. It may be possible
> that if your old processes last very long you'd continue to get a better
> experience using nbproc than nbthread :-/
>
> Willy




FW: Your exhibition stand at Engine Expo 2018

2018-01-12 Thread Brendan C
Hello Again,

If you are attending the IEX Insulation Expo in Cologne this May (or indeed any 
shows on the European Mainland or the UK) we would like to offer you a 
complimentary 3D Design for your stand. Just send us your brief (Please check 
the questions under my signature below for the information we need) and we will 
send you a no obligation quality design. All we ask in return is that you use 
us to construct the stand if you want to use our design.


  *   We specialize in Stand builds throughout Germany, Spain, Italy and the 
UK. With production facilities and labor partners in Poland and  each of these 
countries,  we can offer close to the  best priced quality and professional 
stands in the European market place.
  *   We have built hundreds of stands in all the major venues around Europe 
(frequently extending that to other parts of the world including Asia and The 
Middle East). We know how to cut through all the red tape, in all the 
languages, to ensure our clients turn up to a perfect stand that is built 
within brief, budget and on time.
  *   We offer a complete solution - Design, drawings, fabrication, 
installation, removal, storage, furniture, AV, Electrics, Graphics, 
documentation, approvals and professional project management.

Just take a minute to reply to this email and we will aim to send you back a 
professional 5 page 3D design within 5 working days.

Thanks for your attention and I look forward to hearing back from you,




Brendan Coote
European Commercial and  Projects Director

[BusinessCardLogo Adjusted]

41 High StreetPoznan
East Grinstead   Poland
West Sussex, RH19 3AF

Mobile: 0044 (0) 7789 500055
Office: 0044 (0) 1290 3202119

www.globalexhibitionworks.com
brend...@globalexhibitionworks.com




Briefing requirements:-
1.   Size of your stand (Length x width)?
2.   How many sides are open?
3.   Will your require meeting room (s). Open/Closed/Semi Open?
4.   Will you require Kitchen or Storage Room (s)?
5.   Bar Area?
6.   Reception Area?
7.   Presentation Area?
8.   Will you want to display any of your products and require us to design 
promotional display structures for these. Please elaborate with quantities, 
size and description.
9.   Will you require backlights for your Graphics? (Increase cost of 
graphics by 50%)
10.   Hanging Signage from Ceiling? (Will involve a hoisting/rigging fee from 
the organisors)
11.   Material preference. (Steel, Wood, Laminate)
12.   Raised floor? Carpet/Laminate/Wood Finish?
13.   Any other needs or ideas?
14.   What is your budget?




Warning: upgrading to openssl master+ enable_tls1_3 (coming v1.1.1) could break handshakes for all protocol versions .

2018-01-12 Thread Emeric Brun
Hi All,

FYI: upgrading to next openssl-1.1.1 could break your prod if you're using a 
forced cipher list because
handshake will fail regardless the tls protocol version if you don't specify a 
cipher valid for TLSv1.3
in your cipher list.

https://github.com/openssl/openssl/issues/5057

https://github.com/openssl/openssl/issues/5065

Openssl's team doesn't seem to consider this as an issue and I'm just bored to 
discuss with them.

R,
Emeric



Re: High load average under 1.8 with multiple draining processes

2018-01-12 Thread Willy Tarreau
Hi Samuel,

On Thu, Jan 11, 2018 at 08:29:15PM -0600, Samuel Reed wrote:
> Is there a regression in the 1.8 series with SO_REUSEPORT and nbthread
> (we didn't see this before with nbproc) or somewhere we should start
> looking?

In fact no, nbthread is simply new so it's not a regression but we're
starting to see some side effects. One possibility I can easily imagine
is that at most places we're using spinlocks because the locks are very
short-lived and very small, so they have to be cheap. One limit of
spinlocks is that it's mandatory that you don't have more threads than
cores so that your threads are never scheduled out with a lock held, to
let another one spin for nothing for a timeslice.

The reload makes an interesting case because if you use cpumap to bind
your threads to CPU cores, during the soft-stop period, they do have to
coexist on the same cores and a thread of one process disturbs the thread
of another process by stealing its CPU regularly.

I can't say I'm seeing any easy solution to this in the short term, that's
something we have to add to the list of things to improve in the future.
Maybe something as simple as starting with SCHED_FIFO to prevent threads
from being preempted outside of the poll loop, and dropping it upon reload
could help a lot, but that's just speculation.

We'll have to continue to think about this I guess. It may be possible
that if your old processes last very long you'd continue to get a better
experience using nbproc than nbthread :-/

Willy



Re: [BUG] 100% cpu on each threads

2018-01-12 Thread Emmanuel Hocdet

> Le 12 janv. 2018 à 15:23, Aleksandar Lazic  a écrit :
> 
> 
> -- Originalnachricht --
> Von: "Willy Tarreau" 
> An: "Emmanuel Hocdet" 
> Cc: "haproxy" 
> Gesendet: 12.01.2018 13:04:02
> Betreff: Re: [BUG] 100% cpu on each threads
> 
>> On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:
>>> When syndrome appear, i see such line on syslog:
>>> (for one or all servers)
>>> 
>>> Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad file
>>> descriptor", check duration: 2018ms. 0 active and 1 backup servers left.
>>> Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
>> 
>> Wow, that's scary! This means we have a problem with server-side connections
>> and I really have no idea what it's about at the moment :-(
> @Emmanuel: Wild guess, is this a meltdown/spectre patched server and since 
> the patch you have seen this errors?
> 

no,
nothing has changed on serveurs (and are with different linux kernel), only 
haproxy versions from 1.8-dev … and see
the issue with 1.8.3 (and i don’t know if issue is on 1.8.2)





Re: [BUG] 100% cpu on each threads

2018-01-12 Thread Cyril Bonté
Hi all,

- Mail original -
> De: "Willy Tarreau" 
> À: "Emmanuel Hocdet" 
> Cc: "haproxy" 
> Envoyé: Vendredi 12 Janvier 2018 15:24:54
> Objet: Re: [BUG] 100% cpu on each threads
> 
> On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:
> > When syndrome appear, i see such line on syslog:
> > (for one or all servers)
> > 
> > Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info:
> > "Bad file descriptor", check duration: 2018ms. 0 active and 1
> > backup servers left. Running on backup. 0 sessions active, 0
> > requeued, 0 remaining in queue.
> 
> So I tried a bit but found no way to reproduce this. I'll need some
> more info like the type of health-checks, probably the "server" line
> settings, stuff like this. Does it appear quickly or does it take a
> long time ? Also, does it recover from this on subsequent checks or
> does it stay stuck in this state ?

Im' not sure you saw Samuel Reed's mail.
He reported a similar issue some hours ago (High load average under 1.8 with 
multiple draining processes). It would be interesting to find a common 
configuration to reproduce the issue, so I add him to the thread.

Cyril




Re: [BUG] 100% cpu on each threads

2018-01-12 Thread Emmanuel Hocdet

> Le 12 janv. 2018 à 15:24, Willy Tarreau  a écrit :
> 
> On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:
>> When syndrome appear, i see such line on syslog:
>> (for one or all servers)
>> 
>> Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad file 
>> descriptor", check duration: 2018ms. 0 active and 1 backup servers left. 
>> Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.
> 

or new one:
Jan 12 13:25:13 webacc1 haproxy_ssl[31002]: Server tls/L7_1 is DOWN, reason: 
Layer4 connection problem, info: "General socket error (Bad file descriptor)", 
check duration: 0ms. 0 active and 1 backup servers left. Running on backup. 0 
sessions active, 0 requeued, 0 remaining in queue.


> So I tried a bit but found no way to reproduce this. I'll need some
> more info like the type of health-checks, probably the "server" line
> settings, stuff like this. Does it appear quickly or does it take a
> long time ? Also, does it recover from this on subsequent checks or
> does it stay stuck in this state ?

yep, conf include.
issue no seen without check (but without traffic)

Manu

global
user haproxy
group haproxy
daemon

# for master-worker (-W)
stats socket /var/run/haproxy_ssl.sock expose-fd listeners
nbthread 8

log /dev/log daemon warning
log /dev/log local0

tune.ssl.cachesize 20
tune.ssl.lifetime 5m

ssl-default-bind-options no-sslv3
ssl-default-bind-ciphers 
ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES128-SHA:ECDHE-ECDSA-AES256-SHA:ECDHE-RSA-AES128-SHA:ECDHE-RSA-AES256-SHA:AES128-GCM-SHA256:AES256-GCM-SHA384:AES128-SHA:AES256-SHA


defaults
log global
log-tag "haproxy_ssl"
option dontlognull
maxconn 4
timeout connect 500ms
source 0.0.0.0

timeout client 207s
retries 3
timeout server 207s

listen tls

mode tcp
bind 127.0.0.1:463,X.Y.Z.B:463 accept-proxy ssl  tls-ticket-keys 
/var/lib/haproxy/ssl/tls_keys.cfg strict-sni crt-list 
/var/lib/haproxy/ssl/crtlist.cfg 

log-format 'resumed:%[ssl_fc_is_resumed] cipher:%sslc tlsv:%sslv'

balance roundrobin
option allbackups
fullconn 3

server L7_1 127.0.0.1:483 check send-proxy 

server L7_2 X.Y.Z.C:483 check send-proxy backup 



Re: [BUG] 100% cpu on each threads

2018-01-12 Thread Willy Tarreau
On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:
> When syndrome appear, i see such line on syslog:
> (for one or all servers)
> 
> Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad file 
> descriptor", check duration: 2018ms. 0 active and 1 backup servers left. 
> Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.

So I tried a bit but found no way to reproduce this. I'll need some
more info like the type of health-checks, probably the "server" line
settings, stuff like this. Does it appear quickly or does it take a
long time ? Also, does it recover from this on subsequent checks or
does it stay stuck in this state ?

Willy



Re[2]: [BUG] 100% cpu on each threads

2018-01-12 Thread Aleksandar Lazic


-- Originalnachricht --
Von: "Willy Tarreau" 
An: "Emmanuel Hocdet" 
Cc: "haproxy" 
Gesendet: 12.01.2018 13:04:02
Betreff: Re: [BUG] 100% cpu on each threads


On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:

When syndrome appear, i see such line on syslog:
(for one or all servers)

Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad 
file
descriptor", check duration: 2018ms. 0 active and 1 backup servers 
left.
Running on backup. 0 sessions active, 0 requeued, 0 remaining in 
queue.


Wow, that's scary! This means we have a problem with server-side 
connections

and I really have no idea what it's about at the moment :-(
@Emmanuel: Wild guess, is this a meltdown/spectre patched server and 
since the patch you have seen this errors?



Willy


Aleks




Re: [BUG] 100% cpu on each threads

2018-01-12 Thread Willy Tarreau
On Fri, Jan 12, 2018 at 12:01:15PM +0100, Emmanuel Hocdet wrote:
> When syndrome appear, i see such line on syslog:
> (for one or all servers)
> 
> Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad file
> descriptor", check duration: 2018ms. 0 active and 1 backup servers left.
> Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.

Wow, that's scary! This means we have a problem with server-side connections
and I really have no idea what it's about at the moment :-(

Willy



Re: [BUG] 100% cpu on each threads

2018-01-12 Thread Emmanuel Hocdet
Hi Willy

> Le 12 janv. 2018 à 11:38, Willy Tarreau  a écrit :
> 
> Hi Manu,
> 
> On Fri, Jan 12, 2018 at 11:14:57AM +0100, Emmanuel Hocdet wrote:
>> 
>> Hi,
>> 
>> with 1.8.3  + threads (with mworker)
>> I notice a 100% cpu per thread  ( epool_wait + gettimeofday  in loop)
>> Syndrome appears regularly on start/reload.
> 
> We got a similar report yesterday affecting 1.5 to 1.8 caused by
> a client aborting during a server redispatch. I don't know if it
> could be related at all to what you're seeing, but would you care
> to try the latest 1.8 git to see if it fixes it, since it contains
> the fix ?
> 

same syndrome with latest 1.8.

When syndrome appear, i see such line on syslog:
(for one or all servers)

Server tls/L7_1 is DOWN, reason: Layer4 connection problem, info: "Bad file 
descriptor", check duration: 2018ms. 0 active and 1 backup servers left. 
Running on backup. 0 sessions active, 0 requeued, 0 remaining in queue.

Manu




Re: Segfault on haproxy 1.7.10 with state file and slowstart

2018-01-12 Thread Willy Tarreau
Hello Raghu,

On Thu, Jan 11, 2018 at 02:20:34PM +0530, Raghu Udiyar wrote:
> Hello,
> 
> Haproxy 1.7.10 segfaults when the srv_admin_state is set to
> SRV_ADMF_CMAINT (0x04)
> for a backend server, and that backend has the `slowstart` option set.
> 
> The following configuration reproduces it :
(...)

Thanks for all the details, they made it easy to reproduce.

>From what I'm seeing, it's a fundamental design issue in the state
file handling in 1.7. It starts checks before they have been
initialized, and try to wake up a NULL task. In 1.8 due to the more
dynamic config, the initialization sequence has changed and checks
are initialized before parsing the state file, but I don't feel at
ease with doing in 1.7 since I don't know if some config elements
may remain non-updated.

So instead I've just protected against using the task wakeups during
the state file parsing, and they will be initialized later with the
appropriate parameters.

Could you please check the attached patch on top of 1.7.10 ?

Thanks,
Willy
diff --git a/src/server.c b/src/server.c
index 66e8f8a..e847723 100644
--- a/src/server.c
+++ b/src/server.c
@@ -318,8 +318,10 @@ void srv_set_running(struct server *s, const char *reason)
s->last_change = now.tv_sec;
 
s->state = SRV_ST_STARTING;
-   if (s->slowstart > 0)
-   task_schedule(s->warmup, tick_add(now_ms, MS_TO_TICKS(MAX(1000, 
s->slowstart / 20;
+   if (s->slowstart > 0) {
+   if (s->warmup)
+   task_schedule(s->warmup, tick_add(now_ms, 
MS_TO_TICKS(MAX(1000, s->slowstart / 20;
+   }
else
s->state = SRV_ST_RUNNING;
 
@@ -622,8 +624,10 @@ void srv_clr_admin_flag(struct server *s, enum srv_admin 
mode)
s->state = SRV_ST_STOPPING;
else {
s->state = SRV_ST_STARTING;
-   if (s->slowstart > 0)
-   task_schedule(s->warmup, 
tick_add(now_ms, MS_TO_TICKS(MAX(1000, s->slowstart / 20;
+   if (s->slowstart > 0) {
+   if (s->warmup)
+   task_schedule(s->warmup, 
tick_add(now_ms, MS_TO_TICKS(MAX(1000, s->slowstart / 20;
+   }
else
s->state = SRV_ST_RUNNING;
}


Re: [BUG] 100% cpu on each threads

2018-01-12 Thread Willy Tarreau
Hi Manu,

On Fri, Jan 12, 2018 at 11:14:57AM +0100, Emmanuel Hocdet wrote:
> 
> Hi,
> 
> with 1.8.3  + threads (with mworker)
> I notice a 100% cpu per thread  ( epool_wait + gettimeofday  in loop)
> Syndrome appears regularly on start/reload.

We got a similar report yesterday affecting 1.5 to 1.8 caused by
a client aborting during a server redispatch. I don't know if it
could be related at all to what you're seeing, but would you care
to try the latest 1.8 git to see if it fixes it, since it contains
the fix ?

> My configuration include one bind line with ssl on tcp mode.

There was TCP mode there as well :-/

Willy



[BUG] 100% cpu on each threads

2018-01-12 Thread Emmanuel Hocdet

Hi,

with 1.8.3  + threads (with mworker)
I notice a 100% cpu per thread  ( epool_wait + gettimeofday  in loop)
Syndrome appears regularly on start/reload.

My configuration include one bind line with ssl on tcp mode.

It's a know issue?

++
Manu