Re: Few problems seen in haproxy? (threads, connections).

2018-10-16 Thread Krishna Kumar (Engineering)
Hi Willy,

My systems were out of rotation for some other tests so did not get
to this till now. I have pulled the latest bits just now and tested.
Regarding
maxconn, I simply kept maxconn in global/defaults to 1 million and have
this line in the backend section:
default-server maxconn 100
I have not seen the Queue/Max you mentioned earlier.

The FD time has gone down to zero, but the LB time has increased
about 50% from last time (7700 ns to 11600 ns, I am using 'balance
leastconn').
Test was run for 1 minute:
$ wrk -c 4800 -t 48 -d 60s http://www.flipkart.com/128

The results were for 32 threads, which is the same configuration I tested
with earlier. Both of these testing was done with threads pinning to NUMA-1
cores (cores 1, 3, 5, ..47), and irq's to NUMA-0 (0, 2, 4, ..46). However,
the
cpus recycles from 1-47 back to 1-15 for the thread pinning. So that may
explain the much higher lock numbers that I am seeing. When I changed
this to use all cpus (0-31), the LBPRM lock took 74339.117 ns per operation.
But performance dropped from 210K to 80K.

Overall, I am not at ease for threading, or will have to settle for 12
threads
for the 12 non-hyperthreaded cores for a single socket.

Inlining the output of locks for the case where all threads are pinned to
NUMA-1 cores (and hence 2 threads to same cores for some cores),
at the end of this mail.

Thanks,
- Krishna

Stats about Lock FD:
# write lock  : 2
# write unlock: 2 (0)
# wait time for write : 0.001 msec
# wait time for write/lock: 302.000 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock TASK_RQ:
# write lock  : 373317
# write unlock: 373317 (0)
# wait time for write : 341.875 msec
# wait time for write/lock: 915.775 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock TASK_WQ:
# write lock  : 373432
# write unlock: 373432 (0)
# wait time for write : 491.524 msec
# wait time for write/lock: 1316.235 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock LISTENER:
# write lock  : 1248
# write unlock: 1248 (0)
# wait time for write : 0.295 msec
# wait time for write/lock: 236.341 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock PROXY:
# write lock  : 12524202
# write unlock: 12524202 (0)
# wait time for write : 20979.972 msec
# wait time for write/lock: 1675.154 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock SERVER:
# write lock  : 50100330
# write unlock: 50100330 (0)
# wait time for write : 76908.311 msec
# wait time for write/lock: 1535.086 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock LBPRM:
# write lock  : 50096808
# write unlock: 50096808 (0)
# wait time for write : 584505.012 msec
# wait time for write/lock: 11667.510 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock BUF_WQ:
# write lock  : 35653802
# write unlock: 35653802 (0)
# wait time for write : 80406.420 msec
# wait time for write/lock: 2255.199 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock STRMS:
# write lock  : 9602
# write unlock: 9602 (0)
# wait time for write : 5.613 msec
# wait time for write/lock: 584.594 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock VARS:
# write lock  : 37596611
# write unlock: 37596611 (0)
# wait time for write : 2285.148 msec
# wait time for write/lock: 60.781 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec


On Mon, Oct 15, 2018 at 11:14 PM Willy Tarreau  wrote:

> Hi again,
>
> finally I got rid of the FD lock for single-threaded accesses (most of
> them), and based on Olivier's suggestion, I implemented a per-thread
> wait queue, and cache-aligned some list heads to avoid undesired cache
> line sharing. For me all of this combined resulted in a performance
> increase of 25% on a 12-threads workload. I'm interested in your test
> results, all of this is in the latest master.
>
> If you still see LBPRM a lot, I can send you the experimental patch
> to move the element inside the tree without unlinking/relinking it
> and we can see if that provides any benefit or not (I'm not convinced).
>
> Cheers,
> Willy
>


Re: Few problems seen in haproxy? (threads, connections).

2018-10-11 Thread Krishna Kumar (Engineering)
I must say the improvements are pretty impressive!

Earlier number reported with 24 processes: 519K
Earlier number reported with 24 threads:  79K
New RPS with system irq tuning, today's git,
   configuration changes, 24 threads:353K
Old code with same tuning gave:  290K

My test machine is a 2 NUMA node server, CPU's 0, 2, ..., 22 on node 0, and
1, 3, ..., 23 on node 1. That adds to 24 cpus. The remaining 24 cores are
HT.
Node 0 has 24, 26, ..., 46 and node 1 has 25, 27, ..., 47. This may explain
why
it scales well to 24. The 16 irq's of the NIC and pinned to cpus 0, 2, ...,
30.
Hoping performance further improves.

12 threads: 280K
16 threads: 318K
24 threads: 353K (occasional drop till 330K)
32 threads: 238K

I am attaching 2 text files of the lock metrics for 24 and 32 threads. A
vimdiff shows the differences nicely (fd, task_rq, task_wq, proxy, server,
lbprm and buf_wq increased significantly).

Thanks!

On Thu, Oct 11, 2018 at 8:53 AM Krishna Kumar (Engineering) <
krishna...@flipkart.com> wrote:

> Thanks, will do that.
>
> On Thu, Oct 11, 2018 at 8:37 AM Willy Tarreau  wrote:
>
>> On Thu, Oct 11, 2018 at 08:18:21AM +0530, Krishna Kumar (Engineering)
>> wrote:
>> > Hi Willy,
>> >
>> > Thank you very much for the in-depth analysis and configuration setting
>> > suggestions.
>> > I believe I have got the 3 key items to continue based on your mail:
>> >
>> > 1. Thread pinning
>> > 2. Fix system irq pinning accordingly
>> > 3. Listen on all threads.
>> >
>> > I will post the configuration changes and the results.
>>
>> By the way, please pull the latest master fixes. I've addressed two issues
>> there with locking :
>>   - one where the scheduler work was slightly too high, increasing the
>> time
>> spent on RQ lock
>>   - another one where I messed up on a fix, causing lock-free pools to be
>> disabled (as seen in your output, where the POOL lock appears a lot)
>>
>> On some tests I've run here, I've found the stick-tables lock to be a
>> bottleneck when tracking is enabled. I don't have a short-term solution
>> to this, but looking at the code it's obvious that it can significantly
>> be improved (though it will take quite some time). I'll probably at least
>> try to replace it with an RW lock as I think it could improve the
>> situation.
>>
>> The FD lock is another one requiring some lift-up. I'm certain it's
>> possible,
>> I just don't know if it will not degrade low-thread count performance by
>> using too many atomic ops instead. We'll have to experiment.
>>
>> Cheers,
>> Willy
>>
>
Stats about Lock FD: 
 # write lock  : 407304162
 # write unlock: 407304155 (-7)
 # wait time for write : 39627.620 msec
 # wait time for write/lock: 97.292 nsec
 # read lock   : 0
 # read unlock : 0 (0)
 # wait time for read  : 0.000 msec
 # wait time for read/lock : 0.000 nsec
Stats about Lock TASK_RQ: 
 # write lock  : 2230051
 # write unlock: 2230051 (0)
 # wait time for write : 63163.277 msec
 # wait time for write/lock: 28323.692 nsec
 # read lock   : 0
 # read unlock : 0 (0)
 # wait time for read  : 0.000 msec
 # wait time for read/lock : 0.000 nsec
Stats about Lock TASK_WQ: 
 # write lock  : 14897430
 # write unlock: 14897430 (0)
 # wait time for write : 49136.313 msec
 # wait time for write/lock: 3298.308 nsec
 # read lock   : 0
 # read unlock : 0 (0)
 # wait time for read  : 0.000 msec
 # wait time for read/lock : 0.000 nsec
Stats about Lock POOL: 
 # write lock  : 0
 # write unlock: 0 (0)
 # wait time for write : 0.000 msec
 # wait time for write/lock: 0.000 nsec
 # read lock   : 0
 # read unlock : 0 (0)
 # wait time for read  : 0.000 msec
 # wait time for read/lock : 0.000 nsec
Stats about Lock LISTENER: 
 # write lock  : 5500
 # write unlock: 5500 (0)
 # wait time for write : 0.076 msec
 # wait time for write/lock: 13.734 nsec
 # read lock   : 0
 # read unlock : 0 (0)
 # wait time for read  : 0.000 msec
 # wait time for read/lock : 0.000 nsec
Stats about Lock PROXY: 
 # write lock  : 7368276
 # write unlock: 7368276 (0)
 # wait time for write : 16768.394 msec
 # wait time for write/lock: 2275.755 nsec
 # read lock   : 0
 # read unlock : 0 (0)
 # wait time for read  : 0.000 msec
 # wait time for read/lock : 0.000 nsec
Stats about 

Re: Few problems seen in haproxy? (threads, connections).

2018-10-10 Thread Krishna Kumar (Engineering)
Hi Willy,

Thank you very much for the in-depth analysis and configuration setting
suggestions.
I believe I have got the 3 key items to continue based on your mail:

1. Thread pinning
2. Fix system irq pinning accordingly
3. Listen on all threads.

I will post the configuration changes and the results.

Regards,
- Krishna


On Wed, Oct 10, 2018 at 6:39 PM Willy Tarreau  wrote:

> Hi Krishna,
>
> On Tue, Oct 02, 2018 at 09:18:19PM +0530, Krishna Kumar (Engineering)
> wrote:
> (...)
> > 1. HAProxy system:
> > Kernel: 4.17.13,
> > CPU: 48 core E5-2670 v3
> > Memory: 128GB memory
> > NIC: Mellanox 40g with IRQ pinning
> >
> > 2. Client, 48 core similar to server. Test command line:
> > wrk -c 4800 -t 48 -d 30s http:///128
> >
> > 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout
> as
> > of
> > Oct 2nd).
> > # haproxy-git -vv
> > HA-Proxy version 1.9-dev3 2018/09/29
> (...)
> > 4. HAProxy results for #processes and #threads
> > #Threads-RPS Procs-RPS
> > 1 20903 19280
> > 2 46400 51045
> > 4 96587 142801
> > 8 172224 254720
> > 12 210451 437488
> > 16 173034 437375
> > 24 79069 519367
> > 32 55607 586367
> > 48 31739 596148
>
> Our largest thread test was on 12 cores and it happens that in your case
> it's also the optimal one.
>
> However I do have some comments about your config, before going back to
> real thread issues :
>
> > # cpu-map auto:1/1-48 0-39
>   => you must absolutely pin your processes, and they must be pinned
>  to cores *not* shared with the network card. That's critical.
>  Moreover it's also important that threads are not split across
>  multiple physical CPUs because the remote L3 cache access time
>  over QPI/UPI is terrible. When you run on 12 threads with two
>  12-cores/24-threads CPUs, you could very well have haproxy using
>  12 threads from 6 cores, and the NIC using 12 threads from the
>  other 6 cores of the same physical CPU. The second socket is,
>  as usual, useless for anything requiring low latency. However
>  it's perfect to run SSL. So you could be interested in testing
>  if running the NIC on one socket (try to figure what node the
>  PCIe lanes are physically connected to), and haproxy on the other
>  one. It *could* be possible that you get more performance from 12
>  cores of each but I strongly doubt it based on a number of tests.
>  If you use SSL however it's different as you will benefit from
>  lots of cores much more than low latency.
>
> > bind :80 process 1/1-48
>   => it's also capital for scalability to have individual bind lines. Here
>  you have a single socket accessed from all 48 threads. There's no
>  efficient thread load balancing here. By having this :
>
>  bind :80 process 1/1
>  bind :80 process 1/2
>  ...
>  bind :80 process 1/47
>  bind :80 process 1/48
>
>  You will let the kernel perform the load balancing and distribute a
>  fair load to all threads. This way none of them will risk to pick a
>  larger share of the incoming connections than optimal. I know it's
>  annoying to configure at the moment, I've been thinking about having
>  a way to automatically iterate from a single config line (like the
>  "auto" feature of cpu-map), but for now it's not done.
>
> Now back to the thread measurements :
>
> > 5. Lock stats for 1.9-dev3: Some write locks on average took a lot more
> time
> >to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get:
> > Stats about Lock FD:
> > # write lock  : 143933900
> > # write unlock: 143933895 (-5)
> > # wait time for write : 11370.245 msec
>
> This one definitely is huge. We know some work is still needed on this lock
> and that there are still a few low hanging fruits but not much savings to
> expect short term. This output is very revealing however of the importance
> of this lock.
>
> > # wait time for write/lock: 78.996 nsec
>
> That's roughly the time it takes to access the other CPU's cache, so using
> your two sockets for the same process definitely hurts a lot here.
>
> > Stats about Lock TASK_RQ:
> > # write lock  : 2062874
> > # write unlock: 2062875 (1)
> > # wait time for write : 7820.234 msec
>
> This one is still far too large for what we'd hope, even though it
> has significantly shrunk since 1.8. It could be related to the poor
> distribution of the incoming connections across threads.
>
> > # wait time for write/lock: 3790.941 nsec
>
> Wow, 3.8 microseconds to acquire the wr

Re: Few problems seen in haproxy? (threads, connections).

2018-10-05 Thread Krishna Kumar (Engineering)
Sorry for repeating once again, but this is my last unsolicited
mail on this topic. Any directions for what to look out for?

Thanks,
- Krishna


On Thu, Oct 4, 2018 at 8:42 AM Krishna Kumar (Engineering) <
krishna...@flipkart.com> wrote:

> Re-sending in case this mail was missed. To summarise the 3 issues seen:
>
> 1. Performance drops 18x with higher number of nbthreads as compared to
> nbprocs.
> 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds (for
> 1.9-dev3
> for nbprocs and nbthreads).
> 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains in
> either
>  CLOSE-WAIT (towards clients) and ESTAB (towards the backend servers),
> till
>  the server/client timeout expires.
>
> The tests for threads and processes were done on the same systems, so
> there is
> no difference in system parameters.
>
> Thanks,
> - Krishna
>
>
> On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) <
> krishna...@flipkart.com> wrote:
>
>> Hi Willy, and community developers,
>>
>> I am not sure if I am doing something wrong, but wanted to report
>> some issues that I am seeing. Please let me know if this is a problem.
>>
>> 1. HAProxy system:
>> Kernel: 4.17.13,
>> CPU: 48 core E5-2670 v3
>> Memory: 128GB memory
>> NIC: Mellanox 40g with IRQ pinning
>>
>> 2. Client, 48 core similar to server. Test command line:
>> wrk -c 4800 -t 48 -d 30s http:///128
>>
>> 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout
>> as of
>> Oct 2nd).
>> # haproxy-git -vv
>> HA-Proxy version 1.9-dev3 2018/09/29
>> Copyright 2000-2018 Willy Tarreau 
>>
>> Build options :
>>   TARGET  = linux2628
>>   CPU = generic
>>   CC  = gcc
>>   CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
>> -fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare
>> -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
>> -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
>>   OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1
>>
>> Default settings :
>>   maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
>>
>> Built with OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
>> Running on OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
>> OpenSSL library supports TLS extensions : yes
>> OpenSSL library supports SNI : yes
>> OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
>> Built with transparent proxy support using: IP_TRANSPARENT
>> IPV6_TRANSPARENT IP_FREEBIND
>> Encrypted password support via crypt(3): yes
>> Built with multi-threading support.
>> Built with PCRE version : 8.38 2015-11-23
>> Running on PCRE version : 8.38 2015-11-23
>> PCRE library supports JIT : no (USE_PCRE_JIT not set)
>> Built with zlib version : 1.2.8
>> Running on zlib version : 1.2.8
>> Compression algorithms supported : identity("identity"),
>> deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
>> Built with network namespace support.
>>
>> Available polling systems :
>>   epoll : pref=300,  test result OK
>>poll : pref=200,  test result OK
>>  select : pref=150,  test result OK
>> Total: 3 (3 usable), will use epoll.
>>
>> Available multiplexer protocols :
>> (protocols markes as  cannot be specified using 'proto' keyword)
>>   h2 : mode=HTTP   side=FE
>> : mode=TCP|HTTP   side=FE|BE
>>
>> Available filters :
>> [SPOE] spoe
>> [COMP] compression
>> [TRACE] trace
>>
>> 4. HAProxy results for #processes and #threads
>> #Threads-RPS Procs-RPS
>> 1 20903 19280
>> 2 46400 51045
>> 4 96587 142801
>> 8 172224 254720
>> 12 210451 437488
>> 16 173034 437375
>> 24 79069 519367
>> 32 55607 586367
>> 48 31739 596148
>>
>> 5. Lock stats for 1.9-dev3: Some write locks on average took a lot more
>> time
>>to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get:
>> Stats about Lock FD:
>> # write lock  : 143933900
>> # write unlock: 143933895 (-5)
>> # wait time for write : 11370.245 msec
>> # wait time for write/lock: 78.996 nsec
>> # read lock   : 0
>> # read unlock : 0 (0)
>> # wait time for read  : 0.000 msec
>> # wait time for read/lock : 0.000 nsec
>> Stats about Lock TASK_RQ:
>> # write lock  : 2062874
>> # write unlock: 2062875 (1)
>> # wait time for write : 7820.234 ms

Re: Few problems seen in haproxy? (threads, connections).

2018-10-04 Thread Krishna Kumar (Engineering)
Thanks, will take a look!

On Thu, Oct 4, 2018 at 12:58 PM Илья Шипицин  wrote:

> what I going to try (when I will have some spare time) is sampling with
> google perftools
>
> https://github.com/gperftools/gperftools
>
> they are great in cpu profiling.
> you can try them youself if you have time/wish :)
>
>
> чт, 4 окт. 2018 г. в 11:53, Krishna Kumar (Engineering) <
> krishna...@flipkart.com>:
>
>> 1. haproxy config: Same as given above (both processes and threads were
>> given in the mail)
>> 2. nginx: default, no changes.
>> 3. sysctl's: nothing set. All changes as described earlier (e.g.
>> irqbalance, irq pinning, etc).
>> 4. nf_conntrack: disabled
>> 5. dmesg: no messages.
>>
>> With the same system and settings, threads gives 18x lesser RPS than
>> processes, along with
>> the other 2 issues given in my mail today.
>>
>>
>> On Thu, Oct 4, 2018 at 12:09 PM Илья Шипицин 
>> wrote:
>>
>>> haproxy config, nginx config
>>> non default sysctl (if any)
>>>
>>> as a side note, can you have a look at "dmesg" output ? do you have nf
>>> conntrack enabled ? what are its limits ?
>>>
>>> чт, 4 окт. 2018 г. в 9:59, Krishna Kumar (Engineering) <
>>> krishna...@flipkart.com>:
>>>
>>>> Sure.
>>>>
>>>> 1. Client: Use one of the following two setup's:
>>>> - a single baremetal (48 core, 40g) system
>>>>   Run: "wrk -c 4800 -t 48 -d 30s http://:80/128", or,
>>>> - 100 2 core vm's.
>>>>   Run "wrk -c 16 -t 2 -d 30s http://:80/128" from
>>>>   each VM and summarize the results using some
>>>>   parallel-ssh setup.
>>>>
>>>> 2. HAProxy running on a single baremetal (same system config
>>>> as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different
>>>> cores of the same NUMA node for each irq, kill irqbalance, with
>>>> haproxy configuration file as given in my first mail. Around 60
>>>> backend servers are configured in haproxy.
>>>>
>>>> 3. Backend servers are 2 core VM's running nginx and serving
>>>> a file called "/128", which is 128 bytes in size.
>>>>
>>>> Let me know if you need more information.
>>>>
>>>> Thanks,
>>>> - Krishna
>>>>
>>>>
>>>> On Thu, Oct 4, 2018 at 10:21 AM Илья Шипицин 
>>>> wrote:
>>>>
>>>>> load testing is somewhat good.
>>>>> can you describe an overall setup ? (I want to reproduce and play with
>>>>> it)
>>>>>
>>>>> чт, 4 окт. 2018 г. в 8:16, Krishna Kumar (Engineering) <
>>>>> krishna...@flipkart.com>:
>>>>>
>>>>>> Re-sending in case this mail was missed. To summarise the 3 issues
>>>>>> seen:
>>>>>>
>>>>>> 1. Performance drops 18x with higher number of nbthreads as compared
>>>>>> to nbprocs.
>>>>>> 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds
>>>>>> (for 1.9-dev3
>>>>>> for nbprocs and nbthreads).
>>>>>> 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it
>>>>>> remains in either
>>>>>>  CLOSE-WAIT (towards clients) and ESTAB (towards the backend
>>>>>> servers), till
>>>>>>  the server/client timeout expires.
>>>>>>
>>>>>> The tests for threads and processes were done on the same systems, so
>>>>>> there is
>>>>>> no difference in system parameters.
>>>>>>
>>>>>> Thanks,
>>>>>> - Krishna
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) <
>>>>>> krishna...@flipkart.com> wrote:
>>>>>>
>>>>>>> Hi Willy, and community developers,
>>>>>>>
>>>>>>> I am not sure if I am doing something wrong, but wanted to report
>>>>>>> some issues that I am seeing. Please let me know if this is a
>>>>>>> problem.
>>>>>>>
>>>>>>> 1. HAProxy system:
>>>>>>> Kernel: 4.17.13,
>>>>>>> CPU: 48 core E5-2670 v3
>>>&

Re: Few problems seen in haproxy? (threads, connections).

2018-10-04 Thread Krishna Kumar (Engineering)
1. haproxy config: Same as given above (both processes and threads were
given in the mail)
2. nginx: default, no changes.
3. sysctl's: nothing set. All changes as described earlier (e.g.
irqbalance, irq pinning, etc).
4. nf_conntrack: disabled
5. dmesg: no messages.

With the same system and settings, threads gives 18x lesser RPS than
processes, along with
the other 2 issues given in my mail today.


On Thu, Oct 4, 2018 at 12:09 PM Илья Шипицин  wrote:

> haproxy config, nginx config
> non default sysctl (if any)
>
> as a side note, can you have a look at "dmesg" output ? do you have nf
> conntrack enabled ? what are its limits ?
>
> чт, 4 окт. 2018 г. в 9:59, Krishna Kumar (Engineering) <
> krishna...@flipkart.com>:
>
>> Sure.
>>
>> 1. Client: Use one of the following two setup's:
>> - a single baremetal (48 core, 40g) system
>>   Run: "wrk -c 4800 -t 48 -d 30s http://:80/128", or,
>> - 100 2 core vm's.
>>   Run "wrk -c 16 -t 2 -d 30s http://:80/128" from
>>   each VM and summarize the results using some
>>   parallel-ssh setup.
>>
>> 2. HAProxy running on a single baremetal (same system config
>> as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different
>> cores of the same NUMA node for each irq, kill irqbalance, with
>> haproxy configuration file as given in my first mail. Around 60
>> backend servers are configured in haproxy.
>>
>> 3. Backend servers are 2 core VM's running nginx and serving
>> a file called "/128", which is 128 bytes in size.
>>
>> Let me know if you need more information.
>>
>> Thanks,
>> - Krishna
>>
>>
>> On Thu, Oct 4, 2018 at 10:21 AM Илья Шипицин 
>> wrote:
>>
>>> load testing is somewhat good.
>>> can you describe an overall setup ? (I want to reproduce and play with
>>> it)
>>>
>>> чт, 4 окт. 2018 г. в 8:16, Krishna Kumar (Engineering) <
>>> krishna...@flipkart.com>:
>>>
>>>> Re-sending in case this mail was missed. To summarise the 3 issues seen:
>>>>
>>>> 1. Performance drops 18x with higher number of nbthreads as compared to
>>>> nbprocs.
>>>> 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds
>>>> (for 1.9-dev3
>>>> for nbprocs and nbthreads).
>>>> 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains
>>>> in either
>>>>  CLOSE-WAIT (towards clients) and ESTAB (towards the backend
>>>> servers), till
>>>>  the server/client timeout expires.
>>>>
>>>> The tests for threads and processes were done on the same systems, so
>>>> there is
>>>> no difference in system parameters.
>>>>
>>>> Thanks,
>>>> - Krishna
>>>>
>>>>
>>>> On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) <
>>>> krishna...@flipkart.com> wrote:
>>>>
>>>>> Hi Willy, and community developers,
>>>>>
>>>>> I am not sure if I am doing something wrong, but wanted to report
>>>>> some issues that I am seeing. Please let me know if this is a problem.
>>>>>
>>>>> 1. HAProxy system:
>>>>> Kernel: 4.17.13,
>>>>> CPU: 48 core E5-2670 v3
>>>>> Memory: 128GB memory
>>>>> NIC: Mellanox 40g with IRQ pinning
>>>>>
>>>>> 2. Client, 48 core similar to server. Test command line:
>>>>> wrk -c 4800 -t 48 -d 30s http:///128
>>>>>
>>>>> 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git
>>>>> checkout as of
>>>>> Oct 2nd).
>>>>> # haproxy-git -vv
>>>>> HA-Proxy version 1.9-dev3 2018/09/29
>>>>> Copyright 2000-2018 Willy Tarreau 
>>>>>
>>>>> Build options :
>>>>>   TARGET  = linux2628
>>>>>   CPU = generic
>>>>>   CC  = gcc
>>>>>   CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
>>>>> -fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare
>>>>> -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
>>>>> -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
>>>>>   OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1
>>>>>
>>>>> Default settings :
>>

Re: Few problems seen in haproxy? (threads, connections).

2018-10-03 Thread Krishna Kumar (Engineering)
Sure.

1. Client: Use one of the following two setup's:
- a single baremetal (48 core, 40g) system
  Run: "wrk -c 4800 -t 48 -d 30s http://:80/128", or,
- 100 2 core vm's.
  Run "wrk -c 16 -t 2 -d 30s http://:80/128" from
  each VM and summarize the results using some
  parallel-ssh setup.

2. HAProxy running on a single baremetal (same system config
as client - 48 core, 40g, 4.17.13 kernel, irq tuned to use different
cores of the same NUMA node for each irq, kill irqbalance, with
haproxy configuration file as given in my first mail. Around 60
backend servers are configured in haproxy.

3. Backend servers are 2 core VM's running nginx and serving
a file called "/128", which is 128 bytes in size.

Let me know if you need more information.

Thanks,
- Krishna


On Thu, Oct 4, 2018 at 10:21 AM Илья Шипицин  wrote:

> load testing is somewhat good.
> can you describe an overall setup ? (I want to reproduce and play with it)
>
> чт, 4 окт. 2018 г. в 8:16, Krishna Kumar (Engineering) <
> krishna...@flipkart.com>:
>
>> Re-sending in case this mail was missed. To summarise the 3 issues seen:
>>
>> 1. Performance drops 18x with higher number of nbthreads as compared to
>> nbprocs.
>> 2. CPU utilisation remains at 100% after wrk finishes for 30 seconds (for
>> 1.9-dev3
>> for nbprocs and nbthreads).
>> 3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains in
>> either
>>  CLOSE-WAIT (towards clients) and ESTAB (towards the backend
>> servers), till
>>  the server/client timeout expires.
>>
>> The tests for threads and processes were done on the same systems, so
>> there is
>> no difference in system parameters.
>>
>> Thanks,
>> - Krishna
>>
>>
>> On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) <
>> krishna...@flipkart.com> wrote:
>>
>>> Hi Willy, and community developers,
>>>
>>> I am not sure if I am doing something wrong, but wanted to report
>>> some issues that I am seeing. Please let me know if this is a problem.
>>>
>>> 1. HAProxy system:
>>> Kernel: 4.17.13,
>>> CPU: 48 core E5-2670 v3
>>> Memory: 128GB memory
>>> NIC: Mellanox 40g with IRQ pinning
>>>
>>> 2. Client, 48 core similar to server. Test command line:
>>> wrk -c 4800 -t 48 -d 30s http:///128
>>>
>>> 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout
>>> as of
>>> Oct 2nd).
>>> # haproxy-git -vv
>>> HA-Proxy version 1.9-dev3 2018/09/29
>>> Copyright 2000-2018 Willy Tarreau 
>>>
>>> Build options :
>>>   TARGET  = linux2628
>>>   CPU = generic
>>>   CC  = gcc
>>>   CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
>>> -fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare
>>> -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
>>> -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
>>>   OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1
>>>
>>> Default settings :
>>>   maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
>>>
>>> Built with OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
>>> Running on OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
>>> OpenSSL library supports TLS extensions : yes
>>> OpenSSL library supports SNI : yes
>>> OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
>>> Built with transparent proxy support using: IP_TRANSPARENT
>>> IPV6_TRANSPARENT IP_FREEBIND
>>> Encrypted password support via crypt(3): yes
>>> Built with multi-threading support.
>>> Built with PCRE version : 8.38 2015-11-23
>>> Running on PCRE version : 8.38 2015-11-23
>>> PCRE library supports JIT : no (USE_PCRE_JIT not set)
>>> Built with zlib version : 1.2.8
>>> Running on zlib version : 1.2.8
>>> Compression algorithms supported : identity("identity"),
>>> deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
>>> Built with network namespace support.
>>>
>>> Available polling systems :
>>>   epoll : pref=300,  test result OK
>>>poll : pref=200,  test result OK
>>>  select : pref=150,  test result OK
>>> Total: 3 (3 usable), will use epoll.
>>>
>>> Available multiplexer protocols :
>>> (protocols markes as  cannot be specified using 'proto' keyword)
>>>

Re: Few problems seen in haproxy? (threads, connections).

2018-10-03 Thread Krishna Kumar (Engineering)
Re-sending in case this mail was missed. To summarise the 3 issues seen:

1. Performance drops 18x with higher number of nbthreads as compared to
nbprocs.
2. CPU utilisation remains at 100% after wrk finishes for 30 seconds (for
1.9-dev3
for nbprocs and nbthreads).
3. Sockets on client remain in FIN-WAIT-2, while on HAProxy it remains in
either
 CLOSE-WAIT (towards clients) and ESTAB (towards the backend servers),
till
 the server/client timeout expires.

The tests for threads and processes were done on the same systems, so there
is
no difference in system parameters.

Thanks,
- Krishna


On Tue, Oct 2, 2018 at 9:18 PM Krishna Kumar (Engineering) <
krishna...@flipkart.com> wrote:

> Hi Willy, and community developers,
>
> I am not sure if I am doing something wrong, but wanted to report
> some issues that I am seeing. Please let me know if this is a problem.
>
> 1. HAProxy system:
> Kernel: 4.17.13,
> CPU: 48 core E5-2670 v3
> Memory: 128GB memory
> NIC: Mellanox 40g with IRQ pinning
>
> 2. Client, 48 core similar to server. Test command line:
> wrk -c 4800 -t 48 -d 30s http:///128
>
> 3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout as
> of
> Oct 2nd).
> # haproxy-git -vv
> HA-Proxy version 1.9-dev3 2018/09/29
> Copyright 2000-2018 Willy Tarreau 
>
> Build options :
>   TARGET  = linux2628
>   CPU = generic
>   CC  = gcc
>   CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
> -fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare
> -Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
> -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
>   OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1
>
> Default settings :
>   maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
>
> Built with OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
> Running on OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
> OpenSSL library supports TLS extensions : yes
> OpenSSL library supports SNI : yes
> OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
> Built with transparent proxy support using: IP_TRANSPARENT
> IPV6_TRANSPARENT IP_FREEBIND
> Encrypted password support via crypt(3): yes
> Built with multi-threading support.
> Built with PCRE version : 8.38 2015-11-23
> Running on PCRE version : 8.38 2015-11-23
> PCRE library supports JIT : no (USE_PCRE_JIT not set)
> Built with zlib version : 1.2.8
> Running on zlib version : 1.2.8
> Compression algorithms supported : identity("identity"),
> deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
> Built with network namespace support.
>
> Available polling systems :
>   epoll : pref=300,  test result OK
>poll : pref=200,  test result OK
>  select : pref=150,  test result OK
> Total: 3 (3 usable), will use epoll.
>
> Available multiplexer protocols :
> (protocols markes as  cannot be specified using 'proto' keyword)
>   h2 : mode=HTTP   side=FE
> : mode=TCP|HTTP   side=FE|BE
>
> Available filters :
> [SPOE] spoe
> [COMP] compression
> [TRACE] trace
>
> 4. HAProxy results for #processes and #threads
> #Threads-RPS Procs-RPS
> 1 20903 19280
> 2 46400 51045
> 4 96587 142801
> 8 172224 254720
> 12 210451 437488
> 16 173034 437375
> 24 79069 519367
> 32 55607 586367
> 48 31739 596148
>
> 5. Lock stats for 1.9-dev3: Some write locks on average took a lot more
> time
>to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get:
> Stats about Lock FD:
> # write lock  : 143933900
> # write unlock: 143933895 (-5)
> # wait time for write : 11370.245 msec
> # wait time for write/lock: 78.996 nsec
> # read lock   : 0
> # read unlock : 0 (0)
> # wait time for read  : 0.000 msec
> # wait time for read/lock : 0.000 nsec
> Stats about Lock TASK_RQ:
> # write lock  : 2062874
> # write unlock: 2062875 (1)
> # wait time for write : 7820.234 msec
> # wait time for write/lock: 3790.941 nsec
> # read lock   : 0
> # read unlock : 0 (0)
> # wait time for read  : 0.000 msec
> # wait time for read/lock : 0.000 nsec
> Stats about Lock TASK_WQ:
> # write lock  : 2601227
> # write unlock: 2601227 (0)
> # wait time for write : 5019.811 msec
> # wait time for write/lock: 1929.786 nsec
> # read lock   : 0
> # read unlock : 0 (0)
> # wait time for read  : 0.000 msec
> # wait time for read/lock : 0.000 nsec
> Stats about Lock POOL:
> # write lock  : 2823393
> # write unlock: 2823393 (0)
> # wait time for write : 11984.706 msec
> # wait time for write/lock: 4244.788 nsec
> # read lock   : 0
> # read unlock : 0 (0)
&

Few problems seen in haproxy? (threads, connections).

2018-10-02 Thread Krishna Kumar (Engineering)
Hi Willy, and community developers,

I am not sure if I am doing something wrong, but wanted to report
some issues that I am seeing. Please let me know if this is a problem.

1. HAProxy system:
Kernel: 4.17.13,
CPU: 48 core E5-2670 v3
Memory: 128GB memory
NIC: Mellanox 40g with IRQ pinning

2. Client, 48 core similar to server. Test command line:
wrk -c 4800 -t 48 -d 30s http:///128

3. HAProxy version: I am testing both 1.8.14 and 1.9-dev3 (git checkout as
of
Oct 2nd).
# haproxy-git -vv
HA-Proxy version 1.9-dev3 2018/09/29
Copyright 2000-2018 Willy Tarreau 

Build options :
  TARGET  = linux2628
  CPU = generic
  CC  = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -fno-strict-overflow -Wno-unused-label -Wno-sign-compare
-Wno-unused-parameter -Wno-old-style-declaration -Wno-ignored-qualifiers
-Wno-clobbered -Wno-missing-field-initializers -Wtype-limits
  OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.

Available polling systems :
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available multiplexer protocols :
(protocols markes as  cannot be specified using 'proto' keyword)
  h2 : mode=HTTP   side=FE
: mode=TCP|HTTP   side=FE|BE

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace

4. HAProxy results for #processes and #threads
#Threads-RPS Procs-RPS
1 20903 19280
2 46400 51045
4 96587 142801
8 172224 254720
12 210451 437488
16 173034 437375
24 79069 519367
32 55607 586367
48 31739 596148

5. Lock stats for 1.9-dev3: Some write locks on average took a lot more time
   to acquire, e.g. "POOL" and "TASK_WQ". For 48 threads, I get:
Stats about Lock FD:
# write lock  : 143933900
# write unlock: 143933895 (-5)
# wait time for write : 11370.245 msec
# wait time for write/lock: 78.996 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock TASK_RQ:
# write lock  : 2062874
# write unlock: 2062875 (1)
# wait time for write : 7820.234 msec
# wait time for write/lock: 3790.941 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock TASK_WQ:
# write lock  : 2601227
# write unlock: 2601227 (0)
# wait time for write : 5019.811 msec
# wait time for write/lock: 1929.786 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock POOL:
# write lock  : 2823393
# write unlock: 2823393 (0)
# wait time for write : 11984.706 msec
# wait time for write/lock: 4244.788 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock LISTENER:
# write lock  : 184
# write unlock: 184 (0)
# wait time for write : 0.011 msec
# wait time for write/lock: 60.554 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock PROXY:
# write lock  : 291557
# write unlock: 291557 (0)
# wait time for write : 109.694 msec
# wait time for write/lock: 376.235 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock SERVER:
# write lock  : 1188511
# write unlock: 1188511 (0)
# wait time for write : 854.171 msec
# wait time for write/lock: 718.690 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock LBPRM:
# write lock  : 1184709
# write unlock: 1184709 (0)
# wait time for write : 778.947 msec
# wait time for write/lock: 657.501 nsec
# read lock   : 0
# read unlock : 0 (0)
# wait time for read  : 0.000 msec
# wait time for read/lock : 0.000 nsec
Stats about Lock BUF_WQ:
# write lock  : 669247
# write unlock: 669247 (0)
# wait time for write : 252.265 msec
# wait time for write/lock: 376.939 nsec
# read lock   

Re: Idle HAProxy 1.8 spins at 100% in user space

2018-03-12 Thread Krishna Kumar (Engineering)
Hi Cyril,

Thanks, this patch fixes it, it is now back to 0%. Confirmed it a few
times, and undid the patch, back to 100%, and re-added the patch,
back to 0%. Fixes perfectly.

Thanks,
- Krishna


On Mon, Mar 12, 2018 at 5:23 PM, Willy Tarreau  wrote:

> On Mon, Mar 12, 2018 at 12:36:05PM +0100, Cyril Bonté wrote:
> > I confirm I can reproduce the issue once 32 (and more) threads are used
> : the main process enters an endless loop.
> > I think the same issue may occur with nbproc on FreeBSD (the same code
> in an #ifdef FreeBSD__).
> >
> > Can you try the patch attached ? I'll send a clean one later.
>
> Ah good catch, I'm pretty sure you nailed it down indeed! The fun thing
> is that the initial purpose of that patch was precisely to avoid this
> kind of annoying stuff in the first place!
>
> Cheers,
> Willy
>


Idle HAProxy 1.8 spins at 100% in user space

2018-03-12 Thread Krishna Kumar (Engineering)
As an aside, could someone also post a simple configuration file to
enable 40 listeners (thread)?

I get 100% cpu util when running high number (>30, on a 48 core system)
of threads, I have tried both these versions:

HA-Proxy version 1.8.4-1ppa1~xenial 2018/02/10: Installed via .deb file
HA-Proxy version 1.8.4-1deb90d 2018/02/08: Built from source
 http://www.haproxy.org/download/1.8/src/haproxy-1.8.4.tar.gz


1. Distro/kernel: Ubuntu 16.04.1 LTS, 4.4.0-36-generic

2. Top:
# top -d 1 -b  | head -12
top - 11:59:06 up 4 days, 41 min,  1 user,  load average: 1.00, 1.00, 2.14
Tasks: 492 total,   2 running, 464 sleeping,   0 stopped,  26 zombie
%Cpu(s):  0.5 us,  0.2 sy,  0.0 ni, 99.2 id,  0.0 wa,  0.0 hi,  0.1 si,
0.0 st
KiB Mem : 13191999+total, 9520 free,  1222684 used, 52917792 buff/cache
KiB Swap:0 total,0 free,0 used. 12986652+avail Mem

   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
 87994 haproxy   20   0  896624  14996   1468 R 100.0  0.0   3:09.60 haproxy
 1 root  20   0   38856   7088   4132 S   0.0  0.0   0:08.69 systemd
 2 root  20   0   0  0  0 S   0.0  0.0   0:00.08
kthreadd
 3 root  20   0   0  0  0 S   0.0  0.0   4:05.79
ksoftirqd/0
 5 root   0 -20   0  0  0 S   0.0  0.0   0:00.00
kworker/0:0H

3.  As to what it is doing:
%Cpu0  :100.0 us,  0.0 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.0 si,
0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
0.0 st
%Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,
0.0 st

4. Minimal configuration file to reproduce this (using this blog:
   https://www.haproxy.com/blog/multithreading-in-haproxy/):

global
daemon
nbproc 1
nbthread 40
cpu-map auto:1/1-40 0-39

frontend test-fe
mode http
bind 10.33.110.118:80 process all/all
use_backend test-be

backend test-be
mode http
server 10.33.5.62 10.33.5.62:80 weight 255

5. Problem disappears when "cpu-map auto:1/1-40 0-39" is commented out.
Same strace output, so it is in user space as shown by 'top' above.

6. Version/build (gcc version 5.4.0 20160609 (Ubuntu
5.4.0-6ubuntu1~16.04.2))

# haproxy -vv
HA-Proxy version 1.8.4-1deb90d 2018/02/08
Copyright 2000-2018 Willy Tarreau 

Build options :
  TARGET  = linux2628
  CPU = generic
  CC  = gcc
  CFLAGS  = -O2 -g -fno-strict-aliasing -Wdeclaration-after-statement
-fwrapv -Wno-unused-label
  OPTIONS = USE_ZLIB=yes USE_OPENSSL=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Built with OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
Running on OpenSSL version : OpenSSL 1.0.2g  1 Mar 2016
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports : TLSv1.0 TLSv1.1 TLSv1.2
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND
Encrypted password support via crypt(3): yes
Built with multi-threading support.
Built with PCRE version : 8.38 2015-11-23
Running on PCRE version : 8.38 2015-11-23
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with zlib version : 1.2.8
Running on zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with network namespace support.

Available polling systems :
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Available filters :
[SPOE] spoe
[COMP] compression
[TRACE] trace

7. Strace of the process:
88033 11:57:18.946030 <... epoll_wait resumed> [], 200, 1000) = 0 <1.001144>
88032 11:57:18.946046 <... epoll_wait resumed> [], 200, 1000) = 0 <1.001149>
88033 11:57:18.946078 epoll_wait(47,  
88034 11:57:18.946092 epoll_wait(48,  
88032 11:57:18.946104 epoll_wait(46,  
88031 11:57:18.946115 <... epoll_wait resumed> [], 200, 1000) = 0 <1.001153>
88030 11:57:18.946128 <... epoll_wait resumed> [], 200, 1000) = 0 <1.001154>
88029 11:57:18.946140 <... epoll_wait resumed> [], 200, 1000) = 0 <1.001155>
88028 11:57:18.946152 <... epoll_wait resumed> [], 200, 1000) = 0 <1.001216>
88031 11:57:18.946169 epoll_wait(44,  
88027 11:57:18.946181 <... epoll_wait resumed> [], 200, 1000) = 0 <1.001183>
88030 11:57:18.946196 epoll_wait(43,  
88029 11:57:18.946208 epoll_wait(40,  
88028 11:57:18.946219 epoll_wait(39,  
88027 11:57:18.946231 epoll_wait(38,  
88026 11:57:18.946244 <... epoll_wait resumed> [], 200, 1000) = 0 <1.001296>
88025 11:57:18.946257 <... epoll_wait resumed> [], 200, 1000) = 0 <1.001248>
88024 11:57:18.946269 <... epoll_wait resumed> [], 200, 1000) = 0 <1.001226>
88023 11:57:18.946282 <... epoll_wait resumed> [], 200, 1000) = 0 <1.001210>
88026 11:57:18.946293 epoll_wait(37,  
88022 11:57:18.946307 <... epoll_wait resumed> [], 200, 1000) = 0 <1.001224>
88025 11:57:18.946320 epoll_wait(36,  

Re: [Working update]: Request rate limiting on the backend section

2017-11-08 Thread Krishna Kumar (Engineering)
To remove the reported "margin of error", the config needed
a fix:

   acl within_limit sc2_gpc0_rate() lt 1000

since the first request was at rate==0, and last one is at 999.


On Wed, Nov 8, 2017 at 3:11 PM, Krishna Kumar (Engineering) <
krishna...@flipkart.com> wrote:

> I finally got the backend rate limiting working pretty well. Here is the
> configuration settings in case it helps anyone else do the same:
>
> frontend http-fe
> bind 
> default_backend http-be
>
> backend http-be
> http-request track-sc2 fe_id
> stick-table type integer size 1k expire 30s store
>   http_req_rate(1s),gpc0,gpc0_rate(1s)
> acl within_limit sc2_gpc0_rate() le 1000
> acl increment_gpc0 sc2_inc_gpc0 ge 0
> http-request allow if within_limit increment_gpc0
> http-request deny deny_status 429 if !within_limit
> server my-server 
>
> During the test, the stick table contents were:
> 0x16e593c: key=3 use=98 exp=2 gpc0=44622 gpc0_rate(1000)=1000
> http_req_rate(1000)=69326
>
> Test results:
> # wrk -t 40 -c 4000 -d 30s 
> RPS: 1003.05 (Total requests: 2031922 Good: 30192 Errors: 2001730 Time:
> 30.10)
>
> Margin of error: 0.3%
>
> Thanks,
> - Krishna
>
>
> On Wed, Nov 8, 2017 at 10:02 AM, Krishna Kumar (Engineering) <
> krishna...@flipkart.com> wrote:
>
>> On Tue, Nov 7, 2017 at 11:57 PM, Lukas Tribus <lu...@ltri.eu> wrote:
>>
>> Hi Lukas,
>>
>> > Yes, in 1.7 you can change server maxconn values in real time using
>> > the admin socket:
>> > https://cbonte.github.io/haproxy-dconv/1.7/management.html
>> #9.3-set%20maxconn%20server
>>
>> Thanks, will take a look at if we can use this. The only issue is that we
>> want to
>> be able to change rps very often, and some backend sections contain upto
>> 500 servers (and much more during sales), and doing that on the fly at
>> high
>> frequency may not scale.
>>
>> > You are reluctant to elaborate on the bigger picture, so I guess
>> > generic advice is not what you are looking for. I just hope you are
>> > not trying to build some kind of distributed rate-limiting
>> > functionality with this.
>>
>> Sorry, not reluctance, I thought giving too much detail would put off
>> people
>> from taking a look :) So you are right, we are trying to build a
>> distributed rate
>> limiting feature, and the control plane is mostly ready (thanks to HAProxy
>> developers for making such a performant/configurable system). The service
>> monitors current http_request_rate and current RPS setting via uds every
>> second, and updates these values to a central repository (zookeeper), and
>> on demand, tries to increase capacity by requesting capacity from other
>> servers so as to keep capacity constant at the configured value (e.g. 1000
>> RPS). Is this something you would not recommend?
>>
>> > I don't have enough experience with stick-tables to comment on this
>> > generally, but I would suggest you upgrade to a current 1.7 release
>> > first of all and retry your tests. There are currently 223 bugs fixed
>> > in releases AFTER 1.6.3:
>> > http://www.haproxy.org/bugs/bugs-1.6.3.html
>>
>> Thanks, we are considering moving to this version.
>>
>> > Maybe someone more stick-table savvy can comment on your specific
>> question.
>>
>> If anyone else has done something similar, would really like to hear from
>> you on
>> how to control RPS in the backend.
>>
>> Regards,
>> - Krishna
>>
>>
>


[Working update]: Request rate limiting on the backend section

2017-11-08 Thread Krishna Kumar (Engineering)
I finally got the backend rate limiting working pretty well. Here is the
configuration settings in case it helps anyone else do the same:

frontend http-fe
bind 
default_backend http-be

backend http-be
http-request track-sc2 fe_id
stick-table type integer size 1k expire 30s store
  http_req_rate(1s),gpc0,gpc0_rate(1s)
acl within_limit sc2_gpc0_rate() le 1000
acl increment_gpc0 sc2_inc_gpc0 ge 0
http-request allow if within_limit increment_gpc0
http-request deny deny_status 429 if !within_limit
server my-server 

During the test, the stick table contents were:
0x16e593c: key=3 use=98 exp=2 gpc0=44622 gpc0_rate(1000)=1000
http_req_rate(1000)=69326

Test results:
# wrk -t 40 -c 4000 -d 30s 
RPS: 1003.05 (Total requests: 2031922 Good: 30192 Errors: 2001730 Time:
30.10)

Margin of error: 0.3%

Thanks,
- Krishna


On Wed, Nov 8, 2017 at 10:02 AM, Krishna Kumar (Engineering) <
krishna...@flipkart.com> wrote:

> On Tue, Nov 7, 2017 at 11:57 PM, Lukas Tribus <lu...@ltri.eu> wrote:
>
> Hi Lukas,
>
> > Yes, in 1.7 you can change server maxconn values in real time using
> > the admin socket:
> > https://cbonte.github.io/haproxy-dconv/1.7/management.html
> #9.3-set%20maxconn%20server
>
> Thanks, will take a look at if we can use this. The only issue is that we
> want to
> be able to change rps very often, and some backend sections contain upto
> 500 servers (and much more during sales), and doing that on the fly at high
> frequency may not scale.
>
> > You are reluctant to elaborate on the bigger picture, so I guess
> > generic advice is not what you are looking for. I just hope you are
> > not trying to build some kind of distributed rate-limiting
> > functionality with this.
>
> Sorry, not reluctance, I thought giving too much detail would put off
> people
> from taking a look :) So you are right, we are trying to build a
> distributed rate
> limiting feature, and the control plane is mostly ready (thanks to HAProxy
> developers for making such a performant/configurable system). The service
> monitors current http_request_rate and current RPS setting via uds every
> second, and updates these values to a central repository (zookeeper), and
> on demand, tries to increase capacity by requesting capacity from other
> servers so as to keep capacity constant at the configured value (e.g. 1000
> RPS). Is this something you would not recommend?
>
> > I don't have enough experience with stick-tables to comment on this
> > generally, but I would suggest you upgrade to a current 1.7 release
> > first of all and retry your tests. There are currently 223 bugs fixed
> > in releases AFTER 1.6.3:
> > http://www.haproxy.org/bugs/bugs-1.6.3.html
>
> Thanks, we are considering moving to this version.
>
> > Maybe someone more stick-table savvy can comment on your specific
> question.
>
> If anyone else has done something similar, would really like to hear from
> you on
> how to control RPS in the backend.
>
> Regards,
> - Krishna
>
>


Re: Request rate limiting on the backend section

2017-11-07 Thread Krishna Kumar (Engineering)
On Tue, Nov 7, 2017 at 11:57 PM, Lukas Tribus  wrote:

Hi Lukas,

> Yes, in 1.7 you can change server maxconn values in real time using
> the admin socket:
> https://cbonte.github.io/haproxy-dconv/1.7/management.
html#9.3-set%20maxconn%20server

Thanks, will take a look at if we can use this. The only issue is that we
want to
be able to change rps very often, and some backend sections contain upto
500 servers (and much more during sales), and doing that on the fly at high
frequency may not scale.

> You are reluctant to elaborate on the bigger picture, so I guess
> generic advice is not what you are looking for. I just hope you are
> not trying to build some kind of distributed rate-limiting
> functionality with this.

Sorry, not reluctance, I thought giving too much detail would put off people
from taking a look :) So you are right, we are trying to build a
distributed rate
limiting feature, and the control plane is mostly ready (thanks to HAProxy
developers for making such a performant/configurable system). The service
monitors current http_request_rate and current RPS setting via uds every
second, and updates these values to a central repository (zookeeper), and
on demand, tries to increase capacity by requesting capacity from other
servers so as to keep capacity constant at the configured value (e.g. 1000
RPS). Is this something you would not recommend?

> I don't have enough experience with stick-tables to comment on this
> generally, but I would suggest you upgrade to a current 1.7 release
> first of all and retry your tests. There are currently 223 bugs fixed
> in releases AFTER 1.6.3:
> http://www.haproxy.org/bugs/bugs-1.6.3.html

Thanks, we are considering moving to this version.

> Maybe someone more stick-table savvy can comment on your specific
question.

If anyone else has done something similar, would really like to hear from
you on
how to control RPS in the backend.

Regards,
- Krishna


Re: Request rate limiting on the backend section

2017-11-07 Thread Krishna Kumar (Engineering)
Hi Lukas,

On Tue, Nov 7, 2017 at 6:46 PM, Lukas Tribus  wrote:

> I'd suggest to use maxconn. This limits the amount of connections opened
> to a single server, and is therefor equivalent to in-flight requests.
That's is
> a more appropriate limit than RPS because it doesn't matter if the
responses
> take a long time to compute or not.

Thanks for your suggestion. Unfortunately per server maxconn may be
unsuitable
for our particular case due to 2 reasons:

1. We want to modify the limit dynamically at high frequency, e.g. every
second.
maxconn setting on server does not seem to be modifiable at present
without
a reload. We are using haproxy-1.6.3, is this feature present in a
newer release?

2. We have haproxy running on *many* servers, even for a single service
(many
configuration files also use nbproc). It is easier to give a RPS=1000
for the entire
service instead of breaking up per process and per server, which may
not be
possible at the rate we plan.

Is there any way that this can be done at the backend using stick-tables? I
was
wondering if there was an elementary mistake in the configuration options
being
used.

Regards,
- Krishna


Request rate limiting on the backend section

2017-11-07 Thread Krishna Kumar (Engineering)
Hi all,

I am trying to implement request rate limiting to protect our servers from
too
many requests. While we were able to get this working correctly in the
frontend section, it is required to implement the same in the backend
section
due to the configuration we use in our data center for different services.

This is the current configuration that I tried to get 1000 RPS for the
entire
backend section:

backend HTTP-be
http-request track-sc2 fe_id
stick-table type integer size 1m expire 60s store
 http_req_rate(100ms),gpc0,gpc0_rate(100ms)
acl mark_seen sc2_inc_gpc0 ge 0
acl outside_limit sc2_gpc0_rate() gt 100
http-request deny deny_status 429 if mark_seen outside_limit
server my-server :80

But running "wrk -t 1 -c 1 " gives:
 RPS: 19.60 (Total requests: 18669 Good: 100 Errors: 18569)
with following haproxy metrics:
 0x2765d1c: key=3 use=1 exp=6 gpc0=7270 gpc0_rate(100)=364
http_req_rate(100)=364

and "wrk -t 20 -c 1000 " gives:
RPS: 6.62 (Total requests: 1100022 Good: 100 Errors: 1099922)
with following haproxy metrics:
0x2765d1c: key=3 use=94 exp=5 gpc0=228218 gpc0_rate(100)=7229
http_req_rate(100)=7229

As seen above, only a total of 100 requests succeeded, not 100 * 10 * 20 =
20K
(for the 20 second duration). gpc0_rate does not *seem* to get reset to
zero at
configured interval (100ms), may be due to a mistake in the configuration
settings
above.

Another setting that we tried was:

backend HTTP-be
acl too_fast be_sess_rate gt 1000
tcp-request inspect-delay 1s
tcp-request content accept if ! too_fast
tcp-request content accept if WAIT_END
server my-server :80

While this behaved much better than earlier, I still get a lot of
discrepancy based
on the traffic volume:
wrk -t 40 -c 200 -d 15s 
RPS: 841.14 (Total requests: 12634 Good: 12634 Errors: 0 Time: 15.02)

wrk -t 40 -c 400 -d 15s 
RPS: 1063.78 (Total requests: 15978 Good: 15978 Errors: 0 Time: 15.02)

wrk -t 40 -c 800 -d 15s 
RPS: 1123.03 (Total requests: 16868 Good: 16868 Errors: 0 Time: 15.02)

wrk -t 40 -c 1600 -d 15s 
RPS: 1382.98 (Total requests: 20883 Good: 20883 Errors: 0 Time: 15.10)

The last one is off by 38%

Could someone help on what I am doing wrong, or how to achieve this? I
prefer using the first approach with stick table if possible, as it
provides finer
time granularity.

Thanks for any help.

Regards,
- Krishna


Re: Throughput issue after moving between kernels.

2017-11-03 Thread Krishna Kumar (Engineering)
Though it would not cause your problem, the reason for this is:

In 3.10.18:
https://elixir.free-electrons.com/linux/v3.10.18/source/net/ipv4/tcp.c

void tcp_init_mem
(struct
net *net){
unsigned long limit
 =
nr_free_buffer_pages
()
/ 8;
limit 
= max (limit
,
128UL);
net->ipv4.sysctl_tcp_mem
[0]
= limit 
/ 4 * 3;
net->ipv4.sysctl_tcp_mem
[1]
= limit ;
net->ipv4.sysctl_tcp_mem
[2]
= net->ipv4.sysctl_tcp_mem
[0]
* 2;}

In 4.4.49:
https://elixir.free-electrons.com/linux/v4.4.49/source/net/ipv4/tcp.c

static void __init

tcp_init_mem 
(void){
unsigned long limit
 =
nr_free_buffer_pages
()
/ 16;

limit  =
max (limit
, 128UL);
sysctl_tcp_mem
[0]
= limit 
/ 4 * 3;/* 4.68 % */
sysctl_tcp_mem
[1]
= limit ;  
/*
6.25 % */
sysctl_tcp_mem
[2]
= sysctl_tcp_mem
[0]
* 2;/* 9.37 % */}

As you can see, the limit is halved in 4.4.49. The relevant commit:

commit b66e91ccbc34ebd5a2f90f9e1bc1597e2924a500
Author: Eric Dumazet 
Date:   Fri May 15 12:39:30 2015 -0700

tcp: halves tcp_mem[] limits

Allowing tcp to use ~19% of physical memory is way too much,
and allowed bugs to be hidden. Add to this that some drivers use a full
page per incoming frame, so real cost can be twice the advertized one.

Reduce tcp_mem by 50 % as a first step to sanity.

tcp_mem[0,1,2] defaults are now 4.68%, 6.25%, 9.37% of physical memory.
\
Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 


Regards,
- Krishna


On Fri, Nov 3, 2017 at 6:19 PM, Mark Brookes  wrote:

> Hi All,
>
> We have been investigating an issue with reduced throughput. (its quite
> possible that its nothing to do with HAProxy.)
> I thought I would just check here to see if this rings a bell with anyone.
>
> We are currently looking to update our kernel from 3.10.18 to 4.4.49. It
> appears that in the move from 3.x.x to 4.x.x at some point the kernel devs
> change the tcp_mem calculation which results in halving the values based on
> the same amount of RAM. Although that isnt the problem it just highlighted
> it.
>
> Our test setup is
>
> Mutiple Clients --> Haproxy --> Real Server.
>
> If I run a fairly heavy load using iperf through haproxy using the 3.10.18
> kernel and I check -
>
>
> cat proc/net/sockstat
> sockets: used 193
> TCP: inuse 116 orphan 0 tw 17 alloc 118 mem 25591
> UDP: inuse 12 mem 3
> UDPLITE: inuse 0
> RAW: inuse 1
> FRAG: inuse 0 memory 0
>
> cat /proc/sys/net/ipv4/tcp_mem
> 89544 119392 179088
>
> When I reboot into the 4.4.49 kernel and run the same test I get -
>
> cat proc/net/sockstat
> sockets: used 198
> TCP: inuse 115 orphan 0 tw 18 alloc 117 mem 43957
> UDP: inuse 12 mem 2
> UDPLITE: inuse 0
> RAW: inuse 1
> FRAG: inuse 0 memory 0
>
> cat /proc/sys/net/ipv4/tcp_mem
> 44721 59631 89442
>
> Haproxy --
> Build options :
>   TARGET  = linux2628
>   CPU = generic
>   CC  = gcc
>   CFLAGS  = -m64 -march=x86-64 -O2 -g -fno-strict-aliasing
> -Wdeclaration-after-statement -fwrapv
>   OPTIONS = USE_ZLIB=1 USE_REGPARM=1 USE_OPENSSL=1 USE_STATIC_PCRE=1
>
> Default settings :
>   maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200
>
> Encrypted password support via 

Re: Long Running TCP Connections and Reloads

2017-09-14 Thread Krishna Kumar (Engineering)
Regarding #1, I think this was fixed sometime back. Maybe you are running
a old version of haproxy?

commit e39683c4d4c527d1b561c3ba3983d26cc3e7f42d
Author: Hongbo Long 
Date:   Fri Mar 10 18:41:51 2017 +0100

BUG/MEDIUM: stream: fix client-fin/server-fin handling

A tcp half connection can cause 100% CPU on expiration.



On Thu, Sep 14, 2017 at 6:59 PM, Pean, David S. 
wrote:

> Hello!
>
> I am using a TCP front-end that potentially keeps connections open for
> several hours, while also frequently issuing reloads due to an id to server
> mapping that is changing constantly. This causes many processes to be
> running at any given time, which generally works as expected. However,
> after some time I see some strange behavior with the processes and stats
> that doesn’t appear to have any pattern to it.
>
> Here is the setup in general:
>
> Every two minutes, there is a process that checks if HAProxy should be
> reloaded. If that is the case, this command is run:
>
> /usr/local/sbin/haproxy -D –f -sf PID
>
> The PID is the current HAProxy process. If there are TCP connections to
> that process, it will stay running until those connection drop, then
> generally it will get killed.
>
> 1. Sometimes a process will appear to not get killed, and have no
> connections. It will be running for several hours and have 99 CPU. When
> straced, it doesn't appear to be actually doing anything -- just clock and
> polls very frequently. Is there some sort of timeout for the graceful
> shutdown of the old processes?
>
> 2. Is it possible for the old processes to accept new connections? Even
> though a pid has been sent the shutdown signal, I have seen requests
> reference old server mappings that would have been in an earlier process.
>
> 3. Often the stats page will become out of whack over time. The number of
> requests per second will become drastically different from what is actually
> occuring. It looks like the old stuck processes might be sending more data
> that is maybe not getting cleared?
>
> Are there any considerations for starting up or reloading when dealing
> with long running connections?
>
> Thanks!
>
> David Pean
>
>
>


Using http_proxy in HAProxy 1.6.3

2017-07-16 Thread Krishna Kumar (Engineering)
I have configured a backend as follows:

backend be-testing
mode http
option httpclose
option http_proxy

and hit this using Google's IP as:
 wget  --header="Host: http://216.58.197.46:80; 

but this fails with 503 error, debug shows:
0003:fp-testing.clireq[0009:]: GET / HTTP/1.1
0003:fp-testing.clihdr[0009:]: User-Agent: Wget/1.16 (linux-gnu)
0003:fp-testing.clihdr[0009:]: Accept: */*
0003:fp-testing.clihdr[0009:]: Host: http://216.58.197.46:80
0003:fp-testing.clihdr[0009:]: Connection: Keep-Alive

What am I doing wrong? Is it also possible to use DNS name with a newer
version of HAProxy?

Thanks,
- Krishna


Re: HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection

2017-05-18 Thread Krishna Kumar (Engineering)
Hi Willy,

Thanks for your response/debug details.

> It seems that something is preventing the connection close from being
> considered, while the task is woken up on a timeout and on I/O. This
> exactly reminds me of the client-fin/server-fin bug in fact. Do you
> have any of these timeouts in your config ?

You are right! We have this: "timeout client-fin 3ms"

> So at least you have 3 times 196 bugs in production :-)

And many 'x' times that, we have *lots* of servers to handle the Flipkart
traffic. Thanks for pointing out this information.

So we will upgrade after internal processes are sorted out. Thanks once
again for this quick information on the source of the problem.

Regards,
- Krishna


On Fri, May 19, 2017 at 10:34 AM, Willy Tarreau <w...@1wt.eu> wrote:

> Hi Krishna,
>
> On Fri, May 19, 2017 at 09:47:52AM +0530, Krishna Kumar (Engineering)
> wrote:
> > I saw many similar issues posted earlier by others, but could not find a
> > thread
> > where this is resolved or fixed in a newer release. We are using Ubuntu
> > 16.04
> > with distro HAProxy (1.6.3), and see that HAProxy spins at 100% with 1-10
> > TCP
> > connections, sometimes just 1 - a stale connection that does not seem to
> > belong
> > to any frontend session. Strace with -T shows the folllowing:
>
> In fact a few bugs have caused this situation and all known ones were
> fixed, which doesn't mean there is none left of course. However your
> version is totally outdated and contains tons of known bugs which were
> later fixed (196 total, 22 major, 78 medium, 96 minor) :
>
>http://www.haproxy.org/bugs/bugs-1.6.3.html
>
> > The single connection has this session information:
> > 0xd1d790: [06/May/2017:02:44:37.373636] id=286529830 proto=tcpv4
> > source=a.a.a.a:35297
> >   flags=0x1ce, conn_retries=0, srv_conn=0xca4000, pend_pos=(nil)
> >   frontend=fe-fe-fe-fe-fe-fe (id=3 mode=tcp), listener=? (id=1)
> > addr=b.b.b.b:5667
> >   backend=be-be-be-be-be-be (id=4 mode=tcp) addr=c.c.c.c:11870
> >   server=d.d.d.d (id=4) addr=d.d.d.d:5667
> >   task=0xd1d710 (state=0x04 nice=0 calls=1117789229 exp=, running
> > age=12d11h)
> >   si[0]=0xd1d988 (state=CLO flags=0x00 endp0=CONN:0xd771c0 exp=,
> > et=0x000)
> >   si[1]=0xd1d9a8 (state=EST flags=0x10 endp1=CONN:0xccadb0 exp=,
> > et=0x000)
> >   co0=0xd771c0 ctrl=NONE xprt=NONE data=STRM target=LISTENER:0xc76ae0
> >   flags=0x002f9000 fd=55 fd.state=00 fd.cache=0 updt=0
> >   co1=0xccadb0 ctrl=tcpv4 xprt=RAW data=STRM target=SERVER:0xca4000
> >   flags=0x0020b310 fd=9 fd_spec_e=22 fd_spec_p=0 updt=0
> >   req=0xd1d7a0 (f=0x80a020 an=0x0 pipe=0 tofwd=-1 total=0)
> >   an_exp= rex=? wex=
> >   buf=0x6e9120 data=0x6e9134 o=0 p=0 req.next=0 i=0 size=0
> >   res=0xd1d7e0 (f=0x8000a020 an=0x0 pipe=0 tofwd=0 total=0)
> >   an_exp= rex= wex=
> >   buf=0x6e9120 data=0x6e9134 o=0 p=0 rsp.next=0 i=0 size=0
>
>
> That's quite useful, thanks!
>
>  - connection with client is closed
>  - connection with server is still established and theorically stopped from
>polling
>  - the request channel is closed in both directions
>  - the response channel is closed in both directions
>  - both buffers are empty
>
> It seems that something is preventing the connection close from being
> considered, while the task is woken up on a timeout and on I/O. This
> exactly reminds me of the client-fin/server-fin bug in fact. Do you
> have any of these timeouts in your config ?
>
> I'm also noticing that the session is aged 12.5 days. So either it has
> been looping for this long (after all the function has been called 1
> billion times), or it was a long session which recently timed out.
>
> > We have 3 systems running the identical configuration and haproxy binary,
>
> So at least you have 3 times 196 bugs in production :-)
>
> > and
> > the 100% cpu is ongoing for the last 17 days on one system. The client
> > connection is no longer present. I am assuming that a haproxy reload
> would
> > solve this as the frontend connection is not present, but have not tested
> > it out yet. Since this box is in production, I am unable to do invasive
> > debugging
> > (e.g. gdb).
>
> For sure. At least an upgrade to 1.6.12 would get rid of most of these
> known bugs. You could perform a rolling upgrade, starting with the machine
> having been in that situation for the longest time.
>
> > Please let me know if this is fixed in a latter release, or any more
> > information that
> > can help find the root cause.
>
> For me everything here looks like the client-fin/server-fin bug that was
> fixed two months ago, so if you're using this it's very likely fixed. If
> not, there's still a small probability that the fixes made to better
> deal with wakeup events in the case of the server-fin bug could have
> addressed a wider class of bugs : often we find one way to enter a
> certain bogus condition and hardly imagine all other possibilities.
>
> Regards,
> Willy
>


HAProxy 1.6.3: 100% cpu utilization for >17 days with 1 connection

2017-05-18 Thread Krishna Kumar (Engineering)
Hi,

First of all, thanks for a great product that is working extremely well for
Flipkart!

I saw many similar issues posted earlier by others, but could not find a
thread
where this is resolved or fixed in a newer release. We are using Ubuntu
16.04
with distro HAProxy (1.6.3), and see that HAProxy spins at 100% with 1-10
TCP
connections, sometimes just 1 - a stale connection that does not seem to
belong
to any frontend session. Strace with -T shows the folllowing:

epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.20>
epoll_wait(0, [], 200, 0)   = 0 <0.09>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [{EPOLLIN|EPOLLHUP|EPOLLRDHUP, {u32=2, u64=2}}], 200, 0) = 1
<0.06>
epoll_wait(0, [{EPOLLIN, {u32=11, u64=11}}], 200, 0) = 1 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.29>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.21>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.11>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [{EPOLLIN, {u32=7, u64=7}}], 200, 0) = 1 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.07>
epoll_wait(0, [{EPOLLOUT, {u32=2, u64=2}}], 200, 0) = 1 <0.15>
epoll_wait(0, [], 200, 0)   = 0 <0.07>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.16>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.08>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.17>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [{EPOLLIN, {u32=10, u64=10}}], 200, 0) = 1 <0.09>
epoll_wait(0, [{EPOLLIN|EPOLLRDHUP, {u32=10, u64=10}}], 200, 0) = 1
<0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.16>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.06>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.05>
epoll_wait(0, [], 200, 0)   = 0 <0.17>

The single connection has this session information:
0xd1d790: [06/May/2017:02:44:37.373636] id=286529830 proto=tcpv4
source=a.a.a.a:35297
  flags=0x1ce, conn_retries=0, srv_conn=0xca4000, 

Re: Restricting RPS to a service

2017-04-19 Thread Krishna Kumar (Engineering)
Hi Holgar,

Thanks once again. However, I understand that session means the same as
connection. The rate-limit documentation confirms that: "When the frontend
reaches the specified number of new sessions per second, it stops accepting
*new connections* until the rate drops below the limit again".

As you say, maxconn defines the maximum number of established sessions,
but rate-limit tells how much of that can come per second. I really want
existing
or new sessions to not exceed a particular RPS.

Regards,
- Krishna


On Wed, Apr 19, 2017 at 9:16 PM, Holger Just <hapr...@meine-er.de> wrote:

> Hi Krishna,
>
> Krishna Kumar (Engineering) wrote:
> > Thanks for your response. However, I want to restrict the requests
> > per second either at the frontend or backend, not session rate. I
> > may have only 10 connections from clients, but the backends can
> > handle only 100 RPS. How do I deny or delay requests when they
> > cross a limit?
>
> A "session" is this context is equivalent to a request-response pair. It
> is not connected to a session of your applciation which might be
> represented by a session cookie.
>
> As such, to restrict the number of requests per second for a frontend,
> rate-limit sessions is exactly the option you are looking for. It does
> not limit the concurrent number of sessions (as maxconn would do) but
> the rate with which the requests are comming in.
>
> If there are more requests per second than the configured value, haproxy
> waits until the session rate drops below the configured value. Once the
> socket's backlog backlog is full, requests will not be accepted by the
> kernel anymore until it clears.
>
> If you want to deny requsts with a custom http error instead, you could
> use a custom `http-request deny` rule and use the fe_sess_rate or
> be_sess_rate values.
>
> Cheers,
> Holger
>


Restricting RPS to a service

2017-04-19 Thread Krishna Kumar (Engineering)
Hi Willy, others,

I have seen documents  that describe how to rate limit from a single client.
What is the way to rate limit on the entire service, without caring about
which
client is hitting it? Something like "All RPS should be < 1000/sec"?

Thanks,
- Krishna


Re: [PATCH] [MEDIUM] Improve "no free ports" error case

2017-03-15 Thread Krishna Kumar (Engineering)
Hi Willy,

I am facing one problem with using system port range,

Distro: Ubuntu 16.04.1, kernel: 4.4.0-53-generic

When I set to 5 to 50999, the kernel allocates port in the range 5
to
50499, the remaining 500 ports do not seem to ever get allocated despite
running
a few thousand connections in parallel. A simple test program that I wrote
that does
a bind to IP and then connects uses all 1000 ports. Quickly checking the
tcp code, I
noticed that the kernel tries to allocate an odd port for bind, leaving the
even ports
for connect. Any idea why I don't get the full port range in bind? I am
using something
like the following when specifying the server:
 server abcd google.com:80 source e1.e2.e3.e4
and with the following sysctl:
 sysctl -w net.ipv4.ip_local_port_range="5 50999"

I hope it is OK to add an unrelated question relating to a feature to this
thread:

Is it possible to tell haproxy to use one backend for a request (GET), and
if the
response was 404 (Not Found), use another backend? This resource may be
present in the 2nd backend, but is there any way to try that upon getting
404
from the first?

Thanks,
- Krishna


On Thu, Mar 9, 2017 at 2:22 PM, Krishna Kumar (Engineering) <
krishna...@flipkart.com> wrote:

> Hi Willy,
>
> Excellent, I will try this idea, it should definitely help!
> Thanks for the explanations.
>
> Regards,
> - Krishna
>
>
> On Thu, Mar 9, 2017 at 1:37 PM, Willy Tarreau <w...@1wt.eu> wrote:
>
>> On Thu, Mar 09, 2017 at 12:50:16PM +0530, Krishna Kumar (Engineering)
>> wrote:
>> > 1. About 'retries', I am not sure if it works for connect() failing
>> > synchronously on the
>> > local system (as opposed to getting a timeout/refused via callback).
>>
>> Yes it normally does. I've been using it for the same purpose in certain
>> situations (eg: binding to a source port range while some daemons are
>> later bound into that range).
>>
>> > The
>> > document
>> > on retries says:
>> >
>> > "  is the number of times a connection attempt should be
>> retried
>> > on
>> >   a server when a connection either is refused or times
>> out. The
>> >   default value is 3.
>> > "
>> >
>> > The two conditions above don't fall in our use case.
>>
>> It's still a refused connection :-)
>>
>> > The way I understood was that
>> > retries happens during the callback handler. Also I am not sure if
>> there is
>> > any way to circumvent the "1 second" gap for a retry.
>>
>> Hmmm I have to check. In fact when the LB algorithm is not determinist
>> we immediately retry on another server. If we're supposed to end up only
>> on the same server we indeed apply the delay. But if it's a synchronous
>> error, I don't know. And I think it's one case (especially -EADDRNOTAVAIL)
>> where we should immediately retry.
>>
>> > 2. For nolinger, it was not recommended in the document,
>>
>> It's indeed strongly recommended against, mainly because we've started
>> to see it in configs copy-pasted from blogs without understanding the
>> impacts.
>>
>> > and also I wonder if any data
>> > loss can happen if the socket is not lingered for some time beyond the
>> FIN
>> > packet that
>> > the remote server sent for doing the close(), delayed data packets, etc.
>>
>> The data loss happens only with outgoing data, so for HTTP it's data
>> sent to the client which are at risk. Data coming from the server are
>> properly consumed. In fact, when you configure "http-server-close",
>> the nolinger is automatically enabled in your back so that haproxy
>> can close the server connection without accumulating time-waits.
>>
>> > 3. Ports: Actually each HAProxy process has 400 ports limitation to a
>> > single backend,
>> > and there are many haproxy processes on this and other servers. The
>> ports
>> > are split per
>> > process and per system. E.g. system1 has 'n' processes and each have a
>> > separate port
>> > range from each other, system2 has 'n' processes and a completely
>> different
>> > port range.
>> > For infra reasons, we are restricting the total port range. The unique
>> > ports for different
>> > haproxy processes running on same system is to avoid attempting to use
>> the
>> > same port
>> > (first port# in the range) by two processes and failing in connect, when
>> > attempting to
>> > connect to the same remote server. Hope I explain

Re: [PATCH] [MEDIUM] Improve "no free ports" error case

2017-03-09 Thread Krishna Kumar (Engineering)
Hi Willy,

Excellent, I will try this idea, it should definitely help!
Thanks for the explanations.

Regards,
- Krishna


On Thu, Mar 9, 2017 at 1:37 PM, Willy Tarreau <w...@1wt.eu> wrote:

> On Thu, Mar 09, 2017 at 12:50:16PM +0530, Krishna Kumar (Engineering)
> wrote:
> > 1. About 'retries', I am not sure if it works for connect() failing
> > synchronously on the
> > local system (as opposed to getting a timeout/refused via callback).
>
> Yes it normally does. I've been using it for the same purpose in certain
> situations (eg: binding to a source port range while some daemons are
> later bound into that range).
>
> > The
> > document
> > on retries says:
> >
> > "  is the number of times a connection attempt should be
> retried
> > on
> >   a server when a connection either is refused or times out.
> The
> >   default value is 3.
> > "
> >
> > The two conditions above don't fall in our use case.
>
> It's still a refused connection :-)
>
> > The way I understood was that
> > retries happens during the callback handler. Also I am not sure if there
> is
> > any way to circumvent the "1 second" gap for a retry.
>
> Hmmm I have to check. In fact when the LB algorithm is not determinist
> we immediately retry on another server. If we're supposed to end up only
> on the same server we indeed apply the delay. But if it's a synchronous
> error, I don't know. And I think it's one case (especially -EADDRNOTAVAIL)
> where we should immediately retry.
>
> > 2. For nolinger, it was not recommended in the document,
>
> It's indeed strongly recommended against, mainly because we've started
> to see it in configs copy-pasted from blogs without understanding the
> impacts.
>
> > and also I wonder if any data
> > loss can happen if the socket is not lingered for some time beyond the
> FIN
> > packet that
> > the remote server sent for doing the close(), delayed data packets, etc.
>
> The data loss happens only with outgoing data, so for HTTP it's data
> sent to the client which are at risk. Data coming from the server are
> properly consumed. In fact, when you configure "http-server-close",
> the nolinger is automatically enabled in your back so that haproxy
> can close the server connection without accumulating time-waits.
>
> > 3. Ports: Actually each HAProxy process has 400 ports limitation to a
> > single backend,
> > and there are many haproxy processes on this and other servers. The ports
> > are split per
> > process and per system. E.g. system1 has 'n' processes and each have a
> > separate port
> > range from each other, system2 has 'n' processes and a completely
> different
> > port range.
> > For infra reasons, we are restricting the total port range. The unique
> > ports for different
> > haproxy processes running on same system is to avoid attempting to use
> the
> > same port
> > (first port# in the range) by two processes and failing in connect, when
> > attempting to
> > connect to the same remote server. Hope I explained that clearly.
>
> Yep I clearly see the use case. That's one of the rare cases where it's
> interesting to use SNAT between your haproxy nodes and the internet. This
> way you'll use a unified ports pool for all your nodes and will not have
> to reserve port ranges per system and per process. Each process will then
> share the system's local source ports, and each system will have a
> different
> address. Then the SNAT will convert these IP1..N:port1..N to the public IP
> address and an available port. This will offer you more flexibility to add
> or remove nodes/processes etc. Maybe your total traffic cannot pass through
> a single SNAT box though in which case I understand that you don't have
> much choice. However you could then at least not force each process' port
> range and instead fix the system's local port range so that you know that
> all processes of a single machine share a same port range. That's already
> better because you won't be forcing to assign ports from unfinished
> connections.
>
> Willy
>


Re: [PATCH] [MEDIUM] Improve "no free ports" error case

2017-03-08 Thread Krishna Kumar (Engineering)
Hi Willy,

Thanks for your comments.

1. About 'retries', I am not sure if it works for connect() failing
synchronously on the
local system (as opposed to getting a timeout/refused via callback). The
document
on retries says:

"  is the number of times a connection attempt should be retried
on
  a server when a connection either is refused or times out. The
  default value is 3.
"

The two conditions above don't fall in our use case. The way I understood
was that
retries happens during the callback handler. Also I am not sure if there is
any way to
circumvent the "1 second" gap for a retry.

2. For nolinger, it was not recommended in the document, and also I wonder
if any data
loss can happen if the socket is not lingered for some time beyond the FIN
packet that
the remote server sent for doing the close(), delayed data packets, etc.

3. Ports: Actually each HAProxy process has 400 ports limitation to a
single backend,
and there are many haproxy processes on this and other servers. The ports
are split per
process and per system. E.g. system1 has 'n' processes and each have a
separate port
range from each other, system2 has 'n' processes and a completely different
port range.
For infra reasons, we are restricting the total port range. The unique
ports for different
haproxy processes running on same system is to avoid attempting to use the
same port
(first port# in the range) by two processes and failing in connect, when
attempting to
connect to the same remote server. Hope I explained that clearly.

Thanks,
- Krishna


On Thu, Mar 9, 2017 at 12:19 PM, Willy Tarreau <w...@1wt.eu> wrote:

> Hi Krishna,
>
> On Thu, Mar 09, 2017 at 12:03:19PM +0530, Krishna Kumar (Engineering)
> wrote:
> > Hi Willy,
> >
> > We use HAProxy as a Forward Proxy (I know this is not the intended
> > application for HAProxy) to access outside world from within the DC, and
> > this requires setting a source port range for return traffic to reach the
> > correct
> > box from which a connection was established. On our production boxes, we
> > see around 500 "no free ports" errors per day, but this could increase to
> > about 120K errors during big sale events. The reason for this is due to
> > connect getting a EADDRNOTAVAIL error, since an earlier closed socket
> > may be in last-ack state, as it may take some time for the remote server
> to
> > send the final ack.
> >
> > The attached patch reduces the number of errors by attempting more ports,
> > if they are available.
> >
> > Please review, and let me know if this sounds reasonable to implement.
>
> Well, while the patch looks clean I'm really not convinced it's the correct
> approach. Normally you should simply be using the "retries" parameter to
> increase the amount of connect retries. There's nothing wrong with setting
> it to a really high value if needed. Doesn't it work in your case ?
>
> Also a few other points :
>   - when the remote server sends the FIN with the last segment, your
> connection ends up in CLOSE_WAIT state. Haproxy then closes as
> well, sending a FIN and your socket ends up in LAST_ACK waiting
> for the server to respond. You may instead ask haproxy to close
> with an RST by setting "option nolinger" in the backend. The port
> will then always be free locally. The side effect is that if the
> RST is lost, the SYN of a new outgoing connection may get an ACK
> instead of a SYN-ACK as a reply and will respond to it with an
> RST and try again. This will result in all connections working,
> some taking slightly longer a time (typically 1 second).
>
>   - 500 outgoing ports is a very low value. You should keep in mind
> that nowadays most servers use 60 seconds FIN_WAIT/TIME_WAIT
> delays (the remote server remains in FIN_WAIT1 while waiting for
> your ACK, then enters TIME_WAIT when receiving your FIN). So with
> only 500 ports, you can *safely* support only 500/60 = 8 connections
> per second. Fortunately in practice it doesn't work like this
> since most of the time connections are correctly closed. But if
> you start to enter big trouble, you need to understand that you
> can very quickly reach some limits. And 500 outgoing ports means
> you don't expect to support more than 500 concurrent conns per
> proxy, which seems quite low.
>
> Thus normally what you're experiencing should only be dealt with
> using configuration :
>   - increase retries setting
>   - possibly enable option nolinger (backend only, never on a frontend)
>   - try to increase the available source port ranges.
>
> Regards,
> Willy
>


[PATCH] [MEDIUM] Improve "no free ports" error case

2017-03-08 Thread Krishna Kumar (Engineering)
Hi Willy,

We use HAProxy as a Forward Proxy (I know this is not the intended
application for HAProxy) to access outside world from within the DC, and
this requires setting a source port range for return traffic to reach the
correct
box from which a connection was established. On our production boxes, we
see around 500 "no free ports" errors per day, but this could increase to
about 120K errors during big sale events. The reason for this is due to
connect getting a EADDRNOTAVAIL error, since an earlier closed socket
may be in last-ack state, as it may take some time for the remote server to
send the final ack.

The attached patch reduces the number of errors by attempting more ports,
if they are available.

Please review, and let me know if this sounds reasonable to implement.

Thanks,
- Krishna
From 2946ac0b9ba5567284d5364445ad1f9102365e38 Mon Sep 17 00:00:00 2001
From: Krishna Kumar 
Date: Thu, 9 Mar 2017 11:24:06 +0530
Subject: [PATCH] [MEDIUM] Improve "no free ports" error.

 When source IP and source port range are specified, sometimes HAProxy fails
 to connect to a server and prints "no free ports" message. This happens
 when HAProxy recently closed the socket with the same port#, but was not
 completely closed in the kernel. To fix this, attempt a few more connects
 with different port numbers based on the available ports in the port_range.

Following are some log lines with this patch, and when running out of ports:

Early Connect() failed for backend bk-noports: no free ports.
Connect(port=50116) failed for backend bk-noports, socket not fully closed, RETRYING... (max_attempts = 2)
Connect(port=50227) failed for backend bk-noports, socket not fully closed, RETRYING... (max_attempts = 1)
Connect(port=50116) failed for backend bk-noports: no free ports.

When running #parallel wgets == #source-port-range, the following messages
were printed, but none of the connects failed (though it could fail if the
the ports were completely exhausted, e.g. max_attempts = 0):

Connect(port=50241) failed for backend bk-noports, socket not fully closed, RETRYING... (max_attempts = 2)
Connect(port=50226) failed for backend bk-noports, socket not fully closed, RETRYING... (max_attempts = 2)
---
 include/proto/port_range.h |  18 ++
 include/proto/proto_tcp.h  |  19 ++
 src/proto_tcp.c| 496 +++--
 3 files changed, 338 insertions(+), 195 deletions(-)

diff --git a/include/proto/port_range.h b/include/proto/port_range.h
index 8c63fac..7a64caa 100644
--- a/include/proto/port_range.h
+++ b/include/proto/port_range.h
@@ -55,6 +55,24 @@ static inline void port_range_release_port(struct port_range *range, int port)
 		range->put = 0;
 }
 
+/*
+ * Return the maximum number of ports that can be used to attempt a connect().
+ * This is to handle the following problem:
+ *	- haproxy closes a port (kernel does TCP close on socket).
+ *	- haproxy allocates the same port.
+ *	- haproxy attempts to connect, but fails as the kernel has not
+ *	  finished the close.
+ *
+ * To handle this, we attempt to connect() atmost 'range->avail' times, as
+ * this guarantees a different free port#'s each time. Beyond 'avail', we
+ * recycle the same ports which is likely to fail again, and hence is not
+ * useful. The caller must ensure that range is not NULL.
+ */
+static inline int port_range_avail(struct port_range *range)
+{
+	return range->avail;
+}
+
 /* return a new initialized port range of N ports. The ports are not
  * filled in, it's up to the caller to do it.
  */
diff --git a/include/proto/proto_tcp.h b/include/proto/proto_tcp.h
index 13d7a78..05c2b0e 100644
--- a/include/proto/proto_tcp.h
+++ b/include/proto/proto_tcp.h
@@ -40,6 +40,25 @@ int tcp_drain(int fd);
 /* Export some samples. */
 int smp_fetch_src(const struct arg *args, struct sample *smp, const char *kw, void *private);
 
+
+/*
+ * The maximum number of attempts to try to bind to a free source port. This
+ * is required in case some other process has bound to the same IP/port#.
+ */
+#define MAX_BIND_ATTEMPTS		10
+
+/*
+ * The maximum number of attempts to try to connect to a server. This is
+ * required when haproxy configuration file contains directive to bind to a
+ * source IP and port range. In this case, haproxy selects a port that we
+ * think is free, bind to it (which works even if the socket was not fully
+ * closed due to SO_REUSEADDR), but fail in connect() as the socket tuple
+ * may not be fully closed in the kernel, e.g., it may be in LAST-ACK state.
+ * These retries are to try avoiding getting an EADDRNOTAVAIL error during a
+ * socket connect.
+ */
+#define MAX_CONNECT_ATTEMPTS		3
+
 #endif /* _PROTO_PROTO_TCP_H */
 
 /*
diff --git a/src/proto_tcp.c b/src/proto_tcp.c
index 4741651..a09fd66 100644
--- a/src/proto_tcp.c
+++ b/src/proto_tcp.c
@@ -244,6 +244,139 @@ static int create_server_socket(struct connection *conn)
 }
 
 /*
+ * This internal function should not be called directly. 

Re: RFC: HAProxy shared health-check for nbproc > 1

2017-02-14 Thread Krishna Kumar (Engineering)
Hi Willy,

Thanks for your comments, I did not realize that this was discussed earlier.

Let me go through your feedback and get back. Sorry that I am taking time
for this, but this is due to work related reasons.

Regards
- Krishna


On Tue, Feb 14, 2017 at 2:44 PM, Willy Tarreau <w...@1wt.eu> wrote:

> Hi Krishna,
>
> On Tue, Feb 14, 2017 at 12:45:31PM +0530, Krishna Kumar (Engineering)
> wrote:
> > Hi Willy,
> >
> > Some time back, I had worked on making health checks being done by only
> > one HAProxy process, and to share this information on a UP/DOWN event to
> > other processes (tested for 64 processes). Before I finish it
> completely, I
> > wanted to check with you if this feature is useful. At that time, I was
> > able to
> > propagate the status to all processes on UP/DOWN, and state of the
> servers
> > on the other haproxy processes changed accordingly.
> >
> > The implementation was as follows:
> >
> > - For a backend section that requires shared health check (and which has
> >nbproc>1), add a new option specifying that hc is "shared", with an
> > argument
> >which is a multicast address that is used to send/receive HC messages.
> > Use
> >difference unique MC addresses for different backend sections.
> > - Process#0 becomes the Master process while others are Slaves for HC.
> > - Process #1 to #n-1 listens on the MC address (all via the existing
> generic
> >   epoll API).
> > - When the Master finds that a server has gone UP or DOWN, it sends the
> >   information from "struct check", along with proxy-id, server-id on the
> MC
> >   address.
> > - When Slaves receive this message, they find the correct server and
> updates
> >   it's notion of health (each Slave get the proxy as argument via the
> > "struct
> >   dgram_conn) whenever this file-descriptor is ready for reading).
> >
> > There may be other issues with this approach, including what happens
> during
> > reload (not tested yet),  support for non-epoll, or if process #0 gets
> > killed, or if
> > the MC message is "lost", etc. One option is to have HC's done by slaves
> at
> > a
> > much lower frequency to validate things are sane. CLI shows good HC
> values,
> > but the gui dashboards was showing server DOWN in GREEN color,and other
> > minor things that were not fixed at that time.
> >
> > Please let me know if this functionality/approach makes sense, and adds
> > value.
>
> It's interesting that you worked on this, this is among the things we have
> in the pipe as well.
>
> I have some comments, some of which overlap with what you already
> identified.
> The use of multicast can indeed be an issue during reloads, and even when
> dealing with multiple parallel instances of haproxy, requiring the ability
> to configure the multicast group. Another option which seems reasonable is
> to use pipes to communicate between processes (it can be socketpairs as
> well
> but pipes are even cheaper). And the nice thing is that you can then even
> have full-mesh communications for free thanks to inheritance of the FDs.
> Pipes do not provide atomicity in full-mesh however so you can end up with
> some processes writing partial messages, immediately followed by other
> partial messages. But with socketpairs and sendmsg() it's not an issue.
>
> Another point is the fact that only one process runs the checks. As you
> mentionned, there are some drawbacks. But there are even other ones, such
> as the impossibility for a "slave" process to decide to turn a server down
> or to switch to fastinter after an error on regular traffic when some
> options like "observe layer7 on-error shutdown-server" are enabled. In my
> opinion this is the biggest issue.
>
> However there is a solution to let every process update the state for all
> other processes. It's not much complicated. The principle is that before
> sending a health check, each process just has to verify if the age of the
> last check is still fresh or not, and to only run the check when it's not
> fresh anymore. This way, all processes still have their health check tasks
> but when it's their turn to run, most of them realize they don't need to
> start a check and can be rescheduled.
>
> We already gave some thoughts about this mechanism for use with the peers
> protocol so that multiple LB nodes can share their checks, so the principle
> with inter-process communications could very well be the same here.
>
> It's worth noting that with a basic synchronization (ie "here's my check
> result"), there will still be some occasio

RFC: HAProxy shared health-check for nbproc > 1

2017-02-13 Thread Krishna Kumar (Engineering)
Hi Willy,

Some time back, I had worked on making health checks being done by only
one HAProxy process, and to share this information on a UP/DOWN event to
other processes (tested for 64 processes). Before I finish it completely, I
wanted to check with you if this feature is useful. At that time, I was
able to
propagate the status to all processes on UP/DOWN, and state of the servers
on the other haproxy processes changed accordingly.

The implementation was as follows:

- For a backend section that requires shared health check (and which has
   nbproc>1), add a new option specifying that hc is "shared", with an
argument
   which is a multicast address that is used to send/receive HC messages.
Use
   difference unique MC addresses for different backend sections.
- Process#0 becomes the Master process while others are Slaves for HC.
- Process #1 to #n-1 listens on the MC address (all via the existing generic
  epoll API).
- When the Master finds that a server has gone UP or DOWN, it sends the
  information from "struct check", along with proxy-id, server-id on the MC
  address.
- When Slaves receive this message, they find the correct server and updates
  it's notion of health (each Slave get the proxy as argument via the
"struct
  dgram_conn) whenever this file-descriptor is ready for reading).

There may be other issues with this approach, including what happens during
reload (not tested yet),  support for non-epoll, or if process #0 gets
killed, or if
the MC message is "lost", etc. One option is to have HC's done by slaves at
a
much lower frequency to validate things are sane. CLI shows good HC values,
but the gui dashboards was showing server DOWN in GREEN color,and other
minor things that were not fixed at that time.

Please let me know if this functionality/approach makes sense, and adds
value.

Thanks,
- Krishna


Re: [PATCH] MEDIUM/RFC: Implement time-based server latency metrics

2017-01-31 Thread Krishna Kumar (Engineering)
Hi Willy,

Thanks for your detailed mail. I will get back to you very soon on this.

Regards,
- Krishna


On Tue, Jan 31, 2017 at 12:54 AM, Willy Tarreau <w...@1wt.eu> wrote:

> Hi Krishna,
>
> back on earth ;-)
>
> On Tue, Jan 03, 2017 at 03:07:26PM +0530, Krishna Kumar (Engineering)
> wrote:
> > I explored your suggestion of "hard-coded periods", and have some
> > problems: code complexity seems to be very high at updates (as well
> > as retrievals possibly); and I may not be able to get accurate results.
> > E.g. I have data for 1, 4, 16 seconds; and at 18 seconds, a request is
> > made for retrieval of the last 16 seconds (or 1,4,16). At this time I
> have
> > values for last 18 seconds not 16 seconds. I explored using timers to
> > cascade (will not work as it may run into races with the setters, and
> > also adds too much overhead) vs doing this synchronously when the
> > event happens. Both are complicated and have the above issue of not
> > able to get accurate information depending on when the request is
> > made.
>
> The principle is quite simple but requires a small explanation of how our
> frequency counters manage to report such accurate values first.
>
> We're counting events that happen over a period, represented by a cross
> over the time line below :
>
> X X  XX  XX  X X X X   X
>   |> t
>
> In order to be able to measure an approximately accurate frequency without
> storing too much data, we'll report an average over a period. The problem
> is, if we only report a past average, we'll get slowly changing data and
> in particular it would not be usable to detect quick changes such as a high
> rate happening over the last second. Thus we always keep two measures, the
> one of the previous period, and the one of the current period.
>
> X X  XX  XX  X X X X   X
>   |-|--> t
>   <--- previous --->|<--- current --->
>count=6count=5
>
> And using this we implement a pseudo sliding window. With a true sliding
> window, you would count events between two points that are always exactly
> the same distance apart, which requires to memorize all of them. With
> this pseudo sliding window, instead, we make use of the fact that events
> follow an approximately homogenous distribution over each period and have
> approximately the same frequency over both periods. Thus the mechanism
> becomes more of a "fading" window in fact : we consider one part of the
> previous window that we sum with the current one. The closer the current
> date is from the end of the current window, the least we count on the
> previous window. See below, I've represented 5 different sliding windows
> with the ratio of the previous window that we consider as still valid
> marked with stars and the ratio of the old period at the end of the line :
>
> X X  XX  XX  X X X X   X
>   |-|> t
> |***--|  80%
>   |*|  60%
> |***--|  40%
>   |*|  20%
> |-|  0%
>
> Thus the measure exactly is current + (1-relative_position) * previous.
> Once the period is over, the "current" value overwrites the "previous"
> one, and is cleared again, waiting for new events. This is done during
> insertion and possibly during value retrieval if it's noticed that the
> period has switched since last time.
>
>
> In fact this mechanism can be repeated over multiple periods if needed,
> the principle must always be the same :
>
> X X  XX  XX  X X X X   X
>   |||||--> t
>   old  N-3 N-2   N-1current
>
>
> The "old" period is the one on which we apply the fading effect. The
> current period is the one being entirely accounted (since it's being
> filled as we're looking at it) so here you could always keep a few
> past values (rotating copies of "current") and keep the last one for
> the fade out. In the simplified case above we just don't have the
> N-1..N-M, we only have "current" and "old".
>
> Thus as you can see here, the "current" period is the only dynamic one,
> which is the exact sum of events over the 

Re: [PATCH] MEDIUM/RFC: Implement time-based server latency metrics

2017-01-23 Thread Krishna Kumar (Engineering)
Hi Willy,

Sorry to bother you again, but a quick note in case you have
forgotten this patch/email-thread.

Regards,
- Krishna


On Thu, Jan 5, 2017 at 12:53 PM, Willy Tarreau <w...@1wt.eu> wrote:

> Hi Krishna,
>
> On Thu, Jan 05, 2017 at 11:15:46AM +0530, Krishna Kumar (Engineering)
> wrote:
> > Hi Willy,
> >
> > If required, I can try to make the "hard-coded periods" changes too, but
> > want
> > to hear your opinion as the code gets very complicated, and IMHO, may not
> > give correct results depending on when the request is made. All the other
> > changes are doable.
> >
> > Hoping to hear from you on this topic, please let me know your opinion.
>
> I've started to think about it but had to stop, I'm just busy dealing with
> some painful bugs so it takes me more time to review code additions. I'm
> intentionally keeping your mail marked unread in order to get back to it
> ASAP.
>
> Thanks,
> Willy
>


Re: [PATCH] MEDIUM/RFC: Implement time-based server latency metrics

2017-01-04 Thread Krishna Kumar (Engineering)
Hi Willy,

If required, I can try to make the "hard-coded periods" changes too, but
want
to hear your opinion as the code gets very complicated, and IMHO, may not
give correct results depending on when the request is made. All the other
changes are doable.

Hoping to hear from you on this topic, please let me know your opinion.

Regards,
- Krishna


On Tue, Jan 3, 2017 at 3:07 PM, Krishna Kumar (Engineering) <
krishna...@flipkart.com> wrote:

> Hi Willy,
>
> Sorry for the late response as I was out during the year end, and thanks
> once again for your review comments.
>
> I explored your suggestion of "hard-coded periods", and have some
> problems: code complexity seems to be very high at updates (as well
> as retrievals possibly); and I may not be able to get accurate results.
> E.g. I have data for 1, 4, 16 seconds; and at 18 seconds, a request is
> made for retrieval of the last 16 seconds (or 1,4,16). At this time I have
> values for last 18 seconds not 16 seconds. I explored using timers to
> cascade (will not work as it may run into races with the setters, and
> also adds too much overhead) vs doing this synchronously when the
> event happens. Both are complicated and have the above issue of not
> able to get accurate information depending on when the request is
> made.
>
> To implement your suggestion of say histograms, the retrieval code can
> calculate the 4 values (1, 4, 16, and 64 seconds) by averaging across
> the correct intervals. In this case, the new CLI command is not required,
> and by default it prints all 4 values. Would this work in your opinion?
>
> Ack all your other suggestions, will incorporate those changes and
> re-send. Please let me know if this sounds reasonable.
>
> Thanks,
> - Krishna
>
>
> On Thu, Dec 22, 2016 at 4:23 PM, Willy Tarreau <w...@1wt.eu> wrote:
>
>> Hi Krishna,
>>
>> On Thu, Dec 22, 2016 at 09:41:49AM +0530, Krishna Kumar (Engineering)
>> wrote:
>> > We have found that the current mechanism of qtime, ctime, rtime, and
>> ttime
>> > based on last 1024 requests is not the most suitable to debug/visualize
>> > latency issues with servers, especially if they happen to last a very
>> short
>> > time. For live dashboards showing server timings, we found an additional
>> > last-'n' seconds metrics useful. The logs could also be parsed to derive
>> > these
>> > values, but suffers from delays at high volume, requiring higher
>> processing
>> > power and enabling logs.
>> >
>> > The 'last-n' seconds metrics per server/backend can be configured as
>> follows
>> > in the HAProxy configuration file:
>> > backend backend-1
>> > stats-period 32
>> > ...
>> >
>> > To retrieve these stats at the CLI (in addition to existing metrics),
>> run:
>> > echo show stat-duration time 3 | socat /var/run/admin.sock stdio
>> >
>> > These are also available on the GUI.
>> >
>> > The justification for this patch are:
>> > 1. Allows to capture spikes for a server during a short period. This
>> helps
>> >having dashboards that show server response times every few seconds
>> (e.g.
>> >every 1 second), so as to be able to chart it across timelines.
>> > 2. Be able to get an average across different time intervals, e.g.  the
>> >configuration file may specify to save the last 32 seconds, but the
>> cli
>> >interface can request for average across any interval upto 32
>> seconds.
>> > E.g.
>> >the following command prints the existing metrics appended by the
>> time
>> >based ones for the last 1 second:
>> > echo show stat-duration time 1 | socat /var/run/admin.sock stdio
>> >Running the following existing command appends the time-based metric
>> > values
>> >based on the time period configured in the configuration file per
>> >backend/server:
>> > echo show stat | socat /var/run/admin.sock stdio
>> > 3. Option per backend for configuring the server stat's time interval,
>> and.
>> >no API breakage to stats (new metrics are added at end of line).
>> >
>> > Please review, any feedback on the code/usability/extensibility is very
>> much
>> > appreciated.
>>
>> First, thanks for this work. I'm having several concerns and comments
>> however
>> about it.
>>
>> The first one is that the amount of storage is overkill if the output can
>> only emit an average over a few periods. I mean, the purpose of stats is
>> to emit what we know i

Re: [PATCH] MEDIUM/RFC: Implement time-based server latency metrics

2017-01-03 Thread Krishna Kumar (Engineering)
Hi Willy,

Sorry for the late response as I was out during the year end, and thanks
once again for your review comments.

I explored your suggestion of "hard-coded periods", and have some
problems: code complexity seems to be very high at updates (as well
as retrievals possibly); and I may not be able to get accurate results.
E.g. I have data for 1, 4, 16 seconds; and at 18 seconds, a request is
made for retrieval of the last 16 seconds (or 1,4,16). At this time I have
values for last 18 seconds not 16 seconds. I explored using timers to
cascade (will not work as it may run into races with the setters, and
also adds too much overhead) vs doing this synchronously when the
event happens. Both are complicated and have the above issue of not
able to get accurate information depending on when the request is
made.

To implement your suggestion of say histograms, the retrieval code can
calculate the 4 values (1, 4, 16, and 64 seconds) by averaging across
the correct intervals. In this case, the new CLI command is not required,
and by default it prints all 4 values. Would this work in your opinion?

Ack all your other suggestions, will incorporate those changes and
re-send. Please let me know if this sounds reasonable.

Thanks,
- Krishna


On Thu, Dec 22, 2016 at 4:23 PM, Willy Tarreau <w...@1wt.eu> wrote:

> Hi Krishna,
>
> On Thu, Dec 22, 2016 at 09:41:49AM +0530, Krishna Kumar (Engineering)
> wrote:
> > We have found that the current mechanism of qtime, ctime, rtime, and
> ttime
> > based on last 1024 requests is not the most suitable to debug/visualize
> > latency issues with servers, especially if they happen to last a very
> short
> > time. For live dashboards showing server timings, we found an additional
> > last-'n' seconds metrics useful. The logs could also be parsed to derive
> > these
> > values, but suffers from delays at high volume, requiring higher
> processing
> > power and enabling logs.
> >
> > The 'last-n' seconds metrics per server/backend can be configured as
> follows
> > in the HAProxy configuration file:
> > backend backend-1
> > stats-period 32
> > ...
> >
> > To retrieve these stats at the CLI (in addition to existing metrics),
> run:
> > echo show stat-duration time 3 | socat /var/run/admin.sock stdio
> >
> > These are also available on the GUI.
> >
> > The justification for this patch are:
> > 1. Allows to capture spikes for a server during a short period. This
> helps
> >having dashboards that show server response times every few seconds
> (e.g.
> >every 1 second), so as to be able to chart it across timelines.
> > 2. Be able to get an average across different time intervals, e.g.  the
> >configuration file may specify to save the last 32 seconds, but the
> cli
> >interface can request for average across any interval upto 32 seconds.
> > E.g.
> >the following command prints the existing metrics appended by the time
> >based ones for the last 1 second:
> > echo show stat-duration time 1 | socat /var/run/admin.sock stdio
> >Running the following existing command appends the time-based metric
> > values
> >based on the time period configured in the configuration file per
> >backend/server:
> > echo show stat | socat /var/run/admin.sock stdio
> > 3. Option per backend for configuring the server stat's time interval,
> and.
> >no API breakage to stats (new metrics are added at end of line).
> >
> > Please review, any feedback on the code/usability/extensibility is very
> much
> > appreciated.
>
> First, thanks for this work. I'm having several concerns and comments
> however
> about it.
>
> The first one is that the amount of storage is overkill if the output can
> only emit an average over a few periods. I mean, the purpose of stats is
> to emit what we know internally. Some people might want to see historgrams,
> and while we have everything internally with your patch, it's not possible
> to produce them.
>
> For this reason I think we should proceed differently and always emit these
> stats over a few hard-coded periods. You proved that they don't take that
> much space, and I think it would make sense probably to emit them over a
> small series of power of 4 seconds : 1s, 4s, 16s, 64s. That's quite cheap
> to store and easy to compute because it's not needed anymore to store all
> individual values, you can cascade them while filling a bucket.
>
> And if you go down that route then there's no need anymore for adding yet
> another setting in the configuration, it will be done by default. Another
> point regarding this setting "stats-period" is that you currently do not
> 

Re: [PATCH] MEDIUM/RFC: Implement time-based server latency metrics

2016-12-22 Thread Krishna Kumar (Engineering)
Thanks Willy for your detailed review, especially some design related points
that I was not aware of. I will go through these and respond accordingly.

Regards,
- Krishna


On Thu, Dec 22, 2016 at 4:23 PM, Willy Tarreau <w...@1wt.eu> wrote:

> Hi Krishna,
>
> On Thu, Dec 22, 2016 at 09:41:49AM +0530, Krishna Kumar (Engineering)
> wrote:
> > We have found that the current mechanism of qtime, ctime, rtime, and
> ttime
> > based on last 1024 requests is not the most suitable to debug/visualize
> > latency issues with servers, especially if they happen to last a very
> short
> > time. For live dashboards showing server timings, we found an additional
> > last-'n' seconds metrics useful. The logs could also be parsed to derive
> > these
> > values, but suffers from delays at high volume, requiring higher
> processing
> > power and enabling logs.
> >
> > The 'last-n' seconds metrics per server/backend can be configured as
> follows
> > in the HAProxy configuration file:
> > backend backend-1
> > stats-period 32
> > ...
> >
> > To retrieve these stats at the CLI (in addition to existing metrics),
> run:
> > echo show stat-duration time 3 | socat /var/run/admin.sock stdio
> >
> > These are also available on the GUI.
> >
> > The justification for this patch are:
> > 1. Allows to capture spikes for a server during a short period. This
> helps
> >having dashboards that show server response times every few seconds
> (e.g.
> >every 1 second), so as to be able to chart it across timelines.
> > 2. Be able to get an average across different time intervals, e.g.  the
> >configuration file may specify to save the last 32 seconds, but the
> cli
> >interface can request for average across any interval upto 32 seconds.
> > E.g.
> >the following command prints the existing metrics appended by the time
> >based ones for the last 1 second:
> > echo show stat-duration time 1 | socat /var/run/admin.sock stdio
> >Running the following existing command appends the time-based metric
> > values
> >based on the time period configured in the configuration file per
> >backend/server:
> > echo show stat | socat /var/run/admin.sock stdio
> > 3. Option per backend for configuring the server stat's time interval,
> and.
> >no API breakage to stats (new metrics are added at end of line).
> >
> > Please review, any feedback on the code/usability/extensibility is very
> much
> > appreciated.
>
> First, thanks for this work. I'm having several concerns and comments
> however
> about it.
>
> The first one is that the amount of storage is overkill if the output can
> only emit an average over a few periods. I mean, the purpose of stats is
> to emit what we know internally. Some people might want to see historgrams,
> and while we have everything internally with your patch, it's not possible
> to produce them.
>
> For this reason I think we should proceed differently and always emit these
> stats over a few hard-coded periods. You proved that they don't take that
> much space, and I think it would make sense probably to emit them over a
> small series of power of 4 seconds : 1s, 4s, 16s, 64s. That's quite cheap
> to store and easy to compute because it's not needed anymore to store all
> individual values, you can cascade them while filling a bucket.
>
> And if you go down that route then there's no need anymore for adding yet
> another setting in the configuration, it will be done by default. Another
> point regarding this setting "stats-period" is that you currently do not
> support the "defaults" section while I guess almost all users will want to
> have it there. That's another reason for keeping it hard-coded.
>
> Now some general comments and guidance on the code itself :
>
> > diff --git a/include/proto/stats.h b/include/proto/stats.h
> > index ac893b8..f33ceb1 100644
> > --- a/include/proto/stats.h
> > +++ b/include/proto/stats.h
> > @@ -95,7 +95,8 @@ int stats_fill_fe_stats(struct proxy *px, struct field
> *stats, int len);
> >  int stats_fill_li_stats(struct proxy *px, struct listener *l, int flags,
> >  struct field *stats, int len);
> >  int stats_fill_sv_stats(struct proxy *px, struct server *sv, int flags,
> > -struct field *stats, int len);
> > +struct field *stats, int len,
> > +struct stream_interface *si);
>
> Please avoid re-introducing the stream_interface in stats, we're trying
> hard
> to abstract it. From what I'm seeing you 

[PATCH] MEDIUM/RFC: Implement time-based server latency metrics

2016-12-21 Thread Krishna Kumar (Engineering)
We have found that the current mechanism of qtime, ctime, rtime, and ttime
based on last 1024 requests is not the most suitable to debug/visualize
latency issues with servers, especially if they happen to last a very short
time. For live dashboards showing server timings, we found an additional
last-'n' seconds metrics useful. The logs could also be parsed to derive
these
values, but suffers from delays at high volume, requiring higher processing
power and enabling logs.

The 'last-n' seconds metrics per server/backend can be configured as follows
in the HAProxy configuration file:
backend backend-1
stats-period 32
...

To retrieve these stats at the CLI (in addition to existing metrics), run:
echo show stat-duration time 3 | socat /var/run/admin.sock stdio

These are also available on the GUI.

The justification for this patch are:
1. Allows to capture spikes for a server during a short period. This helps
   having dashboards that show server response times every few seconds (e.g.
   every 1 second), so as to be able to chart it across timelines.
2. Be able to get an average across different time intervals, e.g.  the
   configuration file may specify to save the last 32 seconds, but the cli
   interface can request for average across any interval upto 32 seconds.
E.g.
   the following command prints the existing metrics appended by the time
   based ones for the last 1 second:
echo show stat-duration time 1 | socat /var/run/admin.sock stdio
   Running the following existing command appends the time-based metric
values
   based on the time period configured in the configuration file per
   backend/server:
echo show stat | socat /var/run/admin.sock stdio
3. Option per backend for configuring the server stat's time interval, and.
   no API breakage to stats (new metrics are added at end of line).

Please review, any feedback on the code/usability/extensibility is very much
appreciated.

Thanks,
- Krishna Kumar
From 55d8966ee185031814aa97368adc262c78540705 Mon Sep 17 00:00:00 2001
From: Krishna Kumar 
Date: Thu, 22 Dec 2016 09:28:02 +0530
Subject: [PATCH] Implement a last-'n' seconds based counters for
 backend/server's queue, connect, response and session times.

---
 doc/configuration.txt |  29 ++
 include/common/cfgparse.h |   3 +
 include/proto/stats.h |   3 +-
 include/types/applet.h|   1 +
 include/types/counters.h  |  35 +++
 include/types/proxy.h |   4 +
 include/types/server.h|   1 +
 include/types/stats.h |   7 ++
 src/cfgparse.c|  51 ++-
 src/haproxy.c |   6 ++
 src/hlua_fcn.c|   2 +-
 src/proxy.c   |  39 
 src/server.c  |   7 ++
 src/stats.c   | 226 +-
 src/stream.c  |   7 ++
 15 files changed, 414 insertions(+), 7 deletions(-)

diff --git a/doc/configuration.txt b/doc/configuration.txt
index f654c8e..dc0ddae 100644
--- a/doc/configuration.txt
+++ b/doc/configuration.txt
@@ -8078,6 +8078,35 @@ stats uri 
 
   See also : "stats auth", "stats enable", "stats realm"
 
+stats-period 
+  Enable time based average queue, connect, response and session times for
+  this backend and all servers contained in it.
+  May be used in sections :   defaults | frontend | listen | backend
+ no|no|   no   |   yes
+  Arguments :
+  is the maximum seconds for which to maintain the four
+averages (queue, connect, response and session times). 'nsecs' is a
+non-negative number, and should be a power of 2, with the maximum value
+being 128. To retrieve the averages, use one of the following methods:
+- echo show stat | socat /var/run/admin.sock stdio
+  Shows existing statistics, appended by 4 metrics corresponding to
+  average queue, connect, response and session times. These four
+  times are averaged across the last 'nsecs' seconds.
+- echo show stat-duration time  | socat /var/run/admin.sock stdio
+  Shows existing statistics, appended by 4 metrics corresponding to
+  average queue, connect, response and session times. These four
+  times are averaged across the last 'n' seconds.
+
+  Note : When the backend does not specify the 'stats-period' option, these
+ metrics are zeroed out.
+
+  Example :
+# Preserve last 32 seconds of average queue, connect, response and session
+# times for a backend and all servers in that backend:
+backend backend-1
+stats-period 32
+server a.b.c.d a.b.c.d:80
+server a.b.c.e a.b.c.e:80
 
 stick match  [table ] [{if | unless} ]
   Define a request pattern matching condition to stick a user to a server
diff --git a/include/common/cfgparse.h b/include/common/cfgparse.h
index fd04b14..101cad9 100644
--- a/include/common/cfgparse.h
+++ b/include/common/cfgparse.h
@@ -110,6 +110,9 @@ static inline int warnifnotcap(struct proxy *proxy, 

Re: haproxy hangs on old linux kernels (2.6.24)

2016-04-12 Thread Krishna Kumar (Engineering)
This is actually a kernel Oops as it is accessing an invalid memory
location: fff4 (bug in code). Only kernel upgrade can fix that.

On Tue, Apr 12, 2016 at 12:43 PM, Alexey Vlasov  wrote:

> Hi,
>
> I have some linux boxes with very old kernels. Unfortunately, I cannot
> upgrade them due to the fact that they work very stable. for
> example,their uptime is already some
> years, which is not true speaking about modern kernels.
> But there is one problem: HAPproxy hangs when I turn on SSL options.
>
> # haproxy -v
> HA-Proxy version 1.5.4 2014/09/02
>
> My config:
> global
> tune.ssl.default-dh-param 2048
> ssl-default-bind-ciphers
> ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-AES256-GCM-SHA384:DHE-RSA-AES128-GCM-SHA256:DHE-DSS-AES128-GCM-SHA256:kEDH+AESGCM:ECDHE-RSA-AES128-SHA256:ECDHE-ECDSA-AES128-SHA256:ECDHE-RSA-AES128-SHA:ECDHE-ECDSA-AES128-SHA:ECDHE-RSA-AES256-SHA384:ECDHE-ECDSA-AES256-SHA384:ECDHE-RSA-AES256-SHA:ECDHE-ECDSA-AES256-SHA:DHE-RSA-AES128-SHA256:DHE-RSA-AES128-SHA:DHE-DSS-AES128-SHA256:DHE-RSA-AES256-SHA256:DHE-DSS-AES256-SHA:DHE-RSA-AES256-SHA:!aNULL:!eNULL:!EXPORT:!DES:!RC4:!3DES:!MD5:!PSK
>
> frontend https-in
> bind111.222.111.222:443 ssl strict-sni no-sslv3 crt-list
> /etc/haproxy_aux2_pools/crt.list
> errorfile   408 /dev/null
> option  http-keep-alive
> option  http-server-close
> http-requestadd-header X-Forwarded-Port %[dst_port]
> http-requestadd-header X-Forwarded-Proto https
> use_backend apache_aux2_workers
>
> # ps -o s,pid,start,comm -C haproxy_aux2_pools
> S   PID  STARTED COMMAND
> D   472   Apr 07 haproxy_aux2_po
> D   725   Apr 07 haproxy_aux2_po
> D  1185   Apr 07 haproxy_aux2_po
> D  1706   Apr 07 haproxy_aux2_po
> D  2168   Apr 07 haproxy_aux2_po
> D  2749   Apr 07 haproxy_aux2_po
> D  2996   Apr 07 haproxy_aux2_po
> D  3620   Apr 07 haproxy_aux2_po
> D  3960   Apr 07 haproxy_aux2_po
>
> and kernel trace:
> Apr  7 17:40:23 l4 kernel: Unable to handle kernel paging request at
> fff4 RIP:
> Apr  7 17:40:23 l4 kernel: []
> dma_unpin_iovec_pages+0x10/0x80
> Apr  7 17:40:23 l4 kernel: PGD 203067 PUD 204067 PMD 0
> Apr  7 17:40:23 l4 kernel: Oops:  [1] SMP
> Apr  7 17:40:23 l4 kernel: CPU 0
> Apr  7 17:40:23 l4 kernel: Pid: 17747, comm: haproxy_aux2_po Not tainted
> 2.6.24-1gb-1 #4
> Apr  7 17:40:23 l4 kernel: RIP: 0010:[]
> [] dma_unpin_iovec_pages+0x10/0x80
> Apr  7 17:40:23 l4 kernel: RSP: 0018:8101164dbbb8  EFLAGS: 00010282
> Apr  7 17:40:23 l4 kernel: RAX: 0001 RBX: 
> RCX: 
> Apr  7 17:40:23 l4 kernel: RDX:  RSI: 
> RDI: fff4
> Apr  7 17:40:23 l4 kernel: RBP: 8102acf5c6b0 R08: 0040
> R09: 
> Apr  7 17:40:23 l4 kernel: R10: 80629900 R11: 80398920
> R12: 8102acf5c600
> Apr  7 17:40:23 l4 kernel: R13: 8102acf5c6b0 R14: fff4
> R15: 7fff
> Apr  7 17:40:23 l4 kernel: FS:  2b5d03469b20()
> GS:8062f000() knlGS:
> Apr  7 17:40:23 l4 kernel: CS:  0010 DS:  ES:  CR0:
> 80050033
> Apr  7 17:40:23 l4 kernel: CR2: fff4 CR3: 0001c50f2000
> CR4: 06e0
> Apr  7 17:40:23 l4 kernel: DR0:  DR1: 
> DR2: 
> Apr  7 17:40:23 l4 kernel: DR3:  DR6: 0ff0
> DR7: 0400
> Apr  7 17:40:23 l4 kernel: Process haproxy_aux2_po (pid: 17747, threadinfo
> 8101164da000, task 8101b733a000)
> Apr  7 17:40:23 l4 kernel: Stack:   8102acf5c6b0
> 8102acf5c600 8102acf5c6b0
> Apr  7 17:40:23 l4 kernel: 8102acf5c9dc 804e2f11
> 810010535900 804e1a53
> Apr  7 17:40:23 l4 kernel:  4020
> 8101164dbee8 07524a80
> Apr  7 17:40:23 l4 kernel: Call Trace:
> Apr  7 17:40:23 l4 kernel: Call Trace:
> Apr  7 17:40:23 l4 kernel: [] tcp_recvmsg+0x581/0xcd0
> Apr  7 17:40:23 l4 kernel: [] tcp_sendmsg+0x593/0xc30
> Apr  7 17:40:23 l4 kernel: [] _spin_lock_bh+0x9/0x20
> Apr  7 17:40:23 l4 kernel: [] release_sock+0x13/0xb0
> Apr  7 17:40:23 l4 kernel: []
> sock_common_recvmsg+0x30/0x50
> Apr  7 17:40:23 l4 kernel: [] sock_recvmsg+0x14a/0x160
> Apr  7 17:40:23 l4 kernel: [] filemap_fault+0x21e/0x420
> Apr  7 17:40:23 l4 kernel: []
> autoremove_wake_function+0x0/0x30
> Apr  7 17:40:23 l4 kernel: [] __do_fault+0x1e5/0x460
> Apr  7 17:40:23 l4 kernel: [] handle_mm_fault+0x1af/0x7c0
> Apr  7 17:40:23 l4 kernel: [] sys_recvfrom+0xfe/0x1a0
> Apr  7 17:40:23 l4 kernel: [] do_page_fault+0x1e0/0x830
> Apr  7 17:40:23 l4 kernel: [] vma_merge+0x161/0x1f0
> Apr  7 17:40:23 l4 kernel: [] system_call+0x7e/0x83
> Apr  7 17:40:23 l4 kernel:
> Apr  7 17:40:23 l4 kernel:
> Apr  7 17:40:23 l4 kernel: Code: 8b 37 85 f6 7e 51 48 8d 6f 08 45 31 ed 

Setting TOS field dynamically in 1.5.12 (or later)

2016-01-12 Thread Krishna Kumar (Engineering)
Hi all,

Is there any way to set the TOS field dynamically (any mechanism,
acl or others) for response back to the client? We require to be able
to  dynamically select one of the ISP links when a client makes a new
connection. One option was to set the TOS field for responses from
HAProxy.

Thanks,
- Krishna Kumar


Re: [1.6.1] Utilizing http-reuse

2015-12-08 Thread Krishna Kumar (Engineering)
Great! Initial tests shows that only one connection was established and
closed once.
The behavior is as follows:

telnet and a manual GET: Connection to haproxy and a connection to server
(port 2004).
Run ab: New connection to haproxy, reuse the same connection (Port 2004) to
server.
  'ab' finishes which results in client->haproxy connection getting
closed This results in
   immediate drop of haproxy->server connection (Port 2004) too.
Do another GET in the telnet: New connection is established from HAPRoxy ->
  server (port 2005).
Kill telnet: Connection to haproxy is killed. HAProxy kills port 2005
connection.

This behavior works for us, thanks a lot for the quick fix. The above
behavior
validates the second point you mentioned in your earlier mail:

"I'll see). If the client closes an idle connection while there are still
other connections left, the server connection is not moved back to the
server's idle list and is closed. It's not dramatic, but is a waste of
resources since we could maintain that connection open. I'll see if we can
do something simple regarding this case."

Thanks,
Krishna





.

On Tue, Dec 8, 2015 at 12:32 PM, Willy Tarreau <w...@1wt.eu> wrote:

> On Tue, Dec 08, 2015 at 07:44:45AM +0530, Krishna Kumar (Engineering)
> wrote:
> > Great, will be glad to test and report on the finding. Thanks!
>
> Sorry I forgot to post the patch after committing it. Here it comes.
> Regarding the second point, in the end it's not a bug, it's simply
> because we don't have connection pools yet, and I forgot that keeping
> an orphan backend connection was only possible with connection pools :-)
>
> Willy
>
>


Re: [1.6.1] Utilizing http-reuse

2015-12-07 Thread Krishna Kumar (Engineering)
Great, will be glad to test and report on the finding. Thanks!

Regards,
- Krishna

On Mon, Dec 7, 2015 at 9:07 PM, Willy Tarreau  wrote:

> Hi Krishna,
>
> I found a bug explaining your observations and noticed a second one I have
> not yet troubleshooted.
>
> The bug causing your issue is that before moving the idle connection back
> to
> the server's pool, we check the backend's http-reuse mode. But we're doing
> this after calling http_reset_txn() which prepares the transaction to
> accept
> a new request and sets the backend to the frontend. So we're in fact
> checking
> the frontend's option. That's why it doesn't work in your case. That's a
> stupid
> bug that I managed to fix.
>
> While testing this I discovered another issue (problably less easy to fix,
> I'll see). If the client closes an idle connection while there are still
> other connections left, the server connection is not moved back to the
> server's idle list and is closed. It's not dramatic, but is a waste of
> resources since we could maintain that connection open. I'll see if we can
> do something simple regarding this case.
>
> I'll send a patch soon for the first case.
>
> Thanks,
> Willy
>
>


Re: [1.6.1] Utilizing http-reuse

2015-12-06 Thread Krishna Kumar (Engineering)
Hi Willy, Baptiste,

Apologies for the delay in reproducing this issue and in responding.

I am using HAProxy 1.6.2 and am still finding that connection reuse is not
happening in my setup. Attaching the configuration file, command line
arguments, and the tcpdump (80 packets in all), in case it helps. HAProxy
is configured with a single backend. The same client makes two requests,
one a telnet with a GET request for a 128 byte file, and the second 'ab -k'
command to retrieve the same file.

172.20.97.36: Client
10.34.73.174: HAProxy
10.32.121.94: Server

Telnet from client with GET:
GET /128 HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)

Followed immediately with:
ab -k -n 10 -c 1 http://10.34.73.174/1K

Packets #1-7: Telnet to haproxy, and a GET request made
Packets #8-9: HAProxy opens connection to single backend
Packets #10-15: Response from server, relays data back to the client,
connection from Client->HAProxy and HAProxy->server is
kept open.
Packets #16-19 (5 seconds later): Same client, run 'ab -k'
Packet #20-72: New connection to same backend, and data transfer.
Packet #73: 'ab' closes connection to HAProxy
Packet #74: HAProxy closes connection to 'ab'.
Packet #75: HAProxy closes connection to backend.
Packets #77-81: Telnet closes connection

Configuration file:
--
global
daemon
maxconn 1

defaults
mode http
option http-keep-alive
balance leastconn
option splice-response
option clitcpka
option srvtcpka
option tcp-smart-accept
option tcp-smart-connect
option contstats
timeout http-keep-alive 1800s
timeout http-request 1800s
timeout connect 60s
timeout client 1800s
timeout server 1800s

frontend private-frontend
mode http
maxconn 1
bind 10.34.73.174:80
default_backend private-backend

backend private-backend
http-reuse always
server 10.32.121.94 10.32.121.94:80 maxconn 1

>From the above, it is seen that HAProxy opens a second connection to the
server on same GET request from the client.

Can you please take a look and suggest what needs to be done to get reuse
working?

$ haproxy -vv
HA-Proxy version 1.6.2 2015/11/03
Copyright 2000-2015 Willy Tarreau <wi...@haproxy.org>

Build options :
  TARGET  = linux2628
  CPU = generic
  CC  = gcc
  CFLAGS  = -O3 -g -fno-strict-aliasing -Wdeclaration-after-statement
  OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 1024, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.8
Compression algorithms supported : identity("identity"),
deflate("deflate"), raw-deflate("deflate"), gzip("gzip")
Built with OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
Running on OpenSSL version : OpenSSL 1.0.1f 6 Jan 2014
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.35 2014-04-04
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built without Lua support
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND

Available polling systems :
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.

Thanks,
- Krishna Kumar


On Thu, Nov 12, 2015 at 12:50 PM, Willy Tarreau <w...@1wt.eu> wrote:

> Hi Krishna,
>
> On Wed, Nov 11, 2015 at 03:22:54PM +0530, Krishna Kumar (Engineering)
> wrote:
> > I just tested with 128K byte file (run 4 wgets
> > in parallel each retrieving 128K). Here, I see 4 connections being
> opened, and
> > lots of data packets in the middle, finally followed by 4 connections
> > being closed. I
> > can test with "ab -k" option to simulate a browser, I guess, will try
> that.
>
> In my tests, ab -k definitely works.
>
> > > Is wget advertising HTTP/1.1 in the request ? If not that could
> >
> > Yes, tcpdump shows HTTP/1.1 in the GET request.
>
> OK.
>
> > >   - proxy protocol used to the server
> > >   - SNI sent to the server
> > >   - source IP binding to client's IP address
> > >   - source IP binding to anything dynamic (eg: header)
> > >   - 401/407 received on a server connection.
> >
> > I am not doing any of these specifically. Its a very simple setup where
> the
> > client@ip1 connects to haproxy@ip2 and requests 128 byte file, which
> > is handled by server@ip3.
>
> OK. I don't see any reason for this not to work then.
>
> > I was doing this in telnet:
> >
> > GET /128 HTTP/1.1
> > Host: www.example.com
> > User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)
>
> Looks fine as well. Very strange. I have no idea what would not at the
> moment, I suspect this is something stupid and obvious but am failing
> to spot it :-/
>
> Willy
>
>


packets.pcap
Description: application/vnd.tcpdump.pcap


Re: [1.6.1] Utilizing http-reuse

2015-12-06 Thread Krishna Kumar (Engineering)
Thanks a lot, Willy.

Regards,
- Krishna

On Mon, Dec 7, 2015 at 11:59 AM, Willy Tarreau <w...@1wt.eu> wrote:

> Hi Krishna,
>
> On Mon, Dec 07, 2015 at 08:31:19AM +0530, Krishna Kumar (Engineering)
> wrote:
> > Hi Willy, Baptiste,
> >
> > Apologies for the delay in reproducing this issue and in responding.
> >
> > I am using HAProxy 1.6.2 and am still finding that connection reuse is
> not
> > happening in my setup. Attaching the configuration file, command line
> > arguments, and the tcpdump (80 packets in all), in case it helps. HAProxy
> > is configured with a single backend. The same client makes two requests,
> > one a telnet with a GET request for a 128 byte file, and the second 'ab
> -k'
> > command to retrieve the same file.
> (...)
> > Can you please take a look and suggest what needs to be done to get reuse
> > working?
>
> Thank you for this detailed report. I agree that your config shows that
> it should work and the pcap shows that it doesn't. I've taken a quick
> look at the code and have no idea why it does this. I'm going to
> investigate
> and will keep you informed.
>
> Thanks!
> Willy
>
>


Re: [1.6.1] Utilizing http-reuse

2015-11-11 Thread Krishna Kumar (Engineering)
Hi Willy,

>> B. Run 8 wgets in parallel. Each opens a new connection to get a 128 byte 
>> file.
>>  Again, 8 separate connections are opened to the backend server.
>
> But are they *really* processed in parallel ? If the file is only 128 bytes,
> I can easily imagine that the connections are opened and closed immediately.
> Please keep in mind that wget doesn't work like a browser *at all*. A browser
> keeps connections alive. Wget fetches one object and closes. That's a huge
> difference because the browser connection remains reusable while wget's not.

Yes, they were not really in parallel. I just tested with 128K byte
file (run 4 wgets
in parallel each retrieving 128K). Here, I see 4 connections being opened, and
lots of data packets in the middle, finally followed by 4 connections
being closed. I
can test with "ab -k" option to simulate a browser, I guess, will try that.

>> D.  Run 5 "wget -i  " in parallel. 5
>> connections are opened by
>>   the 5 wgets, and 5 connections are opened by haproxy to the
>> single server, finally
>>   all are closed by RST's.
>
> Is wget advertising HTTP/1.1 in the request ? If not that could

Yes, tcpdump shows HTTP/1.1 in the GET request.

> explain why they're not merged, we only merge connections from
> HTTP/1.1 compliant clients. Also we keep private any connection
> which sees a 401 or 407 status code so that authentication doesn't
> mix up with other clients and we remain compatible with broken
> auth schemes like NTLM which violates HTTP. There are other criteria
> to mark a connection private :
>   - proxy protocol used to the server
>   - SNI sent to the server
>   - source IP binding to client's IP address
>   - source IP binding to anything dynamic (eg: header)
>   - 401/407 received on a server connection.

I am not doing any of these specifically. Its a very simple setup where the
client@ip1 connects to haproxy@ip2 and requests 128 byte file, which
is handled by server@ip3.

>> I also modified step#1 above, to do a telnet, followed by a GET in
>> telnet to actually
>> open a server connection, and then run the other tests. I still don't
>> see re-using connection
>> having effect.
>
> How did you make your test, what exact request did you type ?

I was doing this in telnet:

GET /128 HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)

Thanks for your response & help,

Regards,
- Krishna Kumar



[1.6.1] Utilizing http-reuse

2015-11-10 Thread Krishna Kumar (Engineering)
Dear all,

I am comparing 1.6.1 with 1.5.12. Following are the relevant snippets from the
configuration file:

global
   maxconn 100
defaults
   option http-keep-alive
   option clitcpka
   option srvtcpka
frontend private-frontend
   maxconn 100
   mode http
   bind IP1:80
   default_backend private-backend
backend private-backend
http-reuse always (only in the 1.6.1 configuration)
server IP2 IP2:80 maxconn 10

Client runs a single command to retrieve file of 128 bytes:
  ab -k -n 20 -c 12 http:///128

Tcpdump shows that 12 connections were established to the frontend, 10
connections
were then made to the server, and after the 10 were serviced once (GET), two new
connections were opened to the server and serviced once (GET), finally
8 requests
were done on the first set of server connections. Finally all 12
connections were
closed together. There is no difference in #packets between 1.5.12 and
1.6.1 or the
sequence of packets.

How do I actually re-use idle connections? Do I need to run ab's in
parallel with
some delay, etc, to see old connections being reused? I also ran separately the
following script to get file of 4K, to introduce parallel connections
with delay's, etc:

for i in {1..20}
do
ab -k -n 100 -c 50 http://10.34.73.174/4K &
sleep 0.4
done
wait

But the total# packets for 1.5.12 and 1.6.1 were similar (no drops in tcpdump,
no Connection drop in client, with 24.6K packets for 1.5.12 and 24.8K packets
for 1.6.1). Could someone please let me know what I should change in the
configuration or the client to see the effect of http-reuse?

Thanks,
- Krishna Kumar



Re: [1.6.1] Utilizing http-reuse

2015-11-10 Thread Krishna Kumar (Engineering)
Thanks Baptiste. My configuration file is very basic:

global
  maxconn 100
defaults
mode http
option http-keep-alive
option splice-response
option clitcpka
option srvtcpka
option tcp-smart-accept
option tcp-smart-connect
timeout connect 60s
timeout client 1800s
timeout server 1800s
timeout http-request 1800s
timeout http-keep-alive 1800s
frontend private-frontend
maxconn 100
mode http
bind IP1:80
default_backend private-backend
backend private-backend
 http-reuse always
 server IP2 IP2:80 maxconn 10

As described by you, I did the following tests:

1. Telnet to the HAProxy IP, and then run each of the following tests:

A.  Serial: Run wget, sleep 0.5; wget, sleep 0.5; (8 times). tcpdump shows that
  when each wget finishes, client closes the connection and
haproxy does RST to
  the single backend. Next wget opens a new connection to haproxy,
and in turn
  to the server upon request.

B. Run 8 wgets in parallel. Each opens a new connection to get a 128 byte file.
 Again, 8 separate connections are opened to the backend server.

C.   Run "wget -i ". wget uses keepalive to not close
   the connection. Here, wget opens only 1 connection to haproxy,
and haproxy
   opens 1 connection to the backend, over which wget transfers
the 5 files one after
   the other. Behavior is identical to 1.5.12 (same config file,
except without the reuse
   directive).

D.  Run 5 "wget -i  " in parallel. 5
connections are opened by
  the 5 wgets, and 5 connections are opened by haproxy to the
single server, finally
  all are closed by RST's.

I also modified step#1 above, to do a telnet, followed by a GET in
telnet to actually
open a server connection, and then run the other tests. I still don't
see re-using connection
having effect.

Is this test scenario different from what you had suggested?

Thanks once again.

Regards,
- Krishna Kumar


On Tue, Nov 10, 2015 at 6:19 PM, Baptiste <bed...@gmail.com> wrote:
> On Tue, Nov 10, 2015 at 11:44 AM, Krishna Kumar (Engineering)
> <krishna...@flipkart.com> wrote:
>> Dear all,
>>
>> I am comparing 1.6.1 with 1.5.12. Following are the relevant snippets from 
>> the
>> configuration file:
>>
>> global
>>maxconn 100
>> defaults
>>option http-keep-alive
>>option clitcpka
>>option srvtcpka
>> frontend private-frontend
>>maxconn 100
>>mode http
>>bind IP1:80
>>default_backend private-backend
>> backend private-backend
>> http-reuse always (only in the 1.6.1 configuration)
>> server IP2 IP2:80 maxconn 10
>>
>> Client runs a single command to retrieve file of 128 bytes:
>>   ab -k -n 20 -c 12 http:///128
>>
>> Tcpdump shows that 12 connections were established to the frontend, 10
>> connections
>> were then made to the server, and after the 10 were serviced once (GET), two 
>> new
>> connections were opened to the server and serviced once (GET), finally
>> 8 requests
>> were done on the first set of server connections. Finally all 12
>> connections were
>> closed together. There is no difference in #packets between 1.5.12 and
>> 1.6.1 or the
>> sequence of packets.
>>
>> How do I actually re-use idle connections? Do I need to run ab's in
>> parallel with
>> some delay, etc, to see old connections being reused? I also ran separately 
>> the
>> following script to get file of 4K, to introduce parallel connections
>> with delay's, etc:
>>
>> for i in {1..20}
>> do
>> ab -k -n 100 -c 50 http://10.34.73.174/4K &
>> sleep 0.4
>> done
>> wait
>>
>> But the total# packets for 1.5.12 and 1.6.1 were similar (no drops in 
>> tcpdump,
>> no Connection drop in client, with 24.6K packets for 1.5.12 and 24.8K packets
>> for 1.6.1). Could someone please let me know what I should change in the
>> configuration or the client to see the effect of http-reuse?
>>
>> Thanks,
>> - Krishna Kumar
>>
>
>
> Hi Krishna,
>
> Actually, your timeouts are also very important as well.
> I would also enable "option prefer-last-server", furthermore if you
> have many servers in the farm.
>
> Now, to test the reuse, simply try opening a session using telnet, and
> fake a keepalive session.
> then do a few wget and confirm all the traffic uses the session
> previously established.
>
> Baptiste



Re: Unexpected error messages

2015-10-16 Thread Krishna Kumar (Engineering)
Hi Baptiste,

Thanks for your follow up!

Sorry, I was unable to test that since it was seen only on the production
server. However, I tested the same on a test box, with retries=1 and
redispatch, and see that redispatch does happen even with retries=1
when the backend is down (health check is disabled, retries=1, redispatch
is enabled). However, retries=0 does not redispatch as expected.

So the issue remains. We are also checking if they are facing packet
losses that might explain this problem. The confusing part is that the error
contains status = 200.

Thanks,
- Krishna Kumar


On Fri, Oct 16, 2015 at 3:49 PM, Baptiste <bed...@gmail.com> wrote:
> Is your problem fixed?
>
> We may emit a warning for such configuration.
>
> Baptiste
>
> Le 15 oct. 2015 07:34, "Krishna Kumar (Engineering)"
> <krishna...@flipkart.com> a écrit :
>>
>> Hi Baptiste,
>>
>> Thank you for the advise and solution, I didn't realize retries had to be
>> >1.
>>
>> Regards,
>> - Krishna Kumar
>>
>> On Wed, Oct 14, 2015 at 7:51 PM, Baptiste <bed...@gmail.com> wrote:
>> > On Wed, Oct 14, 2015 at 3:03 PM, Krishna Kumar (Engineering)
>> > <krishna...@flipkart.com> wrote:
>> >> Hi all,
>> >>
>> >> We are occasionally getting these messages (about 25 errors/per
>> >> occurrence,
>> >> 1 occurrence per hour) in the *error* log:
>> >>
>> >> 10.xx.xxx.xx:60086 [14/Oct/2015:04:21:25.048] Alert-FE
>> >> Alert-BE/10.xx.xx.xx 0/5000/1/32/+5033 200 +149 - - --NN 370/4/1/0/+1
>> >> 0/0 {10.xx.x.xxx||367||} {|||432} "POST /fk-alert-service/nsca
>> >> HTTP/1.1"
>> >> 10.xx.xxx.xx:60046 [14/Oct/2015:04:21:19.936] Alert-FE
>> >> Alert-BE/10.xx.xx.xx 0/5000/1/21/+5022 200 +149 - - --NN 302/8/2/0/+1
>> >> 0/0 {10.xx.x.xxx||237||} {|||302} "POST /fk-alert-service/nsca
>> >> HTTP/1.1"
>> >> ...
>> >>
>> >> We are unsure what errors were seen at the client. What could possibly
>> >> be the
>> >> reason for these? Every error line has retries value as "+1", as seen
>> >> above. The
>> >> specific options in the configuration are (HAProxy v1.5.12):
>> >>
>> >> 1. "retries 1"
>> >> 2. "option redispatch"
>> >> 3. "option logasap"
>> >> 4. "timeout connect 5000", server and client timeouts are high - 300s
>> >> 5. Number of backend servers is 7.
>> >> 6. ulimit is 512K
>> >> 7. balance is "roundrobin"
>> >>
>> >> Thank you for any leads/insights.
>> >>
>> >> Regards,
>> >> - Krishna Kumar
>> >>
>> >
>> > Hi Krishna,
>> >
>> > First, I don't understand how the "retries 1" and the "redispatch"
>> > works together in your case.
>> > I mean, redispatch is supposed to be applied at 'retries - 1'...
>> >
>> > So basically, what may be happening:
>> > - because of logasap, HAProxy does not wait until the end of the
>> > session to generate the log line
>> > - this log is in error because a connection was attempted (and failed)
>> > on a server
>> >
>> > You should not setup any ulimit and let HAProxy do the job for you.
>> >
>> > Baptiste



Re: [blog] What's new in HAProxy 1.6

2015-10-15 Thread Krishna Kumar (Engineering)
Extremely useful, thanks a lot.

On Thu, Oct 15, 2015 at 5:13 AM, Igor Cicimov
 wrote:
>
> On 14/10/2015 9:41 PM, "Baptiste"  wrote:
>>
>> Hey,
>>
>> I summarized what's new in HAProxy 1.6 with some configuration
>> examples in a blog post to help quick adoption of new features:
>> http://blog.haproxy.com/2015/10/14/whats-new-in-haproxy-1-6/
>>
>> Baptiste
>>
> Awesome, thank you!
>
> Igor



Unexpected error messages

2015-10-14 Thread Krishna Kumar (Engineering)
Hi all,

We are occasionally getting these messages (about 25 errors/per occurrence,
1 occurrence per hour) in the *error* log:

10.xx.xxx.xx:60086 [14/Oct/2015:04:21:25.048] Alert-FE
Alert-BE/10.xx.xx.xx 0/5000/1/32/+5033 200 +149 - - --NN 370/4/1/0/+1
0/0 {10.xx.x.xxx||367||} {|||432} "POST /fk-alert-service/nsca
HTTP/1.1"
10.xx.xxx.xx:60046 [14/Oct/2015:04:21:19.936] Alert-FE
Alert-BE/10.xx.xx.xx 0/5000/1/21/+5022 200 +149 - - --NN 302/8/2/0/+1
0/0 {10.xx.x.xxx||237||} {|||302} "POST /fk-alert-service/nsca
HTTP/1.1"
...

We are unsure what errors were seen at the client. What could possibly be the
reason for these? Every error line has retries value as "+1", as seen above. The
specific options in the configuration are (HAProxy v1.5.12):

1. "retries 1"
2. "option redispatch"
3. "option logasap"
4. "timeout connect 5000", server and client timeouts are high - 300s
5. Number of backend servers is 7.
6. ulimit is 512K
7. balance is "roundrobin"

Thank you for any leads/insights.

Regards,
- Krishna Kumar



Re: Unexpected error messages

2015-10-14 Thread Krishna Kumar (Engineering)
Hi Baptiste,

Thank you for the advise and solution, I didn't realize retries had to be >1.

Regards,
- Krishna Kumar

On Wed, Oct 14, 2015 at 7:51 PM, Baptiste <bed...@gmail.com> wrote:
> On Wed, Oct 14, 2015 at 3:03 PM, Krishna Kumar (Engineering)
> <krishna...@flipkart.com> wrote:
>> Hi all,
>>
>> We are occasionally getting these messages (about 25 errors/per occurrence,
>> 1 occurrence per hour) in the *error* log:
>>
>> 10.xx.xxx.xx:60086 [14/Oct/2015:04:21:25.048] Alert-FE
>> Alert-BE/10.xx.xx.xx 0/5000/1/32/+5033 200 +149 - - --NN 370/4/1/0/+1
>> 0/0 {10.xx.x.xxx||367||} {|||432} "POST /fk-alert-service/nsca
>> HTTP/1.1"
>> 10.xx.xxx.xx:60046 [14/Oct/2015:04:21:19.936] Alert-FE
>> Alert-BE/10.xx.xx.xx 0/5000/1/21/+5022 200 +149 - - --NN 302/8/2/0/+1
>> 0/0 {10.xx.x.xxx||237||} {|||302} "POST /fk-alert-service/nsca
>> HTTP/1.1"
>> ...
>>
>> We are unsure what errors were seen at the client. What could possibly be the
>> reason for these? Every error line has retries value as "+1", as seen above. 
>> The
>> specific options in the configuration are (HAProxy v1.5.12):
>>
>> 1. "retries 1"
>> 2. "option redispatch"
>> 3. "option logasap"
>> 4. "timeout connect 5000", server and client timeouts are high - 300s
>> 5. Number of backend servers is 7.
>> 6. ulimit is 512K
>> 7. balance is "roundrobin"
>>
>> Thank you for any leads/insights.
>>
>> Regards,
>> - Krishna Kumar
>>
>
> Hi Krishna,
>
> First, I don't understand how the "retries 1" and the "redispatch"
> works together in your case.
> I mean, redispatch is supposed to be applied at 'retries - 1'...
>
> So basically, what may be happening:
> - because of logasap, HAProxy does not wait until the end of the
> session to generate the log line
> - this log is in error because a connection was attempted (and failed)
> on a server
>
> You should not setup any ulimit and let HAProxy do the job for you.
>
> Baptiste



Re: Multi-part message failure during http mode (haproxy 1.5.12)

2015-08-06 Thread Krishna Kumar (Engineering)
Not to spam again, but a request to anyone who has faced this, and
know how to get around or fix it. I checked the source a bit, there is
a reference to multipart message in the compression code only (do not
compress multi-part).

Thank you.

On Wed, Aug 5, 2015 at 9:26 AM, Krishna Kumar (Engineering)
krishna...@flipkart.com wrote:
 Hi all,

 We are getting either ECONNRESET, or sometimes ETIMEDOUT errors
 when the backend sends a large multi-part message via haproxy. It
 works for small file sizes of 4K and 8K, but fails for  2 Mb files. Is there
 any option or setup that will help fix this?

 Thanks,
 - Krishna Kumar

-- 


--

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited. Although Flipkart has taken reasonable precautions to ensure no 
viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments



Multi-part message failure during http mode (haproxy 1.5.12)

2015-08-04 Thread Krishna Kumar (Engineering)
Hi all,

We are getting either ECONNRESET, or sometimes ETIMEDOUT errors
when the backend sends a large multi-part message via haproxy. It
works for small file sizes of 4K and 8K, but fails for  2 Mb files. Is there
any option or setup that will help fix this?

Thanks,
- Krishna Kumar

-- 


--

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited. Although Flipkart has taken reasonable precautions to ensure no 
viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments



How to disable backend servers without health check

2015-07-16 Thread Krishna Kumar (Engineering)
Hi all,

We have a large set of machines running haproxy (1.5.12), and each of
them have hundreds of backends, many of which are the same across
systems. nbproc is set to 12 at present for our 48 core systems. We are
planning a centralized health check, and disable the same in haproxy, to
avoid each process on each server doing health check for the same
backend servers.

Is there any way to disable a single backend from the command line, such
that each haproxy instance finds that this backend is disabled? Using
socat with the socket only makes the handling process set it's status of the
backend as MAINT, but others don't get this information.

Appreciate if someone can show if this can be done.

Regards,
- Krishna Kumar

-- 


--

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited. Although Flipkart has taken reasonable precautions to ensure no 
viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments


Re: Strange system behaviour of during haproxy run

2015-07-16 Thread Krishna Kumar (Engineering)
On Tue, Jul 7, 2015 at 2:31 PM, Willy Tarreau w...@1wt.eu wrote:

Hi Willy,

Thank you once again for the quick response, and apologize for my tardiness.

Unless I'm wrong, for me ixgbe does its own RSS and uses the Rx queues. Each
 Rx queue is bound to an IRQ, and each IRQ may be delivered to a set of
 CPUs.
 I'd say that I've never seen even a iota of CPU usage on the wrong CPU when
 using ixgbe, so I'm quite sure that some of your IRQs are not properly
 bound,
 which could be verified by simply running grep eth in /proc/interrupts,
 though I can easily undertand that it's not easy to read with 48 columns


The irq counts increase for all cores though haproxy runs only on even
cores.
And smp_affinity is set correctly, going from 1, 2, 4, 8, 10, 20, 40, 80,
100 (all in
hex), etc. I have not seen this issue under lower load, but when cpu's are
at
100%, I am getting this behavior. I will try to recreate with low and high
load and
validate this statement.

No it's not that small because a flow is something ephemeral. Even when
 you run with hundreds of thousands of concurrent connections, the time
 between an outgoing ACK and the next data packet is short enough for the
 flow to be matched, especially on a local network when it's a matter of
 hundreds of microseconds at worst.


Yes, that seems reasonable. But if the 8K flows are exhausted during a
single
tx iteration, I assume RX for some flows will not find the flow cached for
RSS.

Just out of curiosity, why not configure the lab to reproduce the same
 setup as the prod, ie 12 processes as well ?


Yes, I will try this in a little while, as we are busy with production
setup at present.


 At the very least, check the interrupts per CPU. There I'm sure you'll
 find that you missed some of them and that they're still delivered to
 the wrong socket. Also keep in mind that a down-up cycle on an interface
 unbinds the IRQs, so you have to rebind them by hand. This is something
 which easily happens by mistake during troubleshooting sessions.


Since we are doing this for hundreds of systems, the IRQ's are being set at
boot
time through a rc2 script.

Thanks once again for your inputs.

Regards,
- Krishna Kumar

-- 


--

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited. Although Flipkart has taken reasonable precautions to ensure no 
viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments


Re: How to disable backend servers without health check

2015-07-16 Thread Krishna Kumar (Engineering)
Hi John,

Your suggestion works very well, and exactly what I was looking for.
Thank you very much.

Regards,
- Krishna Kumar


On Thu, Jul 16, 2015 at 5:48 PM, John Skarbek jskar...@rallydev.com wrote:

 Krishna,

 I've recently had to deal with this as well.  Our solution involves a
 couple of aspects.  Firstly, one must configure an admin socket per
 process.  In our case we run with 20 processes, so we've got a
 configuration that looks similar to this in our global section:

   stats socket /var/run/haproxy_admin1.sock mode 600 level admin process 1
   stats socket /var/run/haproxy_admin2.sock mode 600 level admin process 2
   stats socket /var/run/haproxy_admin3.sock mode 600 level admin process 3
   stats socket /var/run/haproxy_admin4.sock mode 600 level admin process 4

 Counting all the way up to 20...

 After that we can do a simple one liner that disables a single server;
 using bash:

 for i in {1..20}; do echo 'disable server the_backend/the_server' | socat
 /var/run/haproxy_admin$i.sock stdio; done

 This loops through each admin socket and disables 'the_server' from
 'the_backend'.

 I hope this gets you started in looking for a solution.

 I like your route of accomplishing this though.  With our 20 proc
 configuration we've decided to deal with the pain of 20 health checks which
 has caused us some issues, but nothing being a show stopper.

 On Thu, Jul 16, 2015 at 5:53 AM, Krishna Kumar (Engineering) 
 krishna...@flipkart.com wrote:

 Hi all,

 We have a large set of machines running haproxy (1.5.12), and each of
 them have hundreds of backends, many of which are the same across
 systems. nbproc is set to 12 at present for our 48 core systems. We are
 planning a centralized health check, and disable the same in haproxy, to
 avoid each process on each server doing health check for the same
 backend servers.

 Is there any way to disable a single backend from the command line, such
 that each haproxy instance finds that this backend is disabled? Using
 socat with the socket only makes the handling process set it's status of
 the
 backend as MAINT, but others don't get this information.

 Appreciate if someone can show if this can be done.

 Regards,
 - Krishna Kumar



 --

 This email and any files transmitted with it are confidential and
 intended solely for the use of the individual or entity to whom they are
 addressed. If you have received this email in error please notify the
 system manager. This message contains confidential information and is
 intended only for the individual named. If you are not the named addressee
 you should not disseminate, distribute or copy this e-mail. Please notify
 the sender immediately by e-mail if you have received this e-mail by
 mistake and delete this e-mail from your system. If you are not the
 intended recipient you are notified that disclosing, copying, distributing
 or taking any action in reliance on the contents of this information is
 strictly prohibited. Although Flipkart has taken reasonable precautions to
 ensure no viruses are present in this email, the company cannot accept
 responsibility for any loss or damage arising from the use of this email or
 attachments




 --

 [image: rally-logo-68x68.jpg]

 John T Skarbek | jskar...@rallydev.com

 Infrastructure Engineer, Engineering

 1101 Haynes Street, Suite 105, Raleigh, NC 27604

 720.921.8126 Office


-- 


--

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited. Although Flipkart has taken reasonable precautions to ensure no 
viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments


Re: How to disable backend servers without health check

2015-07-16 Thread Krishna Kumar (Engineering)
Thanks Pavlos, this looks very promising, I will take a look on how we
can use this.

Regards,
- Krishna Kumar

On Thu, Jul 16, 2015 at 8:36 PM, Pavlos Parissis pavlos.paris...@gmail.com
wrote:



 On 16/07/2015 04:02 μμ, Krishna Kumar (Engineering) wrote:
  Hi John,
 
  Your suggestion works very well, and exactly what I was looking for.
  Thank you very much.
 


 You could also try  https://github.com/unixsurfer/haproxytool

 Cheers,
 Pavlos



-- 


--

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited. Although Flipkart has taken reasonable precautions to ensure no 
viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments


Re: Strange system behaviour of during haproxy run

2015-07-07 Thread Krishna Kumar (Engineering)
Hi Willy,

Thanks for your response, I know this is not at all related to haproxy, so
your help is really appreciated! Also irqbalance is not running on the
systems.

 That's normal if you pinned the IRQs to cpus 0-47, you'd like to pin IRQs
 only to the CPUs of the first socket (ie: 0, 2, 4, 6, ..., 46 if I
understand
 it right).

I think that will still not solve the issue since packets can come on irq1
and
directly wired to cpu0, but the rx receive code path on cpu0 decides which
cpu this is meant for, and gets that lock and still does ipi. I expected
that
packets for a flow will go to the correct cpu, and thought this could be
related
to the size of the intel flow director?

root@1098366dea41:~# ethtool -S em1 | grep fdir_
 fdir_match: 1596533665 (29%)
 fdir_miss:3908617542 (71%)
 fdir_overflow: 25408

The Intel document says: And the Intel® Ethernet Flow Director
Perfect-Match Filter Table has to be large enough to capture the
unique flows a given controller would typically see. For that reason,
Intel Ethernet Flow Director’s Perfect-Match Filter Table has 8k entries.

http://www.intel.in/content/dam/www/public/us/en/documents/white-papers/intel-ethernet-flow-director.pdf

8K is very small for haproxy serving many more connection.

While trying to simulate in a small lab setup, I ran 4 wrk's to haproxy,
top
showed 3 haproxy's running on cpus 2,4 and 8, and tx/rx counters of only
2,4,8 incremented during the run, and always consistently. So it worked
as expected, but the load on the system was very low.

In our production testing, thousand client VM's are doing 50K connections
each to haproxy (running on 60 servers), and here, I noticed that haproxy
on every server runs on 0,2,4..22, but rx/tx of all queues increment.

Assuming this is the issue, which I am not sure of, do you have any ideas
how to get around this, or any other suggestions?

Thanks,
- Krishna Kumar


On Tue, Jul 7, 2015 at 1:38 PM, Willy Tarreau w...@1wt.eu wrote:

 Hi,

 On Tue, Jul 07, 2015 at 01:24:28PM +0530, Krishna Kumar (Engineering)
 wrote:
  Hi all,
 
  This is not related to haproxy, but I am having a performance issue with
  number of
  packets processed. I am running haproxy on a 48 core system (we have 64
  such servers
  at present, which is going to increase for production tessting), where
 cpus
  0,2,4,6,..46
  are part of NUMA node 1, and cpus 1,3,5,7,.. 47 are part of NUMA node 2.
  The systems
  are running Debian 7, with 3.16.0-23 (kernel has both CONFIG_XPS and
  CONFIG_RPS
  enabled). nbproc is set to 12, and each haproxy is bound to cpus 0,2,4,
 ...
  22, so that
  they are on the same socket, as seen here:
 
  # ps -efF | egrep hap|PID | cut -c1-80
  UID PID   PPID  CSZ   RSS PSR STIME TTY  TIME CMD
  haproxy3099  1 17 89697 324024  0 18:37 ?00:11:19 haproxy
  -f hap
  haproxy3100  1 18 87171 314324  2 18:37 ?00:12:00 haproxy
  -f hap
  haproxy3101  1 18 87214 305328  4 18:37 ?00:12:00 haproxy
  -f hap
  haproxy3102  1 19 89215 322676  6 18:37 ?00:12:02 haproxy
  -f hap
  haproxy3103  1 18 86788 310976  8 18:37 ?00:11:59 haproxy
  -f hap
  haproxy3104  1 18 87197 314888 10 18:37 ?00:12:00 haproxy
  -f hap
  haproxy3105  1 18 91311 319784 12 18:37 ?00:11:59 haproxy
  -f hap
  haproxy3106  1 18 88785 305576 14 18:37 ?00:12:00 haproxy
  -f hap
  haproxy3107  1 19 90366 326428 16 18:37 ?00:12:09 haproxy
  -f hap
  haproxy3108  1 19 89758 320780 18 18:37 ?00:12:09 haproxy
  -f hap
  haproxy3109  1 19 87670 314752 20 18:37 ?00:12:07 haproxy
  -f hap
  haproxy3110  1 19 87763 316672 22 18:37 ?00:12:10 haproxy
  -f hap
 
  set_irq_affinity.sh was run on the ixgbe card, and
 /proc/irq/*/smp_affinity
  shows that each
  irq is bound to cpus 0-47 correctly. However, I see that packets are
 being
  processed on
  cpus of the 2nd socket too, though user/system usage is zero on those as
  haproxy does
  not run on those cores.

 That's normal if you pinned the IRQs to cpus 0-47, you'd like to pin IRQs
 only to the CPUs of the first socket (ie: 0, 2, 4, 6, ..., 46 if I
 understand
 it right).

 Also, double-check that you don't have irqbalance. It can change the
 settings in your back, that's really unpleasant.

 Willy



-- 


--

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify

Strange system behaviour of during haproxy run

2015-07-07 Thread Krishna Kumar (Engineering)
Hi all,

This is not related to haproxy, but I am having a performance issue with
number of
packets processed. I am running haproxy on a 48 core system (we have 64
such servers
at present, which is going to increase for production tessting), where cpus
0,2,4,6,..46
are part of NUMA node 1, and cpus 1,3,5,7,.. 47 are part of NUMA node 2.
The systems
are running Debian 7, with 3.16.0-23 (kernel has both CONFIG_XPS and
CONFIG_RPS
enabled). nbproc is set to 12, and each haproxy is bound to cpus 0,2,4, ...
22, so that
they are on the same socket, as seen here:

# ps -efF | egrep hap|PID | cut -c1-80
UID PID   PPID  CSZ   RSS PSR STIME TTY  TIME CMD
haproxy3099  1 17 89697 324024  0 18:37 ?00:11:19 haproxy
-f hap
haproxy3100  1 18 87171 314324  2 18:37 ?00:12:00 haproxy
-f hap
haproxy3101  1 18 87214 305328  4 18:37 ?00:12:00 haproxy
-f hap
haproxy3102  1 19 89215 322676  6 18:37 ?00:12:02 haproxy
-f hap
haproxy3103  1 18 86788 310976  8 18:37 ?00:11:59 haproxy
-f hap
haproxy3104  1 18 87197 314888 10 18:37 ?00:12:00 haproxy
-f hap
haproxy3105  1 18 91311 319784 12 18:37 ?00:11:59 haproxy
-f hap
haproxy3106  1 18 88785 305576 14 18:37 ?00:12:00 haproxy
-f hap
haproxy3107  1 19 90366 326428 16 18:37 ?00:12:09 haproxy
-f hap
haproxy3108  1 19 89758 320780 18 18:37 ?00:12:09 haproxy
-f hap
haproxy3109  1 19 87670 314752 20 18:37 ?00:12:07 haproxy
-f hap
haproxy3110  1 19 87763 316672 22 18:37 ?00:12:10 haproxy
-f hap

set_irq_affinity.sh was run on the ixgbe card, and /proc/irq/*/smp_affinity
shows that each
irq is bound to cpus 0-47 correctly. However, I see that packets are being
processed on
cpus of the 2nd socket too, though user/system usage is zero on those as
haproxy does
not run on those cores. The following shows the difference of number of
packets processed
after 10 seconds on the different rx/tx queues:

# ./rx_tx   /tmp/ethtool_start /tmp/ethtool_end
Significant difference in #packets processed after 10 seconds on the
various rx/tx queues:
Queue#TXRX
0   2623165 2826065
1   2564573 2749859
2   2901998 2801043
3   2636856 2794000
4   2892465 2742228
5   3087442 2795762
6   2936588 2760732
7   2934087 2767705
8   2260933 2767707
9   2165087 2759038
10  2144893 2814390
11  2302304 2835790
12  3037722 2748335
13  2940284 2727689
14  2348277 2830378
15  2117679 2838013
16  2679899 487703
17  2447832 438733
18  2505330 429834
19  2611643 447960
20  2595708 449729
21  2534836 447217
22  2616150 466920
23  2522947 450145

mpstat shows that first 22 even numbered cpus are heavily used, while the
odd ones only
does softirq processing:

Average:CPU%usr   %sys  %soft  %idle
Average:0 15.47   60.0   24.47   0.00
Average:1  0.00 0.00   12.86  87.14
Average:2 20.32   58.49  21.19  0.00
Average:3 0.10  0.002. 59   97.30
Average:4 18.2060.87 20.93   0.00
Average:5 0.10   0.00   4.1595.75
Average:6 18.7559.37  21.88  0.00
Average:7 0.00   0.003.0396.97
Average:8 22.75 57.71  19.55 0.00
Average:9 0.00   0.00  2.78  97.22
Average:10   21.87 57.67   20.47 0.00
Average:11   0.00   0.00  2.80  97.20
Average:12   19.48 59.8420.68 0.00
Average:13   0.00   0.00  1.76   98.24
Average:14   22.58 57.1620.25 0.00
Average:15   0.00   0.00  1.57   98.43
Average:16   27.00 67.006.000.00
Average:17   0.00   0.07   0.59   99.27
Average:18   26.17 67.84 5.93   0.07
Average:19   0.000.00  0.1599.78
Average:20   26.52 67.36 6.130.00
Average:21   0.000.00  0.3099.63
Average:22   27.69  66.715.600.00
Average:23   0.00 0.00  0.07   99.93
Average:24   0.00 0.00  0.00   100.00

Re: LB as a first row of defence against DDoS

2015-06-24 Thread Krishna Kumar (Engineering)
 On Wed, Jun 24, 2015 at 11:33 PM, Shawn Heisey hapr...@elyograg.org
wrote:

I agree - the blog talks of handling multiple attacks individually, but
what we are
trying to understand is - how can we handle multiple types of attacks in a
single
configuration. Not the exact configuration file, but the concept to
implement this
(assuming this is something that can be explained).

I think this is a great and very high performant software, with a very
helpful
community. Thanks a lot to all contributors, and especially to Willy, and
Baptiste
for the useful blogs that have helped people to adopt haproxy for their LB
needs.

Regards,
- Krishna Kumar


 6/24/2015 11:12 AM, Willy Tarreau wrote:
  The problem with configs posted on a blog is that people blindly
 copy-paste
  them without understanding and then break a lot of things and ask for
 help.
  Baptiste takes care of explaining how things work so that people can pick
  what they need. There's no universal anti-ddos config, we've built a lot
 of
  different ones in the past. Each config is almost unique in fact,
 depending
  on business cases. You need to keep in mind that fighting DDoS consists
 in
  differenciating what looks like a regular visitor *in your case* and what
  is not. Quite commonly it's extremely tricky and even between various
  applications hosted behind the same LB you can apply different
 mechanisms.
  For example for certain apps it's totally abnormal to have more than X
  concurrent connections from a single IP address while in other cases it's
  normal, even to have a lot of requests using a same cookie (think
 completion
  for example).
 
  So it is important to understand the concepts, how the tools work and can
  help, then to analyse what happens in your situation and how to fight
 when
  the problem happens. You'll even notice that you'll change your
 protections
  from one attack to another.

 I always treat sample configs as a starting point that will need
 significant tweaking for my specific situation.  For instance, I already
 know that 10 connections from one IP address won't be enough for several
 of our websites, partly because there are some customers who have
 several users in one location who will almost certainly be connecting
 from the same public IP address.

 That said, I know that there are plenty of people out there who will
 copy/paste a sample config and expect it to make their bed and fillet
 their fish.  I get irritated with those people who won't make an effort
 to actually understand what their systems are doing.

 For this specific situation, I'm hoping to learn how to successfully
 combine the techniques on the blog post into one config without screwing
 it up.  If I run into trouble, I will try to solve it on my own before I
 come back here to ask for help, and if that's required, I will try to
 ask intelligent questions and provide all relevant information at the
 start.

  The subject is really vast. You could have one week full of training on
 the
  subject and still feel naked at the end.

 I've gotten that impression.  I use a number of other open source
 projects which have even steeper learning curves.  The basics of haproxy
 were quite easy to grasp, but I know that there's a lot of unexplored
 depth, some of which I may never use.

 Thank you for everything you do.  You are one of the unsung heroes who
 make the guts of the Internet possible.

 Shawn




-- 


--

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited. Although Flipkart has taken reasonable precautions to ensure no 
viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments


LB as a first row of defence against DDoS

2015-06-17 Thread Krishna Kumar (Engineering)
Referring to Baptiste's excellent blog on Use a lb as a first row of
defense
against DDoS @

http://blog.haproxy.com/2012/02/27/use-a-load-balancer-as-a-first-row-of-defense-against-ddos/

I am not able to find a follow up, if it was written, on combining
configuration
examples to improve protection. Is there either another article explaining
how to combine the configuration settings to protect against multiple types
of
DoS attacks, else, how would one do this?

Thanks,
- Krishna Kumar

-- 


--

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited. Although Flipkart has taken reasonable precautions to ensure no 
viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments


Re: Health check of backends without explicit health-check?

2015-06-17 Thread Krishna Kumar (Engineering)
On Tue, Jun 16, 2015 at 4:29 PM, Krishna Kumar (Engineering) 
krishna...@flipkart.com wrote:

I was referring to HAProxy as the LB here. If there is any means to do this,
kindly let me know.

Thanks,
- Krishna Kumar


Hi list,

 Is there any way to log, or report, or notify, or identify any backend
 that is not responding, without using explicit health-checks? The
 reason for this is that we are planning a big deployment of LB/servers,
 something along the lines of:

 LB1, LB2, LB100 or more
 ^
 |
 v
 Thousands of servers as backends

 where many of the LB's could share the same backend. Doing a health-
 check from many LB's to the same servers is a possible load issue on
 the servers. Is there any other way, based on response timeout, or
 something else, to determine which of the backends are not responding,
 and be able to retrieve that information?

 Thanks,
 - Krishna Kumar


-- 


--

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited. Although Flipkart has taken reasonable precautions to ensure no 
viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments


Re: Health check of backends without explicit health-check?

2015-06-17 Thread Krishna Kumar (Engineering)
Thanks Baptiste.

Regards,
- Krishna Kumar

On Wed, Jun 17, 2015 at 3:38 PM, Baptiste bed...@gmail.com wrote:

 Hi Krishna,

 Usually, people use a service discovery tool to do this.
 Some other people use a local service to cache the check response and
 serve it to all haproxy servers.

 Baptiste


 On Wed, Jun 17, 2015 at 11:38 AM, Krishna Kumar (Engineering)
 krishna...@flipkart.com wrote:
  On Tue, Jun 16, 2015 at 4:29 PM, Krishna Kumar (Engineering)
  krishna...@flipkart.com wrote:
 
  I was referring to HAProxy as the LB here. If there is any means to do
 this,
  kindly let me know.
 
  Thanks,
  - Krishna Kumar
 
 
 
  Hi list,
 
  Is there any way to log, or report, or notify, or identify any backend
  that is not responding, without using explicit health-checks? The
  reason for this is that we are planning a big deployment of LB/servers,
  something along the lines of:
 
  LB1, LB2, LB100 or more
  ^
  |
  v
  Thousands of servers as backends
 
  where many of the LB's could share the same backend. Doing a health-
  check from many LB's to the same servers is a possible load issue on
  the servers. Is there any other way, based on response timeout, or
  something else, to determine which of the backends are not responding,
  and be able to retrieve that information?
 
  Thanks,
  - Krishna Kumar



-- 


--

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited. Although Flipkart has taken reasonable precautions to ensure no 
viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments


Health check of backends without explicit health-check?

2015-06-16 Thread Krishna Kumar (Engineering)
Hi list,

Is there any way to log, or report, or notify, or identify any backend
that is not responding, without using explicit health-checks? The
reason for this is that we are planning a big deployment of LB/servers,
something along the lines of:

LB1, LB2, LB100 or more
^
|
v
Thousands of servers as backends

where many of the LB's could share the same backend. Doing a health-
check from many LB's to the same servers is a possible load issue on
the servers. Is there any other way, based on response timeout, or
something else, to determine which of the backends are not responding,
and be able to retrieve that information?

Thanks,
- Krishna Kumar

-- 


--

This email and any files transmitted with it are confidential and intended 
solely for the use of the individual or entity to whom they are addressed. 
If you have received this email in error please notify the system manager. 
This message contains confidential information and is intended only for the 
individual named. If you are not the named addressee you should not 
disseminate, distribute or copy this e-mail. Please notify the sender 
immediately by e-mail if you have received this e-mail by mistake and 
delete this e-mail from your system. If you are not the intended recipient 
you are notified that disclosing, copying, distributing or taking any 
action in reliance on the contents of this information is strictly 
prohibited. Although Flipkart has taken reasonable precautions to ensure no 
viruses are present in this email, the company cannot accept responsibility 
for any loss or damage arising from the use of this email or attachments


Re: HAProxy SSL performance issue

2015-05-22 Thread Krishna Kumar (Engineering)
 On Thu, May 21, 2015 at 5:58 PM, Willy Tarreau w...@1wt.eu wrote:

Hi Willy,

Thank you for your reply.

 I suspect the BW unit is bytes per second above though I could be

That's correct, and the BW is as you had stated: 8gpbs vs 2.8 gbps.

 Hmmm, would you be running from multiple load generators connected via

No, I am running a single 'ab' command from 1 node.

 I'm thinking about something else, could you retry with less or more total
 objects in the /128 case (or the 16k case) ? The thing is that ab starts

I tried with -n 1000 but it also hangs at 90%. More details on this below.

 You may want to try openssl-1.0.2a which significantly improved
performance

Thank you, I upgraded to 1.0.2a today before testing further.

 You should do this instead to have 3 distinct sockets each in its own
 process (but warning, this requires a kernel = 3.9) :

Yes, I am running 3.19.6, so have made this change too, and for :443.
Thanks for
the explanation.

 Another thing that can be done is to compare the setup above with
6-process
 per frontend. You can even have everything in the same frontend by the
way :

I tried this without any improvement.

 I fail to see how this is possible, the Xeon E5-2670 is 8-core and
 supports 2 CPU configurations max. So that's 16 cores max in total.

It is the v3 processor Intel Xeon Processor E5-2670 v3. lscpu shows:
NUMA node0 CPU(s):
0,2,4,6,8,10,12,14,16,18,20,22,24,26,28,30,32,34,36,38,40,42,44,46
NUMA node1 CPU(s):
1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,45,47

 OK. What do you mean by correct, you mean the same CPU package as
 the one running haproxy so as not to pass the data over QPI, right ?

Yes, I used David Miller's set_irq_affinity.sh script, which maps
irq0-cpu0, irq1-cpu1,
and so on. Following are the interrupt counts on different irq's after a
reboot and test:

irq#   List of cpus#'s (and #interrupts on each)
IRQ-0 (36):0(1652395)
IRQ-1 (37):1(267916)
IRQ-2 (38):2(1639163)
IRQ-3 (39):3(270744)
IRQ-4 (40):4(1651315)
IRQ-5 (41):5(270939)
IRQ-6 (42):6(1637431)
IRQ-7 (43):7(270505)
IRQ-8 (44):8(1643712)
IRQ-9 (45):9(271290)
IRQ-10 (46):   10(1644798)
IRQ-11 (47):   11(269653)
IRQ-12 (48):   12(270003)
IRQ-13 (49):   13(271268)
IRQ-14 (50):   14(270255)
IRQ-15 (51):   15(271206)

When 'ab' at the client uses -k, interrupts are generated on all even cpu's
0-10 on
the haproxy node (which explains why the odd irq's above have counts too,
though
it is smaller due to mix of -k and without -k option testing). Without -k,
interrupts are
generated on all cpus 0-10, including the odd ones.

 This certainly is a side effect of the imbalance above combined with ab
which
 keeps the same connection from the beginning to the end of the test.

With the new configuration file (below), I was able to get some more
information on
what is going on:

1. Without -k option to 'ab', the SSL test works and finishes for all I/O
sizes. With the
following configuration file (1-3:80 ; 4-6:443), 3 haproxy's run and
finish the work.
2. With -k option to 'ab', 3 haproxy's start off in response, they run for
about 1 second
(as seen in 'top'), then 2 stops handling work while only 1 continues,
and after 90%,
the sole haproxy also stops, and the client soon prints the 70007
error. Sometime
the sole working haproxy stops immediately, and I get an error before
10% is done.
This happens only for large IO, like 128K. With 128 bytes, all
haproxies run till 'ab'
completes successfully. Similarly, it works for I/O of 7000 bytes, but
fails at = 8000.
3. 'ab' to the backend, with or without -k, works without issues for any
size.

#2 above seems very suspicious, and happens every time. With your above
suggestion
to have a single frontend, I saw that all 6 starts, and 5 stop at about 1
second, and the
test finally hangs. Without -k, all 6 run and 'ab' finishes.

Regards,
- Krishna Kumar

Configuration file (have tried bind-process 1 2 3 and bind-process 4 5
6 in the two
backend's below, there was no difference in the above behavior):

global
daemon
quiet
nbproc 6
cpu-map 1 0
cpu-map 2 2
cpu-map 3 4
cpu-map 4 6
cpu-map 5 8
cpu-map 6 10
user haproxy
group haproxy
stats socket /var/run/haproxy.sock mode 600 level admin
stats timeout 2m
tune.bufsize 32768

userlist stats-auth
group adminusers admin
user  admininsecure-password admin

defaults
mode http
retries 3
option forwardfor
option redispatch
option prefer-last-server
option splice-auto

frontend www-http
bind *:80 process 1
bind *:80 process 2
bind *:80 process 3
stats uri /stats
stats enable
acl AUTH http_auth(stats-auth)
acl AUTH_ADMIN http_auth(stats-auth) admin
stats http-request auth unless AUTH
default_backend www-backend

frontend www-https
bind *:443 

HAProxy SSL performance issue

2015-05-21 Thread Krishna Kumar (Engineering)
Hi all,

I am getting a big performance hit with SSL termination for small I/O, and
errors
when testing with bigger I/O sizes (ab version is 2.3):

1. Non-SSL vs SSL for small I/O (128 bytes):
   ab -k -n 100 -c 500 http://HAPROXY/128

   RPS: 181763.65 vs 133611.69- 27% drop
   BW:  63546.28   vs 46711.90   - 27% drop

2. Non-SSL vs SSL for medium I/O (16 KB):
   ab -k -n 100 -c 500 http://HAPROXY/16K

   RPS:  62646.13vs 21876.33  (fails mostly with 70007 error as below)
- 65% drop
   BW:   1016531.41 vs 354977.59 (fails mostly with 70007 error)
 - 65% drop

3. Non-SSL vs SSL for large I/O (128 KB):
   ab -k -n 10 -c 500 http://HAPROXY/128K

   RPS:  8476.99  vs apr_poll: The timeout specified has expired
(70007)
   BW:   1086983.11 vs same error, this happens after 9 requests
(always reproducible).

--- HAProxy Build info
-
HA-Proxy version 1.5.12 2015/05/02
Copyright 2000-2015 Willy Tarreau w...@1wt.eu

Build options :
  TARGET  = linux2628
  CPU = native
  CC  = gcc
  CFLAGS  = -O3 -march=native -g -fno-strict-aliasing
  OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1 USE_TFO=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.8
Compression algorithms supported : identity, deflate, gzip
Built with OpenSSL version : OpenSSL 1.0.1k 8 Jan 2015
Running on OpenSSL version : OpenSSL 1.0.1k 8 Jan 2015
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.35 2014-04-04
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND

Available polling systems :
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.
--- Config file - even cpu cores are on 1st socket on the mb, odd cpus
are on 2nd 
global
daemon
maxconn 5
quiet
nbproc 6
cpu-map 1 0
cpu-map 2 2
cpu-map 3 4
cpu-map 4 6
cpu-map 5 8
cpu-map 6 10
user haproxy
group haproxy
stats socket /var/run/haproxy.sock mode 600 level admin
stats timeout 2m
tune.bufsize 32768

userlist stats-auth
group adminusers admin
user  admininsecure-password admin

defaults
mode http
maxconn 5
retries 3
option forwardfor
option redispatch
option prefer-last-server
option splice-auto

frontend www-http
bind-process 1 2 3
bind *:80
stats uri /stats
stats enable
acl AUTH http_auth(stats-auth)
acl AUTH_ADMIN http_auth(stats-auth) admin
stats http-request auth unless AUTH
default_backend www-backend

frontend www-https
bind-process 4 5 6
bind *:443 ssl crt /etc/ssl/private/haproxy.pem
reqadd X-Forwarded-Proto:\ https
default_backend www-backend-ssl

backend www-backend
bind-process 1 2 3
mode http
balance roundrobin
cookie FKSID prefix indirect nocache
server nginx-1 172.20.232.122:80 maxconn 25000 check
server nginx-2 172.20.232.125:80 maxconn 25000 check

backend www-backend-ssl
bind-process 4 5 6
mode http
balance roundrobin
cookie FKSID prefix indirect nocache
server nginx-1 172.20.232.122:80 maxconn 25000 check
server nginx-2 172.20.232.125:80 maxconn 25000 check
---
CPU is E5-2670, 48 core system, nic interrupts are pinned to correct cpu's,
etc.
Can someone suggest what change is  required to get better results as well
as
fix the 70007 error, or share their config settings? The stats are also
captured.
For 128 byte, all 3 haproxy's are running, but for 16K, and for 128K, only
the last
haproxy is being used (and seen consistently):

-- MPSTAT and PIDSTAT
-
128 byte, port 80
Average: CPU%usr   %nice%sys %iowait%irq   %soft  %steal
%guest  %gnice   %idle
Average:   0   22.330.00   39.430.000.009.980.00
0.000.00   28.27
Average:   2   22.000.00   33.560.000.00   15.110.00
0.000.00   29.33
Average:   4   23.390.00   36.990.000.00   10.500.00
0.000.00   29.12

(First 3 haproxy's are used, last 3 are zero and not shown):
Average:  UID   PID%usr %system  %guest%CPU   CPU  Command
Average:  110  5728   22.80   50.000.00   72.80 -  haproxy
Average:  110  5729   22.20   48.600.00   70.80 -  haproxy
Average:  110  5730   24.20   48.000.00   72.20 -  haproxy

128 byte, port 443
Average: CPU%usr   %nice%sys 

Re: Issue with SSL

2015-05-13 Thread Krishna Kumar (Engineering)
Hi Baptiste,

Thank you very much for the tips. I have nbproc=8 in my configuration. Made
the
following changes:

Added both bind and tune.bufsize changeresult -
works.
Removed the tune.bufsize
result - works.
Added bind-process for frontend and backend as:
bind-process 1,2,3,4,5,6,7,8
result - works
Removed the bind-process
result - fails.

(the bind-process change you suggested worked for 16K and also for 128K,
which
was what I was initially testing before going smaller to find that 16K
failed and 4K
worked)

The performance for SSL is also very much lower compared to regular traffic,
it may be related to configuration settings (about 2x to 3x worse):

128 bytes I/O:
SSL:BW: 22168.31 KB/s  RPS: 63408.79
NO-SSL: BW: 61193.31 KB/s   RPS: 175033.38

64K bytes I/O:
SSL:BW: 506393.55 KB/s RPS: 7884.49 rps
NO-SSL: BW: 1101296.07 KB/sRPS: 17147.05 rps

I will send the configuration a little later, as it needs heavy cleaning
up, there are
lots of things I want to clean before that.

Thanks,
- Krishna Kumar


On Wed, May 13, 2015 at 3:05 PM, Baptiste bed...@gmail.com wrote:

 On Wed, May 13, 2015 at 10:07 AM, Krishna Kumar (Engineering)
 krishna...@flipkart.com wrote:
  Hi all,
 
  I am having the following problem with SSL + large I/O. Details are:
 
  Distribution: Debian 7, Kernel: 3.19.6, ab version: 2.3, haproxy: 1.5.12,
  nginx: 1.2.1
 
  $ ab -k -n 10 -c 100 http://IP:80/128K
  Works correctly.
 
  $ ab -k -n 1 -c 10 https://IP:443/4K
  Works correctly.
 
  $ ab -k -n 1 -c 10 https://IP:443/128K
  No output, finally the only message is:
  apr_poll: The timeout specified has expired (70007)
 
  $ ab -k -n 1 -c 10 https://IP:443/16K
  No output, finally the only message is:
  apr_poll: The timeout specified has expired (70007)
 
  Configuration file (SSL parts only):
  defaults:
  nbproc=8
  ssl-default-bind-ciphers
 
 kEECDH+aRSA+AES:kRSA+AES:+AES256:RC4-SHA:!kEDH:!LOW:!EXP:!MD5:!aNULL:!eNULL
  ssl-default-bind-options no-sslv3
 
  frontend www-https
  bind *:443 ssl crt /etc/ssl/private/haproxy.pem
  reqadd X-Forwarded-Proto:\ https
  default_backend www-backend
 
  $ haproxy -vv | egrep -i ssl|tls
OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1 USE_TFO=1
  Built with OpenSSL version : OpenSSL 1.0.1k 8 Jan 2015
  Running on OpenSSL version : OpenSSL 1.0.1k 8 Jan 2015
  OpenSSL library supports TLS extensions : yes
  OpenSSL library supports SNI : yes
  OpenSSL library supports prefer-server-ciphers : yes
 
  I found that setting nbproc=1 works for SSL, but setting it to 1 (2, 4,
 8)
  hangs
  as above. With nbproc=2, I make slightly more progress than with 8
 (system
  has 48 cores though):
 
  $ ab -k -n 1 -c 10 https://IP:443/128K
  apr_poll: The timeout specified has expired (70007)
  Total of 200 requests completed
 
  I tried adding the following to frontend and backend respectively:
 To the frontend - bind-process 1,2
 To the backend - bind-process 3,4,5,6,7,8
 
  How can I fix this issue?
 
  Thanks,
  - Krishna Kumar
 

 Hi Krishna,

 Well, a frontend and a backend must be on the same HAProxy process.
 Please try again by binding all frontend and backend to the same
 process and let us know if you still have the issue.

 Also, could you share with us your whole configuration, since some
 global parameters may have some impact on HAProxy.

 That said, it's weird it brakes up at 16K...;
 Could you add the following directive in the global section:
 tune.bufsize 32000 and run again the 16K test and report any issue?
 (it's simply a test and should not be used in any case as a workaround!)

 Baptiste



Re: Issue with SSL

2015-05-13 Thread Krishna Kumar (Engineering)
Thanks Baptiste.

The performance for SSL vs regular is very bad. Could someone help
with that? Following is the configuration, test result and the monitoring
tool results (the last is interesting).

- Configuration file   -
global
daemon
maxconn  6
quiet
nbproc 6
cpu-map 1 0
cpu-map 2 1
cpu-map 3 2
cpu-map 4 3
cpu-map 5 4
cpu-map 6 5
tune.pipesize 524288
user haproxy
group haproxy
stats socket /var/run/haproxy.sock mode 600 level admin
stats timeout 2m

defaults
mode http
option forwardfor
retries 3
option redispatch
maxconn 6
option splice-auto
option prefer-last-server
timeout connect 5000ms
timeout client 5ms
timeout server 5ms

frontend www-http
bind HAPROXY:80
default_backend www-backend

frontend www-https
bind-process 1,2,3,4
bind HAPROXY:443 ssl crt /etc/ssl/private/haproxy.pem
default_backend www-backend

backend www-backend
bind-process 1,2,3,4   # Have tested with this removed also, no
difference
mode http
maxconn 6
stats enable
stats uri /stats
balance roundrobin
option prefer-last-server
option forwardfor
option splice-auto
cookie FKSID prefix indirect nocache
server nginx-1 NGINX-1:80 maxconn 2 check
server nginx-2 NGINX-2:80 maxconn 2 check

--   Test results   
Test result without SSL:
Requests per second:181224.69 [#/sec] (mean)
Transfer rate:  51854.33 [Kbytes/sec] received

Test result with SSL:
SSL/TLS Protocol:   TLSv1/SSLv3,ECDHE-RSA-AES256-GCM-SHA384,1024,256
Requests per second:65313.33 [#/sec] (mean)
Transfer rate:  18688.29 [Kbytes/sec] received
   Monitoring tools results  ---
Pidstat/mpstat without SSL:
pidstat shows each haproxy using similar system resources:
Average:  110  4730   19.60   29.200.00   48.80 -
haproxy
(and all remaining 5 are similar)
mpstat is similar:
Average: CPU%usr   %nice%sys %iowait%irq   %soft
%steal  %guest  %gnice   %idle
Average:   0   18.340.00   24.120.000.002.26
0.000.000.00   55.28

Pidstat/mpstat with SSL:
pidstat shows first haproxy (on cpu0) is the only one active, rest are
0% for all fields.
Average:  110  4839   35.64   63.560.00   99.20 -
haproxy
mpstat is similar to pidstat:
Average: CPU%usr   %nice%sys %iowait%irq   %soft
%steal  %guest  %gnice   %idle
Average: all0.750.000.820.010.000.52
0.000.000.00   97.90
Average:   0   35.730.00   38.930.000.00   24.80
0.000.000.000.53

Setting bind-process seems to have the effect of reducing performance to
1/3rd of
original (even for http request, if I add bind-process in the frontend
section of http),
as only the first cpu in bind-process is running. If I remove bind-process,
https of 64
bytes goes up 2.5 times to 140K (lower than http which is 180K). But
ofcourse 64K
doesn't work then.

Thanks again,
- Krishna Kumar


On Wed, May 13, 2015 at 7:05 PM, Baptiste bed...@gmail.com wrote:

 On Wed, May 13, 2015 at 2:16 PM, Krishna Kumar (Engineering)
 krishna...@flipkart.com wrote:
  Hi Baptiste,
 
  Thank you very much for the tips. I have nbproc=8 in my configuration.
 Made
  the
  following changes:
 
  Added both bind and tune.bufsize changeresult -
  works.
  Removed the tune.bufsize
  result - works.
  Added bind-process for frontend and backend as:
  bind-process 1,2,3,4,5,6,7,8
  result - works
  Removed the bind-process
  result - fails.
 
  (the bind-process change you suggested worked for 16K and also for 128K,
  which
  was what I was initially testing before going smaller to find that 16K
  failed and 4K
  worked)
 
  The performance for SSL is also very much lower compared to regular
 traffic,
  it may be related to configuration settings (about 2x to 3x worse):
 
  128 bytes I/O:
  SSL:BW: 22168.31 KB/s  RPS: 63408.79
  NO-SSL: BW: 61193.31 KB/s   RPS: 175033.38
 
  64K bytes I/O:
  SSL:BW: 506393.55 KB/s RPS: 7884.49 rps
  NO-SSL: BW: 1101296.07 KB/sRPS: 17147.05 rps
 
  I will send the configuration a little later, as it needs heavy cleaning
 up,
  there are
  lots of things I want to clean before that.
 
  Thanks,
  - Krishna Kumar
 


 Ok, so we spotted a bug there :)
 At least, HAProxy should warn you your backend and frontend aren't on
 the same process.
 In my mind, HAProxy silently create a backend to the frontend's
 process, even if it was not supposed to be there. But this behavior
 may have changed recently.

 No time to dig further in it, but I'll let Willy know so he can check
 about it.

 Simply bear this rule

Issue with SSL

2015-05-13 Thread Krishna Kumar (Engineering)
Hi all,

I am having the following problem with SSL + large I/O. Details are:

Distribution: Debian 7, Kernel: 3.19.6, ab version: 2.3, haproxy: 1.5.12,
nginx: 1.2.1

$ ab -k -n 10 -c 100 http://IP:80/128K
Works correctly.

$ ab -k -n 1 -c 10 https://IP:443/4K
Works correctly.

$ ab -k -n 1 -c 10 https://IP:443/128K
No output, finally the only message is:
apr_poll: The timeout specified has expired (70007)

$ ab -k -n 1 -c 10 https://IP:443/16K
No output, finally the only message is:
apr_poll: The timeout specified has expired (70007)

Configuration file (SSL parts only):
defaults:
nbproc=8
ssl-default-bind-ciphers
kEECDH+aRSA+AES:kRSA+AES:+AES256:RC4-SHA:!kEDH:!LOW:!EXP:!MD5:!aNULL:!eNULL
ssl-default-bind-options no-sslv3

frontend www-https
bind *:443 ssl crt /etc/ssl/private/haproxy.pem
reqadd X-Forwarded-Proto:\ https
default_backend www-backend

$ haproxy -vv | egrep -i ssl|tls
  OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1 USE_TFO=1
Built with OpenSSL version : OpenSSL 1.0.1k 8 Jan 2015
Running on OpenSSL version : OpenSSL 1.0.1k 8 Jan 2015
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes

I found that setting nbproc=1 works for SSL, but setting it to 1 (2, 4, 8)
hangs
as above. With nbproc=2, I make slightly more progress than with 8 (system
has 48 cores though):

$ ab -k -n 1 -c 10 https://IP:443/128K
apr_poll: The timeout specified has expired (70007)
Total of 200 requests completed

I tried adding the following to frontend and backend respectively:
   To the frontend - bind-process 1,2
   To the backend - bind-process 3,4,5,6,7,8

How can I fix this issue?

Thanks,
- Krishna Kumar


Re: HA Proxy

2015-05-07 Thread Krishna Kumar (Engineering)
 let me know how the backend server are busy in too the admin
 and how load balancing works?

I didn't understand the question, please clarify what exactly you are
looking
for.

In short, LB algo can be set to the simple roundrobin, and you also provide
a
set of backends servers to be loadbalance'd in the configuration file.
haproxy
starts up, read the config file, and acts accordingly. All new connections
are sent
to the different backends in a roundrobin fashion, and assuming each
connection
does similar amount of work, the traffic get reasonably distributed among
the
servers.

Thanks,
- Krishna Kumar

On Thu, May 7, 2015 at 11:28 AM, ANISH S IYER 
anish.subramaniai...@gmail.com wrote:


 -- Forwarded message --
 From: Krishna Kumar (Engineering) krishna...@flipkart.com
 Date: Thu, May 7, 2015 at 11:21 AM
 Subject: Re: HA Proxy
 To: ANISH S IYER anish.subramaniai...@gmail.com


 Please send mail to the full list, so that people can also respond and
 confirm
 what I am saying is right. I am also new to haproxy. Please cc all

 On Thu, May 7, 2015 at 11:20 AM, ANISH S IYER 
 anish.subramaniai...@gmail.com wrote:

 hI

 Thanks for your replay

  let me know how the backend server are busy in too the admin
  and how load balancing works?
  i googled this did not find an correct result

 let me know more details

 regards

 anish

 On Thu, May 7, 2015 at 10:22 AM, Krishna Kumar (Engineering) 
 krishna...@flipkart.com wrote:

 On Thu, May 7, 2015 at 9:44 AM, ANISH S IYER 
 anish.subramaniai...@gmail.com wrote:

 1) how ha proxy is know both of his front and backend server is waiting
 or busy.?


 I am not sure if I understood this right. Depending on the algo, the
 backend is picked.
 It should not care if the backend is ready or busy doing some work. The
 new connection
 will go to the selected backend (assuming maxconn for backend is not
 full), and if that
 backend is busy, the connection is queued at the backend.

 2)  when a new server is up how it can added to load balancing
 automatically.


 I think the correct way to add a new server is to update the
 configuration file with the
 server information, and run: /etc/init.d/haproxy reload

 I have seen that large I/O requests sometimes drop during this time (1
 in 20 or 30 times), but
 more often than not, it works perfectly.

 - Krishna Kumar







Couple of questions on future support

2015-05-06 Thread Krishna Kumar (Engineering)
Hi all,

1. Is there any plan to support HTTP/2? Any estimate on the amount of
work/time
it would take to implement?

2. Is there any plan to have support for Geolocation (other than what is
mentioned
in the homepage)?

Thanks,
- Krishna Kumar


Re: Couple of questions on future support

2015-05-06 Thread Krishna Kumar (Engineering)
On Wed, May 6, 2015 at 3:26 PM, Baptiste bed...@gmail.com wrote:

Hi Baptiste,

you can do it natively with maps and conversion of maxmind ip ranges
 into HAProxy's subnets.


Thank you, using this information, I was able to find your article:
http://blog.haproxy.com/2012/07/02/use-geoip-database-within-haproxy/

Let me see what is required and respond.

Regards,
- Krishna Kumar


Re: [haproxy]: Performance of haproxy-to-4-nginx vs direct-to-nginx

2015-05-06 Thread Krishna Kumar (Engineering)
The performance is really good now, thanks to the great responses on this
list. I also increased the nginx's keepalive to 1m as Pavlos suggested.

# ab -k -n 100 -c 500 http://haproxy:80/64
Requests per second:181623.35 [#/sec] (mean)
Transfer rate:  53414.40 [Kbytes/sec] received
(both values are as good as doing direct backend)

# ab -k -n 10 -c 500 http://haproxy:80/256K
Requests per second:4191.92 [#/sec] (mean)
Transfer rate:  1074111.06 [Kbytes/sec] received
(4.8% less for both numbers as compared to direct backend)

If it is helpful, I can post the various parameters that were set (system
level + haproxy + backend) if it will be useful for someone else in future.

Thanks,
- Krishna Kumar

On Thu, May 7, 2015 at 8:31 AM, Baptiste bed...@gmail.com wrote:


 Le 7 mai 2015 04:24, Krishna Kumar (Engineering) 
 krishna...@flipkart.com a écrit :
 
  I found the source of the problem. One of the backends was being shared
  with another person who was testing iptables rules/tunnel setups, and
  that might have caused some connection drops. I have now removed that
  backend from my setup and use dedicated systems, after which the original
  configuration without specifying source port is working, no connection
 flaps
  now.
 
  Thanks,
  - Krishna Kumar

 How much performance do you have now?

 Baptiste



Re: [haproxy]: Performance of haproxy-to-4-nginx vs direct-to-nginx

2015-05-06 Thread Krishna Kumar (Engineering)
I found the source of the problem. One of the backends was being shared
with another person who was testing iptables rules/tunnel setups, and
that might have caused some connection drops. I have now removed that
backend from my setup and use dedicated systems, after which the original
configuration without specifying source port is working, no connection flaps
now.

Thanks,
- Krishna Kumar

On Wed, May 6, 2015 at 4:53 PM, Willy Tarreau w...@1wt.eu wrote:

 On Wed, May 06, 2015 at 12:03:12PM +0200, Baptiste wrote:
  On Wed, May 6, 2015 at 7:15 AM, Krishna Kumar (Engineering)
  krishna...@flipkart.com wrote:
   Hi Baptiste,
  
   On Wed, May 6, 2015 at 1:24 AM, Baptiste bed...@gmail.com wrote:
  
Also, during the test, the status of various backend's change often
between
OK to DOWN,
and then gets back to OK almost immediately:
   
   
   
 www-backend,nginx-3,0,0,0,10,3,184,23843,96517588,,0,,27,0,0,180,DOWN
   
   
 1/2,1,1,0,7,3,6,39,,7,3,1,,220,,2,0,,37,L4CON,,0,0,184,0,0,0,0,00,0,6,Out
of local source ports on the system,,0,2,3,92,
  
   this error is curious with the type of traffic your generating!
   Maybe you should let HAProxy manage the source ports on behalf of the
   server.
   Try adding the source 0.0.0.0:1024-65535 parameter in your backend
   description.
  
  
   Yes, this has fixed the issue - I no longer get state change after an
 hour
   testing.
   The performance didn't improve though. I will check the sysctl
 parameters
   that
   were different between haproxy/nginx nodes.
  
   Thanks,
   - Krishna Kumar
 
 
  You have to investigate why this issue happened.
  I mean, it is not normal. As Pavlos mentionned, you connection rate is
  very low, since you do keep alive and you opened only 500 ports.
 
  Wait, I know, could you share the keep-alive connection from your nginx
 servers?
  By default, they close connections every 100 requests... This might be
  the root of the issue.

 But even then there is no reason why the local ports would remain in use.
 There definitely is a big problem. It also explains why servers are going
 up and down all the time and errors are reported.

 Willy




Re: HA Proxy

2015-05-06 Thread Krishna Kumar (Engineering)
On Thu, May 7, 2015 at 9:44 AM, ANISH S IYER anish.subramaniai...@gmail.com
 wrote:

1) how ha proxy is know both of his front and backend server is waiting or
 busy.?


I am not sure if I understood this right. Depending on the algo, the
backend is picked.
It should not care if the backend is ready or busy doing some work. The new
connection
will go to the selected backend (assuming maxconn for backend is not full),
and if that
backend is busy, the connection is queued at the backend.

2)  when a new server is up how it can added to load balancing
 automatically.


I think the correct way to add a new server is to update the configuration
file with the
server information, and run: /etc/init.d/haproxy reload

I have seen that large I/O requests sometimes drop during this time (1 in
20 or 30 times), but
more often than not, it works perfectly.

- Krishna Kumar


Re: [haproxy]: Performance of haproxy-to-4-nginx vs direct-to-nginx

2015-05-05 Thread Krishna Kumar (Engineering)
Hi Willy, Pavlos,

Thank you once again for your advice.

 Requests per second:19071.55 [#/sec] (mean)
  Transfer rate:  9461.28 [Kbytes/sec] received

 These numbers are extremely low and very likely indicate an http
 close mode combined with an untuned nf_conntrack.


Yes, it was due to http close mode, and wrong irq pinning (nf_conntrack_max
was
set to 640K).


  mpstat (first 4 processors only, rest are almost zero):
  Average: CPU%usr   %nice%sys %iowait%irq   %soft  %steal
  %guest  %gnice   %idle
  Average:   00.250.000.750.000.00   98.010.00
  0.000.001.00

 This CPU is spending its time in softirq, probably due to conntrack
 spending a lot of time looking for the session for each packet in too
 small a hash table.


I had not done irq pinning. Today I am getting much better results with irq
pinning
and keepalive.

Note, this is about 2 Gbps. How is your network configured ? You should
 normally see either 1 Gbps with a gig NIC or 10 Gbps with a 10G NIC,
 because retrieving a static file is very cheap. Would you happen to be
 using bonding in round-robin mode maybe ? If that's the case, it's a
 performance disaster due to out-of-order packets and could explain some
 of the high %softirq.


My setup is as follows (no bonding, etc, and Sys stands for baremetal
system, each with
48 core, 128GB mem, ixgbe single ethernet port card).

Sys1-with-ab   -eth0-   Sys1-with-Haproxy, which uses two nginx backend
systems
over the same eth0 card (that is the current restriction, no extra ethernet
interface for
separate frontend/backend, etc). Today I am getting a high of 7.7 Gbps with
your
suggestions. Is it possible to get higher than that (direct to server gets
8.6 Gbps)?

Please retry without http-server-close to maintain keep-alive to the
 servers, that will avoid the session setup/teardown. If that becomes
 better, there's definitely something to fix in the conntrack or maybe
 in iptables rules if you have some. But in any case don't put such a


There are a few iptables rules, which seem clean. The results now are:

ab -k -n 100 -c 500 http://haproxy:80/64   (I am getting some errors
though,
which is not present when running against the backend directly):

Document Length:64 bytes
Concurrency Level:  500
Time taken for tests:   6.181 seconds
Complete requests:  100
Failed requests:18991
   (Connect: 0, Receive: 0, Length: 9675, Exceptions: 9316)
Write errors:   0
Keep-Alive requests:990330
Total transferred:  296554848 bytes
HTML transferred:   63381120 bytes
Requests per second:161783.42 [#/sec] (mean)
Time per request:   3.091 [ms] (mean)
Time per request:   0.006 [ms] (mean, across all concurrent requests)
Transfer rate:  46853.18 [Kbytes/sec] received

Connection Times (ms)
  min  mean[+/-sd] median   max
Connect:00   0.5  0   8
Processing: 03   6.2  31005
Waiting:03   6.2  31005
Total:  03   6.3  31010

Percentage of the requests served within a certain time (ms)
  50%  3
  66%  3
  75%  3
  80%  3
  90%  4
  95%  5
  98%  6
  99%  8
 100%   1010 (longest request)

pidstat (some system numbers are very high, 50%, maybe due to small packet
sizes?):
Average:  UID   PID%usr %system  %guest%CPU   CPU  Command
Average:  110 526016.009.330.00   15.33 -  haproxy
Average:  110 526026.33   11.830.00   18.17 -  haproxy
Average:  110 52603   11.33   17.830.00   29.17 -  haproxy
Average:  110 52604   17.50   30.330.00   47.83 -  haproxy
Average:  110 52605   20.50   38.500.00   59.00 -  haproxy
Average:  110 52606   24.50   51.330.00   75.83 -  haproxy
Average:  110 52607   22.50   51.330.00   73.83 -  haproxy
Average:  110 52608   23.67   47.170.00   70.83 -  haproxy

mpstat (of interesting cpus only):
Average: CPU%usr   %nice%sys %iowait%irq   %soft  %steal
%guest  %gnice   %idle
Average: all2.580.004.360.000.000.890.00
0.000.00   92.17
Average:   06.840.00   11.460.000.002.030.00
0.000.00   79.67
Average:   1   11.150.00   19.850.000.005.290.00
0.000.00   63.71
Average:   28.320.00   12.200.000.002.220.00
0.000.00   77.26
Average:   37.920.00   11.970.000.002.390.00
0.000.00   77.72
Average:   48.810.00   13.760.000.002.390.00
0.000.00   75.05
Average:   56.960.00   12.270.000.002.380.00
0.000.00   78.39
Average:   69.210.00   12.520.000.003.310.00
0.000.00   74.95
Average:   77.560.00   13.650.000.00   

Re: [haproxy]: Performance of haproxy-to-4-nginx vs direct-to-nginx

2015-05-05 Thread Krishna Kumar (Engineering)
Hi Baptiste,

On Wed, May 6, 2015 at 1:24 AM, Baptiste bed...@gmail.com wrote:

  Also, during the test, the status of various backend's change often
 between
  OK to DOWN,
  and then gets back to OK almost immediately:
 
  www-backend,nginx-3,0,0,0,10,3,184,23843,96517588,,0,,27,0,0,180,DOWN
 
 1/2,1,1,0,7,3,6,39,,7,3,1,,220,,2,0,,37,L4CON,,0,0,184,0,0,0,0,00,0,6,Out
  of local source ports on the system,,0,2,3,92,

 this error is curious with the type of traffic your generating!
 Maybe you should let HAProxy manage the source ports on behalf of the
 server.
 Try adding the source 0.0.0.0:1024-65535 parameter in your backend
 description.


Yes, this has fixed the issue - I no longer get state change after an hour
testing.
The performance didn't improve though. I will check the sysctl parameters
that
were different between haproxy/nginx nodes.

Thanks,
- Krishna Kumar


Re: [haproxy]: Performance of haproxy-to-4-nginx vs direct-to-nginx

2015-05-05 Thread Krishna Kumar (Engineering)
Hi Pavlos

On Wed, May 6, 2015 at 1:24 AM, Pavlos Parissis pavlos.paris...@gmail.com
wrote:

Shall I assume that you have run the same tests without iptables and got
 the same results?


Yes, I had tried it yesterday and saw no measurable difference.

May I suggest to try also httpress and wrk tool?


I tried it today, will post it after your result below.


 Have you compared 'sysctl -a' between haproxy and nginx server?


Yes, the difference is very litle:
11c11
 fs.dentry-state = 266125  130939  45  0   0   0
---
 fs.dentry-state = 19119   0   45  0   0   0
13,17c13,17
 fs.epoll.max_user_watches = 27046277
 fs.file-max = 1048576
 fs.file-nr = 1536 0   1048576
 fs.inode-nr = 262766  98714
 fs.inode-state = 262766   98714   0   0   0   0   0
---
 fs.epoll.max_user_watches = 27046297
 fs.file-max = 262144
 fs.file-nr = 1536 0   262144
 fs.inode-nr = 27290   8946
 fs.inode-state = 2729089460   0   0   0   0

134c134
 kernel.sched_domain.cpu0.domain0.max_newidle_lb_cost = 2305
---
 kernel.sched_domain.cpu0.domain0.max_newidle_lb_cost = 3820

(and for each cpu, similar lb_cost)

Have you checked if you got all backends reported down at the same time?


Yes, that has not happened. After Baptiste's suggestion of adding port
number,
this has disappeared completely.

How many workers do you use on your Nginx which acts as LB?


I was using default of 4. Increasing to 16 seems to improve numbers 10-20%.\


 
 www-backend,nginx-3,0,0,0,10,3,184,23843,96517588,,0,,27,0,0,180,DOWN
 1/2,1,1,0,7,3,6,39,,7,3,1,,220,,2,0,,37,L4CON,,0,0,184,0,0,0,0,00,0,6,Out
  of local source ports on the system,,0,2,3,92,
 

 Hold on a second, what is this 'Out  of local source ports on the
 system' message? ab reports 'Concurrency Level:  500' and you said
 that HAProxy runs in keepalive mode(default on 1.5 releases) which means
 there will be only 500 TCP connections opened from HAProxy towards the
 backends, which it isn't that high and you shouldn't get that message
 unless net.ipv4.ip_local_port_range is very small( I don't think so).


It was set to net.ipv4.ip_local_port_range = 3276861000. I have not
seen
this issue after making the change Baptiste suggested. Though I could
increase
the range above and check too.


 # wrk --timeout 3s --latency -c 1000 -d 5m -t 24 http://a.b.c.d
 Running 5m test @ http://a.b.c.d
   24 threads and 1000 connections
   Thread Stats   Avg  Stdev Max   +/- Stdev
 Latency87.07ms  593.84ms   7.85s95.63%
 Req/Sec16.45k 7.43k   60.89k74.25%
   Latency Distribution
  50%1.75ms
  75%2.40ms
  90%3.57ms
  99%3.27s
   111452585 requests in 5.00m, 15.98GB read
   Socket errors: connect 0, read 0, write 0, timeout 33520
 Requests/sec: 371504.85
 Transfer/sec: 54.56MB


I get very strange result:

# wrk --timeout 3s --latency -c 1000 -d 1m -t 24 http://haproxy
Running 1m test @ http://haproxy
  24 threads and 1000 connections
  Thread Stats   Avg  Stdev Max   +/- Stdev
Latency 2.40ms   26.64ms   1.02s99.28%
Req/Sec 8.77k 8.20k   26.98k62.39%
  Latency Distribution
 50%1.14ms
 75%1.68ms
 90%2.40ms
 99%6.14ms
  98400 requests in 1.00m, 34.06MB read
Requests/sec:   1637.26
Transfer/sec:580.36KB

# wrk --timeout 3s --latency -c 1000 -d 1m -t 24 http://nginx
Running 1m test @ http://nginx
  24 threads and 1000 connections
  Thread Stats   Avg  Stdev Max   +/- Stdev
Latency 5.56ms   12.01ms 444.71ms   99.41%
Req/Sec 8.53k   825.8018.50k90.91%
  Latency Distribution
 50%4.81ms
 75%6.80ms
 90%8.58ms
 99%   11.92ms
  12175205 requests in 1.00m, 4.31GB read
Requests/sec: 202584.48
Transfer/sec: 73.41MB

Thank you,

Regards,
- Krishna Kumar


Re: [haproxy]: Performance of haproxy-to-4-nginx vs direct-to-nginx

2015-04-29 Thread Krishna Kumar (Engineering)
Dear all,

Sorry, my lab systems were down for many days and I could not get back on
this earlier. After
new systems were allocated, I managed to get all the requested information
with a fresh ru
(Sorry, this is a long mail too!). There are now 4 physical servers,
running Debian 3.2.0-4-amd64,
connected directly to a common switch:

server1: Run 'ab' in a container, no cpu/memory restriction.
server2: Run haproxy in a container, configured with 4 nginx's,
cpu/memory configured as
  shown below.
server3: Run 2 different nginx containers, no cpu/mem restriction.
server4: Run 2 different nginx containers, for a total of 4 nginx, no
cpu/mem restriction.

The servers have 2 sockets, each with 24 cores. Socket 0 has cores
0,2,4,..,46 and Socket 1 has
cores 1,3,5,..,47. The NIC (ixgbe) is bound to CPU 0. Haproxy is started on
cpu's:
2,4,6,8,10,12,14,16, so that is in the same cache line as the nic (nginx is
run on different servers
as explained above). No tuning on nginx servers as the comparison is between
'ab' - 'nginx' and 'ab' and 'haproxy' - nginx(s). The cpus are Intel(R)
Xeon(R) CPU E5-2670 v3
@ 2.30GHz. The containers are all configured with 8GB, server having 128GB
memory.

mpstat and iostat were captured during the test, where the capture started
after 'ab' started and
capture ended just before 'ab' finished so as to get warm numbers.


Request directly to 1 nginx backend server, size=256 bytes:

Command: ab -k -n 10 -c 1000 nginx:80/256
Requests per second:69749.02 [#/sec] (mean)
Transfer rate:  34600.18 [Kbytes/sec] received

Request to haproxy configured with 4 nginx backends (nbproc=4), size=256
bytes:

Command: ab -k -n 10 -c 1000 haproxy:80/256
Requests per second:19071.55 [#/sec] (mean)
Transfer rate:  9461.28 [Kbytes/sec] received

mpstat (first 4 processors only, rest are almost zero):
Average: CPU%usr   %nice%sys %iowait%irq   %soft  %steal
%guest  %gnice   %idle
Average: all0.440.001.590.000.002.960.00
0.000.00   95.01
Average:   00.250.000.750.000.00   98.010.00
0.000.001.00
Average:   11.260.005.280.000.002.510.00
0.000.00   90.95
Average:   22.760.008.790.000.005.780.00
0.000.00   82.66
Average:   31.510.006.780.000.003.020.00
0.000.00   88.69

pidstat:
Average:  105   4715.00   33.500.00   38.50 -  haproxy
Average:  105   4726.50   44.000.00   50.50 -  haproxy
Average:  105   4738.50   40.000.00   48.50 -  haproxy
Average:  105   4752.50   14.000.00   16.50 -  haproxy

Request directly to 1 nginx backend server, size=64K

Command: ab -k -n 10 -c 1000 nginx:80/64K
Requests per second:3342.56 [#/sec] (mean)
Transfer rate:  214759.11 [Kbytes/sec] received

Request to haproxy configured with 4 nginx backends (nbproc=4), size=64K

Command: ab -k -n 10 -c 1000 haproxy:80/64K

Requests per second:1283.62 [#/sec] (mean)
Transfer rate:  82472.35 [Kbytes/sec] received

mpstat (first 4 processors only, rest are almost zero):
Average: CPU%usr   %nice%sys %iowait%irq   %soft  %steal
%guest  %gnice   %idle
Average: all0.080.000.740.010.002.620.00
0.000.00   96.55
Average:   00.000.000.000.000.00  100.000.00
0.000.000.00
Average:   11.030.009.980.210.007.670.00
0.000.00   81.10
Average:   20.700.006.320.000.004.500.00
0.000.00   88.48
Average:   30.150.002.040.060.001.730.00
0.000.00   96.03

pidstat:
Average:  UID   PID%usr %system  %guest%CPU   CPU  Command
Average:  105   4710.93   14.700.00   15.63 -  haproxy
Average:  105   4721.12   21.550.00   22.67 -  haproxy
Average:  105   4731.41   20.950.00   22.36 -  haproxy
Average:  105   4750.224.850.005.07 -  haproxy
--
Build information:

HA-Proxy version 1.5.8 2014/10/31
Copyright 2000-2014 Willy Tarreau w...@1wt.eu

Build options :
  TARGET  = linux2628
  

Backend status changes continuously

2015-04-21 Thread Krishna Kumar (Engineering)
Hi all,

While running the command: : ab -n 10 -c 1000 192.168.122.110:80/256,
the haproxy stats page shows the 4 different backend servers changing status
between Active up, going down, Active or backup down, Down, Backup
down, going UP, sometimes all 4 backends are in DOWN state. The result is
very
poor performance reported by 'ab' as compared to running directly against a
single backend.

What could be the reason for this continuous state change?

root@HAPROXY:~# haproxy -vv
HA-Proxy version 1.5.8 2014/10/31
Copyright 2000-2014 Willy Tarreau w...@1wt.eu

Build options :
  TARGET  = linux2628
  CPU = generic
  CC  = gcc
  CFLAGS  = -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat
-Werror=format-security -D_FORTIFY_SOURCE=2
  OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.7
Compression algorithms supported : identity, deflate, gzip
Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
Running on OpenSSL version : OpenSSL 1.0.1k 8 Jan 2015
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.30 2012-02-04
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND

Available polling systems :
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.


Thanks,
- Krishna Kumar


Re: Backend status changes continuously

2015-04-21 Thread Krishna Kumar (Engineering)
Hi Baptists,

Sorry I didn't provide more details earlier.

--
1. root@HAPROXY:~# haproxy -vv
HA-Proxy version 1.5.8 2014/10/31
Copyright 2000-2014 Willy Tarreau w...@1wt.eu

Build options :
  TARGET  = linux2628
  CPU = generic
  CC  = gcc
  CFLAGS  = -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat
-Werror=format-security -D_FORTIFY_SOURCE=2
  OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1

Default settings :
  maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200

Encrypted password support via crypt(3): yes
Built with zlib version : 1.2.7
Compression algorithms supported : identity, deflate, gzip
Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
Running on OpenSSL version : OpenSSL 1.0.1k 8 Jan 2015
OpenSSL library supports TLS extensions : yes
OpenSSL library supports SNI : yes
OpenSSL library supports prefer-server-ciphers : yes
Built with PCRE version : 8.30 2012-02-04
PCRE library supports JIT : no (USE_PCRE_JIT not set)
Built with transparent proxy support using: IP_TRANSPARENT IPV6_TRANSPARENT
IP_FREEBIND

Available polling systems :
  epoll : pref=300,  test result OK
   poll : pref=200,  test result OK
 select : pref=150,  test result OK
Total: 3 (3 usable), will use epoll.
--
2. Configuration file:
global
daemon
maxconn  6
quiet
nbproc 2
maxpipes 16384
user haproxy
group haproxy
stats socket /var/run/haproxy.sock mode 600 level admin
stats timeout 2m

defaults
option  dontlognull
option forwardfor
option http-server-close
retries 3
option redispatch
maxconn 6
option splice-auto
option prefer-last-server
timeout connect 5000ms
timeout client 5ms
timeout server 5ms

frontend www-http
bind *:80
reqadd X-Forwarded-Proto:\ http
default_backend www-backend

frontend www-https
bind *:443 ssl crt /etc/ssl/private/haproxy.pem ciphers
AES:ALL:!aNULL:!eNULL:+RC4:@STRENGTH
rspadd Strict-Transport-Security:\ max-age=31536000
reqadd X-Forwarded-Proto:\ https
default_backend www-backend

userlist stats-auth
group adminusers admin
user  admininsecure-password admin
group readonlyusers user
user  userinsecure-password user

backend www-backend
mode http
maxconn 6
stats enable
stats uri /stats
acl AUTHhttp_auth(stats-auth)
acl AUTH_ADMINhttp_auth(stats-auth) admin
stats http-request auth unless AUTH
balance roundrobin
option prefer-last-server
option forwardfor
option splice-auto
option splice-request
option splice-response
compression offload
compression algo gzip
compression type text/html text/plain text/javascript
application/javascript application/xml text/css application/octet-stream
server nginx-1 192.168.122.101:80 maxconn 15000 cookie S1 check
server nginx-2 192.168.122.102:80 maxconn 15000 cookie S2 check
server nginx-3 192.168.122.103:80 maxconn 15000 cookie S3 check
server nginx-4 192.168.122.104:80 maxconn 15000 cookie S4 check
--

3. A 24 processor Ubuntu system starts 2 nginx VM's (KVM, 2 vcpu, 1GB),
and 1 haproxy VM (KVM, 2 vcpu, 1GB). 'ab' runs on the host and tests with
either the haproxy VM, or directly to one of the 2 nginx VM's.

Sometimes during the test, I also see many nf_conntrack: table full,
dropping
packet messages on the host system.

Thanks.
- Krishna


On Tue, Apr 21, 2015 at 1:29 PM, Krishna Kumar (Engineering) 
krishna...@flipkart.com wrote:

 Hi all,

 While running the command: : ab -n 10 -c 1000 192.168.122.110:80/256
 ,
 the haproxy stats page shows the 4 different backend servers changing
 status
 between Active up, going down, Active or backup down, Down, Backup
 down, going UP, sometimes all 4 backends are in DOWN state. The result is
 very
 poor performance reported by 'ab' as compared to running directly against a
 single backend.

 What could be the reason for this continuous state change?

 root@HAPROXY:~# haproxy -vv
 HA-Proxy version 1.5.8 2014/10/31
 Copyright 2000-2014 Willy Tarreau w...@1wt.eu

 Build options :
   TARGET  = linux2628
   CPU = generic
   CC  = gcc
   CFLAGS  = -g -O2 -fstack-protector --param=ssp-buffer-size=4 -Wformat
 -Werror=format-security -D_FORTIFY_SOURCE=2
   OPTIONS = USE_ZLIB=1 USE_OPENSSL=1 USE_PCRE=1

 Default settings :
   maxconn = 2000, bufsize = 16384, maxrewrite = 8192, maxpollevents = 200

 Encrypted password support via crypt(3): yes
 Built with zlib version : 1.2.7
 Compression algorithms supported : identity, deflate, gzip
 Built with OpenSSL version : OpenSSL 1.0.1e 11 Feb 2013
 Running on OpenSSL version : OpenSSL