RE: Tuning suggestions for high-core-count Linux servers

2017-06-05 Thread Browne, Stuart
So, different tact today, namely the monitoring of '/proc/net/softnet_stat' to 
try reduce potential errors on the interface.

End result: 517k qps.

Final changes for the day:
sysctl -w net.core.netdev_max_backlog=32768
sysctl -w net.core.netdev_budget=2700
/root/nic_balance.sh em1 0 2

netdev_max_backlog:

An increase to this value is indicated by an increase in the 2nd column of 
/proc/net/softnet_stat. The default value starts at a reasonable amount, 
however even 500k qps pushes the limits of this buffer when pinning IRQ's to 
cores. Doubled it.

netdev_budget:

An increase to this value is indicated by an increase in the 3rd column of 
/proc/net/softnet_stat. The default value is quite low (300) and this is easily 
blown away, especially if all of the NIC IRQ's are pinned to a single CPU core. 
Tried various values until the increase was small (at 2700).

As the best numbers have been when using 2 cores however, this number can 
probably be lowered. It seems stable at 2700 however, so didn't re-test at 
lower numbers.

'/root/nic_balance.sh em1 0 2':
(Custom Script based off of RH 20150325_network_performance_tuning.pdf)

Pin all the IRQ's for the 'em1' NIC to the first 2 CPU cores of the local NUMA 
node.

This had the most noticeable effects. By default, the 'irqbalance' service and 
the system in general will create numerous rx/tx listening threads for the NIC, 
each with a soft interrupt. When spread across the multiple NUMA nodes, each 
ingress packet gets delayed as it gets switched to the NUMA node where the rest 
of the process is living.

At low throughput, this isn't a concern. At high throughput, this becomes quite 
noticeable; roughly 100k qps difference.

I tried various levels of tuning (spread across 12 cores, spread across 8, 4 
and pinned to a single core), finding 2 cores the best on the bare-metal node.

...

Whilst 'softnet_stat' didn't show any dropped packets (2nd column), 'netstat -s 
-u' still shows 'packet receive errors'. Still uncertain how they differ and 
how I can fix netstat's problem.

Stuart
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


RE: [EXTERNAL] Re: Tuning suggestions for high-core-count Linux servers

2017-06-04 Thread Browne, Stuart
Ugh, let me try that again (apologies if you got the half-composed version).



> The lab uses Dell R430s running Fedora Core 23 with Intel X710 10GB NICs
> and each populated with a single Xeon E5-2680 v3 2.5 GHz 12-core CPU.

R630 chassis I believe, same NIC's, smaller processor (E5-2650v4@2.2Ghz).



> The only major setting I've found which both helps performance and
> improves consistency is to ensure that each NIC rx/tx queue IRQ is
> assigned to a specific CPU core, with irqbalance disabled.

I've been stopping irqbalance, and have confirmed that the rx/tx queue IRQ's 
aren't jumping around.

> This is with a _single_ dnsperf client, too.  The settings I use are
> -c24 -q82 -T6 -x2048.   However I do use a tweaked version of dnsperf
> which assigns each thread pair (it uses separate threads for rx and tx)
> to its own core.

I didn't think of using -T. *tries that* ..

> You may find the presentation I made at the recent DNS-OARC workshop of
> interest:
> 
> https://indico.dns-oarc.net/event/26/session/3/contribution/18

Reading it now. Many thanks.
 
> You didn't mention precisely which 9.10 series version you're running.
> Note that versions prior to 9.10.4 defaulted to a -U value of ncores/2,
> but investigation showed that on modern systems this was sub-optimal so
> it was changed to ncores-1.  This makes a *very* big difference.

BIND 9.10.4-P8.

> kind regards,
> 
> Ray Bellis
> ISC Research Fellow

Stuart
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Tuning suggestions for high-core-count Linux servers

2017-06-02 Thread Paul Kosinski
It's been some years now, but I had worked on developing code for a high
throughput network server (not BIND). We found that on multi-socketed
NUMA machines we could have similar contention problems, and it was
quite important to make sure that threads which needed access to the
same memory areas weren't split across sockets. Luckily, the various
services being run were sufficiently separate that we could assign the
service processes to different sockets and avoid a lot of contention.

With BIND, it's basically all one service, so this is not directly
possible. 

It might be possible, however, to run two (or more) *separate*
instances of BIND and do some strictly internal routing of the IP
traffic to those separate instances, or even to have separate NICs
feeding the separate processes. In other words, have several BIND
servers in one chassis, each with its own NUMA memory area.



On Fri, 2 Jun 2017 07:12:09 +
"Browne, Stuart"  wrote:

> Just some interesting investigation results. One of the URL's Matthew
> Ian Eis linked to talked about using a tool called 'perf'. For the
> hell of it, I gave it a shot.
> 
> Sure enough it tells some very interesting things.
> 
> When BIND was restricted to using a single NUMA node, the biggest
> call (to _raw_spin_lock) showed 7.05% overhead.
> 
> When BIND was allowed to use both NUMA nodes, the same call showed
> 49.74% overhead; an astonishing difference.
> 
> As it was running unrestricted, memory from both nodes was more used:
> 
> [root@kr20s2601 ~]# numastat -p 22441
> 
> Per-node process memory usage (in MBs) for PID 22441 (named)
>Node 0  Node 1   Total
>   --- --- ---
> Huge 0.000.000.00
> Heap 0.450.120.57
> Stack0.710.641.35
> Private  5.28 9415.30 9420.57
>   --- --- ---
> Total6.43 9416.07 9422.50
> 
> Given the numbers here, you wouldn't think it should make much of a
> difference.
> 
> Sadly, I didn't get which CPU the UDP listener was attached to.
> 
> Anyway, what I've changed so far:
> 
> vm.swappines = 0
> vm.dirty_ratio = 1
> vm.dirty_background_ratio = 1
> kernel.sched_min_granularity_ns = 1000
> kernel.sched_migration_cost_ns = 500
> 
> Query rate thus far reached (on 24 cores, numa node restricted): 426k
> qps Query rate thus far reached (on 48 cores, numa nodes
> unrestricted): 321k qps
> 
> Stuart
> 
> 'perf' data collected during a 3 minute test run:
> 
> [root@kr20s2601 ~]# ls -al perf.data*
> -rw---. 1 root root  717350012 Jun  2 08:36 perf.data.24
> -rw---. 1 root root 1366620296 Jun  2 08:53 perf.data.48
> 
> 'perf' top 5 (24 cores, numa restricted):
> 
> Overhead  Command  Shared Object Symbol
>7.05%  named[kernel.kallsyms] [k] _raw_spin_lock
>6.96%  namedlibpthread-2.17.so[.] pthread_mutex_lock
>3.84%  namedlibc-2.17.so  [.] vfprintf
>2.36%  namedlibdns.so.165.0.7 [.] dns_name_fullcompare
>2.02%  namedlibisc.so.160.1.2 [.] isc_log_wouldlog
> 
> 'perf' top 5 (48 cores):
> 
> Overhead  Command  Shared Object Symbol
>   49.74%  named[kernel.kallsyms] [k] _raw_spin_lock
>4.52%  namedlibpthread-2.17.so[.] pthread_mutex_lock
>3.09%  namedlibisc.so.160.1.2 [.] isc_log_wouldlog
>1.84%  named[kernel.kallsyms] [k] _raw_spin_lock_bh
>1.56%  namedlibc-2.17.so  [.] vfprintf
> ___
> Please visit https://lists.isc.org/mailman/listinfo/bind-users to
> unsubscribe from this list
> 
> bind-users mailing list
> bind-users@lists.isc.org
> https://lists.isc.org/mailman/listinfo/bind-users
> 
> 
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Tuning suggestions for high-core-count Linux servers

2017-06-02 Thread Ray Bellis
On 01/06/2017 23:26, Mathew Ian Eis wrote:

> … and for one last really crazy idea, you could try running a pair of
> named instances on the machine and fronting them with nginx’s
> supposedly scalable UDP load balancer. (As long as you don’t get a
> performance hit, it also opens up other interesting possibilities
> like being able to shift production load for maintenance on the named
> backends).

It's relatively trivial to patch the BIND source to enable SO_REUSEPORT
on the more recent Linux kernels that support it (3.8+, ISTR?) so that
you can just start two BIND instances listening on the exact same ports
and the kernel will do the load balancing for you.

For a NUMA system, make sure each instance is locked to one die, but
beware of NUMA bus transfers caused by incoming packet buffers being
handled by a kernel task running on one die but then delivered to a BIND
instance running on another.

In the meantime we're also looking at SO_REUSEPORT even for single
instance installations because it appears to offer an advantage over
letting multiple threads all fight over one shared file descriptor.

Ray
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Tuning suggestions for high-core-count Linux servers

2017-06-02 Thread Ray Bellis
On 02/06/2017 08:12, Browne, Stuart wrote:

> Query rate thus far reached (on 24 cores, numa node restricted): 426k qps
> Query rate thus far reached (on 48 cores, numa nodes unrestricted): 321k qps

In our internal Performance Lab I've achieved nearly 900 kqps on small
authoritative zones when we had hyperthreading enabled, and 700 kqps
without.

The lab uses Dell R430s running Fedora Core 23 with Intel X710 10GB NICs
and each populated with a single Xeon E5-2680 v3 2.5 GHz 12-core CPU.

These systems have had *negligible* tuning applied - the vast majority
of the system settings changes I've made have been to improve the
repeatability of results, not the absolute performance.

The only major setting I've found which both helps performance and
improves consistency is to ensure that each NIC rx/tx queue IRQ is
assigned to a specific CPU core, with irqbalance disabled.

This is with a _single_ dnsperf client, too.  The settings I use are
-c24 -q82 -T6 -x2048.   However I do use a tweaked version of dnsperf
which assigns each thread pair (it uses separate threads for rx and tx)
to its own core.

You may find the presentation I made at the recent DNS-OARC workshop of
interest:



You didn't mention precisely which 9.10 series version you're running.
Note that versions prior to 9.10.4 defaulted to a -U value of ncores/2,
but investigation showed that on modern systems this was sub-optimal so
it was changed to ncores-1.  This makes a *very* big difference.

kind regards,

Ray Bellis
ISC Research Fellow
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Tuning suggestions for high-core-count Linux servers

2017-06-02 Thread Phil Mayers

On 02/06/17 08:12, Browne, Stuart wrote:

Just some interesting investigation results. One of the URL's Matthew
Ian Eis linked to talked about using a tool called 'perf'. For the
hell of it, I gave it a shot.


perf is super-powerful.

On a sufficiently recent kernel you can also do interesting things with 
the enhanced eBPF-based tracing - see:


http://www.brendangregg.com/ebpf.html

...but those are not going to be usable on a RH7 kernel I believe :o(
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


Re: Tuning suggestions for high-core-count Linux servers

2017-06-02 Thread Browne, Stuart
Just some interesting investigation results. One of the URL's Matthew Ian Eis 
linked to talked about using a tool called 'perf'. For the hell of it, I gave 
it a shot.

Sure enough it tells some very interesting things.

When BIND was restricted to using a single NUMA node, the biggest call (to 
_raw_spin_lock) showed 7.05% overhead.

When BIND was allowed to use both NUMA nodes, the same call showed 49.74% 
overhead; an astonishing difference.

As it was running unrestricted, memory from both nodes was more used:

[root@kr20s2601 ~]# numastat -p 22441

Per-node process memory usage (in MBs) for PID 22441 (named)
   Node 0  Node 1   Total
  --- --- ---
Huge 0.000.000.00
Heap 0.450.120.57
Stack0.710.641.35
Private  5.28 9415.30 9420.57
  --- --- ---
Total6.43 9416.07 9422.50

Given the numbers here, you wouldn't think it should make much of a difference.

Sadly, I didn't get which CPU the UDP listener was attached to.

Anyway, what I've changed so far:

vm.swappines = 0
vm.dirty_ratio = 1
vm.dirty_background_ratio = 1
kernel.sched_min_granularity_ns = 1000
kernel.sched_migration_cost_ns = 500

Query rate thus far reached (on 24 cores, numa node restricted): 426k qps
Query rate thus far reached (on 48 cores, numa nodes unrestricted): 321k qps

Stuart

'perf' data collected during a 3 minute test run:

[root@kr20s2601 ~]# ls -al perf.data*
-rw---. 1 root root  717350012 Jun  2 08:36 perf.data.24
-rw---. 1 root root 1366620296 Jun  2 08:53 perf.data.48

'perf' top 5 (24 cores, numa restricted):

Overhead  Command  Shared Object Symbol
   7.05%  named[kernel.kallsyms] [k] _raw_spin_lock
   6.96%  namedlibpthread-2.17.so[.] pthread_mutex_lock
   3.84%  namedlibc-2.17.so  [.] vfprintf
   2.36%  namedlibdns.so.165.0.7 [.] dns_name_fullcompare
   2.02%  namedlibisc.so.160.1.2 [.] isc_log_wouldlog

'perf' top 5 (48 cores):

Overhead  Command  Shared Object Symbol
  49.74%  named[kernel.kallsyms] [k] _raw_spin_lock
   4.52%  namedlibpthread-2.17.so[.] pthread_mutex_lock
   3.09%  namedlibisc.so.160.1.2 [.] isc_log_wouldlog
   1.84%  named[kernel.kallsyms] [k] _raw_spin_lock_bh
   1.56%  namedlibc-2.17.so  [.] vfprintf
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


RE: [EXTERNAL] Re: Tuning suggestions for high-core-count Linux servers

2017-06-01 Thread Browne, Stuart


> -Original Message-
> From: Plhu [mailto:p...@seznam.cz]


> a few simple ideas to your tests:
>  - have you inspected the per-thread CPU? Aren't some of the threads
> overloaded?

I've tested both the auto-calculated values (one thread per available core) and 
explicitly overridden this. NUMA boundaries seem to be where things get wonky.

>  - have you tried to get the statistics from the Bind server using the
>  XML or JSON interface? It may bring you another insight to the errors.


>  - I may have missed the connection count you use for testing - can you
>  post it? More, how may entries do you have in your database? Can you
>  share your named.conf (without any compromising entries)?

I'm testing to flood, so approximately 5 x 400 client count (dnsperf) with a 
500 query backlog per test instance.

Theoretically this should mean up to 4k5 active or back-logged connections (or 
just 2k5 if I read that documentation wrong).

>  - what is your network environment? How many switches/routers are there
>  between your simulator and the Bind server host?

This is a very closed environment. Server-Switch-Server, all 10Git or 25Gbit. 
Verified the switch stats today, capable of 10x what I'm pushing through it 
currently.

>  - is Bind the only running process on the tested server?

As always, there's the rest of the OS helper stuff, but BIND is the only thing 
actively doing anything (beyond the monitoring I'm doing). So no, nothing else 
is drawing massive amounts of either CPU or network resources.

>  - what CPUs is the Bind server being run on?

>From procinfo:
Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz

2 of them.


>  - is there numad running and while trying the taskset, have you
>  selected the CPUs on the same processor? What does numastat show during
>  the test?

I was manually issuing taskset after confirming the CPU allocations:

taskset 1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35,37,39,41,43,46,47 
/usr/sbin/named -u named -n 24 -f

This is all of the cores (including HT) on the 2nd socket. There wwas almost no 
performance difference between 12 (just the actual cores, no HT's) and 24 (with 
the HT's).

>  - how many UDP sockets are in use during your test?

See above.

> 
> Curious for the responses.
> 
>   Lukas

Stuart
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


RE: Tuning suggestions for high-core-count Linux servers

2017-06-01 Thread Browne, Stuart


> -Original Message-
> From: Mathew Ian Eis [mailto:mathew@nau.edu]
>

> 
> Basically the math here is “large enough that you can queue up the
> 9X.XXXth percentile of traffic bursts without dropping them, but not so
> large that you waste processing time fiddling with the queue”. Since that
> percentile varies widely across environments it’s not easy to provide a
> specific formula. And on that note:

Yup. Experimentation seems to the be name of the day.

> > Will keep spinning test but using smaller increments to the wmem/rmem
> > values
>
> Tightening is nice for finding some theoretical limits but in practice
> not so much. Be careful about making them too tight, lest under your
> “bursty” production loads you drop all sorts of queries without intending
> to.

Yup.

> dropwatch is an easy indicator of whether the throughput issue is on or
> off the system. Seeing packets being dropped in the system combined with
> apparently low CPU usage suggests you might be able to increase
> throughput. `dropwatch -l kas` should tell you the methods that are
> dropping the packets, which can help you understand where in the kernel
> they are being dropped and why. For anything beyond that, I expect your
> Google-fu is as good as mine ;-)

Like the '-l kas':

830 drops at udp_queue_rcv_skb+374 (0x815e1c64)
 15 drops at __udp_queue_rcv_skb+91 (0x815df171)

Well and truly buried in the code.

https://blog.packagecloud.io/eng/2016/06/22/monitoring-tuning-linux-networking-stack-receiving-data/#udpqueuercvskb

This seems like a nice explanation as to what's going on. Still reading through 
it all.


> If your CPU utilization is still apparently low, you might be onto
> something with taskset/numa… Related things I have toyed with but don’t
> currently have in production:
> 
> increasing kernel.sched_migration_cost a couple of orders of magnitude
> 
> setting kernel.sched_autogroup_enabled=0
> 
> systemctl stop irqbalance

I've had irqbalance stopped previously, and sched_autogroup_enabled is already 
set to 0. Initial mucking about a bit with sched_migration_cost gets a few more 
QPS through, so will run more tests.

Thanks for this one, hadn't used it before.

> > Lastly (mostly for posterity for the list, please don’t take this as
> “rtfm” if you’ve seen them already) here are some very useful in-depth
> (but generalized) performance tuning guides:

Will give them a read. I do like manuals :P



> … and for one last really crazy idea, you could try running a pair of
> named instances on the machine and fronting them with nginx’s supposedly
> scalable UDP load balancer. (As long as you don’t get a performance hit,
> it also opens up other interesting possibilities like being able to shift
> production load for maintenance on the named backends).

Yeah, I've had this thought.

I'm pretty sure I've pretty much reached the limit of what BIND can do in a 
single NUMA node for the moment.

I will report back if any great inspiration or successful increases in 
throughput occur.

Stuart
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Tuning suggestions for high-core-count Linux servers

2017-06-01 Thread Mathew Ian Eis
Howdy Stuart,

>  Re: net.core.rmem - I'd love to figure out what the math here should be. 'X 
> number of simultaneous connections multiplied by Y socket memory size = rmem' 
> or some such.

Basically the math here is “large enough that you can queue up the 9X.XXXth 
percentile of traffic bursts without dropping them, but not so large that you 
waste processing time fiddling with the queue”. Since that percentile varies 
widely across environments it’s not easy to provide a specific formula. And on 
that note:

> Will keep spinning test but using smaller increments to the wmem/rmem values

Tightening is nice for finding some theoretical limits but in practice not so 
much. Be careful about making them too tight, lest under your “bursty” 
production loads you drop all sorts of queries without intending to.

> Re: dropwatch - Oo! new tool! More google-fu to figure out how to use that 
> information for good

dropwatch is an easy indicator of whether the throughput issue is on or off the 
system. Seeing packets being dropped in the system combined with apparently low 
CPU usage suggests you might be able to increase throughput. `dropwatch -l kas` 
should tell you the methods that are dropping the packets, which can help you 
understand where in the kernel they are being dropped and why. For anything 
beyond that, I expect your Google-fu is as good as mine ;-)

If your CPU utilization is still apparently low, you might be onto something 
with taskset/numa… Related things I have toyed with but don’t currently have in 
production:

increasing kernel.sched_migration_cost a couple of orders of magnitude
setting kernel.sched_autogroup_enabled=0
systemctl stop irqbalance

Lastly (mostly for posterity for the list, please don’t take this as “rtfm” if 
you’ve seen them already) here are some very useful in-depth (but generalized) 
performance tuning guides:

https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html-single/Performance_Tuning_Guide/
https://access.redhat.com/sites/default/files/attachments/201501-perf-brief-low-latency-tuning-rhel7-v1.1.pdf

… and for one last really crazy idea, you could try running a pair of named 
instances on the machine and fronting them with nginx’s supposedly scalable UDP 
load balancer. (As long as you don’t get a performance hit, it also opens up 
other interesting possibilities like being able to shift production load for 
maintenance on the named backends).

Best of luck! Let us know where you cap out!

Regards,

Mathew Eis
Northern Arizona University
Information Technology Services

-Original Message-
From: "Browne, Stuart" <stuart.bro...@neustar.biz>
Date: Thursday, June 1, 2017 at 12:27 AM
To: Mathew Ian Eis <mathew@nau.edu>, "bind-users@lists.isc.org" 
<bind-users@lists.isc.org>
Subject: RE: Tuning suggestions for high-core-count Linux servers

Cheers Matthew.

1)  Not seeing that error, seeing this one instead:

01-Jun-2017 01:46:27.952 client: warning: client 192.168.0.23#38125 
(x41fe848-f3d1-4eec-967e-039d075ee864.perf1000): error sending response: would 
block

Only seeing a few of them per run (out of ~70 million requests).

Whilst I can see where this is raised in the BIND code 
(lib/isc/unix/socket.c in doio_send), I don't understand the underlying reason 
for it being set (errno == EWOULDBLOCK || errno == EAGAIN).

I've not bumped wmem/rmem up as much as the link (only to 16MB, not 40MB), 
but no real difference after tweaks. I did another run with stupidly-large 
core.{rmem,wmem}_{max,default} (64MB), this actually degraded performance a bit 
so over tuning isn't good either. Need to figure out a good balance here.

I'd love to figure out what the math here should be.  'X number of 
simultaneous connections multiplied by Y socket memory size = rmem' or some 
such.

2) I am still seeing some udp receive errors and receive buffer errors; 
about 1.3% of received packets.

From a 'netstat' point of view, I see:

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address   Foreign Address State
udp   382976  17664 192.168.1.21:53 0.0.0.0:*

The numbers in the receive queue stay in the 200-300k range whilst the 
send-queue floats around the 20-40k range. wmem already bumped.

3) Huh, didn't know about this one. Bumped up the backlog, small increase 
in throughput for my tests. Still need to figure out how to read sofnet_stat. 
More google-fu in my future.

After a reboot and the wmem/rmem/backlog increases, no longer any non-zero 
in the 2nd column.

4) Yes, max_dgram_qlen is already set to 512.

5) Oo! new tool! :)

--
...
11 drops at location 0x815df171
854 drops at location 0x815e1c64
12 drops at location 0x815df171
822 drops at location 0x81

Re: Tuning suggestions for high-core-count Linux servers

2017-06-01 Thread Plhu

  Hello Stuart,
a few simple ideas to your tests:
 - have you inspected the per-thread CPU? Aren't some of the threads overloaded?
 - have you tried to get the statistics from the Bind server using the
 XML or JSON interface? It may bring you another insight to the errors.
 - I may have missed the connection count you use for testing - can you
 post it? More, how may entries do you have in your database? Can you
 share your named.conf (without any compromising entries)?
 - what is your network environment? How many switches/routers are there
 between your simulator and the Bind server host?
 - is Bind the only running process on the tested server?
 - what CPUs is the Bind server being run on?
 - is there numad running and while trying the taskset, have you
 selected the CPUs on the same processor? What does numastat show during
 the test?
 - how many UDP sockets are in use during your test?

Curious for the responses.

  Lukas

Browne, Stuart  writes:

> Cheers Matthew.
>
> 1)  Not seeing that error, seeing this one instead:
>
> 01-Jun-2017 01:46:27.952 client: warning: client 192.168.0.23#38125 
> (x41fe848-f3d1-4eec-967e-039d075ee864.perf1000): error sending response: 
> would block
>
> Only seeing a few of them per run (out of ~70 million requests).
>
> Whilst I can see where this is raised in the BIND code (lib/isc/unix/socket.c 
> in doio_send), I don't understand the underlying reason for it being set 
> (errno == EWOULDBLOCK || errno == EAGAIN).
>
> I've not bumped wmem/rmem up as much as the link (only to 16MB, not 40MB), 
> but no real difference after tweaks. I did another run with stupidly-large 
> core.{rmem,wmem}_{max,default} (64MB), this actually degraded performance a 
> bit so over tuning isn't good either. Need to figure out a good balance here.
>
> I'd love to figure out what the math here should be.  'X number of 
> simultaneous connections multiplied by Y socket memory size = rmem' or some 
> such.
>
> 2) I am still seeing some udp receive errors and receive buffer errors; about 
> 1.3% of received packets.
>
> From a 'netstat' point of view, I see:
>
> Active Internet connections (servers and established)
> Proto Recv-Q Send-Q Local Address   Foreign Address State
> udp   382976  17664 192.168.1.21:53 0.0.0.0:*
>
> The numbers in the receive queue stay in the 200-300k range whilst the 
> send-queue floats around the 20-40k range. wmem already bumped.
>
> 3) Huh, didn't know about this one. Bumped up the backlog, small increase in 
> throughput for my tests. Still need to figure out how to read sofnet_stat. 
> More google-fu in my future.
>
> After a reboot and the wmem/rmem/backlog increases, no longer any non-zero in 
> the 2nd column.
>
> 4) Yes, max_dgram_qlen is already set to 512.
>
> 5) Oo! new tool! :)
>
> --
> ...
> 11 drops at location 0x815df171
> 854 drops at location 0x815e1c64
> 12 drops at location 0x815df171
> 822 drops at location 0x815e1c64
> ...
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


RE: Tuning suggestions for high-core-count Linux servers

2017-06-01 Thread Browne, Stuart
Cheers Matthew.

1)  Not seeing that error, seeing this one instead:

01-Jun-2017 01:46:27.952 client: warning: client 192.168.0.23#38125 
(x41fe848-f3d1-4eec-967e-039d075ee864.perf1000): error sending response: would 
block

Only seeing a few of them per run (out of ~70 million requests).

Whilst I can see where this is raised in the BIND code (lib/isc/unix/socket.c 
in doio_send), I don't understand the underlying reason for it being set (errno 
== EWOULDBLOCK || errno == EAGAIN).

I've not bumped wmem/rmem up as much as the link (only to 16MB, not 40MB), but 
no real difference after tweaks. I did another run with stupidly-large 
core.{rmem,wmem}_{max,default} (64MB), this actually degraded performance a bit 
so over tuning isn't good either. Need to figure out a good balance here.

I'd love to figure out what the math here should be.  'X number of simultaneous 
connections multiplied by Y socket memory size = rmem' or some such.

2) I am still seeing some udp receive errors and receive buffer errors; about 
1.3% of received packets.

From a 'netstat' point of view, I see:

Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address   Foreign Address State
udp   382976  17664 192.168.1.21:53 0.0.0.0:*

The numbers in the receive queue stay in the 200-300k range whilst the 
send-queue floats around the 20-40k range. wmem already bumped.

3) Huh, didn't know about this one. Bumped up the backlog, small increase in 
throughput for my tests. Still need to figure out how to read sofnet_stat. More 
google-fu in my future.

After a reboot and the wmem/rmem/backlog increases, no longer any non-zero in 
the 2nd column.

4) Yes, max_dgram_qlen is already set to 512.

5) Oo! new tool! :)

--
...
11 drops at location 0x815df171
854 drops at location 0x815e1c64
12 drops at location 0x815df171
822 drops at location 0x815e1c64
...
--

I'm pretty sure it's just showing more details of the 'netstat -u -s'. More 
google-fu to figure out how to use that information for good rather than, well, 
.. frustration? .. :)

Will keep spinning test but using smaller increments to the wmem/rmem values, 
see if I can eek anything more than 360k out of it.

Thanks for your suggestions Matthew!

Stuart


-Original Message-
From: Mathew Ian Eis [mailto:mathew@nau.edu] 
Sent: Thursday, 1 June 2017 10:30 AM
To: bind-users@lists.isc.org
Cc: Browne, Stuart
Subject: [EXTERNAL] Re: Tuning suggestions for high-core-count Linux servers

360k qps is actually quite good… the best I have heard of until now on EL was 
180k [1]. There, it was recommended to manually tune the number of subthreads 
with the -U parameter.



Since you’ve mentioned rmem/wmem changes, specifically you want to:



1. check for send buffer overflow; as indicated in named logs:

31-Mar-2017 12:30:55.521 client: warning: client 10.0.0.5#51342 (test.com): 
error sending response: unset



fix: increase rmem via sysctl:

net.core.rmem_max

net.core.rmem_default



2. check for receive buffer overflow; as indicated by netstat:

# netstat -u -s

Udp:

34772479 packet receive errors



fix: increase wmem and backlog via sysctl:

net.core.wmem_max

net.core.wmem_default



… and other ideas:



3. check 2nd column in /proc/net/softnet_stat for any non-zero numbers 
(indicating dropped packets).

If any are non-zero, increase net.core.netdev_max_backlog



4. You may also want to want to increase net.unix.max_dgram_qlen (although 
since EL7 has default this to 512, this is not much of an issue - double check 
that it is 512).



5. Try running dropwatch to see where packets are being lost. If it shows 
nothing then you need to look outside the system. If it shows something you may 
have a hint where to tune next.



Please post your outcomes in any case, since you are already having some 
excellent results.



[1] https://lists.dns-oarc.net/pipermail/dns-operations/2014-April/011543.html


Regards,



Mathew Eis

Northern Arizona University

Information Technology Services



-Original Message-

From: bind-users <bind-users-boun...@lists.isc.org> on behalf of "Browne, 
Stuart" <stuart.bro...@neustar.biz>

Date: Wednesday, May 31, 2017 at 12:25 AM

To: "bind-users@lists.isc.org" <bind-users@lists.isc.org>

Subject: Tuning suggestions for high-core-count Linux servers



Hi,



I've been able to get my hands on some rather nice servers with 2 x 12 core 
Intel CPU's and was wondering if anybody had any decent tuning tips to get BIND 
to respond at a faster rate.



I'm seeing that pretty much cpu count beyond a single die doesn't get any 
real improvement. I understand the NUMA boundaries etc., but this hasn't been 
my experience on previous iterations of the Intel CPU's, at least not this 
dramatically. When I use more than a single die, CPU utilization continues to 
match the core count however throughput doe

Re: Tuning suggestions for high-core-count Linux servers

2017-05-31 Thread Mathew Ian Eis
360k qps is actually quite good… the best I have heard of until now on EL was 
180k [1]. There, it was recommended to manually tune the number of subthreads 
with the -U parameter.

Since you’ve mentioned rmem/wmem changes, specifically you want to:

1. check for send buffer overflow; as indicated in named logs:
31-Mar-2017 12:30:55.521 client: warning: client 10.0.0.5#51342 (test.com): 
error sending response: unset

fix: increase rmem via sysctl:
net.core.rmem_max
net.core.rmem_default

2. check for receive buffer overflow; as indicated by netstat:
# netstat -u -s
Udp:
34772479 packet receive errors

fix: increase wmem and backlog via sysctl:
net.core.wmem_max
net.core.wmem_default

… and other ideas:

3. check 2nd column in /proc/net/softnet_stat for any non-zero numbers 
(indicating dropped packets).
If any are non-zero, increase net.core.netdev_max_backlog

4. You may also want to want to increase net.unix.max_dgram_qlen (although 
since EL7 has default this to 512, this is not much of an issue - double check 
that it is 512).

5. Try running dropwatch to see where packets are being lost. If it shows 
nothing then you need to look outside the system. If it shows something you may 
have a hint where to tune next.

Please post your outcomes in any case, since you are already having some 
excellent results.

[1] https://lists.dns-oarc.net/pipermail/dns-operations/2014-April/011543.html

Regards,

Mathew Eis
Northern Arizona University
Information Technology Services

-Original Message-
From: bind-users  on behalf of "Browne, 
Stuart" 
Date: Wednesday, May 31, 2017 at 12:25 AM
To: "bind-users@lists.isc.org" 
Subject: Tuning suggestions for high-core-count Linux servers

Hi,

I've been able to get my hands on some rather nice servers with 2 x 12 core 
Intel CPU's and was wondering if anybody had any decent tuning tips to get BIND 
to respond at a faster rate.

I'm seeing that pretty much cpu count beyond a single die doesn't get any 
real improvement. I understand the NUMA boundaries etc., but this hasn't been 
my experience on previous iterations of the Intel CPU's, at least not this 
dramatically. When I use more than a single die, CPU utilization continues to 
match the core count however throughput doesn't increase to match.

All the testing I've been doing for now (dnsperf from multiple sources for 
now) seems to be plateauing around 340k qps per BIND host.

Some notes:
- Primarily looking at UDP throughput here
- Intention is for high-throughput, authoritative only
- The zone files used for testing are fairly small and reside completely 
in-memory; no disk IO involved
- RHEL7, bind 9.10 series, iptables 'NOTRACK' firmly in place
- Current configure:

built by make with '--build=x86_64-redhat-linux-gnu' 
'--host=x86_64-redhat-linux-gnu' '--program-prefix=' 
'--disable-dependency-tracking' '--prefix=/usr' '--exec-prefix=/usr' 
'--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' 
'--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' 
'--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' 
'--mandir=/usr/share/man' '--infodir=/usr/share/info' '--localstatedir=/var' 
'--with-libtool' '--enable-threads' '--enable-ipv6' '--with-pic' 
'--enable-shared' '--disable-static' '--disable-openssl-version-check' 
'--with-tuning=large' '--with-libxml2' '--with-libjson' 
'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 
'CFLAGS= -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions 
-fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 
-mtune=generic -fPIC' 'LDFLAGS=-Wl,-z,relro ' 'CPPFLAGS= -DDIG_SIGCHASE -fPIC'

Things tried:
- Using 'taskset' to bind to a single CPU die and limiting BIND to '-n' 
cpu's doesn't improve much beyond letting BIND make its own decision
- NIC interfaces are set for TOE
- rmem & wmem changes (beyond a point) seem to do little to improve 
performance, mainly just make throughput more consistent

I've yet to investigate the switch throughput or tweaking (don't yet have 
access to it).

So, any thoughts?

Stuart
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to 
unsubscribe from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users



___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users

Re: Tuning suggestions for high-core-count Linux servers

2017-05-31 Thread Reindl Harald


Am 31.05.2017 um 14:42 schrieb MURTARI, JOHN:

Stuart, You didn't mention what OS you are using


Subject: RE: Tuning suggestions for high-core-count Linux servers

___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users


RE: Tuning suggestions for high-core-count Linux servers

2017-05-31 Thread MURTARI, JOHN
Stuart,
You didn't mention what OS you are using, I assume some version of 
Linux.  What you are seeing may not be a BIND limit, but the OS.  One thing we 
noted with Redhat is that the kernel just couldn't keep up with the inbound UDP 
packets (queue overflow).The kernel does keep a count of dropped UDP 
packets; unfortunately, I can't recall the command we used to monitor.   Found 
this on Google, 
https://linux-tips.com/t/udp-packet-drops-and-packet-receive-error-difference/237
 . 

Perhaps the other folks have better details.
Best regards!
John

--

Message: 4
Date: Wed, 31 May 2017 07:25:44 +
From: "Browne, Stuart" 
To: "bind-users@lists.isc.org" 
Subject: Tuning suggestions for high-core-count Linux servers
Message-ID:
<07ef8b18a5248a4691e86a8e16bdbd87013bd...@stntexmb11.cis.neustar.com>
Content-Type: text/plain; charset="us-ascii"

Hi,

I've been able to get my hands on some rather nice servers with 2 x 12 core 
Intel CPU's and was wondering if anybody had any decent tuning tips to get BIND 
to respond at a faster rate.

I'm seeing that pretty much cpu count beyond a single die doesn't get any real 
improvement. I understand the NUMA boundaries etc., but this hasn't been my 
experience on previous iterations of the Intel CPU's, at least not this 
dramatically. When I use more than a single die, CPU utilization continues to 
match the core count however throughput doesn't increase to match.

All the testing I've been doing for now (dnsperf from multiple sources for now) 
seems to be plateauing around 340k qps per BIND host.

Some notes:
- Primarily looking at UDP throughput here
- Intention is for high-throughput, authoritative only
- The zone files used for testing are fairly small and reside completely 
in-memory; no disk IO involved
- RHEL7, bind 9.10 series, iptables 'NOTRACK' firmly in place
- Current configure:

built by make with '--build=x86_64-redhat-linux-gnu' 
'--host=x86_64-redhat-linux-gnu' '--program-prefix=' 
'--disable-dependency-tracking' '--prefix=/usr' '--exec-prefix=/usr' 
'--bindir=/usr/bin' '--sbindir=/usr/sbin' '--sysconfdir=/etc' 
'--datadir=/usr/share' '--includedir=/usr/include' '--libdir=/usr/lib64' 
'--libexecdir=/usr/libexec' '--sharedstatedir=/var/lib' 
'--mandir=/usr/share/man' '--infodir=/usr/share/info' '--localstatedir=/var' 
'--with-libtool' '--enable-threads' '--enable-ipv6' '--with-pic' 
'--enable-shared' '--disable-static' '--disable-openssl-version-check' 
'--with-tuning=large' '--with-libxml2' '--with-libjson' 
'build_alias=x86_64-redhat-linux-gnu' 'host_alias=x86_64-redhat-linux-gnu' 
'CFLAGS= -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions 
-fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -m64 
-mtune=generic -fPIC' 'LDFLAGS=-Wl,-z,relro ' 'CPPFLAGS= -DDIG_SIGCHASE -fPIC'

Things tried:
- Using 'taskset' to bind to a single CPU die and limiting BIND to '-n' cpu's 
doesn't improve much beyond letting BIND make its own decision
- NIC interfaces are set for TOE
- rmem & wmem changes (beyond a point) seem to do little to improve 
performance, mainly just make throughput more consistent

I've yet to investigate the switch throughput or tweaking (don't yet have 
access to it).

So, any thoughts?

Stuart


--

Subject: Digest Footer

___
bind-users mailing list
bind-users@lists.isc.org
https://urldefense.proofpoint.com/v2/url?u=https-3A__lists.isc.org_mailman_listinfo_bind-2Dusers=DwICAg=LFYZ-o9_HUMeMTSQicvjIg=W2EXdPialiHj_h1mirDQrw=VpL_2rBI-dAN9CcyM5ItcBMHm2oWJh1OkP57nWgbtko=wj2zSZ6MlLY27Z-9Hjrg_BU-0GSQgQkvy89cfNbHNfQ=
 

--

End of bind-users Digest, Vol 2662, Issue 1
***
___
Please visit https://lists.isc.org/mailman/listinfo/bind-users to unsubscribe 
from this list

bind-users mailing list
bind-users@lists.isc.org
https://lists.isc.org/mailman/listinfo/bind-users