Re: dhclient sucks cpu usage...

2014-06-10 Thread Alexander V. Chernikov

On 10.06.2014 07:03, Bryan Venteicher wrote:

Hi,

- Original Message -

So, after finding out that nc has a stupidly small buffer size (2k
even though there is space for 16k), I was still not getting as good
as performance using nc between machines, so I decided to generate some
flame graphs to try to identify issues...  (Thanks to who included a
full set of modules, including dtraceall on memstick!)

So, the first one is:
https://www.funkthat.com/~jmg/em.stack.svg

As I was browsing around, the em_handle_que was consuming quite a bit
of cpu usage for only doing ~50MB/sec over gige..  Running top -SH shows
me that the taskqueue for em was consuming about 50% cpu...  Also pretty
high for only 50MB/sec...  Looking closer, you'll see that bpf_mtap is
consuming ~3.18% (under ether_nh_input)..  I know I'm not running tcpdump
or anything, but I think dhclient uses bpf to be able to inject packets
and listen in on them, so I kill off dhclient, and instantly, the taskqueue
thread for em drops down to 40% CPU... (transfer rate only marginally
improves, if it does)

I decide to run another flame graph w/o dhclient running:
https://www.funkthat.com/~jmg/em.stack.nodhclient.svg

and now _rxeof drops from 17.22% to 11.94%, pretty significant...

So, if you care about performance, don't run dhclient...


Yes, I've noticed the same issue. It can absolutely kill performance
in a VM guest. It is much more pronounced on only some of my systems,
and I hadn't tracked it down yet. I wonder if this is fallout from
the callout work, or if there was some bpf change.

I've been using the kludgey workaround patch below.

Hm, pretty interesting.
dhclient should setup proper filter (and it looks like it does so:
13:10 [0] m@ptichko s netstat -B
  Pid  Netif   Flags  Recv  Drop Match Sblen Hblen Command
 1224em0 -ifs--l  41225922 011 0 0 dhclient
)
see match count.
And BPF itself adds the cost of read rwlock (+ bgp_filter() calls for 
each consumer on interface).

It should not introduce significant performance penalties.



diff --git a/sys/net/bpf.c b/sys/net/bpf.c
index cb3ed27..9751986 100644
--- a/sys/net/bpf.c
+++ b/sys/net/bpf.c
@@ -2013,9 +2013,11 @@ bpf_gettime(struct bintime *bt, int tstype, struct mbuf 
*m)
return (BPF_TSTAMP_EXTERN);
}
}
+#if 0
if (quality == BPF_TSTAMP_NORMAL)
binuptime(bt);
else
+#endif

bpf_getttime() is called IFF packet filter matches some traffic.
Can you show your netstat -B output ?

getbinuptime(bt);
  
  	return (quality);




--
   John-Mark Gurney Voice: +1 415 225 5579

  All that I will do, has been done, All that I have, has not.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


___
freebsd-...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org



___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: dhclient sucks cpu usage...

2014-06-10 Thread John-Mark Gurney
Alexander V. Chernikov wrote this message on Tue, Jun 10, 2014 at 13:17 +0400:
 On 10.06.2014 07:03, Bryan Venteicher wrote:
 Hi,
 
 - Original Message -
 So, after finding out that nc has a stupidly small buffer size (2k
 even though there is space for 16k), I was still not getting as good
 as performance using nc between machines, so I decided to generate some
 flame graphs to try to identify issues...  (Thanks to who included a
 full set of modules, including dtraceall on memstick!)
 
 So, the first one is:
 https://www.funkthat.com/~jmg/em.stack.svg
 
 As I was browsing around, the em_handle_que was consuming quite a bit
 of cpu usage for only doing ~50MB/sec over gige..  Running top -SH shows
 me that the taskqueue for em was consuming about 50% cpu...  Also pretty
 high for only 50MB/sec...  Looking closer, you'll see that bpf_mtap is
 consuming ~3.18% (under ether_nh_input)..  I know I'm not running tcpdump
 or anything, but I think dhclient uses bpf to be able to inject packets
 and listen in on them, so I kill off dhclient, and instantly, the 
 taskqueue
 thread for em drops down to 40% CPU... (transfer rate only marginally
 improves, if it does)
 
 I decide to run another flame graph w/o dhclient running:
 https://www.funkthat.com/~jmg/em.stack.nodhclient.svg
 
 and now _rxeof drops from 17.22% to 11.94%, pretty significant...
 
 So, if you care about performance, don't run dhclient...
 
 Yes, I've noticed the same issue. It can absolutely kill performance
 in a VM guest. It is much more pronounced on only some of my systems,
 and I hadn't tracked it down yet. I wonder if this is fallout from
 the callout work, or if there was some bpf change.
 
 I've been using the kludgey workaround patch below.
 Hm, pretty interesting.
 dhclient should setup proper filter (and it looks like it does so:
 13:10 [0] m@ptichko s netstat -B
   Pid  Netif   Flags  Recv  Drop Match Sblen Hblen Command
  1224em0 -ifs--l  41225922 011 0 0 dhclient
 )
 see match count.
 And BPF itself adds the cost of read rwlock (+ bgp_filter() calls for 
 each consumer on interface).
 It should not introduce significant performance penalties.

Don't forget that it has to process the returning ack's... So, you're
looking around 10k+ pps that you have to handle and pass through the
filter...  That's a lot of packets to process...

Just for a bit more double check, instead of using the HD as a
source, I used /dev/zero...   I ran a netstat -w 1 -I em0 when
running the test, and I was getting ~50.7MiB/s w/ dhclient running and
then I killed dhclient and it instantly jumped up to ~57.1MiB/s.. So I
launched dhclient again, and it dropped back to ~50MiB/s...

and some of this slowness is due to nc using small buffers which I will
fix shortly..

And with witness disabled it goes from 58MiB/s to 65.7MiB/s..  In
both cases, that's a 13% performance improvement by running w/o
dhclient...

This is using the latest memstick image, r266655 on a (Lenovo T61):
FreeBSD 11.0-CURRENT #0 r266655: Sun May 25 18:55:02 UTC 2014
r...@grind.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512
WARNING: WITNESS option enabled, expect reduced performance.
CPU: Intel(R) Core(TM)2 Duo CPU T7300  @ 2.00GHz (1995.05-MHz K8-class CPU)
  Origin=GenuineIntel  Id=0x6fb  Family=0x6  Model=0xf  Stepping=11
  
Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE
  Features2=0xe3bdSSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM
  AMD Features=0x20100800SYSCALL,NX,LM
  AMD Features2=0x1LAHF
  TSC: P-state invariant, performance statistics
real memory  = 2147483648 (2048 MB)
avail memory = 2014019584 (1920 MB)

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 All that I will do, has been done, All that I have, has not.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: dhclient sucks cpu usage...

2014-06-10 Thread Alexander V. Chernikov

On 10.06.2014 20:24, John-Mark Gurney wrote:

Alexander V. Chernikov wrote this message on Tue, Jun 10, 2014 at 13:17 +0400:

On 10.06.2014 07:03, Bryan Venteicher wrote:

Hi,

- Original Message -

So, after finding out that nc has a stupidly small buffer size (2k
even though there is space for 16k), I was still not getting as good
as performance using nc between machines, so I decided to generate some
flame graphs to try to identify issues...  (Thanks to who included a
full set of modules, including dtraceall on memstick!)

So, the first one is:
https://www.funkthat.com/~jmg/em.stack.svg

As I was browsing around, the em_handle_que was consuming quite a bit
of cpu usage for only doing ~50MB/sec over gige..  Running top -SH shows
me that the taskqueue for em was consuming about 50% cpu...  Also pretty
high for only 50MB/sec...  Looking closer, you'll see that bpf_mtap is
consuming ~3.18% (under ether_nh_input)..  I know I'm not running tcpdump
or anything, but I think dhclient uses bpf to be able to inject packets
and listen in on them, so I kill off dhclient, and instantly, the
taskqueue
thread for em drops down to 40% CPU... (transfer rate only marginally
improves, if it does)

I decide to run another flame graph w/o dhclient running:
https://www.funkthat.com/~jmg/em.stack.nodhclient.svg

and now _rxeof drops from 17.22% to 11.94%, pretty significant...

So, if you care about performance, don't run dhclient...


Yes, I've noticed the same issue. It can absolutely kill performance
in a VM guest. It is much more pronounced on only some of my systems,
and I hadn't tracked it down yet. I wonder if this is fallout from
the callout work, or if there was some bpf change.

I've been using the kludgey workaround patch below.

Hm, pretty interesting.
dhclient should setup proper filter (and it looks like it does so:
13:10 [0] m@ptichko s netstat -B
   Pid  Netif   Flags  Recv  Drop Match Sblen Hblen Command
  1224em0 -ifs--l  41225922 011 0 0 dhclient
)
see match count.
And BPF itself adds the cost of read rwlock (+ bgp_filter() calls for
each consumer on interface).
It should not introduce significant performance penalties.

Don't forget that it has to process the returning ack's... So, you're
Well, it can be still captured with the proper filter like ip  udp  
port 67 or port 68.
We're using tcpdump on high packet ratios (1M) and it does not 
influence process _much_.
We should probably convert its rwlock to rmlock and use per-cpu counters 
for statistics, but that's a different story.

looking around 10k+ pps that you have to handle and pass through the
filter...  That's a lot of packets to process...

Just for a bit more double check, instead of using the HD as a
source, I used /dev/zero...   I ran a netstat -w 1 -I em0 when
running the test, and I was getting ~50.7MiB/s w/ dhclient running and
then I killed dhclient and it instantly jumped up to ~57.1MiB/s.. So I
launched dhclient again, and it dropped back to ~50MiB/s...
dhclient uses different BPF sockets for reading and writing (and it 
moves write socket to privileged child process via fork().
The problem we're facing with is the fact that dhclient does not set 
_any_ read filter on write socket:

21:27 [0] zfscurr0# netstat -B
  Pid  Netif   Flags  Recv  Drop Match Sblen Hblen Command
 1529em0 --fs--l 86774 86769 86784  4044  3180 dhclient
--- ^ --
 1526em0 -ifs--l 86789 0 1 0 0 dhclient

so all traffic is pushed down introducing contention on BPF descriptor 
mutex.


(That's why I've asked for netstat -B output.)

Please try an attached patch to fix this. This is not the right way to 
fix this, we'd better change BPF behavior not to attach to interface 
readers for write-only consumers.
This have been partially implemented as net.bpf.optimize_writers hack, 
but it does not work for all direct BPF consumers (which are not using 
pcap(3) API).




and some of this slowness is due to nc using small buffers which I will
fix shortly..

And with witness disabled it goes from 58MiB/s to 65.7MiB/s..  In
both cases, that's a 13% performance improvement by running w/o
dhclient...

This is using the latest memstick image, r266655 on a (Lenovo T61):
FreeBSD 11.0-CURRENT #0 r266655: Sun May 25 18:55:02 UTC 2014
 r...@grind.freebsd.org:/usr/obj/usr/src/sys/GENERIC amd64
FreeBSD clang version 3.4.1 (tags/RELEASE_34/dot1-final 208032) 20140512
WARNING: WITNESS option enabled, expect reduced performance.
CPU: Intel(R) Core(TM)2 Duo CPU T7300  @ 2.00GHz (1995.05-MHz K8-class CPU)
   Origin=GenuineIntel  Id=0x6fb  Family=0x6  Model=0xf  Stepping=11
   
Features=0xbfebfbffFPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE
   Features2=0xe3bdSSE3,DTES64,MON,DS_CPL,VMX,EST,TM2,SSSE3,CX16,xTPR,PDCM
   AMD 

Re: dhclient sucks cpu usage...

2014-06-10 Thread Bryan Venteicher


- Original Message -
 On 10.06.2014 07:03, Bryan Venteicher wrote:
  Hi,
 
  - Original Message -
  So, after finding out that nc has a stupidly small buffer size (2k
  even though there is space for 16k), I was still not getting as good
  as performance using nc between machines, so I decided to generate some
  flame graphs to try to identify issues...  (Thanks to who included a
  full set of modules, including dtraceall on memstick!)
 
  So, the first one is:
  https://www.funkthat.com/~jmg/em.stack.svg
 
  As I was browsing around, the em_handle_que was consuming quite a bit
  of cpu usage for only doing ~50MB/sec over gige..  Running top -SH shows
  me that the taskqueue for em was consuming about 50% cpu...  Also pretty
  high for only 50MB/sec...  Looking closer, you'll see that bpf_mtap is
  consuming ~3.18% (under ether_nh_input)..  I know I'm not running tcpdump
  or anything, but I think dhclient uses bpf to be able to inject packets
  and listen in on them, so I kill off dhclient, and instantly, the
  taskqueue
  thread for em drops down to 40% CPU... (transfer rate only marginally
  improves, if it does)
 
  I decide to run another flame graph w/o dhclient running:
  https://www.funkthat.com/~jmg/em.stack.nodhclient.svg
 
  and now _rxeof drops from 17.22% to 11.94%, pretty significant...
 
  So, if you care about performance, don't run dhclient...
 
  Yes, I've noticed the same issue. It can absolutely kill performance
  in a VM guest. It is much more pronounced on only some of my systems,
  and I hadn't tracked it down yet. I wonder if this is fallout from
  the callout work, or if there was some bpf change.
 
  I've been using the kludgey workaround patch below.
 Hm, pretty interesting.
 dhclient should setup proper filter (and it looks like it does so:
 13:10 [0] m@ptichko s netstat -B
Pid  Netif   Flags  Recv  Drop Match Sblen Hblen Command
   1224em0 -ifs--l  41225922 011 0 0 dhclient
 )
 see match count.
 And BPF itself adds the cost of read rwlock (+ bgp_filter() calls for
 each consumer on interface).
 It should not introduce significant performance penalties.
 


It will be a bit before I'm able to capture that. Here's a Flamegraph from
earlier in the year showing an absurd amount of time spent in bpf_mtap():

http://people.freebsd.org/~bryanv/vtnet/vtnet-bpf-10.svg


 
  diff --git a/sys/net/bpf.c b/sys/net/bpf.c
  index cb3ed27..9751986 100644
  --- a/sys/net/bpf.c
  +++ b/sys/net/bpf.c
  @@ -2013,9 +2013,11 @@ bpf_gettime(struct bintime *bt, int tstype, struct
  mbuf *m)
  return (BPF_TSTAMP_EXTERN);
  }
  }
  +#if 0
  if (quality == BPF_TSTAMP_NORMAL)
  binuptime(bt);
  else
  +#endif
 bpf_getttime() is called IFF packet filter matches some traffic.
 Can you show your netstat -B output ?
  getbinuptime(bt);

  return (quality);
 
 
  --
 John-Mark GurneyVoice: +1 415 225 5579
 
All that I will do, has been done, All that I have, has not.
  ___
  freebsd-current@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-current
  To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
 
  ___
  freebsd-...@freebsd.org mailing list
  http://lists.freebsd.org/mailman/listinfo/freebsd-net
  To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org
 
 
 
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: dhclient sucks cpu usage...

2014-06-10 Thread Alexander V. Chernikov

On 10.06.2014 22:11, Bryan Venteicher wrote:


- Original Message -

On 10.06.2014 07:03, Bryan Venteicher wrote:

Hi,

- Original Message -

So, after finding out that nc has a stupidly small buffer size (2k
even though there is space for 16k), I was still not getting as good
as performance using nc between machines, so I decided to generate some
flame graphs to try to identify issues...  (Thanks to who included a
full set of modules, including dtraceall on memstick!)

So, the first one is:
https://www.funkthat.com/~jmg/em.stack.svg

As I was browsing around, the em_handle_que was consuming quite a bit
of cpu usage for only doing ~50MB/sec over gige..  Running top -SH shows
me that the taskqueue for em was consuming about 50% cpu...  Also pretty
high for only 50MB/sec...  Looking closer, you'll see that bpf_mtap is
consuming ~3.18% (under ether_nh_input)..  I know I'm not running tcpdump
or anything, but I think dhclient uses bpf to be able to inject packets
and listen in on them, so I kill off dhclient, and instantly, the
taskqueue
thread for em drops down to 40% CPU... (transfer rate only marginally
improves, if it does)

I decide to run another flame graph w/o dhclient running:
https://www.funkthat.com/~jmg/em.stack.nodhclient.svg

and now _rxeof drops from 17.22% to 11.94%, pretty significant...

So, if you care about performance, don't run dhclient...


Yes, I've noticed the same issue. It can absolutely kill performance
in a VM guest. It is much more pronounced on only some of my systems,
and I hadn't tracked it down yet. I wonder if this is fallout from
the callout work, or if there was some bpf change.

I've been using the kludgey workaround patch below.

Hm, pretty interesting.
dhclient should setup proper filter (and it looks like it does so:
13:10 [0] m@ptichko s netstat -B
Pid  Netif   Flags  Recv  Drop Match Sblen Hblen Command
   1224em0 -ifs--l  41225922 011 0 0 dhclient
)
see match count.
And BPF itself adds the cost of read rwlock (+ bgp_filter() calls for
each consumer on interface).
It should not introduce significant performance penalties.



It will be a bit before I'm able to capture that. Here's a Flamegraph from
earlier in the year showing an absurd amount of time spent in bpf_mtap():

Can you briefly describe test setup?
(Actually I'm interested in overall pps rate, bpf filter used and match 
ratio).


For example, for some random box at $work:
22:17 [0] m@sas1-fw1 netstat -I vlan802 -w1
input  (vlan802)   output
   packets  errs idrops  bytespackets  errs  bytes colls
430418 0 0  337712454 396282 0  333207773 0
CPU:  0.4% user,  0.0% nice,  1.2% system, 15.9% interrupt, 82.5% idle

2:17 [0] sas1-fw1# tcpdump -i vlan802 -lnps0 icmp and host X.X.X.X
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on vlan802, link-type EN10MB (Ethernet), capture size 65535 bytes
22:17:14.866085 IP X.X.X.X  Y.Y.Y.Y: ICMP echo request, id 6730, seq 1, 
length 64


22:17 [0] m@sas1-fw1 s netstat -B 2/dev/null | grep tcpdump
98520 vlan802 ---s---  27979422 040 0 0 tcpdump

CPU:  0.9% user,  0.0% nice,  2.7% system, 17.6% interrupt, 78.8% idle
(Actually the load is floating due to bursty traffic in 14-20% rate but 
I can't see much difference with tcpdump turned on/off).




http://people.freebsd.org/~bryanv/vtnet/vtnet-bpf-10.svg



diff --git a/sys/net/bpf.c b/sys/net/bpf.c
index cb3ed27..9751986 100644
--- a/sys/net/bpf.c
+++ b/sys/net/bpf.c
@@ -2013,9 +2013,11 @@ bpf_gettime(struct bintime *bt, int tstype, struct
mbuf *m)
return (BPF_TSTAMP_EXTERN);
}
}
+#if 0
if (quality == BPF_TSTAMP_NORMAL)
binuptime(bt);
else
+#endif

bpf_getttime() is called IFF packet filter matches some traffic.
Can you show your netstat -B output ?

getbinuptime(bt);
   
   	return (quality);




--
John-Mark GurneyVoice: +1 415 225 5579

   All that I will do, has been done, All that I have, has not.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


___
freebsd-...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-net
To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org





___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: dhclient sucks cpu usage...

2014-06-10 Thread John-Mark Gurney
Alexander V. Chernikov wrote this message on Tue, Jun 10, 2014 at 22:21 +0400:
 On 10.06.2014 22:11, Bryan Venteicher wrote:
 
 - Original Message -
 On 10.06.2014 07:03, Bryan Venteicher wrote:
 Hi,
 
 - Original Message -
 So, after finding out that nc has a stupidly small buffer size (2k
 even though there is space for 16k), I was still not getting as good
 as performance using nc between machines, so I decided to generate some
 flame graphs to try to identify issues...  (Thanks to who included a
 full set of modules, including dtraceall on memstick!)
 
 So, the first one is:
 https://www.funkthat.com/~jmg/em.stack.svg
 
 As I was browsing around, the em_handle_que was consuming quite a bit
 of cpu usage for only doing ~50MB/sec over gige..  Running top -SH shows
 me that the taskqueue for em was consuming about 50% cpu...  Also pretty
 high for only 50MB/sec...  Looking closer, you'll see that bpf_mtap is
 consuming ~3.18% (under ether_nh_input)..  I know I'm not running 
 tcpdump
 or anything, but I think dhclient uses bpf to be able to inject packets
 and listen in on them, so I kill off dhclient, and instantly, the
 taskqueue
 thread for em drops down to 40% CPU... (transfer rate only marginally
 improves, if it does)
 
 I decide to run another flame graph w/o dhclient running:
 https://www.funkthat.com/~jmg/em.stack.nodhclient.svg
 
 and now _rxeof drops from 17.22% to 11.94%, pretty significant...
 
 So, if you care about performance, don't run dhclient...
 
 Yes, I've noticed the same issue. It can absolutely kill performance
 in a VM guest. It is much more pronounced on only some of my systems,
 and I hadn't tracked it down yet. I wonder if this is fallout from
 the callout work, or if there was some bpf change.
 
 I've been using the kludgey workaround patch below.
 Hm, pretty interesting.
 dhclient should setup proper filter (and it looks like it does so:
 13:10 [0] m@ptichko s netstat -B
 Pid  Netif   Flags  Recv  Drop Match Sblen Hblen Command
1224em0 -ifs--l  41225922 011 0 0 dhclient
 )
 see match count.
 And BPF itself adds the cost of read rwlock (+ bgp_filter() calls for
 each consumer on interface).
 It should not introduce significant performance penalties.
 
 
 It will be a bit before I'm able to capture that. Here's a Flamegraph from
 earlier in the year showing an absurd amount of time spent in bpf_mtap():
 Can you briefly describe test setup?

For mine, one machine is sink:
nc -l 2387  /dev/null

The machine w/ dhclient is source:
nc carbon 2387  /dev/zero

 (Actually I'm interested in overall pps rate, bpf filter used and match 
 ratio).

the overal rate is ~26k pps both in and out (so total ~52kpps)...

So, netstat -B; sleep 5; netstat -B gives:
  Pid  Netif   Flags  Recv  Drop Match Sblen Hblen Command
  919em0 --fs--l   6275907   6275938   6275961  4060  2236 dhclient
  937em0 -ifs--l   6275992 0 1 0 0 dhclient
  Pid  Netif   Flags  Recv  Drop Match Sblen Hblen Command
  919em0 --fs--l   6539717   6539752   6539775  4060  2236 dhclient
  937em0 -ifs--l   6539806 0 1 0 0 dhclient

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 All that I will do, has been done, All that I have, has not.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: dhclient sucks cpu usage...

2014-06-10 Thread John-Mark Gurney
Alexander V. Chernikov wrote this message on Tue, Jun 10, 2014 at 21:33 +0400:
 On 10.06.2014 20:24, John-Mark Gurney wrote:
 Alexander V. Chernikov wrote this message on Tue, Jun 10, 2014 at 13:17 
 +0400:
 On 10.06.2014 07:03, Bryan Venteicher wrote:
 Hi,
 
 - Original Message -
 So, after finding out that nc has a stupidly small buffer size (2k
 even though there is space for 16k), I was still not getting as good
 as performance using nc between machines, so I decided to generate some
 flame graphs to try to identify issues...  (Thanks to who included a
 full set of modules, including dtraceall on memstick!)
 
 So, the first one is:
 https://www.funkthat.com/~jmg/em.stack.svg
 
 As I was browsing around, the em_handle_que was consuming quite a bit
 of cpu usage for only doing ~50MB/sec over gige..  Running top -SH shows
 me that the taskqueue for em was consuming about 50% cpu...  Also pretty
 high for only 50MB/sec...  Looking closer, you'll see that bpf_mtap is
 consuming ~3.18% (under ether_nh_input)..  I know I'm not running 
 tcpdump
 or anything, but I think dhclient uses bpf to be able to inject packets
 and listen in on them, so I kill off dhclient, and instantly, the
 taskqueue
 thread for em drops down to 40% CPU... (transfer rate only marginally
 improves, if it does)
 
 I decide to run another flame graph w/o dhclient running:
 https://www.funkthat.com/~jmg/em.stack.nodhclient.svg
 
 and now _rxeof drops from 17.22% to 11.94%, pretty significant...
 
 So, if you care about performance, don't run dhclient...
 
 Yes, I've noticed the same issue. It can absolutely kill performance
 in a VM guest. It is much more pronounced on only some of my systems,
 and I hadn't tracked it down yet. I wonder if this is fallout from
 the callout work, or if there was some bpf change.
 
 I've been using the kludgey workaround patch below.
 Hm, pretty interesting.
 dhclient should setup proper filter (and it looks like it does so:
 13:10 [0] m@ptichko s netstat -B
Pid  Netif   Flags  Recv  Drop Match Sblen Hblen Command
   1224em0 -ifs--l  41225922 011 0 0 dhclient
 )
 see match count.
 And BPF itself adds the cost of read rwlock (+ bgp_filter() calls for
 each consumer on interface).
 It should not introduce significant performance penalties.
 Don't forget that it has to process the returning ack's... So, you're
 Well, it can be still captured with the proper filter like ip  udp  
 port 67 or port 68.
 We're using tcpdump on high packet ratios (1M) and it does not 
 influence process _much_.
 We should probably convert its rwlock to rmlock and use per-cpu counters 
 for statistics, but that's a different story.
 looking around 10k+ pps that you have to handle and pass through the
 filter...  That's a lot of packets to process...
 
 Just for a bit more double check, instead of using the HD as a
 source, I used /dev/zero...   I ran a netstat -w 1 -I em0 when
 running the test, and I was getting ~50.7MiB/s w/ dhclient running and
 then I killed dhclient and it instantly jumped up to ~57.1MiB/s.. So I
 launched dhclient again, and it dropped back to ~50MiB/s...
 dhclient uses different BPF sockets for reading and writing (and it 
 moves write socket to privileged child process via fork().
 The problem we're facing with is the fact that dhclient does not set 
 _any_ read filter on write socket:
 21:27 [0] zfscurr0# netstat -B
   Pid  Netif   Flags  Recv  Drop Match Sblen Hblen Command
  1529em0 --fs--l 86774 86769 86784  4044  3180 dhclient
 --- ^ --
  1526em0 -ifs--l 86789 0 1 0 0 dhclient
 
 so all traffic is pushed down introducing contention on BPF descriptor 
 mutex.
 
 (That's why I've asked for netstat -B output.)
 
 Please try an attached patch to fix this. This is not the right way to 
 fix this, we'd better change BPF behavior not to attach to interface 
 readers for write-only consumers.
 This have been partially implemented as net.bpf.optimize_writers hack, 
 but it does not work for all direct BPF consumers (which are not using 
 pcap(3) API).

Ok, looks like this patch helps the issue...

netstat -B; sleep 5; netstat -B:
  Pid  Netif   Flags  Recv  Drop Match Sblen Hblen Command
  958em0 --fs--l   3881435  3868  2236 dhclient
  976em0 -ifs--l   3880014 0 1 0 0 dhclient
  Pid  Netif   Flags  Recv  Drop Match Sblen Hblen Command
  958em0 --fs--l   41785251435  3868  2236 dhclient
  976em0 -ifs--l   4178539 0 1 0 0 dhclient

and now the rate only drops from ~66MiB/s to ~63MiB/s when dhclient is
running...  Still a significant drop (5%), but better than before...

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 All that I will do, has been done, All that I have, has not.

Re: dhclient sucks cpu usage...

2014-06-10 Thread Alexander V. Chernikov
On 10.06.2014 22:56, John-Mark Gurney wrote:
 Alexander V. Chernikov wrote this message on Tue, Jun 10, 2014 at 21:33 +0400:
 On 10.06.2014 20:24, John-Mark Gurney wrote:
 Alexander V. Chernikov wrote this message on Tue, Jun 10, 2014 at 13:17 
 +0400:
 On 10.06.2014 07:03, Bryan Venteicher wrote:
 Hi,

 - Original Message -
 So, after finding out that nc has a stupidly small buffer size (2k
 even though there is space for 16k), I was still not getting as good
 as performance using nc between machines, so I decided to generate some
 flame graphs to try to identify issues...  (Thanks to who included a
 full set of modules, including dtraceall on memstick!)

 So, the first one is:
 https://www.funkthat.com/~jmg/em.stack.svg

 As I was browsing around, the em_handle_que was consuming quite a bit
 of cpu usage for only doing ~50MB/sec over gige..  Running top -SH shows
 me that the taskqueue for em was consuming about 50% cpu...  Also pretty
 high for only 50MB/sec...  Looking closer, you'll see that bpf_mtap is
 consuming ~3.18% (under ether_nh_input)..  I know I'm not running 
 tcpdump
 or anything, but I think dhclient uses bpf to be able to inject packets
 and listen in on them, so I kill off dhclient, and instantly, the
 taskqueue
 thread for em drops down to 40% CPU... (transfer rate only marginally
 improves, if it does)

 I decide to run another flame graph w/o dhclient running:
 https://www.funkthat.com/~jmg/em.stack.nodhclient.svg

 and now _rxeof drops from 17.22% to 11.94%, pretty significant...

 So, if you care about performance, don't run dhclient...

 Yes, I've noticed the same issue. It can absolutely kill performance
 in a VM guest. It is much more pronounced on only some of my systems,
 and I hadn't tracked it down yet. I wonder if this is fallout from
 the callout work, or if there was some bpf change.

 I've been using the kludgey workaround patch below.
 Hm, pretty interesting.
 dhclient should setup proper filter (and it looks like it does so:
 13:10 [0] m@ptichko s netstat -B
   Pid  Netif   Flags  Recv  Drop Match Sblen Hblen Command
  1224em0 -ifs--l  41225922 011 0 0 dhclient
 )
 see match count.
 And BPF itself adds the cost of read rwlock (+ bgp_filter() calls for
 each consumer on interface).
 It should not introduce significant performance penalties.
 Don't forget that it has to process the returning ack's... So, you're
 Well, it can be still captured with the proper filter like ip  udp  
 port 67 or port 68.
 We're using tcpdump on high packet ratios (1M) and it does not 
 influence process _much_.
 We should probably convert its rwlock to rmlock and use per-cpu counters 
 for statistics, but that's a different story.
 looking around 10k+ pps that you have to handle and pass through the
 filter...  That's a lot of packets to process...

 Just for a bit more double check, instead of using the HD as a
 source, I used /dev/zero...   I ran a netstat -w 1 -I em0 when
 running the test, and I was getting ~50.7MiB/s w/ dhclient running and
 then I killed dhclient and it instantly jumped up to ~57.1MiB/s.. So I
 launched dhclient again, and it dropped back to ~50MiB/s...
 dhclient uses different BPF sockets for reading and writing (and it 
 moves write socket to privileged child process via fork().
 The problem we're facing with is the fact that dhclient does not set 
 _any_ read filter on write socket:
 21:27 [0] zfscurr0# netstat -B
   Pid  Netif   Flags  Recv  Drop Match Sblen Hblen Command
  1529em0 --fs--l 86774 86769 86784  4044  3180 dhclient
 --- ^ --
  1526em0 -ifs--l 86789 0 1 0 0 dhclient

 so all traffic is pushed down introducing contention on BPF descriptor 
 mutex.

 (That's why I've asked for netstat -B output.)

 Please try an attached patch to fix this. This is not the right way to 
 fix this, we'd better change BPF behavior not to attach to interface 
 readers for write-only consumers.
 This have been partially implemented as net.bpf.optimize_writers hack, 
 but it does not work for all direct BPF consumers (which are not using 
 pcap(3) API).
 
 Ok, looks like this patch helps the issue...
 
 netstat -B; sleep 5; netstat -B:
   Pid  Netif   Flags  Recv  Drop Match Sblen Hblen Command
   958em0 --fs--l   3881435  3868  2236 dhclient
   976em0 -ifs--l   3880014 0 1 0 0 dhclient
   Pid  Netif   Flags  Recv  Drop Match Sblen Hblen Command
   958em0 --fs--l   41785251435  3868  2236 dhclient
   976em0 -ifs--l   4178539 0 1 0 0 dhclient
 
 and now the rate only drops from ~66MiB/s to ~63MiB/s when dhclient is
 running...  Still a significant drop (5%), but better than before...
Interesting.
Can you provide some traces (pmc or dtrace ones)?

I'm unsure if this will help, but it's 

dhclient sucks cpu usage...

2014-06-09 Thread John-Mark Gurney
So, after finding out that nc has a stupidly small buffer size (2k
even though there is space for 16k), I was still not getting as good
as performance using nc between machines, so I decided to generate some
flame graphs to try to identify issues...  (Thanks to who included a
full set of modules, including dtraceall on memstick!)

So, the first one is:
https://www.funkthat.com/~jmg/em.stack.svg

As I was browsing around, the em_handle_que was consuming quite a bit
of cpu usage for only doing ~50MB/sec over gige..  Running top -SH shows
me that the taskqueue for em was consuming about 50% cpu...  Also pretty
high for only 50MB/sec...  Looking closer, you'll see that bpf_mtap is
consuming ~3.18% (under ether_nh_input)..  I know I'm not running tcpdump
or anything, but I think dhclient uses bpf to be able to inject packets
and listen in on them, so I kill off dhclient, and instantly, the taskqueue
thread for em drops down to 40% CPU... (transfer rate only marginally
improves, if it does)

I decide to run another flame graph w/o dhclient running:
https://www.funkthat.com/~jmg/em.stack.nodhclient.svg

and now _rxeof drops from 17.22% to 11.94%, pretty significant...

So, if you care about performance, don't run dhclient...

-- 
  John-Mark Gurney  Voice: +1 415 225 5579

 All that I will do, has been done, All that I have, has not.
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: dhclient sucks cpu usage...

2014-06-09 Thread Bryan Venteicher
Hi,

- Original Message -
 So, after finding out that nc has a stupidly small buffer size (2k
 even though there is space for 16k), I was still not getting as good
 as performance using nc between machines, so I decided to generate some
 flame graphs to try to identify issues...  (Thanks to who included a
 full set of modules, including dtraceall on memstick!)
 
 So, the first one is:
 https://www.funkthat.com/~jmg/em.stack.svg
 
 As I was browsing around, the em_handle_que was consuming quite a bit
 of cpu usage for only doing ~50MB/sec over gige..  Running top -SH shows
 me that the taskqueue for em was consuming about 50% cpu...  Also pretty
 high for only 50MB/sec...  Looking closer, you'll see that bpf_mtap is
 consuming ~3.18% (under ether_nh_input)..  I know I'm not running tcpdump
 or anything, but I think dhclient uses bpf to be able to inject packets
 and listen in on them, so I kill off dhclient, and instantly, the taskqueue
 thread for em drops down to 40% CPU... (transfer rate only marginally
 improves, if it does)
 
 I decide to run another flame graph w/o dhclient running:
 https://www.funkthat.com/~jmg/em.stack.nodhclient.svg
 
 and now _rxeof drops from 17.22% to 11.94%, pretty significant...
 
 So, if you care about performance, don't run dhclient...
 

Yes, I've noticed the same issue. It can absolutely kill performance
in a VM guest. It is much more pronounced on only some of my systems,
and I hadn't tracked it down yet. I wonder if this is fallout from
the callout work, or if there was some bpf change.

I've been using the kludgey workaround patch below.

diff --git a/sys/net/bpf.c b/sys/net/bpf.c
index cb3ed27..9751986 100644
--- a/sys/net/bpf.c
+++ b/sys/net/bpf.c
@@ -2013,9 +2013,11 @@ bpf_gettime(struct bintime *bt, int tstype, struct mbuf 
*m)
return (BPF_TSTAMP_EXTERN);
}
}
+#if 0
if (quality == BPF_TSTAMP_NORMAL)
binuptime(bt);
else
+#endif
getbinuptime(bt);
 
return (quality);


 --
   John-Mark GurneyVoice: +1 415 225 5579
 
  All that I will do, has been done, All that I have, has not.
 ___
 freebsd-current@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-current
 To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
 
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org