Hi all, After ifnet/ifaddr per-CPU stats work (i.e. cache pollution is avoided), IP forwarding performance improves again! Here is the performance I currently get as of git 8eb1b0
Per-CPU stats give me: +210Kpps for 2 bidirectional normal IP forwarding (now 4.61Mpps) +440Kpps for 2 bidirectional fast IP forwarding (now 5.67Mpps) For fast IP forwarding, we are _not_ that far away from max out the 4 GigE interfaces (which is 5.95Mpps) Detailed information is same as last measurement. Best Regards, sephe On Fri, Feb 1, 2013 at 5:53 PM, Sepherosa Ziehau <[email protected]> wrote: > Hi all, > > Now the multiple TX queue support is finished in master (generic layer > is done, igb(4) is converted). > Here is the performance I currently get as of git 2fb36fa. > > Quick summary, the multiple TX queue support gives me: > +200Kpps for 2 bidirectional normal IP forwarding (now 4.40Mpps) > +160Kpps for 2 bidirectional fast IP forwarding (now 5.23Mpps) > > During the performance measurement, the system is very responsive. > > Detailed information, please read the following inline comment. > > On Thu, Dec 27, 2012 at 4:42 PM, Sepherosa Ziehau <[email protected]> wrote: >> Hi all, >> >> Before I move on to the next big ticket (multiple-tx queue support), >> here is the performance I currently got as of git 2aa7f7f. >> >> Quick summary, the IFQ packets staging mechanism gives me: >> +80Kpps for 2 bidirectional normal IP forwarding (now 4.20Mpps) >> +30Kpps for 2 bidirectional fast forwarding (now 5.07Mpps) >> >> Detailed information, please read the following inline comment. >> >> On Thu, Dec 20, 2012 at 3:03 PM, Sepherosa Ziehau <[email protected]> >> wrote: >>> On Fri, Dec 14, 2012 at 5:47 PM, Sepherosa Ziehau <[email protected]> >>> wrote: >>>> Hi all, >>>> >>>> This email serves as the base performance measurement for further >>>> network stack optimization (as of git 107282b). >>> >>> Since bidirectional fast IP forwarding is already max out the GigE >>> limit, I increase the measurement strength a bit. The new measurement >>> is against git 7e1fbcf >>> >>>> >>>> >>>> The hardware: >>>> mobo ASUS P867H-M >>>> 4x4G DDR3 memory >>>> CPU i7-2600 (w/ HT and Turbo Boost enabled, 4C/8T) >>>> Forwarding NIC Intel 82576EB dual copper >>> >>> The forwarding NIC is now changed to 82580EB quad copper. >>> >>>> Packet generator NICs Intel 82571EB dual copper >>>> >>>> >>>> A emx1 <---> igb0 forwarder igb1 <---> emx1 B >>> >>> The testing topology is changed into following configure: >>> +---+ +-----------+ +---+ >>> | | emx1 <---> igb0 | | igb1 <---> emx1 | | >>> | A | | forwarder | | B | >>> | | emx2 <---> igb2 | | igb3 <---> emx2 | | >>> +---+ +-----------+ +---+ >>> >>> Streams: >>> A.emx1 <---> B.emx1 (bidirectional) >>> A.emx2 <---> B.emx2 (bidirectional) >>> >>>> >>>> A and "forwarder", B and "forwarder" are directly connected using CAT6 >>>> cables. >>>> Polling(4) is enabled on igb1 and igb0 on "forwarder". Following >>>> tunables are in /boot/loader.conf: >>>> kern.ipc.nmbclusters="524288" >>>> net.ifpoll.user_frac="10" >>>> net.ifpoll.status_frac="1000" >> >> net.link.ifq_stage_cntmax="8" >> >>>> Following sysctl is changed before putting igb1 into polling mode: >>>> sysctl hw.igb1.npoll_txoff=4 >>> >>> sysctl hw.igb1.npoll_txoff=1 >>> sysctl hw.igb2.npoll_txoff=2 >>> sysctl hw.igb3.npoll_txoff=3 > > The above sysctls are no longer needed, since all 8 hardware TX queues > are enabled. The CPUID offset is always 0 (i7-2600 has 8 HT). > >> >> sysctl hw.igb0.tx_wreg_nsegs=16 >> sysctl hw.igb1.tx_wreg_nsegs=16 >> sysctl hw.igb2.tx_wreg_nsegs=16 >> sysctl hw.igb3.tx_wreg_nsegs=16 >> >>> >>>> >>>> >>>> First for the users that are only interested in the bulk forwarding >>>> performance: The 32 netperf TCP_STREAMs running on A could do >>>> 941Mbps. >>>> >>>> >>>> Now the tiny packets forwarding performance: >>>> >>>> A and B generate 18 bytes UDP datagrams using >>>> tools/tools/netrate/pktgen. The destination addresses of the UDP >>>> datagrams are selected that the generated UDP datagrams will be evenly >>>> distributed the to the 8 RX queues, which should be common in the >>>> production environment. >>>> >>>> Bidirectional normal IP forwarding: >>>> 1.42Mpps in each direction, so total 2.84Mpps are forwarded. >>>> CPU usage: >>>> On CPUs that are doing TX in addition to RX: 85% ~ 90% (max allowed by >>>> polling's user_frac) >>>> On CPUs that are only doing RX: 40% ~ 50% >>> >>> Two sets of bidirectional normal IP forwarding: >>> 1.03Mpps in each direction, so total 4.12Mpps are forwarded. >> >> 1.05+Mpps in each direction, so total 4.20Mpps are forwarded. > > 1.10+Mpps in each direction, so total 4.40Mpps are forwarded. > >> >>> CPU usage: >>> On CPUs that are doing TX in addition to RX: 90% (max allowed by >>> polling's user_frac) >>> On CPUs that are only doing RX: 70% ~ 80% >> >> Not much improvement on CPU usage. > > All CPUs now do RX and TX, the CPU usage is 90% (max allowed by > polling's user_frac) > >> >>> IPI rate on CPUs that are doing TX in addition to RX: ~10K/s >> >> IPI rate on CPUs that are doing TX in addition to RX: ~4.5K/s > > No more cross CPU IPIs, packet processing is now completely CPU localized. > >> >>> >>>> >>>> Bidirectional fast IP forwarding: (net.inet.ip.fastforwarding=1) >>>> 1.48Mpps in each direction, so total 2.96Mpps are forwarded. >>>> CPU usage: >>>> On CPUs that are doing TX in addition to RX: 65% ~ 70% >>>> On CPUs that are doing RX: 30% ~ 40% >>> >>> Two sets of bidirectional fast IP forwarding: (net.inet.ip.fastforwarding=1) >>> 1.26Mpps in each direction, so total 5.04Mpps are forwarded. >> >> ~1.27Mpps in each direction, so total 5.07Mpps are forwarded. > > ~1.31Mpps in each direction, so total 5.23Mpps are forwarded. > >> >>> CPU usage: >>> On CPUs that are doing TX in addition to RX: 90% (max allowed by >>> polling's user_frac) >>> On CPUs that are only doing RX: 60% ~ 70% >> >> Not much improvement on CPU usage. > > All CPUs now do RX and TX, the CPU usage is 90% (max allowed by > polling's user_frac) > >> >>> IPI rate on CPUs that are doing TX in addition to RX: ~10K/s >> >> IPI rate on CPUs that are doing TX in addition to RX: ~5K/s > > No more cross CPU IPIs, packet processing is now completely CPU localized. > > > > Best Regards, > sephe > > -- > Tomorrow Will Never Die -- Tomorrow Will Never Die
