Hi all, Now the multiple TX queue support is finished in master (generic layer is done, igb(4) is converted). Here is the performance I currently get as of git 2fb36fa.
Quick summary, the multiple TX queue support gives me: +200Kpps for 2 bidirectional normal IP forwarding (now 4.40Mpps) +160Kpps for 2 bidirectional fast IP forwarding (now 5.23Mpps) During the performance measurement, the system is very responsive. Detailed information, please read the following inline comment. On Thu, Dec 27, 2012 at 4:42 PM, Sepherosa Ziehau <[email protected]> wrote: > Hi all, > > Before I move on to the next big ticket (multiple-tx queue support), > here is the performance I currently got as of git 2aa7f7f. > > Quick summary, the IFQ packets staging mechanism gives me: > +80Kpps for 2 bidirectional normal IP forwarding (now 4.20Mpps) > +30Kpps for 2 bidirectional fast forwarding (now 5.07Mpps) > > Detailed information, please read the following inline comment. > > On Thu, Dec 20, 2012 at 3:03 PM, Sepherosa Ziehau <[email protected]> wrote: >> On Fri, Dec 14, 2012 at 5:47 PM, Sepherosa Ziehau <[email protected]> >> wrote: >>> Hi all, >>> >>> This email serves as the base performance measurement for further >>> network stack optimization (as of git 107282b). >> >> Since bidirectional fast IP forwarding is already max out the GigE >> limit, I increase the measurement strength a bit. The new measurement >> is against git 7e1fbcf >> >>> >>> >>> The hardware: >>> mobo ASUS P867H-M >>> 4x4G DDR3 memory >>> CPU i7-2600 (w/ HT and Turbo Boost enabled, 4C/8T) >>> Forwarding NIC Intel 82576EB dual copper >> >> The forwarding NIC is now changed to 82580EB quad copper. >> >>> Packet generator NICs Intel 82571EB dual copper >>> >>> >>> A emx1 <---> igb0 forwarder igb1 <---> emx1 B >> >> The testing topology is changed into following configure: >> +---+ +-----------+ +---+ >> | | emx1 <---> igb0 | | igb1 <---> emx1 | | >> | A | | forwarder | | B | >> | | emx2 <---> igb2 | | igb3 <---> emx2 | | >> +---+ +-----------+ +---+ >> >> Streams: >> A.emx1 <---> B.emx1 (bidirectional) >> A.emx2 <---> B.emx2 (bidirectional) >> >>> >>> A and "forwarder", B and "forwarder" are directly connected using CAT6 >>> cables. >>> Polling(4) is enabled on igb1 and igb0 on "forwarder". Following >>> tunables are in /boot/loader.conf: >>> kern.ipc.nmbclusters="524288" >>> net.ifpoll.user_frac="10" >>> net.ifpoll.status_frac="1000" > > net.link.ifq_stage_cntmax="8" > >>> Following sysctl is changed before putting igb1 into polling mode: >>> sysctl hw.igb1.npoll_txoff=4 >> >> sysctl hw.igb1.npoll_txoff=1 >> sysctl hw.igb2.npoll_txoff=2 >> sysctl hw.igb3.npoll_txoff=3 The above sysctls are no longer needed, since all 8 hardware TX queues are enabled. The CPUID offset is always 0 (i7-2600 has 8 HT). > > sysctl hw.igb0.tx_wreg_nsegs=16 > sysctl hw.igb1.tx_wreg_nsegs=16 > sysctl hw.igb2.tx_wreg_nsegs=16 > sysctl hw.igb3.tx_wreg_nsegs=16 > >> >>> >>> >>> First for the users that are only interested in the bulk forwarding >>> performance: The 32 netperf TCP_STREAMs running on A could do >>> 941Mbps. >>> >>> >>> Now the tiny packets forwarding performance: >>> >>> A and B generate 18 bytes UDP datagrams using >>> tools/tools/netrate/pktgen. The destination addresses of the UDP >>> datagrams are selected that the generated UDP datagrams will be evenly >>> distributed the to the 8 RX queues, which should be common in the >>> production environment. >>> >>> Bidirectional normal IP forwarding: >>> 1.42Mpps in each direction, so total 2.84Mpps are forwarded. >>> CPU usage: >>> On CPUs that are doing TX in addition to RX: 85% ~ 90% (max allowed by >>> polling's user_frac) >>> On CPUs that are only doing RX: 40% ~ 50% >> >> Two sets of bidirectional normal IP forwarding: >> 1.03Mpps in each direction, so total 4.12Mpps are forwarded. > > 1.05+Mpps in each direction, so total 4.20Mpps are forwarded. 1.10+Mpps in each direction, so total 4.40Mpps are forwarded. > >> CPU usage: >> On CPUs that are doing TX in addition to RX: 90% (max allowed by >> polling's user_frac) >> On CPUs that are only doing RX: 70% ~ 80% > > Not much improvement on CPU usage. All CPUs now do RX and TX, the CPU usage is 90% (max allowed by polling's user_frac) > >> IPI rate on CPUs that are doing TX in addition to RX: ~10K/s > > IPI rate on CPUs that are doing TX in addition to RX: ~4.5K/s No more cross CPU IPIs, packet processing is now completely CPU localized. > >> >>> >>> Bidirectional fast IP forwarding: (net.inet.ip.fastforwarding=1) >>> 1.48Mpps in each direction, so total 2.96Mpps are forwarded. >>> CPU usage: >>> On CPUs that are doing TX in addition to RX: 65% ~ 70% >>> On CPUs that are doing RX: 30% ~ 40% >> >> Two sets of bidirectional fast IP forwarding: (net.inet.ip.fastforwarding=1) >> 1.26Mpps in each direction, so total 5.04Mpps are forwarded. > > ~1.27Mpps in each direction, so total 5.07Mpps are forwarded. ~1.31Mpps in each direction, so total 5.23Mpps are forwarded. > >> CPU usage: >> On CPUs that are doing TX in addition to RX: 90% (max allowed by >> polling's user_frac) >> On CPUs that are only doing RX: 60% ~ 70% > > Not much improvement on CPU usage. All CPUs now do RX and TX, the CPU usage is 90% (max allowed by polling's user_frac) > >> IPI rate on CPUs that are doing TX in addition to RX: ~10K/s > > IPI rate on CPUs that are doing TX in addition to RX: ~5K/s No more cross CPU IPIs, packet processing is now completely CPU localized. Best Regards, sephe -- Tomorrow Will Never Die
