On Thu, Jan 28, 2016 at 12:17 AM, Taylor R Campbell <campbell+netbsd-tech-k...@mumble.net> wrote: > Date: Wed, 27 Jan 2016 16:51:22 +0900 > From: Ryota Ozaki <ozak...@netbsd.org> > > Here it is: http://www.netbsd.org/~ozaki-r/softint-if_input-ifqueue.diff > > Results of performance measurements of it are also added to > https://gist.github.com/ozaki-r/975b06216a54a084debc > > The results are good but bothers me; it achieves better performance > than vanilla (and the 1st implementation) on high load (IP forwarding). > For fast forward, it also beats the 1st one. > > I thought that holding splnet during ifp->if_input (splnet is needed > for ifqueue operations and so keep holding in the patch) might affect > the results. So I tried to release during ifp->if_input but the results > didn't change so much (the result of IP forwarding is still better > than vanilla). > > Anyone have any ideas? > > Here's a wild guess: with vanilla, each CPU does > > wm_rxeof loop iteration > if_input processing > wm_rxeof loop iteration > if_input processing > ... > > back and forth. With softint-rx-ifq, each CPU does > > wm_rxeof loop iteration > wm_rxeof loop iteration > ... > if_input processing > if_input processing > ... > > because softint processing is blocked until the hardintr handler > completes. So vanilla might make less efficient use of the CPU cache, > and vanilla might leave the rxq full for longer so that the device > cannot fill it as quickly with incoming packets.
That might be true. If so, the real question may be why the old implementation isn't efficient compared to the new one. > > Another experiment that might be worthwhile is to bind the interrupt > to a specific CPU, and then use splnet instead of WM_RX_LOCK to avoid > acquiring and releasing a lock for each packet. In the measurements, all interrupts are already delivered to CPU#0. Removing the lock doesn't change the results. I guess acquiring and releasing a lock (w/o contentions) are low overhead. Note that wm has a RX lock per HW queue, so RX processing can be done with no lock contention basically. > (On Intel >=Haswell, > we should use transactional memory to avoid bus traffic for that > anyway (and maybe invent an MD pcq(9) that does the same). But the > experiment with wm(4) is easier, and not everyone has transactional > memory.) How does transactional memory help? ozaki-r