On 30.11.2015. 12:55, David Gwynne wrote: > this tweaks the guts of if_start so it guarantees that there's only > ever one call to ifp->if_start running in the system at a time. > previously this was implicit because it could only be called with > the KERNEL_LOCK held. > > as we move forward it would be nice to run the queue without having > to take the biglock. however, because we also want to dequeue the > packets in order, it only makes sense to run a single instance of > the function in the whole system. > > also, if a driver is recovering from an oactive situation (ie, it's > been able to free space on the tx ring) it should be able to start > tx again from an mpsafe interrupt context. > > because most of our drivers assume that theyre run under the > KERNEL_LOCK, this diff uses a flag for the internals of the if_start > call to differentiate between them. it defaults for kernel locked, > but drivers can opt in to an mpsafe version that can call ifp->if_start > without the mplock held. > > the kernel locked code takes KERNEL_LOCK and splnet before calling > ifp->if_start. > > the mpsafe code uses the serialisation mechanism that the scsi > midlayer and pool runqueue use, but implemented with atomics instead > of operations under a mutex. > > the semantic is that work will be queued onto a list protected by > a mutex (ie, the guts of struct ifqueue), and then a cpu will try > to enter a critical section that runs a function to service the > queued work. the cpu that enters the critical section has to dequeue > work in a loop, which is what all our drivers do. > > if another cpu tries to enter the same critical section after > queueing more work, it will return immediately rather than spin on > the lock. the first cpu that is currently dequeueing work in the > critical section will be told to spin again to guarantee that it > will service the work the other cpu added. > > so the network stack may be transmitting packets on cpu1, while an > interrupts on cpu0 occurs which frees up tx descriprots. if cpu0 > calls if_start, it will return immediately because cpu1 will end > up doing the work it wanted to do anyway. > > if the start routine can run on multiple cpus, then it becomes > necessary to know it is NOT running anymore when tearing a nic down. > to that end i have added an if_start_barrier function. an mpsafe > driver can call that when it's being brought down to guarantee that > another cpu isnt fiddling with the tx ring before freeing it. > > a driver opts in to the mpsafe if_start call by doing the following: > > 1. set ifp->if_xflags = IFXF_MPSAFE. > 2. calling if_start() instead of its own start routine (eg, myx_start). > 3. clearing IFF_RUNNING before calling if_start_barrier() on its way down. > 4. only using IFQ_DEQUEUE (not ifq_deq_begin/commit/rollback) > > anyway, this is the diff i have come up with after playing with > several ideas. it removes the IFXF_TXREADY semantics, ie, tx > mitigation and reuses the flag bit for IFXF_MPSAFE. > > the reason for that is juggling or deferring the start routine made > if_start_barrier annoyingly complicated, and all my attmepts at it > introduced a significant performance hit or were insanely complicated. > > tx mitigation only ever gave me back 5 to 10% before it was badly > tweaked, and we've made a lot of other performance improvements > since then. while im sad to see it go, id rather move forward than > dwell on it. > > in the future i would like to try delegating the work to mpsafe > taskqs, but in my attempts i lost something like 30% of my tx rate > by doing that. id like to investigate that further in the future, > just not right now. > > finally, the last thing to consider is lock ordering problems. > because contention on the ifq_serializer causes the second context > to return imediately (that's true even if you call if_start from > within a critical section), i think all the problems are avoided. > i am more concerned with the ifq mutex than i am with the serialiser. > > anyway, here's the diff to look at. happy to discuss further. > > tests would be welcome too.
...and i bought 10G-PCIE2-8B2L-2S although i'm 82599 fan and i will never give up on IX :))))