On 30.11.2015. 12:55, David Gwynne wrote:
> this tweaks the guts of if_start so it guarantees that there's only
> ever one call to ifp->if_start running in the system at a time.
> previously this was implicit because it could only be called with
> the KERNEL_LOCK held.
> 
> as we move forward it would be nice to run the queue without having
> to take the biglock. however, because we also want to dequeue the
> packets in order, it only makes sense to run a single instance of
> the function in the whole system.
> 
> also, if a driver is recovering from an oactive situation (ie, it's
> been able to free space on the tx ring) it should be able to start
> tx again from an mpsafe interrupt context.
> 
> because most of our drivers assume that theyre run under the
> KERNEL_LOCK, this diff uses a flag for the internals of the if_start
> call to differentiate between them. it defaults for kernel locked,
> but drivers can opt in to an mpsafe version that can call ifp->if_start
> without the mplock held.
> 
> the kernel locked code takes KERNEL_LOCK and splnet before calling
> ifp->if_start.
> 
> the mpsafe code uses the serialisation mechanism that the scsi
> midlayer and pool runqueue use, but implemented with atomics instead
> of operations under a mutex.
> 
> the semantic is that work will be queued onto a list protected by
> a mutex (ie, the guts of struct ifqueue), and then a cpu will try
> to enter a critical section that runs a function to service the
> queued work. the cpu that enters the critical section has to dequeue
> work in a loop, which is what all our drivers do.
> 
> if another cpu tries to enter the same critical section after
> queueing more work, it will return immediately rather than spin on
> the lock. the first cpu that is currently dequeueing work in the
> critical section will be told to spin again to guarantee that it
> will service the work the other cpu added.
> 
> so the network stack may be transmitting packets on cpu1, while an
> interrupts on cpu0 occurs which frees up tx descriprots. if cpu0
> calls if_start, it will return immediately because cpu1 will end
> up doing the work it wanted to do anyway.
> 
> if the start routine can run on multiple cpus, then it becomes
> necessary to know it is NOT running anymore when tearing a nic down.
> to that end i have added an if_start_barrier function. an mpsafe
> driver can call that when it's being brought down to guarantee that
> another cpu isnt fiddling with the tx ring before freeing it.
> 
> a driver opts in to the mpsafe if_start call by doing the following:
> 
> 1. set ifp->if_xflags = IFXF_MPSAFE.
> 2. calling if_start() instead of its own start routine (eg, myx_start).
> 3. clearing IFF_RUNNING before calling if_start_barrier() on its way down.
> 4. only using IFQ_DEQUEUE (not ifq_deq_begin/commit/rollback)
> 
> anyway, this is the diff i have come up with after playing with
> several ideas. it removes the IFXF_TXREADY semantics, ie, tx
> mitigation and reuses the flag bit for IFXF_MPSAFE.
> 
> the reason for that is juggling or deferring the start routine made
> if_start_barrier annoyingly complicated, and all my attmepts at it
> introduced a significant performance hit or were insanely complicated.
> 
> tx mitigation only ever gave me back 5 to 10% before it was badly
> tweaked, and we've made a lot of other performance improvements
> since then. while im sad to see it go, id rather move forward than
> dwell on it.
> 
> in the future i would like to try delegating the work to mpsafe
> taskqs, but in my attempts i lost something like 30% of my tx rate
> by doing that. id like to investigate that further in the future,
> just not right now.
> 
> finally, the last thing to consider is lock ordering problems.
> because contention on the ifq_serializer causes the second context
> to return imediately (that's true even if you call if_start from
> within a critical section), i think all the problems are avoided.
> i am more concerned with the ifq mutex than i am with the serialiser.
> 
> anyway, here's the diff to look at. happy to discuss further.
> 
> tests would be welcome too.

...and i bought 10G-PCIE2-8B2L-2S although i'm 82599 fan and i will
never give up on IX  :))))


Reply via email to