On Tue, Aug 30, 2016 at 10:48:53AM +0200, Mike Belopuhov wrote:
> On Tue, Aug 30, 2016 at 08:31 +0200, Mark Kettenis wrote:
> > > Date: Tue, 30 Aug 2016 07:48:09 +0200
> > > From: Mike Belopuhov <m...@belopuhov.com>
> > > 
> > > On Tue, Aug 30, 2016 at 09:58 +1000, David Gwynne wrote:
> > > > On Mon, Aug 29, 2016 at 08:30:37PM +0200, Alexander Bluhm wrote:
> > > > > On Mon, Aug 29, 2016 at 07:10:48PM +0200, Mike Belopuhov wrote:
> > > > > > Due to a recent change in -current the socket sending routine
> > > > > > has started producing small data packets crossing memory page
> > > > > > boundary.  This is not supported by Xen and kernels with this
> > > > > > change will experience broken bulk TCP transmit behaviour.
> > > > > > We're working on fixing it.
> > > > > 
> > > > > For the same reason some old i386 machines from 2006 and 2005 have
> > > > > performance problems when sending data with tcpbench.
> > > > > 
> > > > > em 82573E drops to 200 MBit/sec output, 82546GB and 82540EM do only
> > > > > 10 MBit anymore.
> > > > > 
> > > > > With the patch below I get 946, 642, 422 MBit/sec output performance
> > > > > over these chips respectively.
> > > > > 
> > > > > Don't know wether PAGE_SIZE is the correct fix as I think the problem
> > > > > is more related to the network chip than to the processor's page
> > > > > size.
> > > > 
> > > > does this diff help those chips?
> > > >
> > > 
> > > This diff defeats the purpose of the sosend change by punishing
> > > every other chip not suffering from the aforementioned problem.
> > > Lots of packets from the bulk TCP transfer will have to be
> > > defragmented for no good reason.
> > 
> > No, this em diff will still do proper scatter/gather.  It might
> > consume more descriptors as it will use two descriptors for packets
> > crossing a page boundary.  But the fact that we collect more data into
> > an mbuf will actually reduce the number of descriptors in other cases.
> >
> 
> Right, my bad.  I didn't think this through.
> 
> > Regarding the xnf(4) issue; I think any driver that can't properly
> > deal with an mbuf crossing a page boundary is broken.  I can't think
> > of any modern dma engine that can't handle that properly, or doesn't
> > at least support scatter/gather of some sort.
> 
> To set things straight: xnf does support and utilize fragmented packets.
> 
> This functionality is limited in several ways, however.  First of all
> it may not be supported: some (old?) NetBSD based setups don't support
> scatter-gather at all and require that the packet must fit one 4k buffer.
> This now requires a bcopy into a temporary buffer, while previously it
> didn't.

I think here not much changes. You get more then one segemnt you lose.
Bad HW bad performance...
 
> The second limitation is when scatter-gather i/o is supported Netfront
> provides us with 256 general purpose ring descriptors that describe
> either a complete packet or one of up to 18 chunks of the said packet.
> Therefore there's no traditional fragment SGL attached to a descriptor,
> but the whole 256 entry ring is one big fragment SGL itself.
> 
> Furthermore each one of 256 entries has a reference to a single 4k
> buffer.  This reference is a limited resource itself as it's an entry
> in an IOMMU-like structure called Grant Tables.  Typically there are
> only 32 grant table pages and each page holds 512 entries (IIRC).  One
> xnf device uses 256 entries for Rx ring, several entries for the ring
> header and 256 * NFRAG entries for the Tx ring.  Right now this NFRAG
> is 1.  Bumping it to 2 is probably not a problem.  However if we want
> (and we do) to support jumbo frames (9000 no less) we'd have to bump
> it up to 4 entries to fit one jumbo frame which eats up two whole
> grant table pages (1024 entries).  That's roughly 3 pages per xnf
> in a *typical* setup.  Since it's a shared resource for all Xen PV
> drivers, this limits the number of xnf interfaces to about 9.  If the
> disk driver appears, we might be limited to a lot less number of
> supported interfaces.  But at the moment it's a speculation at best.
> 
> Now that limitations of the interface are specified, we can see that
> bus_dmamap_load_mbuf would be a tremendously wasteful interface:
> 256 descriptors * 18 fragments = 4608 grant table entries
> 4608 / 512 entries per grant table page = 9 pages per xnf out of 32
> in total per system.  This is the reason for the manual mbuf chain
> traversal code that does a bus_dmamap_load into a single buffer.

This is about how almost all HW rings work.
The driver creates the DMA map with nsegments = 18, maxsegsz = PAGE_SIZE and
boundary = PAGE_SIZE. This will waste a few resources in the bus_dmamap_t
but that's it. bus_dmamap_load_mbuf is used to load the mbuf chain into
the dma map and then the driver loops over the dma map dm_segs and fills
the 256 ring. You can check that you have at least 18 free entries in the
SGL before doing the work and if bus_dmamap_load_mbuf fails because the
mbuf chain is too scattered then m_defrag can be used to defragment the
chain. bus_dmamap_load_mbuf does not by itself allocate grant table
entries -- at least I don't see how that would happen. So I think you
can still run everything with the 256 entry ring and so 256 grant table
entries.

> At the same time this is how fragments are used right now: every
> m->m_data within a chain is its own fragment.  The sosend change
> requires an additional change to support multiple segments for
> each m->m_data and use of additional descriptors to cover for that.
> While it's possible to do, this is a requirement that was pushed
> on me w/o any notifications.
> 
> Hope it clears the situation up.
> 
> > There may be old crufty
> > stuff though that can't deal with it, but those probably already have
> > "bcopy" drivers.  Now there may be drivers that don't enforce the
> > boundary properly.  Those will mysteriously stop working.  Will we be
> > able to fix all of those before 6.1 gets released?
> > 
> 
> Since it depends on users providing test coverage, I wouldn't bet on it.
> 

-- 
:wq Claudio

Reply via email to