On Tue, Aug 30, 2016 at 10:48:53AM +0200, Mike Belopuhov wrote: > On Tue, Aug 30, 2016 at 08:31 +0200, Mark Kettenis wrote: > > > Date: Tue, 30 Aug 2016 07:48:09 +0200 > > > From: Mike Belopuhov <m...@belopuhov.com> > > > > > > On Tue, Aug 30, 2016 at 09:58 +1000, David Gwynne wrote: > > > > On Mon, Aug 29, 2016 at 08:30:37PM +0200, Alexander Bluhm wrote: > > > > > On Mon, Aug 29, 2016 at 07:10:48PM +0200, Mike Belopuhov wrote: > > > > > > Due to a recent change in -current the socket sending routine > > > > > > has started producing small data packets crossing memory page > > > > > > boundary. This is not supported by Xen and kernels with this > > > > > > change will experience broken bulk TCP transmit behaviour. > > > > > > We're working on fixing it. > > > > > > > > > > For the same reason some old i386 machines from 2006 and 2005 have > > > > > performance problems when sending data with tcpbench. > > > > > > > > > > em 82573E drops to 200 MBit/sec output, 82546GB and 82540EM do only > > > > > 10 MBit anymore. > > > > > > > > > > With the patch below I get 946, 642, 422 MBit/sec output performance > > > > > over these chips respectively. > > > > > > > > > > Don't know wether PAGE_SIZE is the correct fix as I think the problem > > > > > is more related to the network chip than to the processor's page > > > > > size. > > > > > > > > does this diff help those chips? > > > > > > > > > > This diff defeats the purpose of the sosend change by punishing > > > every other chip not suffering from the aforementioned problem. > > > Lots of packets from the bulk TCP transfer will have to be > > > defragmented for no good reason. > > > > No, this em diff will still do proper scatter/gather. It might > > consume more descriptors as it will use two descriptors for packets > > crossing a page boundary. But the fact that we collect more data into > > an mbuf will actually reduce the number of descriptors in other cases. > > > > Right, my bad. I didn't think this through. > > > Regarding the xnf(4) issue; I think any driver that can't properly > > deal with an mbuf crossing a page boundary is broken. I can't think > > of any modern dma engine that can't handle that properly, or doesn't > > at least support scatter/gather of some sort. > > To set things straight: xnf does support and utilize fragmented packets. > > This functionality is limited in several ways, however. First of all > it may not be supported: some (old?) NetBSD based setups don't support > scatter-gather at all and require that the packet must fit one 4k buffer. > This now requires a bcopy into a temporary buffer, while previously it > didn't.
I think here not much changes. You get more then one segemnt you lose. Bad HW bad performance... > The second limitation is when scatter-gather i/o is supported Netfront > provides us with 256 general purpose ring descriptors that describe > either a complete packet or one of up to 18 chunks of the said packet. > Therefore there's no traditional fragment SGL attached to a descriptor, > but the whole 256 entry ring is one big fragment SGL itself. > > Furthermore each one of 256 entries has a reference to a single 4k > buffer. This reference is a limited resource itself as it's an entry > in an IOMMU-like structure called Grant Tables. Typically there are > only 32 grant table pages and each page holds 512 entries (IIRC). One > xnf device uses 256 entries for Rx ring, several entries for the ring > header and 256 * NFRAG entries for the Tx ring. Right now this NFRAG > is 1. Bumping it to 2 is probably not a problem. However if we want > (and we do) to support jumbo frames (9000 no less) we'd have to bump > it up to 4 entries to fit one jumbo frame which eats up two whole > grant table pages (1024 entries). That's roughly 3 pages per xnf > in a *typical* setup. Since it's a shared resource for all Xen PV > drivers, this limits the number of xnf interfaces to about 9. If the > disk driver appears, we might be limited to a lot less number of > supported interfaces. But at the moment it's a speculation at best. > > Now that limitations of the interface are specified, we can see that > bus_dmamap_load_mbuf would be a tremendously wasteful interface: > 256 descriptors * 18 fragments = 4608 grant table entries > 4608 / 512 entries per grant table page = 9 pages per xnf out of 32 > in total per system. This is the reason for the manual mbuf chain > traversal code that does a bus_dmamap_load into a single buffer. This is about how almost all HW rings work. The driver creates the DMA map with nsegments = 18, maxsegsz = PAGE_SIZE and boundary = PAGE_SIZE. This will waste a few resources in the bus_dmamap_t but that's it. bus_dmamap_load_mbuf is used to load the mbuf chain into the dma map and then the driver loops over the dma map dm_segs and fills the 256 ring. You can check that you have at least 18 free entries in the SGL before doing the work and if bus_dmamap_load_mbuf fails because the mbuf chain is too scattered then m_defrag can be used to defragment the chain. bus_dmamap_load_mbuf does not by itself allocate grant table entries -- at least I don't see how that would happen. So I think you can still run everything with the 256 entry ring and so 256 grant table entries. > At the same time this is how fragments are used right now: every > m->m_data within a chain is its own fragment. The sosend change > requires an additional change to support multiple segments for > each m->m_data and use of additional descriptors to cover for that. > While it's possible to do, this is a requirement that was pushed > on me w/o any notifications. > > Hope it clears the situation up. > > > There may be old crufty > > stuff though that can't deal with it, but those probably already have > > "bcopy" drivers. Now there may be drivers that don't enforce the > > boundary properly. Those will mysteriously stop working. Will we be > > able to fix all of those before 6.1 gets released? > > > > Since it depends on users providing test coverage, I wouldn't bet on it. > -- :wq Claudio