RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
> -Original Message- > From: Karlsson, Magnus > Sent: Thursday, April 12, 2018 5:20 PM > To: Michael S. Tsirkin <m...@redhat.com> > Cc: Björn Töpel <bjorn.to...@gmail.com>; Duyck, Alexander H > <alexander.h.du...@intel.com>; alexander.du...@gmail.com; > john.fastab...@gmail.com; a...@fb.com; bro...@redhat.com; > willemdebruijn.ker...@gmail.com; dan...@iogearbox.net; > netdev@vger.kernel.org; michael.lundkv...@ericsson.com; Brandeburg, > Jesse <jesse.brandeb...@intel.com>; Singhai, Anjali > <anjali.sing...@intel.com>; Zhang, Qi Z <qi.z.zh...@intel.com>; > ravineet.si...@ericsson.com > Subject: RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and > mmap > > > > > -Original Message- > > From: Michael S. Tsirkin [mailto:m...@redhat.com] > > Sent: Thursday, April 12, 2018 4:05 PM > > To: Karlsson, Magnus <magnus.karls...@intel.com> > > Cc: Björn Töpel <bjorn.to...@gmail.com>; Duyck, Alexander H > > <alexander.h.du...@intel.com>; alexander.du...@gmail.com; > > john.fastab...@gmail.com; a...@fb.com; bro...@redhat.com; > > willemdebruijn.ker...@gmail.com; dan...@iogearbox.net; > > netdev@vger.kernel.org; michael.lundkv...@ericsson.com; Brandeburg, > > Jesse <jesse.brandeb...@intel.com>; Singhai, Anjali > > <anjali.sing...@intel.com>; Zhang, Qi Z <qi.z.zh...@intel.com>; > > ravineet.si...@ericsson.com > > Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and > > mmap > > > > On Thu, Apr 12, 2018 at 07:38:25AM +, Karlsson, Magnus wrote: > > > I think you are definitely right in that there are ways in which we > > > can improve performance here. That said, the current queue performs > > > slightly better than the previous one we had that was more or less a > > > copy of one of your first virtio 1.1 proposals from little over a > > > year ago. It had bidirectional queues and a valid flag in the > > > descriptor itself. The reason we abandoned this was not poor > > > performance (it was good), but a need to go to unidirectional > > > queues. Maybe I should have only changed that aspect and kept the valid > flag. > > > > Is there a summary about unidirectional queues anywhere? I'm curious > > to know whether there are any lessons here to be learned for virtio or > ptr_ring. > > I did a quick hack in which I used your ptr_ring for the fill queue instead of > our head/tail based one. In the corner cases (usually empty or usually full), > there is basically no difference. But for the case when the queue is always > half full, the ptr_ring implementation boosts the performance from 5.6 to 5.7 > Mpps (as there is no cache line bouncing in this case) on my system (slower > than Björn's that was used for the numbers in the RFC). > > So I think this should be implemented properly so we can get some real > numbers. > Especially since 0.1 Mpps with copies will likely become much more with > zero-copy as we are really chasing cycles there. We will get back a better > evaluation in a few days. > > Thanks: Magnus > > > -- > > MST Hi Michael, Sorry for the late reply. Been travelling. Björn and I have now made a real implementation of the ptr_ring principles in the af_xdp code. We just added a switch in bind (only for the purpose of this test) to be able to pick what ring implementation to use from the user space test program. The main difference between our version of ptr_ring and our head/tail ring is that the ptr_ring version uses the idx field to signal if the entry is available or not (idx == 0 indicating empty descriptor) and that it does not use the head and tail pointers at all. Even though it is not a "ring of pointers" in our implementation, we will still call it ptr_ring for the purpose of this mail. In summary, our experiments show that the two rings perform the same in our micro benchmarks when the queues are balanced and rarely full or empty, but the head/tail version performs better for RX when the queues are not perfectly balanced. Why is that? We do not exactly know, but there are a number of differences between a ptr_ring in the kernel and one between user and kernel space for the use in af_xdp. * The user space descriptors have to be validated as we are communicating between user space and kernel space. Done slightly differently for the two rings due to the batching below. * The RX and TX ring have descriptors that are larger than one pointer, so need to have barriers here even with ptr_ring. We can not rely on address dependency because it is not a pointer. * Batching performed slightly differently in both versions. We avoid touching head and tail for as long as
RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
> -Original Message- > From: Michael S. Tsirkin [mailto:m...@redhat.com] > Sent: Thursday, April 12, 2018 4:05 PM > To: Karlsson, Magnus <magnus.karls...@intel.com> > Cc: Björn Töpel <bjorn.to...@gmail.com>; Duyck, Alexander H > <alexander.h.du...@intel.com>; alexander.du...@gmail.com; > john.fastab...@gmail.com; a...@fb.com; bro...@redhat.com; > willemdebruijn.ker...@gmail.com; dan...@iogearbox.net; > netdev@vger.kernel.org; michael.lundkv...@ericsson.com; Brandeburg, > Jesse <jesse.brandeb...@intel.com>; Singhai, Anjali > <anjali.sing...@intel.com>; Zhang, Qi Z <qi.z.zh...@intel.com>; > ravineet.si...@ericsson.com > Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and > mmap > > On Thu, Apr 12, 2018 at 07:38:25AM +, Karlsson, Magnus wrote: > > I think you are definitely right in that there are ways in which we > > can improve performance here. That said, the current queue performs > > slightly better than the previous one we had that was more or less a > > copy of one of your first virtio 1.1 proposals from little over a year > > ago. It had bidirectional queues and a valid flag in the descriptor > > itself. The reason we abandoned this was not poor performance (it was > > good), but a need to go to unidirectional queues. Maybe I should have > > only changed that aspect and kept the valid flag. > > Is there a summary about unidirectional queues anywhere? I'm curious to > know whether there are any lessons here to be learned for virtio or ptr_ring. I did a quick hack in which I used your ptr_ring for the fill queue instead of our head/tail based one. In the corner cases (usually empty or usually full), there is basically no difference. But for the case when the queue is always half full, the ptr_ring implementation boosts the performance from 5.6 to 5.7 Mpps (as there is no cache line bouncing in this case) on my system (slower than Björn's that was used for the numbers in the RFC). So I think this should be implemented properly so we can get some real numbers. Especially since 0.1 Mpps with copies will likely become much more with zero-copy as we are really chasing cycles there. We will get back a better evaluation in a few days. Thanks: Magnus > -- > MST
Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
On Thu, Apr 12, 2018 at 07:38:25AM +, Karlsson, Magnus wrote: > I think you are definitely right in that there are ways in which > we can improve performance here. That said, the current queue > performs slightly better than the previous one we had that was > more or less a copy of one of your first virtio 1.1 proposals > from little over a year ago. It had bidirectional queues and a > valid flag in the descriptor itself. The reason we abandoned this > was not poor performance (it was good), but a need to go to > unidirectional queues. Maybe I should have only changed that > aspect and kept the valid flag. Is there a summary about unidirectional queues anywhere? I'm curious to know whether there are any lessons here to be learned for virtio or ptr_ring. -- MST
Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
On Thu, 12 Apr 2018 07:38:25 + "Karlsson, Magnus" <magnus.karls...@intel.com> wrote: > > -Original Message- > > From: Michael S. Tsirkin [mailto:m...@redhat.com] > > Sent: Thursday, April 12, 2018 4:16 AM > > To: Björn Töpel <bjorn.to...@gmail.com> > > Cc: Karlsson, Magnus <magnus.karls...@intel.com>; Duyck, Alexander H > > <alexander.h.du...@intel.com>; alexander.du...@gmail.com; > > john.fastab...@gmail.com; a...@fb.com; bro...@redhat.com; > > willemdebruijn.ker...@gmail.com; dan...@iogearbox.net; > > netdev@vger.kernel.org; michael.lundkv...@ericsson.com; Brandeburg, > > Jesse <jesse.brandeb...@intel.com>; Singhai, Anjali > > <anjali.sing...@intel.com>; Zhang, Qi Z <qi.z.zh...@intel.com>; > > ravineet.si...@ericsson.com > > Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and > > mmap > > > > On Tue, Mar 27, 2018 at 06:59:08PM +0200, Björn Töpel wrote: > > > @@ -30,4 +31,18 @@ struct xdp_umem_reg { > > > __u32 frame_headroom; /* Frame head room */ }; > > > > > > +/* Pgoff for mmaping the rings */ > > > +#define XDP_UMEM_PGOFF_FILL_QUEUE0x1 > > > + > > > +struct xdp_queue { > > > + __u32 head_idx __attribute__((aligned(64))); > > > + __u32 tail_idx __attribute__((aligned(64))); }; > > > + > > > +/* Used for the fill and completion queues for buffers */ struct > > > +xdp_umem_queue { > > > + struct xdp_queue ptrs; > > > + __u32 desc[0] __attribute__((aligned(64))); }; > > > + > > > #endif /* _LINUX_IF_XDP_H */ > > > > So IIUC it's a head/tail ring of 32 bit descriptors. > > > > In my experience (from implementing ptr_ring) this implies that head/tail > > cache lines bounce a lot between CPUs. Caching will help some. You are also > > forced to use barriers to check validity which is slow on some > > architectures. > > > > If instead you can use a special descriptor value (e.g. 0) as a valid > > signal, > > things work much better: > > > > - you read descriptor atomically, if it's not 0 it's fine > > - same with write - write 0 to pass it to the other side > > - there is a data dependency so no need for barriers (except on dec alpha) > > - no need for power of 2 limitations, you can make it any size you like > > - easy to resize too > > > > architecture (if not implementation) would be shared with ptr_ring so some > > of the optimization ideas like batched updates could be lifted from there. > > > > When I was building ptr_ring, any head/tail design underperformed storing > > valid flag with data itself. YMMV. I fully agree with MST here. This is also my experience. I even dropped my own Array-based Lock-Free (ALF) queue implementation[1] in favor of ptr_ring. (Where I try to amortize this cost by bulking, but this cause the queue to become non-wait-free) [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/include/linux/alf_queue.h > I think you are definitely right in that there are ways in which > we can improve performance here. That said, the current queue > performs slightly better than the previous one we had that was > more or less a copy of one of your first virtio 1.1 proposals > from little over a year ago. It had bidirectional queues and a > valid flag in the descriptor itself. The reason we abandoned this > was not poor performance (it was good), but a need to go to > unidirectional queues. Maybe I should have only changed that > aspect and kept the valid flag. > > Anyway, I will take a look at ptr_ring and run some experiments > along the lines of what you propose to get some > numbers. Considering your experience with these kind of > structures, you are likely right. I just need to convince > myself :-). When benchmarking, be careful that you don't measure the "wrong" queue situation. When doing this kind of "overload" benchmarking, you will likely create a situation where the queue is always full (which hopefully isn't a production use-case). In the almost/always full queue situation, using the element values to sync-on (like MST propose) will still cause the cache-line bouncing (that we want to avoid). MST explain and have addressed this situation for ptr_ring in: commit fb9de9704775 ("ptr_ring: batch ring zeroing") https://git.kernel.org/torvalds/c/fb9de9704775 -- Best regards, Jesper Dangaard Brouer MSc.CS, Principal Kernel Engineer at Red Hat LinkedIn: http://www.linkedin.com/in/brouer
RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
> -Original Message- > From: Michael S. Tsirkin [mailto:m...@redhat.com] > Sent: Thursday, April 12, 2018 4:16 AM > To: Björn Töpel <bjorn.to...@gmail.com> > Cc: Karlsson, Magnus <magnus.karls...@intel.com>; Duyck, Alexander H > <alexander.h.du...@intel.com>; alexander.du...@gmail.com; > john.fastab...@gmail.com; a...@fb.com; bro...@redhat.com; > willemdebruijn.ker...@gmail.com; dan...@iogearbox.net; > netdev@vger.kernel.org; michael.lundkv...@ericsson.com; Brandeburg, > Jesse <jesse.brandeb...@intel.com>; Singhai, Anjali > <anjali.sing...@intel.com>; Zhang, Qi Z <qi.z.zh...@intel.com>; > ravineet.si...@ericsson.com > Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and > mmap > > On Tue, Mar 27, 2018 at 06:59:08PM +0200, Björn Töpel wrote: > > @@ -30,4 +31,18 @@ struct xdp_umem_reg { > > __u32 frame_headroom; /* Frame head room */ }; > > > > +/* Pgoff for mmaping the rings */ > > +#define XDP_UMEM_PGOFF_FILL_QUEUE 0x1 > > + > > +struct xdp_queue { > > + __u32 head_idx __attribute__((aligned(64))); > > + __u32 tail_idx __attribute__((aligned(64))); }; > > + > > +/* Used for the fill and completion queues for buffers */ struct > > +xdp_umem_queue { > > + struct xdp_queue ptrs; > > + __u32 desc[0] __attribute__((aligned(64))); }; > > + > > #endif /* _LINUX_IF_XDP_H */ > > So IIUC it's a head/tail ring of 32 bit descriptors. > > In my experience (from implementing ptr_ring) this implies that head/tail > cache lines bounce a lot between CPUs. Caching will help some. You are also > forced to use barriers to check validity which is slow on some architectures. > > If instead you can use a special descriptor value (e.g. 0) as a valid signal, > things work much better: > > - you read descriptor atomically, if it's not 0 it's fine > - same with write - write 0 to pass it to the other side > - there is a data dependency so no need for barriers (except on dec alpha) > - no need for power of 2 limitations, you can make it any size you like > - easy to resize too > > architecture (if not implementation) would be shared with ptr_ring so some > of the optimization ideas like batched updates could be lifted from there. > > When I was building ptr_ring, any head/tail design underperformed storing > valid flag with data itself. YMMV. > > -- > MST I think you are definitely right in that there are ways in which we can improve performance here. That said, the current queue performs slightly better than the previous one we had that was more or less a copy of one of your first virtio 1.1 proposals from little over a year ago. It had bidirectional queues and a valid flag in the descriptor itself. The reason we abandoned this was not poor performance (it was good), but a need to go to unidirectional queues. Maybe I should have only changed that aspect and kept the valid flag. Anyway, I will take a look at ptr_ring and run some experiments along the lines of what you propose to get some numbers. Considering your experience with these kind of structures, you are likely right. I just need to convince myself :-). /Magnus
Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap
On Tue, Mar 27, 2018 at 06:59:08PM +0200, Björn Töpel wrote: > @@ -30,4 +31,18 @@ struct xdp_umem_reg { > __u32 frame_headroom; /* Frame head room */ > }; > > +/* Pgoff for mmaping the rings */ > +#define XDP_UMEM_PGOFF_FILL_QUEUE0x1 > + > +struct xdp_queue { > + __u32 head_idx __attribute__((aligned(64))); > + __u32 tail_idx __attribute__((aligned(64))); > +}; > + > +/* Used for the fill and completion queues for buffers */ > +struct xdp_umem_queue { > + struct xdp_queue ptrs; > + __u32 desc[0] __attribute__((aligned(64))); > +}; > + > #endif /* _LINUX_IF_XDP_H */ So IIUC it's a head/tail ring of 32 bit descriptors. In my experience (from implementing ptr_ring) this implies that head/tail cache lines bounce a lot between CPUs. Caching will help some. You are also forced to use barriers to check validity which is slow on some architectures. If instead you can use a special descriptor value (e.g. 0) as a valid signal, things work much better: - you read descriptor atomically, if it's not 0 it's fine - same with write - write 0 to pass it to the other side - there is a data dependency so no need for barriers (except on dec alpha) - no need for power of 2 limitations, you can make it any size you like - easy to resize too architecture (if not implementation) would be shared with ptr_ring so some of the optimization ideas like batched updates could be lifted from there. When I was building ptr_ring, any head/tail design underperformed storing valid flag with data itself. YMMV. -- MST