RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap

2018-04-23 Thread Karlsson, Magnus


> -Original Message-
> From: Karlsson, Magnus
> Sent: Thursday, April 12, 2018 5:20 PM
> To: Michael S. Tsirkin <m...@redhat.com>
> Cc: Björn Töpel <bjorn.to...@gmail.com>; Duyck, Alexander H
> <alexander.h.du...@intel.com>; alexander.du...@gmail.com;
> john.fastab...@gmail.com; a...@fb.com; bro...@redhat.com;
> willemdebruijn.ker...@gmail.com; dan...@iogearbox.net;
> netdev@vger.kernel.org; michael.lundkv...@ericsson.com; Brandeburg,
> Jesse <jesse.brandeb...@intel.com>; Singhai, Anjali
> <anjali.sing...@intel.com>; Zhang, Qi Z <qi.z.zh...@intel.com>;
> ravineet.si...@ericsson.com
> Subject: RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and
> mmap
> 
> 
> 
> > -Original Message-
> > From: Michael S. Tsirkin [mailto:m...@redhat.com]
> > Sent: Thursday, April 12, 2018 4:05 PM
> > To: Karlsson, Magnus <magnus.karls...@intel.com>
> > Cc: Björn Töpel <bjorn.to...@gmail.com>; Duyck, Alexander H
> > <alexander.h.du...@intel.com>; alexander.du...@gmail.com;
> > john.fastab...@gmail.com; a...@fb.com; bro...@redhat.com;
> > willemdebruijn.ker...@gmail.com; dan...@iogearbox.net;
> > netdev@vger.kernel.org; michael.lundkv...@ericsson.com; Brandeburg,
> > Jesse <jesse.brandeb...@intel.com>; Singhai, Anjali
> > <anjali.sing...@intel.com>; Zhang, Qi Z <qi.z.zh...@intel.com>;
> > ravineet.si...@ericsson.com
> > Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and
> > mmap
> >
> > On Thu, Apr 12, 2018 at 07:38:25AM +, Karlsson, Magnus wrote:
> > > I think you are definitely right in that there are ways in which we
> > > can improve performance here. That said, the current queue performs
> > > slightly better than the previous one we had that was more or less a
> > > copy of one of your first virtio 1.1 proposals from little over a
> > > year ago. It had bidirectional queues and a valid flag in the
> > > descriptor itself. The reason we abandoned this was not poor
> > > performance (it was good), but a need to go to unidirectional
> > > queues. Maybe I should have only changed that aspect and kept the valid
> flag.
> >
> > Is there a summary about unidirectional queues anywhere?  I'm curious
> > to know whether there are any lessons here to be learned for virtio or
> ptr_ring.
> 
> I did a quick hack in which I used your ptr_ring for the fill queue instead of
> our head/tail based one. In the corner cases (usually empty or usually full),
> there is basically no difference. But for the case when the queue is always
> half full, the ptr_ring implementation boosts the performance from 5.6 to 5.7
> Mpps (as there is no cache line bouncing in this case) on my system (slower
> than Björn's that was used for the numbers in the RFC).
> 
> So I think this should be implemented properly so we can get some real
> numbers.
> Especially since 0.1 Mpps with copies will likely become much more with
> zero-copy as we are really chasing cycles there. We will get back a better
> evaluation in a few days.
> 
> Thanks: Magnus
> 
> > --
> > MST

Hi Michael,

Sorry for the late reply. Been travelling. Björn and I have now
made a real implementation of the ptr_ring principles in the
af_xdp code. We just added a switch in bind (only for the purpose
of this test) to be able to pick what ring implementation to use
from the user space test program. The main difference between our
version of ptr_ring and our head/tail ring is that the ptr_ring
version uses the idx field to signal if the entry is available or
not (idx == 0 indicating empty descriptor) and that it does not
use the head and tail pointers at all. Even though it is not
a "ring of pointers" in our implementation, we will still call it
ptr_ring for the purpose of this mail.

In summary, our experiments show that the two rings perform the
same in our micro benchmarks when the queues are balanced and
rarely full or empty, but the head/tail version performs better
for RX when the queues are not perfectly balanced. Why is that?
We do not exactly know, but there are a number of differences
between a ptr_ring in the kernel and one between user and kernel
space for the use in af_xdp.

* The user space descriptors have to be validated as we are
  communicating between user space and kernel space. Done slightly
  differently for the two rings due to the batching below.

* The RX and TX ring have descriptors that are larger than one
  pointer, so need to have barriers here even with ptr_ring. We can
  not rely on address dependency because it is not a pointer.

* Batching performed slightly differently in both versions. We
  avoid touching head and tail for as long as

RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap

2018-04-12 Thread Karlsson, Magnus


> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com]
> Sent: Thursday, April 12, 2018 4:05 PM
> To: Karlsson, Magnus <magnus.karls...@intel.com>
> Cc: Björn Töpel <bjorn.to...@gmail.com>; Duyck, Alexander H
> <alexander.h.du...@intel.com>; alexander.du...@gmail.com;
> john.fastab...@gmail.com; a...@fb.com; bro...@redhat.com;
> willemdebruijn.ker...@gmail.com; dan...@iogearbox.net;
> netdev@vger.kernel.org; michael.lundkv...@ericsson.com; Brandeburg,
> Jesse <jesse.brandeb...@intel.com>; Singhai, Anjali
> <anjali.sing...@intel.com>; Zhang, Qi Z <qi.z.zh...@intel.com>;
> ravineet.si...@ericsson.com
> Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and
> mmap
> 
> On Thu, Apr 12, 2018 at 07:38:25AM +, Karlsson, Magnus wrote:
> > I think you are definitely right in that there are ways in which we
> > can improve performance here. That said, the current queue performs
> > slightly better than the previous one we had that was more or less a
> > copy of one of your first virtio 1.1 proposals from little over a year
> > ago. It had bidirectional queues and a valid flag in the descriptor
> > itself. The reason we abandoned this was not poor performance (it was
> > good), but a need to go to unidirectional queues. Maybe I should have
> > only changed that aspect and kept the valid flag.
> 
> Is there a summary about unidirectional queues anywhere?  I'm curious to
> know whether there are any lessons here to be learned for virtio or ptr_ring.

I did a quick hack in which I used your ptr_ring for the fill queue instead of
our head/tail based one. In the corner cases (usually empty or usually full), 
there
is basically no difference. But for the case when the queue is always half full,
the ptr_ring implementation boosts the performance from 5.6 to 5.7 Mpps 
(as there is no cache line bouncing in this case) 
on my system (slower than Björn's that was used for the numbers in the RFC).

So I think this should be implemented properly so we can get some real numbers.
Especially since 0.1 Mpps with copies will likely become much more with 
zero-copy
as we are really chasing cycles there. We will get back a better evaluation in 
a few
days.

Thanks: Magnus

> --
> MST


RE: [RFC PATCH v2 14/14] samples/bpf: sample application for AF_XDP sockets

2018-04-12 Thread Karlsson, Magnus


> -Original Message-
> From: Jesper Dangaard Brouer [mailto:bro...@redhat.com]
> Sent: Thursday, April 12, 2018 1:05 PM
> To: Björn Töpel <bjorn.to...@gmail.com>
> Cc: Karlsson, Magnus <magnus.karls...@intel.com>; Duyck, Alexander H
> <alexander.h.du...@intel.com>; alexander.du...@gmail.com;
> john.fastab...@gmail.com; a...@fb.com;
> willemdebruijn.ker...@gmail.com; dan...@iogearbox.net;
> netdev@vger.kernel.org; michael.lundkv...@ericsson.com; Brandeburg,
> Jesse <jesse.brandeb...@intel.com>; Singhai, Anjali
> <anjali.sing...@intel.com>; Zhang, Qi Z <qi.z.zh...@intel.com>;
> ravineet.si...@ericsson.com; Topel, Bjorn <bjorn.to...@intel.com>;
> bro...@redhat.com
> Subject: Re: [RFC PATCH v2 14/14] samples/bpf: sample application for
> AF_XDP sockets
> 
> On Tue, 27 Mar 2018 18:59:19 +0200
> Björn Töpel <bjorn.to...@gmail.com> wrote:
> 
> > +static void dump_stats(void)
> > +{
> > +   unsigned long stop_time = get_nsecs();
> > +   long dt = stop_time - start_time;
> > +   int i;
> > +
> > +   for (i = 0; i < num_socks; i++) {
> > +   double rx_pps = xsks[i]->rx_npkts * 10.
> / dt;
> > +   double tx_pps = xsks[i]->tx_npkts * 10.
> / dt;
> > +   char *fmt = "%-15s %'-11.0f %'-11lu\n";
> > +
> > +   printf("\n sock%d@", i);
> > +   print_benchmark(false);
> > +   printf("\n");
> > +
> > +   printf("%-15s %-11s %-11s %-11.2f\n", "", "pps",
> "pkts",
> > +  dt / 10.);
> > +   printf(fmt, "rx", rx_pps, xsks[i]->rx_npkts);
> > +   printf(fmt, "tx", tx_pps, xsks[i]->tx_npkts);
> > +   }
> > +}
> > +
> > +static void *poller(void *arg)
> > +{
> > +   (void)arg;
> > +   for (;;) {
> > +   sleep(1);
> > +   dump_stats();
> > +   }
> > +
> > +   return NULL;
> > +}
> 
> You are printing the "pps" (packets per sec) as an average over the entire
> test run... could you please change that to, at least also, have an more 
> up-to-
> date value like between the last measurement?
> 
> The problem is that when you start the test, the first reading will be too 
> low,
> and it takes time to average out/up. For ixgbe, first reading will be zero,
> because it does a link-down+up, which stops my pktgen.
> 
> The second annoyance is that I like to change system/kernel setting during
> the run, and observe the effect. E.g change CPU sleep states (via tuned-
> adm) during the test-run to see the effect, which I cannot with this long
> average.

Good points. Will fix.

/Magnus

> 
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


RE: [RFC PATCH v2 03/14] xsk: add umem fill queue support and mmap

2018-04-12 Thread Karlsson, Magnus


> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com]
> Sent: Thursday, April 12, 2018 4:16 AM
> To: Björn Töpel <bjorn.to...@gmail.com>
> Cc: Karlsson, Magnus <magnus.karls...@intel.com>; Duyck, Alexander H
> <alexander.h.du...@intel.com>; alexander.du...@gmail.com;
> john.fastab...@gmail.com; a...@fb.com; bro...@redhat.com;
> willemdebruijn.ker...@gmail.com; dan...@iogearbox.net;
> netdev@vger.kernel.org; michael.lundkv...@ericsson.com; Brandeburg,
> Jesse <jesse.brandeb...@intel.com>; Singhai, Anjali
> <anjali.sing...@intel.com>; Zhang, Qi Z <qi.z.zh...@intel.com>;
> ravineet.si...@ericsson.com
> Subject: Re: [RFC PATCH v2 03/14] xsk: add umem fill queue support and
> mmap
> 
> On Tue, Mar 27, 2018 at 06:59:08PM +0200, Björn Töpel wrote:
> > @@ -30,4 +31,18 @@ struct xdp_umem_reg {
> > __u32 frame_headroom; /* Frame head room */  };
> >
> > +/* Pgoff for mmaping the rings */
> > +#define XDP_UMEM_PGOFF_FILL_QUEUE  0x1
> > +
> > +struct xdp_queue {
> > +   __u32 head_idx __attribute__((aligned(64)));
> > +   __u32 tail_idx __attribute__((aligned(64))); };
> > +
> > +/* Used for the fill and completion queues for buffers */ struct
> > +xdp_umem_queue {
> > +   struct xdp_queue ptrs;
> > +   __u32 desc[0] __attribute__((aligned(64))); };
> > +
> >  #endif /* _LINUX_IF_XDP_H */
> 
> So IIUC it's a head/tail ring of 32 bit descriptors.
> 
> In my experience (from implementing ptr_ring) this implies that head/tail
> cache lines bounce a lot between CPUs. Caching will help some. You are also
> forced to use barriers to check validity which is slow on some architectures.
> 
> If instead you can use a special descriptor value (e.g. 0) as a valid signal,
> things work much better:
> 
> - you read descriptor atomically, if it's not 0 it's fine
> - same with write - write 0 to pass it to the other side
> - there is a data dependency so no need for barriers (except on dec alpha)
> - no need for power of 2 limitations, you can make it any size you like
> - easy to resize too
> 
> architecture (if not implementation) would be shared with ptr_ring so some
> of the optimization ideas like batched updates could be lifted from there.
> 
> When I was building ptr_ring, any head/tail design underperformed storing
> valid flag with data itself. YMMV.
> 
> --
> MST

I think you are definitely right in that there are ways in which
we can improve performance here. That said, the current queue
performs slightly better than the previous one we had that was
more or less a copy of one of your first virtio 1.1 proposals
from little over a year ago. It had bidirectional queues and a
valid flag in the descriptor itself. The reason we abandoned this
was not poor performance (it was good), but a need to go to
unidirectional queues. Maybe I should have only changed that
aspect and kept the valid flag.

Anyway, I will take a look at ptr_ring and run some experiments
along the lines of what you propose to get some
numbers. Considering your experience with these kind of
structures, you are likely right. I just need to convince
myself :-).

/Magnus


RE: [RFC PATCH 00/14] Introducing AF_PACKET V4 support

2017-11-03 Thread Karlsson, Magnus


> -Original Message-
> From: Willem de Bruijn [mailto:willemdebruijn.ker...@gmail.com]
> Sent: Friday, November 3, 2017 5:35 AM
> To: Björn Töpel <bjorn.to...@gmail.com>
> Cc: Karlsson, Magnus <magnus.karls...@intel.com>; Duyck, Alexander H
> <alexander.h.du...@intel.com>; Alexander Duyck
> <alexander.du...@gmail.com>; John Fastabend
> <john.fastab...@gmail.com>; Alexei Starovoitov <a...@fb.com>; Jesper
> Dangaard Brouer <bro...@redhat.com>; michael.lundkv...@ericsson.com;
> ravineet.si...@ericsson.com; Daniel Borkmann <dan...@iogearbox.net>;
> Network Development <netdev@vger.kernel.org>; Topel, Bjorn
> <bjorn.to...@intel.com>; Brandeburg, Jesse
> <jesse.brandeb...@intel.com>; Singhai, Anjali <anjali.sing...@intel.com>;
> Rosen, Rami <rami.ro...@intel.com>; Shaw, Jeffrey B
> <jeffrey.b.s...@intel.com>; Yigit, Ferruh <ferruh.yi...@intel.com>; Zhang,
> Qi Z <qi.z.zh...@intel.com>
> Subject: Re: [RFC PATCH 00/14] Introducing AF_PACKET V4 support
> 
> On Tue, Oct 31, 2017 at 9:41 PM, Björn Töpel <bjorn.to...@gmail.com>
> wrote:
> > From: Björn Töpel <bjorn.to...@intel.com>
> >
> > This RFC introduces AF_PACKET_V4 and PACKET_ZEROCOPY that are
> > optimized for high performance packet processing and zero-copy
> > semantics. Throughput improvements can be up to 40x compared to V2
> and
> > V3 for the micro benchmarks included. Would be great to get your
> > feedback on it.
> >
> > The main difference between V4 and V2/V3 is that TX and RX descriptors
> > are separated from packet buffers.
> 
> Cool feature. I'm looking forward to the netdev talk. Aside from the inline
> comments in the patches, a few architecture questions.

Glad to hear. Are you going to Netdev in Seoul? If so, let us hook up
and discuss your comments in further detail. Some initial thoughts
below.

> Is TX support needed? Existing PACKET_TX_RING already sends out packets
> without copying directly from the tx_ring. Indirection through a descriptor
> ring is not helpful on TX if all packets still have to come from a 
> pre-registered
> packet pool. The patch set adds a lot of tx-only code and is complex enough
> without it.

That is correct, but what if the packet you are going to transmit came
in from the receive path and is already in the packet buffer? This
might happen if the application is examining/sniffing packets then
sending them out, or doing some modification to them. In that case we
avoid a copy in V4 since the packet is already in the packet
buffer. With V2 and V3, a copy from the RX ring to the TX ring would
be needed. In the PACKET_ZEROCOPY case, avoiding this copy increases
performance quite a lot.

> Can you use the existing PACKET_V2 format for the packet pool? The
> v4 format is nearly the same as V2. Using the same version might avoid some
> code duplication and simplify upgrading existing legacy code.
> Instead of continuing to add new versions whose behavior is implicit,
> perhaps we can add explicit mode PACKET_INDIRECT to PACKET_V2.

Interesting idea that I think is worth thinking more about. One
problem though with the V2 ring format model, and the current V4
format too by the way, when applied to a user-space allocated memory
is that they are symmetric, i.e. that user space and kernel space have
to produce and consume the same amount of entries (within the length
of the descriptor area). User space sends down a buffer entry that the
kernel fills in for RX for example. Symmetric queues do not work when
you have a shared packet buffer between two processes. (This is not a
requirement, but someone might do a mmap with MAP_SHARED for the
packet buffer and then fork of a child that then inherits this packet
buffer.) One of the processes might just receive packets, while the
other one is transmitting. Or if you have a veth link pair between two
processes and they have been set up to share packet buffer area. With
a symmetric queue you have to copy even if they share the same packet
buffer, but with an asymmetric queue, you do not and the driver only
needs to copy the packet buffer id between the TX desc ring of the
sender to the RX desc ring of the receiver, not the data. I think this
gives an indication that we need a new structure. Anyway, I like your
idea and I think it is worth thinking more about it. Let us have a
discussion about this at Netdev, if you are there.

> Finally, is it necessary to define a new descriptor ring format? Same for the
> packet array and frame set. The kernel already has a few, such as virtio for
> the first, skb_array/ptr_ring, even linux list for the second. These 
> containers
> add a lot of new boilerplate code. If new formats are absolutely necessary, at
> least we should consider making them