Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets

2018-12-08 Thread Björn Töpel
Den lör 8 dec. 2018 kl 16:12 skrev Jesper Dangaard Brouer :
>
> On Fri, 7 Dec 2018 13:21:08 -0800
> Alexei Starovoitov  wrote:
>
> > for production I suspect the users would want
> > an easy way to stay safe when they're playing with AF_XDP.
> > So another builtin program that redirects ssh and ping traffic
> > back to the kernel would be a nice addition.
>
> Are you saying a buildin program that need to parse different kinds of
> Eth-type headers (DSA, VLAN, QinqQ) and find the TCP port to match port
> 22 to return XDP_PASS, or else call AF_XDP redurect. That seems to be
> pure overhead for this fast-path buildin program for AF_XDP.
>
> Would a solution be to install a NIC hardware filter that redirect SSH
> port 22 to another RX-queue. And then have a buildin program that
> returns XDP_PASS installed on that RX-queue.   And change Bjørns
> semantics, such that RX-queue programs takes precedence over the global
> XDP program. This would also be a good fail safe in general for XDP.
>

Exactly this; I'd say this is the most common way of using AF_XDP,
i.e. steer a certain flow to a Rx queue, and *all* packets on that
queue are pulled to the AF_XDP socket. This is why the builtin is the
way it is, it's what is being used, not only for benchmarking
scenarios.

> If the RX-queues take precedence, I can use this fail safe approach.
> E.g. when I want to test my new global XDP program, I'll use ethtool
> match my management IP and send that to a specific RX-queue and my
> fail-safe BPF program.
>

Interesting take to have a *per queue* XDP program that overrides the
regular one. Maybe this is a better way to use builtins?

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets

2018-12-08 Thread Björn Töpel
Den lör 8 dec. 2018 kl 15:52 skrev Jesper Dangaard Brouer :
>
> On Fri, 7 Dec 2018 15:01:55 +0100
> Björn Töpel  wrote:
>
> > Den fre 7 dec. 2018 kl 14:42 skrev Jesper Dangaard Brouer 
> > :
> > >
> > > On Fri,  7 Dec 2018 12:44:24 +0100
> > > Björn Töpel  wrote:
> > >
> > > > The rationale behind attach is performance and ease of use. Many XDP
> > > > socket users just need a simple way of creating/binding a socket and
> > > > receiving frames right away without loading an XDP program.
> > > >
> > > > XDP_ATTACH adds a mechanism we call "builtin XDP program" that simply
> > > > is a kernel provided XDP program that is installed to the netdev when
> > > > XDP_ATTACH is being passed as a bind() flag.
> > > >
> > > > The builtin program is the simplest program possible to redirect a
> > > > frame to an attached socket. In restricted C it would look like this:
> > > >
> > > >   SEC("xdp")
> > > >   int xdp_prog(struct xdp_md *ctx)
> > > >   {
> > > > return bpf_xsk_redirect(ctx);
> > > >   }
> > > >
> > > > The builtin program loaded via XDP_ATTACH behaves, from an
> > > > install-to-netdev/uninstall-from-netdev point of view, differently
> > > > from regular XDP programs. The easiest way to look at it is as a
> > > > 2-level hierarchy, where regular XDP programs has precedence over the
> > > > builtin one.
> > > >
> > > > If no regular XDP program is installed to the netdev, the builtin will
> > > > be install. If the builtin program is installed, and a regular is
> > > > installed, regular XDP program will have precedence over the builtin
> > > > one.
> > > >
> > > > Further, if a regular program is installed, and later removed, the
> > > > builtin one will automatically be installed.
> > > >
> > > > The sxdp_flags field of struct sockaddr_xdp gets two new options
> > > > XDP_BUILTIN_SKB_MODE and XDP_BUILTIN_DRV_MODE, which maps to the
> > > > corresponding XDP netlink install flags.
> > > >
> > > > The builtin XDP program functionally adds even more complexity to the
> > > > already hard to read dev_change_xdp_fd. Maybe it would be simpler to
> > > > store the program in the struct net_device together with the install
> > > > flags instead of calling the ndo_bpf multiple times?
> > >
> > > (As far as I can see from reading the code, correct me if I'm wrong.)
> > >
> > > If an AF_XDP program uses XDP_ATTACH, then it installs the
> > > builtin-program as the XDP program on the "entire" device.  That means
> > > all RX-queues will call this XDP-bpf program (indirect call), and it is
> > > actually only relevant for the specific queue_index.  Yes, the helper
> > > call does check that the 'xdp->rxq->queue_index' for an attached 'xsk'
> > > and return XDP_PASS if it is NULL:
> > >
> >
> > Yes, you are correct. The builtin XDP program, just like a regular XDP
> > program, affects the whole netdev. So, yes the non-AF_XDP queues would
> > get a performance hit from this. Just to reiterate -- this isn't new
> > for this series. This has always been the case for XDP when acting on
> > just one queue.
> >
> > > +BPF_CALL_1(bpf_xdp_xsk_redirect, struct xdp_buff *, xdp)
> > > +{
> > > +   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
> > > +   struct xdp_sock *xsk;
> > > +
> > > +   xsk = READ_ONCE(xdp->rxq->dev->_rx[xdp->rxq->queue_index].xsk);
> > > +   if (xsk) {
> > > +   ri->xsk = xsk;
> > > +   return XDP_REDIRECT;
> > > +   }
> > > +
> > > +   return XDP_PASS;
> > > +}
> > >
> > > Why do every normal XDP_PASS packet have to pay this overhead (indirect
> > > call), when someone loads an AF_XDP socket program?  The AF_XDP socket
> > > is tied hard and only relevant to a specific RX-queue (which is why we
> > > get a performance boost due to SPSC queues).
> > >
> > > I acknowledge there is a need for this, but this use-case shows there
> > > is a need for attaching XDP programs per RX-queue basis.
> > >
> >
> > From my AF_XDP perspective, having a program per queue would make
> > sense. The discussion of a per-queue has been u

Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets

2018-12-08 Thread Björn Töpel
Den fre 7 dec. 2018 kl 22:21 skrev Alexei Starovoitov
:
>
> On Fri, Dec 07, 2018 at 12:44:24PM +0100, Björn Töpel wrote:
> > From: Björn Töpel 
> >
> > Hi!
> >
> > This patch set adds support for a new XDP socket bind option,
> > XDP_ATTACH.
> >
> > The rationale behind attach is performance and ease of use. Many XDP
> > socket users just need a simple way of creating/binding a socket and
> > receiving frames right away without loading an XDP program.
> >
> > XDP_ATTACH adds a mechanism we call "builtin XDP program" that simply
> > is a kernel provided XDP program that is installed to the netdev when
> > XDP_ATTACH is being passed as a bind() flag.
> >
> > The builtin program is the simplest program possible to redirect a
> > frame to an attached socket. In restricted C it would look like this:
> >
> >   SEC("xdp")
> >   int xdp_prog(struct xdp_md *ctx)
> >   {
> > return bpf_xsk_redirect(ctx);
> >   }
> >
> > The builtin program loaded via XDP_ATTACH behaves, from an
> > install-to-netdev/uninstall-from-netdev point of view, differently
> > from regular XDP programs. The easiest way to look at it is as a
> > 2-level hierarchy, where regular XDP programs has precedence over the
> > builtin one.
>
> The feature makes sense to me.
> May be XDP_ATTACH_BUILTIN would be a better name ?

Yes, agree, or maybe XDP_BUILTIN_ATTACH? Regardless, I'll change the
name for the next revision.

> Also I think it needs another parameter to say which builtin
> program to use.

Yup, I had a plan to add the parameter when it's actually more than
*one* builtin, but you're right, let's do it right away. I'll add a
builtin prog enum field to the struct sockaddr_xdp.

> This unconditional xsk_redirect is fine for performance
> benchmarking, but for production I suspect the users would want
> an easy way to stay safe when they're playing with AF_XDP.

For setups that don't direct the flows explicitly by HW filters,  yes!

> So another builtin program that redirects ssh and ping traffic
> back to the kernel would be a nice addition.
>

I suspect AF_XDP users would prefer redirecting packets to the kernel
via the CPUMAP instead of XDP_PASS -- not paying for the ipstack on
the AF_XDP core. Another builtin would be a tcpdump-like behavior, but
that would require an XDP clone (which Magnus is actually
experimenting with!).

I'll address your input and get back with a new revision. Thanks for
spending time on the series!


Björn


Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets

2018-12-07 Thread Björn Töpel
Den fre 7 dec. 2018 kl 14:42 skrev Jesper Dangaard Brouer :
>
> On Fri,  7 Dec 2018 12:44:24 +0100
> Björn Töpel  wrote:
>
> > The rationale behind attach is performance and ease of use. Many XDP
> > socket users just need a simple way of creating/binding a socket and
> > receiving frames right away without loading an XDP program.
> >
> > XDP_ATTACH adds a mechanism we call "builtin XDP program" that simply
> > is a kernel provided XDP program that is installed to the netdev when
> > XDP_ATTACH is being passed as a bind() flag.
> >
> > The builtin program is the simplest program possible to redirect a
> > frame to an attached socket. In restricted C it would look like this:
> >
> >   SEC("xdp")
> >   int xdp_prog(struct xdp_md *ctx)
> >   {
> > return bpf_xsk_redirect(ctx);
> >   }
> >
> > The builtin program loaded via XDP_ATTACH behaves, from an
> > install-to-netdev/uninstall-from-netdev point of view, differently
> > from regular XDP programs. The easiest way to look at it is as a
> > 2-level hierarchy, where regular XDP programs has precedence over the
> > builtin one.
> >
> > If no regular XDP program is installed to the netdev, the builtin will
> > be install. If the builtin program is installed, and a regular is
> > installed, regular XDP program will have precedence over the builtin
> > one.
> >
> > Further, if a regular program is installed, and later removed, the
> > builtin one will automatically be installed.
> >
> > The sxdp_flags field of struct sockaddr_xdp gets two new options
> > XDP_BUILTIN_SKB_MODE and XDP_BUILTIN_DRV_MODE, which maps to the
> > corresponding XDP netlink install flags.
> >
> > The builtin XDP program functionally adds even more complexity to the
> > already hard to read dev_change_xdp_fd. Maybe it would be simpler to
> > store the program in the struct net_device together with the install
> > flags instead of calling the ndo_bpf multiple times?
>
> (As far as I can see from reading the code, correct me if I'm wrong.)
>
> If an AF_XDP program uses XDP_ATTACH, then it installs the
> builtin-program as the XDP program on the "entire" device.  That means
> all RX-queues will call this XDP-bpf program (indirect call), and it is
> actually only relevant for the specific queue_index.  Yes, the helper
> call does check that the 'xdp->rxq->queue_index' for an attached 'xsk'
> and return XDP_PASS if it is NULL:
>

A side-note: What one can do, which I did for the plumbers work, is
bypassing the indirect call in bpf_prog_run_xdp by doing a "if the XDP
program is a builtin, call internal_bpf_xsk_redirect directly". Then,
the XDP_PASS path wont be taxed by the indirect call -- but only for
this special XDP_ATTACH program. And you'll still get an additional
if-statement in for the skb-path.


Björn

> +BPF_CALL_1(bpf_xdp_xsk_redirect, struct xdp_buff *, xdp)
> +{
> +   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
> +   struct xdp_sock *xsk;
> +
> +   xsk = READ_ONCE(xdp->rxq->dev->_rx[xdp->rxq->queue_index].xsk);
> +   if (xsk) {
> +   ri->xsk = xsk;
> +   return XDP_REDIRECT;
> +   }
> +
> +   return XDP_PASS;
> +}
>
> Why do every normal XDP_PASS packet have to pay this overhead (indirect
> call), when someone loads an AF_XDP socket program?  The AF_XDP socket
> is tied hard and only relevant to a specific RX-queue (which is why we
> get a performance boost due to SPSC queues).
>
> I acknowledge there is a need for this, but this use-case shows there
> is a need for attaching XDP programs per RX-queue basis.
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets

2018-12-07 Thread Björn Töpel
Den fre 7 dec. 2018 kl 14:42 skrev Jesper Dangaard Brouer :
>
> On Fri,  7 Dec 2018 12:44:24 +0100
> Björn Töpel  wrote:
>
> > The rationale behind attach is performance and ease of use. Many XDP
> > socket users just need a simple way of creating/binding a socket and
> > receiving frames right away without loading an XDP program.
> >
> > XDP_ATTACH adds a mechanism we call "builtin XDP program" that simply
> > is a kernel provided XDP program that is installed to the netdev when
> > XDP_ATTACH is being passed as a bind() flag.
> >
> > The builtin program is the simplest program possible to redirect a
> > frame to an attached socket. In restricted C it would look like this:
> >
> >   SEC("xdp")
> >   int xdp_prog(struct xdp_md *ctx)
> >   {
> > return bpf_xsk_redirect(ctx);
> >   }
> >
> > The builtin program loaded via XDP_ATTACH behaves, from an
> > install-to-netdev/uninstall-from-netdev point of view, differently
> > from regular XDP programs. The easiest way to look at it is as a
> > 2-level hierarchy, where regular XDP programs has precedence over the
> > builtin one.
> >
> > If no regular XDP program is installed to the netdev, the builtin will
> > be install. If the builtin program is installed, and a regular is
> > installed, regular XDP program will have precedence over the builtin
> > one.
> >
> > Further, if a regular program is installed, and later removed, the
> > builtin one will automatically be installed.
> >
> > The sxdp_flags field of struct sockaddr_xdp gets two new options
> > XDP_BUILTIN_SKB_MODE and XDP_BUILTIN_DRV_MODE, which maps to the
> > corresponding XDP netlink install flags.
> >
> > The builtin XDP program functionally adds even more complexity to the
> > already hard to read dev_change_xdp_fd. Maybe it would be simpler to
> > store the program in the struct net_device together with the install
> > flags instead of calling the ndo_bpf multiple times?
>
> (As far as I can see from reading the code, correct me if I'm wrong.)
>
> If an AF_XDP program uses XDP_ATTACH, then it installs the
> builtin-program as the XDP program on the "entire" device.  That means
> all RX-queues will call this XDP-bpf program (indirect call), and it is
> actually only relevant for the specific queue_index.  Yes, the helper
> call does check that the 'xdp->rxq->queue_index' for an attached 'xsk'
> and return XDP_PASS if it is NULL:
>

Yes, you are correct. The builtin XDP program, just like a regular XDP
program, affects the whole netdev. So, yes the non-AF_XDP queues would
get a performance hit from this. Just to reiterate -- this isn't new
for this series. This has always been the case for XDP when acting on
just one queue.

> +BPF_CALL_1(bpf_xdp_xsk_redirect, struct xdp_buff *, xdp)
> +{
> +   struct bpf_redirect_info *ri = this_cpu_ptr(_redirect_info);
> +   struct xdp_sock *xsk;
> +
> +   xsk = READ_ONCE(xdp->rxq->dev->_rx[xdp->rxq->queue_index].xsk);
> +   if (xsk) {
> +   ri->xsk = xsk;
> +   return XDP_REDIRECT;
> +   }
> +
> +   return XDP_PASS;
> +}
>
> Why do every normal XDP_PASS packet have to pay this overhead (indirect
> call), when someone loads an AF_XDP socket program?  The AF_XDP socket
> is tied hard and only relevant to a specific RX-queue (which is why we
> get a performance boost due to SPSC queues).
>
> I acknowledge there is a need for this, but this use-case shows there
> is a need for attaching XDP programs per RX-queue basis.
>

>From my AF_XDP perspective, having a program per queue would make
sense. The discussion of a per-queue has been up before, and I think
the conclusion was that it would be too complex from a
configuration/tooling point-of-view. Again, for AF_XDP this would be
great.

When we started to hack on AF_PACKET v4, we had some ideas of doing
the "queue slicing" on a netdev level. So, e.g. take a netdev, and
create, say, macvlans that took over parts of parents queues
(something in line of what John did with NETIF_F_HW_L2FW_DOFFLOAD for
macvlan) and then use the macvlan interface as the dedicated AF_XDP
interface.

Personally, I like the current queue slicing model, and having a way
of loading an XDP program per queue would be nice -- unless the UX for
the poor sysadmin will be terrible. :-)


Björn

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


[PATCH bpf-next 6/7] xsk: load a builtin XDP program on XDP_ATTACH

2018-12-07 Thread Björn Töpel
From: Björn Töpel 

This commit extends the XDP_ATTACH bind option by loading a builtin
XDP program.

The builtin program is the simplest program possible to redirect a
frame to an attached socket. In restricted C it would look like this:

  SEC("xdp")
  int xdp_prog(struct xdp_md *ctx)
  {
return bpf_xsk_redirect(ctx);
  }

For many XDP socket users, this program would be the most common one.

The builtin program loaded via XDP_ATTACH behaves, from an
install-to-netdev/uninstall-from-netdev point of view, different from
regular XDP programs. The easiest way to look at it is as a 2-level
hierarchy, where regular XDP programs has precedence over the builtin
one.

If no regular XDP program is installed to the netdev, the builtin will
be install. If the builtin program is installed, and a regular is
installed, the regular XDP will have precedence over the builtin one.

Further, if a regular program is installed, and later removed, the
builtin one will automatically be installed.

The sxdp_flags field of struct sockaddr_xdp gets two new options
XDP_BUILTIN_SKB_MODE and XDP_BUILTIN_DRV_MODE, which maps to the
corresponding XDP netlink install flags.

Signed-off-by: Björn Töpel 
---
 include/linux/netdevice.h   | 10 +
 include/uapi/linux/if_xdp.h | 10 +++--
 net/core/dev.c  | 84 ---
 net/xdp/xsk.c   | 88 +++--
 4 files changed, 179 insertions(+), 13 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a6cc68d2504c..a3094f1a9fcb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2039,6 +2039,13 @@ struct net_device {
struct lock_class_key   *qdisc_running_key;
boolproto_down;
unsignedwol_enabled:1;
+
+#ifdef CONFIG_XDP_SOCKETS
+   struct bpf_prog *xsk_prog;
+   u32 xsk_prog_flags;
+   boolxsk_prog_running;
+   int xsk_prog_ref;
+#endif
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
@@ -3638,6 +3645,9 @@ struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, 
struct net_device *dev,
struct netdev_queue *txq, int *ret);
 
 typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
+int dev_xsk_prog_install(struct net_device *dev, struct bpf_prog *prog,
+u32 flags);
+void dev_xsk_prog_uninstall(struct net_device *dev);
 int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
  int fd, u32 flags);
 u32 __dev_xdp_query(struct net_device *dev, bpf_op_t xdp_op,
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index bd76235c2749..b8fb3200f640 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -13,10 +13,12 @@
 #include 
 
 /* Options for the sxdp_flags field */
-#define XDP_SHARED_UMEM(1 << 0)
-#define XDP_COPY   (1 << 1) /* Force copy-mode */
-#define XDP_ZEROCOPY   (1 << 2) /* Force zero-copy mode */
-#define XDP_ATTACH (1 << 3)
+#define XDP_SHARED_UMEM(1 << 0)
+#define XDP_COPY   (1 << 1) /* Force copy-mode */
+#define XDP_ZEROCOPY   (1 << 2) /* Force zero-copy mode */
+#define XDP_ATTACH (1 << 3)
+#define XDP_BUILTIN_SKB_MODE   (1 << 4)
+#define XDP_BUILTIN_DRV_MODE   (1 << 5)
 
 struct sockaddr_xdp {
__u16 sxdp_family;
diff --git a/net/core/dev.c b/net/core/dev.c
index abe50c424b29..0a1c30da2f87 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -7879,6 +7879,70 @@ static void dev_xdp_uninstall(struct net_device *dev)
NULL));
 }
 
+#ifdef CONFIG_XDP_SOCKETS
+int dev_xsk_prog_install(struct net_device *dev, struct bpf_prog *prog,
+u32 flags)
+{
+   ASSERT_RTNL();
+
+   if (dev->xsk_prog) {
+   if (prog != dev->xsk_prog)
+   return -EINVAL;
+   if (flags && flags != dev->xsk_prog_flags)
+   return -EINVAL;
+   }
+
+   if (dev->xsk_prog) {
+   dev->xsk_prog_ref++;
+   return 0;
+   }
+
+   dev->xsk_prog = bpf_prog_inc(prog);
+   dev->xsk_prog_flags = flags | XDP_FLAGS_UPDATE_IF_NOEXIST;
+   dev->xsk_prog_ref = 1;
+   (void)dev_change_xdp_fd(dev, NULL, -1, dev->xsk_prog_flags);
+   return 0;
+}
+
+void dev_xsk_prog_uninstall(struct net_device *dev)
+{
+   ASSERT_RTNL();
+
+   if (--dev->xsk_prog_ref == 0) {
+   bpf_prog_put(dev->xsk_prog);
+   dev->xsk_prog = NULL;
+   if (dev->xsk_prog_running)
+   (void)dev_change_xdp_fd(dev, NULL, -1,
+   dev->xsk_p

[PATCH bpf-next 5/7] bpf: add function to load builtin BPF program

2018-12-07 Thread Björn Töpel
From: Björn Töpel 

The added bpf_prog_load_builtin can be used to load and verify a BPF
program that originates from the kernel. We call this a "builtin BPF
program". A builtin program can be used for convenience, e.g. it
allows for the kernel to use the bpf infrastructure for internal
tasks.

This functionality will be used by AF_XDP sockets in a later commit.

Signed-off-by: Björn Töpel 
---
 include/linux/bpf.h  |  2 ++
 kernel/bpf/syscall.c | 32 
 2 files changed, 26 insertions(+), 8 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index e82b7039fc66..e810bfeb6239 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -563,6 +563,8 @@ static inline int bpf_map_attr_numa_node(const union 
bpf_attr *attr)
 struct bpf_prog *bpf_prog_get_type_path(const char *name, enum bpf_prog_type 
type);
 int array_map_alloc_check(union bpf_attr *attr);
 
+struct bpf_prog *bpf_prog_load_builtin(union bpf_attr *attr);
+
 #else /* !CONFIG_BPF_SYSCALL */
 static inline struct bpf_prog *bpf_prog_get(u32 ufd)
 {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index ee1328625330..323831e1a1e2 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1461,10 +1461,16 @@ static struct bpf_prog *__bpf_prog_load(union bpf_attr 
*attr,
!capable(CAP_SYS_ADMIN))
return ERR_PTR(-EPERM);
 
-   /* copy eBPF program license from user space */
-   if (strncpy_from_user(license, u64_to_user_ptr(attr->license),
- sizeof(license) - 1) < 0)
-   return ERR_PTR(-EFAULT);
+   /* NB! If uattr is NULL, a builtin BPF is being loaded. */
+   if (uattr) {
+   /* copy eBPF program license from user space */
+   if (strncpy_from_user(license, u64_to_user_ptr(attr->license),
+ sizeof(license) - 1) < 0)
+   return ERR_PTR(-EFAULT);
+   } else {
+   strncpy(license, (const char *)(unsigned long)attr->license,
+   sizeof(license) - 1);
+   }
license[sizeof(license) - 1] = 0;
 
/* eBPF programs must be GPL compatible to use GPL-ed functions */
@@ -1505,10 +1511,15 @@ static struct bpf_prog *__bpf_prog_load(union bpf_attr 
*attr,
 
prog->len = attr->insn_cnt;
 
-   err = -EFAULT;
-   if (copy_from_user(prog->insns, u64_to_user_ptr(attr->insns),
-  bpf_prog_insn_size(prog)) != 0)
-   goto free_prog;
+   if (uattr) {
+   err = -EFAULT;
+   if (copy_from_user(prog->insns, u64_to_user_ptr(attr->insns),
+  bpf_prog_insn_size(prog)) != 0)
+   goto free_prog;
+   } else {
+   memcpy(prog->insns, (void *)(unsigned long)attr->insns,
+  bpf_prog_insn_size(prog));
+   }
 
prog->orig_prog = NULL;
prog->jited = 0;
@@ -1584,6 +1595,11 @@ static int bpf_prog_load(union bpf_attr *attr, union 
bpf_attr __user *uattr)
return fd;
 }
 
+struct bpf_prog *bpf_prog_load_builtin(union bpf_attr *attr)
+{
+   return __bpf_prog_load(attr, NULL);
+}
+
 #define BPF_OBJ_LAST_FIELD file_flags
 
 static int bpf_obj_pin(const union bpf_attr *attr)
-- 
2.19.1



[PATCH bpf-next 3/7] bpf: add bpf_xsk_redirect function

2018-12-07 Thread Björn Töpel
From: Björn Töpel 

The bpf_xsk_redirect function is a new redirect bpf function, in
addition to bpf_redirect/bpf_redirect_map. If an XDP socket has been
attached to a netdev Rx queue via the XDP_ATTACH bind() option and
bpf_xsk_redirect is called, the packet will be redirected to the
attached socket.

The bpf_xsk_redirect function returns XDP_REDIRECT if there is a
socket attached to the originated queue, otherwise XDP_PASS.

This commit also adds the corresponding trace points for the redirect
call.

Signed-off-by: Björn Töpel 
---
 include/linux/filter.h |   4 ++
 include/trace/events/xdp.h |  61 ++
 include/uapi/linux/bpf.h   |  14 +-
 net/core/filter.c  | 100 +
 4 files changed, 178 insertions(+), 1 deletion(-)

diff --git a/include/linux/filter.h b/include/linux/filter.h
index d16deead65c6..691b5c1003c8 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -525,6 +525,10 @@ struct bpf_redirect_info {
u32 flags;
struct bpf_map *map;
struct bpf_map *map_to_flush;
+#ifdef CONFIG_XDP_SOCKETS
+   struct xdp_sock *xsk;
+   struct xdp_sock *xsk_to_flush;
+#endif
u32 kern_flags;
 };
 
diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h
index e95cb86b65cf..30f399bd462b 100644
--- a/include/trace/events/xdp.h
+++ b/include/trace/events/xdp.h
@@ -158,6 +158,67 @@ struct _bpf_dtab_netdev {
 trace_xdp_redirect_map_err(dev, xdp, devmap_ifindex(fwd, map), \
err, map, idx)
 
+DECLARE_EVENT_CLASS(xsk_redirect_template,
+
+   TP_PROTO(const struct net_device *dev,
+const struct bpf_prog *xdp,
+int err,
+struct xdp_buff *xbuff),
+
+   TP_ARGS(dev, xdp, err, xbuff),
+
+   TP_STRUCT__entry(
+   __field(int, prog_id)
+   __field(u32, act)
+   __field(int, ifindex)
+   __field(int, err)
+   __field(u32, queue_index)
+   __field(enum xdp_mem_type, mem_type)
+   ),
+
+   TP_fast_assign(
+   __entry->prog_id= xdp->aux->id;
+   __entry->act= XDP_REDIRECT;
+   __entry->ifindex= dev->ifindex;
+   __entry->err= err;
+   __entry->queue_index= xbuff->rxq->queue_index;
+   __entry->mem_type   = xbuff->rxq->mem.type;
+   ),
+
+   TP_printk("prog_id=%d action=%s ifindex=%d err=%d queue_index=%d"
+ " mem_type=%d",
+ __entry->prog_id,
+ __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB),
+ __entry->ifindex,
+ __entry->err,
+ __entry->queue_index,
+ __entry->mem_type)
+);
+
+DEFINE_EVENT(xsk_redirect_template, xsk_redirect,
+   TP_PROTO(const struct net_device *dev,
+const struct bpf_prog *xdp,
+int err,
+struct xdp_buff *xbuff),
+
+   TP_ARGS(dev, xdp, err, xbuff)
+);
+
+DEFINE_EVENT(xsk_redirect_template, xsk_redirect_err,
+   TP_PROTO(const struct net_device *dev,
+const struct bpf_prog *xdp,
+int err,
+struct xdp_buff *xbuff),
+
+   TP_ARGS(dev, xdp, err, xbuff)
+);
+
+#define _trace_xsk_redirect(dev, xdp, xbuff)   \
+trace_xsk_redirect(dev, xdp, 0, xbuff)
+
+#define _trace_xsk_redirect_err(dev, xdp, xbuff, err)  \
+trace_xsk_redirect_err(dev, xdp, err, xbuff)
+
 TRACE_EVENT(xdp_cpumap_kthread,
 
TP_PROTO(int map_id, unsigned int processed,  unsigned int drops,
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index a84fd232d934..2912d87a39ba 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -2298,6 +2298,17 @@ union bpf_attr {
  * payload and/or *pop* value being to large.
  * Return
  * 0 on success, or a negative error in case of failure.
+ *
+ * int bpf_xsk_redirect(struct xdp_buff *xdp_md)
+ *  Description
+ * Redirect the packet to the attached XDP socket, if any.
+ * An XDP socket can be attached to a network interface Rx
+ * queue by passing the XDP_ATTACH option at bind point of
+ * the socket.
+ *
+ * Return
+ * **XDP_REDIRECT** if there is an XDP socket attached to the Rx
+ * queue receiving the frame, otherwise **XDP_PASS**.
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -2391,7 +2402,8 @@ union bpf_attr {
FN(map_pop_elem),   \
FN(map_peek_elem),  \
FN(msg_push_data),  \
-   FN(msg_pop_data),
+   FN(msg_pop_data),   \
+   FN(xsk_redirect),
 
 /* integer value in 'imm' field o

[PATCH bpf-next 4/7] bpf: prepare for builtin bpf program

2018-12-07 Thread Björn Töpel
From: Björn Töpel 

Break up bpf_prog_load into one function that allocates, initializes
and verifies a bpf program, and one that allocates a file descriptor.

The former function will be used in a later commit to load a builtin
BPF program.

Signed-off-by: Björn Töpel 
---
 kernel/bpf/syscall.c | 59 ++--
 1 file changed, 35 insertions(+), 24 deletions(-)

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index aa05aa38f4a8..ee1328625330 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1441,7 +1441,8 @@ bpf_prog_load_check_attach_type(enum bpf_prog_type 
prog_type,
 /* last field in 'union bpf_attr' used by this command */
 #defineBPF_PROG_LOAD_LAST_FIELD func_info_cnt
 
-static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
+static struct bpf_prog *__bpf_prog_load(union bpf_attr *attr,
+   union bpf_attr __user *uattr)
 {
enum bpf_prog_type type = attr->prog_type;
struct bpf_prog *prog;
@@ -1450,45 +1451,45 @@ static int bpf_prog_load(union bpf_attr *attr, union 
bpf_attr __user *uattr)
bool is_gpl;
 
if (CHECK_ATTR(BPF_PROG_LOAD))
-   return -EINVAL;
+   return ERR_PTR(-EINVAL);
 
if (attr->prog_flags & ~(BPF_F_STRICT_ALIGNMENT | BPF_F_ANY_ALIGNMENT))
-   return -EINVAL;
+   return ERR_PTR(-EINVAL);
 
if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
(attr->prog_flags & BPF_F_ANY_ALIGNMENT) &&
!capable(CAP_SYS_ADMIN))
-   return -EPERM;
+   return ERR_PTR(-EPERM);
 
/* copy eBPF program license from user space */
if (strncpy_from_user(license, u64_to_user_ptr(attr->license),
  sizeof(license) - 1) < 0)
-   return -EFAULT;
+   return ERR_PTR(-EFAULT);
license[sizeof(license) - 1] = 0;
 
/* eBPF programs must be GPL compatible to use GPL-ed functions */
is_gpl = license_is_gpl_compatible(license);
 
if (attr->insn_cnt == 0 || attr->insn_cnt > BPF_MAXINSNS)
-   return -E2BIG;
+   return ERR_PTR(-E2BIG);
 
if (type == BPF_PROG_TYPE_KPROBE &&
attr->kern_version != LINUX_VERSION_CODE)
-   return -EINVAL;
+   return  ERR_PTR(-EINVAL);
 
if (type != BPF_PROG_TYPE_SOCKET_FILTER &&
type != BPF_PROG_TYPE_CGROUP_SKB &&
!capable(CAP_SYS_ADMIN))
-   return -EPERM;
+   return ERR_PTR(-EPERM);
 
bpf_prog_load_fixup_attach_type(attr);
if (bpf_prog_load_check_attach_type(type, attr->expected_attach_type))
-   return -EINVAL;
+   return ERR_PTR(-EINVAL);
 
/* plain bpf_prog allocation */
prog = bpf_prog_alloc(bpf_prog_size(attr->insn_cnt), GFP_USER);
if (!prog)
-   return -ENOMEM;
+   return ERR_PTR(-ENOMEM);
 
prog->expected_attach_type = attr->expected_attach_type;
 
@@ -1544,20 +1545,8 @@ static int bpf_prog_load(union bpf_attr *attr, union 
bpf_attr __user *uattr)
if (err)
goto free_used_maps;
 
-   err = bpf_prog_new_fd(prog);
-   if (err < 0) {
-   /* failed to allocate fd.
-* bpf_prog_put() is needed because the above
-* bpf_prog_alloc_id() has published the prog
-* to the userspace and the userspace may
-* have refcnt-ed it through BPF_PROG_GET_FD_BY_ID.
-*/
-   bpf_prog_put(prog);
-   return err;
-   }
-
bpf_prog_kallsyms_add(prog);
-   return err;
+   return prog;
 
 free_used_maps:
kvfree(prog->aux->func_info);
@@ -1570,7 +1559,29 @@ static int bpf_prog_load(union bpf_attr *attr, union 
bpf_attr __user *uattr)
security_bpf_prog_free(prog->aux);
 free_prog_nouncharge:
bpf_prog_free(prog);
-   return err;
+   return ERR_PTR(err);
+}
+
+static int bpf_prog_load(union bpf_attr *attr, union bpf_attr __user *uattr)
+{
+   struct bpf_prog *prog = __bpf_prog_load(attr, uattr);
+   int fd;
+
+   if (IS_ERR(prog))
+   return PTR_ERR(prog);
+
+   fd = bpf_prog_new_fd(prog);
+   if (fd < 0) {
+   /* failed to allocate fd.
+* bpf_prog_put() is needed because the above
+* bpf_prog_alloc_id() has published the prog
+* to the userspace and the userspace may
+* have refcnt-ed it through BPF_PROG_GET_FD_BY_ID.
+*/
+   bpf_prog_put(prog);
+   }
+
+   return fd;
 }
 
 #define BPF_OBJ_LAST_FIELD file_flags
-- 
2.19.1



[PATCH bpf-next 2/7] xsk: add XDP_ATTACH bind() flag

2018-12-07 Thread Björn Töpel
From: Björn Töpel 

In this commit the XDP_ATTACH bind() flag is introduced. When an XDP
socket is bound with this flag set, the socket will be associated with
a certain netdev Rx queue. The idea is that the XDP socket users do
not have to deal with the XSKMAP, or even an XDP program. Instead
XDP_ATTACH will "attach" an XDP socket to a queue, load a builtin XDP
program that forwards all received packets from the attached queue to
the socket.

An XDP socket bound with this option performs better, since the BPF
program is smaller, and the kernel code-path also has fewer
instructions.

This commit only introduces the first part of XDP_ATTACH, namely
associating the XDP socket to a netdev Rx queue.

To redirect XDP frames to an attached socket, the XDP program must use
the bpf_xsk_redirect that will be introduced in the next commit.

Signed-off-by: Björn Töpel 
---
 include/linux/netdevice.h   |  1 +
 include/net/xdp_sock.h  |  2 ++
 include/uapi/linux/if_xdp.h |  1 +
 net/xdp/xsk.c   | 50 +
 4 files changed, 43 insertions(+), 11 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 94fb2e12f117..a6cc68d2504c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -743,6 +743,7 @@ struct netdev_rx_queue {
struct xdp_rxq_info xdp_rxq;
 #ifdef CONFIG_XDP_SOCKETS
struct xdp_umem *umem;
+   struct xdp_sock *xsk;
 #endif
 } cacheline_aligned_in_smp;
 
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 13acb9803a6d..95315eb0410a 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -72,7 +72,9 @@ struct xdp_sock {
 
 struct xdp_buff;
 #ifdef CONFIG_XDP_SOCKETS
+int xsk_generic_attached_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
+int xsk_attached_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 void xsk_flush(struct xdp_sock *xs);
 bool xsk_is_setup_for_bpf_map(struct xdp_sock *xs);
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index caed8b1614ff..bd76235c2749 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -16,6 +16,7 @@
 #define XDP_SHARED_UMEM(1 << 0)
 #define XDP_COPY   (1 << 1) /* Force copy-mode */
 #define XDP_ZEROCOPY   (1 << 2) /* Force zero-copy mode */
+#define XDP_ATTACH (1 << 3)
 
 struct sockaddr_xdp {
__u16 sxdp_family;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index a03268454a27..1eff7ac8596d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -100,17 +100,20 @@ static int __xsk_rcv_zc(struct xdp_sock *xs, struct 
xdp_buff *xdp, u32 len)
return err;
 }
 
-int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+int xsk_attached_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
-   u32 len;
+   u32 len = xdp->data_end - xdp->data;
+
+   return (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY) ?
+   __xsk_rcv_zc(xs, xdp, len) : __xsk_rcv(xs, xdp, len);
+}
 
+int xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
return -EINVAL;
 
-   len = xdp->data_end - xdp->data;
-
-   return (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY) ?
-   __xsk_rcv_zc(xs, xdp, len) : __xsk_rcv(xs, xdp, len);
+   return xsk_attached_rcv(xs, xdp);
 }
 
 void xsk_flush(struct xdp_sock *xs)
@@ -119,7 +122,7 @@ void xsk_flush(struct xdp_sock *xs)
xs->sk.sk_data_ready(>sk);
 }
 
-int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+int xsk_generic_attached_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
u32 metalen = xdp->data - xdp->data_meta;
u32 len = xdp->data_end - xdp->data;
@@ -127,9 +130,6 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff 
*xdp)
u64 addr;
int err;
 
-   if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
-   return -EINVAL;
-
if (!xskq_peek_addr(xs->umem->fq, ) ||
len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
xs->rx_dropped++;
@@ -152,6 +152,14 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff 
*xdp)
return err;
 }
 
+int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
+{
+   if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
+   return -EINVAL;
+
+   return xsk_generic_attached_rcv(xs, xdp);
+}
+
 void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
 {
xskq_produce_flush_addr_n(umem->cq, nb_entries);
@@ -339,6 +347,18 @@ static int xsk_init_queue(u32 entries, struct xsk_queue 
**queue,
 

[PATCH bpf-next 1/7] xsk: simplify AF_XDP socket teardown

2018-12-07 Thread Björn Töpel
From: Björn Töpel 

Prior this commit, when the struct socket object was being released,
the UMEM did not have its reference count decreased. Instead, this was
done in the struct sock sk_destruct function.

There is no reason to keep the UMEM reference around when the socket
is being orphaned, so in this patch the xdp_put_mem is called in the
xsk_release function. This results in that the xsk_destruct function
can be removed!

Note that, it still holds that a struct xsk_sock reference might still
linger in the XSKMAP after the UMEM is released, e.g. if a user does
not clear the XSKMAP prior to closing the process. This sock will be
in a "released" zombie like state, until the XSKMAP is removed.

Signed-off-by: Björn Töpel 
---
 net/xdp/xsk.c | 16 +---
 1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 07156f43d295..a03268454a27 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -366,6 +366,7 @@ static int xsk_release(struct socket *sock)
 
xskq_destroy(xs->rx);
xskq_destroy(xs->tx);
+   xdp_put_umem(xs->umem);
 
sock_orphan(sk);
sock->sk = NULL;
@@ -713,18 +714,6 @@ static const struct proto_ops xsk_proto_ops = {
.sendpage   = sock_no_sendpage,
 };
 
-static void xsk_destruct(struct sock *sk)
-{
-   struct xdp_sock *xs = xdp_sk(sk);
-
-   if (!sock_flag(sk, SOCK_DEAD))
-   return;
-
-   xdp_put_umem(xs->umem);
-
-   sk_refcnt_debug_dec(sk);
-}
-
 static int xsk_create(struct net *net, struct socket *sock, int protocol,
  int kern)
 {
@@ -751,9 +740,6 @@ static int xsk_create(struct net *net, struct socket *sock, 
int protocol,
 
sk->sk_family = PF_XDP;
 
-   sk->sk_destruct = xsk_destruct;
-   sk_refcnt_debug_inc(sk);
-
sock_set_flag(sk, SOCK_RCU_FREE);
 
xs = xdp_sk(sk);
-- 
2.19.1



[PATCH bpf-next 0/7] Add XDP_ATTACH bind() flag to AF_XDP sockets

2018-12-07 Thread Björn Töpel
From: Björn Töpel 

Hi!

This patch set adds support for a new XDP socket bind option,
XDP_ATTACH.

The rationale behind attach is performance and ease of use. Many XDP
socket users just need a simple way of creating/binding a socket and
receiving frames right away without loading an XDP program.

XDP_ATTACH adds a mechanism we call "builtin XDP program" that simply
is a kernel provided XDP program that is installed to the netdev when
XDP_ATTACH is being passed as a bind() flag.

The builtin program is the simplest program possible to redirect a
frame to an attached socket. In restricted C it would look like this:

  SEC("xdp")
  int xdp_prog(struct xdp_md *ctx)
  {
return bpf_xsk_redirect(ctx);
  }

The builtin program loaded via XDP_ATTACH behaves, from an
install-to-netdev/uninstall-from-netdev point of view, differently
from regular XDP programs. The easiest way to look at it is as a
2-level hierarchy, where regular XDP programs has precedence over the
builtin one.

If no regular XDP program is installed to the netdev, the builtin will
be install. If the builtin program is installed, and a regular is
installed, regular XDP program will have precedence over the builtin
one.

Further, if a regular program is installed, and later removed, the
builtin one will automatically be installed.

The sxdp_flags field of struct sockaddr_xdp gets two new options
XDP_BUILTIN_SKB_MODE and XDP_BUILTIN_DRV_MODE, which maps to the
corresponding XDP netlink install flags.

The builtin XDP program functionally adds even more complexity to the
already hard to read dev_change_xdp_fd. Maybe it would be simpler to
store the program in the struct net_device together with the install
flags instead of calling the ndo_bpf multiple times?

The outline of the series is as following:
  patch 1-2: Introduce the first part of XDP_ATTACH, simply adding
 the socket to the netdev structure.
  patch 3:   Add a new BPF function, bpf_xsk_redirect, that 
 redirects a frame to an attached socket.
  patch 4-5: Preparatory commits for built in BPF programs
  patch 6:   Make XDP_ATTACH load a builtin XDP program
  patch 7:   Extend the samples application with XDP_ATTACH
 support

Patch 1 through 3 gives the performance boost and make it possible to
use AF_XDP sockets without an XSKMAP, but still requires an explicit
XDP program to be loaded.

Patch 4 through 6 make it possible to use XDP socket without explictly
loading an XDP program.

The performance numbers for rxdrop (Intel(R) Xeon(R) Gold 6154 CPU @
3.00GHz):

XDP_SKB:
XSKMAP: 2.8 Mpps
XDP_ATTACH: 2.9 Mpps

XDP_DRV - copy:
XSKMAP: 8.5 Mpps
XDP_ATTACH: 9.3 Mpps

XDP_DRV - zero-copy:
XSKMAP: 15.1 Mpps
XDP_ATTACH: 17.3 Mpps

Thanks!
Björn


Björn Töpel (7):
  xsk: simplify AF_XDP socket teardown
  xsk: add XDP_ATTACH bind() flag
  bpf: add bpf_xsk_redirect function
  bpf: prepare for builtin bpf program
  bpf: add function to load builtin BPF program
  xsk: load a builtin XDP program on XDP_ATTACH
  samples: bpf: add support for XDP_ATTACH to xdpsock

 include/linux/bpf.h |   2 +
 include/linux/filter.h  |   4 +
 include/linux/netdevice.h   |  11 +++
 include/net/xdp_sock.h  |   2 +
 include/trace/events/xdp.h  |  61 +++
 include/uapi/linux/bpf.h|  14 +++-
 include/uapi/linux/if_xdp.h |   9 ++-
 kernel/bpf/syscall.c|  91 ++
 net/core/dev.c  |  84 +++--
 net/core/filter.c   | 100 
 net/xdp/xsk.c   | 146 +---
 samples/bpf/xdpsock_user.c  | 108 --
 12 files changed, 524 insertions(+), 108 deletions(-)

-- 
2.19.1



[PATCH bpf] xsk: do not call synchronize_net() under RCU read lock

2018-10-08 Thread Björn Töpel
From: Björn Töpel 

The XSKMAP update and delete functions called synchronize_net(), which
can sleep. It is not allowed to sleep during an RCU read section.

Instead we need to make sure that the sock sk_destruct (xsk_destruct)
function is asynchronously called after an RCU grace period. Setting
the SOCK_RCU_FREE flag for XDP sockets takes care of this.

Fixes: fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type 
BPF_MAP_TYPE_XSKMAP")
Reported-by: Eric Dumazet 
Signed-off-by: Björn Töpel 
---
 kernel/bpf/xskmap.c | 10 ++
 net/xdp/xsk.c   |  2 ++
 2 files changed, 4 insertions(+), 8 deletions(-)

diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 9f8463afda9c..47147c9e184d 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -192,11 +192,8 @@ static int xsk_map_update_elem(struct bpf_map *map, void 
*key, void *value,
sock_hold(sock->sk);
 
old_xs = xchg(>xsk_map[i], xs);
-   if (old_xs) {
-   /* Make sure we've flushed everything. */
-   synchronize_net();
+   if (old_xs)
sock_put((struct sock *)old_xs);
-   }
 
sockfd_put(sock);
return 0;
@@ -212,11 +209,8 @@ static int xsk_map_delete_elem(struct bpf_map *map, void 
*key)
return -EINVAL;
 
old_xs = xchg(>xsk_map[k], NULL);
-   if (old_xs) {
-   /* Make sure we've flushed everything. */
-   synchronize_net();
+   if (old_xs)
sock_put((struct sock *)old_xs);
-   }
 
return 0;
 }
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 0577cd49aa72..07156f43d295 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -754,6 +754,8 @@ static int xsk_create(struct net *net, struct socket *sock, 
int protocol,
sk->sk_destruct = xsk_destruct;
sk_refcnt_debug_inc(sk);
 
+   sock_set_flag(sk, SOCK_RCU_FREE);
+
xs = xdp_sk(sk);
mutex_init(>mutex);
spin_lock_init(>tx_completion_lock);
-- 
2.17.1



Re: [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP

2018-10-08 Thread Björn Töpel
Den mån 8 okt. 2018 kl 18:55 skrev Eric Dumazet :
>
[...]
>
> You might take a look at SOCK_RCU_FREE flag for sockets.
>

Ah, thanks! I'll use this instead.


Re: [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP

2018-10-08 Thread Björn Töpel
Den mån 8 okt. 2018 kl 18:05 skrev Björn Töpel :
>
> Den mån 8 okt. 2018 kl 17:31 skrev Eric Dumazet :
> >
[...]
> > So it is illegal to call synchronize_net(), since it is a reschedule point.
> >
>
> Thanks for finding and pointing this out, Eric!
>
> I'll have look and get back with a patch.
>

Eric, something in the lines of the patch below? Or is it considered
bad practice to use call_rcu in this context (prone to DoSing the
kernel)?

Thanks for spending time on the xskmap code. Very much appreciated!

>From 491f7bd87705f72c45e59242fc6c3b1db9d3b56d Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= 
Date: Mon, 8 Oct 2018 18:34:11 +0200
Subject: [PATCH] xsk: do not call synchronize_net() under RCU read lock
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

XSKMAP update and delete functions called synchronize_net(), which can
sleep. It is not allowed to sleep during an RCU read section.

Fixes: fbfc504a24f5 ("bpf: introduce new bpf AF_XDP map type
BPF_MAP_TYPE_XSKMAP")
Reported-by: Eric Dumazet 
Signed-off-by: Björn Töpel 
---
 include/net/xdp_sock.h |  1 +
 kernel/bpf/xskmap.c| 21 +++--
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 13acb9803a6d..5b430141a3f6 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -68,6 +68,7 @@ struct xdp_sock {
  */
 spinlock_t tx_completion_lock;
 u64 rx_dropped;
+struct rcu_head rcu;
 };

 struct xdp_buff;
diff --git a/kernel/bpf/xskmap.c b/kernel/bpf/xskmap.c
index 9f8463afda9c..51e8e2785612 100644
--- a/kernel/bpf/xskmap.c
+++ b/kernel/bpf/xskmap.c
@@ -157,6 +157,13 @@ static void *xsk_map_lookup_elem(struct bpf_map
*map, void *key)
 return NULL;
 }

+static void __xsk_map_remove_async(struct rcu_head *rcu)
+{
+struct xdp_sock *xs = container_of(rcu, struct xdp_sock, rcu);
+
+sock_put((struct sock *)xs);
+}
+
 static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
u64 map_flags)
 {
@@ -192,11 +199,8 @@ static int xsk_map_update_elem(struct bpf_map
*map, void *key, void *value,
 sock_hold(sock->sk);

 old_xs = xchg(>xsk_map[i], xs);
-if (old_xs) {
-/* Make sure we've flushed everything. */
-synchronize_net();
-sock_put((struct sock *)old_xs);
-}
+if (old_xs)
+call_rcu(_xs->rcu, __xsk_map_remove_async);

 sockfd_put(sock);
 return 0;
@@ -212,11 +216,8 @@ static int xsk_map_delete_elem(struct bpf_map
*map, void *key)
 return -EINVAL;

 old_xs = xchg(>xsk_map[k], NULL);
-if (old_xs) {
-/* Make sure we've flushed everything. */
-synchronize_net();
-sock_put((struct sock *)old_xs);
-}
+if (old_xs)
+call_rcu(_xs->rcu, __xsk_map_remove_async);

 return 0;
 }
-- 
2.17.1


Re: [PATCH bpf-next v3 07/15] bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP

2018-10-08 Thread Björn Töpel
Den mån 8 okt. 2018 kl 17:31 skrev Eric Dumazet :
>
> On 05/02/2018 04:01 AM, Björn Töpel wrote:
> > From: Björn Töpel 
> >
> > The xskmap is yet another BPF map, very much inspired by
> > dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
> > adds AF_XDP sockets into the map, and by using the bpf_redirect_map
> > helper, an XDP program can redirect XDP frames to an AF_XDP socket.
> >
> > Note that a socket that is bound to certain ifindex/queue index will
> > *only* accept XDP frames from that netdev/queue index. If an XDP
> > program tries to redirect from a netdev/queue index other than what
> > the socket is bound to, the frame will not be received on the socket.
> >
> > A socket can reside in multiple maps.
> >
> > v3: Fixed race and simplified code.
> > v2: Removed one indirection in map lookup.
> >
> > Signed-off-by: Björn Töpel 
> > ---
> >  include/linux/bpf.h   |  25 +
> >  include/linux/bpf_types.h |   3 +
> >  include/net/xdp_sock.h|   7 ++
> >  include/uapi/linux/bpf.h  |   1 +
> >  kernel/bpf/Makefile   |   3 +
> >  kernel/bpf/verifier.c |   8 +-
> >  kernel/bpf/xskmap.c   | 239 
> > ++
> >  net/xdp/xsk.c |   5 +
> >  8 files changed, 289 insertions(+), 2 deletions(-)
> >  create mode 100644 kernel/bpf/xskmap.c
> >
>
> This function is called under rcu_read_lock() , from map_update_elem()
>
> > +
> > +static int xsk_map_update_elem(struct bpf_map *map, void *key, void *value,
> > +u64 map_flags)
> > +{
> > + struct xsk_map *m = container_of(map, struct xsk_map, map);
> > + u32 i = *(u32 *)key, fd = *(u32 *)value;
> > + struct xdp_sock *xs, *old_xs;
> > + struct socket *sock;
> > + int err;
> > +
> > + if (unlikely(map_flags > BPF_EXIST))
> > + return -EINVAL;
> > + if (unlikely(i >= m->map.max_entries))
> > + return -E2BIG;
> > + if (unlikely(map_flags == BPF_NOEXIST))
> > + return -EEXIST;
> > +
> > + sock = sockfd_lookup(fd, );
> > + if (!sock)
> > + return err;
> > +
> > + if (sock->sk->sk_family != PF_XDP) {
> > + sockfd_put(sock);
> > + return -EOPNOTSUPP;
> > + }
> > +
> > + xs = (struct xdp_sock *)sock->sk;
> > +
> > + if (!xsk_is_setup_for_bpf_map(xs)) {
> > + sockfd_put(sock);
> > + return -EOPNOTSUPP;
> > + }
> > +
> > + sock_hold(sock->sk);
> > +
> > + old_xs = xchg(>xsk_map[i], xs);
> > + if (old_xs) {
> > + /* Make sure we've flushed everything. */
>
> So it is illegal to call synchronize_net(), since it is a reschedule point.
>

Thanks for finding and pointing this out, Eric!

I'll have look and get back with a patch.


Björn


> > + synchronize_net();
> > + sock_put((struct sock *)old_xs);
> > + }
> > +
> > + sockfd_put(sock);
> > + return 0;
> > +}
> >
>
>
>


Re: [PATCH v2 0/5] Introducing ixgbe AF_XDP ZC support

2018-10-05 Thread Björn Töpel

On 2018-10-05 06:59, Björn Töpel wrote:

On 2018-10-04 23:18, Jesper Dangaard Brouer wrote:

I see similar performance numbers, but my system can crash with 'txonly'.


Thanks for finding this, Jesper!

Can you give me your "lspci -vvv" dump of your NIC, so I know what ixgbe
flavor you've got?

I'll dig into it right away.



Jesper, there's (hopefully) a fix for the crash here:

  https://patchwork.ozlabs.org/patch/979442/

Thanks for spending time on the ixgbe ZC patches!




Björn


[PATCH bpf-next] xsk: proper AF_XDP socket teardown ordering

2018-10-05 Thread Björn Töpel
From: Björn Töpel 

The AF_XDP socket struct can exist in three different, implicit
states: setup, bound and released. Setup is prior the socket has been
bound to a device. Bound is when the socket is active for receive and
send. Released is when the process/userspace side of the socket is
released, but the sock object is still lingering, e.g. when there is a
reference to the socket in an XSKMAP after process termination.

The Rx fast-path code uses the "dev" member of struct xdp_sock to
check whether a socket is bound or relased, and the Tx code uses the
struct xdp_umem "xsk_list" member in conjunction with "dev" to
determine the state of a socket.

However, the transition from bound to released did not tear the socket
down in correct order.

On the Rx side "dev" was cleared after synchronize_net() making the
synchronization useless. On the Tx side, the internal queues were
destroyed prior removing them from the "xsk_list".

This commit corrects the cleanup order, and by doing so
xdp_del_sk_umem() can be simplified and one synchronize_net() can be
removed.

Fixes: 965a99098443 ("xsk: add support for bind for Rx")
Fixes: ac98d8aab61b ("xsk: wire upp Tx zero-copy functions")
Reported-by: Jesper Dangaard Brouer 
Signed-off-by: Björn Töpel 
---
 net/xdp/xdp_umem.c | 11 +++
 net/xdp/xsk.c  | 13 -
 2 files changed, 11 insertions(+), 13 deletions(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index c6007c58231c..a264cf2accd0 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -32,14 +32,9 @@ void xdp_del_sk_umem(struct xdp_umem *umem, struct xdp_sock 
*xs)
 {
unsigned long flags;
 
-   if (xs->dev) {
-   spin_lock_irqsave(>xsk_list_lock, flags);
-   list_del_rcu(>list);
-   spin_unlock_irqrestore(>xsk_list_lock, flags);
-
-   if (umem->zc)
-   synchronize_net();
-   }
+   spin_lock_irqsave(>xsk_list_lock, flags);
+   list_del_rcu(>list);
+   spin_unlock_irqrestore(>xsk_list_lock, flags);
 }
 
 /* The umem is stored both in the _rx struct and the _tx struct as we do
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index caeddad15b7c..0577cd49aa72 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -355,12 +355,18 @@ static int xsk_release(struct socket *sock)
local_bh_enable();
 
if (xs->dev) {
+   struct net_device *dev = xs->dev;
+
/* Wait for driver to stop using the xdp socket. */
-   synchronize_net();
-   dev_put(xs->dev);
+   xdp_del_sk_umem(xs->umem, xs);
xs->dev = NULL;
+   synchronize_net();
+   dev_put(dev);
}
 
+   xskq_destroy(xs->rx);
+   xskq_destroy(xs->tx);
+
sock_orphan(sk);
sock->sk = NULL;
 
@@ -714,9 +720,6 @@ static void xsk_destruct(struct sock *sk)
if (!sock_flag(sk, SOCK_DEAD))
return;
 
-   xskq_destroy(xs->rx);
-   xskq_destroy(xs->tx);
-   xdp_del_sk_umem(xs->umem, xs);
xdp_put_umem(xs->umem);
 
sk_refcnt_debug_dec(sk);
-- 
2.17.1



Re: [PATCH v2 0/5] Introducing ixgbe AF_XDP ZC support

2018-10-04 Thread Björn Töpel

On 2018-10-04 23:18, Jesper Dangaard Brouer wrote:

I see similar performance numbers, but my system can crash with 'txonly'.


Thanks for finding this, Jesper!

Can you give me your "lspci -vvv" dump of your NIC, so I know what ixgbe
flavor you've got?

I'll dig into it right away.


Björn


Re: [PATCH v2] typo fix in Documentation/networking/af_xdp.rst

2018-10-04 Thread Björn Töpel
Den tors 4 okt. 2018 kl 19:03 skrev Konrad Djimeli :
>
> Fix a simple typo: Completetion -> Completion
>
> Signed-off-by: Konrad Djimeli 
> ---
> Changes in v2:
> - Update line below to be same length as text above
>
>  Documentation/networking/af_xdp.rst | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/Documentation/networking/af_xdp.rst 
> b/Documentation/networking/af_xdp.rst
> index ff929cfab4f4..4ae4f9d8f8fe 100644
> --- a/Documentation/networking/af_xdp.rst
> +++ b/Documentation/networking/af_xdp.rst
> @@ -159,8 +159,8 @@ log2(2048) LSB of the addr will be masked off, meaning 
> that 2048, 2050
>  and 3000 refers to the same chunk.
>
>
> -UMEM Completetion Ring
> -~~
> +UMEM Completion Ring
> +
>
>  The Completion Ring is used transfer ownership of UMEM frames from
>  kernel-space to user-space. Just like the Fill ring, UMEM indicies are
> --
> 2.17.1
>

Thanks Konrad! For future patches, you should tag your patch with
'bpf' or 'bpf-next' as pointed out in
Documentation/bpf/bpf_devel_QA.rst. I guess this should go to 'bpf'.

Acked-by: Björn Töpel 


Re: [PATCH v2 0/5] Introducing ixgbe AF_XDP ZC support

2018-10-02 Thread Björn Töpel

On 2018-10-02 20:23, William Tu wrote:

On Tue, Oct 2, 2018 at 1:01 AM Björn Töpel  wrote:


From: Björn Töpel 

Jeff: Please remove the v1 patches from your dev-queue!

This patch set introduces zero-copy AF_XDP support for Intel's ixgbe
driver.

The ixgbe zero-copy code is located in its own file ixgbe_xsk.[ch],
analogous to the i40e ZC support. Again, as in i40e, code paths have
been copied from the XDP path to the zero-copy path. Going forward we
will try to generalize more code between the AF_XDP ZC drivers, and
also reduce the heavy C

We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is GCC 7.3.0. The NIC is Intel
82599ES/X520-2 10Gbit/s using the ixgbe driver.

Below are the results in Mpps of the 82599ES/X520-2 NIC benchmark runs
for 64B and 1500B packets, generated by a commercial packet generator
HW blasting packets at full 10Gbit/s line rate. The results are with
retpoline and all other spectre and meltdown fixes.

AF_XDP performance 64B packets:
Benchmark   XDP_DRV with zerocopy
rxdrop14.7
txpush14.6
l2fwd 11.1

AF_XDP performance 1500B packets:
Benchmark   XDP_DRV with zerocopy
rxdrop0.8
l2fwd 0.8

XDP performance on our system as a base line.

64B packets:
XDP stats   CPU Mpps   issue-pps
XDP-RX CPU  16  14.7   0

1500B packets:
XDP stats   CPU Mpps   issue-pps
XDP-RX CPU  16  0.80

The structure of the patch set is as follows:

Patch 1: Introduce Rx/Tx ring enable/disable functionality
Patch 2: Preparatory patche to ixgbe driver code for RX
Patch 3: ixgbe zero-copy support for RX
Patch 4: Preparatory patch to ixgbe driver code for TX
Patch 5: ixgbe zero-copy support for TX

Changes since v1:

* Removed redundant AF_XDP precondition checks, pointed out by
   Jakub. Now, the preconditions are only checked at XDP enable time.
* Fixed a crash in the egress path, due to incorrect usage of
   ixgbe_ring queue_index member. In v2 a ring_idx back reference is
   introduced, and used in favor of queue_index. William reported the
   crash, and helped me smoke out the issue. Kudos!


Thanks! I tested this series and no more crash.


Thank you for spending time on this!


The number is pretty good (*without* spectre and meltdown fixes)
model name : Intel(R) Xeon(R) CPU E5-2440 v2 @ 1.90GHz, total 16 cores/

AF_XDP performance 64B packets:
Benchmark   XDP_DRV with zerocopy
rxdrop20
txpush18
l2fwd 20



What is 20 here? Given that 14.8Mpps is maximum for 64B@10Gbit/s for
one queue, is this multiple queues? Is this xdpsock or OvS with AF_XDP?


Cheers,
Björn


Regards,
William



[PATCH v2 5/5] ixgbe: add AF_XDP zero-copy Tx support

2018-10-02 Thread Björn Töpel
From: Björn Töpel 

This patch adds zero-copy Tx support for AF_XDP sockets. It implements
the ndo_xsk_async_xmit netdev ndo and performs all the Tx logic from a
NAPI context. This means pulling egress packets from the Tx ring,
placing the frames on the NIC HW descriptor ring and completing sent
frames back to the application via the completion ring.

The regular XDP Tx ring is used for AF_XDP as well. This rationale for
this is as follows: XDP_REDIRECT guarantees mutual exclusion between
different NAPI contexts based on CPU id. In other words, a netdev can
XDP_REDIRECT to another netdev with a different NAPI context, since
the operation is bound to a specific core and each core has its own
hardware ring.

As the AF_XDP Tx action is running in the same NAPI context and using
the same ring, it will also be protected from XDP_REDIRECT actions
with the exact same mechanism.

As with AF_XDP Rx, all AF_XDP Tx specific functions are added to
ixgbe_xsk.c.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  17 +-
 .../ethernet/intel/ixgbe/ixgbe_txrx_common.h  |   4 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 175 ++
 3 files changed, 195 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index b211032f8682..ec31b32d6674 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -3161,7 +3161,11 @@ int ixgbe_poll(struct napi_struct *napi, int budget)
 #endif
 
ixgbe_for_each_ring(ring, q_vector->tx) {
-   if (!ixgbe_clean_tx_irq(q_vector, ring, budget))
+   bool wd = ring->xsk_umem ?
+ ixgbe_clean_xdp_tx_irq(q_vector, ring, budget) :
+ ixgbe_clean_tx_irq(q_vector, ring, budget);
+
+   if (!wd)
clean_complete = false;
}
 
@@ -3472,6 +3476,10 @@ void ixgbe_configure_tx_ring(struct ixgbe_adapter 
*adapter,
u32 txdctl = IXGBE_TXDCTL_ENABLE;
u8 reg_idx = ring->reg_idx;
 
+   ring->xsk_umem = NULL;
+   if (ring_is_xdp(ring))
+   ring->xsk_umem = ixgbe_xsk_umem(adapter, ring);
+
/* disable queue to avoid issues while updating state */
IXGBE_WRITE_REG(hw, IXGBE_TXDCTL(reg_idx), 0);
IXGBE_WRITE_FLUSH(hw);
@@ -5944,6 +5952,11 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring 
*tx_ring)
u16 i = tx_ring->next_to_clean;
struct ixgbe_tx_buffer *tx_buffer = _ring->tx_buffer_info[i];
 
+   if (tx_ring->xsk_umem) {
+   ixgbe_xsk_clean_tx_ring(tx_ring);
+   goto out;
+   }
+
while (i != tx_ring->next_to_use) {
union ixgbe_adv_tx_desc *eop_desc, *tx_desc;
 
@@ -5995,6 +6008,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring 
*tx_ring)
if (!ring_is_xdp(tx_ring))
netdev_tx_reset_queue(txring_txq(tx_ring));
 
+out:
/* reset next_to_use and next_to_clean */
tx_ring->next_to_use = 0;
tx_ring->next_to_clean = 0;
@@ -10350,6 +10364,7 @@ static const struct net_device_ops ixgbe_netdev_ops = {
.ndo_features_check = ixgbe_features_check,
.ndo_bpf= ixgbe_xdp,
.ndo_xdp_xmit   = ixgbe_xdp_xmit,
+   .ndo_xsk_async_xmit = ixgbe_xsk_async_xmit,
 };
 
 static void ixgbe_disable_txr_hw(struct ixgbe_adapter *adapter,
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
index 56afb685c648..53d4089f5644 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
@@ -42,5 +42,9 @@ int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector,
  struct ixgbe_ring *rx_ring,
  const int budget);
 void ixgbe_xsk_clean_rx_ring(struct ixgbe_ring *rx_ring);
+bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
+   struct ixgbe_ring *tx_ring, int napi_budget);
+int ixgbe_xsk_async_xmit(struct net_device *dev, u32 queue_id);
+void ixgbe_xsk_clean_tx_ring(struct ixgbe_ring *tx_ring);
 
 #endif /* #define _IXGBE_TXRX_COMMON_H_ */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 61259036ff4b..cf1c6f2d97e5 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -626,3 +626,178 @@ void ixgbe_xsk_clean_rx_ring(struct ixgbe_ring *rx_ring)
}
}
 }
+
+static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
+{
+   union ixgbe_adv_tx_desc *tx_desc = NULL;
+   struct ixgbe_tx_buffer *tx_bi;
+   bool work_done = true;
+   u32 len, cmd_type;
+   dma_addr_t dma;
+
+   while (budget-- &g

[PATCH v2 4/5] ixgbe: move common Tx functions to ixgbe_txrx_common.h

2018-10-02 Thread Björn Töpel
From: Björn Töpel 

This patch prepares for the upcoming zero-copy Tx functionality by
moving common functions used both by the regular path and zero-copy
path.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c| 9 +++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h | 5 +
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 547092b8fe54..b211032f8682 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -895,8 +895,8 @@ static void ixgbe_set_ivar(struct ixgbe_adapter *adapter, 
s8 direction,
}
 }
 
-static inline void ixgbe_irq_rearm_queues(struct ixgbe_adapter *adapter,
- u64 qmask)
+void ixgbe_irq_rearm_queues(struct ixgbe_adapter *adapter,
+   u64 qmask)
 {
u32 mask;
 
@@ -8156,9 +8156,6 @@ static inline int ixgbe_maybe_stop_tx(struct ixgbe_ring 
*tx_ring, u16 size)
return __ixgbe_maybe_stop_tx(tx_ring, size);
 }
 
-#define IXGBE_TXD_CMD (IXGBE_TXD_CMD_EOP | \
-  IXGBE_TXD_CMD_RS)
-
 static int ixgbe_tx_map(struct ixgbe_ring *tx_ring,
struct ixgbe_tx_buffer *first,
const u8 hdr_len)
@@ -10259,7 +10256,7 @@ static int ixgbe_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
}
 }
 
-static void ixgbe_xdp_ring_update_tail(struct ixgbe_ring *ring)
+void ixgbe_xdp_ring_update_tail(struct ixgbe_ring *ring)
 {
/* Force memory writes to complete before letting h/w know there
 * are new descriptors to fetch.
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
index cf219f4e009d..56afb685c648 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
@@ -9,6 +9,9 @@
 #define IXGBE_XDP_TX   BIT(1)
 #define IXGBE_XDP_REDIRBIT(2)
 
+#define IXGBE_TXD_CMD (IXGBE_TXD_CMD_EOP | \
+  IXGBE_TXD_CMD_RS)
+
 int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
struct xdp_frame *xdpf);
 bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
@@ -19,6 +22,8 @@ void ixgbe_process_skb_fields(struct ixgbe_ring *rx_ring,
  struct sk_buff *skb);
 void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector,
  struct sk_buff *skb);
+void ixgbe_xdp_ring_update_tail(struct ixgbe_ring *ring);
+void ixgbe_irq_rearm_queues(struct ixgbe_adapter *adapter, u64 qmask);
 
 void ixgbe_txrx_ring_disable(struct ixgbe_adapter *adapter, int ring);
 void ixgbe_txrx_ring_enable(struct ixgbe_adapter *adapter, int ring);
-- 
2.17.1



[PATCH v2 3/5] ixgbe: add AF_XDP zero-copy Rx support

2018-10-02 Thread Björn Töpel
From: Björn Töpel 

This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
queue.

All AF_XDP specific functions are added to a new file, ixgbe_xsk.c.

Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
will allocate a new buffer and copy the zero-copy frame prior passing
it to the kernel stack.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/ixgbe/Makefile |   3 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |  27 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c  |  17 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  78 ++-
 .../ethernet/intel/ixgbe/ixgbe_txrx_common.h  |  15 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 628 ++
 6 files changed, 747 insertions(+), 21 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c

diff --git a/drivers/net/ethernet/intel/ixgbe/Makefile 
b/drivers/net/ethernet/intel/ixgbe/Makefile
index 5414685189ce..ca6b0c458e4a 100644
--- a/drivers/net/ethernet/intel/ixgbe/Makefile
+++ b/drivers/net/ethernet/intel/ixgbe/Makefile
@@ -8,7 +8,8 @@ obj-$(CONFIG_IXGBE) += ixgbe.o
 
 ixgbe-objs := ixgbe_main.o ixgbe_common.o ixgbe_ethtool.o \
   ixgbe_82599.o ixgbe_82598.o ixgbe_phy.o ixgbe_sriov.o \
-  ixgbe_mbx.o ixgbe_x540.o ixgbe_x550.o ixgbe_lib.o ixgbe_ptp.o
+  ixgbe_mbx.o ixgbe_x540.o ixgbe_x550.o ixgbe_lib.o ixgbe_ptp.o \
+  ixgbe_xsk.o
 
 ixgbe-$(CONFIG_IXGBE_DCB) +=  ixgbe_dcb.o ixgbe_dcb_82598.o \
   ixgbe_dcb_82599.o ixgbe_dcb_nl.o
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 265db172042a..7a7679e7be84 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -228,13 +228,17 @@ struct ixgbe_tx_buffer {
 struct ixgbe_rx_buffer {
struct sk_buff *skb;
dma_addr_t dma;
-   struct page *page;
-#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
-   __u32 page_offset;
-#else
-   __u16 page_offset;
-#endif
-   __u16 pagecnt_bias;
+   union {
+   struct {
+   struct page *page;
+   __u32 page_offset;
+   __u16 pagecnt_bias;
+   };
+   struct {
+   void *addr;
+   u64 handle;
+   };
+   };
 };
 
 struct ixgbe_queue_stats {
@@ -348,6 +352,10 @@ struct ixgbe_ring {
struct ixgbe_rx_queue_stats rx_stats;
};
struct xdp_rxq_info xdp_rxq;
+   struct xdp_umem *xsk_umem;
+   struct zero_copy_allocator zca; /* ZC allocator anchor */
+   u16 ring_idx;   /* {rx,tx,xdp}_ring back reference idx */
+   u16 rx_buf_len;
 } cacheline_internodealigned_in_smp;
 
 enum ixgbe_ring_f_enum {
@@ -765,6 +773,11 @@ struct ixgbe_adapter {
 #ifdef CONFIG_XFRM_OFFLOAD
struct ixgbe_ipsec *ipsec;
 #endif /* CONFIG_XFRM_OFFLOAD */
+
+   /* AF_XDP zero-copy */
+   struct xdp_umem **xsk_umems;
+   u16 num_xsk_umems_used;
+   u16 num_xsk_umems;
 };
 
 static inline u8 ixgbe_max_rss_indices(struct ixgbe_adapter *adapter)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
index d361f570ca37..62e6499e4146 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c
@@ -1055,7 +1055,7 @@ static int ixgbe_alloc_q_vectors(struct ixgbe_adapter 
*adapter)
int txr_remaining = adapter->num_tx_queues;
int xdp_remaining = adapter->num_xdp_queues;
int rxr_idx = 0, txr_idx = 0, xdp_idx = 0, v_idx = 0;
-   int err;
+   int err, i;
 
/* only one q_vector if MSI-X is disabled. */
if (!(adapter->flags & IXGBE_FLAG_MSIX_ENABLED))
@@ -1097,6 +1097,21 @@ static int ixgbe_alloc_q_vectors(struct ixgbe_adapter 
*adapter)
xdp_idx += xqpv;
}
 
+   for (i = 0; i < adapter->num_rx_queues; i++) {
+   if (adapter->rx_ring[i])
+   adapter->rx_ring[i]->ring_idx = i;
+   }
+
+   for (i = 0; i < adapter->num_tx_queues; i++) {
+   if (adapter->tx_ring[i])
+   adapter->tx_ring[i]->ring_idx = i;
+   }
+
+   for (i = 0; i < adapter->num_xdp_queues; i++) {
+   if (adapter->xdp_ring[i])
+   adapter->xdp_ring[i]->ring_idx = i;
+   }
+
return 0;
 
 err_out:
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index cc655c4e24fd..547092b8fe54 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -34,6 +34,7 @@

[PATCH v2 2/5] ixgbe: move common Rx functions to ixgbe_txrx_common.h

2018-10-02 Thread Björn Töpel
From: Björn Töpel 

This patch prepares for the upcoming zero-copy Rx functionality, by
moving/changing linkage of common functions, used both by the regular
path and zero-copy path.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 29 +++
 .../ethernet/intel/ixgbe/ixgbe_txrx_common.h  | 26 +
 2 files changed, 37 insertions(+), 18 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 6ff886498882..cc655c4e24fd 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -40,6 +40,7 @@
 #include "ixgbe_dcb_82599.h"
 #include "ixgbe_sriov.h"
 #include "ixgbe_model.h"
+#include "ixgbe_txrx_common.h"
 
 char ixgbe_driver_name[] = "ixgbe";
 static const char ixgbe_driver_string[] =
@@ -1673,9 +1674,9 @@ static void ixgbe_update_rsc_stats(struct ixgbe_ring 
*rx_ring,
  * order to populate the hash, checksum, VLAN, timestamp, protocol, and
  * other fields within the skb.
  **/
-static void ixgbe_process_skb_fields(struct ixgbe_ring *rx_ring,
-union ixgbe_adv_rx_desc *rx_desc,
-struct sk_buff *skb)
+void ixgbe_process_skb_fields(struct ixgbe_ring *rx_ring,
+ union ixgbe_adv_rx_desc *rx_desc,
+ struct sk_buff *skb)
 {
struct net_device *dev = rx_ring->netdev;
u32 flags = rx_ring->q_vector->adapter->flags;
@@ -1708,8 +1709,8 @@ static void ixgbe_process_skb_fields(struct ixgbe_ring 
*rx_ring,
skb->protocol = eth_type_trans(skb, dev);
 }
 
-static void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector,
-struct sk_buff *skb)
+void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector,
+ struct sk_buff *skb)
 {
napi_gro_receive(_vector->napi, skb);
 }
@@ -1868,9 +1869,9 @@ static void ixgbe_dma_sync_frag(struct ixgbe_ring 
*rx_ring,
  *
  * Returns true if an error was encountered and skb was freed.
  **/
-static bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
- union ixgbe_adv_rx_desc *rx_desc,
- struct sk_buff *skb)
+bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
+  union ixgbe_adv_rx_desc *rx_desc,
+  struct sk_buff *skb)
 {
struct net_device *netdev = rx_ring->netdev;
 
@@ -2186,14 +2187,6 @@ static struct sk_buff *ixgbe_build_skb(struct ixgbe_ring 
*rx_ring,
return skb;
 }
 
-#define IXGBE_XDP_PASS 0
-#define IXGBE_XDP_CONSUMED BIT(0)
-#define IXGBE_XDP_TX   BIT(1)
-#define IXGBE_XDP_REDIRBIT(2)
-
-static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-  struct xdp_frame *xdpf);
-
 static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 struct ixgbe_ring *rx_ring,
 struct xdp_buff *xdp)
@@ -8471,8 +8464,8 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
 }
 
 #endif
-static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-  struct xdp_frame *xdpf)
+int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
+   struct xdp_frame *xdpf)
 {
struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
struct ixgbe_tx_buffer *tx_buffer;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
new file mode 100644
index ..3780d315b991
--- /dev/null
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2018 Intel Corporation. */
+
+#ifndef _IXGBE_TXRX_COMMON_H_
+#define _IXGBE_TXRX_COMMON_H_
+
+#define IXGBE_XDP_PASS 0
+#define IXGBE_XDP_CONSUMED BIT(0)
+#define IXGBE_XDP_TX   BIT(1)
+#define IXGBE_XDP_REDIRBIT(2)
+
+int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
+   struct xdp_frame *xdpf);
+bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
+  union ixgbe_adv_rx_desc *rx_desc,
+  struct sk_buff *skb);
+void ixgbe_process_skb_fields(struct ixgbe_ring *rx_ring,
+ union ixgbe_adv_rx_desc *rx_desc,
+ struct sk_buff *skb);
+void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector,
+ struct sk_buff *skb);
+
+void ixgbe_txrx_ring_disable(struct ixgbe_adapter *adapter, int ring);
+void ixgbe_txrx_ring_enable(struct ixgbe_adapter *adapter, int ring);
+
+#endif /* #define _IXGBE_TXRX_COMMON_H_ */
-- 
2.17.1



[PATCH v2 1/5] ixgbe: added Rx/Tx ring disable/enable functions

2018-10-02 Thread Björn Töpel
From: Björn Töpel 

Add functions for Rx/Tx ring enable/disable. Instead of resetting the
whole device, only the affected ring is disabled or enabled.

This plumbing is used in later commits, when zero-copy AF_XDP support
is introduced.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |   1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 158 ++
 2 files changed, 159 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 5c6fd42e90ed..265db172042a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -271,6 +271,7 @@ enum ixgbe_ring_state_t {
__IXGBE_TX_DETECT_HANG,
__IXGBE_HANG_CHECK_ARMED,
__IXGBE_TX_XDP_RING,
+   __IXGBE_TX_DISABLED,
 };
 
 #define ring_uses_build_skb(ring) \
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 187b78f950b5..6ff886498882 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8694,6 +8694,8 @@ static netdev_tx_t __ixgbe_xmit_frame(struct sk_buff *skb,
return NETDEV_TX_OK;
 
tx_ring = ring ? ring : adapter->tx_ring[skb->queue_mapping];
+   if (unlikely(test_bit(__IXGBE_TX_DISABLED, _ring->state)))
+   return NETDEV_TX_BUSY;
 
return ixgbe_xmit_frame_ring(skb, adapter, tx_ring);
 }
@@ -10240,6 +10242,9 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
if (unlikely(!ring))
return -ENXIO;
 
+   if (unlikely(test_bit(__IXGBE_TX_DISABLED, >state)))
+   return -ENXIO;
+
for (i = 0; i < n; i++) {
struct xdp_frame *xdpf = frames[i];
int err;
@@ -10303,6 +10308,159 @@ static const struct net_device_ops ixgbe_netdev_ops = 
{
.ndo_xdp_xmit   = ixgbe_xdp_xmit,
 };
 
+static void ixgbe_disable_txr_hw(struct ixgbe_adapter *adapter,
+struct ixgbe_ring *tx_ring)
+{
+   unsigned long wait_delay, delay_interval;
+   struct ixgbe_hw *hw = >hw;
+   u8 reg_idx = tx_ring->reg_idx;
+   int wait_loop;
+   u32 txdctl;
+
+   IXGBE_WRITE_REG(hw, IXGBE_TXDCTL(reg_idx), IXGBE_TXDCTL_SWFLSH);
+
+   /* delay mechanism from ixgbe_disable_tx */
+   delay_interval = ixgbe_get_completion_timeout(adapter) / 100;
+
+   wait_loop = IXGBE_MAX_RX_DESC_POLL;
+   wait_delay = delay_interval;
+
+   while (wait_loop--) {
+   usleep_range(wait_delay, wait_delay + 10);
+   wait_delay += delay_interval * 2;
+   txdctl = IXGBE_READ_REG(hw, IXGBE_TXDCTL(reg_idx));
+
+   if (!(txdctl & IXGBE_TXDCTL_ENABLE))
+   return;
+   }
+
+   e_err(drv, "TXDCTL.ENABLE not cleared within the polling period\n");
+}
+
+static void ixgbe_disable_txr(struct ixgbe_adapter *adapter,
+ struct ixgbe_ring *tx_ring)
+{
+   set_bit(__IXGBE_TX_DISABLED, _ring->state);
+   ixgbe_disable_txr_hw(adapter, tx_ring);
+}
+
+static void ixgbe_disable_rxr_hw(struct ixgbe_adapter *adapter,
+struct ixgbe_ring *rx_ring)
+{
+   unsigned long wait_delay, delay_interval;
+   struct ixgbe_hw *hw = >hw;
+   u8 reg_idx = rx_ring->reg_idx;
+   int wait_loop;
+   u32 rxdctl;
+
+   rxdctl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(reg_idx));
+   rxdctl &= ~IXGBE_RXDCTL_ENABLE;
+   rxdctl |= IXGBE_RXDCTL_SWFLSH;
+
+   /* write value back with RXDCTL.ENABLE bit cleared */
+   IXGBE_WRITE_REG(hw, IXGBE_RXDCTL(reg_idx), rxdctl);
+
+   /* RXDCTL.EN may not change on 82598 if link is down, so skip it */
+   if (hw->mac.type == ixgbe_mac_82598EB &&
+   !(IXGBE_READ_REG(hw, IXGBE_LINKS) & IXGBE_LINKS_UP))
+   return;
+
+   /* delay mechanism from ixgbe_disable_rx */
+   delay_interval = ixgbe_get_completion_timeout(adapter) / 100;
+
+   wait_loop = IXGBE_MAX_RX_DESC_POLL;
+   wait_delay = delay_interval;
+
+   while (wait_loop--) {
+   usleep_range(wait_delay, wait_delay + 10);
+   wait_delay += delay_interval * 2;
+   rxdctl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(reg_idx));
+
+   if (!(rxdctl & IXGBE_RXDCTL_ENABLE))
+   return;
+   }
+
+   e_err(drv, "RXDCTL.ENABLE not cleared within the polling period\n");
+}
+
+static void ixgbe_reset_txr_stats(struct ixgbe_ring *tx_ring)
+{
+   memset(_ring->stats, 0, sizeof(tx_ring->stats));
+   memset(_ring->tx_stats, 0, sizeof(tx_ring->tx_stats));
+}
+
+static void ixgbe_reset_rxr_stats(struct ixgbe_ring *rx_ring)
+{
+   memset(_ring->stats, 0, sizeof(rx_ring->stats));
+   memset(_ring->rx_stats,

[PATCH v2 0/5] Introducing ixgbe AF_XDP ZC support

2018-10-02 Thread Björn Töpel
From: Björn Töpel 

Jeff: Please remove the v1 patches from your dev-queue!

This patch set introduces zero-copy AF_XDP support for Intel's ixgbe
driver.

The ixgbe zero-copy code is located in its own file ixgbe_xsk.[ch],
analogous to the i40e ZC support. Again, as in i40e, code paths have
been copied from the XDP path to the zero-copy path. Going forward we
will try to generalize more code between the AF_XDP ZC drivers, and
also reduce the heavy C

We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is GCC 7.3.0. The NIC is Intel
82599ES/X520-2 10Gbit/s using the ixgbe driver.

Below are the results in Mpps of the 82599ES/X520-2 NIC benchmark runs
for 64B and 1500B packets, generated by a commercial packet generator
HW blasting packets at full 10Gbit/s line rate. The results are with
retpoline and all other spectre and meltdown fixes.

AF_XDP performance 64B packets:
Benchmark   XDP_DRV with zerocopy
rxdrop14.7
txpush14.6
l2fwd 11.1

AF_XDP performance 1500B packets:
Benchmark   XDP_DRV with zerocopy
rxdrop0.8
l2fwd 0.8

XDP performance on our system as a base line.

64B packets:
XDP stats   CPU Mpps   issue-pps
XDP-RX CPU  16  14.7   0

1500B packets:
XDP stats   CPU Mpps   issue-pps
XDP-RX CPU  16  0.80

The structure of the patch set is as follows:

Patch 1: Introduce Rx/Tx ring enable/disable functionality
Patch 2: Preparatory patche to ixgbe driver code for RX
Patch 3: ixgbe zero-copy support for RX
Patch 4: Preparatory patch to ixgbe driver code for TX
Patch 5: ixgbe zero-copy support for TX

Changes since v1:

* Removed redundant AF_XDP precondition checks, pointed out by
  Jakub. Now, the preconditions are only checked at XDP enable time.
* Fixed a crash in the egress path, due to incorrect usage of
  ixgbe_ring queue_index member. In v2 a ring_idx back reference is
  introduced, and used in favor of queue_index. William reported the
  crash, and helped me smoke out the issue. Kudos!
* In ixgbe_xsk_async_xmit, validate qid against num_xdp_queues,
  instead of num_rx_queues.

Cheers!
Björn

Björn Töpel (5):
  ixgbe: added Rx/Tx ring disable/enable functions
  ixgbe: move common Rx functions to ixgbe_txrx_common.h
  ixgbe: add AF_XDP zero-copy Rx support
  ixgbe: move common Tx functions to ixgbe_txrx_common.h
  ixgbe: add AF_XDP zero-copy Tx support

 drivers/net/ethernet/intel/ixgbe/Makefile |   3 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |  28 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_lib.c  |  17 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 291 ++-
 .../ethernet/intel/ixgbe/ixgbe_txrx_common.h  |  50 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 803 ++
 6 files changed, 1146 insertions(+), 46 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
 create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c

-- 
2.17.1



Re: [net-next, PATCH 0/2, v3] net: socionext: XDP support

2018-10-01 Thread Björn Töpel

On 2018-09-29 13:28, Ilias Apalodimas wrote:

This patch series adds AF_XDP support socionext netsec driver


This series adds *XDP* support and as a result, the AF_XDP batteries
are included. ;-)


Björn


In addition the new dma allocation scheme offers a 10% boost on Rx
pps rate using 64b packets

- patch [1/2]: Use a different allocation scheme for Rx DMA buffers to
   prepare the driver for AF_XDP support
- patch [2/2]: Add XDP support without zero-copy

test and performance numbers (64b packets):
---
- Normal SKBs on Rx: ~217kpps
test: pktgen -> intel i210 -> netsec -> XDP_TX/XDP_REDIRECT
- XDP_TX: 320kpps
- XDP_REDIRECT: 320kpps

qemu -> pktgen -> virtio -> ndo_xdp_xmit -> netsec
- ndo_xdp_xmit: Could not send more than 120kpps. Interface forwarded that
 with success

Changes since v2:
  - Always allocate Rx buffers with XDP_PACKET_HEADROOM
  
  Björn Töpel:

  - Added locking in the Tx queue

  Jesper Dangaard Brouer:
  - Added support for .ndo_xdp_xmit
  - XDP_TX does not flush every packet

Changes since v1:
- patch [2/2]:
  Toshiaki Makita:
  - Added XDP_PACKET_HEADROOM
  - Fixed a bug on XDP_PASS case
  - Using extact for error messaging instead of netdev_warn, when
trying to setup XDP

Ilias Apalodimas (2):
   net: socionext: different approach on DMA
   net: socionext: add XDP support

  drivers/net/ethernet/socionext/netsec.c | 541 +---
  1 file changed, 426 insertions(+), 115 deletions(-)



Re: [PATCH 3/5] ixgbe: add AF_XDP zero-copy Rx support

2018-09-26 Thread Björn Töpel
Den tis 25 sep. 2018 kl 16:57 skrev Jakub Kicinski
:
>
> On Mon, 24 Sep 2018 18:35:55 +0200, Björn Töpel wrote:
> > + if (adapter->flags & IXGBE_FLAG_SRIOV_ENABLED)
> > + return -EINVAL;
> > +
> > + if (adapter->flags & IXGBE_FLAG_DCB_ENABLED)
> > + return -EINVAL;
>
> Hm, should you add UMEM checks to all the places these may get
> enabled?  Like fabf1bce103a ("ixgbe: Prevent unsupported configurations
> with XDP") did?

Actually, I can remove these checks, since it's already checked by the
XDP path. AF_XDP ZC is enabled only when XDP is enabled. So there's no
need to have the XDP checks in the AF_XDP path.

Björn


[PATCH 5/5] ixgbe: add AF_XDP zero-copy Tx support

2018-09-24 Thread Björn Töpel
From: Björn Töpel 

This patch adds zero-copy Tx support for AF_XDP sockets. It implements
the ndo_xsk_async_xmit netdev ndo and performs all the Tx logic from a
NAPI context. This means pulling egress packets from the Tx ring,
placing the frames on the NIC HW descriptor ring and completing sent
frames back to the application via the completion ring.

The regular XDP Tx ring is used for AF_XDP as well. This rationale for
this is as follows: XDP_REDIRECT guarantees mutual exclusion between
different NAPI contexts based on CPU id. In other words, a netdev can
XDP_REDIRECT to another netdev with a different NAPI context, since
the operation is bound to a specific core and each core has its own
hardware ring.

As the AF_XDP Tx action is running in the same NAPI context and using
the same ring, it will also be protected from XDP_REDIRECT actions
with the exact same mechanism.

As with AF_XDP Rx, all AF_XDP Tx specific functions are added to
ixgbe_xsk.c.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  16 +-
 .../ethernet/intel/ixgbe/ixgbe_txrx_common.h  |   4 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 175 ++
 3 files changed, 194 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 4e9b28894a5b..64b19346396d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -3161,7 +3161,11 @@ int ixgbe_poll(struct napi_struct *napi, int budget)
 #endif
 
ixgbe_for_each_ring(ring, q_vector->tx) {
-   if (!ixgbe_clean_tx_irq(q_vector, ring, budget))
+   bool wd = ring->xsk_umem ?
+ ixgbe_clean_xdp_tx_irq(q_vector, ring, budget) :
+ ixgbe_clean_tx_irq(q_vector, ring, budget);
+
+   if (!wd)
clean_complete = false;
}
 
@@ -3472,6 +3476,9 @@ void ixgbe_configure_tx_ring(struct ixgbe_adapter 
*adapter,
u32 txdctl = IXGBE_TXDCTL_ENABLE;
u8 reg_idx = ring->reg_idx;
 
+   if (ring_is_xdp(ring))
+   ring->xsk_umem = ixgbe_xsk_umem(adapter, ring);
+
/* disable queue to avoid issues while updating state */
IXGBE_WRITE_REG(hw, IXGBE_TXDCTL(reg_idx), 0);
IXGBE_WRITE_FLUSH(hw);
@@ -5938,6 +5945,11 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring 
*tx_ring)
u16 i = tx_ring->next_to_clean;
struct ixgbe_tx_buffer *tx_buffer = _ring->tx_buffer_info[i];
 
+   if (tx_ring->xsk_umem) {
+   ixgbe_xsk_clean_tx_ring(tx_ring);
+   goto out;
+   }
+
while (i != tx_ring->next_to_use) {
union ixgbe_adv_tx_desc *eop_desc, *tx_desc;
 
@@ -5989,6 +6001,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring 
*tx_ring)
if (!ring_is_xdp(tx_ring))
netdev_tx_reset_queue(txring_txq(tx_ring));
 
+out:
/* reset next_to_use and next_to_clean */
tx_ring->next_to_use = 0;
tx_ring->next_to_clean = 0;
@@ -10369,6 +10382,7 @@ static const struct net_device_ops ixgbe_netdev_ops = {
.ndo_features_check = ixgbe_features_check,
.ndo_bpf= ixgbe_xdp,
.ndo_xdp_xmit   = ixgbe_xdp_xmit,
+   .ndo_xsk_async_xmit = ixgbe_xsk_async_xmit,
 };
 
 static void ixgbe_disable_txr_hw(struct ixgbe_adapter *adapter,
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
index 56afb685c648..53d4089f5644 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
@@ -42,5 +42,9 @@ int ixgbe_clean_rx_irq_zc(struct ixgbe_q_vector *q_vector,
  struct ixgbe_ring *rx_ring,
  const int budget);
 void ixgbe_xsk_clean_rx_ring(struct ixgbe_ring *rx_ring);
+bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
+   struct ixgbe_ring *tx_ring, int napi_budget);
+int ixgbe_xsk_async_xmit(struct net_device *dev, u32 queue_id);
+void ixgbe_xsk_clean_tx_ring(struct ixgbe_ring *tx_ring);
 
 #endif /* #define _IXGBE_TXRX_COMMON_H_ */
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 253ce3cfbcf1..e998ed880460 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -637,3 +637,178 @@ void ixgbe_xsk_clean_rx_ring(struct ixgbe_ring *rx_ring)
}
}
 }
+
+static bool ixgbe_xmit_zc(struct ixgbe_ring *xdp_ring, unsigned int budget)
+{
+   union ixgbe_adv_tx_desc *tx_desc = NULL;
+   struct ixgbe_tx_buffer *tx_bi;
+   bool work_done = true;
+   u32 len, cmd_type;
+   dma_addr_t dma;
+
+   while (budget-- > 0) {
+   if (unl

[PATCH 4/5] ixgbe: move common Tx functions to ixgbe_txrx_common.h

2018-09-24 Thread Björn Töpel
From: Björn Töpel 

This patch prepares for the upcoming zero-copy Tx functionality by
moving common functions used both by the regular path and zero-copy
path.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c| 9 +++--
 drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h | 5 +
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 4e6726c623a8..4e9b28894a5b 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -895,8 +895,8 @@ static void ixgbe_set_ivar(struct ixgbe_adapter *adapter, 
s8 direction,
}
 }
 
-static inline void ixgbe_irq_rearm_queues(struct ixgbe_adapter *adapter,
- u64 qmask)
+void ixgbe_irq_rearm_queues(struct ixgbe_adapter *adapter,
+   u64 qmask)
 {
u32 mask;
 
@@ -8150,9 +8150,6 @@ static inline int ixgbe_maybe_stop_tx(struct ixgbe_ring 
*tx_ring, u16 size)
return __ixgbe_maybe_stop_tx(tx_ring, size);
 }
 
-#define IXGBE_TXD_CMD (IXGBE_TXD_CMD_EOP | \
-  IXGBE_TXD_CMD_RS)
-
 static int ixgbe_tx_map(struct ixgbe_ring *tx_ring,
struct ixgbe_tx_buffer *first,
const u8 hdr_len)
@@ -10275,7 +10272,7 @@ static int ixgbe_xdp(struct net_device *dev, struct 
netdev_bpf *xdp)
}
 }
 
-static void ixgbe_xdp_ring_update_tail(struct ixgbe_ring *ring)
+void ixgbe_xdp_ring_update_tail(struct ixgbe_ring *ring)
 {
/* Force memory writes to complete before letting h/w know there
 * are new descriptors to fetch.
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
index cf219f4e009d..56afb685c648 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
@@ -9,6 +9,9 @@
 #define IXGBE_XDP_TX   BIT(1)
 #define IXGBE_XDP_REDIRBIT(2)
 
+#define IXGBE_TXD_CMD (IXGBE_TXD_CMD_EOP | \
+  IXGBE_TXD_CMD_RS)
+
 int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
struct xdp_frame *xdpf);
 bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
@@ -19,6 +22,8 @@ void ixgbe_process_skb_fields(struct ixgbe_ring *rx_ring,
  struct sk_buff *skb);
 void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector,
  struct sk_buff *skb);
+void ixgbe_xdp_ring_update_tail(struct ixgbe_ring *ring);
+void ixgbe_irq_rearm_queues(struct ixgbe_adapter *adapter, u64 qmask);
 
 void ixgbe_txrx_ring_disable(struct ixgbe_adapter *adapter, int ring);
 void ixgbe_txrx_ring_enable(struct ixgbe_adapter *adapter, int ring);
-- 
2.17.1



[PATCH 1/5] ixgbe: added Rx/Tx ring disable/enable functions

2018-09-24 Thread Björn Töpel
From: Björn Töpel 

Add functions for Rx/Tx ring enable/disable. Instead of resetting the
whole device, only the affected ring is disabled or enabled.

This plumbing is used in later commits, when zero-copy AF_XDP support
is introduced.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |   1 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 158 ++
 2 files changed, 159 insertions(+)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 5c6fd42e90ed..265db172042a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -271,6 +271,7 @@ enum ixgbe_ring_state_t {
__IXGBE_TX_DETECT_HANG,
__IXGBE_HANG_CHECK_ARMED,
__IXGBE_TX_XDP_RING,
+   __IXGBE_TX_DISABLED,
 };
 
 #define ring_uses_build_skb(ring) \
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index d4d47707662f..3b68e20c067d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -8688,6 +8688,8 @@ static netdev_tx_t __ixgbe_xmit_frame(struct sk_buff *skb,
return NETDEV_TX_OK;
 
tx_ring = ring ? ring : adapter->tx_ring[skb->queue_mapping];
+   if (unlikely(test_bit(__IXGBE_TX_DISABLED, _ring->state)))
+   return NETDEV_TX_BUSY;
 
return ixgbe_xmit_frame_ring(skb, adapter, tx_ring);
 }
@@ -10256,6 +10258,9 @@ static int ixgbe_xdp_xmit(struct net_device *dev, int n,
if (unlikely(!ring))
return -ENXIO;
 
+   if (unlikely(test_bit(__IXGBE_TX_DISABLED, >state)))
+   return -ENXIO;
+
for (i = 0; i < n; i++) {
struct xdp_frame *xdpf = frames[i];
int err;
@@ -10322,6 +10327,159 @@ static const struct net_device_ops ixgbe_netdev_ops = 
{
.ndo_xdp_xmit   = ixgbe_xdp_xmit,
 };
 
+static void ixgbe_disable_txr_hw(struct ixgbe_adapter *adapter,
+struct ixgbe_ring *tx_ring)
+{
+   unsigned long wait_delay, delay_interval;
+   struct ixgbe_hw *hw = >hw;
+   u8 reg_idx = tx_ring->reg_idx;
+   int wait_loop;
+   u32 txdctl;
+
+   IXGBE_WRITE_REG(hw, IXGBE_TXDCTL(reg_idx), IXGBE_TXDCTL_SWFLSH);
+
+   /* delay mechanism from ixgbe_disable_tx */
+   delay_interval = ixgbe_get_completion_timeout(adapter) / 100;
+
+   wait_loop = IXGBE_MAX_RX_DESC_POLL;
+   wait_delay = delay_interval;
+
+   while (wait_loop--) {
+   usleep_range(wait_delay, wait_delay + 10);
+   wait_delay += delay_interval * 2;
+   txdctl = IXGBE_READ_REG(hw, IXGBE_TXDCTL(reg_idx));
+
+   if (!(txdctl & IXGBE_TXDCTL_ENABLE))
+   return;
+   }
+
+   e_err(drv, "TXDCTL.ENABLE not cleared within the polling period\n");
+}
+
+static void ixgbe_disable_txr(struct ixgbe_adapter *adapter,
+ struct ixgbe_ring *tx_ring)
+{
+   set_bit(__IXGBE_TX_DISABLED, _ring->state);
+   ixgbe_disable_txr_hw(adapter, tx_ring);
+}
+
+static void ixgbe_disable_rxr_hw(struct ixgbe_adapter *adapter,
+struct ixgbe_ring *rx_ring)
+{
+   unsigned long wait_delay, delay_interval;
+   struct ixgbe_hw *hw = >hw;
+   u8 reg_idx = rx_ring->reg_idx;
+   int wait_loop;
+   u32 rxdctl;
+
+   rxdctl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(reg_idx));
+   rxdctl &= ~IXGBE_RXDCTL_ENABLE;
+   rxdctl |= IXGBE_RXDCTL_SWFLSH;
+
+   /* write value back with RXDCTL.ENABLE bit cleared */
+   IXGBE_WRITE_REG(hw, IXGBE_RXDCTL(reg_idx), rxdctl);
+
+   /* RXDCTL.EN may not change on 82598 if link is down, so skip it */
+   if (hw->mac.type == ixgbe_mac_82598EB &&
+   !(IXGBE_READ_REG(hw, IXGBE_LINKS) & IXGBE_LINKS_UP))
+   return;
+
+   /* delay mechanism from ixgbe_disable_rx */
+   delay_interval = ixgbe_get_completion_timeout(adapter) / 100;
+
+   wait_loop = IXGBE_MAX_RX_DESC_POLL;
+   wait_delay = delay_interval;
+
+   while (wait_loop--) {
+   usleep_range(wait_delay, wait_delay + 10);
+   wait_delay += delay_interval * 2;
+   rxdctl = IXGBE_READ_REG(hw, IXGBE_RXDCTL(reg_idx));
+
+   if (!(rxdctl & IXGBE_RXDCTL_ENABLE))
+   return;
+   }
+
+   e_err(drv, "RXDCTL.ENABLE not cleared within the polling period\n");
+}
+
+static void ixgbe_reset_txr_stats(struct ixgbe_ring *tx_ring)
+{
+   memset(_ring->stats, 0, sizeof(tx_ring->stats));
+   memset(_ring->tx_stats, 0, sizeof(tx_ring->tx_stats));
+}
+
+static void ixgbe_reset_rxr_stats(struct ixgbe_ring *rx_ring)
+{
+   memset(_ring->stats, 0, sizeof(rx_ring->stats));
+   memset(_ring->rx_stats,

[PATCH 2/5] ixgbe: move common Rx functions to ixgbe_txrx_common.h

2018-09-24 Thread Björn Töpel
From: Björn Töpel 

This patch prepares for the upcoming zero-copy Rx functionality, by
moving/changing linkage of common functions, used both by the regular
path and zero-copy path.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 29 +++
 .../ethernet/intel/ixgbe/ixgbe_txrx_common.h  | 26 +
 2 files changed, 37 insertions(+), 18 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 3b68e20c067d..3e2e6fb2215a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -40,6 +40,7 @@
 #include "ixgbe_dcb_82599.h"
 #include "ixgbe_sriov.h"
 #include "ixgbe_model.h"
+#include "ixgbe_txrx_common.h"
 
 char ixgbe_driver_name[] = "ixgbe";
 static const char ixgbe_driver_string[] =
@@ -1673,9 +1674,9 @@ static void ixgbe_update_rsc_stats(struct ixgbe_ring 
*rx_ring,
  * order to populate the hash, checksum, VLAN, timestamp, protocol, and
  * other fields within the skb.
  **/
-static void ixgbe_process_skb_fields(struct ixgbe_ring *rx_ring,
-union ixgbe_adv_rx_desc *rx_desc,
-struct sk_buff *skb)
+void ixgbe_process_skb_fields(struct ixgbe_ring *rx_ring,
+ union ixgbe_adv_rx_desc *rx_desc,
+ struct sk_buff *skb)
 {
struct net_device *dev = rx_ring->netdev;
u32 flags = rx_ring->q_vector->adapter->flags;
@@ -1708,8 +1709,8 @@ static void ixgbe_process_skb_fields(struct ixgbe_ring 
*rx_ring,
skb->protocol = eth_type_trans(skb, dev);
 }
 
-static void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector,
-struct sk_buff *skb)
+void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector,
+ struct sk_buff *skb)
 {
napi_gro_receive(_vector->napi, skb);
 }
@@ -1868,9 +1869,9 @@ static void ixgbe_dma_sync_frag(struct ixgbe_ring 
*rx_ring,
  *
  * Returns true if an error was encountered and skb was freed.
  **/
-static bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
- union ixgbe_adv_rx_desc *rx_desc,
- struct sk_buff *skb)
+bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
+  union ixgbe_adv_rx_desc *rx_desc,
+  struct sk_buff *skb)
 {
struct net_device *netdev = rx_ring->netdev;
 
@@ -2186,14 +2187,6 @@ static struct sk_buff *ixgbe_build_skb(struct ixgbe_ring 
*rx_ring,
return skb;
 }
 
-#define IXGBE_XDP_PASS 0
-#define IXGBE_XDP_CONSUMED BIT(0)
-#define IXGBE_XDP_TX   BIT(1)
-#define IXGBE_XDP_REDIRBIT(2)
-
-static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-  struct xdp_frame *xdpf);
-
 static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 struct ixgbe_ring *rx_ring,
 struct xdp_buff *xdp)
@@ -8465,8 +8458,8 @@ static u16 ixgbe_select_queue(struct net_device *dev, 
struct sk_buff *skb,
 }
 
 #endif
-static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-  struct xdp_frame *xdpf)
+int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
+   struct xdp_frame *xdpf)
 {
struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
struct ixgbe_tx_buffer *tx_buffer;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
new file mode 100644
index ..3780d315b991
--- /dev/null
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
@@ -0,0 +1,26 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright(c) 2018 Intel Corporation. */
+
+#ifndef _IXGBE_TXRX_COMMON_H_
+#define _IXGBE_TXRX_COMMON_H_
+
+#define IXGBE_XDP_PASS 0
+#define IXGBE_XDP_CONSUMED BIT(0)
+#define IXGBE_XDP_TX   BIT(1)
+#define IXGBE_XDP_REDIRBIT(2)
+
+int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
+   struct xdp_frame *xdpf);
+bool ixgbe_cleanup_headers(struct ixgbe_ring *rx_ring,
+  union ixgbe_adv_rx_desc *rx_desc,
+  struct sk_buff *skb);
+void ixgbe_process_skb_fields(struct ixgbe_ring *rx_ring,
+ union ixgbe_adv_rx_desc *rx_desc,
+ struct sk_buff *skb);
+void ixgbe_rx_skb(struct ixgbe_q_vector *q_vector,
+ struct sk_buff *skb);
+
+void ixgbe_txrx_ring_disable(struct ixgbe_adapter *adapter, int ring);
+void ixgbe_txrx_ring_enable(struct ixgbe_adapter *adapter, int ring);
+
+#endif /* #define _IXGBE_TXRX_COMMON_H_ */
-- 
2.17.1



[PATCH 3/5] ixgbe: add AF_XDP zero-copy Rx support

2018-09-24 Thread Björn Töpel
From: Björn Töpel 

This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
queue.

All AF_XDP specific functions are added to a new file, ixgbe_xsk.c.

Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
will allocate a new buffer and copy the zero-copy frame prior passing
it to the kernel stack.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/ixgbe/Makefile |   3 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |  26 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  78 ++-
 .../ethernet/intel/ixgbe/ixgbe_txrx_common.h  |  15 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 639 ++
 5 files changed, 741 insertions(+), 20 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c

diff --git a/drivers/net/ethernet/intel/ixgbe/Makefile 
b/drivers/net/ethernet/intel/ixgbe/Makefile
index 5414685189ce..ca6b0c458e4a 100644
--- a/drivers/net/ethernet/intel/ixgbe/Makefile
+++ b/drivers/net/ethernet/intel/ixgbe/Makefile
@@ -8,7 +8,8 @@ obj-$(CONFIG_IXGBE) += ixgbe.o
 
 ixgbe-objs := ixgbe_main.o ixgbe_common.o ixgbe_ethtool.o \
   ixgbe_82599.o ixgbe_82598.o ixgbe_phy.o ixgbe_sriov.o \
-  ixgbe_mbx.o ixgbe_x540.o ixgbe_x550.o ixgbe_lib.o ixgbe_ptp.o
+  ixgbe_mbx.o ixgbe_x540.o ixgbe_x550.o ixgbe_lib.o ixgbe_ptp.o \
+  ixgbe_xsk.o
 
 ixgbe-$(CONFIG_IXGBE_DCB) +=  ixgbe_dcb.o ixgbe_dcb_82598.o \
   ixgbe_dcb_82599.o ixgbe_dcb_nl.o
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 265db172042a..421fdac3a76d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -228,13 +228,17 @@ struct ixgbe_tx_buffer {
 struct ixgbe_rx_buffer {
struct sk_buff *skb;
dma_addr_t dma;
-   struct page *page;
-#if (BITS_PER_LONG > 32) || (PAGE_SIZE >= 65536)
-   __u32 page_offset;
-#else
-   __u16 page_offset;
-#endif
-   __u16 pagecnt_bias;
+   union {
+   struct {
+   struct page *page;
+   __u32 page_offset;
+   __u16 pagecnt_bias;
+   };
+   struct {
+   void *addr;
+   u64 handle;
+   };
+   };
 };
 
 struct ixgbe_queue_stats {
@@ -348,6 +352,9 @@ struct ixgbe_ring {
struct ixgbe_rx_queue_stats rx_stats;
};
struct xdp_rxq_info xdp_rxq;
+   struct xdp_umem *xsk_umem;
+   struct zero_copy_allocator zca; /* ZC allocator anchor */
+   u16 rx_buf_len;
 } cacheline_internodealigned_in_smp;
 
 enum ixgbe_ring_f_enum {
@@ -765,6 +772,11 @@ struct ixgbe_adapter {
 #ifdef CONFIG_XFRM_OFFLOAD
struct ixgbe_ipsec *ipsec;
 #endif /* CONFIG_XFRM_OFFLOAD */
+
+   /* AF_XDP zero-copy */
+   struct xdp_umem **xsk_umems;
+   u16 num_xsk_umems_used;
+   u16 num_xsk_umems;
 };
 
 static inline u8 ixgbe_max_rss_indices(struct ixgbe_adapter *adapter)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 3e2e6fb2215a..4e6726c623a8 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -34,6 +34,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include "ixgbe.h"
 #include "ixgbe_common.h"
@@ -3176,7 +3177,10 @@ int ixgbe_poll(struct napi_struct *napi, int budget)
per_ring_budget = budget;
 
ixgbe_for_each_ring(ring, q_vector->rx) {
-   int cleaned = ixgbe_clean_rx_irq(q_vector, ring,
+   int cleaned = ring->xsk_umem ?
+ ixgbe_clean_rx_irq_zc(q_vector, ring,
+   per_ring_budget) :
+ ixgbe_clean_rx_irq(q_vector, ring,
 per_ring_budget);
 
work_done += cleaned;
@@ -3706,10 +3710,27 @@ static void ixgbe_configure_srrctl(struct ixgbe_adapter 
*adapter,
srrctl = IXGBE_RX_HDR_SIZE << IXGBE_SRRCTL_BSIZEHDRSIZE_SHIFT;
 
/* configure the packet buffer length */
-   if (test_bit(__IXGBE_RX_3K_BUFFER, _ring->state))
+   if (rx_ring->xsk_umem) {
+   u32 xsk_buf_len = rx_ring->xsk_umem->chunk_size_nohr -
+ XDP_PACKET_HEADROOM;
+
+   /* If the MAC support setting RXDCTL.RLPML, the
+* SRRCTL[n].BSIZEPKT is set to PAGE_SIZE and
+* RXDCTL.RLPML is set to the actual UMEM buffer
+* size. If not, then we are stuck with a 1k buffer
+* size resolution. In this case frames larger than
+  

[PATCH 0/5] Introducing ixgbe AF_XDP ZC support

2018-09-24 Thread Björn Töpel
From: Björn Töpel 

This patch set introduces zero-copy AF_XDP support for Intel's ixgbe
driver.

The ixgbe zero-copy code is located in its own file ixgbe_xsk.[ch],
analogous to the i40e ZC support. Again, as in i40e, code paths have
been copied from the XDP path to the zero-copy path. Going forward we
will try to generalize more code between the AF_XDP ZC drivers, and
also reduce the heavy C

We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is GCC 7.3.0. The NIC is Intel
82599ES/X520-2 10Gbit/s using the ixgbe driver.

Below are the results in Mpps of the 82599ES/X520-2 NIC benchmark runs
for 64B and 1500B packets, generated by a commercial packet generator
HW blasting packets at full 10Gbit/s line rate. The results are with
retpoline and all other spectre and meltdown fixes.

AF_XDP performance 64B packets:
Benchmark   XDP_DRV with zerocopy
rxdrop14.7
txpush14.6
l2fwd 11.1

AF_XDP performance 1500B packets:
Benchmark   XDP_DRV with zerocopy
rxdrop0.8
l2fwd 0.8

XDP performance on our system as a base line.

64B packets:
XDP stats   CPU Mpps   issue-pps
XDP-RX CPU  16  14.7   0

1500B packets:
XDP stats   CPU Mpps   issue-pps
XDP-RX CPU  16  0.80

The structure of the patch set is as follows:

Patch 1: Introduce Rx/Tx ring enable/disable functionality
Patch 2: Preparatory patche to ixgbe driver code for RX
Patch 3: ixgbe zero-copy support for RX
Patch 4: Preparatory patch to ixgbe driver code for TX
Patch 5: ixgbe zero-copy support for TX

Cheers!
Björn

Björn Töpel (5):
  ixgbe: added Rx/Tx ring disable/enable functions
  ixgbe: move common Rx functions to ixgbe_txrx_common.h
  ixgbe: add AF_XDP zero-copy Rx support
  ixgbe: move common Tx functions to ixgbe_txrx_common.h
  ixgbe: add AF_XDP zero-copy Tx support

 drivers/net/ethernet/intel/ixgbe/Makefile |   3 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe.h  |  27 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 290 ++-
 .../ethernet/intel/ixgbe/ixgbe_txrx_common.h  |  50 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 814 ++
 5 files changed, 1139 insertions(+), 45 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_txrx_common.h
 create mode 100644 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c

-- 
2.17.1



Re: [Intel-wired-lan] [PATCH v2 1/4] i40e: clean zero-copy XDP Tx ring on shutdown/reset

2018-09-21 Thread Björn Töpel
Jeff,

Den fre 7 sep. 2018 kl 10:29 skrev Björn Töpel :
>
> From: Björn Töpel 
>
> When the zero-copy enabled XDP Tx ring is torn down, due to
> configuration changes, outstandning frames on the hardware descriptor
> ring are queued on the completion ring.
>
> The completion ring has a back-pressure mechanism that will guarantee
> that there is sufficient space on the ring.
>
> Signed-off-by: Björn Töpel 
> ---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 17 +++
>  .../ethernet/intel/i40e/i40e_txrx_common.h|  2 ++
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c| 30 +++
>  3 files changed, 43 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
> b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 37bd4e50ccde..7f85d4ba8b54 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -636,13 +636,18 @@ void i40e_clean_tx_ring(struct i40e_ring *tx_ring)
> unsigned long bi_size;
> u16 i;
>
> -   /* ring already cleared, nothing to do */
> -   if (!tx_ring->tx_bi)
> -   return;
> +   if (ring_is_xdp(tx_ring) && tx_ring->xsk_umem) {
> +   i40e_xsk_clean_tx_ring(tx_ring);
> +   } else {
> +   /* ring already cleared, nothing to do */
> +   if (!tx_ring->tx_bi)
> +   return;
>
> -   /* Free all the Tx ring sk_buffs */
> -   for (i = 0; i < tx_ring->count; i++)
> -   i40e_unmap_and_free_tx_resource(tx_ring, _ring->tx_bi[i]);
> +   /* Free all the Tx ring sk_buffs */
> +   for (i = 0; i < tx_ring->count; i++)
> +   i40e_unmap_and_free_tx_resource(tx_ring,
> +   _ring->tx_bi[i]);
> +   }
>
> bi_size = sizeof(struct i40e_tx_buffer) * tx_ring->count;
> memset(tx_ring->tx_bi, 0, bi_size);
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h 
> b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
> index b5afd479a9c5..29c68b29d36f 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
> @@ -87,4 +87,6 @@ static inline void i40e_arm_wb(struct i40e_ring *tx_ring,
> }
>  }
>
> +void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring);
> +
>  #endif /* I40E_TXRX_COMMON_ */
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
> b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> index 2ebfc78bbd09..99116277c4d2 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
> @@ -830,3 +830,33 @@ int i40e_xsk_async_xmit(struct net_device *dev, u32 
> queue_id)
>
> return 0;
>  }
> +
> +/**
> + * i40e_xsk_clean_xdp_ring - Clean the XDP Tx ring on shutdown
> + * @xdp_ring: XDP Tx ring
> + **/
> +void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring)
> +{
> +   u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use;
> +   struct xdp_umem *umem = tx_ring->xsk_umem;
> +   struct i40e_tx_buffer *tx_bi;
> +   u32 xsk_frames = 0;
> +
> +   while (ntc != ntu) {
> +   tx_bi = _ring->tx_bi[ntc];
> +
> +   if (tx_bi->xdpf)
> +   i40e_clean_xdp_tx_buffer(tx_ring, tx_bi);
> +   else
> +   xsk_frames++;
> +
> +   tx_bi->xdpf = NULL;
> +
> +   ntc++;
> +   if (ntc > tx_ring->count)

This is an off-by-one error, and should be:
if (ntc == tx_ring->count)

Can you fix it up, or should I respin the patch?

Thanks!
Björn


> +   ntc = 0;
> +   }
> +
> +   if (xsk_frames)
> +   xsk_umem_complete_tx(umem, xsk_frames);
> +}
> --
> 2.17.1
>
> ___
> Intel-wired-lan mailing list
> intel-wired-...@osuosl.org
> https://lists.osuosl.org/mailman/listinfo/intel-wired-lan


Re: [net-next, PATCH 2/2, v2] net: socionext: add XDP support

2018-09-12 Thread Björn Töpel
Den ons 12 sep. 2018 kl 11:21 skrev Ilias Apalodimas
:
>
> On Wed, Sep 12, 2018 at 11:14:57AM +0200, Jesper Dangaard Brouer wrote:
> > On Wed, 12 Sep 2018 12:02:38 +0300
> > Ilias Apalodimas  wrote:
> >
> > > @@ -1003,20 +1076,29 @@ static int netsec_setup_rx_dring(struct 
> > > netsec_priv *priv)
> > > u16 len;
> > >
> > > buf = netsec_alloc_rx_data(priv, _handle, );
> > > -   if (!buf) {
> > > -   netsec_uninit_pkt_dring(priv, NETSEC_RING_RX);
> > > +   if (!buf)
> > > goto err_out;
> > > -   }
> > > desc->dma_addr = dma_handle;
> > > desc->addr = buf;
> > > desc->len = len;
> > > }
> > >
> > > netsec_rx_fill(priv, 0, DESC_NUM);
> > > +   err = xdp_rxq_info_reg(>xdp_rxq, priv->ndev, 0);
> >
> > Do you only have 1 RX queue? (last arg to xdp_rxq_info_reg is 0),
> >
> >
> Yes the current driver is only supporting a single queue (same for Tx)

XDP and skbuff path sharing the same queue? You'll probably need some
means of synchronization between the .ndo_xmit_xdp and .ndo_start_xmit
implementations. And it looks like .ndo_xmit_xdp is missing!


Björn

> > > +   if (err)
> > > +   goto err_out;
> > > +
> > > +   err = xdp_rxq_info_reg_mem_model(>xdp_rxq, 
> > > MEM_TYPE_PAGE_SHARED,
> > > +NULL);
> > > +   if (err) {
> > > +   xdp_rxq_info_unreg(>xdp_rxq);
> > > +   goto err_out;
> > > +   }
> > >
> > > return 0;
> > >
> >
> >
> > --
> > Best regards,
> >   Jesper Dangaard Brouer
> >   MSc.CS, Principal Kernel Engineer at Red Hat
> >   LinkedIn: http://www.linkedin.com/in/brouer
>
>
> Thanks for looking at this
>
> /Ilias


Re: tools/bpf regression causing samples/bpf/ to hang

2018-09-11 Thread Björn Töpel
Den tis 11 sep. 2018 kl 20:21 skrev Yonghong Song :
>
>
>
> On 9/11/18 10:15 AM, Björn Töpel wrote:
> > Den tis 11 sep. 2018 kl 18:47 skrev Yonghong Song :
> >>
> >>
> >>
> >> On 9/11/18 4:11 AM, Björn Töpel wrote:
> >>> Hi Yonghong, I tried to run the XDP samples from the bpf-next tip
> >>> today, and was hit by a regression.
> >>>
> >>> Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
> >>> functions into a new file") adds a while(1) around the recv call in
> >>> bpf_set_link_xdp_fd making that call getting stuck in an infinite
> >>> loop.
> >>>
> >>> I simply removed the loop, and that solved my problem (patch below).
> >>>
> >>> However, I don't know if removing the loop would break bpftool for
> >>> you. If not, I can submit the patch as a proper one for bpf-next.
> >>
> >> Hi, Björn, thanks for reporting the problem.
> >> The while loop is needed since the "recv" syscall buffer size
> >> may not be big enough to hold all the returned information, in
> >> which cases, multiple "recv" calls are needed.
> >>
> >> Could you try the following patch to see whether it fixed your
> >> issue? Thanks!
> >>
> >
> > Nope, it doesn't -- but if you move that hunk after the for-loop it works.
>
> Could you try this patch?
>

Works! Thanks!

Tested-by: Björn Töpel 

> commit 9a7fb19899ce87594fe8012f8a23fc8fc7b6b764 (HEAD -> fix)
> Author: Yonghong Song 
> Date:   Tue Sep 11 08:58:20 2018 -0700
>
>  tools/bpf: fix a netlink recv issue
>
>  Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
>  functions into a new file") introduced a while loop for the
>  netlink recv path. This while loop is needed since the
>  buffer in recv syscall may not be enough to hold all the
>  information and in such cases multiple recv calls are needed.
>
>  There is a bug introduced by the above commit as
>  the while loop may block on recv syscall if there is no
>  more messages are expected. The netlink message header
>  flag NLM_F_MULTI is used to indicate that more messages
>  are expected and this patch fixed the bug by doing
>  further recv syscall only if multipart message is expected.
>
>  The patch added another fix regarding to message length of 0.
>  When netlink recv returns message length of 0, there will be
>  no more messages for returning data so the while loop
>  can end.
>
>  Fixes: f7010770fbac ("tools/bpf: move bpf/lib netlink related
> functions into a new file")
>  Reported-by: Björn Töpel 
>  Signed-off-by: Yonghong Song 
>
> diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
> index 469e068dd0c5..fde1d7bf8199 100644
> --- a/tools/lib/bpf/netlink.c
> +++ b/tools/lib/bpf/netlink.c
> @@ -65,18 +65,23 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid,
> int seq,
>  __dump_nlmsg_t _fn, dump_nlmsg_t fn,
>  void *cookie)
>   {
> +   bool multipart = true;
>  struct nlmsgerr *err;
>  struct nlmsghdr *nh;
>  char buf[4096];
>  int len, ret;
>
> -   while (1) {
> +   while (multipart) {
> +   multipart = false;
>  len = recv(sock, buf, sizeof(buf), 0);
>  if (len < 0) {
>  ret = -errno;
>  goto done;
>  }
>
> +   if (len == 0)
> +   break;
> +
>  for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
>   nh = NLMSG_NEXT(nh, len)) {
>  if (nh->nlmsg_pid != nl_pid) {
> @@ -87,6 +92,8 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid,
> int seq,
>  ret = -LIBBPF_ERRNO__INVSEQ;
>  goto done;
>  }
> +   if (nh->nlmsg_flags & NLM_F_MULTI)
> +   multipart = true;
>  switch (nh->nlmsg_type) {
>  case NLMSG_ERROR:
>  err = (struct nlmsgerr *)NLMSG_DATA(nh);
>
>
> >
> > Cheers,
> > Björn
> >
> >> commit 3eb1c0249dfc3ea4ad61aa223dce32262af7e049 (HEAD -> fix)
> >> Author: Yonghong Song 
> >> Date:   Tue Sep 11 08:58:20 2018 -0700
> >>
> >>   tools/bpf: f

Re: tools/bpf regression causing samples/bpf/ to hang

2018-09-11 Thread Björn Töpel
Den tis 11 sep. 2018 kl 18:47 skrev Yonghong Song :
>
>
>
> On 9/11/18 4:11 AM, Björn Töpel wrote:
> > Hi Yonghong, I tried to run the XDP samples from the bpf-next tip
> > today, and was hit by a regression.
> >
> > Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
> > functions into a new file") adds a while(1) around the recv call in
> > bpf_set_link_xdp_fd making that call getting stuck in an infinite
> > loop.
> >
> > I simply removed the loop, and that solved my problem (patch below).
> >
> > However, I don't know if removing the loop would break bpftool for
> > you. If not, I can submit the patch as a proper one for bpf-next.
>
> Hi, Björn, thanks for reporting the problem.
> The while loop is needed since the "recv" syscall buffer size
> may not be big enough to hold all the returned information, in
> which cases, multiple "recv" calls are needed.
>
> Could you try the following patch to see whether it fixed your
> issue? Thanks!
>

Nope, it doesn't -- but if you move that hunk after the for-loop it works.

Cheers,
Björn

> commit 3eb1c0249dfc3ea4ad61aa223dce32262af7e049 (HEAD -> fix)
> Author: Yonghong Song 
> Date:   Tue Sep 11 08:58:20 2018 -0700
>
>  tools/bpf: fix a netlink recv issue
>
>  Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
>  functions into a new file") introduced a while loop for the
>  netlink recv path. This while loop is needed since the
>  buffer in recv syscall may not be big enough to hold all the
>  information and in such cases multiple recv calls are needed.
>
>  When netlink recv returns message length of 0, there will be
>  no more messages for returning data so the while loop
>  can end.
>
>  Fixes: f7010770fbac ("tools/bpf: move bpf/lib netlink related
> functions into a new file")
>  Reported-by: Björn Töpel 
>  Signed-off-by: Yonghong Song 
>
> diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
> index 469e068dd0c5..37827319a50a 100644
> --- a/tools/lib/bpf/netlink.c
> +++ b/tools/lib/bpf/netlink.c
> @@ -77,6 +77,9 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid,
> int seq,
>  goto done;
>  }
>
> +   if (len == 0)
> +   break;
> +
>  for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
>   nh = NLMSG_NEXT(nh, len)) {
>  if (nh->nlmsg_pid != nl_pid) {
>
>
> >
> > Thanks!
> > Björn
> >
> > From: =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= 
> > Date: Tue, 11 Sep 2018 12:35:44 +0200
> > Subject: [PATCH] tools/bpf: remove loop around netlink recv
> >
> > Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
> > functions into a new file") moved the bpf_set_link_xdp_fd and split it
> > up into multiple functions. The added receive function
> > bpf_netlink_recv added a loop around the recv syscall leading to
> > multiple recv calls. This resulted in all XDP samples in the
> > samples/bpf/ to stop working, since they were stuck in a blocking
> > recv.
> >
> > This commits removes the while (1)-statement.
> >
> > Fixes: f7010770fbac ("tools/bpf: move bpf/lib netlink related
> > functions into a new file")
> > Signed-off-by: Björn Töpel 
> > ---
> >   tools/lib/bpf/netlink.c | 64 -
> >   1 file changed, 31 insertions(+), 33 deletions(-)
> >
> > diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
> > index 469e068dd0c5..0eae1fbf46c6 100644
> > --- a/tools/lib/bpf/netlink.c
> > +++ b/tools/lib/bpf/netlink.c
> > @@ -70,41 +70,39 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid, int 
> > seq,
> >   char buf[4096];
> >   int len, ret;
> >
> > -while (1) {
> > -len = recv(sock, buf, sizeof(buf), 0);
> > -if (len < 0) {
> > -ret = -errno;
> > +len = recv(sock, buf, sizeof(buf), 0);
> > +if (len < 0) {
> > +ret = -errno;
> > +goto done;
> > +}
> > +
> > +for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
> > + nh = NLMSG_NEXT(nh, len)) {
> > +if (nh->nlmsg_pid != nl_pid) {
> > +ret = -LIBBPF_ERRNO__WRNGPID;
> >   goto done;
> >   }
> > -
> > -for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
> > - nh = NLMSG_NEXT(nh, len)) {
> > -  

tools/bpf regression causing samples/bpf/ to hang

2018-09-11 Thread Björn Töpel
Hi Yonghong, I tried to run the XDP samples from the bpf-next tip
today, and was hit by a regression.

Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
functions into a new file") adds a while(1) around the recv call in
bpf_set_link_xdp_fd making that call getting stuck in an infinite
loop.

I simply removed the loop, and that solved my problem (patch below).

However, I don't know if removing the loop would break bpftool for
you. If not, I can submit the patch as a proper one for bpf-next.

Thanks!
Björn

From: =?UTF-8?q?Bj=C3=B6rn=20T=C3=B6pel?= 
Date: Tue, 11 Sep 2018 12:35:44 +0200
Subject: [PATCH] tools/bpf: remove loop around netlink recv

Commit f7010770fbac ("tools/bpf: move bpf/lib netlink related
functions into a new file") moved the bpf_set_link_xdp_fd and split it
up into multiple functions. The added receive function
bpf_netlink_recv added a loop around the recv syscall leading to
multiple recv calls. This resulted in all XDP samples in the
samples/bpf/ to stop working, since they were stuck in a blocking
recv.

This commits removes the while (1)-statement.

Fixes: f7010770fbac ("tools/bpf: move bpf/lib netlink related
functions into a new file")
Signed-off-by: Björn Töpel 
---
 tools/lib/bpf/netlink.c | 64 -
 1 file changed, 31 insertions(+), 33 deletions(-)

diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
index 469e068dd0c5..0eae1fbf46c6 100644
--- a/tools/lib/bpf/netlink.c
+++ b/tools/lib/bpf/netlink.c
@@ -70,41 +70,39 @@ static int bpf_netlink_recv(int sock, __u32 nl_pid, int seq,
 char buf[4096];
 int len, ret;

-while (1) {
-len = recv(sock, buf, sizeof(buf), 0);
-if (len < 0) {
-ret = -errno;
+len = recv(sock, buf, sizeof(buf), 0);
+if (len < 0) {
+ret = -errno;
+goto done;
+}
+
+for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
+ nh = NLMSG_NEXT(nh, len)) {
+if (nh->nlmsg_pid != nl_pid) {
+ret = -LIBBPF_ERRNO__WRNGPID;
 goto done;
 }
-
-for (nh = (struct nlmsghdr *)buf; NLMSG_OK(nh, len);
- nh = NLMSG_NEXT(nh, len)) {
-if (nh->nlmsg_pid != nl_pid) {
-ret = -LIBBPF_ERRNO__WRNGPID;
-goto done;
-}
-if (nh->nlmsg_seq != seq) {
-ret = -LIBBPF_ERRNO__INVSEQ;
-goto done;
-}
-switch (nh->nlmsg_type) {
-case NLMSG_ERROR:
-err = (struct nlmsgerr *)NLMSG_DATA(nh);
-if (!err->error)
-continue;
-ret = err->error;
-nla_dump_errormsg(nh);
-goto done;
-case NLMSG_DONE:
-return 0;
-default:
-break;
-}
-if (_fn) {
-ret = _fn(nh, fn, cookie);
-if (ret)
-return ret;
-}
+if (nh->nlmsg_seq != seq) {
+ret = -LIBBPF_ERRNO__INVSEQ;
+goto done;
+}
+switch (nh->nlmsg_type) {
+case NLMSG_ERROR:
+err = (struct nlmsgerr *)NLMSG_DATA(nh);
+if (!err->error)
+continue;
+ret = err->error;
+nla_dump_errormsg(nh);
+goto done;
+case NLMSG_DONE:
+return 0;
+default:
+break;
+}
+if (_fn) {
+ret = _fn(nh, fn, cookie);
+if (ret)
+return ret;
 }
 }
 ret = 0;
-- 
2.17.1


Re: [net-next, PATCH 0/2, v1] net: socionext: add AF_XDP support

2018-09-10 Thread Björn Töpel
Den mån 10 sep. 2018 kl 10:26 skrev Ilias Apalodimas
:
>
> This patch series adds AF_XDP support socionext netsec driver
>
> - patch [1/2]: Use a different allocation scheme for Rx DMA buffers to prepare
> the driver for AF_XDP support
> - patch [2/2]: Add AF_XDP support without zero-copy
>
> Ilias Apalodimas (2):
>   net: socionext: different approach on DMA
>   net: socionext: add AF_XDP support
>

You should probably rephrase patch #2. You are adding XDP support, and
AF_XDP just follows from that. Nice to see more XDP support!


Björn

>  drivers/net/ethernet/socionext/netsec.c | 444 
> +++-
>  1 file changed, 329 insertions(+), 115 deletions(-)
>
> --
> 2.7.4
>


[PATCH v2 4/4] i40e: disallow changing the number of descriptors when AF_XDP is on

2018-09-07 Thread Björn Töpel
From: Björn Töpel 

When an AF_XDP UMEM is attached to any of the Rx rings, we disallow a
user to change the number of descriptors via e.g. "ethtool -G IFNAME".

Otherwise, the size of the stash/reuse queue can grow unbounded, which
would result in OOM or leaking userspace buffers.

Signed-off-by: Björn Töpel 
---
 .../net/ethernet/intel/i40e/i40e_ethtool.c|  9 +++-
 .../ethernet/intel/i40e/i40e_txrx_common.h|  1 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c| 22 +++
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index d7d3974beca2..3cd2c88c72f8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -5,7 +5,7 @@
 
 #include "i40e.h"
 #include "i40e_diag.h"
-
+#include "i40e_txrx_common.h"
 #include "i40e_ethtool_stats.h"
 
 #define I40E_PF_STAT(_name, _stat) \
@@ -1493,6 +1493,13 @@ static int i40e_set_ringparam(struct net_device *netdev,
(new_rx_count == vsi->rx_rings[0]->count))
return 0;
 
+   /* If there is a AF_XDP UMEM attached to any of Rx rings,
+* disallow changing the number of descriptors -- regardless
+* if the netdev is running or not.
+*/
+   if (i40e_xsk_any_rx_ring_enabled(vsi))
+   return -EBUSY;
+
while (test_and_set_bit(__I40E_CONFIG_BUSY, pf->state)) {
timeout--;
if (!timeout)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
index 8d46acff6f2e..09809dffe399 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
@@ -89,5 +89,6 @@ static inline void i40e_arm_wb(struct i40e_ring *tx_ring,
 
 void i40e_xsk_clean_rx_ring(struct i40e_ring *rx_ring);
 void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring);
+bool i40e_xsk_any_rx_ring_enabled(struct i40e_vsi *vsi);
 
 #endif /* I40E_TXRX_COMMON_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index e4b62e871afc..119f59ec7cc0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -944,3 +944,25 @@ void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring)
if (xsk_frames)
xsk_umem_complete_tx(umem, xsk_frames);
 }
+
+/**
+ * i40e_xsk_any_rx_ring_enabled - Checks whether any of the Rx rings
+ * has AF_XDP UMEM attached
+ * @vsi: vsi
+ *
+ * Returns true if any of the Rx rings has an AF_XDP UMEM attached
+ **/
+bool i40e_xsk_any_rx_ring_enabled(struct i40e_vsi *vsi)
+{
+   int i;
+
+   if (!vsi->xsk_umems)
+   return false;
+
+   for (i = 0; i < vsi->num_queue_pairs; i++) {
+   if (vsi->xsk_umems[i])
+   return true;
+   }
+
+   return false;
+}
-- 
2.17.1



[PATCH v2 0/4] i40e AF_XDP zero-copy buffer leak fixes

2018-09-07 Thread Björn Töpel
From: Björn Töpel 

NB! The v1 was sent via the bpf-next tree. This time the series is
routed via JeffK's Intel Wired tree to minimize the risk for i40e
merge conflicts.

This series addresses an AF_XDP zero-copy issue that buffers passed
from userspace to the kernel was leaked when the hardware descriptor
ring was torn down.

The patches fixes the i40e AF_XDP zero-copy implementation.

Thanks to Jakub Kicinski for pointing this out!

Some background for folks that don't know the details: A zero-copy
capable driver picks buffers off the fill ring and places them on the
hardware Rx ring to be completed at a later point when DMA is
complete. Similar on the Tx side; The driver picks buffers off the Tx
ring and places them on the Tx hardware ring.

In the typical flow, the Rx buffer will be placed onto an Rx ring
(completed to the user), and the Tx buffer will be placed on the
completion ring to notify the user that the transfer is done.

However, if the driver needs to tear down the hardware rings for some
reason (interface goes down, reconfiguration and such), the userspace
buffers cannot be leaked. They have to be reused or completed back to
userspace.

The implementation does the following:

* Outstanding Tx descriptors will be passed to the completion
  ring. The Tx code has back-pressure mechanism in place, so that
  enough empty space in the completion ring is guaranteed.

* Outstanding Rx descriptors are temporarily stored on a stash/reuse
  queue. The reuse queue is based on Jakub's RFC. When/if the HW rings
  comes up again, entries from the stash are used to re-populate the
  ring.

* When AF_XDP ZC is enabled, disallow changing the number of hardware
  descriptors via ethtool. Otherwise, the size of the stash/reuse
  queue can grow unbounded.

Going forward, introducing a "zero-copy allocator" analogous to Jesper
Brouer's page pool would be a more robust and reuseable solution.

v1->v2: Address kbuild "undefined symbols" error when building with
!CONFIG_XDP_SOCKETS.

Thanks!
Björn


Björn Töpel (3):
  i40e: clean zero-copy XDP Tx ring on shutdown/reset
  i40e: clean zero-copy XDP Rx ring on shutdown/reset
  i40e: disallow changing the number of descriptors when AF_XDP is on

Jakub Kicinski (1):
  net: xsk: add a simple buffer reuse queue

 .../net/ethernet/intel/i40e/i40e_ethtool.c|   9 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  21 ++-
 .../ethernet/intel/i40e/i40e_txrx_common.h|   4 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c| 152 +-
 include/net/xdp_sock.h|  69 
 net/xdp/xdp_umem.c|   2 +
 net/xdp/xsk_queue.c   |  55 +++
 net/xdp/xsk_queue.h   |   3 +
 8 files changed, 299 insertions(+), 16 deletions(-)

-- 
2.17.1



[PATCH v2 3/4] i40e: clean zero-copy XDP Rx ring on shutdown/reset

2018-09-07 Thread Björn Töpel
From: Björn Töpel 

Outstanding Rx descriptors are temporarily stored on a stash/reuse
queue. When/if the HW rings comes up again, entries from the stash are
used to re-populate the ring.

The latter required some restructuring of the allocation scheme for
the AF_XDP zero-copy implementation. There is now a fast, and a slow
allocation. The "fast allocation" is used from the fast-path and
obtains free buffers from the fill ring and the internal recycle
mechanism. The "slow allocation" is only used in ring setup, and
obtains buffers from the fill ring and the stash (if any).

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   4 +-
 .../ethernet/intel/i40e/i40e_txrx_common.h|   1 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c| 100 --
 3 files changed, 96 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 7f85d4ba8b54..740ea58ba938 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1355,8 +1355,10 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
rx_ring->skb = NULL;
}
 
-   if (rx_ring->xsk_umem)
+   if (rx_ring->xsk_umem) {
+   i40e_xsk_clean_rx_ring(rx_ring);
goto skip_free;
+   }
 
/* Free all the Rx ring sk_buffs */
for (i = 0; i < rx_ring->count; i++) {
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
index 29c68b29d36f..8d46acff6f2e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
@@ -87,6 +87,7 @@ static inline void i40e_arm_wb(struct i40e_ring *tx_ring,
}
 }
 
+void i40e_xsk_clean_rx_ring(struct i40e_ring *rx_ring);
 void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring);
 
 #endif /* I40E_TXRX_COMMON_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 99116277c4d2..e4b62e871afc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -140,6 +140,7 @@ static void i40e_xsk_umem_dma_unmap(struct i40e_vsi *vsi, 
struct xdp_umem *umem)
 static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem,
u16 qid)
 {
+   struct xdp_umem_fq_reuse *reuseq;
bool if_running;
int err;
 
@@ -156,6 +157,12 @@ static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, 
struct xdp_umem *umem,
return -EBUSY;
}
 
+   reuseq = xsk_reuseq_prepare(vsi->rx_rings[0]->count);
+   if (!reuseq)
+   return -ENOMEM;
+
+   xsk_reuseq_free(xsk_reuseq_swap(umem, reuseq));
+
err = i40e_xsk_umem_dma_map(vsi, umem);
if (err)
return err;
@@ -353,16 +360,46 @@ static bool i40e_alloc_buffer_zc(struct i40e_ring 
*rx_ring,
 }
 
 /**
- * i40e_alloc_rx_buffers_zc - Allocates a number of Rx buffers
+ * i40e_alloc_buffer_slow_zc - Allocates an i40e_rx_buffer
  * @rx_ring: Rx ring
- * @count: The number of buffers to allocate
+ * @bi: Rx buffer to populate
  *
- * This function allocates a number of Rx buffers and places them on
- * the Rx ring.
+ * This function allocates an Rx buffer. The buffer can come from fill
+ * queue, or via the reuse queue.
  *
  * Returns true for a successful allocation, false otherwise
  **/
-bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 count)
+static bool i40e_alloc_buffer_slow_zc(struct i40e_ring *rx_ring,
+ struct i40e_rx_buffer *bi)
+{
+   struct xdp_umem *umem = rx_ring->xsk_umem;
+   u64 handle, hr;
+
+   if (!xsk_umem_peek_addr_rq(umem, )) {
+   rx_ring->rx_stats.alloc_page_failed++;
+   return false;
+   }
+
+   handle &= rx_ring->xsk_umem->chunk_mask;
+
+   hr = umem->headroom + XDP_PACKET_HEADROOM;
+
+   bi->dma = xdp_umem_get_dma(umem, handle);
+   bi->dma += hr;
+
+   bi->addr = xdp_umem_get_data(umem, handle);
+   bi->addr += hr;
+
+   bi->handle = handle + umem->headroom;
+
+   xsk_umem_discard_addr_rq(umem);
+   return true;
+}
+
+static __always_inline bool __i40e_alloc_rx_buffers_zc(
+   struct i40e_ring *rx_ring, u16 count,
+   bool alloc(struct i40e_ring *rx_ring,
+  struct i40e_rx_buffer *bi))
 {
u16 ntu = rx_ring->next_to_use;
union i40e_rx_desc *rx_desc;
@@ -372,7 +409,7 @@ bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, 
u16 count)
rx_desc = I40E_RX_DESC(rx_ring, ntu);
bi = _ring->rx_bi[ntu];
do {
-   if (!i40e_alloc_buffer_zc(rx_ring, bi)) {
+   if (!alloc(rx_ring, bi)) {
ok =

[PATCH v2 2/4] net: xsk: add a simple buffer reuse queue

2018-09-07 Thread Björn Töpel
From: Jakub Kicinski 

XSK UMEM is strongly single producer single consumer so reuse of
frames is challenging.  Add a simple "stash" of FILL packets to
reuse for drivers to optionally make use of.  This is useful
when driver has to free (ndo_stop) or resize a ring with an active
AF_XDP ZC socket.

v2: Fixed build issues for !CONFIG_XDP_SOCKETS.

Signed-off-by: Jakub Kicinski 
---
 include/net/xdp_sock.h | 69 ++
 net/xdp/xdp_umem.c |  2 ++
 net/xdp/xsk_queue.c| 55 +
 net/xdp/xsk_queue.h|  3 ++
 4 files changed, 129 insertions(+)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 932ca0dad6f3..70a115bea4f4 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -21,6 +21,12 @@ struct xdp_umem_page {
dma_addr_t dma;
 };
 
+struct xdp_umem_fq_reuse {
+   u32 nentries;
+   u32 length;
+   u64 handles[];
+};
+
 struct xdp_umem {
struct xsk_queue *fq;
struct xsk_queue *cq;
@@ -37,6 +43,7 @@ struct xdp_umem {
struct page **pgs;
u32 npgs;
struct net_device *dev;
+   struct xdp_umem_fq_reuse *fq_reuse;
u16 queue_id;
bool zc;
spinlock_t xsk_list_lock;
@@ -75,6 +82,10 @@ void xsk_umem_discard_addr(struct xdp_umem *umem);
 void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries);
 bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len);
 void xsk_umem_consume_tx_done(struct xdp_umem *umem);
+struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries);
+struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
+ struct xdp_umem_fq_reuse *newq);
+void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
 
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 {
@@ -85,6 +96,35 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem 
*umem, u64 addr)
 {
return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
 }
+
+/* Reuse-queue aware version of FILL queue helpers */
+static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   if (!rq->length)
+   return xsk_umem_peek_addr(umem, addr);
+
+   *addr = rq->handles[rq->length - 1];
+   return addr;
+}
+
+static inline void xsk_umem_discard_addr_rq(struct xdp_umem *umem)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   if (!rq->length)
+   xsk_umem_discard_addr(umem);
+   else
+   rq->length--;
+}
+
+static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   rq->handles[rq->length++] = addr;
+}
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
@@ -128,6 +168,21 @@ static inline void xsk_umem_consume_tx_done(struct 
xdp_umem *umem)
 {
 }
 
+static inline struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries)
+{
+   return NULL;
+}
+
+static inline struct xdp_umem_fq_reuse *xsk_reuseq_swap(
+   struct xdp_umem *umem,
+   struct xdp_umem_fq_reuse *newq)
+{
+   return NULL;
+}
+static inline void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq)
+{
+}
+
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 {
return NULL;
@@ -137,6 +192,20 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem 
*umem, u64 addr)
 {
return 0;
 }
+
+static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
+{
+   return NULL;
+}
+
+static inline void xsk_umem_discard_addr_rq(struct xdp_umem *umem)
+{
+}
+
+static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr)
+{
+}
+
 #endif /* CONFIG_XDP_SOCKETS */
 
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index b3b632c5aeae..555427b3e0fe 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -165,6 +165,8 @@ static void xdp_umem_release(struct xdp_umem *umem)
umem->cq = NULL;
}
 
+   xsk_reuseq_destroy(umem);
+
xdp_umem_unpin_pages(umem);
 
task = get_pid_task(umem->pid, PIDTYPE_PID);
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
index 2dc1384d9f27..b66504592d9b 100644
--- a/net/xdp/xsk_queue.c
+++ b/net/xdp/xsk_queue.c
@@ -3,7 +3,9 @@
  * Copyright(c) 2018 Intel Corporation.
  */
 
+#include 
 #include 
+#include 
 
 #include "xsk_queue.h"
 
@@ -62,3 +64,56 @@ void xskq_destroy(struct xsk_queue *q)
page_frag_free(q->ring);
kfree(q);
 }
+
+struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries)
+{
+   struct xdp_umem_fq_reuse *newq;
+
+   /* Check for overflow */
+   if (nentries > (u32)roundup_pow_of_two(nentries))
+   return NULL;
+   nentries = roundup_pow_of_two(nentries);
+
+   newq = kvmalloc(struct_size(newq, handles, nentries), GFP_KERNEL);
+   

[PATCH v2 1/4] i40e: clean zero-copy XDP Tx ring on shutdown/reset

2018-09-07 Thread Björn Töpel
From: Björn Töpel 

When the zero-copy enabled XDP Tx ring is torn down, due to
configuration changes, outstandning frames on the hardware descriptor
ring are queued on the completion ring.

The completion ring has a back-pressure mechanism that will guarantee
that there is sufficient space on the ring.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 17 +++
 .../ethernet/intel/i40e/i40e_txrx_common.h|  2 ++
 drivers/net/ethernet/intel/i40e/i40e_xsk.c| 30 +++
 3 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 37bd4e50ccde..7f85d4ba8b54 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -636,13 +636,18 @@ void i40e_clean_tx_ring(struct i40e_ring *tx_ring)
unsigned long bi_size;
u16 i;
 
-   /* ring already cleared, nothing to do */
-   if (!tx_ring->tx_bi)
-   return;
+   if (ring_is_xdp(tx_ring) && tx_ring->xsk_umem) {
+   i40e_xsk_clean_tx_ring(tx_ring);
+   } else {
+   /* ring already cleared, nothing to do */
+   if (!tx_ring->tx_bi)
+   return;
 
-   /* Free all the Tx ring sk_buffs */
-   for (i = 0; i < tx_ring->count; i++)
-   i40e_unmap_and_free_tx_resource(tx_ring, _ring->tx_bi[i]);
+   /* Free all the Tx ring sk_buffs */
+   for (i = 0; i < tx_ring->count; i++)
+   i40e_unmap_and_free_tx_resource(tx_ring,
+   _ring->tx_bi[i]);
+   }
 
bi_size = sizeof(struct i40e_tx_buffer) * tx_ring->count;
memset(tx_ring->tx_bi, 0, bi_size);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
index b5afd479a9c5..29c68b29d36f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
@@ -87,4 +87,6 @@ static inline void i40e_arm_wb(struct i40e_ring *tx_ring,
}
 }
 
+void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring);
+
 #endif /* I40E_TXRX_COMMON_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 2ebfc78bbd09..99116277c4d2 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -830,3 +830,33 @@ int i40e_xsk_async_xmit(struct net_device *dev, u32 
queue_id)
 
return 0;
 }
+
+/**
+ * i40e_xsk_clean_xdp_ring - Clean the XDP Tx ring on shutdown
+ * @xdp_ring: XDP Tx ring
+ **/
+void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring)
+{
+   u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use;
+   struct xdp_umem *umem = tx_ring->xsk_umem;
+   struct i40e_tx_buffer *tx_bi;
+   u32 xsk_frames = 0;
+
+   while (ntc != ntu) {
+   tx_bi = _ring->tx_bi[ntc];
+
+   if (tx_bi->xdpf)
+   i40e_clean_xdp_tx_buffer(tx_ring, tx_bi);
+   else
+   xsk_frames++;
+
+   tx_bi->xdpf = NULL;
+
+   ntc++;
+   if (ntc > tx_ring->count)
+   ntc = 0;
+   }
+
+   if (xsk_frames)
+   xsk_umem_complete_tx(umem, xsk_frames);
+}
-- 
2.17.1



Re: [PATCH bpf-next 0/4] i40e AF_XDP zero-copy buffer leak fixes

2018-09-05 Thread Björn Töpel
Den ons 5 sep. 2018 kl 19:14 skrev Jakub Kicinski
:
>
> On Tue,  4 Sep 2018 20:11:01 +0200, Björn Töpel wrote:
> > From: Björn Töpel 
> >
> > This series addresses an AF_XDP zero-copy issue that buffers passed
> > from userspace to the kernel was leaked when the hardware descriptor
> > ring was torn down.
> >
> > The patches fixes the i40e AF_XDP zero-copy implementation.
> >
> > Thanks to Jakub Kicinski for pointing this out!
> >
> > Some background for folks that don't know the details: A zero-copy
> > capable driver picks buffers off the fill ring and places them on the
> > hardware Rx ring to be completed at a later point when DMA is
> > complete. Similar on the Tx side; The driver picks buffers off the Tx
> > ring and places them on the Tx hardware ring.
> >
> > In the typical flow, the Rx buffer will be placed onto an Rx ring
> > (completed to the user), and the Tx buffer will be placed on the
> > completion ring to notify the user that the transfer is done.
> >
> > However, if the driver needs to tear down the hardware rings for some
> > reason (interface goes down, reconfiguration and such), the userspace
> > buffers cannot be leaked. They have to be reused or completed back to
> > userspace.
> >
> > The implementation does the following:
> >
> > * Outstanding Tx descriptors will be passed to the completion
> >   ring. The Tx code has back-pressure mechanism in place, so that
> >   enough empty space in the completion ring is guaranteed.
> >
> > * Outstanding Rx descriptors are temporarily stored on a stash/reuse
> >   queue. The reuse queue is based on Jakub's RFC. When/if the HW rings
> >   comes up again, entries from the stash are used to re-populate the
> >   ring.
> >
> > * When AF_XDP ZC is enabled, disallow changing the number of hardware
> >   descriptors via ethtool. Otherwise, the size of the stash/reuse
> >   queue can grow unbounded.
> >
> > Going forward, introducing a "zero-copy allocator" analogous to Jesper
> > Brouer's page pool would be a more robust and reuseable solution.
> >
> > Jakub: I've made a minor checkpatch-fix to your RFC, prior adding it
> > into this series.
>
> Thanks for the fix! :)
>
> Out of curiosity, did checking the reuse queue have a noticeable impact
> in your test (i.e. always using the _rq() helpers)?  You seem to be
> adding an indirect call, would that not be way worse on a retpoline
> kernel?

Do you mean the indirection in __i40e_alloc_rx_buffers_zc (patch #3)?
The indirect call is elided by the __always_inline -- without that
retpoline took 2.5Mpps worth of Rx. :-(

I'm only using the _rq helpers in the configuration/slow path, so the
fast-path is unchanged.


Björn


[PATCH bpf-next 4/4] i40e: disallow changing the number of descriptors when AF_XDP is on

2018-09-04 Thread Björn Töpel
From: Björn Töpel 

When an AF_XDP UMEM is attached to any of the Rx rings, we disallow a
user to change the number of descriptors via e.g. "ethtool -G IFNAME".

Otherwise, the size of the stash/reuse queue can grow unbounded, which
would result in OOM or leaking userspace buffers.

Signed-off-by: Björn Töpel 
---
 .../net/ethernet/intel/i40e/i40e_ethtool.c|  9 +++-
 .../ethernet/intel/i40e/i40e_txrx_common.h|  1 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c| 22 +++
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c 
b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
index d7d3974beca2..3cd2c88c72f8 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_ethtool.c
@@ -5,7 +5,7 @@
 
 #include "i40e.h"
 #include "i40e_diag.h"
-
+#include "i40e_txrx_common.h"
 #include "i40e_ethtool_stats.h"
 
 #define I40E_PF_STAT(_name, _stat) \
@@ -1493,6 +1493,13 @@ static int i40e_set_ringparam(struct net_device *netdev,
(new_rx_count == vsi->rx_rings[0]->count))
return 0;
 
+   /* If there is a AF_XDP UMEM attached to any of Rx rings,
+* disallow changing the number of descriptors -- regardless
+* if the netdev is running or not.
+*/
+   if (i40e_xsk_any_rx_ring_enabled(vsi))
+   return -EBUSY;
+
while (test_and_set_bit(__I40E_CONFIG_BUSY, pf->state)) {
timeout--;
if (!timeout)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
index 8d46acff6f2e..09809dffe399 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
@@ -89,5 +89,6 @@ static inline void i40e_arm_wb(struct i40e_ring *tx_ring,
 
 void i40e_xsk_clean_rx_ring(struct i40e_ring *rx_ring);
 void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring);
+bool i40e_xsk_any_rx_ring_enabled(struct i40e_vsi *vsi);
 
 #endif /* I40E_TXRX_COMMON_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index e4b62e871afc..119f59ec7cc0 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -944,3 +944,25 @@ void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring)
if (xsk_frames)
xsk_umem_complete_tx(umem, xsk_frames);
 }
+
+/**
+ * i40e_xsk_any_rx_ring_enabled - Checks whether any of the Rx rings
+ * has AF_XDP UMEM attached
+ * @vsi: vsi
+ *
+ * Returns true if any of the Rx rings has an AF_XDP UMEM attached
+ **/
+bool i40e_xsk_any_rx_ring_enabled(struct i40e_vsi *vsi)
+{
+   int i;
+
+   if (!vsi->xsk_umems)
+   return false;
+
+   for (i = 0; i < vsi->num_queue_pairs; i++) {
+   if (vsi->xsk_umems[i])
+   return true;
+   }
+
+   return false;
+}
-- 
2.17.1



[PATCH bpf-next 3/4] i40e: clean zero-copy XDP Rx ring on shutdown/reset

2018-09-04 Thread Björn Töpel
From: Björn Töpel 

Outstanding Rx descriptors are temporarily stored on a stash/reuse
queue. When/if the HW rings comes up again, entries from the stash are
used to re-populate the ring.

The latter required some restructuring of the allocation scheme for
the AF_XDP zero-copy implementation. There is now a fast, and a slow
allocation. The "fast allocation" is used from the fast-path and
obtains free buffers from the fill ring and the internal recycle
mechanism. The "slow allocation" is only used in ring setup, and
obtains buffers from the fill ring and the stash (if any).

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   4 +-
 .../ethernet/intel/i40e/i40e_txrx_common.h|   1 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c| 100 --
 3 files changed, 96 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 7f85d4ba8b54..740ea58ba938 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1355,8 +1355,10 @@ void i40e_clean_rx_ring(struct i40e_ring *rx_ring)
rx_ring->skb = NULL;
}
 
-   if (rx_ring->xsk_umem)
+   if (rx_ring->xsk_umem) {
+   i40e_xsk_clean_rx_ring(rx_ring);
goto skip_free;
+   }
 
/* Free all the Rx ring sk_buffs */
for (i = 0; i < rx_ring->count; i++) {
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
index 29c68b29d36f..8d46acff6f2e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
@@ -87,6 +87,7 @@ static inline void i40e_arm_wb(struct i40e_ring *tx_ring,
}
 }
 
+void i40e_xsk_clean_rx_ring(struct i40e_ring *rx_ring);
 void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring);
 
 #endif /* I40E_TXRX_COMMON_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 99116277c4d2..e4b62e871afc 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -140,6 +140,7 @@ static void i40e_xsk_umem_dma_unmap(struct i40e_vsi *vsi, 
struct xdp_umem *umem)
 static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, struct xdp_umem *umem,
u16 qid)
 {
+   struct xdp_umem_fq_reuse *reuseq;
bool if_running;
int err;
 
@@ -156,6 +157,12 @@ static int i40e_xsk_umem_enable(struct i40e_vsi *vsi, 
struct xdp_umem *umem,
return -EBUSY;
}
 
+   reuseq = xsk_reuseq_prepare(vsi->rx_rings[0]->count);
+   if (!reuseq)
+   return -ENOMEM;
+
+   xsk_reuseq_free(xsk_reuseq_swap(umem, reuseq));
+
err = i40e_xsk_umem_dma_map(vsi, umem);
if (err)
return err;
@@ -353,16 +360,46 @@ static bool i40e_alloc_buffer_zc(struct i40e_ring 
*rx_ring,
 }
 
 /**
- * i40e_alloc_rx_buffers_zc - Allocates a number of Rx buffers
+ * i40e_alloc_buffer_slow_zc - Allocates an i40e_rx_buffer
  * @rx_ring: Rx ring
- * @count: The number of buffers to allocate
+ * @bi: Rx buffer to populate
  *
- * This function allocates a number of Rx buffers and places them on
- * the Rx ring.
+ * This function allocates an Rx buffer. The buffer can come from fill
+ * queue, or via the reuse queue.
  *
  * Returns true for a successful allocation, false otherwise
  **/
-bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, u16 count)
+static bool i40e_alloc_buffer_slow_zc(struct i40e_ring *rx_ring,
+ struct i40e_rx_buffer *bi)
+{
+   struct xdp_umem *umem = rx_ring->xsk_umem;
+   u64 handle, hr;
+
+   if (!xsk_umem_peek_addr_rq(umem, )) {
+   rx_ring->rx_stats.alloc_page_failed++;
+   return false;
+   }
+
+   handle &= rx_ring->xsk_umem->chunk_mask;
+
+   hr = umem->headroom + XDP_PACKET_HEADROOM;
+
+   bi->dma = xdp_umem_get_dma(umem, handle);
+   bi->dma += hr;
+
+   bi->addr = xdp_umem_get_data(umem, handle);
+   bi->addr += hr;
+
+   bi->handle = handle + umem->headroom;
+
+   xsk_umem_discard_addr_rq(umem);
+   return true;
+}
+
+static __always_inline bool __i40e_alloc_rx_buffers_zc(
+   struct i40e_ring *rx_ring, u16 count,
+   bool alloc(struct i40e_ring *rx_ring,
+  struct i40e_rx_buffer *bi))
 {
u16 ntu = rx_ring->next_to_use;
union i40e_rx_desc *rx_desc;
@@ -372,7 +409,7 @@ bool i40e_alloc_rx_buffers_zc(struct i40e_ring *rx_ring, 
u16 count)
rx_desc = I40E_RX_DESC(rx_ring, ntu);
bi = _ring->rx_bi[ntu];
do {
-   if (!i40e_alloc_buffer_zc(rx_ring, bi)) {
+   if (!alloc(rx_ring, bi)) {
ok =

[PATCH bpf-next 1/4] i40e: clean zero-copy XDP Tx ring on shutdown/reset

2018-09-04 Thread Björn Töpel
From: Björn Töpel 

When the zero-copy enabled XDP Tx ring is torn down, due to
configuration changes, outstandning frames on the hardware descriptor
ring are queued on the completion ring.

The completion ring has a back-pressure mechanism that will guarantee
that there is sufficient space on the ring.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 17 +++
 .../ethernet/intel/i40e/i40e_txrx_common.h|  2 ++
 drivers/net/ethernet/intel/i40e/i40e_xsk.c| 30 +++
 3 files changed, 43 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 37bd4e50ccde..7f85d4ba8b54 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -636,13 +636,18 @@ void i40e_clean_tx_ring(struct i40e_ring *tx_ring)
unsigned long bi_size;
u16 i;
 
-   /* ring already cleared, nothing to do */
-   if (!tx_ring->tx_bi)
-   return;
+   if (ring_is_xdp(tx_ring) && tx_ring->xsk_umem) {
+   i40e_xsk_clean_tx_ring(tx_ring);
+   } else {
+   /* ring already cleared, nothing to do */
+   if (!tx_ring->tx_bi)
+   return;
 
-   /* Free all the Tx ring sk_buffs */
-   for (i = 0; i < tx_ring->count; i++)
-   i40e_unmap_and_free_tx_resource(tx_ring, _ring->tx_bi[i]);
+   /* Free all the Tx ring sk_buffs */
+   for (i = 0; i < tx_ring->count; i++)
+   i40e_unmap_and_free_tx_resource(tx_ring,
+   _ring->tx_bi[i]);
+   }
 
bi_size = sizeof(struct i40e_tx_buffer) * tx_ring->count;
memset(tx_ring->tx_bi, 0, bi_size);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
index b5afd479a9c5..29c68b29d36f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
@@ -87,4 +87,6 @@ static inline void i40e_arm_wb(struct i40e_ring *tx_ring,
}
 }
 
+void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring);
+
 #endif /* I40E_TXRX_COMMON_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 2ebfc78bbd09..99116277c4d2 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -830,3 +830,33 @@ int i40e_xsk_async_xmit(struct net_device *dev, u32 
queue_id)
 
return 0;
 }
+
+/**
+ * i40e_xsk_clean_xdp_ring - Clean the XDP Tx ring on shutdown
+ * @xdp_ring: XDP Tx ring
+ **/
+void i40e_xsk_clean_tx_ring(struct i40e_ring *tx_ring)
+{
+   u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use;
+   struct xdp_umem *umem = tx_ring->xsk_umem;
+   struct i40e_tx_buffer *tx_bi;
+   u32 xsk_frames = 0;
+
+   while (ntc != ntu) {
+   tx_bi = _ring->tx_bi[ntc];
+
+   if (tx_bi->xdpf)
+   i40e_clean_xdp_tx_buffer(tx_ring, tx_bi);
+   else
+   xsk_frames++;
+
+   tx_bi->xdpf = NULL;
+
+   ntc++;
+   if (ntc > tx_ring->count)
+   ntc = 0;
+   }
+
+   if (xsk_frames)
+   xsk_umem_complete_tx(umem, xsk_frames);
+}
-- 
2.17.1



[PATCH bpf-next 2/4] net: xsk: add a simple buffer reuse queue

2018-09-04 Thread Björn Töpel
From: Jakub Kicinski 

XSK UMEM is strongly single producer single consumer so reuse of
frames is challenging.  Add a simple "stash" of FILL packets to
reuse for drivers to optionally make use of.  This is useful
when driver has to free (ndo_stop) or resize a ring with an active
AF_XDP ZC socket.

Signed-off-by: Jakub Kicinski 
---
 include/net/xdp_sock.h | 43 +
 net/xdp/xdp_umem.c |  2 ++
 net/xdp/xsk_queue.c| 55 ++
 net/xdp/xsk_queue.h|  3 +++
 4 files changed, 103 insertions(+)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 932ca0dad6f3..7b55206da138 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -14,6 +14,7 @@
 #include 
 
 struct net_device;
+struct xdp_umem_fq_reuse;
 struct xsk_queue;
 
 struct xdp_umem_page {
@@ -37,6 +38,7 @@ struct xdp_umem {
struct page **pgs;
u32 npgs;
struct net_device *dev;
+   struct xdp_umem_fq_reuse *fq_reuse;
u16 queue_id;
bool zc;
spinlock_t xsk_list_lock;
@@ -139,4 +141,45 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem 
*umem, u64 addr)
 }
 #endif /* CONFIG_XDP_SOCKETS */
 
+struct xdp_umem_fq_reuse {
+   u32 nentries;
+   u32 length;
+   u64 handles[];
+};
+
+/* Following functions are not thread-safe in any way */
+struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries);
+struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
+ struct xdp_umem_fq_reuse *newq);
+void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
+
+/* Reuse-queue aware version of FILL queue helpers */
+static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   if (!rq->length)
+   return xsk_umem_peek_addr(umem, addr);
+
+   *addr = rq->handles[rq->length - 1];
+   return addr;
+}
+
+static inline void xsk_umem_discard_addr_rq(struct xdp_umem *umem)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   if (!rq->length)
+   xsk_umem_discard_addr(umem);
+   else
+   rq->length--;
+}
+
+static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   rq->handles[rq->length++] = addr;
+}
+
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index b3b632c5aeae..555427b3e0fe 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -165,6 +165,8 @@ static void xdp_umem_release(struct xdp_umem *umem)
umem->cq = NULL;
}
 
+   xsk_reuseq_destroy(umem);
+
xdp_umem_unpin_pages(umem);
 
task = get_pid_task(umem->pid, PIDTYPE_PID);
diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
index 2dc1384d9f27..b66504592d9b 100644
--- a/net/xdp/xsk_queue.c
+++ b/net/xdp/xsk_queue.c
@@ -3,7 +3,9 @@
  * Copyright(c) 2018 Intel Corporation.
  */
 
+#include 
 #include 
+#include 
 
 #include "xsk_queue.h"
 
@@ -62,3 +64,56 @@ void xskq_destroy(struct xsk_queue *q)
page_frag_free(q->ring);
kfree(q);
 }
+
+struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries)
+{
+   struct xdp_umem_fq_reuse *newq;
+
+   /* Check for overflow */
+   if (nentries > (u32)roundup_pow_of_two(nentries))
+   return NULL;
+   nentries = roundup_pow_of_two(nentries);
+
+   newq = kvmalloc(struct_size(newq, handles, nentries), GFP_KERNEL);
+   if (!newq)
+   return NULL;
+   memset(newq, 0, offsetof(typeof(*newq), handles));
+
+   newq->nentries = nentries;
+   return newq;
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_prepare);
+
+struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
+ struct xdp_umem_fq_reuse *newq)
+{
+   struct xdp_umem_fq_reuse *oldq = umem->fq_reuse;
+
+   if (!oldq) {
+   umem->fq_reuse = newq;
+   return NULL;
+   }
+
+   if (newq->nentries < oldq->length)
+   return newq;
+
+   memcpy(newq->handles, oldq->handles,
+  array_size(oldq->length, sizeof(u64)));
+   newq->length = oldq->length;
+
+   umem->fq_reuse = newq;
+   return oldq;
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_swap);
+
+void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq)
+{
+   kvfree(rq);
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_free);
+
+void xsk_reuseq_destroy(struct xdp_umem *umem)
+{
+   xsk_reuseq_free(umem->fq_reuse);
+   umem->fq_reuse = NULL;
+}
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 82252cccb4e0..bcb5cbb40419 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -258,4 +258,7 @@ void xskq_set_umem(struct xsk_queue *q, u64 size, u64 
chunk_mask);
 struct xsk_queue *xskq_create(u32 nentries, bool umem_queue);
 void xskq_destroy(struct xsk_queue *q_ops);
 
+/* Executed by the core when the 

[PATCH bpf-next 0/4] i40e AF_XDP zero-copy buffer leak fixes

2018-09-04 Thread Björn Töpel
From: Björn Töpel 

This series addresses an AF_XDP zero-copy issue that buffers passed
from userspace to the kernel was leaked when the hardware descriptor
ring was torn down.

The patches fixes the i40e AF_XDP zero-copy implementation.

Thanks to Jakub Kicinski for pointing this out!

Some background for folks that don't know the details: A zero-copy
capable driver picks buffers off the fill ring and places them on the
hardware Rx ring to be completed at a later point when DMA is
complete. Similar on the Tx side; The driver picks buffers off the Tx
ring and places them on the Tx hardware ring.

In the typical flow, the Rx buffer will be placed onto an Rx ring
(completed to the user), and the Tx buffer will be placed on the
completion ring to notify the user that the transfer is done.

However, if the driver needs to tear down the hardware rings for some
reason (interface goes down, reconfiguration and such), the userspace
buffers cannot be leaked. They have to be reused or completed back to
userspace.

The implementation does the following:

* Outstanding Tx descriptors will be passed to the completion
  ring. The Tx code has back-pressure mechanism in place, so that
  enough empty space in the completion ring is guaranteed.

* Outstanding Rx descriptors are temporarily stored on a stash/reuse
  queue. The reuse queue is based on Jakub's RFC. When/if the HW rings
  comes up again, entries from the stash are used to re-populate the
  ring.

* When AF_XDP ZC is enabled, disallow changing the number of hardware
  descriptors via ethtool. Otherwise, the size of the stash/reuse
  queue can grow unbounded.

Going forward, introducing a "zero-copy allocator" analogous to Jesper
Brouer's page pool would be a more robust and reuseable solution.

Jakub: I've made a minor checkpatch-fix to your RFC, prior adding it
into this series.


Thanks!
Björn

Björn Töpel (3):
  i40e: clean zero-copy XDP Tx ring on shutdown/reset
  i40e: clean zero-copy XDP Rx ring on shutdown/reset
  i40e: disallow changing the number of descriptors when AF_XDP is on

Jakub Kicinski (1):
  net: xsk: add a simple buffer reuse queue

 .../net/ethernet/intel/i40e/i40e_ethtool.c|   9 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |  21 ++-
 .../ethernet/intel/i40e/i40e_txrx_common.h|   4 +
 drivers/net/ethernet/intel/i40e/i40e_xsk.c| 152 +-
 include/net/xdp_sock.h|  43 +
 net/xdp/xdp_umem.c|   2 +
 net/xdp/xsk_queue.c   |  55 +++
 net/xdp/xsk_queue.h   |   3 +
 8 files changed, 273 insertions(+), 16 deletions(-)

-- 
2.17.1



Re: [RFC] net: xsk: add a simple buffer reuse queue

2018-08-31 Thread Björn Töpel

On 2018-08-29 21:19, Jakub Kicinski wrote:

XSK UMEM is strongly single producer single consumer so reuse of
frames is challenging.  Add a simple "stash" of FILL packets to
reuse for drivers to optionally make use of.  This is useful
when driver has to free (ndo_stop) or resize a ring with an active
AF_XDP ZC socket.



I'll take a stab using this in i40e. I have a couple of
comments/thoughts on this RFC, but let me get back when I have an actual
patch in place. :-)


Thanks!
Björn



Signed-off-by: Jakub Kicinski 
---
  include/net/xdp_sock.h | 44 +
  net/xdp/xdp_umem.c |  2 ++
  net/xdp/xsk_queue.c| 56 ++
  net/xdp/xsk_queue.h|  3 +++
  4 files changed, 105 insertions(+)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 6871e4755975..108c1c100de4 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -14,6 +14,7 @@
  #include 
  
  struct net_device;

+struct xdp_umem_fq_reuse;
  struct xsk_queue;
  
  struct xdp_umem_props {

@@ -41,6 +42,7 @@ struct xdp_umem {
struct page **pgs;
u32 npgs;
struct net_device *dev;
+   struct xdp_umem_fq_reuse *fq_reuse;
u16 queue_id;
bool zc;
spinlock_t xsk_list_lock;
@@ -110,4 +112,46 @@ static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem 
*umem, u64 addr)
return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
  }
  
+struct xdp_umem_fq_reuse {

+   u32 nentries;
+   u32 length;
+   u64 handles[];
+};
+
+/* Following functions are not thread-safe in any way */
+struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries);
+struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
+ struct xdp_umem_fq_reuse *newq);
+void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq);
+
+/* Reuse-queue aware version of FILL queue helpers */
+static inline u64 *xsk_umem_peek_addr_rq(struct xdp_umem *umem, u64 *addr)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   if (!rq->length) {
+   return xsk_umem_peek_addr(umem, addr);
+   } else {
+   *addr = rq->handles[rq->length - 1];
+   return addr;
+   }
+}
+
+static inline void xsk_umem_discard_addr_rq(struct xdp_umem *umem)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   if (!rq->length)
+   xsk_umem_discard_addr(umem);
+   else
+   rq->length--;
+}
+
+static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr)
+{
+   struct xdp_umem_fq_reuse *rq = umem->fq_reuse;
+
+   rq->handles[rq->length++] = addr;
+}
+
  #endif /* _LINUX_XDP_SOCK_H */
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index e762310c9bee..40303e24c954 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -170,6 +170,8 @@ static void xdp_umem_release(struct xdp_umem *umem)
umem->cq = NULL;
}
  
+	xsk_reuseq_destroy(umem);

+
xdp_umem_unpin_pages(umem);
  
  	task = get_pid_task(umem->pid, PIDTYPE_PID);

diff --git a/net/xdp/xsk_queue.c b/net/xdp/xsk_queue.c
index 6c32e92e98fc..f9ee40a13a9a 100644
--- a/net/xdp/xsk_queue.c
+++ b/net/xdp/xsk_queue.c
@@ -3,7 +3,9 @@
   * Copyright(c) 2018 Intel Corporation.
   */
  
+#include 

  #include 
+#include 
  
  #include "xsk_queue.h"
  
@@ -61,3 +63,57 @@ void xskq_destroy(struct xsk_queue *q)

page_frag_free(q->ring);
kfree(q);
  }
+
+struct xdp_umem_fq_reuse *xsk_reuseq_prepare(u32 nentries)
+{
+   struct xdp_umem_fq_reuse *newq;
+
+   /* Check for overflow */
+   if (nentries > (u32)roundup_pow_of_two(nentries))
+   return NULL;
+   nentries = roundup_pow_of_two(nentries);
+
+   newq = kvmalloc(struct_size(newq, handles, nentries), GFP_KERNEL);
+   if (!newq)
+   return NULL;
+   memset(newq, 0, offsetof(typeof(*newq), handles));
+
+   newq->nentries = nentries;
+   return newq;
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_prepare);
+
+struct xdp_umem_fq_reuse *xsk_reuseq_swap(struct xdp_umem *umem,
+ struct xdp_umem_fq_reuse *newq)
+{
+   struct xdp_umem_fq_reuse *oldq = umem->fq_reuse;
+
+   if (!oldq) {
+   umem->fq_reuse = newq;
+   return NULL;
+   }
+
+   if (newq->nentries < oldq->length)
+   return newq;
+
+
+   memcpy(newq->handles, oldq->handles,
+  array_size(oldq->length, sizeof(u64)));
+   newq->length = oldq->length;
+
+   umem->fq_reuse = newq;
+   return oldq;
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_swap);
+
+void xsk_reuseq_free(struct xdp_umem_fq_reuse *rq)
+{
+   kvfree(rq);
+}
+EXPORT_SYMBOL_GPL(xsk_reuseq_free);
+
+void xsk_reuseq_destroy(struct xdp_umem *umem)
+{
+   xsk_reuseq_free(umem->fq_reuse);
+   umem->fq_reuse = NULL;
+}
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 

Re: [PATCH bpf-next] samples/bpf: xdpsock, minor fixes

2018-08-31 Thread Björn Töpel
Den fre 31 aug. 2018 kl 03:04 skrev Prashant Bhole
:
>
> - xsks_map size was fixed to 4, changed it MAX_SOCKS
> - Remove redundant definition of MAX_SOCKS in xdpsock_user.c
> - In dump_stats(), add NULL check for xsks[i]
>

Thanks for the cleanup!

Acked-by: Björn Töpel 

> Signed-off-by: Prashant Bhole 
> ---
>  samples/bpf/xdpsock_kern.c | 2 +-
>  samples/bpf/xdpsock_user.c | 3 +--
>  2 files changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/samples/bpf/xdpsock_kern.c b/samples/bpf/xdpsock_kern.c
> index d8806c41362e..b8ccd0802b3f 100644
> --- a/samples/bpf/xdpsock_kern.c
> +++ b/samples/bpf/xdpsock_kern.c
> @@ -16,7 +16,7 @@ struct bpf_map_def SEC("maps") xsks_map = {
> .type = BPF_MAP_TYPE_XSKMAP,
> .key_size = sizeof(int),
> .value_size = sizeof(int),
> -   .max_entries = 4,
> +   .max_entries = MAX_SOCKS,
>  };
>
>  struct bpf_map_def SEC("maps") rr_map = {
> diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
> index b3906111bedb..57ecadc58403 100644
> --- a/samples/bpf/xdpsock_user.c
> +++ b/samples/bpf/xdpsock_user.c
> @@ -118,7 +118,6 @@ struct xdpsock {
> unsigned long prev_tx_npkts;
>  };
>
> -#define MAX_SOCKS 4
>  static int num_socks;
>  struct xdpsock *xsks[MAX_SOCKS];
>
> @@ -596,7 +595,7 @@ static void dump_stats(void)
>
> prev_time = now;
>
> -   for (i = 0; i < num_socks; i++) {
> +   for (i = 0; i < num_socks && xsks[i]; i++) {
> char *fmt = "%-15s %'-11.0f %'-11lu\n";
> double rx_pps, tx_pps;
>
> --
> 2.17.1
>
>


Re: [PATCH bpf-next] xsk: remove unnecessary assignment

2018-08-31 Thread Björn Töpel
Den fre 31 aug. 2018 kl 03:02 skrev Prashant Bhole
:
>
> Since xdp_umem_query() was added one assignment of bpf.command was
> missed from cleanup. Removing the assignment statement.
>

Good catch!

Acked-by: Björn Töpel 

> Fixes: 84c6b86875e01a0 ("xsk: don't allow umem replace at stack level")
> Signed-off-by: Prashant Bhole 
> ---
>  net/xdp/xdp_umem.c | 2 --
>  1 file changed, 2 deletions(-)
>
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index bfe2dbea480b..d179732617dc 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -76,8 +76,6 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
> net_device *dev,
> if (!dev->netdev_ops->ndo_bpf || !dev->netdev_ops->ndo_xsk_async_xmit)
> return force_zc ? -EOPNOTSUPP : 0; /* fail or fallback */
>
> -   bpf.command = XDP_QUERY_XSK_UMEM;
> -
> rtnl_lock();
> err = xdp_umem_query(dev, queue_id);
> if (err) {
> --
> 2.17.1
>
>


[PATCH bpf-next v2] xsk: include XDP meta data in AF_XDP frames

2018-08-30 Thread Björn Töpel
From: Björn Töpel 

Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did
not include XDP meta data in the data buffers copied out to the user
application.

In this commit, we check if meta data is available, and if so, it is
prepended to the frame.

Signed-off-by: Björn Töpel 
---
 net/xdp/xsk.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4e937cd7c17d..569048e299df 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -55,20 +55,30 @@ EXPORT_SYMBOL(xsk_umem_discard_addr);
 
 static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 {
-   void *buffer;
+   void *to_buf, *from_buf;
+   u32 metalen;
u64 addr;
int err;
 
if (!xskq_peek_addr(xs->umem->fq, ) ||
-   len > xs->umem->chunk_size_nohr) {
+   len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
xs->rx_dropped++;
return -ENOSPC;
}
 
addr += xs->umem->headroom;
 
-   buffer = xdp_umem_get_data(xs->umem, addr);
-   memcpy(buffer, xdp->data, len);
+   if (unlikely(xdp_data_meta_unsupported(xdp))) {
+   from_buf = xdp->data;
+   metalen = 0;
+   } else {
+   from_buf = xdp->data_meta;
+   metalen = xdp->data - xdp->data_meta;
+   }
+
+   to_buf = xdp_umem_get_data(xs->umem, addr);
+   memcpy(to_buf, from_buf, len + metalen);
+   addr += metalen;
err = xskq_produce_batch_desc(xs->rx, addr, len);
if (!err) {
xskq_discard_addr(xs->umem->fq);
@@ -111,6 +121,7 @@ void xsk_flush(struct xdp_sock *xs)
 
 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
+   u32 metalen = xdp->data - xdp->data_meta;
u32 len = xdp->data_end - xdp->data;
void *buffer;
u64 addr;
@@ -120,7 +131,7 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff 
*xdp)
return -EINVAL;
 
if (!xskq_peek_addr(xs->umem->fq, ) ||
-   len > xs->umem->chunk_size_nohr) {
+   len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
xs->rx_dropped++;
return -ENOSPC;
}
@@ -128,7 +139,8 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff 
*xdp)
addr += xs->umem->headroom;
 
buffer = xdp_umem_get_data(xs->umem, addr);
-   memcpy(buffer, xdp->data, len);
+   memcpy(buffer, xdp->data_meta, len + metalen);
+   addr += metalen;
err = xskq_produce_batch_desc(xs->rx, addr, len);
if (!err) {
xskq_discard_addr(xs->umem->fq);
-- 
2.17.1



Re: [PATCH bpf-next] xsk: include XDP meta data in AF_XDP frames

2018-08-30 Thread Björn Töpel
Den tors 30 aug. 2018 kl 14:37 skrev Daniel Borkmann :
>
> On 08/30/2018 10:09 AM, Björn Töpel wrote:
> > From: Björn Töpel 
> >
> > Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did
> > not include XDP meta data in the data buffers copied out to the user
> > application.
> >
> > In this commit, we check if meta data is available, and if so, it is
> > prepended to the frame.
> >
> > Signed-off-by: Björn Töpel 
> > ---
> >  net/xdp/xsk.c | 37 -
> >  1 file changed, 28 insertions(+), 9 deletions(-)
> >
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 4e937cd7c17d..817e4cee1540 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -55,20 +55,30 @@ EXPORT_SYMBOL(xsk_umem_discard_addr);
> >
> >  static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
> >  {
> > - void *buffer;
> > + void *to_buf, *from_buf;
> > + u32 metalen;
> >   u64 addr;
> >   int err;
> >
> >   if (!xskq_peek_addr(xs->umem->fq, ) ||
> > - len > xs->umem->chunk_size_nohr) {
> > + len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
> >   xs->rx_dropped++;
> >   return -ENOSPC;
> >   }
> >
> >   addr += xs->umem->headroom;
> >
> > - buffer = xdp_umem_get_data(xs->umem, addr);
> > - memcpy(buffer, xdp->data, len);
> > + if (xdp_data_meta_unsupported(xdp)) {
>
> Nit: Probably makes sense to wrap with unlikely() since we expect all
> drivers to implement this.
>

I'll send a v2 fixing this...

> > + from_buf = xdp->data;
> > + metalen = 0;
> > + } else {
> > + from_buf = xdp->data_meta;
> > + metalen = xdp->data - xdp->data_meta;
> > + }
> > +
> > + to_buf = xdp_umem_get_data(xs->umem, addr);
> > + memcpy(to_buf, from_buf, len + metalen);
> > + addr += metalen;
> >   err = xskq_produce_batch_desc(xs->rx, addr, len);
> >   if (!err) {
> >   xskq_discard_addr(xs->umem->fq);
> > @@ -111,8 +121,8 @@ void xsk_flush(struct xdp_sock *xs)
> >
> >  int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
> >  {
> > - u32 len = xdp->data_end - xdp->data;
> > - void *buffer;
> > + u32 metalen, len = xdp->data_end - xdp->data;
> > + void *to_buf, *from_buf;
> >   u64 addr;
> >   int err;
> >
> > @@ -120,15 +130,24 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
> > xdp_buff *xdp)
> >   return -EINVAL;
> >
> >   if (!xskq_peek_addr(xs->umem->fq, ) ||
> > - len > xs->umem->chunk_size_nohr) {
> > + len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
> >   xs->rx_dropped++;
> >   return -ENOSPC;
> >   }
> >
> >   addr += xs->umem->headroom;
> >
> > - buffer = xdp_umem_get_data(xs->umem, addr);
> > - memcpy(buffer, xdp->data, len);
> > + if (xdp_data_meta_unsupported(xdp)) {
> > + from_buf = xdp->data;
> > + metalen = 0;
>
> Note that this condition should be dead code. netif_receive_generic_xdp()
> sets xdp->data_meta to xdp->data, so all good here and above is never hit.
>

...and this! Thanks for pointing this out!


Björn

> > + } else {
> > + from_buf = xdp->data_meta;
> > + metalen = xdp->data - xdp->data_meta;
> > + }
> > +
> > + to_buf = xdp_umem_get_data(xs->umem, addr);
> > + memcpy(to_buf, from_buf, len + metalen);
> > + addr += metalen;
> >   err = xskq_produce_batch_desc(xs->rx, addr, len);
> >   if (!err) {
> >   xskq_discard_addr(xs->umem->fq);
> >
>


Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support

2018-08-30 Thread Björn Töpel

On 2018-08-29 21:14, Jakub Kicinski wrote:
> On Tue, 28 Aug 2018 14:44:32 +0200, Björn Töpel wrote:
>> From: Björn Töpel 
>>
>> This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
>> allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
>> allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
>> queue.
>>
>> All AF_XDP specific functions are added to a new file, i40e_xsk.c.
>>
>> Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
>> will allocate a new buffer and copy the zero-copy frame prior passing
>> it to the kernel stack.
>>
>> Signed-off-by: Björn Töpel 
>
> Mm.. I'm surprised you don't run into buffer reuse issues that I had
> when playing with AF_XDP.  What happens in i40e if someone downs the
> interface?  Will UMEMs get destroyed?  Will the RX buffers get freed?
>

The UMEM will linger in the driver until the sockets are dead.

> I'll shortly send an RFC with my quick and dirty RX buffer reuse queue,
> FWIW.
>

Some background for folks that don't know the details: A zero-copy
capable driver picks buffers off the fill ring and places them on the
hardware Rx ring to be completed at a later point when DMA is
complete. Analogous for the Tx side; The driver picks buffers off the
Tx ring and places them on the Tx hardware ring.

In the typical flow, the Rx buffer will be placed onto an Rx ring
(completed to the user), and the Tx buffer will be placed on the
completion ring to notify the user that the transfer is done.

However, if the driver needs to tear down the hardware rings for some
reason (interface goes down, reconfiguration and such), what should be
done with the Rx and Tx buffers that has been given to the driver?

So, to frame the problem: What should a driver do when this happens,
so that buffers aren't leaked?

Note that when the UMEM is going down, there's no need to complete
anything, since the sockets are dying/dead already.

This is, as you state, a missing piece in the implementation and needs
to be fixed.

Now on to possible solutions:

1. Complete the buffers back to the user. For Tx, this is probably the
   best way -- just place the buffers onto the completion ring.

   For Rx, we can give buffers back to user space by setting the
   length in the Rx descriptor to zero And putting them on the Rx
   ring. However, one complication here is that we do not have any
   back-pressure mechanism for the Rx side like we have on Tx. If the
   Rx ring(s) is (are) full the kernel will have to leak them or
   implement a retry mechanism (ugly and should be avoided).

   Another option to solve this without needing any retry or leaking
   for Rx is to implement the same back-pressure mechanism that we
   have on the Tx path in the Rx path. In the Tx path, the driver will
   only get a Tx packet to send if there is space for it in the
   completion ring. On Rx, this would be that the driver would only
   get a buffer from the fill ring if there is space for it to put it
   on the Rx ring. The drawback of this is that it would likely impact
   performance negatively since the Rx ring would have to be touch one
   more time (in the Tx path, it increased performance since it made
   it possible to implement the Tx path without any buffering), but it
   would guarantee that all buffers can always be returned to user
   space making solution this a viable option.

2. Store the buffers internally in the driver, and make sure that they
   are inserted into the "normal flow" again. For Rx that would be
   putting the buffers back into the allocation scheme that the driver
   is using. For Tx, placing the buffers back onto the Tx HW ring
   (plus all the logic for making sure that all corner cases work).

3. Mark the socket(s) as in error state, en require the user to redo
   the setup. This is bit harsh...

For i40e I think #2 for Rx (buffers reside in kernel, return to
allocator) and #1 for Tx (complete to userland).

Your RFC is plumbing to implement #2 for Rx in a driver. I'm not a fan
of extending the umem with the "reuse queue". This decision is really
up the driver. Some driver might prefer another scheme, or simply
prefer storing the buffers in another manner.

Looking forward, as both you and Jesper has alluded to, we need a
proper allocator for zero-copy. Then it would be a matter of injecting
the Rx buffers back to the allocator.

I'll send out a patch, so that we don't leak the buffers, but I'd like
to hear your thoughts on what the behavior should be.

And what should the behavior be when the netdev is removed from the
kernel?

And thanks for looking into the code!


Björn


Re: [PATCH bpf-next 11/11] samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock

2018-08-30 Thread Björn Töpel
Den ons 29 aug. 2018 kl 14:44 skrev Jesper Dangaard Brouer :
>
> On Tue, 28 Aug 2018 14:44:35 +0200
> Björn Töpel  wrote:
>
> > From: Björn Töpel 
> >
> > The -c/--copy -z/--zero-copy flags enforces either copy or zero-copy
> > mode.
>
> Nice, thanks for adding this.  It allows me to quickly test the
> difference between normal-copy vs zero-copy modes.
> (Kernel bpf-next without RETPOLINE).
>
> AF_XDP RX-drop:
>  Normal-copy mode: rx 13,070,318 pps - 76.5 ns
>  Zero-copy   mode: rx 26,132,328 pps - 38.3 ns
>
> Compare to XDP_DROP:  34,251,464 pps - 29.2 ns
>XDP_DROP + read :  30,756,664 pps - 32.5 ns
>
> The normal-copy mode is surprisingly fast (and it works for every
> driver implemeting the regular XDP_REDIRECT action).  It is still
> faster to do in-kernel XDP_DROP than AF_XDP zero-copy mode dropping,
> which was expected given frames travel to a remote CPU before returned
> (don't think remote CPU reads payload?).  The gap in nanosec is
> actually quite small, thus I'm impressed by the SPSC-queue
> implementation working across these CPUs.
>
>
> AF_XDP layer2-fwd:
>  Normal-copy mode: rx  3,200,885   tx  3,200,892
>  Zero-copy   mode: rx 17,026,300   tx 17,026,269
>
> Compare to XDP_TX: rx 14,529,079   tx 14,529,850  - 68.82 ns
>  XDP_REDIRECT: rx 13,235,785   tx 13,235,784  - 75.55 ns
>
> The copy-mode is slow because it allocates SKBs internally (I do
> wonder if we could speed it up by using ndo_xdp_xmit + disable-BH).
> More intersting is that the zero-copy is faster than XDP_TX and
> XDP_REDIRECT. I think the speedup comes from avoiding some DMA mapping
> calls with ZC.
>
> Side-note: XDP_TX vs. REDIRECT: 75.55 - 68.82 = 6.73 ns.  The cost of
> going through the xdp_do_redirect_map core is actually quite small :-)
> (I have some micro optimizations that should help ~2ns).
>
>
> AF_XDP TX-only:
>  Normal-copy mode: tx  2,853,461 pps
>  Zero-copy   mode: tx 22,255,311 pps
>
> (There is not XDP mode that does TX to compare against)
>

Kudos for doing the in-depth benchmarking!


Thanks!
Björn

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e

2018-08-30 Thread Björn Töpel
Den ons 29 aug. 2018 kl 18:12 skrev Daniel Borkmann :
>
> On 08/28/2018 02:44 PM, Björn Töpel wrote:
> > From: Björn Töpel 
> >
> > This patch set introduces zero-copy AF_XDP support for Intel's i40e
> > driver. In the first preparatory patch we also add support for
> > XDP_REDIRECT for zero-copy allocated frames so that XDP programs can
> > redirect them. This was a ToDo from the first AF_XDP zero-copy patch
> > set from early June. Special thanks to Alex Duyck and Jesper Dangaard
> > Brouer for reviewing earlier versions of this patch set.
> >
> > The i40e zero-copy code is located in its own file i40e_xsk.[ch]. Note
> > that in the interest of time, to get an AF_XDP zero-copy implementation
> > out there for people to try, some code paths have been copied from the
> > XDP path to the zero-copy path. It is out goal to merge the two paths
> > in later patch sets.
> >
> > In contrast to the implementation from beginning of June, this patch
> > set does not require any extra HW queues for AF_XDP zero-copy
> > TX. Instead, the XDP TX HW queue is used for both XDP_REDIRECT and
> > AF_XDP zero-copy TX.
> >
> > Jeff, given that most of changes are in i40e, it is up to you how you
> > would like to route these patches. The set is tagged bpf-next, but
> > if taking it via the Intel driver tree is easier, let us know.
> >
> > We have run some benchmarks on a dual socket system with two Broadwell
> > E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> > cores which gives a total of 28, but only two cores are used in these
> > experiments. One for TR/RX and one for the user space application. The
> > memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> > 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> > memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> > NIC is Intel I40E 40Gbit/s using the i40e driver.
> >
> > Below are the results in Mpps of the I40E NIC benchmark runs for 64
> > and 1500 byte packets, generated by a commercial packet generator HW
> > outputing packets at full 40 Gbit/s line rate. The results are with
> > retpoline and all other spectre and meltdown fixes, so these results
> > are not comparable to the ones from the zero-copy patch set in June.
> >
> > AF_XDP performance 64 byte packets.
> > Benchmark   XDP_SKBXDP_DRVXDP_DRV with zerocopy
> > rxdrop   2.68.2 15.0
> > txpush   2.2-   21.9
> > l2fwd1.72.3 11.3
> >
> > AF_XDP performance 1500 byte packets:
> > Benchmark   XDP_SKB   XDP_DRV XDP_DRV with zerocopy
> > rxdrop   2.03.3 3.3
> > l2fwd1.31.7 3.1
> >
> > XDP performance on our system as a base line:
> >
> > 64 byte packets:
> > XDP stats   CPU pps issue-pps
> > XDP-RX CPU  16  18.4M  0
> >
> > 1500 byte packets:
> > XDP stats   CPU pps issue-pps
> > XDP-RX CPU  16  3.3M0
> >
> > The structure of the patch set is as follows:
> >
> > Patch 1: Add support for XDP_REDIRECT of zero-copy allocated frames
> > Patches 2-4: Preparatory patches to common xsk and net code
> > Patches 5-7: Preparatory patches to i40e driver code for RX
> > Patch 8: i40e zero-copy support for RX
> > Patch 9: Preparatory patch to i40e driver code for TX
> > Patch 10: i40e zero-copy support for TX
> > Patch 11: Add flags to sample application to force zero-copy/copy mode
> >
> > We based this patch set on bpf-next commit 050cdc6c9501 ("Merge
> > git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
> >
> >
> > Magnus & Björn
> >
> > Björn Töpel (8):
> >   xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
> >   xdp: export xdp_rxq_info_unreg_mem_model
> >   xsk: expose xdp_umem_get_{data,dma} to drivers
> >   i40e: added queue pair disable/enable functions
> >   i40e: refactor Rx path for re-use
> >   i40e: move common Rx functions to i40e_txrx_common.h
> >   i40e: add AF_XDP zero-copy Rx support
> >   samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock
> >
> > Magnus Karlsson (3):
> >   net: add napi_if_scheduled_mark_missed
> >   i40e: move common Tx functions to i40e_txrx_common.h
> >   i40e: add AF_XDP zero-copy Tx support
>
> Thanks for working on this, LGTM! Are you also planning to get ixgbe
> out after that?
>

Yes, the plan is to get ixgbe out as the next driver, but we'll focus
on the existing i40e issues first.


Thanks,
Björn

> For the series:
>
> Acked-by: Daniel Borkmann 
>
> Thanks,
> Daniel


[PATCH bpf-next] xsk: include XDP meta data in AF_XDP frames

2018-08-30 Thread Björn Töpel
From: Björn Töpel 

Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did
not include XDP meta data in the data buffers copied out to the user
application.

In this commit, we check if meta data is available, and if so, it is
prepended to the frame.

Signed-off-by: Björn Töpel 
---
 net/xdp/xsk.c | 37 -
 1 file changed, 28 insertions(+), 9 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4e937cd7c17d..817e4cee1540 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -55,20 +55,30 @@ EXPORT_SYMBOL(xsk_umem_discard_addr);
 
 static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 {
-   void *buffer;
+   void *to_buf, *from_buf;
+   u32 metalen;
u64 addr;
int err;
 
if (!xskq_peek_addr(xs->umem->fq, ) ||
-   len > xs->umem->chunk_size_nohr) {
+   len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
xs->rx_dropped++;
return -ENOSPC;
}
 
addr += xs->umem->headroom;
 
-   buffer = xdp_umem_get_data(xs->umem, addr);
-   memcpy(buffer, xdp->data, len);
+   if (xdp_data_meta_unsupported(xdp)) {
+   from_buf = xdp->data;
+   metalen = 0;
+   } else {
+   from_buf = xdp->data_meta;
+   metalen = xdp->data - xdp->data_meta;
+   }
+
+   to_buf = xdp_umem_get_data(xs->umem, addr);
+   memcpy(to_buf, from_buf, len + metalen);
+   addr += metalen;
err = xskq_produce_batch_desc(xs->rx, addr, len);
if (!err) {
xskq_discard_addr(xs->umem->fq);
@@ -111,8 +121,8 @@ void xsk_flush(struct xdp_sock *xs)
 
 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
-   u32 len = xdp->data_end - xdp->data;
-   void *buffer;
+   u32 metalen, len = xdp->data_end - xdp->data;
+   void *to_buf, *from_buf;
u64 addr;
int err;
 
@@ -120,15 +130,24 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff 
*xdp)
return -EINVAL;
 
if (!xskq_peek_addr(xs->umem->fq, ) ||
-   len > xs->umem->chunk_size_nohr) {
+   len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
xs->rx_dropped++;
return -ENOSPC;
}
 
addr += xs->umem->headroom;
 
-   buffer = xdp_umem_get_data(xs->umem, addr);
-   memcpy(buffer, xdp->data, len);
+   if (xdp_data_meta_unsupported(xdp)) {
+   from_buf = xdp->data;
+   metalen = 0;
+   } else {
+   from_buf = xdp->data_meta;
+   metalen = xdp->data - xdp->data_meta;
+   }
+
+   to_buf = xdp_umem_get_data(xs->umem, addr);
+   memcpy(to_buf, from_buf, len + metalen);
+   addr += metalen;
err = xskq_produce_batch_desc(xs->rx, addr, len);
if (!err) {
xskq_discard_addr(xs->umem->fq);
-- 
2.17.1



Re: [PATCH bpf-next 01/11] xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY

2018-08-28 Thread Björn Töpel
Den tis 28 aug. 2018 kl 16:11 skrev Jesper Dangaard Brouer :
>
> On Tue, 28 Aug 2018 14:44:25 +0200
> Björn Töpel  wrote:
>
> > From: Björn Töpel 
> >
> > This commit adds proper MEM_TYPE_ZERO_COPY support for
> > convert_to_xdp_frame. Converting a MEM_TYPE_ZERO_COPY xdp_buff to an
> > xdp_frame is done by transforming the MEM_TYPE_ZERO_COPY buffer into a
> > MEM_TYPE_PAGE_ORDER0 frame. This is costly, and in the future it might
> > make sense to implement a more sophisticated thread-safe alloc/free
> > scheme for MEM_TYPE_ZERO_COPY, so that no allocation and copy is
> > required in the fast-path.
>
> This is going to be slow. Especially the dev_alloc_page() call, which
> for small frames is likely going to be slower than the data copy.
> I guess this is a good first step, but I do hope we will circle back and
> optimize this later.  (It would also be quite easy to use
> MEM_TYPE_PAGE_POOL instead to get page recycling in devmap redirect case).
>

Yes, slow. :-( Still, I think this is a good starting point, and then
introduce a page pool in later performance oriented series to make XDP
faster for the AF_XDP scenario.

But I'm definitely on your side here; This need to be addressed -- but
not now IMO.


And thanks for spending time on the series!
Björn

> I would have liked the MEM_TYPE_ZERO_COPY frame to travel one level
> deeper into the redirect-core code.  Allowing devmap to send these
> frame without copy, and allow cpumap to do the dev_alloc_page() call
> (+copy) on the remote CPU.
>
>
> > Signed-off-by: Björn Töpel 
> > ---
> >  include/net/xdp.h |  5 +++--
> >  net/core/xdp.c| 39 +++
> >  2 files changed, 42 insertions(+), 2 deletions(-)
> >
> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index 76b95256c266..0d5c6fb4b2e2 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -91,6 +91,8 @@ static inline void xdp_scrub_frame(struct xdp_frame 
> > *frame)
> >   frame->dev_rx = NULL;
> >  }
> >
> > +struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
> > +
> >  /* Convert xdp_buff to xdp_frame */
> >  static inline
> >  struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
> > @@ -99,9 +101,8 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff 
> > *xdp)
> >   int metasize;
> >   int headroom;
> >
> > - /* TODO: implement clone, copy, use "native" MEM_TYPE */
> >   if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
> > - return NULL;
> > + return xdp_convert_zc_to_xdp_frame(xdp);
> >
> >   /* Assure headroom is available for storing info */
> >   headroom = xdp->data - xdp->data_hard_start;
> > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > index 89b6785cef2a..be6cb2f0e722 100644
> > --- a/net/core/xdp.c
> > +++ b/net/core/xdp.c
> > @@ -398,3 +398,42 @@ void xdp_attachment_setup(struct xdp_attachment_info 
> > *info,
> >   info->flags = bpf->flags;
> >  }
> >  EXPORT_SYMBOL_GPL(xdp_attachment_setup);
> > +
> > +struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
> > +{
> > + unsigned int metasize, headroom, totsize;
> > + void *addr, *data_to_copy;
> > + struct xdp_frame *xdpf;
> > + struct page *page;
> > +
> > + /* Clone into a MEM_TYPE_PAGE_ORDER0 xdp_frame. */
> > + metasize = xdp_data_meta_unsupported(xdp) ? 0 :
> > +xdp->data - xdp->data_meta;
> > + headroom = xdp->data - xdp->data_hard_start;
> > + totsize = xdp->data_end - xdp->data + metasize;
> > +
> > + if (sizeof(*xdpf) + totsize > PAGE_SIZE)
> > + return NULL;
> > +
> > + page = dev_alloc_page();
> > + if (!page)
> > + return NULL;
> > +
> > + addr = page_to_virt(page);
> > + xdpf = addr;
> > + memset(xdpf, 0, sizeof(*xdpf));
> > +
> > + addr += sizeof(*xdpf);
> > + data_to_copy = metasize ? xdp->data_meta : xdp->data;
> > + memcpy(addr, data_to_copy, totsize);
> > +
> > + xdpf->data = addr + metasize;
> > + xdpf->len = totsize - metasize;
> > + xdpf->headroom = 0;
> > + xdpf->metasize = metasize;
> > + xdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
> > +
> > + xdp_return_buff(xdp);
> > + return xdpf;
> > +}
> > +EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
>
>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: [Intel-wired-lan] [PATCH] i40e: report correct statistics when XDP is enabled

2018-08-28 Thread Björn Töpel

On 2018-08-28 19:00, Paul Menzel wrote:

Dear Björn,


On 08/24/18 16:00, Jesper Dangaard Brouer wrote:

On Fri, 24 Aug 2018 13:21:59 +0200
Björn Töpel  wrote:


When XDP is enabled, the driver will report incorrect
statistics. Received frames will reported as transmitted frames.

This commits fixes the i40e implementation of ndo_get_stats64 (struct


Should you send a v2, then please use singular for *commit*:

This commit ….


net_device_ops), so that iproute2 will report correct statistics
(e.g. when running "ip -stats link show dev eth0") even when XDP is
enabled.


In the future, I’d be great, if you could describe your fix in the
commit message too. For example, why the if statement needs to move up.



Thanks for the review, Paul. I'll address your comments, if we'll end up
with V2.


Björn


Reported-by: Jesper Dangaard Brouer 
Fixes: 74608d17fe29 ("i40e: add support for XDP_TX action")


Stable candidate:
  $ git describe --contains 74608d17fe29
  v4.13-rc1~157^2~128^2~13


Signed-off-by: Björn Töpel 


It works for me:

Tested-by: Jesper Dangaard Brouer 

I'm explicitly _not_ ACK'ing the patch, as I think the your code changes
below makes it harder to follow whether a TX or RX ring is getting
updated. But it is 100% up to the driver maintainers to say if this is
acceptable from a maintenance PoV.


---
  drivers/net/ethernet/intel/i40e/i40e_main.c | 24 +++--
  1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index e40c023cc7b6..7c122dd3faa1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -425,9 +425,9 @@ static void i40e_get_netdev_stats_struct(struct net_device 
*netdev,
  struct rtnl_link_stats64 *stats)
  {
struct i40e_netdev_priv *np = netdev_priv(netdev);
-   struct i40e_ring *tx_ring, *rx_ring;
struct i40e_vsi *vsi = np->vsi;
struct rtnl_link_stats64 *vsi_stats = i40e_get_vsi_stats_struct(vsi);
+   struct i40e_ring *ring;
int i;
  
  	if (test_bit(__I40E_VSI_DOWN, vsi->state))

@@ -441,24 +441,26 @@ static void i40e_get_netdev_stats_struct(struct 
net_device *netdev,
u64 bytes, packets;
unsigned int start;
  
-		tx_ring = READ_ONCE(vsi->tx_rings[i]);

-   if (!tx_ring)
+   ring = READ_ONCE(vsi->tx_rings[i]);
+   if (!ring)
continue;
-   i40e_get_netdev_stats_struct_tx(tx_ring, stats);
+   i40e_get_netdev_stats_struct_tx(ring, stats);
  
-		rx_ring = _ring[1];

+   if (i40e_enabled_xdp_vsi(vsi)) {
+   ring++;
+   i40e_get_netdev_stats_struct_tx(ring, stats);
+   }
  
+		ring++;

do {
-   start = u64_stats_fetch_begin_irq(_ring->syncp);
-   packets = rx_ring->stats.packets;
-   bytes   = rx_ring->stats.bytes;
-   } while (u64_stats_fetch_retry_irq(_ring->syncp, start));
+   start   = u64_stats_fetch_begin_irq(>syncp);
+   packets = ring->stats.packets;
+   bytes   = ring->stats.bytes;
+   } while (u64_stats_fetch_retry_irq(>syncp, start));
  
  		stats->rx_packets += packets;

stats->rx_bytes   += bytes;
  
-		if (i40e_enabled_xdp_vsi(vsi))

-   i40e_get_netdev_stats_struct_tx(_ring[1], stats);
}
rcu_read_unlock();
  



Kind regards,

Paul



Re: [PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e

2018-08-28 Thread Björn Töpel
Den tis 28 aug. 2018 kl 14:47 skrev Björn Töpel :
>
> From: Björn Töpel 
>
> This patch set introduces zero-copy AF_XDP support for Intel's i40e
> driver. In the first preparatory patch we also add support for
> XDP_REDIRECT for zero-copy allocated frames so that XDP programs can
> redirect them. This was a ToDo from the first AF_XDP zero-copy patch
> set from early June. Special thanks to Alex Duyck and Jesper Dangaard
> Brouer for reviewing earlier versions of this patch set.
>
> The i40e zero-copy code is located in its own file i40e_xsk.[ch]. Note
> that in the interest of time, to get an AF_XDP zero-copy implementation
> out there for people to try, some code paths have been copied from the
> XDP path to the zero-copy path. It is out goal to merge the two paths
> in later patch sets.
>
> In contrast to the implementation from beginning of June, this patch
> set does not require any extra HW queues for AF_XDP zero-copy
> TX. Instead, the XDP TX HW queue is used for both XDP_REDIRECT and
> AF_XDP zero-copy TX.
>
> Jeff, given that most of changes are in i40e, it is up to you how you
> would like to route these patches. The set is tagged bpf-next, but
> if taking it via the Intel driver tree is easier, let us know.
>
> We have run some benchmarks on a dual socket system with two Broadwell
> E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
> cores which gives a total of 28, but only two cores are used in these
> experiments. One for TR/RX and one for the user space application. The
> memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
> 8192MB and with 8 of those DIMMs in the system we have 64 GB of total
> memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
> NIC is Intel I40E 40Gbit/s using the i40e driver.
>
> Below are the results in Mpps of the I40E NIC benchmark runs for 64
> and 1500 byte packets, generated by a commercial packet generator HW
> outputing packets at full 40 Gbit/s line rate. The results are with
> retpoline and all other spectre and meltdown fixes, so these results
> are not comparable to the ones from the zero-copy patch set in June.
>
> AF_XDP performance 64 byte packets.
> Benchmark   XDP_SKBXDP_DRVXDP_DRV with zerocopy
> rxdrop   2.68.2 15.0
> txpush   2.2-   21.9
> l2fwd1.72.3 11.3
>
> AF_XDP performance 1500 byte packets:
> Benchmark   XDP_SKB   XDP_DRV XDP_DRV with zerocopy
> rxdrop   2.03.3 3.3
> l2fwd1.31.7 3.1
>
> XDP performance on our system as a base line:
>
> 64 byte packets:
> XDP stats   CPU pps issue-pps
> XDP-RX CPU  16  18.4M  0
>
> 1500 byte packets:
> XDP stats   CPU pps issue-pps
> XDP-RX CPU  16  3.3M0
>
> The structure of the patch set is as follows:
>
> Patch 1: Add support for XDP_REDIRECT of zero-copy allocated frames
> Patches 2-4: Preparatory patches to common xsk and net code
> Patches 5-7: Preparatory patches to i40e driver code for RX
> Patch 8: i40e zero-copy support for RX
> Patch 9: Preparatory patch to i40e driver code for TX
> Patch 10: i40e zero-copy support for TX
> Patch 11: Add flags to sample application to force zero-copy/copy mode
>
> We based this patch set on bpf-next commit 050cdc6c9501 ("Merge
> git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")
>
>
> Magnus & Björn
>
> Björn Töpel (8):
>   xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
>   xdp: export xdp_rxq_info_unreg_mem_model
>   xsk: expose xdp_umem_get_{data,dma} to drivers
>   i40e: added queue pair disable/enable functions
>   i40e: refactor Rx path for re-use
>   i40e: move common Rx functions to i40e_txrx_common.h
>   i40e: add AF_XDP zero-copy Rx support
>   samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock
>
> Magnus Karlsson (3):
>   net: add napi_if_scheduled_mark_missed
>   i40e: move common Tx functions to i40e_txrx_common.h
>   i40e: add AF_XDP zero-copy Tx support
>
>  drivers/net/ethernet/intel/i40e/Makefile  |   3 +-
>  drivers/net/ethernet/intel/i40e/i40e.h|  19 +
>  drivers/net/ethernet/intel/i40e/i40e_main.c   | 307 ++-
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 182 ++--
>  drivers/net/ethernet/intel/i40e/i40e_txrx.h   |  20 +-
>  .../ethernet/intel/i40e/i40e_txrx_common.h|  90 ++
>  drivers/net/ethernet/intel/i40e/i40e_xsk.c| 834 ++
>  drivers/net/ethernet/intel/i40e/i40e_xsk.h|  25 +
>  include/linux/netdevice.h |  26 +
>  include/net/xdp.h |   6 +-
>  include/net/xdp_sock.h  

[PATCH bpf-next 01/11] xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY

2018-08-28 Thread Björn Töpel
From: Björn Töpel 

This commit adds proper MEM_TYPE_ZERO_COPY support for
convert_to_xdp_frame. Converting a MEM_TYPE_ZERO_COPY xdp_buff to an
xdp_frame is done by transforming the MEM_TYPE_ZERO_COPY buffer into a
MEM_TYPE_PAGE_ORDER0 frame. This is costly, and in the future it might
make sense to implement a more sophisticated thread-safe alloc/free
scheme for MEM_TYPE_ZERO_COPY, so that no allocation and copy is
required in the fast-path.

Signed-off-by: Björn Töpel 
---
 include/net/xdp.h |  5 +++--
 net/core/xdp.c| 39 +++
 2 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 76b95256c266..0d5c6fb4b2e2 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -91,6 +91,8 @@ static inline void xdp_scrub_frame(struct xdp_frame *frame)
frame->dev_rx = NULL;
 }
 
+struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp);
+
 /* Convert xdp_buff to xdp_frame */
 static inline
 struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
@@ -99,9 +101,8 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
int metasize;
int headroom;
 
-   /* TODO: implement clone, copy, use "native" MEM_TYPE */
if (xdp->rxq->mem.type == MEM_TYPE_ZERO_COPY)
-   return NULL;
+   return xdp_convert_zc_to_xdp_frame(xdp);
 
/* Assure headroom is available for storing info */
headroom = xdp->data - xdp->data_hard_start;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 89b6785cef2a..be6cb2f0e722 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -398,3 +398,42 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
info->flags = bpf->flags;
 }
 EXPORT_SYMBOL_GPL(xdp_attachment_setup);
+
+struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
+{
+   unsigned int metasize, headroom, totsize;
+   void *addr, *data_to_copy;
+   struct xdp_frame *xdpf;
+   struct page *page;
+
+   /* Clone into a MEM_TYPE_PAGE_ORDER0 xdp_frame. */
+   metasize = xdp_data_meta_unsupported(xdp) ? 0 :
+  xdp->data - xdp->data_meta;
+   headroom = xdp->data - xdp->data_hard_start;
+   totsize = xdp->data_end - xdp->data + metasize;
+
+   if (sizeof(*xdpf) + totsize > PAGE_SIZE)
+   return NULL;
+
+   page = dev_alloc_page();
+   if (!page)
+   return NULL;
+
+   addr = page_to_virt(page);
+   xdpf = addr;
+   memset(xdpf, 0, sizeof(*xdpf));
+
+   addr += sizeof(*xdpf);
+   data_to_copy = metasize ? xdp->data_meta : xdp->data;
+   memcpy(addr, data_to_copy, totsize);
+
+   xdpf->data = addr + metasize;
+   xdpf->len = totsize - metasize;
+   xdpf->headroom = 0;
+   xdpf->metasize = metasize;
+   xdpf->mem.type = MEM_TYPE_PAGE_ORDER0;
+
+   xdp_return_buff(xdp);
+   return xdpf;
+}
+EXPORT_SYMBOL_GPL(xdp_convert_zc_to_xdp_frame);
-- 
2.17.1



[PATCH bpf-next 02/11] xdp: export xdp_rxq_info_unreg_mem_model

2018-08-28 Thread Björn Töpel
From: Björn Töpel 

Export __xdp_rxq_info_unreg_mem_model as xdp_rxq_info_unreg_mem_model,
so it can be used from netdev drivers. Also, add additional checks for
the memory type.

Signed-off-by: Björn Töpel 
---
 include/net/xdp.h |  1 +
 net/core/xdp.c| 15 +--
 2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 0d5c6fb4b2e2..0f25b3675c5c 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -136,6 +136,7 @@ void xdp_rxq_info_unused(struct xdp_rxq_info *xdp_rxq);
 bool xdp_rxq_info_is_reg(struct xdp_rxq_info *xdp_rxq);
 int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
   enum xdp_mem_type type, void *allocator);
+void xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq);
 
 /* Drivers not supporting XDP metadata can use this helper, which
  * rejects any room expansion for metadata as a result.
diff --git a/net/core/xdp.c b/net/core/xdp.c
index be6cb2f0e722..654dbb19707e 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -94,11 +94,21 @@ static void __xdp_mem_allocator_rcu_free(struct rcu_head 
*rcu)
kfree(xa);
 }
 
-static void __xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq)
+void xdp_rxq_info_unreg_mem_model(struct xdp_rxq_info *xdp_rxq)
 {
struct xdp_mem_allocator *xa;
int id = xdp_rxq->mem.id;
 
+   if (xdp_rxq->reg_state != REG_STATE_REGISTERED) {
+   WARN(1, "Missing register, driver bug");
+   return;
+   }
+
+   if (xdp_rxq->mem.type != MEM_TYPE_PAGE_POOL &&
+   xdp_rxq->mem.type != MEM_TYPE_ZERO_COPY) {
+   return;
+   }
+
if (id == 0)
return;
 
@@ -110,6 +120,7 @@ static void __xdp_rxq_info_unreg_mem_model(struct 
xdp_rxq_info *xdp_rxq)
 
mutex_unlock(_id_lock);
 }
+EXPORT_SYMBOL_GPL(xdp_rxq_info_unreg_mem_model);
 
 void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
 {
@@ -119,7 +130,7 @@ void xdp_rxq_info_unreg(struct xdp_rxq_info *xdp_rxq)
 
WARN(!(xdp_rxq->reg_state == REG_STATE_REGISTERED), "Driver BUG");
 
-   __xdp_rxq_info_unreg_mem_model(xdp_rxq);
+   xdp_rxq_info_unreg_mem_model(xdp_rxq);
 
xdp_rxq->reg_state = REG_STATE_UNREGISTERED;
xdp_rxq->dev = NULL;
-- 
2.17.1



[PATCH bpf-next 07/11] i40e: move common Rx functions to i40e_txrx_common.h

2018-08-28 Thread Björn Töpel
From: Björn Töpel 

This patch prepares for the upcoming zero-copy Rx functionality, by
moving/changing linkage of common functions, used both by the regular
path and zero-copy path.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 33 ---
 .../ethernet/intel/i40e/i40e_txrx_common.h| 31 +
 2 files changed, 44 insertions(+), 20 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_txrx_common.h

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index b5a2cfeb68a5..878fb4b47484 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -8,6 +8,7 @@
 #include "i40e.h"
 #include "i40e_trace.h"
 #include "i40e_prototype.h"
+#include "i40e_txrx_common.h"
 
 static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
u32 td_tag)
@@ -536,8 +537,8 @@ int i40e_add_del_fdir(struct i40e_vsi *vsi,
  * This is used to verify if the FD programming or invalidation
  * requested by SW to the HW is successful or not and take actions accordingly.
  **/
-static void i40e_fd_handle_status(struct i40e_ring *rx_ring,
- union i40e_rx_desc *rx_desc, u8 prog_id)
+void i40e_fd_handle_status(struct i40e_ring *rx_ring,
+  union i40e_rx_desc *rx_desc, u8 prog_id)
 {
struct i40e_pf *pf = rx_ring->vsi->back;
struct pci_dev *pdev = pf->pdev;
@@ -1282,7 +1283,7 @@ static inline bool i40e_rx_is_programming_status(u64 qw)
  *
  * Returns an i40e_rx_buffer to reuse if the cleanup occurred, otherwise NULL.
  **/
-static struct i40e_rx_buffer *i40e_clean_programming_status(
+struct i40e_rx_buffer *i40e_clean_programming_status(
struct i40e_ring *rx_ring,
union i40e_rx_desc *rx_desc,
u64 qw)
@@ -1499,7 +1500,7 @@ int i40e_setup_rx_descriptors(struct i40e_ring *rx_ring)
  * @rx_ring: ring to bump
  * @val: new head index
  **/
-static inline void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
+void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 val)
 {
rx_ring->next_to_use = val;
 
@@ -1583,8 +1584,8 @@ static bool i40e_alloc_mapped_page(struct i40e_ring 
*rx_ring,
  * @skb: packet to send up
  * @vlan_tag: vlan tag for packet
  **/
-static void i40e_receive_skb(struct i40e_ring *rx_ring,
-struct sk_buff *skb, u16 vlan_tag)
+void i40e_receive_skb(struct i40e_ring *rx_ring,
+ struct sk_buff *skb, u16 vlan_tag)
 {
struct i40e_q_vector *q_vector = rx_ring->q_vector;
 
@@ -1811,7 +1812,6 @@ static inline void i40e_rx_hash(struct i40e_ring *ring,
  * order to populate the hash, checksum, VLAN, protocol, and
  * other fields within the skb.
  **/
-static inline
 void i40e_process_skb_fields(struct i40e_ring *rx_ring,
 union i40e_rx_desc *rx_desc, struct sk_buff *skb,
 u8 rx_ptype)
@@ -2204,16 +2204,10 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
return true;
 }
 
-#define I40E_XDP_PASS  0
-#define I40E_XDP_CONSUMED  BIT(0)
-#define I40E_XDP_TXBIT(1)
-#define I40E_XDP_REDIR BIT(2)
-
 static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
  struct i40e_ring *xdp_ring);
 
-static int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
-struct i40e_ring *xdp_ring)
+int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp, struct i40e_ring *xdp_ring)
 {
struct xdp_frame *xdpf = convert_to_xdp_frame(xdp);
 
@@ -2298,7 +2292,7 @@ static void i40e_rx_buffer_flip(struct i40e_ring *rx_ring,
  *
  * This function updates the XDP Tx ring tail register.
  **/
-static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
+void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
 {
/* Force memory writes to complete before letting h/w
 * know there are new descriptors to fetch.
@@ -2315,9 +2309,9 @@ static inline void i40e_xdp_ring_update_tail(struct 
i40e_ring *xdp_ring)
  *
  * This function updates the Rx ring statistics.
  **/
-static void i40e_update_rx_stats(struct i40e_ring *rx_ring,
-unsigned int total_rx_bytes,
-unsigned int total_rx_packets)
+void i40e_update_rx_stats(struct i40e_ring *rx_ring,
+ unsigned int total_rx_bytes,
+ unsigned int total_rx_packets)
 {
u64_stats_update_begin(_ring->syncp);
rx_ring->stats.packets += total_rx_packets;
@@ -2336,8 +2330,7 @@ static void i40e_update_rx_stats(struct i40e_ring 
*rx_ring,
  * should be called when a batch of packets has been processed in the
  * napi loop.
  **/
-static void i40e_finalize_xdp_rx(struct i40e_ring *rx_ring,

[PATCH bpf-next 04/11] net: add napi_if_scheduled_mark_missed

2018-08-28 Thread Björn Töpel
From: Magnus Karlsson 

The function napi_if_scheduled_mark_missed is used to check if the
NAPI context is scheduled, if so set NAPIF_STATE_MISSED and return
true. Used by the AF_XDP zero-copy i40e Tx code implementation in
order to make sure that irq affinity is honored by the napi context.

Signed-off-by: Magnus Karlsson 
---
 include/linux/netdevice.h | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ca5ab98053c8..4271f6b4e419 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -535,6 +535,32 @@ static inline void napi_synchronize(const struct 
napi_struct *n)
barrier();
 }
 
+/**
+ * napi_if_scheduled_mark_missed - if napi is running, set the
+ * NAPIF_STATE_MISSED
+ * @n: NAPI context
+ *
+ * If napi is running, set the NAPIF_STATE_MISSED, and return true if
+ * NAPI is scheduled.
+ **/
+static inline bool napi_if_scheduled_mark_missed(struct napi_struct *n)
+{
+   unsigned long val, new;
+
+   do {
+   val = READ_ONCE(n->state);
+   if (val & NAPIF_STATE_DISABLE)
+   return true;
+
+   if (!(val & NAPIF_STATE_SCHED))
+   return false;
+
+   new = val | NAPIF_STATE_MISSED;
+   } while (cmpxchg(>state, val, new) != val);
+
+   return true;
+}
+
 enum netdev_queue_state_t {
__QUEUE_STATE_DRV_XOFF,
__QUEUE_STATE_STACK_XOFF,
-- 
2.17.1



[PATCH bpf-next 10/11] i40e: add AF_XDP zero-copy Tx support

2018-08-28 Thread Björn Töpel
From: Magnus Karlsson 

This patch adds zero-copy Tx support for AF_XDP sockets. It implements
the ndo_xsk_async_xmit netdev ndo and performs all the Tx logic from a
NAPI context. This means pulling egress packets from the Tx ring,
placing the frames on the NIC HW descriptor ring and completing sent
frames back to the application via the completion ring.

The regular XDP Tx ring is used for AF_XDP as well. This rationale for
this is as follows: XDP_REDIRECT guarantees mutual exclusion between
different NAPI contexts based on CPU id. In other words, a netdev can
XDP_REDIRECT to another netdev with a different NAPI context, since
the operation is bound to a specific core and each core has its own
hardware ring.

As the AF_XDP Tx action is running in the same NAPI context and using
the same ring, it will also be protected from XDP_REDIRECT actions
with the exact same mechanism.

As with AF_XDP Rx, all AF_XDP Tx specific functions are added to
i40e_xsk.c.

Signed-off-by: Magnus Karlsson 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |   4 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |   6 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 173 
 drivers/net/ethernet/intel/i40e/i40e_xsk.h  |   4 +
 4 files changed, 186 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 848eea7c84db..5da7eb0fe4ae 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -3074,6 +3074,9 @@ static int i40e_configure_tx_ring(struct i40e_ring *ring)
i40e_status err = 0;
u32 qtx_ctl = 0;
 
+   if (ring_is_xdp(ring))
+   ring->xsk_umem = i40e_xsk_umem(ring);
+
/* some ATR related tx ring init */
if (vsi->back->flags & I40E_FLAG_FD_ATR_ENABLED) {
ring->atr_sample_rate = vsi->back->atr_sample_rate;
@@ -12185,6 +12188,7 @@ static const struct net_device_ops i40e_netdev_ops = {
.ndo_bridge_setlink = i40e_ndo_bridge_setlink,
.ndo_bpf= i40e_xdp,
.ndo_xdp_xmit   = i40e_xdp_xmit,
+   .ndo_xsk_async_xmit = i40e_xsk_async_xmit,
 };
 
 /**
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 11e201fcb57a..37bd4e50ccde 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2597,7 +2597,11 @@ int i40e_napi_poll(struct napi_struct *napi, int budget)
 * budget and be more aggressive about cleaning up the Tx descriptors.
 */
i40e_for_each_ring(ring, q_vector->tx) {
-   if (!i40e_clean_tx_irq(vsi, ring, budget)) {
+   bool wd = ring->xsk_umem ?
+ i40e_clean_xdp_tx_irq(vsi, ring, budget) :
+ i40e_clean_tx_irq(vsi, ring, budget);
+
+   if (!wd) {
clean_complete = false;
continue;
}
diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c 
b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index bf502f2307c2..94947a826bc3 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -659,3 +659,176 @@ int i40e_clean_rx_irq_zc(struct i40e_ring *rx_ring, int 
budget)
return failure ? budget : (int)total_rx_packets;
 }
 
+/**
+ * i40e_xmit_zc - Performs zero-copy Tx AF_XDP
+ * @xdp_ring: XDP Tx ring
+ * @budget: NAPI budget
+ *
+ * Returns true if the work is finished.
+ **/
+static bool i40e_xmit_zc(struct i40e_ring *xdp_ring, unsigned int budget)
+{
+   unsigned int total_packets = 0;
+   struct i40e_tx_buffer *tx_bi;
+   struct i40e_tx_desc *tx_desc;
+   bool work_done = true;
+   dma_addr_t dma;
+   u32 len;
+
+   while (budget-- > 0) {
+   if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
+   xdp_ring->tx_stats.tx_busy++;
+   work_done = false;
+   break;
+   }
+
+   if (!xsk_umem_consume_tx(xdp_ring->xsk_umem, , ))
+   break;
+
+   dma_sync_single_for_device(xdp_ring->dev, dma, len,
+  DMA_BIDIRECTIONAL);
+
+   tx_bi = _ring->tx_bi[xdp_ring->next_to_use];
+   tx_bi->bytecount = len;
+
+   tx_desc = I40E_TX_DESC(xdp_ring, xdp_ring->next_to_use);
+   tx_desc->buffer_addr = cpu_to_le64(dma);
+   tx_desc->cmd_type_offset_bsz =
+   build_ctob(I40E_TX_DESC_CMD_ICRC
+  | I40E_TX_DESC_CMD_EOP,
+  0, len, 0);
+   total_packets++;
+
+   xdp_ring->next_to_use++;
+   if (xdp_ring->next_to_use == xdp_ring->count)
+   xdp_ring->next_to_use = 0;
+   }
+
+   if 

[PATCH bpf-next 05/11] i40e: added queue pair disable/enable functions

2018-08-28 Thread Björn Töpel
From: Björn Töpel 

Add functions for queue pair enable/disable. Instead of resetting the
whole device, only the affected queue pair is disabled or enabled.

This plumbing is used in a later commit, when zero-copy AF_XDP support
is introduced.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c | 250 
 1 file changed, 250 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index ac685ad4d877..d8b5a6af72bd 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -11827,6 +11827,256 @@ static int i40e_xdp_setup(struct i40e_vsi *vsi,
return 0;
 }
 
+/**
+ * i40e_enter_busy_conf - Enters busy config state
+ * @vsi: vsi
+ *
+ * Returns 0 on success, <0 for failure.
+ **/
+static int i40e_enter_busy_conf(struct i40e_vsi *vsi)
+{
+   struct i40e_pf *pf = vsi->back;
+   int timeout = 50;
+
+   while (test_and_set_bit(__I40E_CONFIG_BUSY, pf->state)) {
+   timeout--;
+   if (!timeout)
+   return -EBUSY;
+   usleep_range(1000, 2000);
+   }
+
+   return 0;
+}
+
+/**
+ * i40e_exit_busy_conf - Exits busy config state
+ * @vsi: vsi
+ **/
+static void i40e_exit_busy_conf(struct i40e_vsi *vsi)
+{
+   struct i40e_pf *pf = vsi->back;
+
+   clear_bit(__I40E_CONFIG_BUSY, pf->state);
+}
+
+/**
+ * i40e_queue_pair_reset_stats - Resets all statistics for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_queue_pair_reset_stats(struct i40e_vsi *vsi, int queue_pair)
+{
+   memset(>rx_rings[queue_pair]->rx_stats, 0,
+  sizeof(vsi->rx_rings[queue_pair]->rx_stats));
+   memset(>tx_rings[queue_pair]->stats, 0,
+  sizeof(vsi->tx_rings[queue_pair]->stats));
+   if (i40e_enabled_xdp_vsi(vsi)) {
+   memset(>xdp_rings[queue_pair]->stats, 0,
+  sizeof(vsi->xdp_rings[queue_pair]->stats));
+   }
+}
+
+/**
+ * i40e_queue_pair_clean_rings - Cleans all the rings of a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ **/
+static void i40e_queue_pair_clean_rings(struct i40e_vsi *vsi, int queue_pair)
+{
+   i40e_clean_tx_ring(vsi->tx_rings[queue_pair]);
+   if (i40e_enabled_xdp_vsi(vsi))
+   i40e_clean_tx_ring(vsi->xdp_rings[queue_pair]);
+   i40e_clean_rx_ring(vsi->rx_rings[queue_pair]);
+}
+
+/**
+ * i40e_queue_pair_toggle_napi - Enables/disables NAPI for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ **/
+static void i40e_queue_pair_toggle_napi(struct i40e_vsi *vsi, int queue_pair,
+   bool enable)
+{
+   struct i40e_ring *rxr = vsi->rx_rings[queue_pair];
+   struct i40e_q_vector *q_vector = rxr->q_vector;
+
+   if (!vsi->netdev)
+   return;
+
+   /* All rings in a qp belong to the same qvector. */
+   if (q_vector->rx.ring || q_vector->tx.ring) {
+   if (enable)
+   napi_enable(_vector->napi);
+   else
+   napi_disable(_vector->napi);
+   }
+}
+
+/**
+ * i40e_queue_pair_toggle_rings - Enables/disables all rings for a queue pair
+ * @vsi: vsi
+ * @queue_pair: queue pair
+ * @enable: true for enable, false for disable
+ *
+ * Returns 0 on success, <0 on failure.
+ **/
+static int i40e_queue_pair_toggle_rings(struct i40e_vsi *vsi, int queue_pair,
+   bool enable)
+{
+   struct i40e_pf *pf = vsi->back;
+   int pf_q, ret = 0;
+
+   pf_q = vsi->base_queue + queue_pair;
+   ret = i40e_control_wait_tx_q(vsi->seid, pf, pf_q,
+false /*is xdp*/, enable);
+   if (ret) {
+   dev_info(>pdev->dev,
+"VSI seid %d Tx ring %d %sable timeout\n",
+vsi->seid, pf_q, (enable ? "en" : "dis"));
+   return ret;
+   }
+
+   i40e_control_rx_q(pf, pf_q, enable);
+   ret = i40e_pf_rxq_wait(pf, pf_q, enable);
+   if (ret) {
+   dev_info(>pdev->dev,
+"VSI seid %d Rx ring %d %sable timeout\n",
+vsi->seid, pf_q, (enable ? "en" : "dis"));
+   return ret;
+   }
+
+   /* Due to HW errata, on Rx disable only, the register can
+* indicate done before it really is. Needs 50ms to be sure
+*/
+   if (!enable)
+   mdelay(50);
+
+   if (!i40e_enabled_xdp_vsi(vsi))
+   return ret;
+
+   ret = i40e_control_wait_tx_q(vsi->seid, pf,
+pf_q + vsi->alloc_queue_pairs,
+true 

[PATCH bpf-next 03/11] xsk: expose xdp_umem_get_{data,dma} to drivers

2018-08-28 Thread Björn Töpel
From: Björn Töpel 

Move the xdp_umem_get_{data,dma} functions to include/net/xdp_sock.h,
so that the upcoming zero-copy implementation in the Ethernet drivers
can utilize them.

Also, supply some dummy function implementations for
CONFIG_XDP_SOCKETS=n configs.

Signed-off-by: Björn Töpel 
---
 include/net/xdp_sock.h | 43 ++
 net/xdp/xdp_umem.h | 10 --
 2 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 7161856bcf9c..56994ad1ab40 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -79,6 +79,16 @@ void xsk_umem_discard_addr(struct xdp_umem *umem);
 void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries);
 bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma, u32 *len);
 void xsk_umem_consume_tx_done(struct xdp_umem *umem);
+
+static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
+{
+   return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
+}
+
+static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
+{
+   return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
+}
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
@@ -98,6 +108,39 @@ static inline bool xsk_is_setup_for_bpf_map(struct xdp_sock 
*xs)
 {
return false;
 }
+
+static inline u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
+{
+   return NULL;
+}
+
+static inline void xsk_umem_discard_addr(struct xdp_umem *umem)
+{
+}
+
+static inline void xsk_umem_complete_tx(struct xdp_umem *umem, u32 nb_entries)
+{
+}
+
+static inline bool xsk_umem_consume_tx(struct xdp_umem *umem, dma_addr_t *dma,
+  u32 *len)
+{
+   return false;
+}
+
+static inline void xsk_umem_consume_tx_done(struct xdp_umem *umem)
+{
+}
+
+static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
+{
+   return NULL;
+}
+
+static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
+{
+   return 0;
+}
 #endif /* CONFIG_XDP_SOCKETS */
 
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/net/xdp/xdp_umem.h b/net/xdp/xdp_umem.h
index f11560334f88..c8be1ad3eb88 100644
--- a/net/xdp/xdp_umem.h
+++ b/net/xdp/xdp_umem.h
@@ -8,16 +8,6 @@
 
 #include 
 
-static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
-{
-   return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
-}
-
-static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
-{
-   return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
-}
-
 int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
u32 queue_id, u16 flags);
 bool xdp_umem_validate_queues(struct xdp_umem *umem);
-- 
2.17.1



[PATCH bpf-next 09/11] i40e: move common Tx functions to i40e_txrx_common.h

2018-08-28 Thread Björn Töpel
From: Magnus Karlsson 

This patch prepares for the upcoming zero-copy Tx functionality, by
moving common functions and refactor chunks of code into re-usable
functions, used both by the regular path and zero-copy path.

Signed-off-by: Magnus Karlsson 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 35 +--
 .../ethernet/intel/i40e/i40e_txrx_common.h| 59 +++
 2 files changed, 61 insertions(+), 33 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 2c4d179ffebf..11e201fcb57a 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -11,16 +11,6 @@
 #include "i40e_txrx_common.h"
 #include "i40e_xsk.h"
 
-static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
-   u32 td_tag)
-{
-   return cpu_to_le64(I40E_TX_DESC_DTYPE_DATA |
-  ((u64)td_cmd  << I40E_TXD_QW1_CMD_SHIFT) |
-  ((u64)td_offset << I40E_TXD_QW1_OFFSET_SHIFT) |
-  ((u64)size  << I40E_TXD_QW1_TX_BUF_SZ_SHIFT) |
-  ((u64)td_tag  << I40E_TXD_QW1_L2TAG1_SHIFT));
-}
-
 #define I40E_TXD_CMD (I40E_TX_DESC_CMD_EOP | I40E_TX_DESC_CMD_RS)
 /**
  * i40e_fdir - Generate a Flow Director descriptor based on fdata
@@ -769,8 +759,6 @@ void i40e_detect_recover_hung(struct i40e_vsi *vsi)
}
 }
 
-#define WB_STRIDE 4
-
 /**
  * i40e_clean_tx_irq - Reclaim resources after transmit completes
  * @vsi: the VSI we care about
@@ -875,27 +863,8 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 
i += tx_ring->count;
tx_ring->next_to_clean = i;
-   u64_stats_update_begin(_ring->syncp);
-   tx_ring->stats.bytes += total_bytes;
-   tx_ring->stats.packets += total_packets;
-   u64_stats_update_end(_ring->syncp);
-   tx_ring->q_vector->tx.total_bytes += total_bytes;
-   tx_ring->q_vector->tx.total_packets += total_packets;
-
-   if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
-   /* check to see if there are < 4 descriptors
-* waiting to be written back, then kick the hardware to force
-* them to be written back in case we stay in NAPI.
-* In this mode on X722 we do not enable Interrupt.
-*/
-   unsigned int j = i40e_get_tx_pending(tx_ring, false);
-
-   if (budget &&
-   ((j / WB_STRIDE) == 0) && (j > 0) &&
-   !test_bit(__I40E_VSI_DOWN, vsi->state) &&
-   (I40E_DESC_UNUSED(tx_ring) != tx_ring->count))
-   tx_ring->arm_wb = true;
-   }
+   i40e_update_tx_stats(tx_ring, total_packets, total_bytes);
+   i40e_arm_wb(tx_ring, vsi, budget);
 
if (ring_is_xdp(tx_ring))
return !!budget;
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h 
b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
index 2bd5187fcd66..b5afd479a9c5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
@@ -28,4 +28,63 @@ void i40e_release_rx_desc(struct i40e_ring *rx_ring, u32 
val);
 #define I40E_XDP_TXBIT(1)
 #define I40E_XDP_REDIR BIT(2)
 
+/**
+ * build_ctob - Builds the Tx descriptor (cmd, offset and type) qword
+ **/
+static inline __le64 build_ctob(u32 td_cmd, u32 td_offset, unsigned int size,
+   u32 td_tag)
+{
+   return cpu_to_le64(I40E_TX_DESC_DTYPE_DATA |
+  ((u64)td_cmd  << I40E_TXD_QW1_CMD_SHIFT) |
+  ((u64)td_offset << I40E_TXD_QW1_OFFSET_SHIFT) |
+  ((u64)size  << I40E_TXD_QW1_TX_BUF_SZ_SHIFT) |
+  ((u64)td_tag  << I40E_TXD_QW1_L2TAG1_SHIFT));
+}
+
+/**
+ * i40e_update_tx_stats - Update the egress statistics for the Tx ring
+ * @tx_ring: Tx ring to update
+ * @total_packet: total packets sent
+ * @total_bytes: total bytes sent
+ **/
+static inline void i40e_update_tx_stats(struct i40e_ring *tx_ring,
+   unsigned int total_packets,
+   unsigned int total_bytes)
+{
+   u64_stats_update_begin(_ring->syncp);
+   tx_ring->stats.bytes += total_bytes;
+   tx_ring->stats.packets += total_packets;
+   u64_stats_update_end(_ring->syncp);
+   tx_ring->q_vector->tx.total_bytes += total_bytes;
+   tx_ring->q_vector->tx.total_packets += total_packets;
+}
+
+#define WB_STRIDE 4
+
+/**
+ * i40e_arm_wb - (Possibly) arms Tx write-back
+ * @tx_ring: Tx ring to update
+ * @vsi: the VSI
+ * @budget: the NAPI budget left
+ **/
+static inline void i40e_arm_wb(struct i40e_ring *tx_ring,
+  struct i40e_vsi *vsi,
+  int budget)
+{
+   if (tx_ring->flags & I40E_TXR_FLAGS_WB_ON_ITR) {
+

[PATCH bpf-next 11/11] samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock

2018-08-28 Thread Björn Töpel
From: Björn Töpel 

The -c/--copy -z/--zero-copy flags enforces either copy or zero-copy
mode.

Signed-off-by: Björn Töpel 
---
 samples/bpf/xdpsock_user.c | 12 +++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 4914788b6727..b3906111bedb 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -649,6 +649,8 @@ static struct option long_options[] = {
{"xdp-skb", no_argument, 0, 'S'},
{"xdp-native", no_argument, 0, 'N'},
{"interval", required_argument, 0, 'n'},
+   {"zero-copy", no_argument, 0, 'z'},
+   {"copy", no_argument, 0, 'c'},
{0, 0, 0, 0}
 };
 
@@ -667,6 +669,8 @@ static void usage(const char *prog)
"  -S, --xdp-skb=n  Use XDP skb-mod\n"
"  -N, --xdp-native=n   Enfore XDP native mode\n"
"  -n, --interval=n Specify statistics update interval 
(default 1 sec).\n"
+   "  -z, --zero-copy  Force zero-copy mode.\n"
+   "  -c, --copy   Force copy mode.\n"
"\n";
fprintf(stderr, str, prog);
exit(EXIT_FAILURE);
@@ -679,7 +683,7 @@ static void parse_command_line(int argc, char **argv)
opterr = 0;
 
for (;;) {
-   c = getopt_long(argc, argv, "rtli:q:psSNn:", long_options,
+   c = getopt_long(argc, argv, "rtli:q:psSNn:cz", long_options,
_index);
if (c == -1)
break;
@@ -716,6 +720,12 @@ static void parse_command_line(int argc, char **argv)
case 'n':
opt_interval = atoi(optarg);
break;
+   case 'z':
+   opt_xdp_bind_flags |= XDP_ZEROCOPY;
+   break;
+   case 'c':
+   opt_xdp_bind_flags |= XDP_COPY;
+   break;
default:
usage(basename(argv[0]));
}
-- 
2.17.1



[PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support

2018-08-28 Thread Björn Töpel
From: Björn Töpel 

This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
queue.

All AF_XDP specific functions are added to a new file, i40e_xsk.c.

Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
will allocate a new buffer and copy the zero-copy frame prior passing
it to the kernel stack.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/i40e/Makefile|   3 +-
 drivers/net/ethernet/intel/i40e/i40e.h  |  19 +
 drivers/net/ethernet/intel/i40e/i40e_main.c |  53 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c |   9 +-
 drivers/net/ethernet/intel/i40e/i40e_txrx.h |  20 +-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c  | 661 
 drivers/net/ethernet/intel/i40e/i40e_xsk.h  |  21 +
 7 files changed, 775 insertions(+), 11 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.c
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.h

diff --git a/drivers/net/ethernet/intel/i40e/Makefile 
b/drivers/net/ethernet/intel/i40e/Makefile
index 14397e7e9925..50590e8d1fd1 100644
--- a/drivers/net/ethernet/intel/i40e/Makefile
+++ b/drivers/net/ethernet/intel/i40e/Makefile
@@ -22,6 +22,7 @@ i40e-objs := i40e_main.o \
i40e_txrx.o \
i40e_ptp.o  \
i40e_client.o   \
-   i40e_virtchnl_pf.o
+   i40e_virtchnl_pf.o \
+   i40e_xsk.o
 
 i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 7a80652e2500..876cac317e79 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -786,6 +786,11 @@ struct i40e_vsi {
 
/* VSI specific handlers */
irqreturn_t (*irq_handler)(int irq, void *data);
+
+   /* AF_XDP zero-copy */
+   struct xdp_umem **xsk_umems;
+   u16 num_xsk_umems_used;
+   u16 num_xsk_umems;
 } cacheline_internodealigned_in_smp;
 
 struct i40e_netdev_priv {
@@ -1090,6 +1095,20 @@ static inline bool i40e_enabled_xdp_vsi(struct i40e_vsi 
*vsi)
return !!vsi->xdp_prog;
 }
 
+static inline struct xdp_umem *i40e_xsk_umem(struct i40e_ring *ring)
+{
+   bool xdp_on = i40e_enabled_xdp_vsi(ring->vsi);
+   int qid = ring->queue_index;
+
+   if (ring_is_xdp(ring))
+   qid -= ring->vsi->alloc_queue_pairs;
+
+   if (!ring->vsi->xsk_umems || !ring->vsi->xsk_umems[qid] || !xdp_on)
+   return NULL;
+
+   return ring->vsi->xsk_umems[qid];
+}
+
 int i40e_create_queue_channel(struct i40e_vsi *vsi, struct i40e_channel *ch);
 int i40e_set_bw_limit(struct i40e_vsi *vsi, u16 seid, u64 max_tx_rate);
 int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index d8b5a6af72bd..848eea7c84db 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -9,7 +9,9 @@
 /* Local includes */
 #include "i40e.h"
 #include "i40e_diag.h"
+#include "i40e_xsk.h"
 #include 
+#include 
 /* All i40e tracepoints are defined by the include below, which
  * must be included exactly once across the whole kernel with
  * CREATE_TRACE_POINTS defined
@@ -3181,13 +3183,46 @@ static int i40e_configure_rx_ring(struct i40e_ring 
*ring)
struct i40e_hw *hw = >back->hw;
struct i40e_hmc_obj_rxq rx_ctx;
i40e_status err = 0;
+   bool ok;
+   int ret;
 
bitmap_zero(ring->state, __I40E_RING_STATE_NBITS);
 
/* clear the context structure first */
memset(_ctx, 0, sizeof(rx_ctx));
 
-   ring->rx_buf_len = vsi->rx_buf_len;
+   if (ring->vsi->type == I40E_VSI_MAIN)
+   xdp_rxq_info_unreg_mem_model(>xdp_rxq);
+
+   ring->xsk_umem = i40e_xsk_umem(ring);
+   if (ring->xsk_umem) {
+   ring->rx_buf_len = ring->xsk_umem->chunk_size_nohr -
+  XDP_PACKET_HEADROOM;
+   /* For AF_XDP ZC, we disallow packets to span on
+* multiple buffers, thus letting us skip that
+* handling in the fast-path.
+*/
+   chain_len = 1;
+   ring->zca.free = i40e_zca_free;
+   ret = xdp_rxq_info_reg_mem_model(>xdp_rxq,
+MEM_TYPE_ZERO_COPY,
+>zca);
+   if (ret)
+   return ret;
+   dev_info(>back->pdev->dev,
+"Registered XDP mem model MEM_TYPE_ZERO_COPY on Rx 
ring %d\n",
+ring->queue_index);
+
+   } else {
+   ring->rx_buf_len = vs

[PATCH bpf-next 00/11] AF_XDP zero-copy support for i40e

2018-08-28 Thread Björn Töpel
From: Björn Töpel 

This patch set introduces zero-copy AF_XDP support for Intel's i40e
driver. In the first preparatory patch we also add support for
XDP_REDIRECT for zero-copy allocated frames so that XDP programs can
redirect them. This was a ToDo from the first AF_XDP zero-copy patch
set from early June. Special thanks to Alex Duyck and Jesper Dangaard
Brouer for reviewing earlier versions of this patch set.

The i40e zero-copy code is located in its own file i40e_xsk.[ch]. Note
that in the interest of time, to get an AF_XDP zero-copy implementation
out there for people to try, some code paths have been copied from the
XDP path to the zero-copy path. It is out goal to merge the two paths
in later patch sets.

In contrast to the implementation from beginning of June, this patch
set does not require any extra HW queues for AF_XDP zero-copy
TX. Instead, the XDP TX HW queue is used for both XDP_REDIRECT and
AF_XDP zero-copy TX.

Jeff, given that most of changes are in i40e, it is up to you how you
would like to route these patches. The set is tagged bpf-next, but
if taking it via the Intel driver tree is easier, let us know.

We have run some benchmarks on a dual socket system with two Broadwell
E5 2660 @ 2.0 GHz with hyperthreading turned off. Each socket has 14
cores which gives a total of 28, but only two cores are used in these
experiments. One for TR/RX and one for the user space application. The
memory is DDR4 @ 2133 MT/s (1067 MHz) and the size of each DIMM is
8192MB and with 8 of those DIMMs in the system we have 64 GB of total
memory. The compiler used is gcc (Ubuntu 7.3.0-16ubuntu3) 7.3.0. The
NIC is Intel I40E 40Gbit/s using the i40e driver.

Below are the results in Mpps of the I40E NIC benchmark runs for 64
and 1500 byte packets, generated by a commercial packet generator HW
outputing packets at full 40 Gbit/s line rate. The results are with
retpoline and all other spectre and meltdown fixes, so these results
are not comparable to the ones from the zero-copy patch set in June.

AF_XDP performance 64 byte packets.
Benchmark   XDP_SKBXDP_DRVXDP_DRV with zerocopy
rxdrop   2.68.2 15.0
txpush   2.2-   21.9
l2fwd1.72.3 11.3

AF_XDP performance 1500 byte packets:
Benchmark   XDP_SKB   XDP_DRV XDP_DRV with zerocopy
rxdrop   2.03.3 3.3
l2fwd1.31.7 3.1

XDP performance on our system as a base line:

64 byte packets:
XDP stats   CPU pps issue-pps
XDP-RX CPU  16  18.4M  0

1500 byte packets:
XDP stats   CPU pps issue-pps
XDP-RX CPU  16  3.3M0

The structure of the patch set is as follows:

Patch 1: Add support for XDP_REDIRECT of zero-copy allocated frames
Patches 2-4: Preparatory patches to common xsk and net code
Patches 5-7: Preparatory patches to i40e driver code for RX
Patch 8: i40e zero-copy support for RX
Patch 9: Preparatory patch to i40e driver code for TX
Patch 10: i40e zero-copy support for TX
Patch 11: Add flags to sample application to force zero-copy/copy mode

We based this patch set on bpf-next commit 050cdc6c9501 ("Merge
git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")


Magnus & Björn

Björn Töpel (8):
  xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
  xdp: export xdp_rxq_info_unreg_mem_model
  xsk: expose xdp_umem_get_{data,dma} to drivers
  i40e: added queue pair disable/enable functions
  i40e: refactor Rx path for re-use
  i40e: move common Rx functions to i40e_txrx_common.h
  i40e: add AF_XDP zero-copy Rx support
  samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock

Magnus Karlsson (3):
  net: add napi_if_scheduled_mark_missed
  i40e: move common Tx functions to i40e_txrx_common.h
  i40e: add AF_XDP zero-copy Tx support

 drivers/net/ethernet/intel/i40e/Makefile  |   3 +-
 drivers/net/ethernet/intel/i40e/i40e.h|  19 +
 drivers/net/ethernet/intel/i40e/i40e_main.c   | 307 ++-
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 182 ++--
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |  20 +-
 .../ethernet/intel/i40e/i40e_txrx_common.h|  90 ++
 drivers/net/ethernet/intel/i40e/i40e_xsk.c| 834 ++
 drivers/net/ethernet/intel/i40e/i40e_xsk.h|  25 +
 include/linux/netdevice.h |  26 +
 include/net/xdp.h |   6 +-
 include/net/xdp_sock.h|  43 +
 net/core/xdp.c|  54 +-
 net/xdp/xdp_umem.h|  10 -
 samples/bpf/xdpsock_user.c|  12 +-
 14 files changed, 1523 insertions(+), 108 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_txrx_common.h
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.c
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_xsk.h

-- 
2.17.1



[PATCH bpf-next 06/11] i40e: refactor Rx path for re-use

2018-08-28 Thread Björn Töpel
From: Björn Töpel 

In this commit, the Rx path is refactored some, as a step torwards the
introduction AF_XDP Rx zero-copy.

The page re-use counter is moved into the i40e_reuse_rx_page, instead
of bumping the counter in many places. The Rx buffer page clearing is
moved for better readability. Lastely, functions to update statistics
and bump the XDP Tx ring are introduced.

Signed-off-by: Björn Töpel 
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 111 ++--
 1 file changed, 77 insertions(+), 34 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index b5042d1a63c0..b5a2cfeb68a5 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -1244,6 +1244,11 @@ static void i40e_reuse_rx_page(struct i40e_ring *rx_ring,
new_buff->page  = old_buff->page;
new_buff->page_offset   = old_buff->page_offset;
new_buff->pagecnt_bias  = old_buff->pagecnt_bias;
+
+   rx_ring->rx_stats.page_reuse_count++;
+
+   /* clear contents of buffer_info */
+   old_buff->page = NULL;
 }
 
 /**
@@ -1266,7 +1271,7 @@ static inline bool i40e_rx_is_programming_status(u64 qw)
 }
 
 /**
- * i40e_clean_programming_status - clean the programming status descriptor
+ * i40e_clean_programming_status - try clean the programming status descriptor
  * @rx_ring: the rx ring that has this descriptor
  * @rx_desc: the rx descriptor written back by HW
  * @qw: qword representing status_error_len in CPU ordering
@@ -1275,15 +1280,22 @@ static inline bool i40e_rx_is_programming_status(u64 qw)
  * status being successful or not and take actions accordingly. FCoE should
  * handle its context/filter programming/invalidation status and take actions.
  *
+ * Returns an i40e_rx_buffer to reuse if the cleanup occurred, otherwise NULL.
  **/
-static void i40e_clean_programming_status(struct i40e_ring *rx_ring,
- union i40e_rx_desc *rx_desc,
- u64 qw)
+static struct i40e_rx_buffer *i40e_clean_programming_status(
+   struct i40e_ring *rx_ring,
+   union i40e_rx_desc *rx_desc,
+   u64 qw)
 {
struct i40e_rx_buffer *rx_buffer;
-   u32 ntc = rx_ring->next_to_clean;
+   u32 ntc;
u8 id;
 
+   if (!i40e_rx_is_programming_status(qw))
+   return NULL;
+
+   ntc = rx_ring->next_to_clean;
+
/* fetch, update, and store next to clean */
rx_buffer = _ring->rx_bi[ntc++];
ntc = (ntc < rx_ring->count) ? ntc : 0;
@@ -1291,18 +1303,13 @@ static void i40e_clean_programming_status(struct 
i40e_ring *rx_ring,
 
prefetch(I40E_RX_DESC(rx_ring, ntc));
 
-   /* place unused page back on the ring */
-   i40e_reuse_rx_page(rx_ring, rx_buffer);
-   rx_ring->rx_stats.page_reuse_count++;
-
-   /* clear contents of buffer_info */
-   rx_buffer->page = NULL;
-
id = (qw & I40E_RX_PROG_STATUS_DESC_QW1_PROGID_MASK) >>
  I40E_RX_PROG_STATUS_DESC_QW1_PROGID_SHIFT;
 
if (id == I40E_RX_PROG_STATUS_DESC_FD_FILTER_STATUS)
i40e_fd_handle_status(rx_ring, rx_desc, id);
+
+   return rx_buffer;
 }
 
 /**
@@ -2152,7 +2159,6 @@ static void i40e_put_rx_buffer(struct i40e_ring *rx_ring,
if (i40e_can_reuse_rx_page(rx_buffer)) {
/* hand second half of page back to the ring */
i40e_reuse_rx_page(rx_ring, rx_buffer);
-   rx_ring->rx_stats.page_reuse_count++;
} else {
/* we are not reusing the buffer so unmap it */
dma_unmap_page_attrs(rx_ring->dev, rx_buffer->dma,
@@ -2160,10 +2166,9 @@ static void i40e_put_rx_buffer(struct i40e_ring *rx_ring,
 DMA_FROM_DEVICE, I40E_RX_DMA_ATTR);
__page_frag_cache_drain(rx_buffer->page,
rx_buffer->pagecnt_bias);
+   /* clear contents of buffer_info */
+   rx_buffer->page = NULL;
}
-
-   /* clear contents of buffer_info */
-   rx_buffer->page = NULL;
 }
 
 /**
@@ -2287,6 +2292,12 @@ static void i40e_rx_buffer_flip(struct i40e_ring 
*rx_ring,
 #endif
 }
 
+/**
+ * i40e_xdp_ring_update_tail - Updates the XDP Tx ring tail register
+ * @xdp_ring: XDP Tx ring
+ *
+ * This function updates the XDP Tx ring tail register.
+ **/
 static inline void i40e_xdp_ring_update_tail(struct i40e_ring *xdp_ring)
 {
/* Force memory writes to complete before letting h/w
@@ -2296,6 +2307,49 @@ static inline void i40e_xdp_ring_update_tail(struct 
i40e_ring *xdp_ring)
writel_relaxed(xdp_ring->next_to_use, xdp_ring->tail);
 }
 
+/**
+ * i40e_update_rx_stats - Update Rx ring statistics
+ * @rx_ring: rx descriptor ring
+ * @total_rx_bytes: number of bytes received
+ * @total_rx_packet

Re: [PATCH bpf] xsk: fix return value of xdp_umem_assign_dev()

2018-08-20 Thread Björn Töpel
Den mån 20 aug. 2018 kl 02:58 skrev Prashant Bhole
:
>
> s/ENOTSUPP/EOPNOTSUPP/ in function umem_assign_dev().
> This function's return value is directly returned by xsk_bind().
> EOPNOTSUPP is bind()'s possible return value.
>
> Fixes: f734607e819b ("xsk: refactor xdp_umem_assign_dev()")
> Signed-off-by: Prashant Bhole 
> ---
>  net/xdp/xdp_umem.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index 911ca6d3cb5a..bfe2dbea480b 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -74,14 +74,14 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
> net_device *dev,
> return 0;
>
> if (!dev->netdev_ops->ndo_bpf || !dev->netdev_ops->ndo_xsk_async_xmit)
> -   return force_zc ? -ENOTSUPP : 0; /* fail or fallback */
> +   return force_zc ? -EOPNOTSUPP : 0; /* fail or fallback */
>
> bpf.command = XDP_QUERY_XSK_UMEM;
>
> rtnl_lock();
> err = xdp_umem_query(dev, queue_id);
> if (err) {
> -   err = err < 0 ? -ENOTSUPP : -EBUSY;
> +   err = err < 0 ? -EOPNOTSUPP : -EBUSY;
> goto err_rtnl_unlock;
> }
>
> --
> 2.17.1
>
>

Acked-by: Björn Töpel 


Re: [PATCH bpf] Revert "xdp: add NULL pointer check in __xdp_return()"

2018-08-14 Thread Björn Töpel
Den fre 10 aug. 2018 kl 18:26 skrev Jakub Kicinski
:
>
> On Fri, 10 Aug 2018 17:16:45 +0200, Björn Töpel wrote:
> > Den fre 10 aug. 2018 kl 16:10 skrev Daniel Borkmann :
> > >
> > > On 08/10/2018 11:28 AM, Björn Töpel wrote:
> > > > From: Björn Töpel 
> > > >
> > > > This reverts commit 36e0f12bbfd3016f495904b35e41c5711707509f.
> > > >
> > > > The reverted commit adds a WARN to check against NULL entries in the
> > > > mem_id_ht rhashtable. Any kernel path implementing the XDP (generic or
> > > > driver) fast path is required to make a paired
> > > > xdp_rxq_info_reg/xdp_rxq_info_unreg call for proper function. In
> > > > addition, a driver using a different allocation scheme than the
> > > > default MEM_TYPE_PAGE_SHARED is required to additionally call
> > > > xdp_rxq_info_reg_mem_model.
> > > >
> > > > For MEM_TYPE_ZERO_COPY, an xdp_rxq_info_reg_mem_model call ensures
> > > > that the mem_id_ht rhashtable has a properly inserted allocator id. If
> > > > not, this would be a driver bug. A NULL pointer kernel OOPS is
> > > > preferred to the WARN.
> > > >
> > > > Suggested-by: Jesper Dangaard Brouer 
> > > > Signed-off-by: Björn Töpel 
> > >
> > > Given the last bpf pr went out yesterday night, I've applied this to
> > > bpf-next (worst case we can just route it via stable), thanks!
> >
> > Ah, right! Thanks!
> >
> > bpf-next is OK. (Since this path is currently not used yet by any driver... 
> > :-()
>
> Wasn't this dead code, anyway?  The frame return path is for redirects,
> and one can't convert_to_xdp_frame presently?

Indeed, dead it is. Hmm, I'll remove it as part of the i40e zc submission.


Björn


Re: Error running AF_XDP sample application

2018-08-10 Thread Björn Töpel
Den fre 10 aug. 2018 kl 15:23 skrev Konrad Djimeli :
>
> On 2018-08-10 11:58, Konrad Djimeli wrote:
> > On 2018-08-10 03:51, Jakub Kicinski wrote:
> >> On Thu, 09 Aug 2018 18:18:08 +0200, kdjimeli wrote:
> >>> Hello,
> >>>
> >>> I have been trying to test a sample AF_XDP program, but I have been
> >>> experiencing some issues.
> >>> After building the sample code
> >>> https://github.com/torvalds/linux/tree/master/samples/bpf,
> >>> when running the xdpsock binary, I get the errors
> >>> "libbpf: failed to create map (name: 'xsks_map'): Invalid argument"
> >>> "libbpf: failed to load object './xdpsock_kern.o"
> >>>
> >>> I tried to figure out the cause of the error but all I know is that it
> >>> occurs at line 910 with the function
> >>> call "bpf_prog_load_xattr(_load_attr, , _fd)".
> >>>
> >>> Please I would like to inquire what could be a possible for this error.
> >>
> >> which kernel version are you running?
> >
> > My kernel version is 4.18.0-rc8+. I cloned it from
> > https://github.com/torvalds/linux before building a running.
> >
> > My commit head(git show-ref --head) is at
> > 1236568ee3cbb0d3ac62d0074a29b97ecf34cbbc HEAD
> > 1236568ee3cbb0d3ac62d0074a29b97ecf34cbbc refs/heads/master
> > 1236568ee3cbb0d3ac62d0074a29b97ecf34cbbc refs/remotes/origin/HEAD
> > 1236568ee3cbb0d3ac62d0074a29b97ecf34cbbc refs/remotes/origin/master
> > ...
> >
> >
> > I also applied the patch https://patchwork.ozlabs.org/patch/949884/
> > (samples: bpf: convert xdpsock_user.c to libbpf ), as the error was
> > initially in the form show below:
> >   "failed to create a map: 22 Invalid argument"
> >   "ERROR: load_bpf_file"
> >
> > Thanks
> > Konrad
>
> Also other sample applications that make use of other bpf maps, such as
> BPF_MAP_TYPE_CPUMAP in xdp_redirect_cpu work fine. But the application
> with BPF_MAP_TYPE_XSKMAP fails producing the error mentioned above.
>
> Thanks
> Konrad

Thanks for taking AF_XDP for a spin!

Before I start digging into details; Do you have CONFIG_XDP_SOCKETS=y
in your config? :-)


Björn


Re: [PATCH bpf] Revert "xdp: add NULL pointer check in __xdp_return()"

2018-08-10 Thread Björn Töpel
Den fre 10 aug. 2018 kl 16:10 skrev Daniel Borkmann :
>
> On 08/10/2018 11:28 AM, Björn Töpel wrote:
> > From: Björn Töpel 
> >
> > This reverts commit 36e0f12bbfd3016f495904b35e41c5711707509f.
> >
> > The reverted commit adds a WARN to check against NULL entries in the
> > mem_id_ht rhashtable. Any kernel path implementing the XDP (generic or
> > driver) fast path is required to make a paired
> > xdp_rxq_info_reg/xdp_rxq_info_unreg call for proper function. In
> > addition, a driver using a different allocation scheme than the
> > default MEM_TYPE_PAGE_SHARED is required to additionally call
> > xdp_rxq_info_reg_mem_model.
> >
> > For MEM_TYPE_ZERO_COPY, an xdp_rxq_info_reg_mem_model call ensures
> > that the mem_id_ht rhashtable has a properly inserted allocator id. If
> > not, this would be a driver bug. A NULL pointer kernel OOPS is
> > preferred to the WARN.
> >
> > Suggested-by: Jesper Dangaard Brouer 
> > Signed-off-by: Björn Töpel 
>
> Given the last bpf pr went out yesterday night, I've applied this to
> bpf-next (worst case we can just route it via stable), thanks!

Ah, right! Thanks!

bpf-next is OK. (Since this path is currently not used yet by any driver... :-()


Björn


Re: [PATCH bpf] Revert "xdp: add NULL pointer check in __xdp_return()"

2018-08-10 Thread Björn Töpel
Den fre 10 aug. 2018 kl 12:18 skrev Jesper Dangaard Brouer :
>
> On Fri, 10 Aug 2018 11:28:02 +0200
> Björn Töpel  wrote:
>
> > From: Björn Töpel 
> >
> > This reverts commit 36e0f12bbfd3016f495904b35e41c5711707509f.
> >
> > The reverted commit adds a WARN to check against NULL entries in the
> > mem_id_ht rhashtable. Any kernel path implementing the XDP (generic or
> > driver) fast path is required to make a paired
> > xdp_rxq_info_reg/xdp_rxq_info_unreg call for proper function. In
> > addition, a driver using a different allocation scheme than the
> > default MEM_TYPE_PAGE_SHARED is required to additionally call
> > xdp_rxq_info_reg_mem_model.
> >
> > For MEM_TYPE_ZERO_COPY, an xdp_rxq_info_reg_mem_model call ensures
> > that the mem_id_ht rhashtable has a properly inserted allocator id. If
> > not, this would be a driver bug. A NULL pointer kernel OOPS is
> > preferred to the WARN.
>
> Acked-by: Jesper Dangaard Brouer 
>
> As a comment says in the code: /* NB! Only valid from an xdp_buff! */
> Which is (currently) guarded by the return/exit in convert_to_xdp_frame().
>
> This means that this code path can only be invoked while the driver is
> still running under the RX NAPI process. Thus, there is no chance that
> the allocator-id is gone (via calling xdp_rxq_info_unreg) for this code
> path.
>
> But I really hope we at somepoint can convert a MEM_TYPE_ZERO_COPY into
> a form of xdp_frame, that can travel further into the redirect-core.
> In which case, we likely need to handle the NULL case (but also need
> other code to handle what to do with the memory backing the frame)
>
> (I'm my vision here:)
>
> I really dislike that the current Zero-Copy mode steal ALL packets,
> when ZC is enabled on a RX-queue.  This is not better than the existing
> bypass solutions, which have ugly ways of re-injecting packet back into
> the network stack.  With the integration with XDP, we have the
> flexibility of selecting frames, that we don't want to be "bypassed"
> into AF_XDP, and want the kernel process these. (The most common
> use-case is letting the kernel handle the arptable).  IHMO this is what
> will/would make AF_XDP superior to other bypass solutions.
>
>

Thanks for putting your visions/ideas here! I agree with both of your
last sections, and this is what we're working towards. AF_XDP ZC has
to play nicer with XDP.  The current (well, the soon-to-be-published
[1] ;-)) ZC scheme is just a first step, and should be seen as a
starting point so people can start playing using AF_XDP. Jakub also
mentioned these issues a couple of threads ago, so there are
definitely more people feeling the ZC allocator pains. Mid-term a
sophisticated/proper and generic (for inter-driver reuse) ZC allocator
is needed; Converting xdp_buffs to xdp_frames cheaply for multi-CPU
completion, and hopefully dito for the XDP_PASS/kernel stack path. But
let's start with something simple that works, and take it from there.

Björn

[1] WIP: https://github.com/bjoto/linux/tree/af-xdp-i40e-zc

> > Suggested-by: Jesper Dangaard Brouer 
> > Signed-off-by: Björn Töpel 
> > ---
> >  net/core/xdp.c | 3 +--
> >  1 file changed, 1 insertion(+), 2 deletions(-)
> >
> > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > index 6771f1855b96..9d1f22072d5d 100644
> > --- a/net/core/xdp.c
> > +++ b/net/core/xdp.c
> > @@ -345,8 +345,7 @@ static void __xdp_return(void *data, struct 
> > xdp_mem_info *mem, bool napi_direct,
> >   rcu_read_lock();
> >   /* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() 
> > */
> >   xa = rhashtable_lookup(mem_id_ht, >id, 
> > mem_id_rht_params);
> > - if (!WARN_ON_ONCE(!xa))
> > - xa->zc_alloc->free(xa->zc_alloc, handle);
> > + xa->zc_alloc->free(xa->zc_alloc, handle);
> >   rcu_read_unlock();
> >   default:
> >   /* Not possible, checked in xdp_rxq_info_reg_mem_model() */
>
>
>
> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


[PATCH bpf] Revert "xdp: add NULL pointer check in __xdp_return()"

2018-08-10 Thread Björn Töpel
From: Björn Töpel 

This reverts commit 36e0f12bbfd3016f495904b35e41c5711707509f.

The reverted commit adds a WARN to check against NULL entries in the
mem_id_ht rhashtable. Any kernel path implementing the XDP (generic or
driver) fast path is required to make a paired
xdp_rxq_info_reg/xdp_rxq_info_unreg call for proper function. In
addition, a driver using a different allocation scheme than the
default MEM_TYPE_PAGE_SHARED is required to additionally call
xdp_rxq_info_reg_mem_model.

For MEM_TYPE_ZERO_COPY, an xdp_rxq_info_reg_mem_model call ensures
that the mem_id_ht rhashtable has a properly inserted allocator id. If
not, this would be a driver bug. A NULL pointer kernel OOPS is
preferred to the WARN.

Suggested-by: Jesper Dangaard Brouer 
Signed-off-by: Björn Töpel 
---
 net/core/xdp.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/xdp.c b/net/core/xdp.c
index 6771f1855b96..9d1f22072d5d 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -345,8 +345,7 @@ static void __xdp_return(void *data, struct xdp_mem_info 
*mem, bool napi_direct,
rcu_read_lock();
/* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() */
xa = rhashtable_lookup(mem_id_ht, >id, mem_id_rht_params);
-   if (!WARN_ON_ONCE(!xa))
-   xa->zc_alloc->free(xa->zc_alloc, handle);
+   xa->zc_alloc->free(xa->zc_alloc, handle);
rcu_read_unlock();
default:
/* Not possible, checked in xdp_rxq_info_reg_mem_model() */
-- 
2.17.1



Re: [PATCH v8 bpf-next 00/10] veth: Driver XDP

2018-08-06 Thread Björn Töpel

On 2018-08-03 11:45, Jesper Dangaard Brouer wrote:

On Fri,  3 Aug 2018 16:58:08 +0900
Toshiaki Makita  wrote:


This patch set introduces driver XDP for veth.
Basically this is used in conjunction with redirect action of another XDP
program.

   NIC ---> veth===veth
  (XDP) (redirect)(XDP)



I'm was playing with V7 on my testlab yesterday and I noticed one
fundamental issue.  You are not updating the "ifconfig" stats counters,
when in XDP mode.  This makes receive or send via XDP invisible to
sysadm/management tools.  This for-sure is going to cause confusion...

I took a closer look at other driver. The ixgbe driver is doing the
right thing.  Driver i40e have a bug, where RX/TX stats are swapped
getting (strange!).  


Indeed! Thanks for finding/reporting this! I'll have look!


Björn


The mlx5 driver is not updating the regular RX/TX
counters, but A LOT of other ethtool stats counters (which are the ones
I usually monitor when testing).

So, given other drivers also didn't get this right, we need to have a
discussion outside your/this patchset.  Thus, I don't want to
stop/stall this patchset, but this is something we need to fixup in a
followup patchset to other drivers as well.

Thus, I'm acking the patchset, but I request that we do a joint effort
of fixing this as followup patches.

Acked-by: Jesper Dangaard Brouer 



Re: [PATCH bpf] xdp: add NULL pointer check in __xdp_return()

2018-08-02 Thread Björn Töpel
Den ons 1 aug. 2018 kl 22:25 skrev Daniel Borkmann :
>
> On 08/01/2018 04:43 PM, Björn Töpel wrote:
> > Den ons 1 aug. 2018 kl 16:14 skrev Jesper Dangaard Brouer 
> > :
> >> On Mon, 23 Jul 2018 11:41:02 +0200
> >> Björn Töpel  wrote:
> >>
> >>>>>> diff --git a/net/core/xdp.c b/net/core/xdp.c
> >>>>>> index 9d1f220..1c12bc7 100644
> >>>>>> --- a/net/core/xdp.c
> >>>>>> +++ b/net/core/xdp.c
> >>>>>> @@ -345,7 +345,8 @@ static void __xdp_return(void *data, struct 
> >>>>>> xdp_mem_info *mem, bool napi_direct,
> >>>>>>   rcu_read_lock();
> >>>>>>   /* mem->id is valid, checked in 
> >>>>>> xdp_rxq_info_reg_mem_model() */
> >>>>>>   xa = rhashtable_lookup(mem_id_ht, >id, 
> >>>>>> mem_id_rht_params);
> >>>>>> - xa->zc_alloc->free(xa->zc_alloc, handle);
> >>>>>> + if (xa)
> >>>>>> + xa->zc_alloc->free(xa->zc_alloc, handle);
> >>>>> hmm...It is not clear to me the "!xa" case don't have to be handled?
> >>>>
> >>>> Thank you for reviewing!
> >>>>
> >>>> Returning NULL pointer is bug case such as calling after use
> >>>> xdp_rxq_info_unreg().
> >>>> so that, I think it can't handle at that moment.
> >>>> we can make __xdp_return to add WARN_ON_ONCE() or
> >>>> add return error code to driver.
> >>>> But I'm not sure if these is useful information.
> >>>>
> >>>> I might have misunderstood scenario of MEM_TYPE_ZERO_COPY
> >>>> because there is no use case of MEM_TYPE_ZERO_COPY yet.
> >>>
> >>> Taehee, again, sorry for the slow response and thanks for patch!
> >>>
> >>> If xa is NULL, the driver has a buggy/broken implementation. What
> >>> would be a proper way of dealing with this? BUG?
> >>
> >> Hmm... I don't like these kind of changes to the hot-path code!
> >>
> >> You might not realize this, but adding BUG() and WARN_ON() to the code
> >> affect performance in ways you might not realize!  These macros gets
> >> compiled and uses an asm instruction called "ud2".  Seeing the "ud2"
> >> instruction causes the CPUs instruction cache prefetcher to stop.
> >> Thus, if some code ends up below this instruction, this will cause more
> >> i-cache-misses.
> >>
> >> I don't know if xa==NULL is even possible, but if it is, then I think
> >> this is a result of a driver mem_reg API usage bug.  And the mem-reg
> >> API is full of WARN's and error messages, exactly to push these kind of
> >> checks out of the fast-path.  There is no need for a BUG() call, as
> >> deref a NULL pointer will case an OOPS, that is easy to read and
> >> understand.
> >
> > Jesper, thanks for having a look! So, you're right that if xa==NULL
> > the driver is "broken/buggy" (as stated earlier!). I agree that
> > OOPSing on a NULL pointer is as good as a BUG!
> >
> > The applied patch adds a WARN_ON_ONCE, and I thought best practice was
> > that a buggy driver shouldn't crash the kernel... What is considered
> > best practices in these scenarios? *I'd* prefer an OOPS instead of
> > WARN_ON_ONCE, to catch that buggy driver. Again, that's me. I thought
> > that most people prefer not crashing, hence the patch. :-)
>
> In that case, lets send a revert for the patch with a proper analysis
> of why it is safe to omit the NULL check which should be placed as a
> comment right near the rhashtable_lookup().
>

I'll do that (as soon as I've double-checked so that I'm not lying)!


Björn

> Thanks,
> Daniel


Re: [PATCH bpf] xdp: add NULL pointer check in __xdp_return()

2018-08-01 Thread Björn Töpel
Den ons 1 aug. 2018 kl 16:14 skrev Jesper Dangaard Brouer :
>
> On Mon, 23 Jul 2018 11:41:02 +0200
> Björn Töpel  wrote:
>
> > > >> diff --git a/net/core/xdp.c b/net/core/xdp.c
> > > >> index 9d1f220..1c12bc7 100644
> > > >> --- a/net/core/xdp.c
> > > >> +++ b/net/core/xdp.c
> > > >> @@ -345,7 +345,8 @@ static void __xdp_return(void *data, struct 
> > > >> xdp_mem_info *mem, bool napi_direct,
> > > >>   rcu_read_lock();
> > > >>   /* mem->id is valid, checked in 
> > > >> xdp_rxq_info_reg_mem_model() */
> > > >>   xa = rhashtable_lookup(mem_id_ht, >id, 
> > > >> mem_id_rht_params);
> > > >> - xa->zc_alloc->free(xa->zc_alloc, handle);
> > > >> + if (xa)
> > > >> + xa->zc_alloc->free(xa->zc_alloc, handle);
> > > > hmm...It is not clear to me the "!xa" case don't have to be handled?
> > >
> > > Thank you for reviewing!
> > >
> > > Returning NULL pointer is bug case such as calling after use
> > > xdp_rxq_info_unreg().
> > > so that, I think it can't handle at that moment.
> > > we can make __xdp_return to add WARN_ON_ONCE() or
> > > add return error code to driver.
> > > But I'm not sure if these is useful information.
> > >
> > > I might have misunderstood scenario of MEM_TYPE_ZERO_COPY
> > > because there is no use case of MEM_TYPE_ZERO_COPY yet.
> > >
> >
> > Taehee, again, sorry for the slow response and thanks for patch!
> >
> > If xa is NULL, the driver has a buggy/broken implementation. What
> > would be a proper way of dealing with this? BUG?
>
> Hmm... I don't like these kind of changes to the hot-path code!
>
> You might not realize this, but adding BUG() and WARN_ON() to the code
> affect performance in ways you might not realize!  These macros gets
> compiled and uses an asm instruction called "ud2".  Seeing the "ud2"
> instruction causes the CPUs instruction cache prefetcher to stop.
> Thus, if some code ends up below this instruction, this will cause more
> i-cache-misses.
>
> I don't know if xa==NULL is even possible, but if it is, then I think
> this is a result of a driver mem_reg API usage bug.  And the mem-reg
> API is full of WARN's and error messages, exactly to push these kind of
> checks out of the fast-path.  There is no need for a BUG() call, as
> deref a NULL pointer will case an OOPS, that is easy to read and
> understand.
>

Jesper, thanks for having a look! So, you're right that if xa==NULL
the driver is "broken/buggy" (as stated earlier!). I agree that
OOPSing on a NULL pointer is as good as a BUG!

The applied patch adds a WARN_ON_ONCE, and I thought best practice was
that a buggy driver shouldn't crash the kernel... What is considered
best practices in these scenarios? *I'd* prefer an OOPS instead of
WARN_ON_ONCE, to catch that buggy driver. Again, that's me. I thought
that most people prefer not crashing, hence the patch. :-)


Björn

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer


Re: [RFC bpf-next 0/6] net: xsk: minor improvements around queue handling

2018-07-31 Thread Björn Töpel
Den tis 31 juli 2018 kl 04:49 skrev Jakub Kicinski
:
>
> On Mon, 30 Jul 2018 14:49:32 +0200, Björn Töpel wrote:
> > Den tors 26 juli 2018 kl 23:44 skrev Jakub Kicinski:
> > >
> > > Hi!
> >
> > Thanks for spending your time on this, Jakub. I'm (temporarily) back
> > for a week, so you can expect faster replies now...
> >
> > > This set tries to make the core take care of error checking for the
> > > drivers.  In particular making sure that the AF_XDP UMEM is not installed
> > > on queues which don't exist (or are disabled) and that changing queue
> > > (AKA ethtool channel) count cannot disable queues with active AF_XDF
> > > zero-copy sockets.
> > >
> > > I'm sending as an RFC because I'm not entirely sure what the desired
> > > behaviour is here.  Is it Okay to install AF_XDP on queues which don't
> > > exist?  I presume not?
> >
> > Your presumption is correct. The idea with the
> > real_num_rx_queues/real_num_tx_queues check in xsk_bind, is to bail
> > out if the queue doesn't exist at bind call. Note that we *didn't* add
> > any code to avoid the bound queue from being removed via set channel
> > (your patch 6). Our idea was that if you remove a queue, the ingress
> > frames would simply stop flowing, and the queue config change checking
> > was "out-of-band".
> >
> > I think I prefer your approach, i.e. not allowing the channels/queues
> > to change if they're bound to an AF_XDP socket. However, your
> > xdp_umem_query used in ethtool only works for ZC enabled drivers, not
> > for the existing non-ZC/copy case. If we'd like to go the route of
> > disabling ethtool_set_channels for an AF_XDP enabled queue this
> > functionality needs to move the query into netdev core, so we have a
> > consistent behavior.
>
> Agreed.  There seems to be no notification for changing the number of
> queues and therefore no very clean way to solve this today.

I'm probably lacking some history here; Has there been any past
efforts in making channels/queues a "kernel object"? Would it make
sense to add notifications for queue changes analogous to netdev
changes?

> The last
> two patches are more of a courtesy to the drivers, to simplify the
> data structure for holding the UMEMs.
>
> I could argue that driver and stack are not really apples to apples.
> Much like Generic XDP, the skb-based AF_XDP is basically a development
> tool and last-resort fallback.

Hmm... I partially agree. Let me think about it a bit more.

> For TX driver will most likely allocate
> separate queues, while skb-based will use the stack's queues.  These
> are actually different queues.  Stack will also fallback to other queue
> in __netdev_pick_tx() if number of queues changes.
>

Yup, ideally the driver will use a dedicated queue. We had some
thoughts on hijacking the skb Tx queue, and route the stack egress
packets elsewhere, but it ended up way too messy.

> But yes, preferably skb-based and ZC should behave the same..
>
> > > Are the AF_XDP queue_ids referring to TX queues
> > > as well as RX queues in case of the driver?  I presume not?
> >
> > We've had a lot of discussions about this internally. Ideally, we'd
> > like to give driver implementors the most freedom, and not enforcing a
> > certain queue scheme for Tx.
>
> You say freedom I hear diverging implementations and per-driver
> checks in user space ;-)
>

Yeah. :-) Well, for the i40e ZC implementation, the Tx queue id was
not equal to Rx queue id, so to answer your question: "Correct, the
queue id refer to Rx."

> Practically speaking unless you take the xmit lock there is little
> chance of reusing stack's TX queues, so you'd have to allocate a
> separate queue one way or the other..  At which point the number of
> stack's TX queues has no bearing on AF_XDP ZC.
>

Yup, you're right.

> > OTOH it makes it weird for the userland
> > application *not* to have the same id, e.g. if a userland application
> > would like to get stats or configure the AF_XDP bound Tx queue --
> > which id is it? Should the Tx queue id  for an xsk be exposed in
> > sysfs?
>
> I'd not go there.
>

Honestly, me neither. I need to think more about to expose the Tx
queue pulls/knobs for a control plane.

> > If the id is *not* the same, would it be OK to change the number of
> > channels and Tx would continue to operate correctly? A related
> > question; An xsk with *only* Tx, should it be constrained by the
> > number of (enabled) Rx queues?
>
> Good question, are drivers even supposed to care about tx-only/rx-only?
> From driver's perspective rx-only socket will s

Re: [PATCH net-next 0/3] xsk: improvements to RX queue check and replace

2018-07-31 Thread Björn Töpel
Den tis 31 juli 2018 kl 05:46 skrev Jakub Kicinski
:
>
> Hi!
>
> First 3 patches of my recent RFC.  The first one make the check against
> real_num_rx_queues slightly more reliable, while the latter two redefine
> XDP_QUERY_XSK_UMEM slightly to disallow replacing UMEM in the driver at
> the stack level.
>
> I'm not sure where this lays on the bpf vs net trees scale, but there
> should be no conflicts with either tree.
>
> Jakub Kicinski (3):
>   net: update real_num_rx_queues even when !CONFIG_SYSFS
>   xsk: refactor xdp_umem_assign_dev()
>   xsk: don't allow umem replace at stack level
>
>  include/linux/netdevice.h | 10 +++---
>  net/xdp/xdp_umem.c| 70 +++
>  2 files changed, 47 insertions(+), 33 deletions(-)
>
> --
> 2.17.1
>

LGTM!

For the series:
Acked-by: Björn Töpel 


Re: [RFC bpf-next 0/6] net: xsk: minor improvements around queue handling

2018-07-30 Thread Björn Töpel
Den tors 26 juli 2018 kl 23:44 skrev Jakub Kicinski
:
>
> Hi!
>

Thanks for spending your time on this, Jakub. I'm (temporarily) back
for a week, so you can expect faster replies now...

> This set tries to make the core take care of error checking for the
> drivers.  In particular making sure that the AF_XDP UMEM is not installed
> on queues which don't exist (or are disabled) and that changing queue
> (AKA ethtool channel) count cannot disable queues with active AF_XDF
> zero-copy sockets.
>
> I'm sending as an RFC because I'm not entirely sure what the desired
> behaviour is here.  Is it Okay to install AF_XDP on queues which don't
> exist?  I presume not?

Your presumption is correct. The idea with the
real_num_rx_queues/real_num_tx_queues check in xsk_bind, is to bail
out if the queue doesn't exist at bind call. Note that we *didn't* add
any code to avoid the bound queue from being removed via set channel
(your patch 6). Our idea was that if you remove a queue, the ingress
frames would simply stop flowing, and the queue config change checking
was "out-of-band".

I think I prefer your approach, i.e. not allowing the channels/queues
to change if they're bound to an AF_XDP socket. However, your
xdp_umem_query used in ethtool only works for ZC enabled drivers, not
for the existing non-ZC/copy case. If we'd like to go the route of
disabling ethtool_set_channels for an AF_XDP enabled queue this
functionality needs to move the query into netdev core, so we have a
consistent behavior.

> Are the AF_XDP queue_ids referring to TX queues
> as well as RX queues in case of the driver?  I presume not?

We've had a lot of discussions about this internally. Ideally, we'd
like to give driver implementors the most freedom, and not enforcing a
certain queue scheme for Tx. OTOH it makes it weird for the userland
application *not* to have the same id, e.g. if a userland application
would like to get stats or configure the AF_XDP bound Tx queue --
which id is it? Should the Tx queue id  for an xsk be exposed in
sysfs? If the id is *not* the same, would it be OK to change the
number of channels and Tx would continue to operate correctly? A
related question; An xsk with *only* Tx, should it be constrained by
the number of (enabled) Rx queues?

I'd be happy to hear some more opinions/thoughts on this...

> Should
> we try to prevent disabling queues which have non zero-copy sockets
> installed as well? :S
>

Yes, the ZC/non-ZC case must be consistent IMO. See comment above.

> Anyway, if any of those patches seem useful and reasonable, please let
> me know I will repost as non-RFC.
>

I definitely think patch 2 and 3 (and probably 1) should go as non-RFC!

Thanks for spotting that we're not holding the rtnl lock when checking
the # queues (patch 4)!


Björn

> Jakub Kicinski (6):
>   net: update real_num_rx_queues even when !CONFIG_SYSFS
>   xsk: refactor xdp_umem_assign_dev()
>   xsk: don't allow umem replace at stack level
>   xsk: don't allow installing UMEM beyond the number of queues
>   ethtool: rename local variable max -> curr
>   ethtool: don't allow disabling queues with umem installed
>
>  include/linux/netdevice.h | 16 +++--
>  net/core/ethtool.c| 19 ++
>  net/xdp/xdp_umem.c| 73 ---
>  3 files changed, 71 insertions(+), 37 deletions(-)
>
> --
> 2.17.1
>


Re: [RFC bpf-next 3/6] xsk: don't allow umem replace at stack level

2018-07-30 Thread Björn Töpel
Den tors 26 juli 2018 kl 23:44 skrev Jakub Kicinski
:
>
> Currently drivers have to check if they already have a umem
> installed for a given queue and return an error if so.  Make
> better use of XDP_QUERY_XSK_UMEM and move this functionality
> to the core.
>
> We need to keep rtnl across the calls now.
>
> Signed-off-by: Jakub Kicinski 
> Reviewed-by: Quentin Monnet 
> ---
>  include/linux/netdevice.h |  7 ---
>  net/xdp/xdp_umem.c| 37 -
>  2 files changed, 32 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 6717dc7e8fbf..a5a34f0fb485 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -872,10 +872,10 @@ struct netdev_bpf {
> struct {
> struct bpf_offloaded_map *offmap;
> };
> -   /* XDP_SETUP_XSK_UMEM */
> +   /* XDP_QUERY_XSK_UMEM, XDP_SETUP_XSK_UMEM */
> struct {
> -   struct xdp_umem *umem;
> -   u16 queue_id;
> +   struct xdp_umem *umem; /* out for query*/
> +   u16 queue_id; /* in for query */
> } xsk;
> };
>  };
> @@ -3566,6 +3566,7 @@ int dev_change_xdp_fd(struct net_device *dev, struct 
> netlink_ext_ack *extack,
>   int fd, u32 flags);
>  u32 __dev_xdp_query(struct net_device *dev, bpf_op_t xdp_op,
> enum bpf_netdev_command cmd);
> +int xdp_umem_query(struct net_device *dev, u16 queue_id);
>
>  int __dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
>  int dev_forward_skb(struct net_device *dev, struct sk_buff *skb);
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index c199d66b5f3f..911ca6d3cb5a 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -11,6 +11,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>
>  #include "xdp_umem.h"
>  #include "xsk_queue.h"
> @@ -40,6 +42,21 @@ void xdp_del_sk_umem(struct xdp_umem *umem, struct 
> xdp_sock *xs)
> }
>  }
>
> +int xdp_umem_query(struct net_device *dev, u16 queue_id)
> +{
> +   struct netdev_bpf bpf;
> +
> +   ASSERT_RTNL();
> +
> +   memset(, 0, sizeof(bpf));
> +   bpf.command = XDP_QUERY_XSK_UMEM;
> +   bpf.xsk.queue_id = queue_id;
> +
> +   if (!dev->netdev_ops->ndo_bpf)
> +   return 0;
> +   return dev->netdev_ops->ndo_bpf(dev, ) ?: !!bpf.xsk.umem;
> +}
> +
>  int xdp_umem_assign_dev(struct xdp_umem *umem, struct net_device *dev,
> u32 queue_id, u16 flags)
>  {
> @@ -62,28 +79,30 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
> net_device *dev,
> bpf.command = XDP_QUERY_XSK_UMEM;
>
> rtnl_lock();
> -   err = dev->netdev_ops->ndo_bpf(dev, );
> -   rtnl_unlock();
> -
> -   if (err)
> -   return force_zc ? -ENOTSUPP : 0;
> +   err = xdp_umem_query(dev, queue_id);
> +   if (err) {
> +   err = err < 0 ? -ENOTSUPP : -EBUSY;
> +   goto err_rtnl_unlock;
> +   }
>
> bpf.command = XDP_SETUP_XSK_UMEM;
> bpf.xsk.umem = umem;
> bpf.xsk.queue_id = queue_id;
>
> -   rtnl_lock();
> err = dev->netdev_ops->ndo_bpf(dev, );
> -   rtnl_unlock();
> -
> if (err)
> -   return force_zc ? err : 0; /* fail or fallback */
> +   goto err_rtnl_unlock;
> +   rtnl_unlock();
>
>     dev_hold(dev);
> umem->dev = dev;
> umem->queue_id = queue_id;
> umem->zc = true;
> return 0;
> +
> +err_rtnl_unlock:
> +   rtnl_unlock();
> +   return force_zc ? err : 0; /* fail or fallback */
>  }
>
>  static void xdp_umem_clear_dev(struct xdp_umem *umem)
> --
> 2.17.1
>

Nice!

For a non-RFC version,

Acked-by: Björn Töpel 


Re: [RFC bpf-next 2/6] xsk: refactor xdp_umem_assign_dev()

2018-07-30 Thread Björn Töpel
Den tors 26 juli 2018 kl 23:44 skrev Jakub Kicinski
:
>
> Return early and only take the ref on dev once there is no possibility
> of failing.
>
> Signed-off-by: Jakub Kicinski 
> Reviewed-by: Quentin Monnet 
> ---
>  net/xdp/xdp_umem.c | 49 --
>  1 file changed, 21 insertions(+), 28 deletions(-)
>
> diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
> index f47abb46c587..c199d66b5f3f 100644
> --- a/net/xdp/xdp_umem.c
> +++ b/net/xdp/xdp_umem.c
> @@ -56,41 +56,34 @@ int xdp_umem_assign_dev(struct xdp_umem *umem, struct 
> net_device *dev,
> if (force_copy)
> return 0;
>
> -   dev_hold(dev);
> +   if (!dev->netdev_ops->ndo_bpf || !dev->netdev_ops->ndo_xsk_async_xmit)
> +   return force_zc ? -ENOTSUPP : 0; /* fail or fallback */
>
> -   if (dev->netdev_ops->ndo_bpf && dev->netdev_ops->ndo_xsk_async_xmit) {
> -   bpf.command = XDP_QUERY_XSK_UMEM;
> +   bpf.command = XDP_QUERY_XSK_UMEM;
>
> -   rtnl_lock();
> -   err = dev->netdev_ops->ndo_bpf(dev, );
> -   rtnl_unlock();
> +   rtnl_lock();
> +   err = dev->netdev_ops->ndo_bpf(dev, );
> +   rtnl_unlock();
>
> -   if (err) {
> -   dev_put(dev);
> -   return force_zc ? -ENOTSUPP : 0;
> -   }
> +   if (err)
> +   return force_zc ? -ENOTSUPP : 0;
>
> -   bpf.command = XDP_SETUP_XSK_UMEM;
> -   bpf.xsk.umem = umem;
> -   bpf.xsk.queue_id = queue_id;
> +   bpf.command = XDP_SETUP_XSK_UMEM;
> +   bpf.xsk.umem = umem;
> +   bpf.xsk.queue_id = queue_id;
>
> -   rtnl_lock();
> -   err = dev->netdev_ops->ndo_bpf(dev, );
> -   rtnl_unlock();
> +   rtnl_lock();
> +   err = dev->netdev_ops->ndo_bpf(dev, );
> +   rtnl_unlock();
>
> -   if (err) {
> -   dev_put(dev);
> -   return force_zc ? err : 0; /* fail or fallback */
> -   }
> -
> -   umem->dev = dev;
> -   umem->queue_id = queue_id;
> -   umem->zc = true;
> -   return 0;
> -   }
> +   if (err)
> +   return force_zc ? err : 0; /* fail or fallback */
>
> -   dev_put(dev);
> -   return force_zc ? -ENOTSUPP : 0; /* fail or fallback */
> +   dev_hold(dev);
> +   umem->dev = dev;
> +   umem->queue_id = queue_id;
> +   umem->zc = true;
> +   return 0;
>  }
>
>  static void xdp_umem_clear_dev(struct xdp_umem *umem)
> --
> 2.17.1
>

Much cleaner! Please spin this w/o the RFC tag.

Acked-by: Björn Töpel 


Björn


Re: [PATCH bpf] net: xsk: don't return frames via the allocator on error

2018-07-30 Thread Björn Töpel
Den lör 28 juli 2018 kl 05:21 skrev Jakub Kicinski
:
>
> xdp_return_buff() is used when frame has been successfully
> handled (transmitted) or if an error occurred during delayed
> processing and there is no way to report it back to
> xdp_do_redirect().
>
> In case of __xsk_rcv_zc() error is propagated all the way
> back to the driver, so there is no need to call
> xdp_return_buff().  Driver will recycle the frame anyway
> after seeing that error happened.
>
> Fixes: 173d3adb6f43 ("xsk: add zero-copy support for Rx")
> Signed-off-by: Jakub Kicinski 
> ---
> Patch could as well be applied to bpf-next, since there are no drivers
> for AF_XDP in tree, yet.  xdp_umem_get_dma() and xdp_umem_get_data() are
> not even exported.  But one could reimplent those...
>
> As I mentioned I think this makes the entire MEM_TYPE_ZERO_COPY allocator
> handling dead code now :(
>

Nice Jakub!

Indeed, as you state, there's no ZC driver implementation yet. So,
either bpf or bpf-next.

Acked-by: Björn Töpel 

>  net/xdp/xsk.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 72335c2e8108..4e937cd7c17d 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -84,10 +84,8 @@ static int __xsk_rcv_zc(struct xdp_sock *xs, struct 
> xdp_buff *xdp, u32 len)
>  {
> int err = xskq_produce_batch_desc(xs->rx, (u64)xdp->handle, len);
>
> -   if (err) {
> -   xdp_return_buff(xdp);
> +   if (err)
> xs->rx_dropped++;
> -   }
>
> return err;
>  }
> --
> 2.17.1
>


Re: [PATCH V3 bpf] xdp: add NULL pointer check in __xdp_return()

2018-07-26 Thread Björn Töpel
Den tors 26 juli 2018 kl 16:18 skrev Taehee Yoo :
>
> rhashtable_lookup() can return NULL. so that NULL pointer
> check routine should be added.
>

Thanks Taehee!

Acked-by: Björn Töpel 

> Fixes: 02b55e5657c3 ("xdp: add MEM_TYPE_ZERO_COPY")
> Acked-by: Martin KaFai Lau 
> Signed-off-by: Taehee Yoo 
> ---
> V3 : reduce code line
> V2 : add WARN_ON_ONCE when xa is NULL.
>  net/core/xdp.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 9d1f220..6771f18 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -345,7 +345,8 @@ static void __xdp_return(void *data, struct xdp_mem_info 
> *mem, bool napi_direct,
> rcu_read_lock();
> /* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() 
> */
> xa = rhashtable_lookup(mem_id_ht, >id, 
> mem_id_rht_params);
> -   xa->zc_alloc->free(xa->zc_alloc, handle);
> +   if (!WARN_ON_ONCE(!xa))
> +   xa->zc_alloc->free(xa->zc_alloc, handle);
> rcu_read_unlock();
> default:
> /* Not possible, checked in xdp_rxq_info_reg_mem_model() */
> --
> 2.9.3
>


Re: [PATCH bpf] xdp: add NULL pointer check in __xdp_return()

2018-07-26 Thread Björn Töpel
Den mån 23 juli 2018 kl 21:58 skrev Jakub Kicinski
:
>
> On Mon, 23 Jul 2018 11:39:36 +0200, Björn Töpel wrote:
> > Den fre 20 juli 2018 kl 22:08 skrev Jakub Kicinski:
> > > On Fri, 20 Jul 2018 10:18:21 -0700, Martin KaFai Lau wrote:
> > > > On Sat, Jul 21, 2018 at 01:04:45AM +0900, Taehee Yoo wrote:
> > > > > rhashtable_lookup() can return NULL. so that NULL pointer
> > > > > check routine should be added.
> > > > >
> > > > > Fixes: 02b55e5657c3 ("xdp: add MEM_TYPE_ZERO_COPY")
> > > > > Signed-off-by: Taehee Yoo 
> > > > > ---
> > > > >  net/core/xdp.c | 3 ++-
> > > > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > > > > index 9d1f220..1c12bc7 100644
> > > > > --- a/net/core/xdp.c
> > > > > +++ b/net/core/xdp.c
> > > > > @@ -345,7 +345,8 @@ static void __xdp_return(void *data, struct 
> > > > > xdp_mem_info *mem, bool napi_direct,
> > > > > rcu_read_lock();
> > > > > /* mem->id is valid, checked in 
> > > > > xdp_rxq_info_reg_mem_model() */
> > > > > xa = rhashtable_lookup(mem_id_ht, >id, 
> > > > > mem_id_rht_params);
> > > > > -   xa->zc_alloc->free(xa->zc_alloc, handle);
> > > > > +   if (xa)
> > > > > +   xa->zc_alloc->free(xa->zc_alloc, handle);
> > > > hmm...It is not clear to me the "!xa" case don't have to be handled?
> > >
> > > Actually I have a more fundamental question about this interface I've
> > > been meaning to ask.
> > >
> > > IIUC free() can happen on any CPU at any time, when whatever device,
> > > socket or CPU this got redirected to completed the TX.  IOW there may
> > > be multiple producers.  Drivers would need to create spin lock a'la the
> > > a9744f7ca200 ("xsk: fix potential race in SKB TX completion code") fix?
> > >
> >
> > Jakub, apologies for the slow response. I'm still in
> > "holiday/hammock mode", but will be back in a week. :-P
>
> Ah, sorry to interrupt! :)
>

Don't make it a habit! ;-P

> > The idea with the xdp_return_* functions are that an xdp_buff and
> > xdp_frame can have custom allocations schemes. The difference beween
> > struct xdp_buff and struct xdp_frame is lifetime. The xdp_buff
> > lifetime is within the napi context, whereas xdp_frame can have a
> > lifetime longer/outside the napi context. E.g. for a XDP_REDIRECT
> > scenario an xdp_buff is converted to a xdp_frame. The conversion is
> > done in include/net/xdp.h:convert_to_xdp_frame.
> >
> > Currently, the zero-copy MEM_TYPE_ZERO_COPY memtype can *only* be used
> > for xdp_buff, meaning that the lifetime is constrained to a napi
> > context. Further, given an xdp_buff with memtype MEM_TYPE_ZERO_COPY,
> > doing XDP_REDIRECT to a target that is *not* an AF_XDP socket would
> > mean converting the xdp_buff to an xdp_frame. The xdp_frame can then
> > be free'd on any CPU.
> >
> > Note that the xsk_rcv* functions is always called from an napi
> > context, and therefore is using the xdp_return_buff calls.
> >
> > To answer your question -- no, this fix is *not* needed, because the
> > xdp_buff napi constrained, and the xdp_buff will only be free'd on one
> > CPU.
>
> Oh, thanks, I missed the check in convert_to_xdp_frame(), so the only
> frames which can come back via the free path are out of the error path
> in __xsk_rcv_zc()?
>
> That path looks a little surprising too, isn't the expectation that if
> xdp_do_redirect() returns an error the driver retains the ownership of
> the buffer?
>
> static int __xsk_rcv_zc(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
> {
> int err = xskq_produce_batch_desc(xs->rx, (u64)xdp->handle, len);
>
> if (err) {
> xdp_return_buff(xdp);
> xs->rx_dropped++;
> }
>
> return err;
> }
>
> This seems to call xdp_return_buff() *and* return an error.
>

Ugh, this is indeed an error. The xdp_return buff should be removed.
Thanks for spotting this!

So, yes, the way to get the buffer back (in ZC) to the driver is via
the error path (recycling) or via the UMEM fill queue.

> > > We need some form of internal kernel circulation which would be MPSC.
> > > I'm currently hacking up the XSK code to tell me whether 

Re: [PATCH V2 bpf] xdp: add NULL pointer check in __xdp_return()

2018-07-26 Thread Björn Töpel
Den tors 26 juli 2018 kl 04:14 skrev Jakub Kicinski
:
>
> On Thu, 26 Jul 2018 00:09:50 +0900, Taehee Yoo wrote:
> > rhashtable_lookup() can return NULL. so that NULL pointer
> > check routine should be added.
> >
> > Fixes: 02b55e5657c3 ("xdp: add MEM_TYPE_ZERO_COPY")
> > Signed-off-by: Taehee Yoo 
> > ---
> > V2 : add WARN_ON_ONCE when xa is NULL.
> >
> >  net/core/xdp.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > index 9d1f220..786fdbe 100644
> > --- a/net/core/xdp.c
> > +++ b/net/core/xdp.c
> > @@ -345,7 +345,10 @@ static void __xdp_return(void *data, struct 
> > xdp_mem_info *mem, bool napi_direct,
> >   rcu_read_lock();
> >   /* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() 
> > */
> >   xa = rhashtable_lookup(mem_id_ht, >id, 
> > mem_id_rht_params);
> > - xa->zc_alloc->free(xa->zc_alloc, handle);
> > + if (!xa)
> > + WARN_ON_ONCE(1);
>
> nit: is compiler smart enough to figure out the fast path here?
> WARN_ON_ONCE() has the nice side effect of wrapping the condition in
> unlikely().  It could save us both LoC and potentially cycles to do:
>
> if (!WARN_ON_ONCE(!xa))
> xa->zc_alloc->free(xa->zc_alloc, handle);
>
> Although it admittedly looks a bit awkward.  I'm not sure if we have
> some form of assert (i.e. positive check) in tree :S
>

I'm kind of in favor of this ^^^. Hopefully, Taehee is ok with another spin.

Björn

> > + else
> > + xa->zc_alloc->free(xa->zc_alloc, handle);
> >   rcu_read_unlock();
> >   default:
> >   /* Not possible, checked in xdp_rxq_info_reg_mem_model() */


[PATCH bpf] xsk: fix poll/POLLIN premature returns

2018-07-23 Thread Björn Töpel
From: Björn Töpel 

Polling for the ingress queues relies on reading the producer/consumer
pointers of the Rx queue.

Prior this commit, a cached consumer pointer could be used, instead of
the actual consumer pointer and therefore report POLLIN prematurely.

This patch makes sure that the non-cached consumer pointer is used
instead.

Reported-by: Qi Zhang 
Tested-by: Qi Zhang 
Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support")
Signed-off-by: Björn Töpel 
---
 net/xdp/xsk_queue.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 52ecaf770642..8a64b150be54 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -250,7 +250,7 @@ static inline bool xskq_full_desc(struct xsk_queue *q)
 
 static inline bool xskq_empty_desc(struct xsk_queue *q)
 {
-   return xskq_nb_free(q, q->prod_tail, 1) == q->nentries;
+   return xskq_nb_free(q, q->prod_tail, q->nentries) == q->nentries;
 }
 
 void xskq_set_umem(struct xsk_queue *q, struct xdp_umem_props *umem_props);
-- 
2.17.1



Re: [PATCH bpf] xdp: add NULL pointer check in __xdp_return()

2018-07-23 Thread Björn Töpel
Den lör 21 juli 2018 kl 14:58 skrev Taehee Yoo :
>
> 2018-07-21 2:18 GMT+09:00 Martin KaFai Lau :
> > On Sat, Jul 21, 2018 at 01:04:45AM +0900, Taehee Yoo wrote:
> >> rhashtable_lookup() can return NULL. so that NULL pointer
> >> check routine should be added.
> >>
> >> Fixes: 02b55e5657c3 ("xdp: add MEM_TYPE_ZERO_COPY")
> >> Signed-off-by: Taehee Yoo 
> >> ---
> >>  net/core/xdp.c | 3 ++-
> >>  1 file changed, 2 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/net/core/xdp.c b/net/core/xdp.c
> >> index 9d1f220..1c12bc7 100644
> >> --- a/net/core/xdp.c
> >> +++ b/net/core/xdp.c
> >> @@ -345,7 +345,8 @@ static void __xdp_return(void *data, struct 
> >> xdp_mem_info *mem, bool napi_direct,
> >>   rcu_read_lock();
> >>   /* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() 
> >> */
> >>   xa = rhashtable_lookup(mem_id_ht, >id, 
> >> mem_id_rht_params);
> >> - xa->zc_alloc->free(xa->zc_alloc, handle);
> >> + if (xa)
> >> + xa->zc_alloc->free(xa->zc_alloc, handle);
> > hmm...It is not clear to me the "!xa" case don't have to be handled?
>
> Thank you for reviewing!
>
> Returning NULL pointer is bug case such as calling after use
> xdp_rxq_info_unreg().
> so that, I think it can't handle at that moment.
> we can make __xdp_return to add WARN_ON_ONCE() or
> add return error code to driver.
> But I'm not sure if these is useful information.
>
> I might have misunderstood scenario of MEM_TYPE_ZERO_COPY
> because there is no use case of MEM_TYPE_ZERO_COPY yet.
>

Taehee, again, sorry for the slow response and thanks for patch!

If xa is NULL, the driver has a buggy/broken implementation. What
would be a proper way of dealing with this? BUG?


Björn

> Thanks!
>
> >
> >>   rcu_read_unlock();
> >>   default:
> >>   /* Not possible, checked in xdp_rxq_info_reg_mem_model() */
> >> --
> >> 2.9.3
> >>


Re: [PATCH bpf] xdp: add NULL pointer check in __xdp_return()

2018-07-23 Thread Björn Töpel
Den fre 20 juli 2018 kl 22:08 skrev Jakub Kicinski
:
>
> On Fri, 20 Jul 2018 10:18:21 -0700, Martin KaFai Lau wrote:
> > On Sat, Jul 21, 2018 at 01:04:45AM +0900, Taehee Yoo wrote:
> > > rhashtable_lookup() can return NULL. so that NULL pointer
> > > check routine should be added.
> > >
> > > Fixes: 02b55e5657c3 ("xdp: add MEM_TYPE_ZERO_COPY")
> > > Signed-off-by: Taehee Yoo 
> > > ---
> > >  net/core/xdp.c | 3 ++-
> > >  1 file changed, 2 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > > index 9d1f220..1c12bc7 100644
> > > --- a/net/core/xdp.c
> > > +++ b/net/core/xdp.c
> > > @@ -345,7 +345,8 @@ static void __xdp_return(void *data, struct 
> > > xdp_mem_info *mem, bool napi_direct,
> > > rcu_read_lock();
> > > /* mem->id is valid, checked in xdp_rxq_info_reg_mem_model() 
> > > */
> > > xa = rhashtable_lookup(mem_id_ht, >id, 
> > > mem_id_rht_params);
> > > -   xa->zc_alloc->free(xa->zc_alloc, handle);
> > > +   if (xa)
> > > +   xa->zc_alloc->free(xa->zc_alloc, handle);
> > hmm...It is not clear to me the "!xa" case don't have to be handled?
>
> Actually I have a more fundamental question about this interface I've
> been meaning to ask.
>
> IIUC free() can happen on any CPU at any time, when whatever device,
> socket or CPU this got redirected to completed the TX.  IOW there may
> be multiple producers.  Drivers would need to create spin lock a'la the
> a9744f7ca200 ("xsk: fix potential race in SKB TX completion code") fix?
>

Jakub, apologies for the slow response. I'm still in
"holiday/hammock mode", but will be back in a week. :-P

The idea with the xdp_return_* functions are that an xdp_buff and
xdp_frame can have custom allocations schemes. The difference beween
struct xdp_buff and struct xdp_frame is lifetime. The xdp_buff
lifetime is within the napi context, whereas xdp_frame can have a
lifetime longer/outside the napi context. E.g. for a XDP_REDIRECT
scenario an xdp_buff is converted to a xdp_frame. The conversion is
done in include/net/xdp.h:convert_to_xdp_frame.

Currently, the zero-copy MEM_TYPE_ZERO_COPY memtype can *only* be used
for xdp_buff, meaning that the lifetime is constrained to a napi
context. Further, given an xdp_buff with memtype MEM_TYPE_ZERO_COPY,
doing XDP_REDIRECT to a target that is *not* an AF_XDP socket would
mean converting the xdp_buff to an xdp_frame. The xdp_frame can then
be free'd on any CPU.

Note that the xsk_rcv* functions is always called from an napi
context, and therefore is using the xdp_return_buff calls.

To answer your question -- no, this fix is *not* needed, because the
xdp_buff napi constrained, and the xdp_buff will only be free'd on one
CPU.

> We need some form of internal kernel circulation which would be MPSC.
> I'm currently hacking up the XSK code to tell me whether the frame was
> consumed by the correct XSK, and always clone the frame otherwise
> (claiming to be the "traditional" MEM_TYPE_PAGE_ORDER0).
>
> I feel like I'm missing something about the code.  Is redirect of
> ZC/UMEM frame outside the xsk not possible and the only returns we will
> see are from net/xdp/xsk.c?  That would work, but I don't see such a
> check.  Help would be appreciated.
>

Right now, this is the case (refer to the TODO in
convert_to_xdp_frame), i.e. you cannot redirect an ZC/UMEM allocated
xdp_buff to a target that is not an xsk. This must, obviously, change
so that an xdp_buff (of MEM_TYPE_ZERO_COPY) can be converted to an
xdp_frame. The xdp_frame must be able to be free'd from multiple CPUs,
so here the a more sophisticated allocation scheme is required.

> Also the fact that XSK bufs can't be freed, only completed, adds to the
> pain of implementing AF_XDP, we'd certainly need some form of "give
> back the frame, but I may need it later" SPSC mechanism, otherwise
> driver writers will have tough time.  Unless, again, I'm missing
> something about the code :)
>

Yup, moving the recycling scheme from driver to "generic" is a good
idea! I need to finish up those i40e zerocopy patches first though...

(...and I'm very excited that you're doing nfp support for AF_XDP!!!)


Björn

> > > rcu_read_unlock();
> > > default:
> > > /* Not possible, checked in xdp_rxq_info_reg_mem_model() */
>


Re: [PATCH 2/3] i40e: split XDP_TX tail and XDP_REDIRECT map flushing

2018-06-26 Thread Björn Töpel
Den tis 26 juni 2018 kl 18:08 skrev Jesper Dangaard Brouer :
>
> The driver was combining the XDP_TX tail flush and XDP_REDIRECT
> map flushing (xdp_do_flush_map).  This is suboptimal, these two
> flush operations should be kept separate.
>
> It looks like the mistake was copy-pasted from ixgbe.
>
> Fixes: d9314c474d4f ("i40e: add support for XDP_REDIRECT")
> Signed-off-by: Jesper Dangaard Brouer 
> ---
>  drivers/net/ethernet/intel/i40e/i40e_txrx.c |   24 +++-
>  1 file changed, 15 insertions(+), 9 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
> b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> index 8ffb7454e67c..c1c027743159 100644
> --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
> @@ -2200,9 +2200,10 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
> return true;
>  }
>
> -#define I40E_XDP_PASS 0
> -#define I40E_XDP_CONSUMED 1
> -#define I40E_XDP_TX 2
> +#define I40E_XDP_PASS  0
> +#define I40E_XDP_CONSUMED  BIT(0)
> +#define I40E_XDP_TXBIT(1)
> +#define I40E_XDP_REDIR BIT(2)
>
>  static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
>   struct i40e_ring *xdp_ring);
> @@ -2249,7 +2250,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring 
> *rx_ring,
> break;
> case XDP_REDIRECT:
> err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
> -   result = !err ? I40E_XDP_TX : I40E_XDP_CONSUMED;
> +   result = !err ? I40E_XDP_REDIR : I40E_XDP_CONSUMED;
> break;
> default:
> bpf_warn_invalid_xdp_action(act);
> @@ -2312,7 +2313,8 @@ static int i40e_clean_rx_irq(struct i40e_ring *rx_ring, 
> int budget)
> unsigned int total_rx_bytes = 0, total_rx_packets = 0;
> struct sk_buff *skb = rx_ring->skb;
> u16 cleaned_count = I40E_DESC_UNUSED(rx_ring);
> -   bool failure = false, xdp_xmit = false;
> +   unsigned int xdp_xmit = 0;
> +   bool failure = false;
> struct xdp_buff xdp;
>
> xdp.rxq = _ring->xdp_rxq;
> @@ -2373,8 +2375,10 @@ static int i40e_clean_rx_irq(struct i40e_ring 
> *rx_ring, int budget)
> }
>
> if (IS_ERR(skb)) {
> -   if (PTR_ERR(skb) == -I40E_XDP_TX) {
> -   xdp_xmit = true;
> +   unsigned int xdp_res = -PTR_ERR(skb);
> +
> +   if (xdp_res & (I40E_XDP_TX | I40E_XDP_REDIR)) {
> +   xdp_xmit |= xdp_res;
> i40e_rx_buffer_flip(rx_ring, rx_buffer, size);
> } else {
> rx_buffer->pagecnt_bias++;
> @@ -2428,12 +2432,14 @@ static int i40e_clean_rx_irq(struct i40e_ring 
> *rx_ring, int budget)
> total_rx_packets++;
> }
>
> -   if (xdp_xmit) {
> +   if (xdp_xmit & I40E_XDP_REDIR)
> +   xdp_do_flush_map();
> +
> +   if (xdp_xmit & I40E_XDP_TX) {
> struct i40e_ring *xdp_ring =
> rx_ring->vsi->xdp_rings[rx_ring->queue_index];
>
> i40e_xdp_ring_update_tail(xdp_ring);
> -   xdp_do_flush_map();
> }
>
> rx_ring->skb = skb;
>

Nice! Added intel-wired-lan to Cc.

Acked-by: Björn Töpel 


[PATCH bpf] xsk: re-add queue id check for XDP_SKB path

2018-06-12 Thread Björn Töpel
From: Björn Töpel 

Commit 173d3adb6f43 ("xsk: add zero-copy support for Rx") introduced a
regression on the XDP_SKB receive path, when the queue id checks were
removed. Now, they are back again.

Fixes: 173d3adb6f43 ("xsk: add zero-copy support for Rx")
Reported-by: Qi Zhang 
Signed-off-by: Björn Töpel 
---
 net/xdp/xsk.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 36919a254ba3..3b3410ada097 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -118,6 +118,9 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff 
*xdp)
u64 addr;
int err;
 
+   if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
+   return -EINVAL;
+
if (!xskq_peek_addr(xs->umem->fq, ) ||
len > xs->umem->chunk_size_nohr) {
xs->rx_dropped++;
-- 
2.14.1



[PATCH bpf] xsk: silence warning on memory allocation failure

2018-06-11 Thread Björn Töpel
From: Björn Töpel 

syzkaller reported a warning from xdp_umem_pin_pages():

  WARNING: CPU: 1 PID: 4537 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 
mm/slab_common.c:996
  ...
  __do_kmalloc mm/slab.c:3713 [inline]
  __kmalloc+0x25/0x760 mm/slab.c:3727
  kmalloc_array include/linux/slab.h:634 [inline]
  kcalloc include/linux/slab.h:645 [inline]
  xdp_umem_pin_pages net/xdp/xdp_umem.c:205 [inline]
  xdp_umem_reg net/xdp/xdp_umem.c:318 [inline]
  xdp_umem_create+0x5c9/0x10f0 net/xdp/xdp_umem.c:349
  xsk_setsockopt+0x443/0x550 net/xdp/xsk.c:531
  __sys_setsockopt+0x1bd/0x390 net/socket.c:1935
  __do_sys_setsockopt net/socket.c:1946 [inline]
  __se_sys_setsockopt net/socket.c:1943 [inline]
  __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1943
  do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is a warning about attempting to allocate more than
KMALLOC_MAX_SIZE memory. The request originates from userspace, and if
the request is too big, the kernel is free to deny its allocation. In
this patch, the failed allocation attempt is silenced with
__GFP_NOWARN.

Fixes: c0c77d8fb787 ("xsk: add user memory registration support sockopt")
Reported-by: syzbot+4abadc5d69117b346...@syzkaller.appspotmail.com
Signed-off-by: Björn Töpel 
---
 net/xdp/xdp_umem.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index b9ef487c4618..f47abb46c587 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -204,7 +204,8 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem)
long npgs;
int err;
 
-   umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs), GFP_KERNEL);
+   umem->pgs = kcalloc(umem->npgs, sizeof(*umem->pgs),
+   GFP_KERNEL | __GFP_NOWARN);
if (!umem->pgs)
return -ENOMEM;
 
-- 
2.14.1



  1   2   3   4   >