from:"Magnus Karlsson"

Re: [PATCH net-next v2 1/3] xsk: Remove non-zero 'dma_page' check in xp_assign_dev

2024-03-04 Thread Magnus Karlsson

On Thu, 29 Feb 2024 at 13:52, wangyunjian  wrote:
>
> > -Original Message-
> > From: Paolo Abeni [mailto:pab...@redhat.com]
> > Sent: Thursday, February 29, 2024 6:43 PM
> > To: wangyunjian ; m...@redhat.com;
> > willemdebruijn.ker...@gmail.com; jasow...@redhat.com; k...@kernel.org;
> > bj...@kernel.org; magnus.karls...@intel.com; maciej.fijalkow...@intel.com;
> > jonathan.le...@gmail.com; da...@davemloft.net
> > Cc: b...@vger.kernel.org; net...@vger.kernel.org;
> > linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > virtualizat...@lists.linux.dev; xudingke ; liwei (DT)
> > 
> > Subject: Re: [PATCH net-next v2 1/3] xsk: Remove non-zero 'dma_page' check 
> > in
> > xp_assign_dev
> >
> > On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> > > Now dma mappings are used by the physical NICs. However the vNIC maybe
> > > do not need them. So remove non-zero 'dma_page' check in
> > > xp_assign_dev.
> > >
> > > Signed-off-by: Yunjian Wang 
> > > ---
> > >  net/xdp/xsk_buff_pool.c | 7 ---
> > >  1 file changed, 7 deletions(-)
> > >
> > > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c index
> > > ce60ecd48a4d..a5af75b1f43c 100644
> > > --- a/net/xdp/xsk_buff_pool.c
> > > +++ b/net/xdp/xsk_buff_pool.c
> > > @@ -219,16 +219,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> > > if (err)
> > > goto err_unreg_pool;
> > >
> > > -   if (!pool->dma_pages) {
> > > -   WARN(1, "Driver did not DMA map zero-copy buffers");
> > > -   err = -EINVAL;
> > > -   goto err_unreg_xsk;
> > > -   }
> >
> > This would unconditionally remove an otherwise valid check for most NIC. 
> > What
> > about let the driver declare it wont need DMA map with a
> > (pool?) flag.
>
> This check is redundant. The NIC's driver determines whether a DMA map is 
> required.
> If the NIC'driver requires the DMA map, it uses the xsk_pool_dma_map 
> function, which
> initializes the DMA map and performs a check.

Just to provide some context: I put this check there many years ago to
guard against a zero-copy driver writer forgetting to call
xsk_pool_dma_map() during the implementation phase. A working driver
will always have pool->dma_pages != NULL. If you both think that this
check is too much of a precaution, then I have no problem getting rid
of it. Just thought that a text warning would be nicer than a crash
later.

Thanks: Magnus

> Thanks
>
> >
> > Cheers,
> >
> > Paolo
>

Re: [PATCH v2 bpf-next 0/2] xsk: introduce generic almost-zerocopy xmit

2021-04-13 Thread Magnus Karlsson

On Tue, Apr 13, 2021 at 3:49 AM Xuan Zhuo  wrote:
>
> On Mon, 12 Apr 2021 16:13:12 +0200, Magnus Karlsson 
>  wrote:
> > On Wed, Mar 31, 2021 at 2:27 PM Alexander Lobakin  wrote:
> > >
> > > This series is based on the exceptional generic zerocopy xmit logics
> > > initially introduced by Xuan Zhuo. It extends it the way that it
> > > could cover all the sane drivers, not only the ones that are capable
> > > of xmitting skbs with no linear space.
> > >
> > > The first patch is a random while-we-are-here improvement over
> > > full-copy path, and the second is the main course. See the individual
> > > commit messages for the details.
> > >
> > > The original (full-zerocopy) path is still here and still generally
> > > faster, but for now it seems like virtio_net will remain the only
> > > user of it, at least for a considerable period of time.
> > >
> > > From v1 [0]:
> > >  - don't add a whole SMP_CACHE_BYTES because of only two bytes
> > >(NET_IP_ALIGN);
> > >  - switch to zerocopy if the frame is 129 bytes or longer, not 128.
> > >128 still fit to kmalloc-512, while a zerocopy skb is always
> > >kmalloc-1024 -> can potentially be slower on this frame size.
> > >
> > > [0] https://lore.kernel.org/netdev/20210330231528.546284-1-aloba...@pm.me
> > >
> > > Alexander Lobakin (2):
> > >   xsk: speed-up generic full-copy xmit
> >
> > I took both your patches for a spin on my machine and for the first
> > one I do see a small but consistent drop in performance. I thought it
> > would go the other way, but it does not so let us put this one on the
> > shelf for now.
> >
> > >   xsk: introduce generic almost-zerocopy xmit
> >
> > This one wreaked havoc on my machine ;-). The performance dropped with
> > 75% for packets larger than 128 bytes when the new scheme kicks in.
> > Checking with perf top, it seems that we spend much more time
> > executing the sendmsg syscall. Analyzing some more:
> >
> > $ sudo bpftrace -e 'kprobe:__sys_sendto { @calls = @calls + 1; }
> > interval:s:1 {printf("calls/sec: %d\n", @calls); @calls = 0;}'
> > Attaching 2 probes...
> > calls/sec: 1539509 with your patch compared to
> >
> > calls/sec: 105796 without your patch
> >
> > The application spends a lot of more time trying to get the kernel to
> > send new packets, but the kernel replies with "have not completed the
> > outstanding ones, so come back later" = EAGAIN. Seems like the
> > transmission takes longer when the skbs have fragments, but I have not
> > examined this any further. Did you get a speed-up?
>
> Regarding this solution, I actually tested it on my mlx5 network card, but the
> performance was severely degraded, so I did not continue this solution later. 
> I
> guess it might have something to do with the physical network card. We can try
> other network cards.

I tried it on a third card and got a 40% degradation, so let us scrap
this idea. It should stay optional as it is today as the (software)
drivers that benefit from this can turn it on explicitly.

> links: https://www.spinics.net/lists/netdev/msg710918.html
>
> Thanks.
>
> >
> > >  net/xdp/xsk.c | 32 ++--
> > >  1 file changed, 22 insertions(+), 10 deletions(-)
> > >
> > > --
> > > Well, this is untested. I currently don't have an access to my setup
> > > and is bound by moving to another country, but as I don't know for
> > > sure at the moment when I'll get back to work on the kernel next time,
> > > I found it worthy to publish this now -- if any further changes will
> > > be required when I already will be out-of-sight, maybe someone could
> > > carry on to make a another revision and so on (I'm still here for any
> > > questions, comments, reviews and improvements till the end of this
> > > week).
> > > But this *should* work with all the sane drivers. If a particular
> > > one won't handle this, it's likely ill. Any tests are highly
> > > appreciated. Thanks!
> > > --
> > > 2.31.1
> > >
> > >

Re: [PATCH v2 bpf-next 0/2] xsk: introduce generic almost-zerocopy xmit

2021-04-12 Thread Magnus Karlsson

On Wed, Mar 31, 2021 at 2:27 PM Alexander Lobakin  wrote:
>
> This series is based on the exceptional generic zerocopy xmit logics
> initially introduced by Xuan Zhuo. It extends it the way that it
> could cover all the sane drivers, not only the ones that are capable
> of xmitting skbs with no linear space.
>
> The first patch is a random while-we-are-here improvement over
> full-copy path, and the second is the main course. See the individual
> commit messages for the details.
>
> The original (full-zerocopy) path is still here and still generally
> faster, but for now it seems like virtio_net will remain the only
> user of it, at least for a considerable period of time.
>
> From v1 [0]:
>  - don't add a whole SMP_CACHE_BYTES because of only two bytes
>(NET_IP_ALIGN);
>  - switch to zerocopy if the frame is 129 bytes or longer, not 128.
>128 still fit to kmalloc-512, while a zerocopy skb is always
>kmalloc-1024 -> can potentially be slower on this frame size.
>
> [0] https://lore.kernel.org/netdev/20210330231528.546284-1-aloba...@pm.me
>
> Alexander Lobakin (2):
>   xsk: speed-up generic full-copy xmit

I took both your patches for a spin on my machine and for the first
one I do see a small but consistent drop in performance. I thought it
would go the other way, but it does not so let us put this one on the
shelf for now.

>   xsk: introduce generic almost-zerocopy xmit

This one wreaked havoc on my machine ;-). The performance dropped with
75% for packets larger than 128 bytes when the new scheme kicks in.
Checking with perf top, it seems that we spend much more time
executing the sendmsg syscall. Analyzing some more:

$ sudo bpftrace -e 'kprobe:__sys_sendto { @calls = @calls + 1; }
interval:s:1 {printf("calls/sec: %d\n", @calls); @calls = 0;}'
Attaching 2 probes...
calls/sec: 1539509 with your patch compared to

calls/sec: 105796 without your patch

The application spends a lot of more time trying to get the kernel to
send new packets, but the kernel replies with "have not completed the
outstanding ones, so come back later" = EAGAIN. Seems like the
transmission takes longer when the skbs have fragments, but I have not
examined this any further. Did you get a speed-up?

>  net/xdp/xsk.c | 32 ++--
>  1 file changed, 22 insertions(+), 10 deletions(-)
>
> --
> Well, this is untested. I currently don't have an access to my setup
> and is bound by moving to another country, but as I don't know for
> sure at the moment when I'll get back to work on the kernel next time,
> I found it worthy to publish this now -- if any further changes will
> be required when I already will be out-of-sight, maybe someone could
> carry on to make a another revision and so on (I'm still here for any
> questions, comments, reviews and improvements till the end of this
> week).
> But this *should* work with all the sane drivers. If a particular
> one won't handle this, it's likely ill. Any tests are highly
> appreciated. Thanks!
> --
> 2.31.1
>
>

Re: [PATCH bpf-next 0/2] xsk: introduce generic almost-zerocopy xmit

2021-03-31 Thread Magnus Karlsson

On Wed, Mar 31, 2021 at 1:17 AM Alexander Lobakin  wrote:
>
> This series is based on the exceptional generic zerocopy xmit logics
> initially introduced by Xuan Zhuo. It extends it the way that it
> could cover all the sane drivers, not only the ones that are capable
> of xmitting skbs with no linear space.
>
> The first patch is a random while-we-are-here improvement over
> full-copy path, and the second is the main course. See the individual
> commit messages for the details.
>
> The original (full-zerocopy) path is still here and still generally
> faster, but for now it seems like virtio_net will remain the only
> user of it, at least for a considerable period of time.
>
> Alexander Lobakin (2):
>   xsk: speed-up generic full-copy xmit
>   xsk: introduce generic almost-zerocopy xmit
>
>  net/xdp/xsk.c | 33 +++--
>  1 file changed, 23 insertions(+), 10 deletions(-)
>
> --
> Well, this is untested. I currently don't have an access to my setup
> and is bound by moving to another country, but as I don't know for
> sure at the moment when I'll get back to work on the kernel next time,
> I found it worthy to publish this now -- if any further changes will
> be required when I already will be out-of-sight, maybe someone could
> carry on to make a another revision and so on (I'm still here for any
> questions, comments, reviews and improvements till the end of this
> week).
> But this *should* work with all the sane drivers. If a particular
> one won't handle this, it's likely ill.

Thanks Alexander. I will take your patches for a spin on a couple of
NICs and get back to you, though it will be next week due to holidays
where I am based.

> --
> 2.31.1
>
>

Re: [PATCH] net: xdp: fix error return code of xsk_generic_xmit()

2021-03-05 Thread Magnus Karlsson

On Fri, Mar 5, 2021 at 10:28 AM Jia-Ju Bai  wrote:
>
> When err is zero but xskq_prod_reserve() fails, no error return code of
> xsk_generic_xmit() is assigned.
> To fix this bug, err is assigned with the return value of
> xskq_prod_reserve(), and then err is checked.

This error is ignored by design. This so that the zero-copy path and
the copy/skb path will return the same value (success in this case)
when the completion ring does not have a spare entry we can put the
future completion in. The problem lies with the zero-copy path that is
asynchronous, in contrast to the skb path that is synchronous. The
zero-copy path cannot return an error when this happens as this
reservation in the completion ring is performed by the driver that
might concurrently run on another core without any way to feed this
back to the syscall that does not wait for the driver to execute in
any case. Introducing a return value for this condition right now for
the skb case, might break existing applications.

Though it would be really good if you could submit a small patch to
bpf-next that adds a comment explaining this to avoid any future
confusion. Something along the lines of: /* The error code of
xskq_prod_reserve is ignored so that skb mode will mimic the same
behavior as zero-copy mode that does not signal an error in this case
as it cannot. */. You could put it right after the if statement.

Thank you: Magnus

> The spinlock is only used to protect the call to xskq_prod_reserve().
>
> Reported-by: TOTE Robot 
> Signed-off-by: Jia-Ju Bai 
> ---
>  net/xdp/xsk.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 4faabd1ecfd1..f1c1db07dd07 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -484,8 +484,14 @@ static int xsk_generic_xmit(struct sock *sk)
>  * if there is space in it. This avoids having to implement
>  * any buffering in the Tx path.
>  */
> +   if (unlikely(err)) {
> +   kfree_skb(skb);
> +   goto out;
> +   }
> +
> spin_lock_irqsave(>pool->cq_lock, flags);
> -   if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) {
> +   err = xskq_prod_reserve(xs->pool->cq);
> +   if (unlikely(err)) {
> spin_unlock_irqrestore(>pool->cq_lock, flags);
> kfree_skb(skb);
> goto out;
> --
> 2.17.1
>

Re: [PATCH v4 bpf-next 5/6] xsk: respect device's headroom and tailroom on generic xmit path

2021-02-16 Thread Magnus Karlsson

On Tue, Feb 16, 2021 at 12:44 PM Alexander Lobakin  wrote:
>
> xsk_generic_xmit() allocates a new skb and then queues it for
> xmitting. The size of new skb's headroom is desc->len, so it comes
> to the driver/device with no reserved headroom and/or tailroom.
> Lots of drivers need some headroom (and sometimes tailroom) to
> prepend (and/or append) some headers or data, e.g. CPU tags,
> device-specific headers/descriptors (LSO, TLS etc.), and if case
> of no available space skb_cow_head() will reallocate the skb.
> Reallocations are unwanted on fast-path, especially when it comes
> to XDP, so generic XSK xmit should reserve the spaces declared in
> dev->needed_headroom and dev->needed tailroom to avoid them.
>
> Note on max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom)):
>
> Usually, output functions reserve LL_RESERVED_SPACE(dev), which
> consists of dev->hard_header_len + dev->needed_headroom, aligned
> by 16.
> However, on XSK xmit hard header is already here in the chunk, so
> hard_header_len is not needed. But it'd still be better to align
> data up to cacheline, while reserving no less than driver requests
> for headroom. NET_SKB_PAD here is to double-insure there will be
> no reallocations even when the driver advertises no needed_headroom,
> but in fact need it (not so rare case).
>
> Fixes: 35fcde7f8deb ("xsk: support for Tx")
> Signed-off-by: Alexander Lobakin 

Acked-by: Magnus Karlsson 

> ---
>  net/xdp/xsk.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 4faabd1ecfd1..143979ea4165 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -454,12 +454,16 @@ static int xsk_generic_xmit(struct sock *sk)
> struct sk_buff *skb;
> unsigned long flags;
> int err = 0;
> +   u32 hr, tr;
>
> mutex_lock(>mutex);
>
> if (xs->queue_id >= xs->dev->real_num_tx_queues)
> goto out;
>
> +   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
> +   tr = xs->dev->needed_tailroom;
> +
> while (xskq_cons_peek_desc(xs->tx, , xs->pool)) {
> char *buffer;
> u64 addr;
> @@ -471,11 +475,13 @@ static int xsk_generic_xmit(struct sock *sk)
> }
>
> len = desc.len;
> -   skb = sock_alloc_send_skb(sk, len, 1, );
> +   skb = sock_alloc_send_skb(sk, hr + len + tr, 1, );
> if (unlikely(!skb))
> goto out;
>
> +   skb_reserve(skb, hr);
> skb_put(skb, len);
> +
> addr = desc.addr;
> buffer = xsk_buff_raw_get_data(xs->pool, addr);
> err = skb_store_bits(skb, 0, buffer, len);
> --
> 2.30.1
>
>

Re: [PATCH v4 bpf-next 6/6] xsk: build skb by page (aka generic zerocopy xmit)

2021-02-16 Thread Magnus Karlsson

On Tue, Feb 16, 2021 at 12:44 PM Alexander Lobakin  wrote:
>
> From: Xuan Zhuo 
>
> This patch is used to construct skb based on page to save memory copy
> overhead.
>
> This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> directly construct skb. If this feature is not supported, it is still
> necessary to copy data to construct skb.
>
>  Performance Testing 
>
> The test environment is Aliyun ECS server.
> Test cmd:
> ```
> xdpsock -i eth0 -t  -S -s 
> ```
>
> Test result data:
>
> size64  512 10241500
> copy1916747 1775988 1600203 1440054
> page1974058 1953655 1945463 1904478
> percent 3.0%10.0%   21.58%  32.3%
>
> Signed-off-by: Xuan Zhuo 
> Reviewed-by: Dust Li 
> [ alobakin:
>  - expand subject to make it clearer;
>  - improve skb->truesize calculation;
>  - reserve some headroom in skb for drivers;
>  - tailroom is not needed as skb is non-linear ]
> Signed-off-by: Alexander Lobakin 

Thank you Alexander!

Acked-by: Magnus Karlsson 

> ---
>  net/xdp/xsk.c | 119 --
>  1 file changed, 95 insertions(+), 24 deletions(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 143979ea4165..ff7bd06e1241 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -445,6 +445,96 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> sock_wfree(skb);
>  }
>
> +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> + struct xdp_desc *desc)
> +{
> +   struct xsk_buff_pool *pool = xs->pool;
> +   u32 hr, len, offset, copy, copied;
> +   struct sk_buff *skb;
> +   struct page *page;
> +   void *buffer;
> +   int err, i;
> +   u64 addr;
> +
> +   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(xs->dev->needed_headroom));
> +
> +   skb = sock_alloc_send_skb(>sk, hr, 1, );
> +   if (unlikely(!skb))
> +   return ERR_PTR(err);
> +
> +   skb_reserve(skb, hr);
> +
> +   addr = desc->addr;
> +   len = desc->len;
> +
> +   buffer = xsk_buff_raw_get_data(pool, addr);
> +   offset = offset_in_page(buffer);
> +   addr = buffer - pool->addrs;
> +
> +   for (copied = 0, i = 0; copied < len; i++) {
> +   page = pool->umem->pgs[addr >> PAGE_SHIFT];
> +   get_page(page);
> +
> +   copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> +   skb_fill_page_desc(skb, i, page, offset, copy);
> +
> +   copied += copy;
> +   addr += copy;
> +   offset = 0;
> +   }
> +
> +   skb->len += len;
> +   skb->data_len += len;
> +   skb->truesize += pool->unaligned ? len : pool->chunk_size;
> +
> +   refcount_add(skb->truesize, >sk.sk_wmem_alloc);
> +
> +   return skb;
> +}
> +
> +static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> +struct xdp_desc *desc)
> +{
> +   struct net_device *dev = xs->dev;
> +   struct sk_buff *skb;
> +
> +   if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
> +   skb = xsk_build_skb_zerocopy(xs, desc);
> +   if (IS_ERR(skb))
> +   return skb;
> +   } else {
> +   u32 hr, tr, len;
> +   void *buffer;
> +   int err;
> +
> +   hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
> +   tr = dev->needed_tailroom;
> +   len = desc->len;
> +
> +   skb = sock_alloc_send_skb(>sk, hr + len + tr, 1, );
> +   if (unlikely(!skb))
> +   return ERR_PTR(err);
> +
> +   skb_reserve(skb, hr);
> +   skb_put(skb, len);
> +
> +   buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
> +   err = skb_store_bits(skb, 0, buffer, len);
> +   if (unlikely(err)) {
> +   kfree_skb(skb);
> +   return ERR_PTR(err);
> +   }
> +   }
> +
> +   skb->dev = dev;
> +   skb->priority = xs->sk.sk_priority;
> +   skb->mark = xs->sk.sk_mark;
> +   skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr;
> +   skb->destructor = xsk_destruct_skb;
> +
> +   return skb;
> +}
> +
>  static int xsk_generic_xmit(struct sock *sk)
>  {
> struct xdp_soc

Re: [PATCH bpf-next v3 3/3] xsk: build skb by page

2021-01-26 Thread Magnus Karlsson

On Mon, Jan 25, 2021 at 4:22 PM Xuan Zhuo  wrote:
>
> On Mon, 25 Jan 2021 14:16:16 +0100, Magnus Karlsson 
>  wrote:
> > On Mon, Jan 25, 2021 at 1:43 PM Xuan Zhuo  
> > wrote:
> > >
> > > On Mon, 25 Jan 2021 08:44:38 +0100, Magnus Karlsson 
> > >  wrote:
> > > > On Mon, Jan 25, 2021 at 3:27 AM Xuan Zhuo  
> > > > wrote:
> > > > >
> > > > > On Fri, 22 Jan 2021 19:37:06 +0100, Magnus Karlsson 
> > > > >  wrote:
> > > > > > On Fri, Jan 22, 2021 at 6:26 PM Alexander Lobakin  
> > > > > > wrote:
> > > > > > >
> > > > > > > From: Xuan Zhuo 
> > > > > > > Date: Fri, 22 Jan 2021 23:39:15 +0800
> > > > > > >
> > > > > > > > On Fri, 22 Jan 2021 13:55:14 +0100, Magnus Karlsson 
> > > > > > > >  wrote:
> > > > > > > > > On Fri, Jan 22, 2021 at 1:39 PM Alexander Lobakin 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > From: Magnus Karlsson 
> > > > > > > > > > Date: Fri, 22 Jan 2021 13:18:47 +0100
> > > > > > > > > >
> > > > > > > > > > > On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin 
> > > > > > > > > > >  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > From: Alexander Lobakin 
> > > > > > > > > > > > Date: Fri, 22 Jan 2021 11:47:45 +
> > > > > > > > > > > >
> > > > > > > > > > > > > From: Eric Dumazet 
> > > > > > > > > > > > > Date: Thu, 21 Jan 2021 16:41:33 +0100
> > > > > > > > > > > > >
> > > > > > > > > > > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > > > > > > > > > > > > This patch is used to construct skb based on page 
> > > > > > > > > > > > > > > to save memory copy
> > > > > > > > > > > > > > > overhead.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > This function is implemented based on 
> > > > > > > > > > > > > > > IFF_TX_SKB_NO_LINEAR. Only the
> > > > > > > > > > > > > > > network card priv_flags supports 
> > > > > > > > > > > > > > > IFF_TX_SKB_NO_LINEAR will use page to
> > > > > > > > > > > > > > > directly construct skb. If this feature is not 
> > > > > > > > > > > > > > > supported, it is still
> > > > > > > > > > > > > > > necessary to copy data to construct skb.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >  Performance Testing 
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The test environment is Aliyun ECS server.
> > > > > > > > > > > > > > > Test cmd:
> > > > > > > > > > > > > > > ```
> > > > > > > > > > > > > > > xdpsock -i eth0 -t  -S -s 
> > > > > > > > > > > > > > > ```
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Test result data:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > size64  512 10241500
> > > > > > > > > > > > > > > copy1916747 1775988 1600203 1440054
> > > > > > > > > > > > > > > page1974058 1953655 1945463 1904478
> > > > > > > > > > > > > > > percent 3.0%10.0%   21.58%  32.3%
> > > > > > > > > > > > > > >
> > > > > > >

Re: [PATCH bpf-next v3 3/3] xsk: build skb by page

2021-01-25 Thread Magnus Karlsson

On Mon, Jan 25, 2021 at 1:43 PM Xuan Zhuo  wrote:
>
> On Mon, 25 Jan 2021 08:44:38 +0100, Magnus Karlsson 
>  wrote:
> > On Mon, Jan 25, 2021 at 3:27 AM Xuan Zhuo  
> > wrote:
> > >
> > > On Fri, 22 Jan 2021 19:37:06 +0100, Magnus Karlsson 
> > >  wrote:
> > > > On Fri, Jan 22, 2021 at 6:26 PM Alexander Lobakin  
> > > > wrote:
> > > > >
> > > > > From: Xuan Zhuo 
> > > > > Date: Fri, 22 Jan 2021 23:39:15 +0800
> > > > >
> > > > > > On Fri, 22 Jan 2021 13:55:14 +0100, Magnus Karlsson 
> > > > > >  wrote:
> > > > > > > On Fri, Jan 22, 2021 at 1:39 PM Alexander Lobakin 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > From: Magnus Karlsson 
> > > > > > > > Date: Fri, 22 Jan 2021 13:18:47 +0100
> > > > > > > >
> > > > > > > > > On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin 
> > > > > > > > >  wrote:
> > > > > > > > > >
> > > > > > > > > > From: Alexander Lobakin 
> > > > > > > > > > Date: Fri, 22 Jan 2021 11:47:45 +
> > > > > > > > > >
> > > > > > > > > > > From: Eric Dumazet 
> > > > > > > > > > > Date: Thu, 21 Jan 2021 16:41:33 +0100
> > > > > > > > > > >
> > > > > > > > > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > > > > > > > > > > This patch is used to construct skb based on page to 
> > > > > > > > > > > > > save memory copy
> > > > > > > > > > > > > overhead.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This function is implemented based on 
> > > > > > > > > > > > > IFF_TX_SKB_NO_LINEAR. Only the
> > > > > > > > > > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR 
> > > > > > > > > > > > > will use page to
> > > > > > > > > > > > > directly construct skb. If this feature is not 
> > > > > > > > > > > > > supported, it is still
> > > > > > > > > > > > > necessary to copy data to construct skb.
> > > > > > > > > > > > >
> > > > > > > > > > > > >  Performance Testing 
> > > > > > > > > > > > >
> > > > > > > > > > > > > The test environment is Aliyun ECS server.
> > > > > > > > > > > > > Test cmd:
> > > > > > > > > > > > > ```
> > > > > > > > > > > > > xdpsock -i eth0 -t  -S -s 
> > > > > > > > > > > > > ```
> > > > > > > > > > > > >
> > > > > > > > > > > > > Test result data:
> > > > > > > > > > > > >
> > > > > > > > > > > > > size64  512 10241500
> > > > > > > > > > > > > copy1916747 1775988 1600203 1440054
> > > > > > > > > > > > > page1974058 1953655 1945463 1904478
> > > > > > > > > > > > > percent 3.0%10.0%   21.58%  32.3%
> > > > > > > > > > > > >
> > > > > > > > > > > > > Signed-off-by: Xuan Zhuo 
> > > > > > > > > > > > > Reviewed-by: Dust Li 
> > > > > > > > > > > > > ---
> > > > > > > > > > > > >  net/xdp/xsk.c | 104 
> > > > > > > > > > > > > --
> > > > > > > > > > > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > > > > > > > > > > >
> > > > > > > > > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > > > > > > &g

Re: [PATCH bpf-next v3 3/3] xsk: build skb by page

2021-01-25 Thread Magnus Karlsson

On Mon, Jan 25, 2021 at 3:27 AM Xuan Zhuo  wrote:
>
> On Fri, 22 Jan 2021 19:37:06 +0100, Magnus Karlsson 
>  wrote:
> > On Fri, Jan 22, 2021 at 6:26 PM Alexander Lobakin  wrote:
> > >
> > > From: Xuan Zhuo 
> > > Date: Fri, 22 Jan 2021 23:39:15 +0800
> > >
> > > > On Fri, 22 Jan 2021 13:55:14 +0100, Magnus Karlsson 
> > > >  wrote:
> > > > > On Fri, Jan 22, 2021 at 1:39 PM Alexander Lobakin  
> > > > > wrote:
> > > > > >
> > > > > > From: Magnus Karlsson 
> > > > > > Date: Fri, 22 Jan 2021 13:18:47 +0100
> > > > > >
> > > > > > > On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > From: Alexander Lobakin 
> > > > > > > > Date: Fri, 22 Jan 2021 11:47:45 +
> > > > > > > >
> > > > > > > > > From: Eric Dumazet 
> > > > > > > > > Date: Thu, 21 Jan 2021 16:41:33 +0100
> > > > > > > > >
> > > > > > > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > > > > > > > > This patch is used to construct skb based on page to save 
> > > > > > > > > > > memory copy
> > > > > > > > > > > overhead.
> > > > > > > > > > >
> > > > > > > > > > > This function is implemented based on 
> > > > > > > > > > > IFF_TX_SKB_NO_LINEAR. Only the
> > > > > > > > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR 
> > > > > > > > > > > will use page to
> > > > > > > > > > > directly construct skb. If this feature is not supported, 
> > > > > > > > > > > it is still
> > > > > > > > > > > necessary to copy data to construct skb.
> > > > > > > > > > >
> > > > > > > > > > >  Performance Testing 
> > > > > > > > > > >
> > > > > > > > > > > The test environment is Aliyun ECS server.
> > > > > > > > > > > Test cmd:
> > > > > > > > > > > ```
> > > > > > > > > > > xdpsock -i eth0 -t  -S -s 
> > > > > > > > > > > ```
> > > > > > > > > > >
> > > > > > > > > > > Test result data:
> > > > > > > > > > >
> > > > > > > > > > > size64  512 10241500
> > > > > > > > > > > copy1916747 1775988 1600203 1440054
> > > > > > > > > > > page1974058 1953655 1945463 1904478
> > > > > > > > > > > percent 3.0%10.0%   21.58%  32.3%
> > > > > > > > > > >
> > > > > > > > > > > Signed-off-by: Xuan Zhuo 
> > > > > > > > > > > Reviewed-by: Dust Li 
> > > > > > > > > > > ---
> > > > > > > > > > >  net/xdp/xsk.c | 104 
> > > > > > > > > > > --
> > > > > > > > > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > > > > > > > > >
> > > > > > > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > > > > > > > index 4a83117..38af7f1 100644
> > > > > > > > > > > --- a/net/xdp/xsk.c
> > > > > > > > > > > +++ b/net/xdp/xsk.c
> > > > > > > > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct 
> > > > > > > > > > > sk_buff *skb)
> > > > > > > > > > >   sock_wfree(skb);
> > > > > > > > > > >  }
> > > > > > > > > > >
> > > > > > > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct 
> > > > > > > > > > > xdp_sock *xs,
> > > > > > > > > > > +

Re: [PATCH bpf-next v3 3/3] xsk: build skb by page

2021-01-22 Thread Magnus Karlsson

On Fri, Jan 22, 2021 at 6:26 PM Alexander Lobakin  wrote:
>
> From: Xuan Zhuo 
> Date: Fri, 22 Jan 2021 23:39:15 +0800
>
> > On Fri, 22 Jan 2021 13:55:14 +0100, Magnus Karlsson 
> >  wrote:
> > > On Fri, Jan 22, 2021 at 1:39 PM Alexander Lobakin  wrote:
> > > >
> > > > From: Magnus Karlsson 
> > > > Date: Fri, 22 Jan 2021 13:18:47 +0100
> > > >
> > > > > On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin  
> > > > > wrote:
> > > > > >
> > > > > > From: Alexander Lobakin 
> > > > > > Date: Fri, 22 Jan 2021 11:47:45 +
> > > > > >
> > > > > > > From: Eric Dumazet 
> > > > > > > Date: Thu, 21 Jan 2021 16:41:33 +0100
> > > > > > >
> > > > > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > > > > > > This patch is used to construct skb based on page to save 
> > > > > > > > > memory copy
> > > > > > > > > overhead.
> > > > > > > > >
> > > > > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. 
> > > > > > > > > Only the
> > > > > > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will 
> > > > > > > > > use page to
> > > > > > > > > directly construct skb. If this feature is not supported, it 
> > > > > > > > > is still
> > > > > > > > > necessary to copy data to construct skb.
> > > > > > > > >
> > > > > > > > >  Performance Testing 
> > > > > > > > >
> > > > > > > > > The test environment is Aliyun ECS server.
> > > > > > > > > Test cmd:
> > > > > > > > > ```
> > > > > > > > > xdpsock -i eth0 -t  -S -s 
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > Test result data:
> > > > > > > > >
> > > > > > > > > size64  512 10241500
> > > > > > > > > copy1916747 1775988 1600203 1440054
> > > > > > > > > page1974058 1953655 1945463 1904478
> > > > > > > > > percent 3.0%10.0%   21.58%  32.3%
> > > > > > > > >
> > > > > > > > > Signed-off-by: Xuan Zhuo 
> > > > > > > > > Reviewed-by: Dust Li 
> > > > > > > > > ---
> > > > > > > > >  net/xdp/xsk.c | 104 
> > > > > > > > > --
> > > > > > > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > > > > > > >
> > > > > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > > > > > index 4a83117..38af7f1 100644
> > > > > > > > > --- a/net/xdp/xsk.c
> > > > > > > > > +++ b/net/xdp/xsk.c
> > > > > > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct 
> > > > > > > > > sk_buff *skb)
> > > > > > > > >   sock_wfree(skb);
> > > > > > > > >  }
> > > > > > > > >
> > > > > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct 
> > > > > > > > > xdp_sock *xs,
> > > > > > > > > +   struct xdp_desc *desc)
> > > > > > > > > +{
> > > > > > > > > + u32 len, offset, copy, copied;
> > > > > > > > > + struct sk_buff *skb;
> > > > > > > > > + struct page *page;
> > > > > > > > > + void *buffer;
> > > > > > > > > + int err, i;
> > > > > > > > > + u64 addr;
> > > > > > > > > +
> > > > > > > > > + skb = sock_alloc_send_skb(>sk, 0, 1, );
> > > > > > > > > + if (unlikely(!skb))
> > > > > > > > > + return ERR_PTR(err);
> > > > > > > > > +
> > > > &

Re: [PATCH bpf-next v3 3/3] xsk: build skb by page

2021-01-22 Thread Magnus Karlsson

On Fri, Jan 22, 2021 at 1:39 PM Alexander Lobakin  wrote:
>
> From: Magnus Karlsson 
> Date: Fri, 22 Jan 2021 13:18:47 +0100
>
> > On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin  wrote:
> > >
> > > From: Alexander Lobakin 
> > > Date: Fri, 22 Jan 2021 11:47:45 +
> > >
> > > > From: Eric Dumazet 
> > > > Date: Thu, 21 Jan 2021 16:41:33 +0100
> > > >
> > > > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > > > This patch is used to construct skb based on page to save memory 
> > > > > > copy
> > > > > > overhead.
> > > > > >
> > > > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > > > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page 
> > > > > > to
> > > > > > directly construct skb. If this feature is not supported, it is 
> > > > > > still
> > > > > > necessary to copy data to construct skb.
> > > > > >
> > > > > >  Performance Testing 
> > > > > >
> > > > > > The test environment is Aliyun ECS server.
> > > > > > Test cmd:
> > > > > > ```
> > > > > > xdpsock -i eth0 -t  -S -s 
> > > > > > ```
> > > > > >
> > > > > > Test result data:
> > > > > >
> > > > > > size64  512 10241500
> > > > > > copy1916747 1775988 1600203 1440054
> > > > > > page1974058 1953655 1945463 1904478
> > > > > > percent 3.0%10.0%   21.58%  32.3%
> > > > > >
> > > > > > Signed-off-by: Xuan Zhuo 
> > > > > > Reviewed-by: Dust Li 
> > > > > > ---
> > > > > >  net/xdp/xsk.c | 104 
> > > > > > --
> > > > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > > > >
> > > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > > index 4a83117..38af7f1 100644
> > > > > > --- a/net/xdp/xsk.c
> > > > > > +++ b/net/xdp/xsk.c
> > > > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff 
> > > > > > *skb)
> > > > > >   sock_wfree(skb);
> > > > > >  }
> > > > > >
> > > > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > > > +   struct xdp_desc *desc)
> > > > > > +{
> > > > > > + u32 len, offset, copy, copied;
> > > > > > + struct sk_buff *skb;
> > > > > > + struct page *page;
> > > > > > + void *buffer;
> > > > > > + int err, i;
> > > > > > + u64 addr;
> > > > > > +
> > > > > > + skb = sock_alloc_send_skb(>sk, 0, 1, );
> > > > > > + if (unlikely(!skb))
> > > > > > + return ERR_PTR(err);
> > > > > > +
> > > > > > + addr = desc->addr;
> > > > > > + len = desc->len;
> > > > > > +
> > > > > > + buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > > > > > + offset = offset_in_page(buffer);
> > > > > > + addr = buffer - xs->pool->addrs;
> > > > > > +
> > > > > > + for (copied = 0, i = 0; copied < len; i++) {
> > > > > > + page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > > > > > +
> > > > > > + get_page(page);
> > > > > > +
> > > > > > + copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > > > > > +
> > > > > > + skb_fill_page_desc(skb, i, page, offset, copy);
> > > > > > +
> > > > > > + copied += copy;
> > > > > > + addr += copy;
> > > > > > + offset = 0;
> > > > > > + }
> > > > > > +
> > > > > > + skb->len += len;
> > > > > > + skb->data_len += len;
> > > > >
> > > > > > + skb->truesize += len;
> > > > >
> > > > > This is n

Re: [PATCH bpf-next v3 3/3] xsk: build skb by page

2021-01-22 Thread Magnus Karlsson

On Fri, Jan 22, 2021 at 12:57 PM Alexander Lobakin  wrote:
>
> From: Alexander Lobakin 
> Date: Fri, 22 Jan 2021 11:47:45 +
>
> > From: Eric Dumazet 
> > Date: Thu, 21 Jan 2021 16:41:33 +0100
> >
> > > On 1/21/21 2:47 PM, Xuan Zhuo wrote:
> > > > This patch is used to construct skb based on page to save memory copy
> > > > overhead.
> > > >
> > > > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > > > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > > > directly construct skb. If this feature is not supported, it is still
> > > > necessary to copy data to construct skb.
> > > >
> > > >  Performance Testing 
> > > >
> > > > The test environment is Aliyun ECS server.
> > > > Test cmd:
> > > > ```
> > > > xdpsock -i eth0 -t  -S -s 
> > > > ```
> > > >
> > > > Test result data:
> > > >
> > > > size64  512 10241500
> > > > copy1916747 1775988 1600203 1440054
> > > > page1974058 1953655 1945463 1904478
> > > > percent 3.0%10.0%   21.58%  32.3%
> > > >
> > > > Signed-off-by: Xuan Zhuo 
> > > > Reviewed-by: Dust Li 
> > > > ---
> > > >  net/xdp/xsk.c | 104 
> > > > --
> > > >  1 file changed, 86 insertions(+), 18 deletions(-)
> > > >
> > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > index 4a83117..38af7f1 100644
> > > > --- a/net/xdp/xsk.c
> > > > +++ b/net/xdp/xsk.c
> > > > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> > > >   sock_wfree(skb);
> > > >  }
> > > >
> > > > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > > > +   struct xdp_desc *desc)
> > > > +{
> > > > + u32 len, offset, copy, copied;
> > > > + struct sk_buff *skb;
> > > > + struct page *page;
> > > > + void *buffer;
> > > > + int err, i;
> > > > + u64 addr;
> > > > +
> > > > + skb = sock_alloc_send_skb(>sk, 0, 1, );
> > > > + if (unlikely(!skb))
> > > > + return ERR_PTR(err);
> > > > +
> > > > + addr = desc->addr;
> > > > + len = desc->len;
> > > > +
> > > > + buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > > > + offset = offset_in_page(buffer);
> > > > + addr = buffer - xs->pool->addrs;
> > > > +
> > > > + for (copied = 0, i = 0; copied < len; i++) {
> > > > + page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > > > +
> > > > + get_page(page);
> > > > +
> > > > + copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > > > +
> > > > + skb_fill_page_desc(skb, i, page, offset, copy);
> > > > +
> > > > + copied += copy;
> > > > + addr += copy;
> > > > + offset = 0;
> > > > + }
> > > > +
> > > > + skb->len += len;
> > > > + skb->data_len += len;
> > >
> > > > + skb->truesize += len;
> > >
> > > This is not the truesize, unfortunately.
> > >
> > > We need to account for the number of pages, not number of bytes.
> >
> > The easiest solution is:
> >
> >   skb->truesize += PAGE_SIZE * i;
> >
> > i would be equal to skb_shinfo(skb)->nr_frags after exiting the loop.
>
> Oops, pls ignore this. I forgot that XSK buffers are not
> "one per page".
> We need to count the number of pages manually and then do
>
> skb->truesize += PAGE_SIZE * npages;
>
> Right.

There are two possible packet buffer (chunks) sizes in a umem, 2K and
4K on a system with a PAGE_SIZE of 4K. If I remember correctly, and
please correct me if wrong, truesize is used for memory accounting.
But in this code, no kernel memory has been allocated (apart from the
skb). The page is just a part of the umem that has been already
allocated beforehand and by user-space in this case. So what should
truesize be in this case? Do we add 0, chunk_size * i, or the
complicated case of counting exactly how many 4K pages that are used
when the chunk_size is 2K, as two chunks could occupy the same page,
or just the upper bound of PAGE_SIZE * i that is likely a good
approximation in most cases? Just note that there might be other uses
of truesize that I am unaware of that could impact this choice.

> > > > +
> > > > + refcount_add(len, >sk.sk_wmem_alloc);
> > > > +
> > > > + return skb;
> > > > +}
> > > > +
> >
> > Al
>
> Thanks,
> Al
>

Re: [PATCH net-next v2 3/3] xsk: build skb by page

2021-01-20 Thread Magnus Karlsson

On Wed, Jan 20, 2021 at 9:29 PM Alexander Lobakin  wrote:
>
> From: Xuan Zhuo 
> Date: Wed, 20 Jan 2021 16:30:56 +0800
>
> > This patch is used to construct skb based on page to save memory copy
> > overhead.
> >
> > This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
> > network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
> > directly construct skb. If this feature is not supported, it is still
> > necessary to copy data to construct skb.
> >
> >  Performance Testing 
> >
> > The test environment is Aliyun ECS server.
> > Test cmd:
> > ```
> > xdpsock -i eth0 -t  -S -s 
> > ```
> >
> > Test result data:
> >
> > size64  512 10241500
> > copy1916747 1775988 1600203 1440054
> > page1974058 1953655 1945463 1904478
> > percent 3.0%10.0%   21.58%  32.3%
> >
> > Signed-off-by: Xuan Zhuo 
> > Reviewed-by: Dust Li 
> > ---
> >  net/xdp/xsk.c | 104 
> > --
> >  1 file changed, 86 insertions(+), 18 deletions(-)
>
> Now I like the result, thanks!
>
> But Patchwork still display your series incorrectly (messages 0 and 1
> are missing). I'm concerning maintainers may not take this in such
> form. Try to pass the folder's name, not folder/*.patch to
> git send-email when sending, and don't use --in-reply-to when sending
> a new iteration.

Xuan,

Please make the new submission of the patch set a v3 even though you
did not change the code. Just so we can clearly see it is the new
submission.

> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 8037b04..40bac11 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> >   sock_wfree(skb);
> >  }
> >
> > +static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > +   struct xdp_desc *desc)
> > +{
> > + u32 len, offset, copy, copied;
> > + struct sk_buff *skb;
> > + struct page *page;
> > + void *buffer;
> > + int err, i;
> > + u64 addr;
> > +
> > + skb = sock_alloc_send_skb(>sk, 0, 1, );
> > + if (unlikely(!skb))
> > + return ERR_PTR(err);
> > +
> > + addr = desc->addr;
> > + len = desc->len;
> > +
> > + buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > + offset = offset_in_page(buffer);
> > + addr = buffer - xs->pool->addrs;
> > +
> > + for (copied = 0, i = 0; copied < len; i++) {
> > + page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> > +
> > + get_page(page);
> > +
> > + copy = min_t(u32, PAGE_SIZE - offset, len - copied);
> > +
> > + skb_fill_page_desc(skb, i, page, offset, copy);
> > +
> > + copied += copy;
> > + addr += copy;
> > + offset = 0;
> > + }
> > +
> > + skb->len += len;
> > + skb->data_len += len;
> > + skb->truesize += len;
> > +
> > + refcount_add(len, >sk.sk_wmem_alloc);
> > +
> > + return skb;
> > +}
> > +
> > +static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > +  struct xdp_desc *desc)
> > +{
> > + struct sk_buff *skb = NULL;
> > +
> > + if (xs->dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
> > + skb = xsk_build_skb_zerocopy(xs, desc);
> > + if (IS_ERR(skb))
> > + return skb;
> > + } else {
> > + void *buffer;
> > + u32 len;
> > + int err;
> > +
> > + len = desc->len;
> > + skb = sock_alloc_send_skb(>sk, len, 1, );
> > + if (unlikely(!skb))
> > + return ERR_PTR(err);
> > +
> > + skb_put(skb, len);
> > + buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
> > + err = skb_store_bits(skb, 0, buffer, len);
> > + if (unlikely(err)) {
> > + kfree_skb(skb);
> > + return ERR_PTR(err);
> > + }
> > + }
> > +
> > + skb->dev = xs->dev;
> > + skb->priority = xs->sk.sk_priority;
> > + skb->mark = xs->sk.sk_mark;
> > + skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr;
> > + skb->destructor = xsk_destruct_skb;
> > +
> > + return skb;
> > +}
> > +
> >  static int xsk_generic_xmit(struct sock *sk)
> >  {
> >   struct xdp_sock *xs = xdp_sk(sk);
> > @@ -446,43 +527,30 @@ static int xsk_generic_xmit(struct sock *sk)
> >   goto out;
> >
> >   while (xskq_cons_peek_desc(xs->tx, , xs->pool)) {
> > - char *buffer;
> > - u64 addr;
> > - u32 len;
> > -
> >   if (max_batch-- == 0) {
> >   err = -EAGAIN;
> >   goto out;
> >   }
> >
> > - len = desc.len;
> > - skb = sock_alloc_send_skb(sk, len, 1, );
> > - if (unlikely(!skb))
> > + skb =

Re: [PATCH bpf-next] xsk: build skb by page

2021-01-18 Thread Magnus Karlsson

On Mon, Jan 18, 2021 at 5:38 PM Alexander Lobakin  wrote:
>
> > From: Magnus Karlsson 
> > Date: Mon, 18 Jan 2021 16:10:40 +0100
> >
> > On Mon, Jan 18, 2021 at 3:47 PM Alexander Lobakin  wrote:
> > >
> > > From: Alexander Lobakin 
> > > Date: Mon, 18 Jan 2021 13:00:17 +
> > >
> > > > From: Yunsheng Lin 
> > > > Date: Mon, 18 Jan 2021 20:40:52 +0800
> > > >
> > > >> On 2021/1/16 10:44, Xuan Zhuo wrote:
> > > >>> This patch is used to construct skb based on page to save memory copy
> > > >>> overhead.
> > > >>>
> > > >>> This has one problem:
> > > >>>
> > > >>> We construct the skb by fill the data page as a frag into the skb. In
> > > >>> this way, the linear space is empty, and the header information is 
> > > >>> also
> > > >>> in the frag, not in the linear space, which is not allowed for some
> > > >>> network cards. For example, Mellanox Technologies MT27710 Family
> > > >>> [ConnectX-4 Lx] will get the following error message:
> > > >>>
> > > >>> mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 
> > > >>> 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
> > > >>> : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > >>> 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > >>> 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > >>> 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
> > > >>> WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
> > > >>> : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
> > > >>> 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > >>> 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
> > > >>> 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > >>> mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb
> > > >>>
> > > >>> I also tried to use build_skb to construct skb, but because of the
> > > >>> existence of skb_shinfo, it must be behind the linear space, so this
> > > >>> method is not working. We can't put skb_shinfo on desc->addr, it will 
> > > >>> be
> > > >>> exposed to users, this is not safe.
> > > >>>
> > > >>> Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether 
> > > >>> the
> > > >>
> > > >> Does it make sense to use ETHTOOL_TX_COPYBREAK tunable in ethtool to
> > > >> configure if the data is copied or not?
> > > >
> > > > As far as I can grep, only mlx4 supports this, and it has a different
> > > > meaning in that driver.
> > > > So I guess a new netdev_feature would be a better solution.
> > > >
> > > >>> network card supports the header information of the packet in the frag
> > > >>> and not in the linear space.
> > > >>>
> > > >>>  Performance Testing 
> > > >>>
> > > >>> The test environment is Aliyun ECS server.
> > > >>> Test cmd:
> > > >>> ```
> > > >>> xdpsock -i eth0 -t  -S -s 
> > > >>> ```
> > > >>>
> > > >>> Test result data:
> > > >>>
> > > >>> size64  512 10241500
> > > >>> copy1916747 1775988 1600203 1440054
> > > >>> page1974058 1953655 1945463 1904478
> > > >>> percent 3.0%10.0%   21.58%  32.3%
> > > >>>
> > > >>> Signed-off-by: Xuan Zhuo 
> > > >>> Reviewed-by: Dust Li 
> > > >>> ---
> > > >>>  drivers/net/virtio_net.c|   2 +-
> > > >>>  include/linux/netdev_features.h |   5 +-
> > > >>>  net/ethtool/common.c|   1 +
> > > >>>  net/xdp/xsk.c   | 108 
> > > >>> +---
> > > >>>  4 files changed, 97 insertions(+), 19 deletions(-)
> > > >>>
> > > >>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c

Re: [PATCH bpf-next] xsk: build skb by page

2021-01-18 Thread Magnus Karlsson

On Mon, Jan 18, 2021 at 3:47 PM Alexander Lobakin  wrote:
>
> From: Alexander Lobakin 
> Date: Mon, 18 Jan 2021 13:00:17 +
>
> > From: Yunsheng Lin 
> > Date: Mon, 18 Jan 2021 20:40:52 +0800
> >
> >> On 2021/1/16 10:44, Xuan Zhuo wrote:
> >>> This patch is used to construct skb based on page to save memory copy
> >>> overhead.
> >>>
> >>> This has one problem:
> >>>
> >>> We construct the skb by fill the data page as a frag into the skb. In
> >>> this way, the linear space is empty, and the header information is also
> >>> in the frag, not in the linear space, which is not allowed for some
> >>> network cards. For example, Mellanox Technologies MT27710 Family
> >>> [ConnectX-4 Lx] will get the following error message:
> >>>
> >>> mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 
> >>> 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
> >>> : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
> >>> WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
> >>> : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
> >>> 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
> >>> 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb
> >>>
> >>> I also tried to use build_skb to construct skb, but because of the
> >>> existence of skb_shinfo, it must be behind the linear space, so this
> >>> method is not working. We can't put skb_shinfo on desc->addr, it will be
> >>> exposed to users, this is not safe.
> >>>
> >>> Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether the
> >>
> >> Does it make sense to use ETHTOOL_TX_COPYBREAK tunable in ethtool to
> >> configure if the data is copied or not?
> >
> > As far as I can grep, only mlx4 supports this, and it has a different
> > meaning in that driver.
> > So I guess a new netdev_feature would be a better solution.
> >
> >>> network card supports the header information of the packet in the frag
> >>> and not in the linear space.
> >>>
> >>>  Performance Testing 
> >>>
> >>> The test environment is Aliyun ECS server.
> >>> Test cmd:
> >>> ```
> >>> xdpsock -i eth0 -t  -S -s 
> >>> ```
> >>>
> >>> Test result data:
> >>>
> >>> size64  512 10241500
> >>> copy1916747 1775988 1600203 1440054
> >>> page1974058 1953655 1945463 1904478
> >>> percent 3.0%10.0%   21.58%  32.3%
> >>>
> >>> Signed-off-by: Xuan Zhuo 
> >>> Reviewed-by: Dust Li 
> >>> ---
> >>>  drivers/net/virtio_net.c|   2 +-
> >>>  include/linux/netdev_features.h |   5 +-
> >>>  net/ethtool/common.c|   1 +
> >>>  net/xdp/xsk.c   | 108 
> >>> +---
> >>>  4 files changed, 97 insertions(+), 19 deletions(-)
> >>>
> >>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >>> index 4ecccb8..841a331 100644
> >>> --- a/drivers/net/virtio_net.c
> >>> +++ b/drivers/net/virtio_net.c
> >>> @@ -2985,7 +2985,7 @@ static int virtnet_probe(struct virtio_device *vdev)
> >>> /* Set up network device as normal. */
> >>> dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE;
> >>> dev->netdev_ops = _netdev;
> >>> -   dev->features = NETIF_F_HIGHDMA;
> >>> +   dev->features = NETIF_F_HIGHDMA | NETIF_F_SKB_NO_LINEAR;
> >>>
> >>> dev->ethtool_ops = _ethtool_ops;
> >>> SET_NETDEV_DEV(dev, >dev);
> >>> diff --git a/include/linux/netdev_features.h 
> >>> b/include/linux/netdev_features.h
> >>> index 934de56..8dd28e2 100644
> >>> --- a/include/linux/netdev_features.h
> >>> +++ b/include/linux/netdev_features.h
> >>> @@ -85,9 +85,11 @@ enum {
> >>>
> >>> NETIF_F_HW_MACSEC_BIT,  /* Offload MACsec operations */
> >>>
> >>> +   NETIF_F_SKB_NO_LINEAR_BIT,  /* Allow skb linear is empty */
> >>> +
> >>> /*
> >>>  * Add your fresh new feature above and remember to update
> >>> -* netdev_features_strings[] in net/core/ethtool.c and maybe
> >>> +* netdev_features_strings[] in net/ethtool/common.c and maybe
> >>>  * some feature mask #defines below. Please also describe it
> >>>  * in Documentation/networking/netdev-features.rst.
> >>>  */
> >>> @@ -157,6 +159,7 @@ enum {
> >>>  #define NETIF_F_GRO_FRAGLIST   __NETIF_F(GRO_FRAGLIST)
> >>>  #define NETIF_F_GSO_FRAGLIST   __NETIF_F(GSO_FRAGLIST)
> >>>  #define NETIF_F_HW_MACSEC  __NETIF_F(HW_MACSEC)
> >>> +#define NETIF_F_SKB_NO_LINEAR  __NETIF_F(SKB_NO_LINEAR)
> >>>
> >>>  /* Finds the next feature with the highest number of the range of start 
> >>> till 0.
> >>>   */
> >>> diff --git a/net/ethtool/common.c b/net/ethtool/common.c
> >>> index

Re: [PATCH bpf-next] xsk: build skb by page

2021-01-18 Thread Magnus Karlsson

On Mon, Jan 18, 2021 at 3:47 PM Alexander Lobakin  wrote:
>
> From: Alexander Lobakin 
> Date: Mon, 18 Jan 2021 13:00:17 +
>
> > From: Yunsheng Lin 
> > Date: Mon, 18 Jan 2021 20:40:52 +0800
> >
> >> On 2021/1/16 10:44, Xuan Zhuo wrote:
> >>> This patch is used to construct skb based on page to save memory copy
> >>> overhead.
> >>>
> >>> This has one problem:
> >>>
> >>> We construct the skb by fill the data page as a frag into the skb. In
> >>> this way, the linear space is empty, and the header information is also
> >>> in the frag, not in the linear space, which is not allowed for some
> >>> network cards. For example, Mellanox Technologies MT27710 Family
> >>> [ConnectX-4 Lx] will get the following error message:
> >>>
> >>> mlx5_core :3b:00.1 eth1: Error cqe on cqn 0x817, ci 0x8, qn 
> >>> 0x1dbb, opcode 0xd, syndrome 0x1, vendor syndrome 0x68
> >>> : 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> 0020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> 0030: 00 00 00 00 60 10 68 01 0a 00 1d bb 00 0f 9f d2
> >>> WQE DUMP: WQ size 1024 WQ cur size 0, WQE index 0xf, len: 64
> >>> : 00 00 0f 0a 00 1d bb 03 00 00 00 08 00 00 00 00
> >>> 0010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> 0020: 00 00 00 2b 00 08 00 00 00 00 00 05 9e e3 08 00
> >>> 0030: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >>> mlx5_core :3b:00.1 eth1: ERR CQE on SQ: 0x1dbb
> >>>
> >>> I also tried to use build_skb to construct skb, but because of the
> >>> existence of skb_shinfo, it must be behind the linear space, so this
> >>> method is not working. We can't put skb_shinfo on desc->addr, it will be
> >>> exposed to users, this is not safe.
> >>>
> >>> Finally, I added a feature NETIF_F_SKB_NO_LINEAR to identify whether the
> >>
> >> Does it make sense to use ETHTOOL_TX_COPYBREAK tunable in ethtool to
> >> configure if the data is copied or not?
> >
> > As far as I can grep, only mlx4 supports this, and it has a different
> > meaning in that driver.
> > So I guess a new netdev_feature would be a better solution.
> >
> >>> network card supports the header information of the packet in the frag
> >>> and not in the linear space.
> >>>
> >>>  Performance Testing 
> >>>
> >>> The test environment is Aliyun ECS server.
> >>> Test cmd:
> >>> ```
> >>> xdpsock -i eth0 -t  -S -s 
> >>> ```
> >>>
> >>> Test result data:
> >>>
> >>> size64  512 10241500
> >>> copy1916747 1775988 1600203 1440054
> >>> page1974058 1953655 1945463 1904478
> >>> percent 3.0%10.0%   21.58%  32.3%
> >>>
> >>> Signed-off-by: Xuan Zhuo 
> >>> Reviewed-by: Dust Li 
> >>> ---
> >>>  drivers/net/virtio_net.c|   2 +-
> >>>  include/linux/netdev_features.h |   5 +-
> >>>  net/ethtool/common.c|   1 +
> >>>  net/xdp/xsk.c   | 108 
> >>> +---
> >>>  4 files changed, 97 insertions(+), 19 deletions(-)
> >>>
> >>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >>> index 4ecccb8..841a331 100644
> >>> --- a/drivers/net/virtio_net.c
> >>> +++ b/drivers/net/virtio_net.c
> >>> @@ -2985,7 +2985,7 @@ static int virtnet_probe(struct virtio_device *vdev)
> >>> /* Set up network device as normal. */
> >>> dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE;
> >>> dev->netdev_ops = _netdev;
> >>> -   dev->features = NETIF_F_HIGHDMA;
> >>> +   dev->features = NETIF_F_HIGHDMA | NETIF_F_SKB_NO_LINEAR;
> >>>
> >>> dev->ethtool_ops = _ethtool_ops;
> >>> SET_NETDEV_DEV(dev, >dev);
> >>> diff --git a/include/linux/netdev_features.h 
> >>> b/include/linux/netdev_features.h
> >>> index 934de56..8dd28e2 100644
> >>> --- a/include/linux/netdev_features.h
> >>> +++ b/include/linux/netdev_features.h
> >>> @@ -85,9 +85,11 @@ enum {
> >>>
> >>> NETIF_F_HW_MACSEC_BIT,  /* Offload MACsec operations */
> >>>
> >>> +   NETIF_F_SKB_NO_LINEAR_BIT,  /* Allow skb linear is empty */
> >>> +
> >>> /*
> >>>  * Add your fresh new feature above and remember to update
> >>> -* netdev_features_strings[] in net/core/ethtool.c and maybe
> >>> +* netdev_features_strings[] in net/ethtool/common.c and maybe
> >>>  * some feature mask #defines below. Please also describe it
> >>>  * in Documentation/networking/netdev-features.rst.
> >>>  */
> >>> @@ -157,6 +159,7 @@ enum {
> >>>  #define NETIF_F_GRO_FRAGLIST   __NETIF_F(GRO_FRAGLIST)
> >>>  #define NETIF_F_GSO_FRAGLIST   __NETIF_F(GSO_FRAGLIST)
> >>>  #define NETIF_F_HW_MACSEC  __NETIF_F(HW_MACSEC)
> >>> +#define NETIF_F_SKB_NO_LINEAR  __NETIF_F(SKB_NO_LINEAR)
> >>>
> >>>  /* Finds the next feature with the highest number of the range of start 
> >>> till 0.
> >>>   */
> >>> diff --git a/net/ethtool/common.c b/net/ethtool/common.c
> >>> index

Re: [PATCH bpf-next] xsk: build skb by page

2020-12-23 Thread Magnus Karlsson

On Wed, Dec 23, 2020 at 9:57 AM Xuan Zhuo  wrote:
>
> This patch is used to construct skb based on page to save memory copy
> overhead.
>
> Taking into account the problem of addr unaligned, and the
> possibility of frame size greater than page in the future.

Thanks Xuan for the patch set. Could you please share performance
numbers so we know how much this buys us? Would be good if you could
produce them for 64 bytes, 1500 bytes and something in the middle so
we can judge the benefits of this.

Please note that responses will be delayed this week and next due to
the Christmas and New Years holidays over here.

> Signed-off-by: Xuan Zhuo 
> ---
>  net/xdp/xsk.c | 68 
> ---
>  1 file changed, 51 insertions(+), 17 deletions(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index ac4a317..7cab40f 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -430,6 +430,55 @@ static void xsk_destruct_skb(struct sk_buff *skb)
> sock_wfree(skb);
>  }
>
> +static struct sk_buff *xsk_build_skb_bypage(struct xdp_sock *xs, struct 
> xdp_desc *desc)
> +{
> +   char *buffer;
> +   u64 addr;
> +   u32 len, offset, copy, copied;
> +   int err, i;
> +   struct page *page;
> +   struct sk_buff *skb;
> +
> +   skb = sock_alloc_send_skb(>sk, 0, 1, );
> +   if (unlikely(!skb))
> +   return NULL;
> +
> +   addr = desc->addr;
> +   len = desc->len;
> +
> +   buffer = xsk_buff_raw_get_data(xs->pool, addr);
> +   offset = offset_in_page(buffer);
> +   addr = buffer - (char *)xs->pool->addrs;
> +
> +   for (copied = 0, i = 0; copied < len; ++i) {
> +   page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
> +
> +   get_page(page);
> +
> +   copy = min((u32)(PAGE_SIZE - offset), len - copied);
> +
> +   skb_fill_page_desc(skb, i, page, offset, copy);
> +
> +   copied += copy;
> +   addr += copy;
> +   offset = 0;
> +   }
> +
> +   skb->len += len;
> +   skb->data_len += len;
> +   skb->truesize += len;
> +
> +   refcount_add(len, >sk.sk_wmem_alloc);
> +
> +   skb->dev = xs->dev;
> +   skb->priority = xs->sk.sk_priority;
> +   skb->mark = xs->sk.sk_mark;
> +   skb_shinfo(skb)->destructor_arg = (void *)(long)addr;
> +   skb->destructor = xsk_destruct_skb;
> +
> +   return skb;
> +}
> +
>  static int xsk_generic_xmit(struct sock *sk)
>  {
> struct xdp_sock *xs = xdp_sk(sk);
> @@ -445,40 +494,25 @@ static int xsk_generic_xmit(struct sock *sk)
> goto out;
>
> while (xskq_cons_peek_desc(xs->tx, , xs->pool)) {
> -   char *buffer;
> -   u64 addr;
> -   u32 len;
> -
> if (max_batch-- == 0) {
> err = -EAGAIN;
> goto out;
> }
>
> -   len = desc.len;
> -   skb = sock_alloc_send_skb(sk, len, 1, );
> +   skb = xsk_build_skb_bypage(xs, );
> if (unlikely(!skb))
> goto out;
>
> -   skb_put(skb, len);
> -   addr = desc.addr;
> -   buffer = xsk_buff_raw_get_data(xs->pool, addr);
> -   err = skb_store_bits(skb, 0, buffer, len);
> /* This is the backpressure mechanism for the Tx path.
>  * Reserve space in the completion queue and only proceed
>  * if there is space in it. This avoids having to implement
>  * any buffering in the Tx path.
>  */
> -   if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) {
> +   if (xskq_prod_reserve(xs->pool->cq)) {
> kfree_skb(skb);
> goto out;
> }
>
> -   skb->dev = xs->dev;
> -   skb->priority = sk->sk_priority;
> -   skb->mark = sk->sk_mark;
> -   skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr;
> -   skb->destructor = xsk_destruct_skb;
> -
> err = __dev_direct_xmit(skb, xs->queue_id);
> if  (err == NETDEV_TX_BUSY) {
> /* Tell user-space to retry the send */
> --
> 1.8.3.1
>

Re: [PATCH bpf-next] xsk: save the undone skb

2020-12-14 Thread Magnus Karlsson

On Sat, Dec 12, 2020 at 9:47 AM Xuan Zhuo  wrote:
>
> On Fri, 11 Dec 2020 16:32:06 +0100, Magnus Karlsson 
>  wrote:
> > On Fri, Dec 11, 2020 at 2:12 PM Xuan Zhuo  
> > wrote:
> > >
> > > We can reserve the skb. When sending fails, NETDEV_TX_BUSY or
> > > xskq_prod_reserve fails. As long as skb is successfully generated and
> > > successfully configured, we can reserve skb if we encounter exceptions
> > > later.
> > >
> > > Especially when NETDEV_TX_BUSY fails, there is no need to deal with
> > > the problem that xskq_prod_reserve has been updated.
> > >
> > > Signed-off-by: Xuan Zhuo 
> > > ---
> > >  include/net/xdp_sock.h |  3 +++
> > >  net/xdp/xsk.c  | 36 +++-
> > >  2 files changed, 30 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > index 4f4e93b..fead0c9 100644
> > > --- a/include/net/xdp_sock.h
> > > +++ b/include/net/xdp_sock.h
> > > @@ -76,6 +76,9 @@ struct xdp_sock {
> > > struct mutex mutex;
> > > struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */
> > > struct xsk_queue *cq_tmp; /* Only as tmp storage before bind */
> > > +
> > > +   struct sk_buff *skb_undone;
> > > +   bool skb_undone_reserve;
> > >  };
> > >
> > >  #ifdef CONFIG_XDP_SOCKETS
> > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > index e28c682..1051024 100644
> > > --- a/net/xdp/xsk.c
> > > +++ b/net/xdp/xsk.c
> > > @@ -435,6 +435,19 @@ static int xsk_generic_xmit(struct sock *sk)
> > > if (xs->queue_id >= xs->dev->real_num_tx_queues)
> > > goto out;
> > >
> > > +   if (xs->skb_undone) {
> > > +   if (xs->skb_undone_reserve) {
> > > +   if (xskq_prod_reserve(xs->pool->cq))
> > > +   goto out;
> > > +
> > > +   xs->skb_undone_reserve = false;
> > > +   }
> > > +
> > > +   skb = xs->skb_undone;
> > > +   xs->skb_undone = NULL;
> > > +   goto xmit;
> > > +   }
> > > +
> > > while (xskq_cons_peek_desc(xs->tx, , xs->pool)) {
> > > char *buffer;
> > > u64 addr;
> > > @@ -454,12 +467,7 @@ static int xsk_generic_xmit(struct sock *sk)
> > > addr = desc.addr;
> > > buffer = xsk_buff_raw_get_data(xs->pool, addr);
> > > err = skb_store_bits(skb, 0, buffer, len);
> > > -   /* This is the backpressure mechanism for the Tx path.
> > > -* Reserve space in the completion queue and only proceed
> > > -* if there is space in it. This avoids having to 
> > > implement
> > > -* any buffering in the Tx path.
> > > -*/
> > > -   if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) {
> > > +   if (unlikely(err)) {
> > > kfree_skb(skb);
> > > goto out;
> > > }
> > > @@ -470,12 +478,22 @@ static int xsk_generic_xmit(struct sock *sk)
> > > skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr;
> > > skb->destructor = xsk_destruct_skb;
> > >
> > > +   /* This is the backpressure mechanism for the Tx path.
> > > +* Reserve space in the completion queue and only proceed
> > > +* if there is space in it. This avoids having to 
> > > implement
> > > +* any buffering in the Tx path.
> > > +*/
> > > +   if (xskq_prod_reserve(xs->pool->cq)) {
> > > +   xs->skb_undone_reserve = true;
> > > +   xs->skb_undone = skb;
> > > +   goto out;
> > > +   }
> > > +
> > > +xmit:
> >
> > This will not work in the general case since we cannot guarantee that
> > the application does not replace the packet in the Tx ring before it
> > calls send() again. This is fully legal. I also do not like to
> > introduce state between calls. Much simpler to have it stateless

Re: [PATCH bpf-next] xsk: save the undone skb

2020-12-11 Thread Magnus Karlsson

On Fri, Dec 11, 2020 at 2:12 PM Xuan Zhuo  wrote:
>
> We can reserve the skb. When sending fails, NETDEV_TX_BUSY or
> xskq_prod_reserve fails. As long as skb is successfully generated and
> successfully configured, we can reserve skb if we encounter exceptions
> later.
>
> Especially when NETDEV_TX_BUSY fails, there is no need to deal with
> the problem that xskq_prod_reserve has been updated.
>
> Signed-off-by: Xuan Zhuo 
> ---
>  include/net/xdp_sock.h |  3 +++
>  net/xdp/xsk.c  | 36 +++-
>  2 files changed, 30 insertions(+), 9 deletions(-)
>
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index 4f4e93b..fead0c9 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -76,6 +76,9 @@ struct xdp_sock {
> struct mutex mutex;
> struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */
> struct xsk_queue *cq_tmp; /* Only as tmp storage before bind */
> +
> +   struct sk_buff *skb_undone;
> +   bool skb_undone_reserve;
>  };
>
>  #ifdef CONFIG_XDP_SOCKETS
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index e28c682..1051024 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -435,6 +435,19 @@ static int xsk_generic_xmit(struct sock *sk)
> if (xs->queue_id >= xs->dev->real_num_tx_queues)
> goto out;
>
> +   if (xs->skb_undone) {
> +   if (xs->skb_undone_reserve) {
> +   if (xskq_prod_reserve(xs->pool->cq))
> +   goto out;
> +
> +   xs->skb_undone_reserve = false;
> +   }
> +
> +   skb = xs->skb_undone;
> +   xs->skb_undone = NULL;
> +   goto xmit;
> +   }
> +
> while (xskq_cons_peek_desc(xs->tx, , xs->pool)) {
> char *buffer;
> u64 addr;
> @@ -454,12 +467,7 @@ static int xsk_generic_xmit(struct sock *sk)
> addr = desc.addr;
> buffer = xsk_buff_raw_get_data(xs->pool, addr);
> err = skb_store_bits(skb, 0, buffer, len);
> -   /* This is the backpressure mechanism for the Tx path.
> -* Reserve space in the completion queue and only proceed
> -* if there is space in it. This avoids having to implement
> -* any buffering in the Tx path.
> -*/
> -   if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) {
> +   if (unlikely(err)) {
> kfree_skb(skb);
> goto out;
> }
> @@ -470,12 +478,22 @@ static int xsk_generic_xmit(struct sock *sk)
> skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr;
> skb->destructor = xsk_destruct_skb;
>
> +   /* This is the backpressure mechanism for the Tx path.
> +* Reserve space in the completion queue and only proceed
> +* if there is space in it. This avoids having to implement
> +* any buffering in the Tx path.
> +*/
> +   if (xskq_prod_reserve(xs->pool->cq)) {
> +   xs->skb_undone_reserve = true;
> +   xs->skb_undone = skb;
> +   goto out;
> +   }
> +
> +xmit:

This will not work in the general case since we cannot guarantee that
the application does not replace the packet in the Tx ring before it
calls send() again. This is fully legal. I also do not like to
introduce state between calls. Much simpler to have it stateless which
means less error prone.

On the positive side, I will submit a patch that improves performance
of this transmit function by using the new batch interfaces I
introduced a month ago. With this patch I get a throughput improvement
of between 15 and 25% for the txpush benchmark in xdpsock. This is
much more than you will get from this patch. It also avoids the
problem you are addressing here completely. I will submit the patch
next week after the bug fix in this code has trickled down to
bpf-next. Hope you will like the throughput improvement that it
provides.

> err = __dev_direct_xmit(skb, xs->queue_id);
> if  (err == NETDEV_TX_BUSY) {
> /* Tell user-space to retry the send */
> -   skb->destructor = sock_wfree;
> -   /* Free skb without triggering the perf drop trace */
> -   consume_skb(skb);
> +   xs->skb_undone = skb;
> err = -EAGAIN;
> goto out;
> }
> --
> 1.8.3.1
>

Re: [PATCH net v2] xsk: Return error code if force_zc is set

2020-12-04 Thread Magnus Karlsson

On Fri, Dec 4, 2020 at 11:18 AM Zhang Changzhong
 wrote:
>
> If force_zc is set, we should exit out with an error, not fall back to
> copy mode.
>
> Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings")
> Reported-by: Hulk Robot 
> Signed-off-by: Zhang Changzhong 
> ---
>  net/xdp/xsk_buff_pool.c | 1 +
>  1 file changed, 1 insertion(+)

Thank you Changzhong!

Acked-by: Magnus Karlsson 

> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 9287edd..d5adeee 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -175,6 +175,7 @@ static int __xp_assign_dev(struct xsk_buff_pool *pool,
>
> if (!pool->dma_pages) {
> WARN(1, "Driver did not DMA map zero-copy buffers");
> +   err = -EINVAL;
> goto err_unreg_xsk;
> }
> pool->umem->zc = true;
> --
> 2.9.5
>

Re: [PATCH net] xsk: Fix error return code in __xp_assign_dev()

2020-12-04 Thread Magnus Karlsson

On Fri, Dec 4, 2020 at 9:49 AM Zhang Changzhong
 wrote:
>
> Fix to return a negative error code from the error handling
> case instead of 0, as done elsewhere in this function.
>
> Fixes: 921b68692abb ("xsk: Enable sharing of dma mappings")
> Reported-by: Hulk Robot 
> Signed-off-by: Zhang Changzhong 
> ---
>  net/xdp/xsk_buff_pool.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 9287edd..d5adeee 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -175,6 +175,7 @@ static int __xp_assign_dev(struct xsk_buff_pool *pool,
>
> if (!pool->dma_pages) {
> WARN(1, "Driver did not DMA map zero-copy buffers");
> +   err = -EINVAL;

Good catch! My intention here by not setting err is that it should
fall back to copy mode, which it does. The problem is that the
force_zc flag is disregarded when err is not set (see exit code below)
and your patch fixes that. If force_zc is set, we should exit out with
an error, not fall back. Could you please write about this in your
cover letter and send a v2?

BTW, what is the "Hulk Robot" that is in your Reported-by tag?

Thank you: Magnus

err_unreg_xsk:
xp_disable_drv_zc(pool);
err_unreg_pool:
if (!force_zc)
err = 0; /* fallback to copy mode */
if (err)
xsk_clear_pool_at_qid(netdev, queue_id);
return err;

> goto err_unreg_xsk;
> }
> pool->umem->zc = true;
> --
> 2.9.5
>

Re: [PATCH bpf V3 2/2] xsk: change the tx writeable condition

2020-12-02 Thread Magnus Karlsson

On Tue, Dec 1, 2020 at 2:59 PM Xuan Zhuo  wrote:
>
> Modify the tx writeable condition from the queue is not full to the
> number of present tx queues is less than the half of the total number
> of queues. Because the tx queue not full is a very short time, this will
> cause a large number of EPOLLOUT events, and cause a large number of
> process wake up.

And the Fixes label here should be:

Fixes: 35fcde7f8deb ("xsk: support for Tx")

> Signed-off-by: Xuan Zhuo 
> Acked-by: Magnus Karlsson 
> ---
>  net/xdp/xsk.c   | 16 +---
>  net/xdp/xsk_queue.h |  6 ++
>  2 files changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 9bbfd8a..6250447 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -211,6 +211,14 @@ static int __xsk_rcv(struct xdp_sock *xs, struct 
> xdp_buff *xdp, u32 len,
> return 0;
>  }
>
> +static bool xsk_tx_writeable(struct xdp_sock *xs)
> +{
> +   if (xskq_cons_present_entries(xs->tx) > xs->tx->nentries / 2)
> +   return false;
> +
> +   return true;
> +}
> +
>  static bool xsk_is_bound(struct xdp_sock *xs)
>  {
> if (READ_ONCE(xs->state) == XSK_BOUND) {
> @@ -296,7 +304,8 @@ void xsk_tx_release(struct xsk_buff_pool *pool)
> rcu_read_lock();
> list_for_each_entry_rcu(xs, >xsk_tx_list, tx_list) {
> __xskq_cons_release(xs->tx);
> -   xs->sk.sk_write_space(>sk);
> +   if (xsk_tx_writeable(xs))
> +   xs->sk.sk_write_space(>sk);
> }
> rcu_read_unlock();
>  }
> @@ -436,7 +445,8 @@ static int xsk_generic_xmit(struct sock *sk)
>
>  out:
> if (sent_frame)
> -   sk->sk_write_space(sk);
> +   if (xsk_tx_writeable(xs))
> +   sk->sk_write_space(sk);
>
> mutex_unlock(>mutex);
> return err;
> @@ -493,7 +503,7 @@ static __poll_t xsk_poll(struct file *file, struct socket 
> *sock,
>
> if (xs->rx && !xskq_prod_is_empty(xs->rx))
> mask |= EPOLLIN | EPOLLRDNORM;
> -   if (xs->tx && !xskq_cons_is_full(xs->tx))
> +   if (xs->tx && xsk_tx_writeable(xs))
> mask |= EPOLLOUT | EPOLLWRNORM;
>
> return mask;
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index cdb9cf3..9e71b9f 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -264,6 +264,12 @@ static inline bool xskq_cons_is_full(struct xsk_queue *q)
> q->nentries;
>  }
>
> +static inline u32 xskq_cons_present_entries(struct xsk_queue *q)
> +{
> +   /* No barriers needed since data is not accessed */
> +   return READ_ONCE(q->ring->producer) - READ_ONCE(q->ring->consumer);
> +}
> +
>  /* Functions for producers */
>
>  static inline bool xskq_prod_is_full(struct xsk_queue *q)
> --
> 1.8.3.1
>

Re: [PATCH bpf V3 1/2] xsk: replace datagram_poll by sock_poll_wait

2020-12-02 Thread Magnus Karlsson

On Tue, Dec 1, 2020 at 3:00 PM Xuan Zhuo  wrote:
>
> datagram_poll will judge the current socket status (EPOLLIN, EPOLLOUT)
> based on the traditional socket information (eg: sk_wmem_alloc), but
> this does not apply to xsk. So this patch uses sock_poll_wait instead of
> datagram_poll, and the mask is calculated by xsk_poll.
>

Could you please add a Fixes label here as this is a bug fix? It will
help the persons backporting this. Here is the line that goes just
above your Sign-off-by.

Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support")

Thanks: Magnus

> Signed-off-by: Xuan Zhuo 
> Acked-by: Magnus Karlsson 
> ---
>  net/xdp/xsk.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index b7b039b..9bbfd8a 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -471,11 +471,13 @@ static int xsk_sendmsg(struct socket *sock, struct 
> msghdr *m, size_t total_len)
>  static __poll_t xsk_poll(struct file *file, struct socket *sock,
>  struct poll_table_struct *wait)
>  {
> -   __poll_t mask = datagram_poll(file, sock, wait);
> +   __poll_t mask = 0;
> struct sock *sk = sock->sk;
> struct xdp_sock *xs = xdp_sk(sk);
> struct xsk_buff_pool *pool;
>
> +   sock_poll_wait(file, sock, wait);
> +
> if (unlikely(!xsk_is_bound(xs)))
> return mask;
>
> --
> 1.8.3.1
>

Re: [PATCH bpf v2 2/2] xsk: change the tx writeable condition

2020-11-27 Thread Magnus Karlsson

On Wed, Nov 25, 2020 at 7:49 AM Xuan Zhuo  wrote:
>
> Modify the tx writeable condition from the queue is not full to the
> number of present tx queues is less than the half of the total number
> of queues. Because the tx queue not full is a very short time, this will
> cause a large number of EPOLLOUT events, and cause a large number of
> process wake up.
>
> Signed-off-by: Xuan Zhuo 

Thank you Xuan!

Acked-by: Magnus Karlsson 

> ---
>  net/xdp/xsk.c   | 16 +---
>  net/xdp/xsk_queue.h |  6 ++
>  2 files changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 0df8651..22e35e9 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -211,6 +211,14 @@ static int __xsk_rcv(struct xdp_sock *xs, struct 
> xdp_buff *xdp, u32 len,
> return 0;
>  }
>
> +static bool xsk_tx_writeable(struct xdp_sock *xs)
> +{
> +   if (xskq_cons_present_entries(xs->tx) > xs->tx->nentries / 2)
> +   return false;
> +
> +   return true;
> +}
> +
>  static bool xsk_is_bound(struct xdp_sock *xs)
>  {
> if (READ_ONCE(xs->state) == XSK_BOUND) {
> @@ -296,7 +304,8 @@ void xsk_tx_release(struct xsk_buff_pool *pool)
> rcu_read_lock();
> list_for_each_entry_rcu(xs, >xsk_tx_list, tx_list) {
> __xskq_cons_release(xs->tx);
> -   xs->sk.sk_write_space(>sk);
> +   if (xsk_tx_writeable(xs))
> +   xs->sk.sk_write_space(>sk);
> }
> rcu_read_unlock();
>  }
> @@ -499,7 +508,8 @@ static int xsk_generic_xmit(struct sock *sk)
>
>  out:
> if (sent_frame)
> -   sk->sk_write_space(sk);
> +   if (xsk_tx_writeable(xs))
> +   sk->sk_write_space(sk);
>
> mutex_unlock(>mutex);
> return err;
> @@ -556,7 +566,7 @@ static __poll_t xsk_poll(struct file *file, struct socket 
> *sock,
>
> if (xs->rx && !xskq_prod_is_empty(xs->rx))
> mask |= EPOLLIN | EPOLLRDNORM;
> -   if (xs->tx && !xskq_cons_is_full(xs->tx))
> +   if (xs->tx && xsk_tx_writeable(xs))
> mask |= EPOLLOUT | EPOLLWRNORM;
>
> return mask;
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index b936c46..b655004 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -307,6 +307,12 @@ static inline bool xskq_cons_is_full(struct xsk_queue *q)
> q->nentries;
>  }
>
> +static inline __u64 xskq_cons_present_entries(struct xsk_queue *q)
> +{
> +   /* No barriers needed since data is not accessed */
> +   return READ_ONCE(q->ring->producer) - READ_ONCE(q->ring->consumer);
> +}
> +
>  /* Functions for producers */
>
>  static inline u32 xskq_prod_nb_free(struct xsk_queue *q, u32 max)
> --
> 1.8.3.1
>

Re: [PATCH bpf v2 1/2] xsk: replace datagram_poll by sock_poll_wait

2020-11-27 Thread Magnus Karlsson

On Wed, Nov 25, 2020 at 7:49 AM Xuan Zhuo  wrote:
>
> datagram_poll will judge the current socket status (EPOLLIN, EPOLLOUT)
> based on the traditional socket information (eg: sk_wmem_alloc), but
> this does not apply to xsk. So this patch uses sock_poll_wait instead of
> datagram_poll, and the mask is calculated by xsk_poll.
>
> Signed-off-by: Xuan Zhuo 
> ---
>  net/xdp/xsk.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)

Acked-by: Magnus Karlsson 

> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index b014197..0df8651 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -534,11 +534,13 @@ static int xsk_sendmsg(struct socket *sock, struct 
> msghdr *m, size_t total_len)
>  static __poll_t xsk_poll(struct file *file, struct socket *sock,
>  struct poll_table_struct *wait)
>  {
> -   __poll_t mask = datagram_poll(file, sock, wait);
> +   __poll_t mask = 0;
> struct sock *sk = sock->sk;
> struct xdp_sock *xs = xdp_sk(sk);
> struct xsk_buff_pool *pool;
>
> +   sock_poll_wait(file, sock, wait);
> +
> if (unlikely(!xsk_is_bound(xs)))
> return mask;
>
> --
> 1.8.3.1
>

Re: [PATCH 1/3] xsk: replace datagram_poll by sock_poll_wait

2020-11-24 Thread Magnus Karlsson

On Mon, Nov 23, 2020 at 3:11 PM Magnus Karlsson
 wrote:
>
> On Wed, Nov 18, 2020 at 9:26 AM Xuan Zhuo  wrote:
> >
> > datagram_poll will judge the current socket status (EPOLLIN, EPOLLOUT)
> > based on the traditional socket information (eg: sk_wmem_alloc), but
> > this does not apply to xsk. So this patch uses sock_poll_wait instead of
> > datagram_poll, and the mask is calculated by xsk_poll.
> >
> > Signed-off-by: Xuan Zhuo 
> > ---
> >  net/xdp/xsk.c | 4 +++-
> >  1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index cfbec39..7f0353e 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -477,11 +477,13 @@ static int xsk_sendmsg(struct socket *sock, struct 
> > msghdr *m, size_t total_len)
> >  static __poll_t xsk_poll(struct file *file, struct socket *sock,
> >  struct poll_table_struct *wait)
> >  {
> > -   __poll_t mask = datagram_poll(file, sock, wait);
> > +   __poll_t mask = 0;
>
> It would indeed be nice to not execute a number of tests in
> datagram_poll that will never be triggered. It will speed up things
> for sure. But we need to make sure that removing those flags that
> datagram_poll sets do not have any bad effects in the code above this.
> But let us tentatively keep this patch for the next version of the
> patch set. Just need to figure out how to solve your problem in a nice
> way first. See discussion in patch 0/3.
>
> > struct sock *sk = sock->sk;
> > struct xdp_sock *xs = xdp_sk(sk);
> > struct xsk_buff_pool *pool;
> >
> > +   sock_poll_wait(file, sock, wait);
> > +
> > if (unlikely(!xsk_is_bound(xs)))
> > return mask;
> >
> > --
> > 1.8.3.1
> >

The fix looks correct and it will speed things up too as a bonus.
Please include this patch in the v2 as outlined in my answer to 0/3.

Thanks!

Re: [PATCH 0/3] xsk: fix for xsk_poll writeable

2020-11-24 Thread Magnus Karlsson

On Tue, Nov 24, 2020 at 10:01 AM Magnus Karlsson
 wrote:
>
> On Mon, Nov 23, 2020 at 4:21 PM Xuan Zhuo  wrote:
> >
> > On Mon, 23 Nov 2020 15:00:48 +0100, Magnus Karlsson 
> >  wrote:
> > > On Wed, Nov 18, 2020 at 9:25 AM Xuan Zhuo  
> > > wrote:
> > > >
> > > > I tried to combine cq available and tx writeable, but I found it very 
> > > > difficult.
> > > > Sometimes we pay attention to the status of "available" for both, but 
> > > > sometimes,
> > > > we may only pay attention to one, such as tx writeable, because we can 
> > > > use the
> > > > item of fq to write to tx. And this kind of demand may be constantly 
> > > > changing,
> > > > and it may be necessary to set it every time before entering xsk_poll, 
> > > > so
> > > > setsockopt is not very convenient. I feel even more that using a new 
> > > > event may
> > > > be a better solution, such as EPOLLPRI, I think it can be used here, 
> > > > after all,
> > > > xsk should not have OOB data ^_^.
> > > >
> > > > However, two other problems were discovered during the test:
> > > >
> > > > * The mask returned by datagram_poll always contains EPOLLOUT
> > > > * It is not particularly reasonable to return EPOLLOUT based on tx not 
> > > > full
> > > >
> > > > After fixing these two problems, I found that when the process is 
> > > > awakened by
> > > > EPOLLOUT, the process can always get the item from cq.
> > > >
> > > > Because the number of packets that the network card can send at a time 
> > > > is
> > > > actually limited, suppose this value is "nic_num". Once the number of
> > > > consumed items in the tx queue is greater than nic_num, this means that 
> > > > there
> > > > must also be new recycled items in the cq queue from nic.
> > > >
> > > > In this way, as long as the tx configured by the user is larger, we 
> > > > won't have
> > > > the situation that tx is already in the writeable state but cannot get 
> > > > the item
> > > > from cq.
> > >
> > > I think the overall approach of tying this into poll() instead of
> > > setsockopt() is the right way to go. But we need a more robust
> > > solution. Your patch #3 also breaks backwards compatibility and that
> > > is not allowed. Could you please post some simple code example of what
> > > it is you would like to do in user space? So you would like to wake up
> > > when there are entries in the cq that can be retrieved and the reason
> > > you would like to do this is that you then know you can put some more
> > > entries into the Tx ring and they will get sent as there now are free
> > > slots in the cq. Correct me if wrong. Would an event that wakes you up
> > > when there is both space in the Tx ring and space in the cq work? Is
> > > there a case in which we would like to be woken up when only the Tx
> > > ring is non-full? Maybe there are as it might be beneficial to fill
> > > the Tx and while doing that some entries in the cq has been completed
> > > and away the packets go. But it would be great if you could post some
> > > simple example code, does not need to compile or anything. Can be
> > > pseudo code.
> > >
> > > It would also be good to know if your goal is max throughput, max
> > > burst size, or something else.
> > >
> > > Thanks: Magnus
> > >
> >
> > My goal is max pps, If possible, increase the size of buf appropriately to
> > improve throughput. like pktgen.
> >
> > The code like this: (tx and umem cq also is 1024, and that works with zero
> > copy.)
> >
> > ```
> > void send_handler(xsk)
> > {
> > char buf[22];
> >
> > while (true) {
> > while (true){
> > if (send_buf_to_tx_ring(xsk, buf, sizeof(buf)))
> > break; // break this when no cq or tx is full
> > }
> >
> > if (sendto(xsk->fd))
> > break;
> > }
> > }
> > }
> >
> >
> > static int loop(int efd, xsk)
> > {
> > struct epoll_event e[1024];
> > struct epoll_event ee;
> > int n, i;
> >
> > ee.events = EPOLLOUT;
> > ee.data.ptr

Re: [PATCH 2/3] xsk: change the tx writeable condition

2020-11-24 Thread Magnus Karlsson

On Wed, Nov 18, 2020 at 9:25 AM Xuan Zhuo  wrote:
>
> Modify the tx writeable condition from the queue is not full to the
> number of remaining tx queues is less than the half of the total number
> of queues. Because the tx queue not full is a very short time, this will
> cause a large number of EPOLLOUT events, and cause a large number of
> process wake up.
>
> Signed-off-by: Xuan Zhuo 
> ---
>  net/xdp/xsk.c   | 20 +---
>  net/xdp/xsk_queue.h |  6 ++
>  2 files changed, 23 insertions(+), 3 deletions(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 7f0353e..bc3d4ece 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -211,6 +211,17 @@ static int __xsk_rcv(struct xdp_sock *xs, struct 
> xdp_buff *xdp, u32 len,
> return 0;
>  }
>
> +static bool xsk_writeable(struct xdp_sock *xs)

Not clear what this function does from the name. How about
xsk_tx_half_free() or maybe xsk_tx_writeable()?

> +{
> +   if (!xs->tx)
> +   return false;

Skip this test as it will slow down the code. It is only needed in one
place below.

> +   if (xskq_cons_left(xs->tx) > xs->tx->nentries / 2)
> +   return false;
> +
> +   return true;
> +}
> +
>  static bool xsk_is_bound(struct xdp_sock *xs)
>  {
> if (READ_ONCE(xs->state) == XSK_BOUND) {
> @@ -296,7 +307,8 @@ void xsk_tx_release(struct xsk_buff_pool *pool)
> rcu_read_lock();
> list_for_each_entry_rcu(xs, >xsk_tx_list, tx_list) {
> __xskq_cons_release(xs->tx);
> -   xs->sk.sk_write_space(>sk);
> +   if (xsk_writeable(xs))
> +   xs->sk.sk_write_space(>sk);
> }
> rcu_read_unlock();
>  }
> @@ -442,7 +454,8 @@ static int xsk_generic_xmit(struct sock *sk)
>
>  out:
> if (sent_frame)
> -   sk->sk_write_space(sk);
> +   if (xsk_writeable(xs))
> +   sk->sk_write_space(sk);
>
> mutex_unlock(>mutex);
> return err;
> @@ -499,7 +512,8 @@ static __poll_t xsk_poll(struct file *file, struct socket 
> *sock,
>
> if (xs->rx && !xskq_prod_is_empty(xs->rx))
> mask |= EPOLLIN | EPOLLRDNORM;
> -   if (xs->tx && !xskq_cons_is_full(xs->tx))
> +

No reason to introduce a newline here.

> +   if (xsk_writeable(xs))

Add an explicit "xs->tx &&" in the if statement here as we removed the
test in xsk_writeable.

> mask |= EPOLLOUT | EPOLLWRNORM;
>
> return mask;
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index cdb9cf3..82a5228 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -264,6 +264,12 @@ static inline bool xskq_cons_is_full(struct xsk_queue *q)
> q->nentries;
>  }
>
> +static inline __u64 xskq_cons_left(struct xsk_queue *q)

Let us call this xskq_cons_entries_present() or
xskq_cons_filled_entries(). The word "left" has the connotation that I
still have stuff left to do. While this is kind of true for this case,
it might not be for other cases that can use your function. The
function provides how many (filled) entries that are present in the
ring. Can you come up with a better name as I am not super fond of my
suggestions? It would have been nice to call it xskq_cons_nb_entries()
but there is already such a function that is lazy in nature and that
allows access to the entries.

> +{
> +   /* No barriers needed since data is not accessed */
> +   return READ_ONCE(q->ring->producer) - READ_ONCE(q->ring->consumer);
> +}
> +
>  /* Functions for producers */
>
>  static inline bool xskq_prod_is_full(struct xsk_queue *q)
> --
> 1.8.3.1
>

Re: [PATCH 0/3] xsk: fix for xsk_poll writeable

2020-11-24 Thread Magnus Karlsson

On Mon, Nov 23, 2020 at 4:21 PM Xuan Zhuo  wrote:
>
> On Mon, 23 Nov 2020 15:00:48 +0100, Magnus Karlsson 
>  wrote:
> > On Wed, Nov 18, 2020 at 9:25 AM Xuan Zhuo  
> > wrote:
> > >
> > > I tried to combine cq available and tx writeable, but I found it very 
> > > difficult.
> > > Sometimes we pay attention to the status of "available" for both, but 
> > > sometimes,
> > > we may only pay attention to one, such as tx writeable, because we can 
> > > use the
> > > item of fq to write to tx. And this kind of demand may be constantly 
> > > changing,
> > > and it may be necessary to set it every time before entering xsk_poll, so
> > > setsockopt is not very convenient. I feel even more that using a new 
> > > event may
> > > be a better solution, such as EPOLLPRI, I think it can be used here, 
> > > after all,
> > > xsk should not have OOB data ^_^.
> > >
> > > However, two other problems were discovered during the test:
> > >
> > > * The mask returned by datagram_poll always contains EPOLLOUT
> > > * It is not particularly reasonable to return EPOLLOUT based on tx not 
> > > full
> > >
> > > After fixing these two problems, I found that when the process is 
> > > awakened by
> > > EPOLLOUT, the process can always get the item from cq.
> > >
> > > Because the number of packets that the network card can send at a time is
> > > actually limited, suppose this value is "nic_num". Once the number of
> > > consumed items in the tx queue is greater than nic_num, this means that 
> > > there
> > > must also be new recycled items in the cq queue from nic.
> > >
> > > In this way, as long as the tx configured by the user is larger, we won't 
> > > have
> > > the situation that tx is already in the writeable state but cannot get 
> > > the item
> > > from cq.
> >
> > I think the overall approach of tying this into poll() instead of
> > setsockopt() is the right way to go. But we need a more robust
> > solution. Your patch #3 also breaks backwards compatibility and that
> > is not allowed. Could you please post some simple code example of what
> > it is you would like to do in user space? So you would like to wake up
> > when there are entries in the cq that can be retrieved and the reason
> > you would like to do this is that you then know you can put some more
> > entries into the Tx ring and they will get sent as there now are free
> > slots in the cq. Correct me if wrong. Would an event that wakes you up
> > when there is both space in the Tx ring and space in the cq work? Is
> > there a case in which we would like to be woken up when only the Tx
> > ring is non-full? Maybe there are as it might be beneficial to fill
> > the Tx and while doing that some entries in the cq has been completed
> > and away the packets go. But it would be great if you could post some
> > simple example code, does not need to compile or anything. Can be
> > pseudo code.
> >
> > It would also be good to know if your goal is max throughput, max
> > burst size, or something else.
> >
> > Thanks: Magnus
> >
>
> My goal is max pps, If possible, increase the size of buf appropriately to
> improve throughput. like pktgen.
>
> The code like this: (tx and umem cq also is 1024, and that works with zero
> copy.)
>
> ```
> void send_handler(xsk)
> {
> char buf[22];
>
> while (true) {
> while (true){
> if (send_buf_to_tx_ring(xsk, buf, sizeof(buf)))
> break; // break this when no cq or tx is full
> }
>
> if (sendto(xsk->fd))
> break;
> }
> }
> }
>
>
> static int loop(int efd, xsk)
> {
> struct epoll_event e[1024];
> struct epoll_event ee;
> int n, i;
>
> ee.events = EPOLLOUT;
> ee.data.ptr = NULL;
>
> epoll_ctl(efd, EPOLL_CTL_ADD, xsk->fd, );
>
> while (1) {
> n = epoll_wait(efd, e, sizeof(e)/sizeof(e[0]), -1);
>
> if (n == 0)
> continue;
>
> if (n < 0) {
> continue;
> }
>
> for (i = 0; i < n; ++i) {
> send_handler(xsk);
> }
> }
> }
> ```
>
> 1. Now, since datagram_poll(that determine whether it is write able based on
>sock_write

Re: [PATCH 1/3] xsk: replace datagram_poll by sock_poll_wait

2020-11-23 Thread Magnus Karlsson

On Wed, Nov 18, 2020 at 9:26 AM Xuan Zhuo  wrote:
>
> datagram_poll will judge the current socket status (EPOLLIN, EPOLLOUT)
> based on the traditional socket information (eg: sk_wmem_alloc), but
> this does not apply to xsk. So this patch uses sock_poll_wait instead of
> datagram_poll, and the mask is calculated by xsk_poll.
>
> Signed-off-by: Xuan Zhuo 
> ---
>  net/xdp/xsk.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index cfbec39..7f0353e 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -477,11 +477,13 @@ static int xsk_sendmsg(struct socket *sock, struct 
> msghdr *m, size_t total_len)
>  static __poll_t xsk_poll(struct file *file, struct socket *sock,
>  struct poll_table_struct *wait)
>  {
> -   __poll_t mask = datagram_poll(file, sock, wait);
> +   __poll_t mask = 0;

It would indeed be nice to not execute a number of tests in
datagram_poll that will never be triggered. It will speed up things
for sure. But we need to make sure that removing those flags that
datagram_poll sets do not have any bad effects in the code above this.
But let us tentatively keep this patch for the next version of the
patch set. Just need to figure out how to solve your problem in a nice
way first. See discussion in patch 0/3.

> struct sock *sk = sock->sk;
> struct xdp_sock *xs = xdp_sk(sk);
> struct xsk_buff_pool *pool;
>
> +   sock_poll_wait(file, sock, wait);
> +
> if (unlikely(!xsk_is_bound(xs)))
> return mask;
>
> --
> 1.8.3.1
>

Re: [PATCH 0/3] xsk: fix for xsk_poll writeable

2020-11-23 Thread Magnus Karlsson

On Wed, Nov 18, 2020 at 9:25 AM Xuan Zhuo  wrote:
>
> I tried to combine cq available and tx writeable, but I found it very 
> difficult.
> Sometimes we pay attention to the status of "available" for both, but 
> sometimes,
> we may only pay attention to one, such as tx writeable, because we can use the
> item of fq to write to tx. And this kind of demand may be constantly changing,
> and it may be necessary to set it every time before entering xsk_poll, so
> setsockopt is not very convenient. I feel even more that using a new event may
> be a better solution, such as EPOLLPRI, I think it can be used here, after 
> all,
> xsk should not have OOB data ^_^.
>
> However, two other problems were discovered during the test:
>
> * The mask returned by datagram_poll always contains EPOLLOUT
> * It is not particularly reasonable to return EPOLLOUT based on tx not full
>
> After fixing these two problems, I found that when the process is awakened by
> EPOLLOUT, the process can always get the item from cq.
>
> Because the number of packets that the network card can send at a time is
> actually limited, suppose this value is "nic_num". Once the number of
> consumed items in the tx queue is greater than nic_num, this means that there
> must also be new recycled items in the cq queue from nic.
>
> In this way, as long as the tx configured by the user is larger, we won't have
> the situation that tx is already in the writeable state but cannot get the 
> item
> from cq.

I think the overall approach of tying this into poll() instead of
setsockopt() is the right way to go. But we need a more robust
solution. Your patch #3 also breaks backwards compatibility and that
is not allowed. Could you please post some simple code example of what
it is you would like to do in user space? So you would like to wake up
when there are entries in the cq that can be retrieved and the reason
you would like to do this is that you then know you can put some more
entries into the Tx ring and they will get sent as there now are free
slots in the cq. Correct me if wrong. Would an event that wakes you up
when there is both space in the Tx ring and space in the cq work? Is
there a case in which we would like to be woken up when only the Tx
ring is non-full? Maybe there are as it might be beneficial to fill
the Tx and while doing that some entries in the cq has been completed
and away the packets go. But it would be great if you could post some
simple example code, does not need to compile or anything. Can be
pseudo code.

It would also be good to know if your goal is max throughput, max
burst size, or something else.

Thanks: Magnus

> Xuan Zhuo (3):
>   xsk: replace datagram_poll by sock_poll_wait
>   xsk: change the tx writeable condition
>   xsk: set tx/rx the min entries
>
>  include/uapi/linux/if_xdp.h |  2 ++
>  net/xdp/xsk.c   | 26 ++
>  net/xdp/xsk_queue.h |  6 ++
>  3 files changed, 30 insertions(+), 4 deletions(-)
>
> --
> 1.8.3.1
>

Re: memory leak in xdp_umem_create

2020-10-24 Thread Magnus Karlsson

On Fri, Oct 23, 2020 at 6:24 PM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:f804b315 Merge tag 'linux-watchdog-5.10-rc1' of git://www...
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=1797677f90
> kernel config:  https://syzkaller.appspot.com/x/.config?x=504c0405f28172a
> dashboard link: https://syzkaller.appspot.com/bug?extid=eb71df123dc2be2c1456
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=10f2754450
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=12fc4de850
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+eb71df123dc2be2c1...@syzkaller.appspotmail.com
>
> Warning: Permanently added '10.128.10.22' (ECDSA) to the list of known hosts.
> executing program
> executing program
> BUG: memory leak
> unreferenced object 0x888110c1e400 (size 96):
>   comm "syz-executor230", pid 8462, jiffies 4294942469 (age 13.280s)
>   hex dump (first 32 bytes):
> 00 50 e0 00 00 c9 ff ff 00 00 02 00 00 00 00 00  .P..
> 00 00 00 00 00 10 00 00 20 00 00 00 20 00 00 00   ... ...
>   backtrace:
> [] kmalloc include/linux/slab.h:554 [inline]
> [] kzalloc include/linux/slab.h:666 [inline]
> [] xdp_umem_create+0x33/0x630 net/xdp/xdp_umem.c:229
> [<551a05ed>] xsk_setsockopt+0x4ad/0x590 net/xdp/xsk.c:852
> [] __sys_setsockopt+0x1b0/0x360 net/socket.c:2132
> [<76c65982>] __do_sys_setsockopt net/socket.c:2143 [inline]
> [<76c65982>] __se_sys_setsockopt net/socket.c:2140 [inline]
> [<76c65982>] __x64_sys_setsockopt+0x22/0x30 net/socket.c:2140
> [] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> [] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> BUG: memory leak
> unreferenced object 0x88810e018f00 (size 256):
>   comm "syz-executor230", pid 8462, jiffies 4294942469 (age 13.280s)
>   hex dump (first 32 bytes):
> 00 00 4f 04 00 ea ff ff 40 00 4f 04 00 ea ff ff  ..O.@.O.
> 80 00 4f 04 00 ea ff ff c0 00 4f 04 00 ea ff ff  ..O...O.
>   backtrace:
> [<257d0c74>] kmalloc_array include/linux/slab.h:594 [inline]
> [<257d0c74>] kcalloc include/linux/slab.h:605 [inline]
> [<257d0c74>] xdp_umem_pin_pages net/xdp/xdp_umem.c:89 [inline]
> [<257d0c74>] xdp_umem_reg net/xdp/xdp_umem.c:207 [inline]
> [<257d0c74>] xdp_umem_create+0x3cc/0x630 net/xdp/xdp_umem.c:240
> [<551a05ed>] xsk_setsockopt+0x4ad/0x590 net/xdp/xsk.c:852
> [] __sys_setsockopt+0x1b0/0x360 net/socket.c:2132
> [<76c65982>] __do_sys_setsockopt net/socket.c:2143 [inline]
> [<76c65982>] __se_sys_setsockopt net/socket.c:2140 [inline]
> [<76c65982>] __x64_sys_setsockopt+0x22/0x30 net/socket.c:2140
> [] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> [] entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
>
>
> ---
> This report is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
>
> syzbot will keep track of this issue. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this issue, for details see:
> https://goo.gl/tpsmEJ#testing-patches

Thank you Mr Syzbot! I will take a look at this.

Re: general protection fault in xsk_release

2020-09-25 Thread Magnus Karlsson

On Fri, Sep 25, 2020 at 4:47 PM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following issue on:
>
> HEAD commit:3fc826f1 Merge branch 'net-dsa-bcm_sf2-Additional-DT-chang..
> git tree:   net-next
> console output: https://syzkaller.appspot.com/x/log.txt?x=158f800990
> kernel config:  https://syzkaller.appspot.com/x/.config?x=51fb40e67d1e3dec
> dashboard link: https://syzkaller.appspot.com/bug?extid=ddc7b4944bc61da19b81
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=16372c8190
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=100bd2c390
>
> The issue was bisected to:
>
> commit 1c1efc2af158869795d3334a12fed2afd9c51539
> Author: Magnus Karlsson 
> Date:   Fri Aug 28 08:26:17 2020 +
>
> xsk: Create and free buffer pool independently from umem

Thanks. On it.

> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=157a310390
> final oops: https://syzkaller.appspot.com/x/report.txt?x=177a310390
> console output: https://syzkaller.appspot.com/x/log.txt?x=137a310390
>
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+ddc7b4944bc61da19...@syzkaller.appspotmail.com
> Fixes: 1c1efc2af158 ("xsk: Create and free buffer pool independently from 
> umem")
>
> RAX: ffda RBX: 7fff675613c0 RCX: 00443959
> RDX: 0030 RSI: 2000 RDI: 0003
> RBP:  R08: 0001 R09: 01bb
> R10:  R11: 0246 R12: 
> R13: 0007 R14:  R15: 
> general protection fault, probably for non-canonical address 
> 0xdcad:  [#1] PREEMPT SMP KASAN
> KASAN: null-ptr-deref in range [0x0568-0x056f]
> CPU: 0 PID: 6888 Comm: syz-executor811 Not tainted 5.9.0-rc6-syzkaller #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> Google 01/01/2011
> RIP: 0010:dev_put include/linux/netdevice.h:3899 [inline]
> RIP: 0010:xsk_unbind_dev net/xdp/xsk.c:521 [inline]
> RIP: 0010:xsk_release+0x63f/0x7d0 net/xdp/xsk.c:591
> Code: 00 00 48 c7 85 c8 04 00 00 00 00 00 00 e8 29 a0 47 fe 48 8d bb 68 05 00 
> 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 66 
> 01 00 00 48 8b 83 68 05 00 00 65 ff 08 e9 54
> RSP: 0018:c90005707c90 EFLAGS: 00010202
> RAX: dc00 RBX:  RCX: 815b9de2
> RDX: 00ad RSI: 882d2317 RDI: 0568
> RBP: 888091aae000 R08: 0001 R09: 8d0ffaaf
> R10: fbfff1a1ff55 R11:  R12: 888091aae5f8
> R13: 888091aae4c8 R14: dc00 R15: 111012355cb5
> FS:  () GS:8880ae40() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: 004c8b88 CR3: 93ea3000 CR4: 001506f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>  __sock_release+0xcd/0x280 net/socket.c:596
>  sock_close+0x18/0x20 net/socket.c:1277
>  __fput+0x285/0x920 fs/file_table.c:281
>  task_work_run+0xdd/0x190 kernel/task_work.c:141
>  exit_task_work include/linux/task_work.h:25 [inline]
>  do_exit+0xb7d/0x29f0 kernel/exit.c:806
>  do_group_exit+0x125/0x310 kernel/exit.c:903
>  __do_sys_exit_group kernel/exit.c:914 [inline]
>  __se_sys_exit_group kernel/exit.c:912 [inline]
>  __x64_sys_exit_group+0x3a/0x50 kernel/exit.c:912
>  do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> RIP: 0033:0x442588
> Code: Bad RIP value.
> RSP: 002b:7fff67561328 EFLAGS: 0246 ORIG_RAX: 00e7
> RAX: ffda RBX: 0001 RCX: 00442588
> RDX: 0001 RSI: 003c RDI: 0001
> RBP: 004c8b50 R08: 00e7 R09: ffd0
> R10:  R11: 0246 R12: 0001
> R13: 006dd280 R14:  R15: 
> Modules linked in:
> ---[ end trace 9742ad575ae08359 ]---
> RIP: 0010:dev_put include/linux/netdevice.h:3899 [inline]
> RIP: 0010:xsk_unbind_dev net/xdp/xsk.c:521 [inline]
> RIP: 0010:xsk_release+0x63f/0x7d0 net/xdp/xsk.c:591
> Code: 00 00 48 c7 85 c8 04 00 00 00 00 00 00 e8 29 a0 47 fe 48 8d bb 68 05 00 
> 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 66 
> 01 00 00 48 8b 83 68 05 00 00 65 ff 08 e9 54
> RSP: 0018:c90005707c90 EFLAGS: 00010202
> RAX: dc00 RBX: 000

Re: KASAN: use-after-free Write in xp_put_pool

2020-09-02 Thread Magnus Karlsson

On Wed, Sep 2, 2020 at 9:06 AM Daniel Borkmann  wrote:
>
> On 9/2/20 8:57 AM, syzbot wrote:
> > Hello,
> >
> > syzbot found the following issue on:
>
> Magnus/Bjorn, ptal, thanks!

On it as we speak.

> > HEAD commit:dc1a9bf2 octeontx2-pf: Add UDP segmentation offload support
> > git tree:   net-next
> > console output: https://syzkaller.appspot.com/x/log.txt?x=16ff67de90
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=b6856d16f78d8fa9
> > dashboard link: https://syzkaller.appspot.com/bug?extid=5334f62e4d22804e646a
> > compiler:   gcc (GCC) 10.1.0-syz 20200507
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=12e9f27990
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=125f3e1e90
> >
> > The issue was bisected to:
> >
> > commit a1132430c2c55af62d13e9fca752d46f14d548b3
> > Author: Magnus Karlsson 
> > Date:   Fri Aug 28 08:26:26 2020 +
> >
> >  xsk: Add shared umem support between devices
> >
> > bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=10a373de90
> > final oops: https://syzkaller.appspot.com/x/report.txt?x=12a373de90
> > console output: https://syzkaller.appspot.com/x/log.txt?x=14a373de90
> >
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+5334f62e4d22804e6...@syzkaller.appspotmail.com
> > Fixes: a1132430c2c5 ("xsk: Add shared umem support between devices")
> >
> > ==
> > BUG: KASAN: use-after-free in instrument_atomic_write 
> > include/linux/instrumented.h:71 [inline]
> > BUG: KASAN: use-after-free in atomic_fetch_sub_release 
> > include/asm-generic/atomic-instrumented.h:220 [inline]
> > BUG: KASAN: use-after-free in refcount_sub_and_test 
> > include/linux/refcount.h:266 [inline]
> > BUG: KASAN: use-after-free in refcount_dec_and_test 
> > include/linux/refcount.h:294 [inline]
> > BUG: KASAN: use-after-free in xp_put_pool+0x2c/0x1e0 
> > net/xdp/xsk_buff_pool.c:262
> > Write of size 4 at addr 8880a6a4d860 by task ksoftirqd/0/9
> >
> > CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 5.9.0-rc1-syzkaller #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 01/01/2011
> > Call Trace:
> >   __dump_stack lib/dump_stack.c:77 [inline]
> >   dump_stack+0x18f/0x20d lib/dump_stack.c:118
> >   print_address_description.constprop.0.cold+0xae/0x497 
> > mm/kasan/report.c:383
> >   __kasan_report mm/kasan/report.c:513 [inline]
> >   kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
> >   check_memory_region_inline mm/kasan/generic.c:186 [inline]
> >   check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
> >   instrument_atomic_write include/linux/instrumented.h:71 [inline]
> >   atomic_fetch_sub_release include/asm-generic/atomic-instrumented.h:220 
> > [inline]
> >   refcount_sub_and_test include/linux/refcount.h:266 [inline]
> >   refcount_dec_and_test include/linux/refcount.h:294 [inline]
> >   xp_put_pool+0x2c/0x1e0 net/xdp/xsk_buff_pool.c:262
> >   xsk_destruct+0x7d/0xa0 net/xdp/xsk.c:1138
> >   __sk_destruct+0x4b/0x860 net/core/sock.c:1764
> >   rcu_do_batch kernel/rcu/tree.c:2428 [inline]
> >   rcu_core+0x5c7/0x1190 kernel/rcu/tree.c:2656
> >   __do_softirq+0x2de/0xa24 kernel/softirq.c:298
> >   run_ksoftirqd kernel/softirq.c:652 [inline]
> >   run_ksoftirqd+0x89/0x100 kernel/softirq.c:644
> >   smpboot_thread_fn+0x655/0x9e0 kernel/smpboot.c:165
> >   kthread+0x3b5/0x4a0 kernel/kthread.c:292
> >   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
> >
> > Allocated by task 6854:
> >   kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
> >   kasan_set_track mm/kasan/common.c:56 [inline]
> >   __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
> >   kmalloc_node include/linux/slab.h:577 [inline]
> >   kvmalloc_node+0x61/0xf0 mm/util.c:574
> >   kvmalloc include/linux/mm.h:750 [inline]
> >   kvzalloc include/linux/mm.h:758 [inline]
> >   xp_create_and_assign_umem+0x58/0x8d0 net/xdp/xsk_buff_pool.c:54
> >   xsk_bind+0x9a0/0xed0 net/xdp/xsk.c:709
> >   __sys_bind+0x1e9/0x250 net/socket.c:1656
> >   __do_sys_bind net/socket.c:1667 [inline]
> >   __se_sys_bind net/socket.c:1665 [inline]
> >   __x64_sys_bind+0x6f/0xb0 net/socket.c:1665
> >   do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> >   entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >
> > Freed by task 6854:
> >   kasan_save_stack+0x1b/0x

Re: BUG: unable to handle kernel NULL pointer dereference in xsk_poll

2019-09-30 Thread Magnus Karlsson

On Mon, Sep 30, 2019 at 9:17 AM syzbot
 wrote:
>
> Hello,
>
> syzbot found the following crash on:

Thank you Mr Syzcaller. I am on it.

/Magnus

> HEAD commit:a3c0e7b1 Merge tag 'libnvdimm-fixes-5.4-rc1' of git://git...
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=14f0543560
> kernel config:  https://syzkaller.appspot.com/x/.config?x=6ffbfa7e4a36190f
> dashboard link: https://syzkaller.appspot.com/bug?extid=a5765ed8cdb1cca4d249
> compiler:   gcc (GCC) 9.0.0 20181231 (experimental)
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1096d83560
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=129f15f360
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+a5765ed8cdb1cca4d...@syzkaller.appspotmail.com
>
> IPv6: ADDRCONF(NETDEV_CHANGE): hsr0: link becomes ready
> 8021q: adding VLAN 0 to HW filter on device batadv0
> BUG: kernel NULL pointer dereference, address: 
> #PF: supervisor instruction fetch in kernel mode
> #PF: error_code(0x0010) - not-present page
> PGD 99226067 P4D 99226067 PUD 8fa47067 PMD 0
> Oops: 0010 [#1] PREEMPT SMP KASAN
> CPU: 0 PID: 8719 Comm: syz-executor502 Not tainted 5.3.0+ #0
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> RIP: 0010:0x0
> Code: Bad RIP value.
> RSP: 0018:88809dd4f848 EFLAGS: 00010246
> RAX:  RBX: 88808c06d740 RCX: 11101180db7c
> RDX: 0002 RSI:  RDI: 88809a190b00
> RBP: 88809dd4f880 R08: 8880921924c0 R09: ed101180db31
> R10: ed101180db30 R11: 88808c06d987 R12: 0002
> R13: 0304 R14: 88809a190b00 R15: 
> FS:  01b27880() GS:8880ae80() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: ffd6 CR3: a307a000 CR4: 001406f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
> Call Trace:
>   xsk_poll+0x1e7/0x5a0 net/xdp/xsk.c:430
>   sock_poll+0x15e/0x480 net/socket.c:1256
>   vfs_poll include/linux/poll.h:90 [inline]
>   do_pollfd fs/select.c:859 [inline]
>   do_poll fs/select.c:907 [inline]
>   do_sys_poll+0x63c/0xdd0 fs/select.c:1001
>   __do_sys_ppoll fs/select.c:1101 [inline]
>   __se_sys_ppoll fs/select.c:1081 [inline]
>   __x64_sys_ppoll+0x259/0x310 fs/select.c:1081
>   do_syscall_64+0xfa/0x760 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x441bd9
> Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7
> 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
> ff 0f 83 7b 10 fc ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:7ffd48824e98 EFLAGS: 0246 ORIG_RAX: 010f
> RAX: ffda RBX: 0003 RCX: 00441bd9
> RDX:  RSI: 0006 RDI: 2040
> RBP: 7ffd48824eb0 R08:  R09: 01bb
> R10:  R11: 0246 R12: 
> R13: 00403170 R14:  R15: 
> Modules linked in:
> CR2: 
> ---[ end trace e262cafe88422aec ]---
> RIP: 0010:0x0
> Code: Bad RIP value.
> RSP: 0018:88809dd4f848 EFLAGS: 00010246
> RAX:  RBX: 88808c06d740 RCX: 11101180db7c
> RDX: 0002 RSI:  RDI: 88809a190b00
> RBP: 88809dd4f880 R08: 8880921924c0 R09: ed101180db31
> R10: ed101180db30 R11: 88808c06d987 R12: 0002
> R13: 0304 R14: 88809a190b00 R15: 
> FS:  01b27880() GS:8880ae80() knlGS:
> CS:  0010 DS:  ES:  CR0: 80050033
> CR2: ffd6 CR3: a307a000 CR4: 001406f0
> DR0:  DR1:  DR2: 
> DR3:  DR6: fffe0ff0 DR7: 0400
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkal...@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
> syzbot can test patches for this bug, for details see:
> https://goo.gl/tpsmEJ#testing-patches

Re: [PATCH v2 bpf-next] mm: mmap: increase sockets maximum memory size pgoff for 32bits

2019-08-13 Thread Magnus Karlsson

On Mon, Aug 12, 2019 at 2:45 PM Ivan Khoronzhuk
 wrote:
>
> The AF_XDP sockets umem mapping interface uses XDP_UMEM_PGOFF_FILL_RING
> and XDP_UMEM_PGOFF_COMPLETION_RING offsets. The offsets seems like are
> established already and are part of configuration interface.
>
> But for 32-bit systems, while AF_XDP socket configuration, the values
> are to large to pass maximum allowed file size verification.
> The offsets can be tuned ofc, but instead of changing existent
> interface - extend max allowed file size for sockets.

Can you use mmap2() instead that takes a larger offset (2^44) even on
32-bit systems?

/Magnus

> Signed-off-by: Ivan Khoronzhuk 
> ---
>
> Based on bpf-next/master
>
> v2..v1:
> removed not necessarily #ifdev as ULL and UL for 64 has same size
>
>  mm/mmap.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 7e8c3e8ae75f..578f52812361 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -1358,6 +1358,9 @@ static inline u64 file_mmap_size_max(struct file *file, 
> struct inode *inode)
> if (S_ISBLK(inode->i_mode))
> return MAX_LFS_FILESIZE;
>
> +   if (S_ISSOCK(inode->i_mode))
> +   return MAX_LFS_FILESIZE;
> +
> /* Special "we do even unsigned file positions" case */
> if (file->f_mode & FMODE_UNSIGNED_OFFSET)
> return 0;
> --
> 2.17.1
>

Re: [PATCH bpf v2] xdp: fix race on generic receive path

2019-07-03 Thread Magnus Karlsson

On Wed, Jul 3, 2019 at 2:09 PM Ilya Maximets  wrote:
>
> Unlike driver mode, generic xdp receive could be triggered
> by different threads on different CPU cores at the same time
> leading to the fill and rx queue breakage. For example, this
> could happen while sending packets from two processes to the
> first interface of veth pair while the second part of it is
> open with AF_XDP socket.
>
> Need to take a lock for each generic receive to avoid race.

I measured the performance degradation of rxdrop on my local machine
and it went from 2.19 to 2.08, so roughly a 5% drop. I think we can
live with this in XDP_SKB mode. If we at some later point in time need
to boost performance in this mode, let us look at it then from a
broader perspective and find the most low hanging fruit.

Thanks Ilya for this fix.

Acked-by: Magnus Karlsson 

> Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support")
> Signed-off-by: Ilya Maximets 
> ---
>
> Version 2:
> * spin_lock_irqsave --> spin_lock_bh.
>
>  include/net/xdp_sock.h |  2 ++
>  net/xdp/xsk.c  | 31 ++-
>  2 files changed, 24 insertions(+), 9 deletions(-)
>
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index d074b6d60f8a..ac3c047d058c 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -67,6 +67,8 @@ struct xdp_sock {
>  * in the SKB destructor callback.
>  */
> spinlock_t tx_completion_lock;
> +   /* Protects generic receive. */
> +   spinlock_t rx_lock;
> u64 rx_dropped;
>  };
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index a14e8864e4fa..5e0637db92ea 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -123,13 +123,17 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
> xdp_buff *xdp)
> u64 addr;
> int err;
>
> -   if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
> -   return -EINVAL;
> +   spin_lock_bh(>rx_lock);
> +
> +   if (xs->dev != xdp->rxq->dev || xs->queue_id != 
> xdp->rxq->queue_index) {
> +   err = -EINVAL;
> +   goto out_unlock;
> +   }
>
> if (!xskq_peek_addr(xs->umem->fq, ) ||
> len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
> -   xs->rx_dropped++;
> -   return -ENOSPC;
> +   err = -ENOSPC;
> +   goto out_drop;
> }
>
> addr += xs->umem->headroom;
> @@ -138,13 +142,21 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
> xdp_buff *xdp)
> memcpy(buffer, xdp->data_meta, len + metalen);
> addr += metalen;
> err = xskq_produce_batch_desc(xs->rx, addr, len);
> -   if (!err) {
> -   xskq_discard_addr(xs->umem->fq);
> -   xsk_flush(xs);
> -   return 0;
> -   }
> +   if (err)
> +   goto out_drop;
> +
> +   xskq_discard_addr(xs->umem->fq);
> +   xskq_produce_flush_desc(xs->rx);
>
> +   spin_unlock_bh(>rx_lock);
> +
> +   xs->sk.sk_data_ready(>sk);
> +   return 0;
> +
> +out_drop:
> xs->rx_dropped++;
> +out_unlock:
> +   spin_unlock_bh(>rx_lock);
> return err;
>  }
>
> @@ -765,6 +777,7 @@ static int xsk_create(struct net *net, struct socket 
> *sock, int protocol,
>
> xs = xdp_sk(sk);
> mutex_init(>mutex);
> +   spin_lock_init(>rx_lock);
> spin_lock_init(>tx_completion_lock);
>
> mutex_lock(>xdp.lock);
> --
> 2.17.1
>

Re: [PATCH bpf] xdp: fix race on generic receive path

2019-07-02 Thread Magnus Karlsson

On Tue, Jul 2, 2019 at 4:36 PM Ilya Maximets  wrote:
>
> Unlike driver mode, generic xdp receive could be triggered
> by different threads on different CPU cores at the same time
> leading to the fill and rx queue breakage. For example, this
> could happen while sending packets from two processes to the
> first interface of veth pair while the second part of it is
> open with AF_XDP socket.
>
> Need to take a lock for each generic receive to avoid race.

Thanks for this catch Ilya. Do you have any performance numbers you
could share of the impact of adding this spin lock? The reason I ask
is that if the impact is negligible, then let us just add it. But if
it is too large, we might want to brain storm about some other
possible solutions.

Thanks: Magnus

> Fixes: c497176cb2e4 ("xsk: add Rx receive functions and poll support")
> Signed-off-by: Ilya Maximets 
> ---
>  include/net/xdp_sock.h |  2 ++
>  net/xdp/xsk.c  | 32 +++-
>  2 files changed, 25 insertions(+), 9 deletions(-)
>
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index d074b6d60f8a..ac3c047d058c 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -67,6 +67,8 @@ struct xdp_sock {
>  * in the SKB destructor callback.
>  */
> spinlock_t tx_completion_lock;
> +   /* Protects generic receive. */
> +   spinlock_t rx_lock;
> u64 rx_dropped;
>  };
>
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index a14e8864e4fa..19f41d2b670c 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -119,17 +119,22 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
> xdp_buff *xdp)
>  {
> u32 metalen = xdp->data - xdp->data_meta;
> u32 len = xdp->data_end - xdp->data;
> +   unsigned long flags;
> void *buffer;
> u64 addr;
> int err;
>
> -   if (xs->dev != xdp->rxq->dev || xs->queue_id != xdp->rxq->queue_index)
> -   return -EINVAL;
> +   spin_lock_irqsave(>rx_lock, flags);
> +
> +   if (xs->dev != xdp->rxq->dev || xs->queue_id != 
> xdp->rxq->queue_index) {
> +   err = -EINVAL;
> +   goto out_unlock;
> +   }
>
> if (!xskq_peek_addr(xs->umem->fq, ) ||
> len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
> -   xs->rx_dropped++;
> -   return -ENOSPC;
> +   err = -ENOSPC;
> +   goto out_drop;
> }
>
> addr += xs->umem->headroom;
> @@ -138,13 +143,21 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct 
> xdp_buff *xdp)
> memcpy(buffer, xdp->data_meta, len + metalen);
> addr += metalen;
> err = xskq_produce_batch_desc(xs->rx, addr, len);
> -   if (!err) {
> -   xskq_discard_addr(xs->umem->fq);
> -   xsk_flush(xs);
> -   return 0;
> -   }
> +   if (err)
> +   goto out_drop;
> +
> +   xskq_discard_addr(xs->umem->fq);
> +   xskq_produce_flush_desc(xs->rx);
>
> +   spin_unlock_irqrestore(>rx_lock, flags);
> +
> +   xs->sk.sk_data_ready(>sk);
> +   return 0;
> +
> +out_drop:
> xs->rx_dropped++;
> +out_unlock:
> +   spin_unlock_irqrestore(>rx_lock, flags);
> return err;
>  }
>
> @@ -765,6 +778,7 @@ static int xsk_create(struct net *net, struct socket 
> *sock, int protocol,
>
> xs = xdp_sk(sk);
> mutex_init(>mutex);
> +   spin_lock_init(>rx_lock);
> spin_lock_init(>tx_completion_lock);
>
> mutex_lock(>xdp.lock);
> --
> 2.17.1
>

39 matches

Mail list logo