RE: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support

2024-03-11 Thread wangyunjian


> -Original Message-
> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Monday, March 11, 2024 12:01 PM
> To: wangyunjian 
> Cc: Michael S. Tsirkin ; Paolo Abeni ;
> willemdebruijn.ker...@gmail.com; k...@kernel.org; bj...@kernel.org;
> magnus.karls...@intel.com; maciej.fijalkow...@intel.com;
> jonathan.le...@gmail.com; da...@davemloft.net; b...@vger.kernel.org;
> net...@vger.kernel.org; linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> virtualizat...@lists.linux.dev; xudingke ; liwei (DT)
> 
> Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support
> 
> On Mon, Mar 4, 2024 at 9:45 PM wangyunjian 
> wrote:
> >
> >
> >
> > > -Original Message-
> > > From: Michael S. Tsirkin [mailto:m...@redhat.com]
> > > Sent: Friday, March 1, 2024 7:53 PM
> > > To: wangyunjian 
> > > Cc: Paolo Abeni ;
> > > willemdebruijn.ker...@gmail.com; jasow...@redhat.com;
> > > k...@kernel.org; bj...@kernel.org; magnus.karls...@intel.com;
> > > maciej.fijalkow...@intel.com; jonathan.le...@gmail.com;
> > > da...@davemloft.net; b...@vger.kernel.org; net...@vger.kernel.org;
> > > linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > > virtualizat...@lists.linux.dev; xudingke ;
> > > liwei (DT) 
> > > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy
> > > support
> > >
> > > On Fri, Mar 01, 2024 at 11:45:52AM +, wangyunjian wrote:
> > > > > -Original Message-
> > > > > From: Paolo Abeni [mailto:pab...@redhat.com]
> > > > > Sent: Thursday, February 29, 2024 7:13 PM
> > > > > To: wangyunjian ; m...@redhat.com;
> > > > > willemdebruijn.ker...@gmail.com; jasow...@redhat.com;
> > > > > k...@kernel.org; bj...@kernel.org; magnus.karls...@intel.com;
> > > > > maciej.fijalkow...@intel.com; jonathan.le...@gmail.com;
> > > > > da...@davemloft.net
> > > > > Cc: b...@vger.kernel.org; net...@vger.kernel.org;
> > > > > linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > > > > virtualizat...@lists.linux.dev; xudingke ;
> > > > > liwei (DT) 
> > > > > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy
> > > > > support
> > > > >
> > > > > On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> > > > > > @@ -2661,6 +2776,54 @@ static int tun_ptr_peek_len(void *ptr)
> > > > > > }
> > > > > >  }
> > > > > >
> > > > > > +static void tun_peek_xsk(struct tun_file *tfile) {
> > > > > > +   struct xsk_buff_pool *pool;
> > > > > > +   u32 i, batch, budget;
> > > > > > +   void *frame;
> > > > > > +
> > > > > > +   if (!ptr_ring_empty(>tx_ring))
> > > > > > +   return;
> > > > > > +
> > > > > > +   spin_lock(>pool_lock);
> > > > > > +   pool = tfile->xsk_pool;
> > > > > > +   if (!pool) {
> > > > > > +   spin_unlock(>pool_lock);
> > > > > > +   return;
> > > > > > +   }
> > > > > > +
> > > > > > +   if (tfile->nb_descs) {
> > > > > > +   xsk_tx_completed(pool, tfile->nb_descs);
> > > > > > +   if (xsk_uses_need_wakeup(pool))
> > > > > > +   xsk_set_tx_need_wakeup(pool);
> > > > > > +   }
> > > > > > +
> > > > > > +   spin_lock(>tx_ring.producer_lock);
> > > > > > +   budget = min_t(u32, tfile->tx_ring.size,
> > > > > > + TUN_XDP_BATCH);
> > > > > > +
> > > > > > +   batch = xsk_tx_peek_release_desc_batch(pool, budget);
> > > > > > +   if (!batch) {
> > > > >
> > > > > This branch looks like an unneeded "optimization". The generic
> > > > > loop below should have the same effect with no measurable perf
> > > > > delta - and
> > > smaller code.
> > > > > Just remove this.
> > > > >
> > > > > > +   tfile->nb_descs = 0;
> > > > > > +   spin_unlock(>tx_ring.producer_lock);
> > > > > > +   spin_unlock(>pool_lock);
> > >

RE: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support

2024-03-04 Thread wangyunjian



> -Original Message-
> From: Michael S. Tsirkin [mailto:m...@redhat.com]
> Sent: Friday, March 1, 2024 7:53 PM
> To: wangyunjian 
> Cc: Paolo Abeni ; willemdebruijn.ker...@gmail.com;
> jasow...@redhat.com; k...@kernel.org; bj...@kernel.org;
> magnus.karls...@intel.com; maciej.fijalkow...@intel.com;
> jonathan.le...@gmail.com; da...@davemloft.net; b...@vger.kernel.org;
> net...@vger.kernel.org; linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> virtualizat...@lists.linux.dev; xudingke ; liwei (DT)
> 
> Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support
> 
> On Fri, Mar 01, 2024 at 11:45:52AM +, wangyunjian wrote:
> > > -Original Message-
> > > From: Paolo Abeni [mailto:pab...@redhat.com]
> > > Sent: Thursday, February 29, 2024 7:13 PM
> > > To: wangyunjian ; m...@redhat.com;
> > > willemdebruijn.ker...@gmail.com; jasow...@redhat.com;
> > > k...@kernel.org; bj...@kernel.org; magnus.karls...@intel.com;
> > > maciej.fijalkow...@intel.com; jonathan.le...@gmail.com;
> > > da...@davemloft.net
> > > Cc: b...@vger.kernel.org; net...@vger.kernel.org;
> > > linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> > > virtualizat...@lists.linux.dev; xudingke ;
> > > liwei (DT) 
> > > Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy
> > > support
> > >
> > > On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> > > > @@ -2661,6 +2776,54 @@ static int tun_ptr_peek_len(void *ptr)
> > > > }
> > > >  }
> > > >
> > > > +static void tun_peek_xsk(struct tun_file *tfile) {
> > > > +   struct xsk_buff_pool *pool;
> > > > +   u32 i, batch, budget;
> > > > +   void *frame;
> > > > +
> > > > +   if (!ptr_ring_empty(>tx_ring))
> > > > +   return;
> > > > +
> > > > +   spin_lock(>pool_lock);
> > > > +   pool = tfile->xsk_pool;
> > > > +   if (!pool) {
> > > > +   spin_unlock(>pool_lock);
> > > > +   return;
> > > > +   }
> > > > +
> > > > +   if (tfile->nb_descs) {
> > > > +   xsk_tx_completed(pool, tfile->nb_descs);
> > > > +   if (xsk_uses_need_wakeup(pool))
> > > > +   xsk_set_tx_need_wakeup(pool);
> > > > +   }
> > > > +
> > > > +   spin_lock(>tx_ring.producer_lock);
> > > > +   budget = min_t(u32, tfile->tx_ring.size, TUN_XDP_BATCH);
> > > > +
> > > > +   batch = xsk_tx_peek_release_desc_batch(pool, budget);
> > > > +   if (!batch) {
> > >
> > > This branch looks like an unneeded "optimization". The generic loop
> > > below should have the same effect with no measurable perf delta - and
> smaller code.
> > > Just remove this.
> > >
> > > > +   tfile->nb_descs = 0;
> > > > +   spin_unlock(>tx_ring.producer_lock);
> > > > +   spin_unlock(>pool_lock);
> > > > +   return;
> > > > +   }
> > > > +
> > > > +   tfile->nb_descs = batch;
> > > > +   for (i = 0; i < batch; i++) {
> > > > +   /* Encode the XDP DESC flag into lowest bit for 
> > > > consumer to
> differ
> > > > +* XDP desc from XDP buffer and sk_buff.
> > > > +*/
> > > > +   frame = tun_xdp_desc_to_ptr(>tx_descs[i]);
> > > > +   /* The budget must be less than or equal to 
> > > > tx_ring.size,
> > > > +* so enqueuing will not fail.
> > > > +*/
> > > > +   __ptr_ring_produce(>tx_ring, frame);
> > > > +   }
> > > > +   spin_unlock(>tx_ring.producer_lock);
> > > > +   spin_unlock(>pool_lock);
> > >
> > > More related to the general design: it looks wrong. What if
> > > get_rx_bufs() will fail (ENOBUF) after successful peeking? With no
> > > more incoming packets, later peek will return 0 and it looks like
> > > that the half-processed packets will stay in the ring forever???
> > >
> > > I think the 'ring produce' part should be moved into tun_do_read().
> >
> > Currently, the vhost-net obtains a batch descriptors/sk_buffs from t

RE: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support

2024-03-04 Thread wangyunjian


> -Original Message-
> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Monday, March 4, 2024 2:56 PM
> To: wangyunjian 
> Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com; k...@kernel.org;
> bj...@kernel.org; magnus.karls...@intel.com; maciej.fijalkow...@intel.com;
> jonathan.le...@gmail.com; da...@davemloft.net; b...@vger.kernel.org;
> net...@vger.kernel.org; linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> virtualizat...@lists.linux.dev; xudingke ; liwei (DT)
> 
> Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support
> 
> On Wed, Feb 28, 2024 at 7:06 PM Yunjian Wang 
> wrote:
> >
> > This patch set allows TUN to support the AF_XDP Tx zero-copy feature,
> > which can significantly reduce CPU utilization for XDP programs.
> >
> > Since commit fc72d1d54dd9 ("tuntap: XDP transmission"), the pointer
> > ring has been utilized to queue different types of pointers by
> > encoding the type into the lower bits. Therefore, we introduce a new
> > flag, TUN_XDP_DESC_FLAG(0x2UL), which allows us to enqueue XDP
> > descriptors and differentiate them from XDP buffers and sk_buffs.
> > Additionally, a spin lock is added for enabling and disabling operations on 
> > the
> xsk pool.
> >
> > The performance testing was performed on a Intel E5-2620 2.40GHz
> machine.
> > Traffic were generated/send through TUN(testpmd txonly with AF_XDP) to
> > VM (testpmd rxonly in guest).
> >
> > +--+-+-+-+
> > |  |   copy  |zero-copy| speedup |
> > +--+-+-+-+
> > | UDP  |   Mpps  |   Mpps  |%|
> > | 64   |   2.5   |   4.0   |   60%   |
> > | 512  |   2.1   |   3.6   |   71%   |
> > | 1024 |   1.9   |   3.3   |   73%   |
> > +--+-+-+-+
> >
> > Signed-off-by: Yunjian Wang 
> > ---
> >  drivers/net/tun.c  | 177
> +++--
> >  drivers/vhost/net.c|   4 +
> >  include/linux/if_tun.h |  32 
> >  3 files changed, 208 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c index
> > bc80fc1d576e..7f4ff50b532c 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -63,6 +63,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -86,6 +87,7 @@ static void tun_default_link_ksettings(struct
> net_device *dev,
> >struct
> ethtool_link_ksettings
> > *cmd);
> >
> >  #define TUN_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
> > +#define TUN_XDP_BATCH 64
> >
> >  /* TUN device flags */
> >
> > @@ -146,6 +148,9 @@ struct tun_file {
> > struct tun_struct *detached;
> > struct ptr_ring tx_ring;
> > struct xdp_rxq_info xdp_rxq;
> > +   struct xsk_buff_pool *xsk_pool;
> > +   spinlock_t pool_lock;   /* Protects xsk pool enable/disable */
> > +   u32 nb_descs;
> >  };
> >
> >  struct tun_page {
> > @@ -614,6 +619,8 @@ void tun_ptr_free(void *ptr)
> > struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
> >
> > xdp_return_frame(xdpf);
> > +   } else if (tun_is_xdp_desc_frame(ptr)) {
> > +   return;
> > } else {
> > __skb_array_destroy_skb(ptr);
> > }
> > @@ -631,6 +638,37 @@ static void tun_queue_purge(struct tun_file *tfile)
> > skb_queue_purge(>sk.sk_error_queue);
> >  }
> >
> > +static void tun_set_xsk_pool(struct tun_file *tfile, struct
> > +xsk_buff_pool *pool) {
> > +   if (!pool)
> > +   return;
> > +
> > +   spin_lock(>pool_lock);
> > +   xsk_pool_set_rxq_info(pool, >xdp_rxq);
> > +   tfile->xsk_pool = pool;
> > +   spin_unlock(>pool_lock); }
> > +
> > +static void tun_clean_xsk_pool(struct tun_file *tfile) {
> > +   spin_lock(>pool_lock);
> > +   if (tfile->xsk_pool) {
> > +   void *ptr;
> > +
> > +   while ((ptr = ptr_ring_consume(>tx_ring)) != NULL)
> > +   tun_ptr_free(ptr);
> > +
> > +   if (tfile->nb_descs) {
> > +   xsk_tx_completed(tfile->xsk_pool,
> tfile->nb_descs);
> > +   if (xsk_uses_need_wakeup(tfile->xsk_pool))
> > +
> xsk_set_tx_need_wakeup(tfile->xsk_p

RE: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support

2024-03-01 Thread wangyunjian
> -Original Message-
> From: Paolo Abeni [mailto:pab...@redhat.com]
> Sent: Thursday, February 29, 2024 7:13 PM
> To: wangyunjian ; m...@redhat.com;
> willemdebruijn.ker...@gmail.com; jasow...@redhat.com; k...@kernel.org;
> bj...@kernel.org; magnus.karls...@intel.com; maciej.fijalkow...@intel.com;
> jonathan.le...@gmail.com; da...@davemloft.net
> Cc: b...@vger.kernel.org; net...@vger.kernel.org;
> linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> virtualizat...@lists.linux.dev; xudingke ; liwei (DT)
> 
> Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support
> 
> On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> > @@ -2661,6 +2776,54 @@ static int tun_ptr_peek_len(void *ptr)
> > }
> >  }
> >
> > +static void tun_peek_xsk(struct tun_file *tfile) {
> > +   struct xsk_buff_pool *pool;
> > +   u32 i, batch, budget;
> > +   void *frame;
> > +
> > +   if (!ptr_ring_empty(>tx_ring))
> > +   return;
> > +
> > +   spin_lock(>pool_lock);
> > +   pool = tfile->xsk_pool;
> > +   if (!pool) {
> > +   spin_unlock(>pool_lock);
> > +   return;
> > +   }
> > +
> > +   if (tfile->nb_descs) {
> > +   xsk_tx_completed(pool, tfile->nb_descs);
> > +   if (xsk_uses_need_wakeup(pool))
> > +   xsk_set_tx_need_wakeup(pool);
> > +   }
> > +
> > +   spin_lock(>tx_ring.producer_lock);
> > +   budget = min_t(u32, tfile->tx_ring.size, TUN_XDP_BATCH);
> > +
> > +   batch = xsk_tx_peek_release_desc_batch(pool, budget);
> > +   if (!batch) {
> 
> This branch looks like an unneeded "optimization". The generic loop below
> should have the same effect with no measurable perf delta - and smaller code.
> Just remove this.
> 
> > +   tfile->nb_descs = 0;
> > +   spin_unlock(>tx_ring.producer_lock);
> > +   spin_unlock(>pool_lock);
> > +   return;
> > +   }
> > +
> > +   tfile->nb_descs = batch;
> > +   for (i = 0; i < batch; i++) {
> > +   /* Encode the XDP DESC flag into lowest bit for consumer to 
> > differ
> > +* XDP desc from XDP buffer and sk_buff.
> > +*/
> > +   frame = tun_xdp_desc_to_ptr(>tx_descs[i]);
> > +   /* The budget must be less than or equal to tx_ring.size,
> > +* so enqueuing will not fail.
> > +*/
> > +   __ptr_ring_produce(>tx_ring, frame);
> > +   }
> > +   spin_unlock(>tx_ring.producer_lock);
> > +   spin_unlock(>pool_lock);
> 
> More related to the general design: it looks wrong. What if
> get_rx_bufs() will fail (ENOBUF) after successful peeking? With no more
> incoming packets, later peek will return 0 and it looks like that the
> half-processed packets will stay in the ring forever???
> 
> I think the 'ring produce' part should be moved into tun_do_read().

Currently, the vhost-net obtains a batch descriptors/sk_buffs from the
ptr_ring and enqueue the batch descriptors/sk_buffs to the virtqueue'queue,
and then consumes the descriptors/sk_buffs from the virtqueue'queue in
sequence. As a result, TUN does not know whether the batch descriptors have
been used up, and thus does not know when to return the batch descriptors.

So, I think it's reasonable that when vhost-net checks ptr_ring is empty,
it calls peek_len to get new xsk's descs and return the descriptors.

Thanks
> 
> Cheers,
> 
> Paolo



RE: [PATCH net-next v2 2/3] vhost_net: Call peek_len when using xdp

2024-02-29 Thread wangyunjian


> -Original Message-
> From: Paolo Abeni [mailto:pab...@redhat.com]
> Sent: Thursday, February 29, 2024 6:49 PM
> To: wangyunjian ; m...@redhat.com;
> willemdebruijn.ker...@gmail.com; jasow...@redhat.com; k...@kernel.org;
> bj...@kernel.org; magnus.karls...@intel.com; maciej.fijalkow...@intel.com;
> jonathan.le...@gmail.com; da...@davemloft.net
> Cc: b...@vger.kernel.org; net...@vger.kernel.org;
> linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> virtualizat...@lists.linux.dev; xudingke ; liwei (DT)
> 
> Subject: Re: [PATCH net-next v2 2/3] vhost_net: Call peek_len when using xdp
> 
> On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> > If TUN supports AF_XDP TX zero-copy, the XDP program will enqueue
> > packets to the XDP ring and wake up the vhost worker. This requires
> > the vhost worker to call peek_len(), which can be used to consume XDP
> > descriptors.
> >
> > Signed-off-by: Yunjian Wang 
> > ---
> >  drivers/vhost/net.c | 17 -
> >  1 file changed, 12 insertions(+), 5 deletions(-)
> >
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c index
> > f2ed7167c848..077e74421558 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -207,6 +207,11 @@ static int vhost_net_buf_peek_len(void *ptr)
> > return __skb_array_len_with_tag(ptr);  }
> >
> > +static bool vhost_sock_xdp(struct socket *sock) {
> > +   return sock_flag(sock->sk, SOCK_XDP); }
> > +
> >  static int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq)  {
> > struct vhost_net_buf *rxq = >rxq; @@ -214,6 +219,13 @@ static
> > int vhost_net_buf_peek(struct vhost_net_virtqueue *nvq)
> > if (!vhost_net_buf_is_empty(rxq))
> > goto out;
> >
> > +   if (ptr_ring_empty(nvq->rx_ring)) {
> > +   struct socket *sock = vhost_vq_get_backend(>vq);
> > +   /* Call peek_len to consume XSK descriptors, when using xdp */
> > +   if (vhost_sock_xdp(sock) && sock->ops->peek_len)
> > +   sock->ops->peek_len(sock);
> 
> This really looks like a socket API misuse. Why can't you use ptr-ring 
> primitives
> to consume XSK descriptors? peek_len could be constified some day, this code
> will prevent such (good) thing.

Thank you for your suggestion. I will consider that with Patch 3/3.

> 
> Cheers,
> 
> Paolo



RE: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support

2024-02-29 Thread wangyunjian
> -Original Message-
> From: Paolo Abeni [mailto:pab...@redhat.com]
> Sent: Thursday, February 29, 2024 7:13 PM
> To: wangyunjian ; m...@redhat.com;
> willemdebruijn.ker...@gmail.com; jasow...@redhat.com; k...@kernel.org;
> bj...@kernel.org; magnus.karls...@intel.com; maciej.fijalkow...@intel.com;
> jonathan.le...@gmail.com; da...@davemloft.net
> Cc: b...@vger.kernel.org; net...@vger.kernel.org;
> linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> virtualizat...@lists.linux.dev; xudingke ; liwei (DT)
> 
> Subject: Re: [PATCH net-next v2 3/3] tun: AF_XDP Tx zero-copy support
> 
> On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> > @@ -2661,6 +2776,54 @@ static int tun_ptr_peek_len(void *ptr)
> > }
> >  }
> >
> > +static void tun_peek_xsk(struct tun_file *tfile) {
> > +   struct xsk_buff_pool *pool;
> > +   u32 i, batch, budget;
> > +   void *frame;
> > +
> > +   if (!ptr_ring_empty(>tx_ring))
> > +   return;
> > +
> > +   spin_lock(>pool_lock);
> > +   pool = tfile->xsk_pool;
> > +   if (!pool) {
> > +   spin_unlock(>pool_lock);
> > +   return;
> > +   }
> > +
> > +   if (tfile->nb_descs) {
> > +   xsk_tx_completed(pool, tfile->nb_descs);
> > +   if (xsk_uses_need_wakeup(pool))
> > +   xsk_set_tx_need_wakeup(pool);
> > +   }
> > +
> > +   spin_lock(>tx_ring.producer_lock);
> > +   budget = min_t(u32, tfile->tx_ring.size, TUN_XDP_BATCH);
> > +
> > +   batch = xsk_tx_peek_release_desc_batch(pool, budget);
> > +   if (!batch) {
> 
> This branch looks like an unneeded "optimization". The generic loop below
> should have the same effect with no measurable perf delta - and smaller code.
> Just remove this.

OK, I will update it, thanks.

> 
> > +   tfile->nb_descs = 0;
> > +   spin_unlock(>tx_ring.producer_lock);
> > +   spin_unlock(>pool_lock);
> > +   return;
> > +   }
> > +
> > +   tfile->nb_descs = batch;
> > +   for (i = 0; i < batch; i++) {
> > +   /* Encode the XDP DESC flag into lowest bit for consumer to 
> > differ
> > +* XDP desc from XDP buffer and sk_buff.
> > +*/
> > +   frame = tun_xdp_desc_to_ptr(>tx_descs[i]);
> > +   /* The budget must be less than or equal to tx_ring.size,
> > +* so enqueuing will not fail.
> > +*/
> > +   __ptr_ring_produce(>tx_ring, frame);
> > +   }
> > +   spin_unlock(>tx_ring.producer_lock);
> > +   spin_unlock(>pool_lock);
> 
> More related to the general design: it looks wrong. What if
> get_rx_bufs() will fail (ENOBUF) after successful peeking? With no more
> incoming packets, later peek will return 0 and it looks like that the
> half-processed packets will stay in the ring forever???

The vhost_net_rx_peek_head_len function obtains the packet length
but does not consume it. The packet is still in the ring. The later peek
will reuse it.

> 
> I think the 'ring produce' part should be moved into tun_do_read().

Thank you for your suggestion. I will consider that.

> 
> Cheers,
> 
> Paolo



RE: [PATCH net-next v2 1/3] xsk: Remove non-zero 'dma_page' check in xp_assign_dev

2024-02-29 Thread wangyunjian
> -Original Message-
> From: Paolo Abeni [mailto:pab...@redhat.com]
> Sent: Thursday, February 29, 2024 6:43 PM
> To: wangyunjian ; m...@redhat.com;
> willemdebruijn.ker...@gmail.com; jasow...@redhat.com; k...@kernel.org;
> bj...@kernel.org; magnus.karls...@intel.com; maciej.fijalkow...@intel.com;
> jonathan.le...@gmail.com; da...@davemloft.net
> Cc: b...@vger.kernel.org; net...@vger.kernel.org;
> linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> virtualizat...@lists.linux.dev; xudingke ; liwei (DT)
> 
> Subject: Re: [PATCH net-next v2 1/3] xsk: Remove non-zero 'dma_page' check in
> xp_assign_dev
> 
> On Wed, 2024-02-28 at 19:05 +0800, Yunjian Wang wrote:
> > Now dma mappings are used by the physical NICs. However the vNIC maybe
> > do not need them. So remove non-zero 'dma_page' check in
> > xp_assign_dev.
> >
> > Signed-off-by: Yunjian Wang 
> > ---
> >  net/xdp/xsk_buff_pool.c | 7 ---
> >  1 file changed, 7 deletions(-)
> >
> > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c index
> > ce60ecd48a4d..a5af75b1f43c 100644
> > --- a/net/xdp/xsk_buff_pool.c
> > +++ b/net/xdp/xsk_buff_pool.c
> > @@ -219,16 +219,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> > if (err)
> > goto err_unreg_pool;
> >
> > -   if (!pool->dma_pages) {
> > -   WARN(1, "Driver did not DMA map zero-copy buffers");
> > -   err = -EINVAL;
> > -   goto err_unreg_xsk;
> > -   }
> 
> This would unconditionally remove an otherwise valid check for most NIC. What
> about let the driver declare it wont need DMA map with a
> (pool?) flag.

This check is redundant. The NIC's driver determines whether a DMA map is 
required.
If the NIC'driver requires the DMA map, it uses the xsk_pool_dma_map function, 
which
initializes the DMA map and performs a check.

Thanks

> 
> Cheers,
> 
> Paolo



RE: [PATCH net-next 1/2] xsk: Remove non-zero 'dma_page' check in xp_assign_dev

2024-02-21 Thread wangyunjian
> -Original Message-
> From: Xuan Zhuo [mailto:xuanz...@linux.alibaba.com]
> Sent: Wednesday, February 21, 2024 5:53 PM
> To: wangyunjian 
> Cc: net...@vger.kernel.org; linux-kernel@vger.kernel.org;
> k...@vger.kernel.org; virtualizat...@lists.linux.dev; xudingke
> ; wangyunjian ;
> m...@redhat.com; willemdebruijn.ker...@gmail.com; jasow...@redhat.com;
> k...@kernel.org; da...@davemloft.net; magnus.karls...@intel.com
> Subject: Re: [PATCH net-next 1/2] xsk: Remove non-zero 'dma_page' check in
> xp_assign_dev
> 
> On Wed, 24 Jan 2024 17:37:38 +0800, Yunjian Wang
>  wrote:
> > Now dma mappings are used by the physical NICs. However the vNIC maybe
> > do not need them. So remove non-zero 'dma_page' check in
> > xp_assign_dev.
> 
> Could you tell me which one nic can work with AF_XDP without DMA?

TUN will support AF_XDP Tx zero-copy, which does not require DMA mappings.

Thanks

> 
> Thanks.
> 
> 
> >
> > Signed-off-by: Yunjian Wang 
> > ---
> >  net/xdp/xsk_buff_pool.c | 7 ---
> >  1 file changed, 7 deletions(-)
> >
> > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c index
> > 28711cc44ced..939b6e7b59ff 100644
> > --- a/net/xdp/xsk_buff_pool.c
> > +++ b/net/xdp/xsk_buff_pool.c
> > @@ -219,16 +219,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> > if (err)
> > goto err_unreg_pool;
> >
> > -   if (!pool->dma_pages) {
> > -   WARN(1, "Driver did not DMA map zero-copy buffers");
> > -   err = -EINVAL;
> > -   goto err_unreg_xsk;
> > -   }
> > pool->umem->zc = true;
> > return 0;
> >
> > -err_unreg_xsk:
> > -   xp_disable_drv_zc(pool);
> >  err_unreg_pool:
> > if (!force_zc)
> > err = 0; /* fallback to copy mode */
> > --
> > 2.33.0
> >
> >



RE: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support

2024-01-29 Thread wangyunjian
> -Original Message-
> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Monday, January 29, 2024 11:03 AM
> To: wangyunjian 
> Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com; k...@kernel.org;
> da...@davemloft.net; magnus.karls...@intel.com; net...@vger.kernel.org;
> linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> virtualizat...@lists.linux.dev; xudingke 
> Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> 
> On Thu, Jan 25, 2024 at 8:54 PM wangyunjian 
> wrote:
> >
> >
> >
> > > -Original Message-
> > > From: Jason Wang [mailto:jasow...@redhat.com]
> > > Sent: Thursday, January 25, 2024 12:49 PM
> > > To: wangyunjian 
> > > Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com;
> > > k...@kernel.org; da...@davemloft.net; magnus.karls...@intel.com;
> > > net...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > k...@vger.kernel.org; virtualizat...@lists.linux.dev; xudingke
> > > 
> > > Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> > >
> > > On Wed, Jan 24, 2024 at 5:38 PM Yunjian Wang
> > > 
> > > wrote:
> > > >
> > > > Now the zero-copy feature of AF_XDP socket is supported by some
> > > > drivers, which can reduce CPU utilization on the xdp program.
> > > > This patch set allows tun to support AF_XDP Rx zero-copy feature.
> > > >
> > > > This patch tries to address this by:
> > > > - Use peek_len to consume a xsk->desc and get xsk->desc length.
> > > > - When the tun support AF_XDP Rx zero-copy, the vq's array maybe empty.
> > > > So add a check for empty vq's array in vhost_net_buf_produce().
> > > > - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support
> > > > - add tun_put_user_desc function to copy the Rx data to VM
> > >
> > > Code explains themselves, let's explain why you need to do this.
> > >
> > > 1) why you want to use peek_len
> > > 2) for "vq's array", what does it mean?
> > > 3) from the view of TUN/TAP tun_put_user_desc() is the TX path, so I
> > > guess you meant TX zerocopy instead of RX (as I don't see codes for
> > > RX?)
> >
> > OK, I agree and use TX zerocopy instead of RX zerocopy. I meant RX
> > zerocopy from the view of vhost-net.
> 
> Ok.
> 
> >
> > >
> > > A big question is how could you handle GSO packets from
> userspace/guests?
> >
> > Now by disabling VM's TSO and csum feature.
> 
> Btw, how could you do that?

By set network backend-specific options:





Thanks

> 
> Thanks
> 



RE: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support

2024-01-29 Thread wangyunjian
> -Original Message-
> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Monday, January 29, 2024 11:05 AM
> To: wangyunjian 
> Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com; k...@kernel.org;
> da...@davemloft.net; magnus.karls...@intel.com; net...@vger.kernel.org;
> linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> virtualizat...@lists.linux.dev; xudingke 
> Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> 
> On Sat, Jan 27, 2024 at 5:34 PM wangyunjian 
> wrote:
> >
> > > > -Original Message-
> > > > From: Jason Wang [mailto:jasow...@redhat.com]
> > > > Sent: Thursday, January 25, 2024 12:49 PM
> > > > To: wangyunjian 
> > > > Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com;
> > > > k...@kernel.org; da...@davemloft.net; magnus.karls...@intel.com;
> > > > net...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > > > k...@vger.kernel.org; virtualizat...@lists.linux.dev; xudingke
> > > > 
> > > > Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> > > >
> > > > On Wed, Jan 24, 2024 at 5:38 PM Yunjian Wang
> > > 
> > > > wrote:
> > > > >
> > > > > Now the zero-copy feature of AF_XDP socket is supported by some
> > > > > drivers, which can reduce CPU utilization on the xdp program.
> > > > > This patch set allows tun to support AF_XDP Rx zero-copy feature.
> > > > >
> > > > > This patch tries to address this by:
> > > > > - Use peek_len to consume a xsk->desc and get xsk->desc length.
> > > > > - When the tun support AF_XDP Rx zero-copy, the vq's array maybe
> empty.
> > > > > So add a check for empty vq's array in vhost_net_buf_produce().
> > > > > - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support
> > > > > - add tun_put_user_desc function to copy the Rx data to VM
> > > >
> > > > Code explains themselves, let's explain why you need to do this.
> > > >
> > > > 1) why you want to use peek_len
> > > > 2) for "vq's array", what does it mean?
> > > > 3) from the view of TUN/TAP tun_put_user_desc() is the TX path, so
> > > > I guess you meant TX zerocopy instead of RX (as I don't see codes
> > > > for
> > > > RX?)
> > >
> > > OK, I agree and use TX zerocopy instead of RX zerocopy. I meant RX
> > > zerocopy from the view of vhost-net.
> > >
> > > >
> > > > A big question is how could you handle GSO packets from
> userspace/guests?
> > >
> > > Now by disabling VM's TSO and csum feature. XDP does not support GSO
> > > packets.
> > > However, this feature can be added once XDP supports it in the future.
> > >
> > > >
> > > > >
> > > > > Signed-off-by: Yunjian Wang 
> > > > > ---
> > > > >  drivers/net/tun.c   | 165
> > > > +++-
> > > > >  drivers/vhost/net.c |  18 +++--
> > > > >  2 files changed, 176 insertions(+), 7 deletions(-)
> >
> > [...]
> >
> > > > >
> > > > >  static int peek_head_len(struct vhost_net_virtqueue *rvq,
> > > > > struct sock
> > > > > *sk)  {
> > > > > +   struct socket *sock = sk->sk_socket;
> > > > > struct sk_buff *head;
> > > > > int len = 0;
> > > > > unsigned long flags;
> > > > >
> > > > > -   if (rvq->rx_ring)
> > > > > -   return vhost_net_buf_peek(rvq);
> > > > > +   if (rvq->rx_ring) {
> > > > > +   len = vhost_net_buf_peek(rvq);
> > > > > +   if (likely(len))
> > > > > +   return len;
> > > > > +   }
> > > > > +
> > > > > +   if (sock->ops->peek_len)
> > > > > +   return sock->ops->peek_len(sock);
> > > >
> > > > What prevents you from reusing the ptr_ring here? Then you don't
> > > > need the above tricks.
> > >
> > > Thank you for your suggestion. I will consider how to reuse the ptr_ring.
> >
> > If ptr_ring is used to transfer xdp_descs, there is a problem: After
> > some xdp_descs are obtained through xsk_tx_peek_desc(), the descs may
> > fail to be added to ptr_ring. However, no API is available to
> > implement the rollback function.
> 
> I don't understand, this issue seems to exist in the physical NIC as well?
> 
> We get more descriptors than the free slots in the NIC ring.
> 
> How did other NIC solve this issue?

Currently, physical NICs such as i40e, ice, ixgbe, igc, and mlx5 obtains
available NIC descriptors and then retrieve the same number of xsk
descriptors for processing.

Thanks

> 
> Thanks
> 
> >
> > Thanks
> >
> > >
> > > >
> > > > Thanks
> > > >
> > > >
> > > > >
> > > > > spin_lock_irqsave(>sk_receive_queue.lock, flags);
> > > > > head = skb_peek(>sk_receive_queue);
> > > > > --
> > > > > 2.33.0
> > > > >
> >



RE: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support

2024-01-27 Thread wangyunjian
> > -Original Message-
> > From: Jason Wang [mailto:jasow...@redhat.com]
> > Sent: Thursday, January 25, 2024 12:49 PM
> > To: wangyunjian 
> > Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com; k...@kernel.org;
> > da...@davemloft.net; magnus.karls...@intel.com;
> > net...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > k...@vger.kernel.org; virtualizat...@lists.linux.dev; xudingke
> > 
> > Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> >
> > On Wed, Jan 24, 2024 at 5:38 PM Yunjian Wang
> 
> > wrote:
> > >
> > > Now the zero-copy feature of AF_XDP socket is supported by some
> > > drivers, which can reduce CPU utilization on the xdp program.
> > > This patch set allows tun to support AF_XDP Rx zero-copy feature.
> > >
> > > This patch tries to address this by:
> > > - Use peek_len to consume a xsk->desc and get xsk->desc length.
> > > - When the tun support AF_XDP Rx zero-copy, the vq's array maybe empty.
> > > So add a check for empty vq's array in vhost_net_buf_produce().
> > > - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support
> > > - add tun_put_user_desc function to copy the Rx data to VM
> >
> > Code explains themselves, let's explain why you need to do this.
> >
> > 1) why you want to use peek_len
> > 2) for "vq's array", what does it mean?
> > 3) from the view of TUN/TAP tun_put_user_desc() is the TX path, so I
> > guess you meant TX zerocopy instead of RX (as I don't see codes for
> > RX?)
> 
> OK, I agree and use TX zerocopy instead of RX zerocopy. I meant RX zerocopy
> from the view of vhost-net.
> 
> >
> > A big question is how could you handle GSO packets from userspace/guests?
> 
> Now by disabling VM's TSO and csum feature. XDP does not support GSO
> packets.
> However, this feature can be added once XDP supports it in the future.
> 
> >
> > >
> > > Signed-off-by: Yunjian Wang 
> > > ---
> > >  drivers/net/tun.c   | 165
> > +++-
> > >  drivers/vhost/net.c |  18 +++--
> > >  2 files changed, 176 insertions(+), 7 deletions(-)

[...]

> > >
> > >  static int peek_head_len(struct vhost_net_virtqueue *rvq, struct
> > > sock
> > > *sk)  {
> > > +   struct socket *sock = sk->sk_socket;
> > > struct sk_buff *head;
> > > int len = 0;
> > > unsigned long flags;
> > >
> > > -   if (rvq->rx_ring)
> > > -   return vhost_net_buf_peek(rvq);
> > > +   if (rvq->rx_ring) {
> > > +   len = vhost_net_buf_peek(rvq);
> > > +   if (likely(len))
> > > +   return len;
> > > +   }
> > > +
> > > +   if (sock->ops->peek_len)
> > > +   return sock->ops->peek_len(sock);
> >
> > What prevents you from reusing the ptr_ring here? Then you don't need
> > the above tricks.
> 
> Thank you for your suggestion. I will consider how to reuse the ptr_ring.

If ptr_ring is used to transfer xdp_descs, there is a problem: After some
xdp_descs are obtained through xsk_tx_peek_desc(), the descs may fail
to be added to ptr_ring. However, no API is available to implement the
rollback function.

Thanks

> 
> >
> > Thanks
> >
> >
> > >
> > > spin_lock_irqsave(>sk_receive_queue.lock, flags);
> > > head = skb_peek(>sk_receive_queue);
> > > --
> > > 2.33.0
> > >



RE: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support

2024-01-25 Thread wangyunjian


> -Original Message-
> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Thursday, January 25, 2024 12:49 PM
> To: wangyunjian 
> Cc: m...@redhat.com; willemdebruijn.ker...@gmail.com; k...@kernel.org;
> da...@davemloft.net; magnus.karls...@intel.com; net...@vger.kernel.org;
> linux-kernel@vger.kernel.org; k...@vger.kernel.org;
> virtualizat...@lists.linux.dev; xudingke 
> Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> 
> On Wed, Jan 24, 2024 at 5:38 PM Yunjian Wang 
> wrote:
> >
> > Now the zero-copy feature of AF_XDP socket is supported by some
> > drivers, which can reduce CPU utilization on the xdp program.
> > This patch set allows tun to support AF_XDP Rx zero-copy feature.
> >
> > This patch tries to address this by:
> > - Use peek_len to consume a xsk->desc and get xsk->desc length.
> > - When the tun support AF_XDP Rx zero-copy, the vq's array maybe empty.
> > So add a check for empty vq's array in vhost_net_buf_produce().
> > - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support
> > - add tun_put_user_desc function to copy the Rx data to VM
> 
> Code explains themselves, let's explain why you need to do this.
> 
> 1) why you want to use peek_len
> 2) for "vq's array", what does it mean?
> 3) from the view of TUN/TAP tun_put_user_desc() is the TX path, so I guess you
> meant TX zerocopy instead of RX (as I don't see codes for
> RX?)

OK, I agree and use TX zerocopy instead of RX zerocopy. I meant RX zerocopy
from the view of vhost-net. 

> 
> A big question is how could you handle GSO packets from userspace/guests?

Now by disabling VM's TSO and csum feature. XDP does not support GSO packets.
However, this feature can be added once XDP supports it in the future. 

> 
> >
> > Signed-off-by: Yunjian Wang 
> > ---
> >  drivers/net/tun.c   | 165
> +++-
> >  drivers/vhost/net.c |  18 +++--
> >  2 files changed, 176 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c index
> > afa5497f7c35..248b0f8e07d1 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -77,6 +77,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #include 
> >  #include 
> > @@ -145,6 +146,10 @@ struct tun_file {
> > struct tun_struct *detached;
> > struct ptr_ring tx_ring;
> > struct xdp_rxq_info xdp_rxq;
> > +   struct xdp_desc desc;
> > +   /* protects xsk pool */
> > +   spinlock_t pool_lock;
> > +   struct xsk_buff_pool *pool;
> >  };
> >
> >  struct tun_page {
> > @@ -208,6 +213,8 @@ struct tun_struct {
> > struct bpf_prog __rcu *xdp_prog;
> > struct tun_prog __rcu *steering_prog;
> > struct tun_prog __rcu *filter_prog;
> > +   /* tracks AF_XDP ZC enabled queues */
> > +   unsigned long *af_xdp_zc_qps;
> > struct ethtool_link_ksettings link_ksettings;
> > /* init args */
> > struct file *file;
> > @@ -795,6 +802,8 @@ static int tun_attach(struct tun_struct *tun,
> > struct file *file,
> >
> > tfile->queue_index = tun->numqueues;
> > tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN;
> > +   tfile->desc.len = 0;
> > +   tfile->pool = NULL;
> >
> > if (tfile->detached) {
> > /* Re-attach detached tfile, updating XDP queue_index
> > */ @@ -989,6 +998,13 @@ static int tun_net_init(struct net_device *dev)
> > return err;
> > }
> >
> > +   tun->af_xdp_zc_qps = bitmap_zalloc(MAX_TAP_QUEUES,
> GFP_KERNEL);
> > +   if (!tun->af_xdp_zc_qps) {
> > +   security_tun_dev_free_security(tun->security);
> > +   free_percpu(dev->tstats);
> > +   return -ENOMEM;
> > +   }
> > +
> > tun_flow_init(tun);
> >
> > dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST | @@
> -1009,6
> > +1025,7 @@ static int tun_net_init(struct net_device *dev)
> > tun_flow_uninit(tun);
> > security_tun_dev_free_security(tun->security);
> > free_percpu(dev->tstats);
> > +   bitmap_free(tun->af_xdp_zc_qps);
> > return err;
> > }
> > return 0;
> > @@ -1222,11 +1239,77 @@ static int tun_xdp_set(struct net_device *dev,
> struct bpf_prog *prog,
> > return 0;
>

RE: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support

2024-01-25 Thread wangyunjian
> -Original Message-
> From: Willem de Bruijn [mailto:willemdebruijn.ker...@gmail.com]
> Sent: Thursday, January 25, 2024 3:05 AM
> To: wangyunjian ; m...@redhat.com;
> willemdebruijn.ker...@gmail.com; jasow...@redhat.com; k...@kernel.org;
> da...@davemloft.net; magnus.karls...@intel.com
> Cc: net...@vger.kernel.org; linux-kernel@vger.kernel.org;
> k...@vger.kernel.org; virtualizat...@lists.linux.dev; xudingke
> ; wangyunjian 
> Subject: Re: [PATCH net-next 2/2] tun: AF_XDP Rx zero-copy support
> 
> Yunjian Wang wrote:
> > Now the zero-copy feature of AF_XDP socket is supported by some
> > drivers, which can reduce CPU utilization on the xdp program.
> > This patch set allows tun to support AF_XDP Rx zero-copy feature.
> >
> > This patch tries to address this by:
> > - Use peek_len to consume a xsk->desc and get xsk->desc length.
> > - When the tun support AF_XDP Rx zero-copy, the vq's array maybe empty.
> > So add a check for empty vq's array in vhost_net_buf_produce().
> > - add XDP_SETUP_XSK_POOL and ndo_xsk_wakeup callback support
> > - add tun_put_user_desc function to copy the Rx data to VM
> >
> > Signed-off-by: Yunjian Wang 
> 
> I don't fully understand the higher level design of this feature yet.

This feature can reduce the memory copy for the xdp program. 

> 
> But some initial comments at the code level.
> 
> > ---
> >  drivers/net/tun.c   | 165
> +++-
> >  drivers/vhost/net.c |  18 +++--
> >  2 files changed, 176 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/net/tun.c b/drivers/net/tun.c index
> > afa5497f7c35..248b0f8e07d1 100644
> > --- a/drivers/net/tun.c
> > +++ b/drivers/net/tun.c
> > @@ -77,6 +77,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #include 
> >  #include 
> > @@ -145,6 +146,10 @@ struct tun_file {
> > struct tun_struct *detached;
> > struct ptr_ring tx_ring;
> > struct xdp_rxq_info xdp_rxq;
> > +   struct xdp_desc desc;
> > +   /* protects xsk pool */
> > +   spinlock_t pool_lock;
> > +   struct xsk_buff_pool *pool;
> >  };
> >
> >  struct tun_page {
> > @@ -208,6 +213,8 @@ struct tun_struct {
> > struct bpf_prog __rcu *xdp_prog;
> > struct tun_prog __rcu *steering_prog;
> > struct tun_prog __rcu *filter_prog;
> > +   /* tracks AF_XDP ZC enabled queues */
> > +   unsigned long *af_xdp_zc_qps;
> > struct ethtool_link_ksettings link_ksettings;
> > /* init args */
> > struct file *file;
> > @@ -795,6 +802,8 @@ static int tun_attach(struct tun_struct *tun,
> > struct file *file,
> >
> > tfile->queue_index = tun->numqueues;
> > tfile->socket.sk->sk_shutdown &= ~RCV_SHUTDOWN;
> > +   tfile->desc.len = 0;
> > +   tfile->pool = NULL;
> >
> > if (tfile->detached) {
> > /* Re-attach detached tfile, updating XDP queue_index */ @@ 
> > -989,6
> > +998,13 @@ static int tun_net_init(struct net_device *dev)
> > return err;
> > }
> >
> > +   tun->af_xdp_zc_qps = bitmap_zalloc(MAX_TAP_QUEUES, GFP_KERNEL);
> > +   if (!tun->af_xdp_zc_qps) {
> > +   security_tun_dev_free_security(tun->security);
> > +   free_percpu(dev->tstats);
> > +   return -ENOMEM;
> > +   }
> > +
> > tun_flow_init(tun);
> >
> > dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST | @@ -1009,6
> > +1025,7 @@ static int tun_net_init(struct net_device *dev)
> > tun_flow_uninit(tun);
> > security_tun_dev_free_security(tun->security);
> > free_percpu(dev->tstats);
> > +   bitmap_free(tun->af_xdp_zc_qps);
> 
> Please release state in inverse order of acquire.

Will done.

> 
> > return err;
> > }
> > return 0;
> > @@ -1222,11 +1239,77 @@ static int tun_xdp_set(struct net_device *dev,
> struct bpf_prog *prog,
> > return 0;
> >  }
> >
> > +static int tun_xsk_pool_enable(struct net_device *netdev,
> > +  struct xsk_buff_pool *pool,
> > +  u16 qid)
> > +{
> > +   struct tun_struct *tun = netdev_priv(netdev);
> > +   struct tun_file *tfile;
> > +   unsigned long flags;
> > +
> > +   rcu_read_lock();
> > +   tfile = rtnl_dereference(tun->tfiles[qid]);
> > +   if (!tfile) {
> > +   rcu_read_unlock();
> > 

RE: [PATCH net] virtio_net: fix return value check in receive_mergeable()

2017-12-04 Thread wangyunjian


> -Original Message-
> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Monday, December 04, 2017 3:10 PM
> To: wangyunjian <wangyunj...@huawei.com>; linux-
> ker...@vger.kernel.org
> Cc: m...@redhat.com; caihe <ca...@huawei.com>
> Subject: Re: [PATCH net] virtio_net: fix return value check in
> receive_mergeable()
> 
> 
> 
> On 2017年12月04日 14:02, wangyunjian wrote:
> > From: Yunjian Wang <wangyunj...@huawei.com>
> >
> > The function virtqueue_get_buf_ctx() could return NULL, the return
> > value 'buf' need to be checked with NULL, not value 'ctx'.
> >
> > Signed-off-by: Yunjian Wang <wangyunj...@huawei.com>
> > ---
> >   drivers/net/virtio_net.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 19a985e..559b215 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -756,7 +756,7 @@ static struct sk_buff *receive_mergeable(struct
> net_device *dev,
> > int num_skb_frags;
> >
> > buf = virtqueue_get_buf_ctx(rq->vq, , );
> > -   if (unlikely(!ctx)) {
> > +   if (unlikely(!buf)) {
> > pr_debug("%s: rx error: %d buffers out of %d
> missing\n",
> >  dev->name, num_buf,
> >  virtio16_to_cpu(vi->vdev,
> 
> Hi:
> 
> The path looks good but I'm not sure this is a real fix since we add
> data and ctx simultaneously in virtqueue_add_inbuf_ctx().
> 
> Any bad result you see without this patch?
> 

Now no problem has been found yet, and I found it when reviewing the code.

Thanks

> Thanks


RE: [PATCH net] virtio_net: fix return value check in receive_mergeable()

2017-12-04 Thread wangyunjian


> -Original Message-
> From: Jason Wang [mailto:jasow...@redhat.com]
> Sent: Monday, December 04, 2017 3:10 PM
> To: wangyunjian ; linux-
> ker...@vger.kernel.org
> Cc: m...@redhat.com; caihe 
> Subject: Re: [PATCH net] virtio_net: fix return value check in
> receive_mergeable()
> 
> 
> 
> On 2017年12月04日 14:02, wangyunjian wrote:
> > From: Yunjian Wang 
> >
> > The function virtqueue_get_buf_ctx() could return NULL, the return
> > value 'buf' need to be checked with NULL, not value 'ctx'.
> >
> > Signed-off-by: Yunjian Wang 
> > ---
> >   drivers/net/virtio_net.c | 2 +-
> >   1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 19a985e..559b215 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -756,7 +756,7 @@ static struct sk_buff *receive_mergeable(struct
> net_device *dev,
> > int num_skb_frags;
> >
> > buf = virtqueue_get_buf_ctx(rq->vq, , );
> > -   if (unlikely(!ctx)) {
> > +   if (unlikely(!buf)) {
> > pr_debug("%s: rx error: %d buffers out of %d
> missing\n",
> >  dev->name, num_buf,
> >  virtio16_to_cpu(vi->vdev,
> 
> Hi:
> 
> The path looks good but I'm not sure this is a real fix since we add
> data and ctx simultaneously in virtqueue_add_inbuf_ctx().
> 
> Any bad result you see without this patch?
> 

Now no problem has been found yet, and I found it when reviewing the code.

Thanks

> Thanks


[PATCH net] virtio_net: fix return value check in receive_mergeable()

2017-12-03 Thread wangyunjian
From: Yunjian Wang 

The function virtqueue_get_buf_ctx() could return NULL, the return
value 'buf' need to be checked with NULL, not value 'ctx'.

Signed-off-by: Yunjian Wang 
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 19a985e..559b215 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -756,7 +756,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
int num_skb_frags;
 
buf = virtqueue_get_buf_ctx(rq->vq, , );
-   if (unlikely(!ctx)) {
+   if (unlikely(!buf)) {
pr_debug("%s: rx error: %d buffers out of %d missing\n",
 dev->name, num_buf,
 virtio16_to_cpu(vi->vdev,
-- 
1.8.3.1




[PATCH net] virtio_net: fix return value check in receive_mergeable()

2017-12-03 Thread wangyunjian
From: Yunjian Wang 

The function virtqueue_get_buf_ctx() could return NULL, the return
value 'buf' need to be checked with NULL, not value 'ctx'.

Signed-off-by: Yunjian Wang 
---
 drivers/net/virtio_net.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 19a985e..559b215 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -756,7 +756,7 @@ static struct sk_buff *receive_mergeable(struct net_device 
*dev,
int num_skb_frags;
 
buf = virtqueue_get_buf_ctx(rq->vq, , );
-   if (unlikely(!ctx)) {
+   if (unlikely(!buf)) {
pr_debug("%s: rx error: %d buffers out of %d missing\n",
 dev->name, num_buf,
 virtio16_to_cpu(vi->vdev,
-- 
1.8.3.1




RE: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors

2016-11-30 Thread wangyunjian
>-Original Message-
>From: Jason Wang [mailto:jasow...@redhat.com] 
>Sent: Thursday, December 01, 2016 11:37 AM
>To: Michael S. Tsirkin
>Cc: wangyunjian; net...@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
>Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when 
>meet errors
>
>
>
>On 2016年12月01日 11:27, Michael S. Tsirkin wrote:
>> On Thu, Dec 01, 2016 at 11:26:21AM +0800, Jason Wang wrote:
>>> >
>>> >
>>> >On 2016年12月01日 11:21, Michael S. Tsirkin wrote:
>>>> > >On Thu, Dec 01, 2016 at 02:48:59AM +, wangyunjian wrote:
>>>>>> > > > >-Original Message-
>>>>>> > > > >From: Michael S. Tsirkin [mailto:m...@redhat.com]
>>>>>> > > > >Sent: Wednesday, November 30, 2016 9:41 PM
>>>>>> > > > >To: wangyunjian
>>>>>> > > > >Cc:jasow...@redhat.com;net...@vger.kernel.org;linux-kernel@
>>>>>> > > > >vger.kernel.org; caihe
>>>>>> > > > >Subject: Re: [PATCH net] vhost_net: don't continue to call 
>>>>>> > > > >the recvmsg when meet errors
>>>>>> > > > >
>>>>>> > > > >On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>>>>>>> > > > > >When we meet an error(err=-EBADFD) recvmsg,
>>>>>> > > > >How do you get EBADFD? Won't vhost_net_rx_peek_head_len 
>>>>>> > > > >return 0 in this case, breaking the loop?
>>>>> > > >We started many guest VMs while attaching/detaching some virtio-net 
>>>>> > > >nics for loop.
>>>>> > > >The soft lockup might happened. The err is -EBADFD.
>>>>> > > >
>>>> > >OK, I'd like to figure out what happened here. why don't we get 0 
>>>> > >when we peek at the head?
>>>> > >
>>>> > >EBADFD is from here:
>>>> > >  struct tun_struct *tun = __tun_get(tfile); ...
>>>> > >  if (!tun)
>>>> > >  return -EBADFD;
>>>> > >
>>>> > >but then:
>>>> > >static int tun_peek_len(struct socket *sock) {
>>>> > >
>>>> > >...
>>>> > >
>>>> > >  struct tun_struct *tun; ...
>>>> > >  tun = __tun_get(tfile);
>>>> > >  if (!tun)
>>>> > >  return 0;
>>>> > >
>>>> > >
>>>> > >so peek len should return 0.
>>>> > >
>>>> > >then while will exit:
>>>> > >  while ((sock_len = vhost_net_rx_peek_head_len(net, 
>>>> > >sock->sk))) ...
>>>> > >
>>> >
>>> >Consider this case: user do ip link del link tap0 before recvmsg() 
>>> >but after
>>> >tun_peek_len() ?
>> Sure, this can happen, but I think we'll just exit on the next loop, 
>> won't we?
>>
>
>Right, this is the only case I can image for -EBADFD, let's wait for the 
>author to the steps.
>

Thanks, I understand it don't happen in the latest kernel version.
My problem happened using kernel version 3.10.0-xx

The peek len willn't return 0.

static int peek_head_len(struct sock *sk)
{
struct sk_buff *head;
int len = 0;
unsigned long flags;

spin_lock_irqsave(>sk_receive_queue.lock, flags);
head = skb_peek(>sk_receive_queue);
if (likely(head)) {
len = head->len;
if (skb_vlan_tag_present(head))
len += VLAN_HLEN;
}

spin_unlock_irqrestore(>sk_receive_queue.lock, flags);
return len;
}


RE: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors

2016-11-30 Thread wangyunjian
>-Original Message-
>From: Jason Wang [mailto:jasow...@redhat.com] 
>Sent: Thursday, December 01, 2016 11:37 AM
>To: Michael S. Tsirkin
>Cc: wangyunjian; net...@vger.kernel.org; linux-kernel@vger.kernel.org; caihe
>Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when 
>meet errors
>
>
>
>On 2016年12月01日 11:27, Michael S. Tsirkin wrote:
>> On Thu, Dec 01, 2016 at 11:26:21AM +0800, Jason Wang wrote:
>>> >
>>> >
>>> >On 2016年12月01日 11:21, Michael S. Tsirkin wrote:
>>>> > >On Thu, Dec 01, 2016 at 02:48:59AM +, wangyunjian wrote:
>>>>>> > > > >-Original Message-
>>>>>> > > > >From: Michael S. Tsirkin [mailto:m...@redhat.com]
>>>>>> > > > >Sent: Wednesday, November 30, 2016 9:41 PM
>>>>>> > > > >To: wangyunjian
>>>>>> > > > >Cc:jasow...@redhat.com;net...@vger.kernel.org;linux-kernel@
>>>>>> > > > >vger.kernel.org; caihe
>>>>>> > > > >Subject: Re: [PATCH net] vhost_net: don't continue to call 
>>>>>> > > > >the recvmsg when meet errors
>>>>>> > > > >
>>>>>> > > > >On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>>>>>>> > > > > >When we meet an error(err=-EBADFD) recvmsg,
>>>>>> > > > >How do you get EBADFD? Won't vhost_net_rx_peek_head_len 
>>>>>> > > > >return 0 in this case, breaking the loop?
>>>>> > > >We started many guest VMs while attaching/detaching some virtio-net 
>>>>> > > >nics for loop.
>>>>> > > >The soft lockup might happened. The err is -EBADFD.
>>>>> > > >
>>>> > >OK, I'd like to figure out what happened here. why don't we get 0 
>>>> > >when we peek at the head?
>>>> > >
>>>> > >EBADFD is from here:
>>>> > >  struct tun_struct *tun = __tun_get(tfile); ...
>>>> > >  if (!tun)
>>>> > >  return -EBADFD;
>>>> > >
>>>> > >but then:
>>>> > >static int tun_peek_len(struct socket *sock) {
>>>> > >
>>>> > >...
>>>> > >
>>>> > >  struct tun_struct *tun; ...
>>>> > >  tun = __tun_get(tfile);
>>>> > >  if (!tun)
>>>> > >  return 0;
>>>> > >
>>>> > >
>>>> > >so peek len should return 0.
>>>> > >
>>>> > >then while will exit:
>>>> > >  while ((sock_len = vhost_net_rx_peek_head_len(net, 
>>>> > >sock->sk))) ...
>>>> > >
>>> >
>>> >Consider this case: user do ip link del link tap0 before recvmsg() 
>>> >but after
>>> >tun_peek_len() ?
>> Sure, this can happen, but I think we'll just exit on the next loop, 
>> won't we?
>>
>
>Right, this is the only case I can image for -EBADFD, let's wait for the 
>author to the steps.
>

Thanks, I understand it don't happen in the latest kernel version.
My problem happened using kernel version 3.10.0-xx

The peek len willn't return 0.

static int peek_head_len(struct sock *sk)
{
struct sk_buff *head;
int len = 0;
unsigned long flags;

spin_lock_irqsave(>sk_receive_queue.lock, flags);
head = skb_peek(>sk_receive_queue);
if (likely(head)) {
len = head->len;
if (skb_vlan_tag_present(head))
len += VLAN_HLEN;
}

spin_unlock_irqrestore(>sk_receive_queue.lock, flags);
return len;
}


RE: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors

2016-11-30 Thread wangyunjian

>-Original Message-
>From: Michael S. Tsirkin [mailto:m...@redhat.com] 
>Sent: Wednesday, November 30, 2016 9:41 PM
>To: wangyunjian
>Cc: jasow...@redhat.com; net...@vger.kernel.org; linux-kernel@vger.kernel.org; 
>caihe
>Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when 
>meet errors
>
>On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>> When we meet an error(err=-EBADFD) recvmsg,
>
>How do you get EBADFD? Won't vhost_net_rx_peek_head_len
>return 0 in this case, breaking the loop?

We started many guest VMs while attaching/detaching some virtio-net nics for 
loop. 
The soft lockup might happened. The err is -EBADFD.

meesage log:
kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! 
[vhost-60898:126093]
call trace:
[]vhost_get_vq_desc+0x1e7/0x984 [vhost]
[]handle_rx+0x226/0x810 [vhost_net]
[]handle_rx_net+0x15/0x20 [vhost_net]
[]vhost_worker+0xfb/0x1e0 [vhost]
[]? vhost_dev_reset_owner+0x50/0x50 [vhost]
[]kthread+0xcf/0xe0
[]? kthread_create_on_node+0x140/0x140
[]ret_from_fork+0x58/0x90
[]? kthread_create_on_node+0x140/0x140

>> the error handling in vhost
>> handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
>> 
>> Signed-off-by: Yunjian Wang <wangyunj...@huawei.com>
>> ---
>>  drivers/vhost/net.c | 3 +++
>>  1 file changed, 3 insertions(+)
>> 
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 5dc128a..edc470b 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
>>  pr_debug("Discarded rx packet: "
>>   " len %d, expected %zd\n", err, sock_len);
>>  vhost_discard_vq_desc(vq, headcount);
>> +/* Don't continue to do, when meet errors. */
>> +if (err < 0)
>> +goto out;
>
>You might get e.g. EAGAIN and I think you need to retry
>in this case.
>
>>  continue;
>>  }
>>  /* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
>> -- 
>> 1.9.5.msysgit.1
>>


RE: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors

2016-11-30 Thread wangyunjian

>-Original Message-
>From: Michael S. Tsirkin [mailto:m...@redhat.com] 
>Sent: Wednesday, November 30, 2016 9:41 PM
>To: wangyunjian
>Cc: jasow...@redhat.com; net...@vger.kernel.org; linux-kernel@vger.kernel.org; 
>caihe
>Subject: Re: [PATCH net] vhost_net: don't continue to call the recvmsg when 
>meet errors
>
>On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
>> When we meet an error(err=-EBADFD) recvmsg,
>
>How do you get EBADFD? Won't vhost_net_rx_peek_head_len
>return 0 in this case, breaking the loop?

We started many guest VMs while attaching/detaching some virtio-net nics for 
loop. 
The soft lockup might happened. The err is -EBADFD.

meesage log:
kernel:[609608.510180]BUG: soft lockup - CPU#18 stuck for 23s! 
[vhost-60898:126093]
call trace:
[]vhost_get_vq_desc+0x1e7/0x984 [vhost]
[]handle_rx+0x226/0x810 [vhost_net]
[]handle_rx_net+0x15/0x20 [vhost_net]
[]vhost_worker+0xfb/0x1e0 [vhost]
[]? vhost_dev_reset_owner+0x50/0x50 [vhost]
[]kthread+0xcf/0xe0
[]? kthread_create_on_node+0x140/0x140
[]ret_from_fork+0x58/0x90
[]? kthread_create_on_node+0x140/0x140

>> the error handling in vhost
>> handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
>> 
>> Signed-off-by: Yunjian Wang 
>> ---
>>  drivers/vhost/net.c | 3 +++
>>  1 file changed, 3 insertions(+)
>> 
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 5dc128a..edc470b 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
>>  pr_debug("Discarded rx packet: "
>>   " len %d, expected %zd\n", err, sock_len);
>>  vhost_discard_vq_desc(vq, headcount);
>> +/* Don't continue to do, when meet errors. */
>> +if (err < 0)
>> +goto out;
>
>You might get e.g. EAGAIN and I think you need to retry
>in this case.
>
>>  continue;
>>  }
>>  /* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
>> -- 
>> 1.9.5.msysgit.1
>>