Re: [GIT PULL] virtio: features, fixes, cleanups

2024-05-22 Thread Xuan Zhuo
On Wed, 22 May 2024 07:38:01 -0400, "Michael S. Tsirkin"  
wrote:
> On Wed, May 22, 2024 at 06:22:45PM +0800, Xuan Zhuo wrote:
> > On Wed, 22 May 2024 06:03:01 -0400, "Michael S. Tsirkin"  
> > wrote:
> > > Things to note here:
> > >
> > > - the new Marvell OCTEON DPU driver is not here: latest v4 keeps causing
> > >   build failures on mips. I deferred the pull hoping to get it in
> > >   and I might merge a new version post rc1
> > >   (supposed to be ok for new drivers as they can't cause regressions),
> > >   but we'll see.
> > > - there are also a couple bugfixes under review, to be merged after rc1
> > > - I merged a trivial patch (removing a comment) that also got
> > >   merged through net.
> > >   git handles this just fine and it did not seem worth it
> > >   rebasing to drop it.
> > > - there is a trivial conflict in the header file. Shouldn't be any
> > >   trouble to resolve, but fyi the resolution by Stephen is here
> > >   diff --cc drivers/virtio/virtio_mem.c
> > >   index e8355f55a8f7,6d4dfbc53a66..
> > >   --- a/drivers/virtio/virtio_mem.c
> > >   +++ b/drivers/virtio/virtio_mem.c
> > >   @@@ -21,7 -21,7 +21,8 @@@
> > > #include 
> > > #include 
> > > #include 
> > >+#include 
> > >   + #include 
> > >   Also see it here:
> > >   https://lore.kernel.org/all/20240423145947.14217...@canb.auug.org.au/
> > >
> > >
> > >
> > > The following changes since commit 
> > > 18daea77cca626f590fb140fc11e3a43c5d41354:
> > >
> > >   Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm 
> > > (2024-04-30 12:40:41 -0700)
> > >
> > > are available in the Git repository at:
> > >
> > >   https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git 
> > > tags/for_linus
> > >
> > > for you to fetch changes up to 0b8dbbdcf2e42273fbac9b752919e2e5b2abac21:
> > >
> > >   Merge tag 'for_linus' into vhost (2024-05-12 08:15:28 -0400)
> > >
> > > 
> > > virtio: features, fixes, cleanups
> > >
> > > Several new features here:
> > >
> > > - virtio-net is finally supported in vduse.
> > >
> > > - Virtio (balloon and mem) interaction with suspend is improved
> > >
> > > - vhost-scsi now handles signals better/faster.
> > >
> > > - virtio-net now supports premapped mode by default,
> > >   opening the door for all kind of zero copy tricks.
> > >
> > > Fixes, cleanups all over the place.
> > >
> > > Signed-off-by: Michael S. Tsirkin 
> > >
> > > 
> > > Christophe JAILLET (1):
> > >   vhost-vdpa: Remove usage of the deprecated ida_simple_xx() API
> > >
> > > David Hildenbrand (1):
> > >   virtio-mem: support suspend+resume
> > >
> > > David Stevens (2):
> > >   virtio_balloon: Give the balloon its own wakeup source
> > >   virtio_balloon: Treat stats requests as wakeup events
> > >
> > > Eugenio P閞ez (2):
> > >   MAINTAINERS: add Eugenio P閞ez as reviewer
> > >   MAINTAINERS: add Eugenio P閞ez as reviewer
> > >
> > > Jiri Pirko (1):
> > >   virtio: delete vq in vp_find_vqs_msix() when request_irq() fails
> > >
> > > Krzysztof Kozlowski (24):
> > >   virtio: balloon: drop owner assignment
> > >   virtio: input: drop owner assignment
> > >   virtio: mem: drop owner assignment
> > >   um: virt-pci: drop owner assignment
> > >   virtio_blk: drop owner assignment
> > >   bluetooth: virtio: drop owner assignment
> > >   hwrng: virtio: drop owner assignment
> > >   virtio_console: drop owner assignment
> > >   crypto: virtio - drop owner assignment
> > >   firmware: arm_scmi: virtio: drop owner assignment
> > >   gpio: virtio: drop owner assignment
> > >   drm/virtio: drop owner assignment
> > >   iommu: virtio: drop owner assignment
> > >   misc: nsm: drop owner assignment
> > >   net: caif: virtio: drop owner assignment
> > >   net: virtio: drop owner assignment
> > >   net: 9p: virtio: drop owner assignment
> > >   vsock/virtio: drop owner assignmen

Re: [GIT PULL] virtio: features, fixes, cleanups

2024-05-22 Thread Xuan Zhuo
system wq to flush dev for TMFs
>   vhost: Remove vhost_vq_flush
>   vhost_scsi: Handle vhost_vq_work_queue failures for TMFs
>   vhost: Use virtqueue mutex for swapping worker
>   vhost: Release worker mutex during flushes
>   vhost_task: Handle SIGKILL by flushing work and exiting
>   kernel: Remove signal hacks for vhost_tasks
>
> Uwe Kleine-König (1):
>   virtio-mmio: Convert to platform remove callback returning void
>
> Xuan Zhuo (7):
>   virtio_ring: introduce dma map api for page
>   virtio_ring: enable premapped mode whatever use_dma_api
>   virtio_net: replace private by pp struct inside page
>   virtio_net: big mode support premapped
>   virtio_net: enable premapped by default
>   virtio_net: rx remove premapped failover code
>   virtio_net: remove the misleading comment

Hi Michael,

As we discussed here:


http://lore.kernel.org/all/CACGkMEuyeJ9mMgYnnB42=hw6umNuo=agn7VBqBqYPd7GN=+3...@mail.gmail.com

This patch set has been abandoned.

And you miss


https://lore.kernel.org/all/20240424091533.86949-1-xuanz...@linux.alibaba.com/

Thanks.

>
> Yuxue Liu (2):
>   vp_vdpa: Fix return value check vp_vdpa_request_irq
>   vp_vdpa: don't allocate unused msix vectors
>
> Zhu Lingshan (1):
>   MAINTAINERS: apply maintainer role of Intel vDPA driver
>
>  MAINTAINERS   |  10 +-
>  arch/um/drivers/virt-pci.c|   1 -
>  drivers/block/virtio_blk.c|   1 -
>  drivers/bluetooth/virtio_bt.c |   1 -
>  drivers/char/hw_random/virtio-rng.c   |   1 -
>  drivers/char/virtio_console.c |   2 -
>  drivers/crypto/virtio/virtio_crypto_core.c|   1 -
>  drivers/firmware/arm_scmi/virtio.c|   1 -
>  drivers/gpio/gpio-virtio.c|   1 -
>  drivers/gpu/drm/virtio/virtgpu_drv.c  |   1 -
>  drivers/iommu/virtio-iommu.c  |   1 -
>  drivers/misc/nsm.c|   1 -
>  drivers/net/caif/caif_virtio.c|   1 -
>  drivers/net/virtio_net.c  | 248 
> +-
>  drivers/net/wireless/virtual/mac80211_hwsim.c |   1 -
>  drivers/nvdimm/virtio_pmem.c  |   1 -
>  drivers/rpmsg/virtio_rpmsg_bus.c  |   1 -
>  drivers/scsi/virtio_scsi.c|   1 -
>  drivers/vdpa/vdpa.c   |   2 +-
>  drivers/vdpa/vdpa_user/vduse_dev.c|  24 ++-
>  drivers/vdpa/virtio_pci/vp_vdpa.c |  27 ++-
>  drivers/vhost/scsi.c  |  70 +---
>  drivers/vhost/vdpa.c  |   6 +-
>  drivers/vhost/vhost.c | 130 ++
>  drivers/vhost/vhost.h |   3 +-
>  drivers/virtio/virtio_balloon.c   |  85 +
>  drivers/virtio/virtio_input.c |   1 -
>  drivers/virtio/virtio_mem.c   |  69 ++-
>  drivers/virtio/virtio_mmio.c  |   6 +-
>  drivers/virtio/virtio_pci_common.c|   4 +-
>  drivers/virtio/virtio_ring.c  |  59 +-
>  fs/coredump.c |   4 +-
>  fs/fuse/virtio_fs.c   |   1 -
>  include/linux/sched/vhost_task.h  |   3 +-
>  include/linux/virtio.h|   7 +
>  include/uapi/linux/virtio_mem.h   |   2 +
>  kernel/exit.c |   5 +-
>  kernel/signal.c   |   4 +-
>  kernel/vhost_task.c   |  53 --
>  net/9p/trans_virtio.c |   1 -
>  net/vmw_vsock/virtio_transport.c  |   1 -
>  sound/virtio/virtio_card.c|   1 -
>  42 files changed, 578 insertions(+), 265 deletions(-)
>



Re: [PATCH net-next] virtio_net: Fix error code in __virtnet_get_hw_stats()

2024-05-12 Thread Xuan Zhuo
On Fri, 10 May 2024 15:50:45 +0300, Dan Carpenter  
wrote:
> The virtnet_send_command_reply() function returns true on success or
> false on failure.  The "ok" variable is true/false depending on whether
> it succeeds or not.  It's up to the caller to translate the true/false
> into -EINVAL on failure or zero for success.
>
> The bug is that __virtnet_get_hw_stats() returns false for both
> errors and success.  It's not a bug, but it is confusing that the caller
> virtnet_get_hw_stats() uses an "ok" variable to store negative error
> codes.
>
> Fix the bug and clean things up so that it's clear that
> __virtnet_get_hw_stats() returns zero on success or negative error codes
> on failure.
>
> Fixes: 941168f8b40e ("virtio_net: support device stats")
> Signed-off-by: Dan Carpenter 

That confused me too.

Reviewed-by: Xuan Zhuo 

Thanks.


> ---
>  drivers/net/virtio_net.c | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 218a446c4c27..4fc0fcdad259 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -4016,7 +4016,7 @@ static int __virtnet_get_hw_stats(struct virtnet_info 
> *vi,
>   _out, _in);
>
>   if (!ok)
> - return ok;
> + return -EINVAL;
>
>   for (p = reply; p - reply < res_size; p += le16_to_cpu(hdr->size)) {
>   hdr = p;
> @@ -4053,7 +4053,7 @@ static int virtnet_get_hw_stats(struct virtnet_info *vi,
>   struct virtio_net_ctrl_queue_stats *req;
>   bool enable_cvq;
>   void *reply;
> - int ok;
> + int err;
>
>   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_DEVICE_STATS))
>   return 0;
> @@ -4100,12 +4100,12 @@ static int virtnet_get_hw_stats(struct virtnet_info 
> *vi,
>   if (enable_cvq)
>   virtnet_make_stat_req(vi, ctx, req, vi->max_queue_pairs * 2, 
> );
>
> - ok = __virtnet_get_hw_stats(vi, ctx, req, sizeof(*req) * j, reply, 
> res_size);
> + err = __virtnet_get_hw_stats(vi, ctx, req, sizeof(*req) * j, reply, 
> res_size);
>
>   kfree(req);
>   kfree(reply);
>
> - return ok;
> + return err;
>  }
>
>  static void virtnet_get_strings(struct net_device *dev, u32 stringset, u8 
> *data)
>



Re: [PATCH net v4] virtio_net: Do not send RSS key if it is not supported

2024-04-07 Thread Xuan Zhuo
On Wed,  3 Apr 2024 08:43:12 -0700, Breno Leitao  wrote:
> There is a bug when setting the RSS options in virtio_net that can break
> the whole machine, getting the kernel into an infinite loop.
>
> Running the following command in any QEMU virtual machine with virtionet
> will reproduce this problem:
>
> # ethtool -X eth0  hfunc toeplitz
>
> This is how the problem happens:
>
> 1) ethtool_set_rxfh() calls virtnet_set_rxfh()
>
> 2) virtnet_set_rxfh() calls virtnet_commit_rss_command()
>
> 3) virtnet_commit_rss_command() populates 4 entries for the rss
> scatter-gather
>
> 4) Since the command above does not have a key, then the last
> scatter-gatter entry will be zeroed, since rss_key_size == 0.
> sg_buf_size = vi->rss_key_size;
>
> 5) This buffer is passed to qemu, but qemu is not happy with a buffer
> with zero length, and do the following in virtqueue_map_desc() (QEMU
> function):
>
>   if (!sz) {
>   virtio_error(vdev, "virtio: zero sized buffers are not allowed");
>
> 6) virtio_error() (also QEMU function) set the device as broken
>
> vdev->broken = true;
>
> 7) Qemu bails out, and do not repond this crazy kernel.
>
> 8) The kernel is waiting for the response to come back (function
> virtnet_send_command())
>
> 9) The kernel is waiting doing the following :
>
>   while (!virtqueue_get_buf(vi->cvq, ) &&
>!virtqueue_is_broken(vi->cvq))
> cpu_relax();
>
> 10) None of the following functions above is true, thus, the kernel
> loops here forever. Keeping in mind that virtqueue_is_broken() does
> not look at the qemu `vdev->broken`, so, it never realizes that the
> vitio is broken at QEMU side.
>
> Fix it by not sending RSS commands if the feature is not available in
> the device.
>
> Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
> Cc: sta...@vger.kernel.org
> Cc: qemu-de...@nongnu.org
> Signed-off-by: Breno Leitao 
> Reviewed-by: Heng Qi 

Reviewed-by: Xuan Zhuo 

> ---
> Changelog:
>
> V2:
>   * Moved from creating a valid packet, by rejecting the request
> completely.
> V3:
>   * Got some good feedback from and Xuan Zhuo and Heng Qi, and reworked
> the rejection path.
> V4:
>   * Added a comment in an "if" clause, as suggested by Michael S. Tsirkin.
>
> ---
>  drivers/net/virtio_net.c | 26 ++
>  1 file changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index c22d1118a133..115c3c5414f2 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3807,6 +3807,7 @@ static int virtnet_set_rxfh(struct net_device *dev,
>   struct netlink_ext_ack *extack)
>  {
>   struct virtnet_info *vi = netdev_priv(dev);
> + bool update = false;
>   int i;
>
>   if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&
> @@ -3814,13 +3815,28 @@ static int virtnet_set_rxfh(struct net_device *dev,
>   return -EOPNOTSUPP;
>
>   if (rxfh->indir) {
> + if (!vi->has_rss)
> + return -EOPNOTSUPP;
> +
>   for (i = 0; i < vi->rss_indir_table_size; ++i)
>   vi->ctrl->rss.indirection_table[i] = rxfh->indir[i];
> + update = true;
>   }
> - if (rxfh->key)
> +
> + if (rxfh->key) {
> + /* If either _F_HASH_REPORT or _F_RSS are negotiated, the
> +  * device provides hash calculation capabilities, that is,
> +  * hash_key is configured.
> +  */
> + if (!vi->has_rss && !vi->has_rss_hash_report)
> + return -EOPNOTSUPP;
> +
>   memcpy(vi->ctrl->rss.key, rxfh->key, vi->rss_key_size);
> + update = true;
> + }
>
> - virtnet_commit_rss_command(vi);
> + if (update)
> + virtnet_commit_rss_command(vi);
>
>   return 0;
>  }
> @@ -4729,13 +4745,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>   if (virtio_has_feature(vdev, VIRTIO_NET_F_HASH_REPORT))
>   vi->has_rss_hash_report = true;
>
> - if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS))
> + if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) {
>   vi->has_rss = true;
>
> - if (vi->has_rss || vi->has_rss_hash_report) {
>   vi->rss_indir_table_size =
>   virtio_cread16(vdev, offsetof(struct virtio_net_config,
>   rss_max_indirection_table_length));
> + }
> +
> + if (vi->has_rss || vi->has_rss_hash_report) {
>   vi->rss_key_size =
>   virtio_cread8(vdev, offsetof(struct virtio_net_config, 
> rss_max_key_size));
>
> --
> 2.43.0
>



Re: [PATCH net v2 1/2] virtio_net: Do not set rss_indir if RSS is not supported

2024-03-27 Thread Xuan Zhuo
On Wed, 27 Mar 2024 06:51:25 -0700, Breno Leitao  wrote:
> Hello Xuan,
>
> On Wed, Mar 27, 2024 at 09:37:43AM +0800, Xuan Zhuo wrote:
> > On Tue, 26 Mar 2024 08:19:08 -0700, Breno Leitao  wrote:
> > > Do not set virtnet_info->rss_indir_table_size if RSS is not available
> > > for the device.
> > >
> > > Currently, rss_indir_table_size is set if either has_rss or
> > > has_rss_hash_report is available, but, it should only be set if has_rss
> > > is set.
> > >
> > > On the virtnet_set_rxfh(), return an invalid command if the request has
> > > indirection table set, but virtnet does not support RSS.
> > >
> > > Suggested-by: Heng Qi 
> > > Signed-off-by: Breno Leitao 
> > > ---
> > >  drivers/net/virtio_net.c | 9 +++--
> > >  1 file changed, 7 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index c22d1118a133..c640fdf28fc5 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -3813,6 +3813,9 @@ static int virtnet_set_rxfh(struct net_device *dev,
> > >   rxfh->hfunc != ETH_RSS_HASH_TOP)
> > >   return -EOPNOTSUPP;
> > >
> > > + if (rxfh->indir && !vi->has_rss)
> > > + return -EINVAL;
> > > +
> > >   if (rxfh->indir) {
> >
> > Put !vi->has_rss here?
>
> I am not sure I understand the suggestion. Where do you suggest we have
> !vi->has_rss?
>
> If we got this far, we either have:
>
> a) rxfh->indir set and vi->has_rss is also set
> b) rxfh->indir not set. (vi->has_rss could be set or not).


This function does two tasks.
1. update indir table
2. update rss key

#1 only for has_rss
#2 for has_rss or has_rss_hash_report


So I would code:

bool update = false

if (rxfh->indir) {
if (!vi->has_rss)
return -EINVAL;

for (i = 0; i < vi->rss_indir_table_size; ++i)
vi->ctrl->rss.indirection_table[i] = rxfh->indir[i];

update = true
}

if (rxfh->key) {
if (!vi->has_rss && !vi->has_rss_hash_report)
return -EINVAL;

memcpy(vi->ctrl->rss.key, rxfh->key, vi->rss_key_size);
update = true
}


if (update)
virtnet_commit_rss_command(vi);

Thanks.


>
> Thanks for the review,
> Breno



Re: [PATCH net v2 1/2] virtio_net: Do not set rss_indir if RSS is not supported

2024-03-26 Thread Xuan Zhuo
On Tue, 26 Mar 2024 08:19:08 -0700, Breno Leitao  wrote:
> Do not set virtnet_info->rss_indir_table_size if RSS is not available
> for the device.
>
> Currently, rss_indir_table_size is set if either has_rss or
> has_rss_hash_report is available, but, it should only be set if has_rss
> is set.
>
> On the virtnet_set_rxfh(), return an invalid command if the request has
> indirection table set, but virtnet does not support RSS.
>
> Suggested-by: Heng Qi 
> Signed-off-by: Breno Leitao 
> ---
>  drivers/net/virtio_net.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index c22d1118a133..c640fdf28fc5 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3813,6 +3813,9 @@ static int virtnet_set_rxfh(struct net_device *dev,
>   rxfh->hfunc != ETH_RSS_HASH_TOP)
>   return -EOPNOTSUPP;
>
> + if (rxfh->indir && !vi->has_rss)
> + return -EINVAL;
> +
>   if (rxfh->indir) {

Put !vi->has_rss here?

Thanks.


>   for (i = 0; i < vi->rss_indir_table_size; ++i)
>   vi->ctrl->rss.indirection_table[i] = rxfh->indir[i];
> @@ -4729,13 +4732,15 @@ static int virtnet_probe(struct virtio_device *vdev)
>   if (virtio_has_feature(vdev, VIRTIO_NET_F_HASH_REPORT))
>   vi->has_rss_hash_report = true;
>
> - if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS))
> + if (virtio_has_feature(vdev, VIRTIO_NET_F_RSS)) {
>   vi->has_rss = true;
>
> - if (vi->has_rss || vi->has_rss_hash_report) {
>   vi->rss_indir_table_size =
>   virtio_cread16(vdev, offsetof(struct virtio_net_config,
>   rss_max_indirection_table_length));
> + }
> +
> + if (vi->has_rss || vi->has_rss_hash_report) {
>   vi->rss_key_size =
>   virtio_cread8(vdev, offsetof(struct virtio_net_config, 
> rss_max_key_size));
>
> --
> 2.43.0
>



Re: [PATCH] virtio_net: Do not send RSS key if it is not supported

2024-03-25 Thread Xuan Zhuo
On Mon, 25 Mar 2024 04:26:08 -0700, Breno Leitao  wrote:
> Hello Xuan,
>
> On Mon, Mar 25, 2024 at 01:57:53PM +0800, Xuan Zhuo wrote:
> > On Fri, 22 Mar 2024 03:21:21 -0700, Breno Leitao  wrote:
> > > Hello Xuan,
> > >
> > > On Fri, Mar 22, 2024 at 10:00:22AM +0800, Xuan Zhuo wrote:
> > > > On Thu, 21 Mar 2024 09:54:30 -0700, Breno Leitao  
> > > > wrote:
> > >
> > > > > 4) Since the command above does not have a key, then the last
> > > > >scatter-gatter entry will be zeroed, since rss_key_size == 0.
> > > > > sg_buf_size = vi->rss_key_size;
> > > >
> > > >
> > > >
> > > > if (vi->has_rss || vi->has_rss_hash_report) {
> > > > vi->rss_indir_table_size =
> > > > virtio_cread16(vdev, offsetof(struct 
> > > > virtio_net_config,
> > > > rss_max_indirection_table_length));
> > > > vi->rss_key_size =
> > > > virtio_cread8(vdev, offsetof(struct 
> > > > virtio_net_config, rss_max_key_size));
> > > >
> > > > vi->rss_hash_types_supported =
> > > > virtio_cread32(vdev, offsetof(struct 
> > > > virtio_net_config, supported_hash_types));
> > > > vi->rss_hash_types_supported &=
> > > > ~(VIRTIO_NET_RSS_HASH_TYPE_IP_EX |
> > > >   VIRTIO_NET_RSS_HASH_TYPE_TCP_EX |
> > > >   VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);
> > > >
> > > > dev->hw_features |= NETIF_F_RXHASH;
> > > > }
> > > >
> > > >
> > > > vi->rss_key_size is initiated here, I wonder if there is something 
> > > > wrong?
> > >
> > > Not really, the code above is never executed (in my machines). This is
> > > because `vi->has_rss` and `vi->has_rss_hash_report` are both unset.
> > >
> > > Looking further, vdev does not have the VIRTIO_NET_F_RSS and
> > > VIRTIO_NET_F_HASH_REPORT features.
> > >
> > > Also, when I run `ethtool -x`, I got:
> > >
> > >   # ethtool  -x eth0
> > >   RX flow hash indirection table for eth0 with 1 RX ring(s):
> > >   Operation not supported
> > >   RSS hash key:
> > >   Operation not supported
> > >   RSS hash function:
> > >   toeplitz: on
> > >   xor: off
> > >   crc32: off
> >
> >
> > The spec saies:
> > Note that if the device offers VIRTIO_NET_F_HASH_REPORT, even if it
> > supports only one pair of virtqueues, it MUST support at least one of
> > commands of VIRTIO_NET_CTRL_MQ class to configure reported hash
> > parameters:
> >
> > If the device offers VIRTIO_NET_F_RSS, it MUST support
> > VIRTIO_NET_CTRL_MQ_RSS_CONFIG command per 5.1.6.5.7.1.
> >
> > Otherwise the device MUST support VIRTIO_NET_CTRL_MQ_HASH_CONFIG command
> > per 5.1.6.5.6.4.
> >
> >
> > So if we have not anyone of `vi->has_rss` and `vi->has_rss_hash_report`,
> > we should return from virtnet_set_rxfh directly.
>
> Makes sense. Although it is not clear to me how vi->has_rss_hash_report
> is related here, but, I am convinced that we shouldn't do any RSS
> operation if the device doesn't have the RSS feature, i.e, vi->has_rss
> is false.
>
> That said, I am thinking about something like this. How does it sound?
>
>   diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>   index 5a7700b103f8..8c1ad7361cf2 100644
>   --- a/drivers/net/virtio_net.c
>   +++ b/drivers/net/virtio_net.c
>   @@ -3780,6 +3780,9 @@ static int virtnet_set_rxfh(struct net_device 
> *dev,
>   struct virtnet_info *vi = netdev_priv(dev);
>   int i;
>
>   +   if (!vi->has_rss)
>   +   return -EOPNOTSUPP;
>   +

Should we check has_rss_hash_report?

@Heng Qi

Could you help us?

Thanks.


>   if (rxfh->hfunc != ETH_RSS_HASH_NO_CHANGE &&
>   rxfh->hfunc != ETH_RSS_HASH_TOP)
>   return -EOPNOTSUPP;
>
> Thanks!



Re: [PATCH] virtio_net: Do not send RSS key if it is not supported

2024-03-25 Thread Xuan Zhuo
On Fri, 22 Mar 2024 03:21:21 -0700, Breno Leitao  wrote:
> Hello Xuan,
>
> On Fri, Mar 22, 2024 at 10:00:22AM +0800, Xuan Zhuo wrote:
> > On Thu, 21 Mar 2024 09:54:30 -0700, Breno Leitao  wrote:
>
> > > 4) Since the command above does not have a key, then the last
> > >scatter-gatter entry will be zeroed, since rss_key_size == 0.
> > > sg_buf_size = vi->rss_key_size;
> >
> >
> >
> > if (vi->has_rss || vi->has_rss_hash_report) {
> > vi->rss_indir_table_size =
> > virtio_cread16(vdev, offsetof(struct virtio_net_config,
> > rss_max_indirection_table_length));
> > vi->rss_key_size =
> > virtio_cread8(vdev, offsetof(struct virtio_net_config, 
> > rss_max_key_size));
> >
> > vi->rss_hash_types_supported =
> > virtio_cread32(vdev, offsetof(struct virtio_net_config, 
> > supported_hash_types));
> > vi->rss_hash_types_supported &=
> > ~(VIRTIO_NET_RSS_HASH_TYPE_IP_EX |
> >   VIRTIO_NET_RSS_HASH_TYPE_TCP_EX |
> >   VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);
> >
> > dev->hw_features |= NETIF_F_RXHASH;
> > }
> >
> >
> > vi->rss_key_size is initiated here, I wonder if there is something wrong?
>
> Not really, the code above is never executed (in my machines). This is
> because `vi->has_rss` and `vi->has_rss_hash_report` are both unset.
>
> Looking further, vdev does not have the VIRTIO_NET_F_RSS and
> VIRTIO_NET_F_HASH_REPORT features.
>
> Also, when I run `ethtool -x`, I got:
>
>   # ethtool  -x eth0
>   RX flow hash indirection table for eth0 with 1 RX ring(s):
>   Operation not supported
>   RSS hash key:
>   Operation not supported
>   RSS hash function:
>   toeplitz: on
>   xor: off
>   crc32: off


The spec saies:
Note that if the device offers VIRTIO_NET_F_HASH_REPORT, even if it
supports only one pair of virtqueues, it MUST support at least one of
commands of VIRTIO_NET_CTRL_MQ class to configure reported hash
parameters:

If the device offers VIRTIO_NET_F_RSS, it MUST support
VIRTIO_NET_CTRL_MQ_RSS_CONFIG command per 5.1.6.5.7.1.

Otherwise the device MUST support VIRTIO_NET_CTRL_MQ_HASH_CONFIG command
per 5.1.6.5.6.4.


So if we have not anyone of `vi->has_rss` and `vi->has_rss_hash_report`,
we should return from virtnet_set_rxfh directly.

Thanks.



Re: [PATCH] virtio_net: Do not send RSS key if it is not supported

2024-03-21 Thread Xuan Zhuo
On Thu, 21 Mar 2024 09:54:30 -0700, Breno Leitao  wrote:
> There is a bug when setting the RSS options in virtio_net that can break
> the whole machine, getting the kernel into an infinite loop.
>
> Running the following command in any QEMU virtual machine with virtionet
> will reproduce this problem:
>
>   # ethtool -X eth0  hfunc toeplitz
>
> This is how the problem happens:
>
> 1) ethtool_set_rxfh() calls virtnet_set_rxfh()
>
> 2) virtnet_set_rxfh() calls virtnet_commit_rss_command()
>
> 3) virtnet_commit_rss_command() populates 4 entries for the rss
>scatter-gather
>
> 4) Since the command above does not have a key, then the last
>scatter-gatter entry will be zeroed, since rss_key_size == 0.
> sg_buf_size = vi->rss_key_size;



if (vi->has_rss || vi->has_rss_hash_report) {
vi->rss_indir_table_size =
virtio_cread16(vdev, offsetof(struct virtio_net_config,
rss_max_indirection_table_length));
vi->rss_key_size =
virtio_cread8(vdev, offsetof(struct virtio_net_config, 
rss_max_key_size));

vi->rss_hash_types_supported =
virtio_cread32(vdev, offsetof(struct virtio_net_config, 
supported_hash_types));
vi->rss_hash_types_supported &=
~(VIRTIO_NET_RSS_HASH_TYPE_IP_EX |
  VIRTIO_NET_RSS_HASH_TYPE_TCP_EX |
  VIRTIO_NET_RSS_HASH_TYPE_UDP_EX);

dev->hw_features |= NETIF_F_RXHASH;
}


vi->rss_key_size is initiated here, I wonder if there is something wrong?

Thanks.


>
> 5) This buffer is passed to qemu, but qemu is not happy with a buffer
>with zero length, and do the following in virtqueue_map_desc() (QEMU
>function):
>
>   if (!sz) {
>   virtio_error(vdev, "virtio: zero sized buffers are not allowed");
>
> 6) virtio_error() (also QEMU function) set the device as broken
>
>   vdev->broken = true;
>
> 7) Qemu bails out, and do not repond this crazy kernel.
>
> 8) The kernel is waiting for the response to come back (function
>virtnet_send_command())
>
> 9) The kernel is waiting doing the following :
>
>   while (!virtqueue_get_buf(vi->cvq, ) &&
>  !virtqueue_is_broken(vi->cvq))
>   cpu_relax();
>
> 10) None of the following functions above is true, thus, the kernel
> loops here forever. Keeping in mind that virtqueue_is_broken() does
> not look at the qemu `vdev->broken`, so, it never realizes that the
> vitio is broken at QEMU side.
>
> Fix it by not sending the key scatter-gatter key if it is not set.
>
> Fixes: c7114b1249fa ("drivers/net/virtio_net: Added basic RSS support.")
> Signed-off-by: Breno Leitao 
> Cc: sta...@vger.kernel.org
> Cc: qemu-de...@nongnu.org
> ---
>  drivers/net/virtio_net.c | 16 +---
>  1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index d7ce4a1011ea..5a7700b103f8 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3041,11 +3041,16 @@ static int virtnet_set_ringparam(struct net_device 
> *dev,
>  static bool virtnet_commit_rss_command(struct virtnet_info *vi)
>  {
>   struct net_device *dev = vi->dev;
> + int has_key = vi->rss_key_size;
>   struct scatterlist sgs[4];
>   unsigned int sg_buf_size;
> + int nents = 3;
> +
> + if (has_key)
> + nents += 1;
>
>   /* prepare sgs */
> - sg_init_table(sgs, 4);
> + sg_init_table(sgs, nents);
>
>   sg_buf_size = offsetof(struct virtio_net_ctrl_rss, indirection_table);
>   sg_set_buf([0], >ctrl->rss, sg_buf_size);
> @@ -3057,8 +3062,13 @@ static bool virtnet_commit_rss_command(struct 
> virtnet_info *vi)
>   - offsetof(struct virtio_net_ctrl_rss, max_tx_vq);
>   sg_set_buf([2], >ctrl->rss.max_tx_vq, sg_buf_size);
>
> - sg_buf_size = vi->rss_key_size;
> - sg_set_buf([3], vi->ctrl->rss.key, sg_buf_size);
> + if (has_key) {
> + /* Only populate if key is available, otherwise
> +  * populating a buffer with zero size breaks virtio
> +  */
> + sg_buf_size = vi->rss_key_size;
> + sg_set_buf([3], vi->ctrl->rss.key, sg_buf_size);
> + }
>
>   if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_MQ,
> vi->has_rss ? VIRTIO_NET_CTRL_MQ_RSS_CONFIG
> --
> 2.43.0
>



Re: [PATCH net-next v5] virtio_net: Support RX hash XDP hint

2024-02-22 Thread Xuan Zhuo
On Fri, 09 Feb 2024 13:57:25 +0100, Paolo Abeni  wrote:
> On Fri, 2024-02-09 at 18:39 +0800, Liang Chen wrote:
> > On Wed, Feb 7, 2024 at 10:27 PM Paolo Abeni  wrote:
> > >
> > > On Wed, 2024-02-07 at 10:54 +0800, Liang Chen wrote:
> > > > On Tue, Feb 6, 2024 at 6:44 PM Paolo Abeni  wrote:
> > > > >
> > > > > On Sat, 2024-02-03 at 10:56 +0800, Liang Chen wrote:
> > > > > > On Sat, Feb 3, 2024 at 12:20 AM Jesper Dangaard Brouer 
> > > > > >  wrote:
> > > > > > > On 02/02/2024 13.11, Liang Chen wrote:
> > > > > [...]
> > > > > > > > @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct 
> > > > > > > > xdp_buff *xdp)
> > > > > > > >   }
> > > > > > > >   }
> > > > > > > >
> > > > > > > > +static void virtnet_xdp_save_rx_hash(struct virtnet_xdp_buff 
> > > > > > > > *virtnet_xdp,
> > > > > > > > +  struct net_device *dev,
> > > > > > > > +  struct 
> > > > > > > > virtio_net_hdr_v1_hash *hdr_hash)
> > > > > > > > +{
> > > > > > > > + if (dev->features & NETIF_F_RXHASH) {
> > > > > > > > + virtnet_xdp->hash_value = hdr_hash->hash_value;
> > > > > > > > + virtnet_xdp->hash_report = hdr_hash->hash_report;
> > > > > > > > + }
> > > > > > > > +}
> > > > > > > > +
> > > > > > >
> > > > > > > Would it be possible to store a pointer to hdr_hash in 
> > > > > > > virtnet_xdp_buff,
> > > > > > > with the purpose of delaying extracting this, until and only if 
> > > > > > > XDP
> > > > > > > bpf_prog calls the kfunc?
> > > > > > >
> > > > > >
> > > > > > That seems to be the way v1 works,
> > > > > > https://lore.kernel.org/all/20240122102256.261374-1-liangchen.li...@gmail.com/
> > > > > > . But it was pointed out that the inline header may be overwritten 
> > > > > > by
> > > > > > the xdp prog, so the hash is copied out to maintain its integrity.
> > > > >
> > > > > Why? isn't XDP supposed to get write access only to the pkt
> > > > > contents/buffer?
> > > > >
> > > >
> > > > Normally, an XDP program accesses only the packet data. However,
> > > > there's also an XDP RX Metadata area, referenced by the data_meta
> > > > pointer. This pointer can be adjusted with bpf_xdp_adjust_meta to
> > > > point somewhere ahead of the data buffer, thereby granting the XDP
> > > > program access to the virtio header located immediately before the
> > >
> > > AFAICS bpf_xdp_adjust_meta() does not allow moving the meta_data before
> > > xdp->data_hard_start:
> > >
> > > https://elixir.bootlin.com/linux/latest/source/net/core/filter.c#L4210
> > >
> > > and virtio net set such field after the virtio_net_hdr:
> > >
> > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1218
> > > https://elixir.bootlin.com/linux/latest/source/drivers/net/virtio_net.c#L1420
> > >
> > > I don't see how the virtio hdr could be touched? Possibly even more
> > > important: if such thing is possible, I think is should be somewhat
> > > denied (for the same reason an H/W nic should prevent XDP from
> > > modifying its own buffer descriptor).
> >
> > Thank you for highlighting this concern. The header layout differs
> > slightly between small and mergeable mode. Taking 'mergeable mode' as
> > an example, after calling xdp_prepare_buff the layout of xdp_buff
> > would be as depicted in the diagram below,
> >
> >   buf
> >|
> >v
> > +--+--+-+
> > | xdp headroom | virtio header| packet  |
> > | (256 bytes)  | (20 bytes)   | content |
> > +--+--+-+
> > ^ ^
> > | |
> >  data_hard_startdata
> >   data_meta
> >
> > If 'bpf_xdp_adjust_meta' repositions the 'data_meta' pointer a little
> > towards 'data_hard_start', it would point to the inline header, thus
> > potentially allowing the XDP program to access the inline header.
>
> I see. That layout was completely unexpected to me.
>
> AFAICS the virtio_net driver tries to avoid accessing/using the
> virtio_net_hdr after the XDP program execution, so nothing tragic
> should happen.
>
> @Michael, @Jason, I guess the above is like that by design? Isn't it a
> bit fragile?

YES. We process it carefully. That brings some troubles, we hope to put the
virtio-net header to the vring desc like other NICs. But that is a big project.

I think this patch is ok, this can be merged to net-next firstly.

Thanks.


>
> Thanks!
>
> Paolo
>



Re: RE: [PATCH net-next 1/2] xsk: Remove non-zero 'dma_page' check in xp_assign_dev

2024-02-21 Thread Xuan Zhuo
On Wed, 21 Feb 2024 11:37:22 +, wangyunjian  wrote:
> > -Original Message-
> > From: Xuan Zhuo [mailto:xuanz...@linux.alibaba.com]
> > Sent: Wednesday, February 21, 2024 5:53 PM
> > To: wangyunjian 
> > Cc: net...@vger.kernel.org; linux-kernel@vger.kernel.org;
> > k...@vger.kernel.org; virtualizat...@lists.linux.dev; xudingke
> > ; wangyunjian ;
> > m...@redhat.com; willemdebruijn.ker...@gmail.com; jasow...@redhat.com;
> > k...@kernel.org; da...@davemloft.net; magnus.karls...@intel.com
> > Subject: Re: [PATCH net-next 1/2] xsk: Remove non-zero 'dma_page' check in
> > xp_assign_dev
> >
> > On Wed, 24 Jan 2024 17:37:38 +0800, Yunjian Wang
> >  wrote:
> > > Now dma mappings are used by the physical NICs. However the vNIC maybe
> > > do not need them. So remove non-zero 'dma_page' check in
> > > xp_assign_dev.
> >
> > Could you tell me which one nic can work with AF_XDP without DMA?
>
> TUN will support AF_XDP Tx zero-copy, which does not require DMA mappings.


Great. Though I do not know how it works, but I think a new option or feature
is better.

Thanks.


>
> Thanks
>
> >
> > Thanks.
> >
> >
> > >
> > > Signed-off-by: Yunjian Wang 
> > > ---
> > >  net/xdp/xsk_buff_pool.c | 7 ---
> > >  1 file changed, 7 deletions(-)
> > >
> > > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c index
> > > 28711cc44ced..939b6e7b59ff 100644
> > > --- a/net/xdp/xsk_buff_pool.c
> > > +++ b/net/xdp/xsk_buff_pool.c
> > > @@ -219,16 +219,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
> > >   if (err)
> > >   goto err_unreg_pool;
> > >
> > > - if (!pool->dma_pages) {
> > > - WARN(1, "Driver did not DMA map zero-copy buffers");
> > > - err = -EINVAL;
> > > - goto err_unreg_xsk;
> > > - }
> > >   pool->umem->zc = true;
> > >   return 0;
> > >
> > > -err_unreg_xsk:
> > > - xp_disable_drv_zc(pool);
> > >  err_unreg_pool:
> > >   if (!force_zc)
> > >   err = 0; /* fallback to copy mode */
> > > --
> > > 2.33.0
> > >
> > >



Re: [PATCH net-next 1/2] xsk: Remove non-zero 'dma_page' check in xp_assign_dev

2024-02-21 Thread Xuan Zhuo
On Wed, 24 Jan 2024 17:37:38 +0800, Yunjian Wang  wrote:
> Now dma mappings are used by the physical NICs. However the vNIC
> maybe do not need them. So remove non-zero 'dma_page' check in
> xp_assign_dev.

Could you tell me which one nic can work with AF_XDP without DMA?

Thanks.


>
> Signed-off-by: Yunjian Wang 
> ---
>  net/xdp/xsk_buff_pool.c | 7 ---
>  1 file changed, 7 deletions(-)
>
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 28711cc44ced..939b6e7b59ff 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -219,16 +219,9 @@ int xp_assign_dev(struct xsk_buff_pool *pool,
>   if (err)
>   goto err_unreg_pool;
>
> - if (!pool->dma_pages) {
> - WARN(1, "Driver did not DMA map zero-copy buffers");
> - err = -EINVAL;
> - goto err_unreg_xsk;
> - }
>   pool->umem->zc = true;
>   return 0;
>
> -err_unreg_xsk:
> - xp_disable_drv_zc(pool);
>  err_unreg_pool:
>   if (!force_zc)
>   err = 0; /* fallback to copy mode */
> --
> 2.33.0
>
>



Re: [PATCH v4] virtio_net: Support RX hash XDP hint

2024-02-02 Thread Xuan Zhuo
On Fri, 2 Feb 2024 17:25:02 +0800, Liang Chen  wrote:
> On Thu, Feb 1, 2024 at 1:37 PM Jason Wang  wrote:
> >
> > On Wed, Jan 31, 2024 at 11:55 AM Liang Chen  
> > wrote:
> > >
> > > The RSS hash report is a feature that's part of the virtio specification.
> > > Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
> > > (still a work in progress as per [1]) support this feature. While the
> > > capability to obtain the RSS hash has been enabled in the normal path,
> > > it's currently missing in the XDP path. Therefore, we are introducing
> > > XDP hints through kfuncs to allow XDP programs to access the RSS hash.
> > >
> > > 1.
> > > https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r
> > >
> > > Signed-off-by: Liang Chen 
> > > Reviewed-by: Xuan Zhuo 
> >
> > Acked-by: Jason Wang 
> >
> > Thanks
>
> I've just realized that I forgot to include net...@vger.kernel.org in
> the cc list. Would it be advisable for me to resend the patch with
> net...@vger.kernel.org in the cc list to ensure it receives the
> necessary attention for potential merging? Thanks!

Did you use ./scripts/get_maintainer.pl?

Please resend it.

Thanks.


>
> Thanks,
> Liang
> >



Re: [PATCH v3] virtio_net: Support RX hash XDP hint

2024-01-30 Thread Xuan Zhuo
On Thu, 25 Jan 2024 18:19:12 +0800, Liang Chen  
wrote:
> The RSS hash report is a feature that's part of the virtio specification.
> Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
> (still a work in progress as per [1]) support this feature. While the
> capability to obtain the RSS hash has been enabled in the normal path,
> it's currently missing in the XDP path. Therefore, we are introducing
> XDP hints through kfuncs to allow XDP programs to access the RSS hash.
>
> 1.
> https://lore.kernel.org/all/20231015141644.260646-1-akihiko.od...@daynix.com/#r
>
> Signed-off-by: Liang Chen 
> ---
>  drivers/net/virtio_net.c | 98 +++-
>  1 file changed, 86 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index d7ce4a1011ea..0c845f2223da 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -349,6 +349,12 @@ struct virtio_net_common_hdr {
>   };
>  };
>
> +struct virtnet_xdp_buff {
> + struct xdp_buff xdp;
> + u32 hash_value;
> + u16 hash_report;
> +};
> +
>  static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf);
>
>  static bool is_xdp_frame(void *ptr)
> @@ -1033,6 +1039,16 @@ static void put_xdp_frags(struct xdp_buff *xdp)
>   }
>  }
>
> +static void virtnet_xdp_save_rx_hash(struct virtnet_xdp_buff *virtnet_xdp,
> +  struct net_device *dev,
> +  struct virtio_net_hdr_v1_hash *hdr_hash)
> +{
> + if (dev->features & NETIF_F_RXHASH) {
> + virtnet_xdp->hash_value = __le32_to_cpu(hdr_hash->hash_value);
> + virtnet_xdp->hash_report = __le16_to_cpu(hdr_hash->hash_report);

Could we put the __leXX_to_cpu to virtnet_xdp_rx_hash?

Other looks good to me.

Reviewed-by: Xuan Zhuo 

Thanks.


> + }
> +}
> +
>  static int virtnet_xdp_handler(struct bpf_prog *xdp_prog, struct xdp_buff 
> *xdp,
>  struct net_device *dev,
>  unsigned int *xdp_xmit,
> @@ -1199,9 +1215,10 @@ static struct sk_buff *receive_small_xdp(struct 
> net_device *dev,
>   unsigned int headroom = vi->hdr_len + header_offset;
>   struct virtio_net_hdr_mrg_rxbuf *hdr = buf + header_offset;
>   struct page *page = virt_to_head_page(buf);
> + struct virtnet_xdp_buff virtnet_xdp;
>   struct page *xdp_page;
> + struct xdp_buff *xdp;
>   unsigned int buflen;
> - struct xdp_buff xdp;
>   struct sk_buff *skb;
>   unsigned int metasize = 0;
>   u32 act;
> @@ -1233,17 +1250,20 @@ static struct sk_buff *receive_small_xdp(struct 
> net_device *dev,
>   page = xdp_page;
>   }
>
> - xdp_init_buff(, buflen, >xdp_rxq);
> - xdp_prepare_buff(, buf + VIRTNET_RX_PAD + vi->hdr_len,
> + xdp = _xdp.xdp;
> + xdp_init_buff(xdp, buflen, >xdp_rxq);
> + xdp_prepare_buff(xdp, buf + VIRTNET_RX_PAD + vi->hdr_len,
>xdp_headroom, len, true);
>
> - act = virtnet_xdp_handler(xdp_prog, , dev, xdp_xmit, stats);
> + virtnet_xdp_save_rx_hash(_xdp, dev, (void *)hdr);
> +
> + act = virtnet_xdp_handler(xdp_prog, xdp, dev, xdp_xmit, stats);
>
>   switch (act) {
>   case XDP_PASS:
>   /* Recalculate length in case bpf program changed it */
> - len = xdp.data_end - xdp.data;
> - metasize = xdp.data - xdp.data_meta;
> + len = xdp->data_end - xdp->data;
> + metasize = xdp->data - xdp->data_meta;
>   break;
>
>   case XDP_TX:
> @@ -1254,7 +1274,7 @@ static struct sk_buff *receive_small_xdp(struct 
> net_device *dev,
>   goto err_xdp;
>   }
>
> - skb = virtnet_build_skb(buf, buflen, xdp.data - buf, len);
> + skb = virtnet_build_skb(buf, buflen, xdp->data - buf, len);
>   if (unlikely(!skb))
>   goto err;
>
> @@ -1591,10 +1611,11 @@ static struct sk_buff *receive_mergeable_xdp(struct 
> net_device *dev,
>   int num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
>   struct page *page = virt_to_head_page(buf);
>   int offset = buf - page_address(page);
> + struct virtnet_xdp_buff virtnet_xdp;
>   unsigned int xdp_frags_truesz = 0;
>   struct sk_buff *head_skb;
>   unsigned int frame_sz;
> - struct xdp_buff xdp;
> + struct xdp_buff *xdp;
>   void *data;
>   u32 act;
>   int err;
> @@ -1604,16 +1625,19 @@ static struct sk_buff *receive_mergeable_xdp(struct 
> net_device *dev,
>   if (unlikely(!da

Re: [PATCH v2 2/3] virtio_net: Add missing virtio header in skb for XDP_PASS

2024-01-24 Thread Xuan Zhuo
On Thu, 25 Jan 2024 11:48:18 +0800, Jason Wang  wrote:
> On Wed, Jan 24, 2024 at 5:16 PM Xuan Zhuo  wrote:
> >
> > On Wed, 24 Jan 2024 16:57:20 +0800, Liang Chen  
> > wrote:
> > > For the XDP_PASS scenario of the XDP path, the skb constructed with
> > > xdp_buff does not include the virtio header. Adding the virtio header
> > > information back when creating the skb.
> > >
> > > Signed-off-by: Liang Chen 
> > > ---
> > >  drivers/net/virtio_net.c | 6 ++
> > >  1 file changed, 6 insertions(+)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index b56828804e5f..2de46eb4c661 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -1270,6 +1270,9 @@ static struct sk_buff *receive_small_xdp(struct 
> > > net_device *dev,
> > >   if (unlikely(!skb))
> > >   goto err;
> > >
> > > + /* Store the original virtio header for subsequent use by the 
> > > driver. */
> > > + memcpy(skb_vnet_common_hdr(skb), _xdp.hdr, vi->hdr_len);
> >
> > About this, a spec is waiting for voting.
> >
>
> A pointer?

Sorry.

http://lore.kernel.org/all/1704263702-50528-1-git-send-email-hen...@linux.alibaba.com


>
> > This may change the logic of the csum offset and so on.
>
>  Btw, doesn't it need a new feature bit?

Yes.

But this patch set should not include this.
We may do the similar thing, but the commit log will
include more info about the spec.

Thanks.


>
> Thanks
>
> >
> > Please not do this.
> >
> > Thanks.
> >
> >
> > > +
> > >   if (metasize)
> > >   skb_metadata_set(skb, metasize);
> > >
> > > @@ -1635,6 +1638,9 @@ static struct sk_buff *receive_mergeable_xdp(struct 
> > > net_device *dev,
> > >   head_skb = build_skb_from_xdp_buff(dev, vi, xdp, 
> > > xdp_frags_truesz);
> > >   if (unlikely(!head_skb))
> > >   break;
> > > + /* Store the original virtio header for subsequent use by 
> > > the driver. */
> > > + memcpy(skb_vnet_common_hdr(head_skb), _xdp.hdr, 
> > > vi->hdr_len);
> > > +
> > >   return head_skb;
> > >
> > >   case XDP_TX:
> > > --
> > > 2.40.1
> > >
> >
>



Re: [PATCH v2 2/3] virtio_net: Add missing virtio header in skb for XDP_PASS

2024-01-24 Thread Xuan Zhuo
On Wed, 24 Jan 2024 16:57:20 +0800, Liang Chen  
wrote:
> For the XDP_PASS scenario of the XDP path, the skb constructed with
> xdp_buff does not include the virtio header. Adding the virtio header
> information back when creating the skb.
>
> Signed-off-by: Liang Chen 
> ---
>  drivers/net/virtio_net.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index b56828804e5f..2de46eb4c661 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -1270,6 +1270,9 @@ static struct sk_buff *receive_small_xdp(struct 
> net_device *dev,
>   if (unlikely(!skb))
>   goto err;
>
> + /* Store the original virtio header for subsequent use by the driver. */
> + memcpy(skb_vnet_common_hdr(skb), _xdp.hdr, vi->hdr_len);

About this, a spec is waiting for voting.

This may change the logic of the csum offset and so on.

Please not do this.

Thanks.


> +
>   if (metasize)
>   skb_metadata_set(skb, metasize);
>
> @@ -1635,6 +1638,9 @@ static struct sk_buff *receive_mergeable_xdp(struct 
> net_device *dev,
>   head_skb = build_skb_from_xdp_buff(dev, vi, xdp, 
> xdp_frags_truesz);
>   if (unlikely(!head_skb))
>   break;
> + /* Store the original virtio header for subsequent use by the 
> driver. */
> + memcpy(skb_vnet_common_hdr(head_skb), _xdp.hdr, 
> vi->hdr_len);
> +
>   return head_skb;
>
>   case XDP_TX:
> --
> 2.40.1
>



Re: [PATCH v2 1/3] virtio_net: Preserve virtio header before XDP program execution

2024-01-24 Thread Xuan Zhuo
On Wed, 24 Jan 2024 16:57:19 +0800, Liang Chen  
wrote:
> The xdp program may overwrite the inline virtio header. To ensure the
> integrity of the virtio header, it is saved in a data structure that
> wraps both the xdp_buff and the header before running the xdp program.
>
> Signed-off-by: Liang Chen 
> ---
>  drivers/net/virtio_net.c | 43 +---
>  1 file changed, 31 insertions(+), 12 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index d7ce4a1011ea..b56828804e5f 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -349,6 +349,11 @@ struct virtio_net_common_hdr {
>   };
>  };
>
> +struct virtnet_xdp_buff {
> + struct xdp_buff xdp;
> + struct virtio_net_common_hdr hdr;

Not all items of the hdr are useful, we can just save the useful items.

> +};
> +
>  static void virtnet_sq_free_unused_buf(struct virtqueue *vq, void *buf);
>
>  static bool is_xdp_frame(void *ptr)
> @@ -1199,9 +1204,10 @@ static struct sk_buff *receive_small_xdp(struct 
> net_device *dev,
>   unsigned int headroom = vi->hdr_len + header_offset;
>   struct virtio_net_hdr_mrg_rxbuf *hdr = buf + header_offset;
>   struct page *page = virt_to_head_page(buf);
> + struct virtnet_xdp_buff virtnet_xdp;
>   struct page *xdp_page;
> + struct xdp_buff *xdp;
>   unsigned int buflen;
> - struct xdp_buff xdp;
>   struct sk_buff *skb;
>   unsigned int metasize = 0;
>   u32 act;
> @@ -1233,17 +1239,23 @@ static struct sk_buff *receive_small_xdp(struct 
> net_device *dev,
>   page = xdp_page;
>   }
>
> - xdp_init_buff(, buflen, >xdp_rxq);
> - xdp_prepare_buff(, buf + VIRTNET_RX_PAD + vi->hdr_len,
> + xdp = _xdp.xdp;
> + xdp_init_buff(xdp, buflen, >xdp_rxq);
> + xdp_prepare_buff(xdp, buf + VIRTNET_RX_PAD + vi->hdr_len,
>xdp_headroom, len, true);
>
> - act = virtnet_xdp_handler(xdp_prog, , dev, xdp_xmit, stats);
> + /* Copy out the virtio header, as it may be overwritten by the
> +  * xdp program.
> +  */
> + memcpy(_xdp.hdr, hdr, vi->hdr_len);

Can we put this into virtnet_xdp_handler?

And just do that when the hash is negotiated.

> +
> + act = virtnet_xdp_handler(xdp_prog, xdp, dev, xdp_xmit, stats);
>
>   switch (act) {
>   case XDP_PASS:
>   /* Recalculate length in case bpf program changed it */
> - len = xdp.data_end - xdp.data;
> - metasize = xdp.data - xdp.data_meta;
> + len = xdp->data_end - xdp->data;
> + metasize = xdp->data - xdp->data_meta;
>   break;
>
>   case XDP_TX:
> @@ -1254,7 +1266,7 @@ static struct sk_buff *receive_small_xdp(struct 
> net_device *dev,
>   goto err_xdp;
>   }
>
> - skb = virtnet_build_skb(buf, buflen, xdp.data - buf, len);
> + skb = virtnet_build_skb(buf, buflen, xdp->data - buf, len);
>   if (unlikely(!skb))
>   goto err;
>
> @@ -1591,10 +1603,11 @@ static struct sk_buff *receive_mergeable_xdp(struct 
> net_device *dev,
>   int num_buf = virtio16_to_cpu(vi->vdev, hdr->num_buffers);
>   struct page *page = virt_to_head_page(buf);
>   int offset = buf - page_address(page);
> + struct virtnet_xdp_buff virtnet_xdp;
>   unsigned int xdp_frags_truesz = 0;
>   struct sk_buff *head_skb;
>   unsigned int frame_sz;
> - struct xdp_buff xdp;
> + struct xdp_buff *xdp;
>   void *data;
>   u32 act;
>   int err;
> @@ -1604,16 +1617,22 @@ static struct sk_buff *receive_mergeable_xdp(struct 
> net_device *dev,
>   if (unlikely(!data))
>   goto err_xdp;
>
> - err = virtnet_build_xdp_buff_mrg(dev, vi, rq, , data, len, frame_sz,
> + xdp = _xdp.xdp;
> + err = virtnet_build_xdp_buff_mrg(dev, vi, rq, xdp, data, len, frame_sz,
>_buf, _frags_truesz, stats);
>   if (unlikely(err))
>   goto err_xdp;
>
> - act = virtnet_xdp_handler(xdp_prog, , dev, xdp_xmit, stats);
> + /* Copy out the virtio header, as it may be overwritten by the
> +  * xdp program.
> +  */
> + memcpy(_xdp.hdr, hdr, vi->hdr_len);
> +
> + act = virtnet_xdp_handler(xdp_prog, xdp, dev, xdp_xmit, stats);
>
>   switch (act) {
>   case XDP_PASS:
> - head_skb = build_skb_from_xdp_buff(dev, vi, , 
> xdp_frags_truesz);
> + head_skb = build_skb_from_xdp_buff(dev, vi, xdp, 
> xdp_frags_truesz);
>   if (unlikely(!head_skb))
>   break;
>   return head_skb;
> @@ -1626,7 +1645,7 @@ static struct sk_buff *receive_mergeable_xdp(struct 
> net_device *dev,
>   break;
>   }
>
> - put_xdp_frags();
> + put_xdp_frags(xdp);
>
>  err_xdp:
>   put_page(page);
> --
> 2.40.1
>



Re: [PATCH] virtio_net: Support RX hash XDP hint

2024-01-23 Thread Xuan Zhuo
On Wed, 24 Jan 2024 10:04:51 +0800, Liang Chen  
wrote:
> On Mon, Jan 22, 2024 at 7:10 PM Heng Qi  wrote:
> >
> > Hi Liang Chen,
> >
> > 在 2024/1/22 下午6:22, Liang Chen 写道:
> > > The RSS hash report is a feature that's part of the virtio specification.
> > > Currently, virtio backends like qemu, vdpa (mlx5), and potentially vhost
> > > (still a work in progress as per [1]) support this feature. While the
> > > capability to obtain the RSS hash has been enabled in the normal path,
> > > it's currently missing in the XDP path. Therefore, we are introducing XDP
> > > hints through kfuncs to allow XDP programs to access the RSS hash.
> > >
> > > Signed-off-by: Liang Chen 
> > > ---
> > >   drivers/net/virtio_net.c | 56 
> > >   1 file changed, 56 insertions(+)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index d7ce4a1011ea..1463a4709e3c 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -4579,6 +4579,60 @@ static void virtnet_set_big_packets(struct 
> > > virtnet_info *vi, const int mtu)
> > >   }
> > >   }
> > >
> > > +static int virtnet_xdp_rx_hash(const struct xdp_md *_ctx, u32 *hash,
> > > +enum xdp_rss_hash_type *rss_type)
> > > +{
> > > + const struct xdp_buff *xdp = (void *)_ctx;
> > > + struct virtio_net_hdr_v1_hash *hdr_hash;
> > > + struct virtnet_info *vi;
> > > +
> > > + if (!(xdp->rxq->dev->features & NETIF_F_RXHASH))
> >
> > I think 'vi->has_rss_hash_report' should be used here.
> > NETIF_F_RXHASH cannot guarantee that the hash report feature is negotiated,
> > and accessing hash_report and hash_value is unsafe at this time.
> >
>
> My understanding is that rxhash in dev->features is turned on only if
> the corresponding hw feature is set. We will double check on this.
>
> > > + return -ENODATA;
> > > +
> > > + vi = netdev_priv(xdp->rxq->dev);
> > > + hdr_hash = (struct virtio_net_hdr_v1_hash *)(xdp->data - 
> > > vi->hdr_len);
> >
> > If the virtio-net-hdr is overrided by the XDP prog, how can this be done
> > correctly and as expected?
> >
>
> Yeah, thanks for bringing up this concern. We are actively working on
> a fix of this issue(possibly with a wrapper structure of xdp_buff).

Are there some places to save the hash before run xdp?

Thanks.


>
> Thanks,
> Liang
>
> > Thanks,
> > Heng
> >
> > > +
> > > + switch (__le16_to_cpu(hdr_hash->hash_report)) {
> > > + case VIRTIO_NET_HASH_REPORT_TCPv4:
> > > + *rss_type = XDP_RSS_TYPE_L4_IPV4_TCP;
> > > + break;
> > > + case VIRTIO_NET_HASH_REPORT_UDPv4:
> > > + *rss_type = XDP_RSS_TYPE_L4_IPV4_UDP;
> > > + break;
> > > + case VIRTIO_NET_HASH_REPORT_TCPv6:
> > > + *rss_type = XDP_RSS_TYPE_L4_IPV6_TCP;
> > > + break;
> > > + case VIRTIO_NET_HASH_REPORT_UDPv6:
> > > + *rss_type = XDP_RSS_TYPE_L4_IPV6_UDP;
> > > + break;
> > > + case VIRTIO_NET_HASH_REPORT_TCPv6_EX:
> > > + *rss_type = XDP_RSS_TYPE_L4_IPV6_TCP_EX;
> > > + break;
> > > + case VIRTIO_NET_HASH_REPORT_UDPv6_EX:
> > > + *rss_type = XDP_RSS_TYPE_L4_IPV6_UDP_EX;
> > > + break;
> > > + case VIRTIO_NET_HASH_REPORT_IPv4:
> > > + *rss_type = XDP_RSS_TYPE_L3_IPV4;
> > > + break;
> > > + case VIRTIO_NET_HASH_REPORT_IPv6:
> > > + *rss_type = XDP_RSS_TYPE_L3_IPV6;
> > > + break;
> > > + case VIRTIO_NET_HASH_REPORT_IPv6_EX:
> > > + *rss_type = XDP_RSS_TYPE_L3_IPV6_EX;
> > > + break;
> > > + case VIRTIO_NET_HASH_REPORT_NONE:
> > > + default:
> > > + *rss_type = XDP_RSS_TYPE_NONE;
> > > + }
> > > +
> > > + *hash = __le32_to_cpu(hdr_hash->hash_value);
> > > + return 0;
> > > +}
> > > +
> > > +static const struct xdp_metadata_ops virtnet_xdp_metadata_ops = {
> > > + .xmo_rx_hash= virtnet_xdp_rx_hash,
> > > +};
> > > +
> > >   static int virtnet_probe(struct virtio_device *vdev)
> > >   {
> > >   int i, err = -ENOMEM;
> > > @@ -4613,6 +4667,8 @@ static int virtnet_probe(struct virtio_device *vdev)
> > >   dev->ethtool_ops = _ethtool_ops;
> > >   SET_NETDEV_DEV(dev, >dev);
> > >
> > > + dev->xdp_metadata_ops = _xdp_metadata_ops;
> > > +
> > >   /* Do we support "hardware" checksums? */
> > >   if (virtio_has_feature(vdev, VIRTIO_NET_F_CSUM)) {
> > >   /* This opens up the world of extra features. */
> >
>



Re: [PATCH net-next] tcp: add tracepoints for data send/recv/acked

2023-12-06 Thread Xuan Zhuo
On Tue, 5 Dec 2023 20:39:28 +0100, Eric Dumazet  wrote:
> On Tue, Dec 5, 2023 at 3:11 AM Xuan Zhuo  wrote:
> >
> > On Mon, 4 Dec 2023 13:28:21 +0100, Eric Dumazet  wrote:
> > > On Mon, Dec 4, 2023 at 12:43 PM Philo Lu  wrote:
> > > >
> > > > Add 3 tracepoints, namely tcp_data_send/tcp_data_recv/tcp_data_acked,
> > > > which will be called every time a tcp data packet is sent, received, and
> > > > acked.
> > > > tcp_data_send: called after a data packet is sent.
> > > > tcp_data_recv: called after a data packet is receviced.
> > > > tcp_data_acked: called after a valid ack packet is processed (some sent
> > > > data are ackknowledged).
> > > >
> > > > We use these callbacks for fine-grained tcp monitoring, which collects
> > > > and analyses every tcp request/response event information. The whole
> > > > system has been described in SIGMOD'18 (see
> > > > https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
> > > > achieve this with bpf, we require hooks for data events that call bpf
> > > > prog (1) when any data packet is sent/received/acked, and (2) after
> > > > critical tcp state variables have been updated (e.g., snd_una, snd_nxt,
> > > > rcv_nxt). However, existing bpf hooks cannot meet our requirements.
> > > > Besides, these tracepoints help to debug tcp when data send/recv/acked.
> > >
> > > This I do not understand.
> > >
> > > >
> > > > Though kretprobe/fexit can also be used to collect these information,
> > > > they will not work if the kernel functions get inlined. Considering the
> > > > stability, we prefer tracepoint as the solution.
> > >
> > > I dunno, this seems quite weak to me. I see many patches coming to add
> > > tracing in the stack, but no patches fixing any issues.
> >
> >
> > We have implemented a mechanism to split the request and response from the 
> > TCP
> > connection using these "hookers", which can handle various protocols such as
> > HTTP, HTTPS, Redis, and MySQL. This mechanism allows us to record important
> > information about each request and response, including the amount of data
> > uploaded, the time taken by the server to handle the request, and the time 
> > taken
> > for the client to receive the response. This mechanism has been running
> > internally for many years and has proven to be very useful.
> >
> > One of the main benefits of this mechanism is that it helps in locating the
> > source of any issues or problems that may arise. For example, if there is a
> > problem with the network, the application, or the machine, we can use this
> > mechanism to identify and isolate the issue.
> >
> > TCP has long been a challenge when it comes to tracking the transmission of 
> > data
> > on the network. The application can only confirm that it has sent a certain
> > amount of data to the kernel, but it has limited visibility into whether the
> > client has actually received this data. Our mechanism addresses this issue 
> > by
> > providing insights into the amount of data received by the client and the 
> > time
> > it was received. Furthermore, we can also detect any packet loss or delays
> > caused by the server.
> >
> > https://help-static-aliyun-doc.aliyuncs.com/assets/img/zh-CN/7912288961/9732df025beny.svg
> >
> > So, we do not want to add some tracepoint to do some unknow debug.
> > We have a clear goal. debugging is just an incidental capability.
> >
>
> We have powerful mechanisms in the stack already that ordinary (no
> privilege requested) applications can readily use.
>
> We have been using them for a while.
>
> If existing mechanisms are missing something you need, please expand them.
>
> For reference, start looking at tcp_get_timestamping_opt_stats() history.
>
> Sender side can for instance get precise timestamps.
>
> Combinations of these timestamps reveal different parts of the overall
> network latency,
>
> T0: sendmsg() enters TCP
> T1: first byte enters qdisc
> T2: first byte sent to the NIC
> T3: first byte ACKed in TCP
> T4: last byte sent to the NIC
> T5: last byte ACKed
> T1 - T0: how long the first byte was blocked in the TCP layer ("Head
> of Line Blocking" latency).
> T2 - T1: how long the first byte was blocked in the Linux traffic
> shaping layer (known as QDisc).
> T3 - T2: the network ‘distance’ (propagation delay + current queuing
> delay along the network path and at the receiver).
> T5 - T2: how fast the sent chunk was delivered.
> Message Size / (T5 - T0): goodput (from application’s perspective)


The key point is that using our mechanism, the application does not need to be
modified.

As long as the app's network protocol is request-response, we can trace tcp
connection at any time to analyze the request and response. And record the start
and end times of request and response. Of course there is some ttl and other
information.

Thanks.



Re: [PATCH net-next] tcp: add tracepoints for data send/recv/acked

2023-12-04 Thread Xuan Zhuo
On Mon, 4 Dec 2023 13:28:21 +0100, Eric Dumazet  wrote:
> On Mon, Dec 4, 2023 at 12:43 PM Philo Lu  wrote:
> >
> > Add 3 tracepoints, namely tcp_data_send/tcp_data_recv/tcp_data_acked,
> > which will be called every time a tcp data packet is sent, received, and
> > acked.
> > tcp_data_send: called after a data packet is sent.
> > tcp_data_recv: called after a data packet is receviced.
> > tcp_data_acked: called after a valid ack packet is processed (some sent
> > data are ackknowledged).
> >
> > We use these callbacks for fine-grained tcp monitoring, which collects
> > and analyses every tcp request/response event information. The whole
> > system has been described in SIGMOD'18 (see
> > https://dl.acm.org/doi/pdf/10.1145/3183713.3190659 for details). To
> > achieve this with bpf, we require hooks for data events that call bpf
> > prog (1) when any data packet is sent/received/acked, and (2) after
> > critical tcp state variables have been updated (e.g., snd_una, snd_nxt,
> > rcv_nxt). However, existing bpf hooks cannot meet our requirements.
> > Besides, these tracepoints help to debug tcp when data send/recv/acked.
>
> This I do not understand.
>
> >
> > Though kretprobe/fexit can also be used to collect these information,
> > they will not work if the kernel functions get inlined. Considering the
> > stability, we prefer tracepoint as the solution.
>
> I dunno, this seems quite weak to me. I see many patches coming to add
> tracing in the stack, but no patches fixing any issues.


We have implemented a mechanism to split the request and response from the TCP
connection using these "hookers", which can handle various protocols such as
HTTP, HTTPS, Redis, and MySQL. This mechanism allows us to record important
information about each request and response, including the amount of data
uploaded, the time taken by the server to handle the request, and the time taken
for the client to receive the response. This mechanism has been running
internally for many years and has proven to be very useful.

One of the main benefits of this mechanism is that it helps in locating the
source of any issues or problems that may arise. For example, if there is a
problem with the network, the application, or the machine, we can use this
mechanism to identify and isolate the issue.

TCP has long been a challenge when it comes to tracking the transmission of data
on the network. The application can only confirm that it has sent a certain
amount of data to the kernel, but it has limited visibility into whether the
client has actually received this data. Our mechanism addresses this issue by
providing insights into the amount of data received by the client and the time
it was received. Furthermore, we can also detect any packet loss or delays
caused by the server.

https://help-static-aliyun-doc.aliyuncs.com/assets/img/zh-CN/7912288961/9732df025beny.svg

So, we do not want to add some tracepoint to do some unknow debug.
We have a clear goal. debugging is just an incidental capability.

Thanks.


>
> It really looks like : We do not know how TCP stack works, we do not
> know if there is any issue,
> let us add trace points to help us to make forward progress in our analysis.
>
> These tracepoints will not tell how many segments/bytes were
> sent/acked/received, I really do not see
> how we will avoid adding in the future more stuff, forcing the
> compiler to save more state
> just in case the tracepoint needs the info.
>
> The argument of "add minimal info", so that we can silently add more
> stuff in the future "for free" is not something I buy.
>
> I very much prefer that you make sure the stuff you need is not
> inlined, so that standard kprobe/kretprobe facility can be used.



Re: [PATCH net-next 2/5] virtio_net: Add page_pool support to improve performance

2023-11-30 Thread Xuan Zhuo
On Thu, 30 Nov 2023 13:30:40 +0800, Zhu Yanjun  wrote:
>
> 在 2023/11/30 10:34, Xuan Zhuo 写道:
> > On Wed, 29 Nov 2023 23:29:10 +0800, Zhu Yanjun  wrote:
> >> 在 2023/11/29 23:22, Zhu Yanjun 写道:
> >>> 在 2023/11/29 22:59, Michael S. Tsirkin 写道:
> >>>> On Wed, Nov 29, 2023 at 10:50:57PM +0800, Zhu Yanjun wrote:
> >>>>> 在 2023/5/26 13:46, Liang Chen 写道:
> >>>> what made you respond to a patch from May, now?
> >>> I want to apply page_pool to our virtio_net. This virtio_net works on
> >>> our device.
> >>>
> >>> I want to verify whether page_pool on virtio_net with our device can
> >>> improve the performance or not.
> >>>
> >>> And I found that ethtool is wrong.
> >>>
> >>> I use virtio_net on our device. I found that page member variable in
> >>> rq is not used in recv path.
> >>>
> >>> When virtio_net is modprobe, I checked page member variable in rq with
> >>> kprobe or crash tool.  page member variable in rq is always NULL.
> >>>
> >>> But sg in recv path is used.
> >>>
> >>> So how to use page member variable in rq? If page member variable in
> >>> rq is always NULL, can we remove it?
> >>>
> >>> BTW, I use ping and iperf tool to make tests with virtio_net. In the
> >>> tests, page member variable in rq is always NULL.
> >>
> >> And I replaced page member variable in rq with page_pool, but the
> >> statistics of page_pool are always 0.
> >>
> >> It is interesting that page_pool member variable in rq is not used in
> >> ping and iperf tests.
> >>
> >> I am not sure what tests can make page member variable not NULL. ^_^
> > Do you mean rq->pages?
> >
> > That is for big mode.
>
> Hi, Xuan
>
> Got it. What is big mode? Do you mean big packet size? I run iperf with
> the packet size 2^23.
>
> The rq->pages is still NULL.
>
> It is interesting.

You may need to check the code of virtnet_probe().

Thanks.


>
> Zhu Yanjun
>
>
> >
> > Thanks.
> >
> >
> >> Best Regards,
> >>
> >> Zhu Yanjun
> >>
> >>
> >>> It is interesting.
> >>>
> >>> Zhu Yanjun
> >>>



Re: [PATCH net-next 2/5] virtio_net: Add page_pool support to improve performance

2023-11-29 Thread Xuan Zhuo
On Wed, 29 Nov 2023 23:29:10 +0800, Zhu Yanjun  wrote:
>
> 在 2023/11/29 23:22, Zhu Yanjun 写道:
> >
> > 在 2023/11/29 22:59, Michael S. Tsirkin 写道:
> >> On Wed, Nov 29, 2023 at 10:50:57PM +0800, Zhu Yanjun wrote:
> >>> 在 2023/5/26 13:46, Liang Chen 写道:
> >>
> >> what made you respond to a patch from May, now?
> >
> > I want to apply page_pool to our virtio_net. This virtio_net works on
> > our device.
> >
> > I want to verify whether page_pool on virtio_net with our device can
> > improve the performance or not.
> >
> > And I found that ethtool is wrong.
> >
> > I use virtio_net on our device. I found that page member variable in
> > rq is not used in recv path.
> >
> > When virtio_net is modprobe, I checked page member variable in rq with
> > kprobe or crash tool.  page member variable in rq is always NULL.
> >
> > But sg in recv path is used.
> >
> > So how to use page member variable in rq? If page member variable in
> > rq is always NULL, can we remove it?
> >
> > BTW, I use ping and iperf tool to make tests with virtio_net. In the
> > tests, page member variable in rq is always NULL.
>
>
> And I replaced page member variable in rq with page_pool, but the
> statistics of page_pool are always 0.
>
> It is interesting that page_pool member variable in rq is not used in
> ping and iperf tests.
>
> I am not sure what tests can make page member variable not NULL. ^_^

Do you mean rq->pages?

That is for big mode.

Thanks.


>
> Best Regards,
>
> Zhu Yanjun
>
>
> >
> > It is interesting.
> >
> > Zhu Yanjun
> >
> >>



[PATCH net-next v2 3/3] xsk: build skb by page

2021-01-20 Thread Xuan Zhuo
This patch is used to construct skb based on page to save memory copy
overhead.

This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
directly construct skb. If this feature is not supported, it is still
necessary to copy data to construct skb.

 Performance Testing 

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s 
```

Test result data:

size64  512 10241500
copy1916747 1775988 1600203 1440054
page1974058 1953655 1945463 1904478
percent 3.0%10.0%   21.58%  32.3%

Signed-off-by: Xuan Zhuo 
Reviewed-by: Dust Li 
---
 net/xdp/xsk.c | 104 --
 1 file changed, 86 insertions(+), 18 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 8037b04..40bac11 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
sock_wfree(skb);
 }
 
+static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
+ struct xdp_desc *desc)
+{
+   u32 len, offset, copy, copied;
+   struct sk_buff *skb;
+   struct page *page;
+   void *buffer;
+   int err, i;
+   u64 addr;
+
+   skb = sock_alloc_send_skb(>sk, 0, 1, );
+   if (unlikely(!skb))
+   return ERR_PTR(err);
+
+   addr = desc->addr;
+   len = desc->len;
+
+   buffer = xsk_buff_raw_get_data(xs->pool, addr);
+   offset = offset_in_page(buffer);
+   addr = buffer - xs->pool->addrs;
+
+   for (copied = 0, i = 0; copied < len; i++) {
+   page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
+
+   get_page(page);
+
+   copy = min_t(u32, PAGE_SIZE - offset, len - copied);
+
+   skb_fill_page_desc(skb, i, page, offset, copy);
+
+   copied += copy;
+   addr += copy;
+   offset = 0;
+   }
+
+   skb->len += len;
+   skb->data_len += len;
+   skb->truesize += len;
+
+   refcount_add(len, >sk.sk_wmem_alloc);
+
+   return skb;
+}
+
+static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
+struct xdp_desc *desc)
+{
+   struct sk_buff *skb = NULL;
+
+   if (xs->dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
+   skb = xsk_build_skb_zerocopy(xs, desc);
+   if (IS_ERR(skb))
+   return skb;
+   } else {
+   void *buffer;
+   u32 len;
+   int err;
+
+   len = desc->len;
+   skb = sock_alloc_send_skb(>sk, len, 1, );
+   if (unlikely(!skb))
+   return ERR_PTR(err);
+
+   skb_put(skb, len);
+   buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
+   err = skb_store_bits(skb, 0, buffer, len);
+   if (unlikely(err)) {
+   kfree_skb(skb);
+   return ERR_PTR(err);
+   }
+   }
+
+   skb->dev = xs->dev;
+   skb->priority = xs->sk.sk_priority;
+   skb->mark = xs->sk.sk_mark;
+   skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr;
+   skb->destructor = xsk_destruct_skb;
+
+   return skb;
+}
+
 static int xsk_generic_xmit(struct sock *sk)
 {
struct xdp_sock *xs = xdp_sk(sk);
@@ -446,43 +527,30 @@ static int xsk_generic_xmit(struct sock *sk)
goto out;
 
while (xskq_cons_peek_desc(xs->tx, , xs->pool)) {
-   char *buffer;
-   u64 addr;
-   u32 len;
-
if (max_batch-- == 0) {
err = -EAGAIN;
goto out;
}
 
-   len = desc.len;
-   skb = sock_alloc_send_skb(sk, len, 1, );
-   if (unlikely(!skb))
+   skb = xsk_build_skb(xs, );
+   if (IS_ERR(skb)) {
+   err = PTR_ERR(skb);
goto out;
+   }
 
-   skb_put(skb, len);
-   addr = desc.addr;
-   buffer = xsk_buff_raw_get_data(xs->pool, addr);
-   err = skb_store_bits(skb, 0, buffer, len);
/* This is the backpressure mechanism for the Tx path.
 * Reserve space in the completion queue and only proceed
 * if there is space in it. This avoids having to implement
 * any buffering in the Tx path.
 */
spin_lock_irqsave(>pool->cq_lock, flags);
-   if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) {
+   if (xskq_prod_reserve(xs->pool->cq)) {
   

[PATCH net-next v2 3/3] xsk: build skb by page

2021-01-19 Thread Xuan Zhuo
This patch is used to construct skb based on page to save memory copy
overhead.

This function is implemented based on IFF_TX_SKB_NO_LINEAR. Only the
network card priv_flags supports IFF_TX_SKB_NO_LINEAR will use page to
directly construct skb. If this feature is not supported, it is still
necessary to copy data to construct skb.

 Performance Testing 

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s 
```

Test result data:

size64  512 10241500
copy1916747 1775988 1600203 1440054
page1974058 1953655 1945463 1904478
percent 3.0%10.0%   21.58%  32.3%

Signed-off-by: Xuan Zhuo 
Reviewed-by: Dust Li 
---
 net/xdp/xsk.c | 104 --
 1 file changed, 86 insertions(+), 18 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 8037b04..817a3a5 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -430,6 +430,87 @@ static void xsk_destruct_skb(struct sk_buff *skb)
sock_wfree(skb);
 }
 
+static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
+ struct xdp_desc *desc)
+{
+   u32 len, offset, copy, copied;
+   struct sk_buff *skb;
+   struct page *page;
+   char *buffer;
+   int err, i;
+   u64 addr;
+
+   skb = sock_alloc_send_skb(>sk, 0, 1, );
+   if (unlikely(!skb))
+   return ERR_PTR(err);
+
+   addr = desc->addr;
+   len = desc->len;
+
+   buffer = xsk_buff_raw_get_data(xs->pool, addr);
+   offset = offset_in_page(buffer);
+   addr = buffer - (char *)xs->pool->addrs;
+
+   for (copied = 0, i = 0; copied < len; i++) {
+   page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
+
+   get_page(page);
+
+   copy = min_t(u32, PAGE_SIZE - offset, len - copied);
+
+   skb_fill_page_desc(skb, i, page, offset, copy);
+
+   copied += copy;
+   addr += copy;
+   offset = 0;
+   }
+
+   skb->len += len;
+   skb->data_len += len;
+   skb->truesize += len;
+
+   refcount_add(len, >sk.sk_wmem_alloc);
+
+   return skb;
+}
+
+static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
+struct xdp_desc *desc)
+{
+   struct sk_buff *skb = NULL;
+
+   if (xs->dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
+   skb = xsk_build_skb_zerocopy(xs, desc);
+   if (IS_ERR(skb))
+   return skb;
+   } else {
+   char *buffer;
+   u32 len;
+   int err;
+
+   len = desc->len;
+   skb = sock_alloc_send_skb(>sk, len, 1, );
+   if (unlikely(!skb))
+   return ERR_PTR(err);
+
+   skb_put(skb, len);
+   buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
+   err = skb_store_bits(skb, 0, buffer, len);
+   if (unlikely(err)) {
+   kfree_skb(skb);
+   return ERR_PTR(err);
+   }
+   }
+
+   skb->dev = xs->dev;
+   skb->priority = xs->sk.sk_priority;
+   skb->mark = xs->sk.sk_mark;
+   skb_shinfo(skb)->destructor_arg = (void *)(long)desc->addr;
+   skb->destructor = xsk_destruct_skb;
+
+   return skb;
+}
+
 static int xsk_generic_xmit(struct sock *sk)
 {
struct xdp_sock *xs = xdp_sk(sk);
@@ -446,43 +527,30 @@ static int xsk_generic_xmit(struct sock *sk)
goto out;
 
while (xskq_cons_peek_desc(xs->tx, , xs->pool)) {
-   char *buffer;
-   u64 addr;
-   u32 len;
-
if (max_batch-- == 0) {
err = -EAGAIN;
goto out;
}
 
-   len = desc.len;
-   skb = sock_alloc_send_skb(sk, len, 1, );
-   if (unlikely(!skb))
+   skb = xsk_build_skb(xs, );
+   if (IS_ERR(skb)) {
+   err = PTR_ERR(skb);
goto out;
+   }
 
-   skb_put(skb, len);
-   addr = desc.addr;
-   buffer = xsk_buff_raw_get_data(xs->pool, addr);
-   err = skb_store_bits(skb, 0, buffer, len);
/* This is the backpressure mechanism for the Tx path.
 * Reserve space in the completion queue and only proceed
 * if there is space in it. This avoids having to implement
 * any buffering in the Tx path.
 */
spin_lock_irqsave(>pool->cq_lock, flags);
-   if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) {
+   if (xskq_prod_reserve(xs->pool->cq)) {
   

[PATCH net-next v2 2/3] virtio-net: support IFF_TX_SKB_NO_LINEAR

2021-01-19 Thread Xuan Zhuo
Virtio net supports the case where the skb linear space is empty, so add
priv_flags.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio_net.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ba8e637..f2ff6c3 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2972,7 +2972,8 @@ static int virtnet_probe(struct virtio_device *vdev)
return -ENOMEM;
 
/* Set up network device as normal. */
-   dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE;
+   dev->priv_flags |= IFF_UNICAST_FLT | IFF_LIVE_ADDR_CHANGE |
+  IFF_TX_SKB_NO_LINEAR;
dev->netdev_ops = _netdev;
dev->features = NETIF_F_HIGHDMA;
 
-- 
1.8.3.1



[PATCH netdev 1/5] xsk: support get page for drv

2021-01-05 Thread Xuan Zhuo
For some drivers, such as virtio-net, we do not configure dma when
binding xsk. We will get the page when sending.

This patch participates in a field need_dma during the setup pool. If
the device does not use dma, this value should be set to false.

And a function xsk_buff_raw_get_page is added to get the page based on
addr in drv.

Signed-off-by: Xuan Zhuo 
---
 include/linux/netdevice.h   |  1 +
 include/net/xdp_sock_drv.h  | 10 ++
 include/net/xsk_buff_pool.h |  1 +
 net/xdp/xsk_buff_pool.c | 10 +-
 4 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7bf1679..b8baef9 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -915,6 +915,7 @@ struct netdev_bpf {
struct {
struct xsk_buff_pool *pool;
u16 queue_id;
+   bool need_dma;
} xsk;
};
 };
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 4e295541..e9c7e25 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -100,6 +100,11 @@ static inline void *xsk_buff_raw_get_data(struct 
xsk_buff_pool *pool, u64 addr)
return xp_raw_get_data(pool, addr);
 }
 
+static inline struct page *xsk_buff_raw_get_page(struct xsk_buff_pool *pool, 
u64 addr)
+{
+   return xp_raw_get_page(pool, addr);
+}
+
 static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct 
xsk_buff_pool *pool)
 {
struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
@@ -232,6 +237,11 @@ static inline void *xsk_buff_raw_get_data(struct 
xsk_buff_pool *pool, u64 addr)
return NULL;
 }
 
+static inline struct page *xsk_buff_raw_get_page(struct xsk_buff_pool *pool, 
u64 addr)
+{
+   return NULL;
+}
+
 static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct 
xsk_buff_pool *pool)
 {
 }
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index 01755b8..54e461d 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -103,6 +103,7 @@ int xp_dma_map(struct xsk_buff_pool *pool, struct device 
*dev,
 bool xp_can_alloc(struct xsk_buff_pool *pool, u32 count);
 void *xp_raw_get_data(struct xsk_buff_pool *pool, u64 addr);
 dma_addr_t xp_raw_get_dma(struct xsk_buff_pool *pool, u64 addr);
+struct page *xp_raw_get_page(struct xsk_buff_pool *pool, u64 addr);
 static inline dma_addr_t xp_get_dma(struct xdp_buff_xsk *xskb)
 {
return xskb->dma;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 67a4494..9bb058f 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -167,12 +167,13 @@ static int __xp_assign_dev(struct xsk_buff_pool *pool,
bpf.command = XDP_SETUP_XSK_POOL;
bpf.xsk.pool = pool;
bpf.xsk.queue_id = queue_id;
+   bpf.xsk.need_dma = true;
 
err = netdev->netdev_ops->ndo_bpf(netdev, );
if (err)
goto err_unreg_pool;
 
-   if (!pool->dma_pages) {
+   if (bpf.xsk.need_dma && !pool->dma_pages) {
WARN(1, "Driver did not DMA map zero-copy buffers");
err = -EINVAL;
goto err_unreg_xsk;
@@ -536,6 +537,13 @@ void *xp_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
 }
 EXPORT_SYMBOL(xp_raw_get_data);
 
+struct page *xp_raw_get_page(struct xsk_buff_pool *pool, u64 addr)
+{
+   addr = pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr;
+   return pool->umem->pgs[addr >> PAGE_SHIFT];
+}
+EXPORT_SYMBOL(xp_raw_get_page);
+
 dma_addr_t xp_raw_get_dma(struct xsk_buff_pool *pool, u64 addr)
 {
addr = pool->unaligned ? xp_unaligned_add_offset_to_addr(addr) : addr;
-- 
1.8.3.1



[PATCH netdev 5/5] virtio-net, xsk: virtio-net support xsk zero copy tx

2021-01-05 Thread Xuan Zhuo
Virtio net support xdp socket.

We should open the module param "napi_tx" for using this feature.

In fact, various virtio implementations have some problems:
1. The tx interrupt may be lost
2. The tx interrupt may have a relatively large delay

This brings us to several questions:

1. Wakeup wakes up a tx interrupt or directly starts a napi on the
   current cpu, which will cause a delay in sending packets.
2. When the tx ring is full, the tx interrupt may be lost or delayed,
   resulting in untimely recovery.

I choose to send part of the data directly during wakeup. If the sending
has not been completed, I will start a napi to complete the subsequent
sending work.

Since the possible delay or loss of tx interrupt occurs when the tx ring
is full, I added a timer to solve this problem.

The performance of udp sending based on virtio net + xsk is 6 times that
of ordinary kernel udp send.

* xsk_check_timeout: when the dev full or all xsk.hdr used, start timer
  to check the xsk.hdr is avail. the unit is us.
* xsk_num_max: the xsk.hdr max num
* xsk_num_percent: the max hdr num be the percent of the virtio ring
  size. The real xsk hdr num will the min of xsk_num_max and the percent
  of the num of virtio ring
* xsk_budget: the budget for xsk run

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio_net.c | 437 ++-
 1 file changed, 434 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index e744dce..76319e7 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -22,10 +22,21 @@
 #include 
 #include 
 #include 
+#include 
 
 static int napi_weight = NAPI_POLL_WEIGHT;
 module_param(napi_weight, int, 0444);
 
+static int xsk_check_timeout = 100;
+static int xsk_num_max   = 1024;
+static int xsk_num_percent   = 80;
+static int xsk_budget= 128;
+
+module_param(xsk_check_timeout, int, 0644);
+module_param(xsk_num_max,   int, 0644);
+module_param(xsk_num_percent,   int, 0644);
+module_param(xsk_budget,int, 0644);
+
 static bool csum = true, gso = true, napi_tx = true;
 module_param(csum, bool, 0444);
 module_param(gso, bool, 0444);
@@ -110,6 +121,9 @@ struct virtnet_xsk_hdr {
u32 len;
 };
 
+#define VIRTNET_STATE_XSK_WAKEUP BIT(0)
+#define VIRTNET_STATE_XSK_TIMER BIT(1)
+
 #define VIRTNET_SQ_STAT(m) offsetof(struct virtnet_sq_stats, m)
 #define VIRTNET_RQ_STAT(m) offsetof(struct virtnet_rq_stats, m)
 
@@ -149,6 +163,32 @@ struct send_queue {
struct virtnet_sq_stats stats;
 
struct napi_struct napi;
+
+   struct {
+   struct xsk_buff_pool   __rcu *pool;
+   struct virtnet_xsk_hdr __rcu *hdr;
+
+   unsigned long  state;
+   u64hdr_con;
+   u64hdr_pro;
+   u64hdr_n;
+   struct xdp_desclast_desc;
+   bool   wait_slot;
+   /* tx interrupt issues
+*   1. that may be lost
+*   2. that too slow, 200/s or delay 10ms
+*
+* timer for:
+* 1. recycle the desc.(no check for performance, see below)
+* 2. check the nic ring is avali. when nic ring is full
+*
+* Here, the regular check is performed for dev full. The
+* application layer must ensure that the number of cq is
+* sufficient, otherwise there may be insufficient cq in use.
+*
+*/
+   struct hrtimer  timer;
+   } xsk;
 };
 
 /* Internal representation of a receive virtqueue */
@@ -267,6 +307,8 @@ static void __free_old_xmit_ptr(struct send_queue *sq, bool 
in_napi,
bool xsk_wakeup,
unsigned int *_packets, unsigned int *_bytes);
 static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi);
+static int virtnet_xsk_run(struct send_queue *sq,
+  struct xsk_buff_pool *pool, int budget);
 
 static bool is_xdp_frame(void *ptr)
 {
@@ -1439,6 +1481,40 @@ static int virtnet_receive(struct receive_queue *rq, int 
budget,
return stats.packets;
 }
 
+static void virt_xsk_complete(struct send_queue *sq, u32 num, bool xsk_wakeup)
+{
+   struct xsk_buff_pool *pool;
+   int n;
+
+   rcu_read_lock();
+
+   WRITE_ONCE(sq->xsk.hdr_pro, sq->xsk.hdr_pro + num);
+
+   pool = rcu_dereference(sq->xsk.pool);
+   if (!pool) {
+   if (sq->xsk.hdr_pro - sq->xsk.hdr_con == sq->xsk.hdr_n) {
+   kfree(sq->xsk.hdr);
+   rcu_assign_pointer(sq->xsk.hdr, NULL);
+   }
+   rcu_read_unlock();
+   return;
+   }
+
+   xsk_tx_completed(pool, num);
+
+   rcu_read_unlock();
+
+   if 

[PATCH netdev 4/5] xsk, virtio-net: prepare for support xsk

2021-01-05 Thread Xuan Zhuo
Split function free_old_xmit_skbs, add sub-function __free_old_xmit_ptr,
which is convenient to call with other statistical information, and
supports the parameter 'xsk_wakeup' required for processing xsk.

Use netif stop check as a function virtnet_sq_stop_check, which will be
used when adding xsk support.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio_net.c | 95 ++--
 1 file changed, 52 insertions(+), 43 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index df38a9f..e744dce 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -263,6 +263,11 @@ struct padded_vnet_hdr {
char padding[4];
 };
 
+static void __free_old_xmit_ptr(struct send_queue *sq, bool in_napi,
+   bool xsk_wakeup,
+   unsigned int *_packets, unsigned int *_bytes);
+static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi);
+
 static bool is_xdp_frame(void *ptr)
 {
return (unsigned long)ptr & VIRTIO_XDP_FLAG;
@@ -376,6 +381,37 @@ static void skb_xmit_done(struct virtqueue *vq)
netif_wake_subqueue(vi->dev, vq2txq(vq));
 }
 
+static void virtnet_sq_stop_check(struct send_queue *sq, bool in_napi)
+{
+   struct virtnet_info *vi = sq->vq->vdev->priv;
+   struct net_device *dev = vi->dev;
+   int qnum = sq - vi->sq;
+
+   /* If running out of space, stop queue to avoid getting packets that we
+* are then unable to transmit.
+* An alternative would be to force queuing layer to requeue the skb by
+* returning NETDEV_TX_BUSY. However, NETDEV_TX_BUSY should not be
+* returned in a normal path of operation: it means that driver is not
+* maintaining the TX queue stop/start state properly, and causes
+* the stack to do a non-trivial amount of useless work.
+* Since most packets only take 1 or 2 ring slots, stopping the queue
+* early means 16 slots are typically wasted.
+*/
+
+   if (sq->vq->num_free < 2 + MAX_SKB_FRAGS) {
+   netif_stop_subqueue(dev, qnum);
+   if (!sq->napi.weight &&
+   unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
+   /* More just got used, free them then recheck. */
+   free_old_xmit_skbs(sq, in_napi);
+   if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS) {
+   netif_start_subqueue(dev, qnum);
+   virtqueue_disable_cb(sq->vq);
+   }
+   }
+   }
+}
+
 #define MRG_CTX_HEADER_SHIFT 22
 static void *mergeable_len_to_ctx(unsigned int truesize,
  unsigned int headroom)
@@ -543,13 +579,11 @@ static int virtnet_xdp_xmit(struct net_device *dev,
struct receive_queue *rq = vi->rq;
struct bpf_prog *xdp_prog;
struct send_queue *sq;
-   unsigned int len;
int packets = 0;
int bytes = 0;
int drops = 0;
int kicks = 0;
int ret, err;
-   void *ptr;
int i;
 
/* Only allow ndo_xdp_xmit if XDP is loaded on dev, as this
@@ -567,24 +601,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
goto out;
}
 
-   /* Free up any pending old buffers before queueing new ones. */
-   while ((ptr = virtqueue_get_buf(sq->vq, )) != NULL) {
-   if (likely(is_xdp_frame(ptr))) {
-   struct virtnet_xdp_type *xtype;
-   struct xdp_frame *frame;
-
-   xtype = ptr_to_xtype(ptr);
-   frame = xtype_got_ptr(xtype);
-   bytes += frame->len;
-   xdp_return_frame(frame);
-   } else {
-   struct sk_buff *skb = ptr;
-
-   bytes += skb->len;
-   napi_consume_skb(skb, false);
-   }
-   packets++;
-   }
+   __free_old_xmit_ptr(sq, false, true, , );
 
for (i = 0; i < n; i++) {
struct xdp_frame *xdpf = frames[i];
@@ -1422,7 +1439,9 @@ static int virtnet_receive(struct receive_queue *rq, int 
budget,
return stats.packets;
 }
 
-static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi)
+static void __free_old_xmit_ptr(struct send_queue *sq, bool in_napi,
+   bool xsk_wakeup,
+   unsigned int *_packets, unsigned int *_bytes)
 {
unsigned int packets = 0;
unsigned int bytes = 0;
@@ -1456,6 +1475,17 @@ static void free_old_xmit_skbs(struct send_queue *sq, 
bool in_napi)
packets++;
}
 
+   *_packets = packets;
+   *_bytes = bytes;
+}
+
+static void free_old_xmit_skbs(struct send_queue *sq, bool in_n

[PATCH netdev 3/5] virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx

2021-01-05 Thread Xuan Zhuo
If support xsk, a new ptr will be recovered during the
process of freeing the old ptr. In order to distinguish between ctx sent
by XDP_TX and ctx sent by xsk, a struct is added here to distinguish
between these two situations. virtnet_xdp_type.type It is used to
distinguish different ctx, and virtnet_xdp_type.offset is used to record
the offset between "true ctx" and virtnet_xdp_type.

The newly added virtnet_xsk_hdr will be used for xsk.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio_net.c | 77 ++--
 1 file changed, 62 insertions(+), 15 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f2349b8..df38a9f 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -94,6 +94,22 @@ struct virtnet_rq_stats {
u64 kicks;
 };
 
+enum {
+   XDP_TYPE_XSK,
+   XDP_TYPE_TX,
+};
+
+struct virtnet_xdp_type {
+   int offset:24;
+   unsigned type:8;
+};
+
+struct virtnet_xsk_hdr {
+   struct virtnet_xdp_type type;
+   struct virtio_net_hdr_mrg_rxbuf hdr;
+   u32 len;
+};
+
 #define VIRTNET_SQ_STAT(m) offsetof(struct virtnet_sq_stats, m)
 #define VIRTNET_RQ_STAT(m) offsetof(struct virtnet_rq_stats, m)
 
@@ -252,14 +268,19 @@ static bool is_xdp_frame(void *ptr)
return (unsigned long)ptr & VIRTIO_XDP_FLAG;
 }
 
-static void *xdp_to_ptr(struct xdp_frame *ptr)
+static void *xdp_to_ptr(struct virtnet_xdp_type *ptr)
 {
return (void *)((unsigned long)ptr | VIRTIO_XDP_FLAG);
 }
 
-static struct xdp_frame *ptr_to_xdp(void *ptr)
+static struct virtnet_xdp_type *ptr_to_xtype(void *ptr)
 {
-   return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
+   return (struct virtnet_xdp_type *)((unsigned long)ptr & 
~VIRTIO_XDP_FLAG);
+}
+
+static void *xtype_got_ptr(struct virtnet_xdp_type *xdptype)
+{
+   return (char *)xdptype + xdptype->offset;
 }
 
 /* Converting between virtqueue no. and kernel tx/rx queue no.
@@ -460,11 +481,16 @@ static int __virtnet_xdp_xmit_one(struct virtnet_info *vi,
   struct xdp_frame *xdpf)
 {
struct virtio_net_hdr_mrg_rxbuf *hdr;
+   struct virtnet_xdp_type *xdptype;
int err;
 
-   if (unlikely(xdpf->headroom < vi->hdr_len))
+   if (unlikely(xdpf->headroom < vi->hdr_len + sizeof(*xdptype)))
return -EOVERFLOW;
 
+   xdptype = (struct virtnet_xdp_type *)(xdpf + 1);
+   xdptype->offset = (char *)xdpf - (char *)xdptype;
+   xdptype->type = XDP_TYPE_TX;
+
/* Make room for virtqueue hdr (also change xdpf->headroom?) */
xdpf->data -= vi->hdr_len;
/* Zero header and leave csum up to XDP layers */
@@ -474,7 +500,7 @@ static int __virtnet_xdp_xmit_one(struct virtnet_info *vi,
 
sg_init_one(sq->sg, xdpf->data, xdpf->len);
 
-   err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp_to_ptr(xdpf),
+   err = virtqueue_add_outbuf(sq->vq, sq->sg, 1, xdp_to_ptr(xdptype),
   GFP_ATOMIC);
if (unlikely(err))
return -ENOSPC; /* Caller handle free/refcnt */
@@ -544,8 +570,11 @@ static int virtnet_xdp_xmit(struct net_device *dev,
/* Free up any pending old buffers before queueing new ones. */
while ((ptr = virtqueue_get_buf(sq->vq, )) != NULL) {
if (likely(is_xdp_frame(ptr))) {
-   struct xdp_frame *frame = ptr_to_xdp(ptr);
+   struct virtnet_xdp_type *xtype;
+   struct xdp_frame *frame;
 
+   xtype = ptr_to_xtype(ptr);
+   frame = xtype_got_ptr(xtype);
bytes += frame->len;
xdp_return_frame(frame);
} else {
@@ -1395,24 +1424,34 @@ static int virtnet_receive(struct receive_queue *rq, 
int budget,
 
 static void free_old_xmit_skbs(struct send_queue *sq, bool in_napi)
 {
-   unsigned int len;
unsigned int packets = 0;
unsigned int bytes = 0;
-   void *ptr;
+   unsigned int len;
+   struct virtnet_xdp_type *xtype;
+   struct xdp_frame*frame;
+   struct virtnet_xsk_hdr  *xskhdr;
+   struct sk_buff  *skb;
+   void*ptr;
 
while ((ptr = virtqueue_get_buf(sq->vq, )) != NULL) {
if (likely(!is_xdp_frame(ptr))) {
-   struct sk_buff *skb = ptr;
+   skb = ptr;
 
pr_debug("Sent skb %p\n", skb);
 
bytes += skb->len;
napi_consume_skb(skb, in_napi);
} else {
-   struct xdp_frame *frame = ptr_to_xdp(ptr);
+   xtype = ptr_to_xtype(ptr);
 
-   bytes += frame->len;
-   xdp_return_frame(frame);
+  

[PATCH netdev 2/5] virtio-net: support XDP_TX when not more queues

2021-01-05 Thread Xuan Zhuo
The number of queues implemented by many virtio backends is limited,
especially some machines have a large number of CPUs. In this case, it
is often impossible to allocate a separate queue for XDP_TX.

This patch allows XDP_TX to run by lock when not enough queue.

Signed-off-by: Xuan Zhuo 
---
 drivers/net/virtio_net.c | 42 --
 1 file changed, 32 insertions(+), 10 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f65eea6..f2349b8 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -194,6 +194,7 @@ struct virtnet_info {
 
/* # of XDP queue pairs currently used by the driver */
u16 xdp_queue_pairs;
+   bool xdp_enable;
 
/* I like... big packets and I cannot lie! */
bool big_packets;
@@ -481,14 +482,34 @@ static int __virtnet_xdp_xmit_one(struct virtnet_info *vi,
return 0;
 }
 
-static struct send_queue *virtnet_xdp_sq(struct virtnet_info *vi)
+static struct send_queue *virtnet_get_xdp_sq(struct virtnet_info *vi)
 {
unsigned int qp;
+   struct netdev_queue *txq;
+
+   if (vi->curr_queue_pairs > nr_cpu_ids) {
+   qp = vi->curr_queue_pairs - vi->xdp_queue_pairs + 
smp_processor_id();
+   } else {
+   qp = smp_processor_id() % vi->curr_queue_pairs;
+   txq = netdev_get_tx_queue(vi->dev, qp);
+   __netif_tx_lock(txq, raw_smp_processor_id());
+   }
 
-   qp = vi->curr_queue_pairs - vi->xdp_queue_pairs + smp_processor_id();
return >sq[qp];
 }
 
+static void virtnet_put_xdp_sq(struct virtnet_info *vi)
+{
+   unsigned int qp;
+   struct netdev_queue *txq;
+
+   if (vi->curr_queue_pairs <= nr_cpu_ids) {
+   qp = smp_processor_id() % vi->curr_queue_pairs;
+   txq = netdev_get_tx_queue(vi->dev, qp);
+   __netif_tx_unlock(txq);
+   }
+}
+
 static int virtnet_xdp_xmit(struct net_device *dev,
int n, struct xdp_frame **frames, u32 flags)
 {
@@ -512,7 +533,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
if (!xdp_prog)
return -ENXIO;
 
-   sq = virtnet_xdp_sq(vi);
+   sq = virtnet_get_xdp_sq(vi);
 
if (unlikely(flags & ~XDP_XMIT_FLAGS_MASK)) {
ret = -EINVAL;
@@ -560,12 +581,13 @@ static int virtnet_xdp_xmit(struct net_device *dev,
sq->stats.kicks += kicks;
u64_stats_update_end(>stats.syncp);
 
+   virtnet_put_xdp_sq(vi);
return ret;
 }
 
 static unsigned int virtnet_get_headroom(struct virtnet_info *vi)
 {
-   return vi->xdp_queue_pairs ? VIRTIO_XDP_HEADROOM : 0;
+   return vi->xdp_enable ? VIRTIO_XDP_HEADROOM : 0;
 }
 
 /* We copy the packet for XDP in the following cases:
@@ -1457,12 +1479,13 @@ static int virtnet_poll(struct napi_struct *napi, int 
budget)
xdp_do_flush();
 
if (xdp_xmit & VIRTIO_XDP_TX) {
-   sq = virtnet_xdp_sq(vi);
+   sq = virtnet_get_xdp_sq(vi);
if (virtqueue_kick_prepare(sq->vq) && virtqueue_notify(sq->vq)) 
{
u64_stats_update_begin(>stats.syncp);
sq->stats.kicks++;
u64_stats_update_end(>stats.syncp);
}
+   virtnet_put_xdp_sq(vi);
}
 
return received;
@@ -2415,10 +2438,7 @@ static int virtnet_xdp_set(struct net_device *dev, 
struct bpf_prog *prog,
 
/* XDP requires extra queues for XDP_TX */
if (curr_qp + xdp_qp > vi->max_queue_pairs) {
-   NL_SET_ERR_MSG_MOD(extack, "Too few free TX rings available");
-   netdev_warn(dev, "request %i queues but max is %i\n",
-   curr_qp + xdp_qp, vi->max_queue_pairs);
-   return -ENOMEM;
+   xdp_qp = 0;
}
 
old_prog = rtnl_dereference(vi->rq[0].xdp_prog);
@@ -2451,12 +2471,14 @@ static int virtnet_xdp_set(struct net_device *dev, 
struct bpf_prog *prog,
netif_set_real_num_rx_queues(dev, curr_qp + xdp_qp);
vi->xdp_queue_pairs = xdp_qp;
 
+   vi->xdp_enable = false;
if (prog) {
for (i = 0; i < vi->max_queue_pairs; i++) {
rcu_assign_pointer(vi->rq[i].xdp_prog, prog);
if (i == 0 && !old_prog)
virtnet_clear_guest_offloads(vi);
}
+   vi->xdp_enable = true;
}
 
for (i = 0; i < vi->max_queue_pairs; i++) {
@@ -2524,7 +2546,7 @@ static int virtnet_set_features(struct net_device *dev,
int err;
 
if ((dev->features ^ features) & NETIF_F_LRO) {
-   if (vi->xdp_queue_pairs)
+   if (vi->xdp_enable)
return -EBUSY;
 
if (features & NETIF_F_LRO)
-- 
1.8.3.1



[PATCH netdev 0/5] virtio-net support xdp socket zero copy xmit

2021-01-05 Thread Xuan Zhuo
The first patch made some adjustments to xsk.

The second patch itself can be used as an independent patch to solve the problem
that XDP may fail to load when the number of queues is insufficient.

The third to last patch implements support for xsk in virtio-net.

A practical problem with virtio is that tx interrupts are not very reliable.
There will always be some missing or delayed tx interrupts. So I specially added
a point timer to solve this problem. Of course, considering performance issues,
The timer only triggers when the ring of the network card is full.

Regarding the issue of virtio-net supporting xsk's zero copy rx, I am also
developing it, but I found that the modification may be relatively large, so I
consider this patch set to be separated from the code related to xsk zero copy
rx.

Xuan Zhuo (5):
  xsk: support get page for drv
  virtio-net: support XDP_TX when not more queues
  virtio-net, xsk: distinguish XDP_TX and XSK XMIT ctx
  xsk, virtio-net: prepare for support xsk
  virtio-net, xsk: virtio-net support xsk zero copy tx

 drivers/net/virtio_net.c| 643 +++-
 include/linux/netdevice.h   |   1 +
 include/net/xdp_sock_drv.h  |  10 +
 include/net/xsk_buff_pool.h |   1 +
 net/xdp/xsk_buff_pool.c |  10 +-
 5 files changed, 597 insertions(+), 68 deletions(-)

--
1.8.3.1



[PATCH bpf-next] xsk: build skb by page

2020-12-29 Thread Xuan Zhuo
This patch is used to construct skb based on page to save memory copy
overhead.

Taking into account the problem of addr unaligned, and the
possibility of frame size greater than page in the future.

The test environment is Aliyun ECS server.
Test cmd:
```
xdpsock -i eth0 -t  -S -s 
```

Test result data:

size64  512 10241500
copy1916747 1775988 1600203 1440054
page1974058 1953655 1945463 1904478
percent 3.0%10.0%   21.58%  32.3%

Signed-off-by: Xuan Zhuo 
---
 net/xdp/xsk.c | 68 ---
 1 file changed, 51 insertions(+), 17 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ac4a317..7cab40f 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -430,6 +430,55 @@ static void xsk_destruct_skb(struct sk_buff *skb)
sock_wfree(skb);
 }
 
+static struct sk_buff *xsk_build_skb_bypage(struct xdp_sock *xs, struct 
xdp_desc *desc)
+{
+   char *buffer;
+   u64 addr;
+   u32 len, offset, copy, copied;
+   int err, i;
+   struct page *page;
+   struct sk_buff *skb;
+
+   skb = sock_alloc_send_skb(>sk, 0, 1, );
+   if (unlikely(!skb))
+   return NULL;
+
+   addr = desc->addr;
+   len = desc->len;
+
+   buffer = xsk_buff_raw_get_data(xs->pool, addr);
+   offset = offset_in_page(buffer);
+   addr = buffer - (char *)xs->pool->addrs;
+
+   for (copied = 0, i = 0; copied < len; ++i) {
+   page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
+
+   get_page(page);
+
+   copy = min((u32)(PAGE_SIZE - offset), len - copied);
+
+   skb_fill_page_desc(skb, i, page, offset, copy);
+
+   copied += copy;
+   addr += copy;
+   offset = 0;
+   }
+
+   skb->len += len;
+   skb->data_len += len;
+   skb->truesize += len;
+
+   refcount_add(len, >sk.sk_wmem_alloc);
+
+   skb->dev = xs->dev;
+   skb->priority = xs->sk.sk_priority;
+   skb->mark = xs->sk.sk_mark;
+   skb_shinfo(skb)->destructor_arg = (void *)(long)addr;
+   skb->destructor = xsk_destruct_skb;
+
+   return skb;
+}
+
 static int xsk_generic_xmit(struct sock *sk)
 {
struct xdp_sock *xs = xdp_sk(sk);
@@ -445,40 +494,25 @@ static int xsk_generic_xmit(struct sock *sk)
goto out;
 
while (xskq_cons_peek_desc(xs->tx, , xs->pool)) {
-   char *buffer;
-   u64 addr;
-   u32 len;
-
if (max_batch-- == 0) {
err = -EAGAIN;
goto out;
}
 
-   len = desc.len;
-   skb = sock_alloc_send_skb(sk, len, 1, );
+   skb = xsk_build_skb_bypage(xs, );
if (unlikely(!skb))
goto out;
 
-   skb_put(skb, len);
-   addr = desc.addr;
-   buffer = xsk_buff_raw_get_data(xs->pool, addr);
-   err = skb_store_bits(skb, 0, buffer, len);
/* This is the backpressure mechanism for the Tx path.
 * Reserve space in the completion queue and only proceed
 * if there is space in it. This avoids having to implement
 * any buffering in the Tx path.
 */
-   if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) {
+   if (xskq_prod_reserve(xs->pool->cq)) {
kfree_skb(skb);
goto out;
}
 
-   skb->dev = xs->dev;
-   skb->priority = sk->sk_priority;
-   skb->mark = sk->sk_mark;
-   skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr;
-   skb->destructor = xsk_destruct_skb;
-
err = __dev_direct_xmit(skb, xs->queue_id);
if  (err == NETDEV_TX_BUSY) {
/* Tell user-space to retry the send */
-- 
1.8.3.1



[PATCH bpf-next] xsk: build skb by page

2020-12-23 Thread Xuan Zhuo
This patch is used to construct skb based on page to save memory copy
overhead.

Taking into account the problem of addr unaligned, and the
possibility of frame size greater than page in the future.

Signed-off-by: Xuan Zhuo 
---
 net/xdp/xsk.c | 68 ---
 1 file changed, 51 insertions(+), 17 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ac4a317..7cab40f 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -430,6 +430,55 @@ static void xsk_destruct_skb(struct sk_buff *skb)
sock_wfree(skb);
 }
 
+static struct sk_buff *xsk_build_skb_bypage(struct xdp_sock *xs, struct 
xdp_desc *desc)
+{
+   char *buffer;
+   u64 addr;
+   u32 len, offset, copy, copied;
+   int err, i;
+   struct page *page;
+   struct sk_buff *skb;
+
+   skb = sock_alloc_send_skb(>sk, 0, 1, );
+   if (unlikely(!skb))
+   return NULL;
+
+   addr = desc->addr;
+   len = desc->len;
+
+   buffer = xsk_buff_raw_get_data(xs->pool, addr);
+   offset = offset_in_page(buffer);
+   addr = buffer - (char *)xs->pool->addrs;
+
+   for (copied = 0, i = 0; copied < len; ++i) {
+   page = xs->pool->umem->pgs[addr >> PAGE_SHIFT];
+
+   get_page(page);
+
+   copy = min((u32)(PAGE_SIZE - offset), len - copied);
+
+   skb_fill_page_desc(skb, i, page, offset, copy);
+
+   copied += copy;
+   addr += copy;
+   offset = 0;
+   }
+
+   skb->len += len;
+   skb->data_len += len;
+   skb->truesize += len;
+
+   refcount_add(len, >sk.sk_wmem_alloc);
+
+   skb->dev = xs->dev;
+   skb->priority = xs->sk.sk_priority;
+   skb->mark = xs->sk.sk_mark;
+   skb_shinfo(skb)->destructor_arg = (void *)(long)addr;
+   skb->destructor = xsk_destruct_skb;
+
+   return skb;
+}
+
 static int xsk_generic_xmit(struct sock *sk)
 {
struct xdp_sock *xs = xdp_sk(sk);
@@ -445,40 +494,25 @@ static int xsk_generic_xmit(struct sock *sk)
goto out;
 
while (xskq_cons_peek_desc(xs->tx, , xs->pool)) {
-   char *buffer;
-   u64 addr;
-   u32 len;
-
if (max_batch-- == 0) {
err = -EAGAIN;
goto out;
}
 
-   len = desc.len;
-   skb = sock_alloc_send_skb(sk, len, 1, );
+   skb = xsk_build_skb_bypage(xs, );
if (unlikely(!skb))
goto out;
 
-   skb_put(skb, len);
-   addr = desc.addr;
-   buffer = xsk_buff_raw_get_data(xs->pool, addr);
-   err = skb_store_bits(skb, 0, buffer, len);
/* This is the backpressure mechanism for the Tx path.
 * Reserve space in the completion queue and only proceed
 * if there is space in it. This avoids having to implement
 * any buffering in the Tx path.
 */
-   if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) {
+   if (xskq_prod_reserve(xs->pool->cq)) {
kfree_skb(skb);
goto out;
}
 
-   skb->dev = xs->dev;
-   skb->priority = sk->sk_priority;
-   skb->mark = sk->sk_mark;
-   skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr;
-   skb->destructor = xsk_destruct_skb;
-
err = __dev_direct_xmit(skb, xs->queue_id);
if  (err == NETDEV_TX_BUSY) {
/* Tell user-space to retry the send */
-- 
1.8.3.1



[PATCH bpf-next] xsk: save the undone skb

2020-12-11 Thread Xuan Zhuo
We can reserve the skb. When sending fails, NETDEV_TX_BUSY or
xskq_prod_reserve fails. As long as skb is successfully generated and
successfully configured, we can reserve skb if we encounter exceptions
later.

Especially when NETDEV_TX_BUSY fails, there is no need to deal with
the problem that xskq_prod_reserve has been updated.

Signed-off-by: Xuan Zhuo 
---
 include/net/xdp_sock.h |  3 +++
 net/xdp/xsk.c  | 36 +++-
 2 files changed, 30 insertions(+), 9 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 4f4e93b..fead0c9 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -76,6 +76,9 @@ struct xdp_sock {
struct mutex mutex;
struct xsk_queue *fq_tmp; /* Only as tmp storage before bind */
struct xsk_queue *cq_tmp; /* Only as tmp storage before bind */
+
+   struct sk_buff *skb_undone;
+   bool skb_undone_reserve;
 };
 
 #ifdef CONFIG_XDP_SOCKETS
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index e28c682..1051024 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -435,6 +435,19 @@ static int xsk_generic_xmit(struct sock *sk)
if (xs->queue_id >= xs->dev->real_num_tx_queues)
goto out;
 
+   if (xs->skb_undone) {
+   if (xs->skb_undone_reserve) {
+   if (xskq_prod_reserve(xs->pool->cq))
+   goto out;
+
+   xs->skb_undone_reserve = false;
+   }
+
+   skb = xs->skb_undone;
+   xs->skb_undone = NULL;
+   goto xmit;
+   }
+
while (xskq_cons_peek_desc(xs->tx, , xs->pool)) {
char *buffer;
u64 addr;
@@ -454,12 +467,7 @@ static int xsk_generic_xmit(struct sock *sk)
addr = desc.addr;
buffer = xsk_buff_raw_get_data(xs->pool, addr);
err = skb_store_bits(skb, 0, buffer, len);
-   /* This is the backpressure mechanism for the Tx path.
-* Reserve space in the completion queue and only proceed
-* if there is space in it. This avoids having to implement
-* any buffering in the Tx path.
-*/
-   if (unlikely(err) || xskq_prod_reserve(xs->pool->cq)) {
+   if (unlikely(err)) {
kfree_skb(skb);
goto out;
}
@@ -470,12 +478,22 @@ static int xsk_generic_xmit(struct sock *sk)
skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr;
skb->destructor = xsk_destruct_skb;
 
+   /* This is the backpressure mechanism for the Tx path.
+* Reserve space in the completion queue and only proceed
+* if there is space in it. This avoids having to implement
+* any buffering in the Tx path.
+*/
+   if (xskq_prod_reserve(xs->pool->cq)) {
+   xs->skb_undone_reserve = true;
+   xs->skb_undone = skb;
+   goto out;
+   }
+
+xmit:
err = __dev_direct_xmit(skb, xs->queue_id);
if  (err == NETDEV_TX_BUSY) {
/* Tell user-space to retry the send */
-   skb->destructor = sock_wfree;
-   /* Free skb without triggering the perf drop trace */
-   consume_skb(skb);
+   xs->skb_undone = skb;
err = -EAGAIN;
goto out;
}
-- 
1.8.3.1



[PATCH bpf V3 2/2] xsk: change the tx writeable condition

2020-12-01 Thread Xuan Zhuo
Modify the tx writeable condition from the queue is not full to the
number of present tx queues is less than the half of the total number
of queues. Because the tx queue not full is a very short time, this will
cause a large number of EPOLLOUT events, and cause a large number of
process wake up.

Signed-off-by: Xuan Zhuo 
Acked-by: Magnus Karlsson 
---
 net/xdp/xsk.c   | 16 +---
 net/xdp/xsk_queue.h |  6 ++
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 9bbfd8a..6250447 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -211,6 +211,14 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff 
*xdp, u32 len,
return 0;
 }
 
+static bool xsk_tx_writeable(struct xdp_sock *xs)
+{
+   if (xskq_cons_present_entries(xs->tx) > xs->tx->nentries / 2)
+   return false;
+
+   return true;
+}
+
 static bool xsk_is_bound(struct xdp_sock *xs)
 {
if (READ_ONCE(xs->state) == XSK_BOUND) {
@@ -296,7 +304,8 @@ void xsk_tx_release(struct xsk_buff_pool *pool)
rcu_read_lock();
list_for_each_entry_rcu(xs, >xsk_tx_list, tx_list) {
__xskq_cons_release(xs->tx);
-   xs->sk.sk_write_space(>sk);
+   if (xsk_tx_writeable(xs))
+   xs->sk.sk_write_space(>sk);
}
rcu_read_unlock();
 }
@@ -436,7 +445,8 @@ static int xsk_generic_xmit(struct sock *sk)
 
 out:
if (sent_frame)
-   sk->sk_write_space(sk);
+   if (xsk_tx_writeable(xs))
+   sk->sk_write_space(sk);
 
mutex_unlock(>mutex);
return err;
@@ -493,7 +503,7 @@ static __poll_t xsk_poll(struct file *file, struct socket 
*sock,
 
if (xs->rx && !xskq_prod_is_empty(xs->rx))
mask |= EPOLLIN | EPOLLRDNORM;
-   if (xs->tx && !xskq_cons_is_full(xs->tx))
+   if (xs->tx && xsk_tx_writeable(xs))
mask |= EPOLLOUT | EPOLLWRNORM;
 
return mask;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index cdb9cf3..9e71b9f 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -264,6 +264,12 @@ static inline bool xskq_cons_is_full(struct xsk_queue *q)
q->nentries;
 }
 
+static inline u32 xskq_cons_present_entries(struct xsk_queue *q)
+{
+   /* No barriers needed since data is not accessed */
+   return READ_ONCE(q->ring->producer) - READ_ONCE(q->ring->consumer);
+}
+
 /* Functions for producers */
 
 static inline bool xskq_prod_is_full(struct xsk_queue *q)
-- 
1.8.3.1



[PATCH bpf V3 1/2] xsk: replace datagram_poll by sock_poll_wait

2020-12-01 Thread Xuan Zhuo
datagram_poll will judge the current socket status (EPOLLIN, EPOLLOUT)
based on the traditional socket information (eg: sk_wmem_alloc), but
this does not apply to xsk. So this patch uses sock_poll_wait instead of
datagram_poll, and the mask is calculated by xsk_poll.

Signed-off-by: Xuan Zhuo 
Acked-by: Magnus Karlsson 
---
 net/xdp/xsk.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index b7b039b..9bbfd8a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -471,11 +471,13 @@ static int xsk_sendmsg(struct socket *sock, struct msghdr 
*m, size_t total_len)
 static __poll_t xsk_poll(struct file *file, struct socket *sock,
 struct poll_table_struct *wait)
 {
-   __poll_t mask = datagram_poll(file, sock, wait);
+   __poll_t mask = 0;
struct sock *sk = sock->sk;
struct xdp_sock *xs = xdp_sk(sk);
struct xsk_buff_pool *pool;
 
+   sock_poll_wait(file, sock, wait);
+
if (unlikely(!xsk_is_bound(xs)))
return mask;
 
-- 
1.8.3.1



[PATCH bpf V3 0/2] xsk: fix for xsk_poll writeable

2020-12-01 Thread Xuan Zhuo


V2:
   #2 patch made some changes following magnus' opinions.

V3:
   Regarding the function xskq_cons_present_entries, I think daniel are right,
   I have modified it.

Xuan Zhuo (2):
  xsk: replace datagram_poll by sock_poll_wait
  xsk: change the tx writeable condition

 net/xdp/xsk.c   | 20 
 net/xdp/xsk_queue.h |  6 ++
 2 files changed, 22 insertions(+), 4 deletions(-)

--
1.8.3.1



[PATCH bpf v2 0/2] xsk: fix for xsk_poll writeable

2020-11-24 Thread Xuan Zhuo
Thanks for magnus very much.

V2:
   #2 patch made some changes following magnus' opinions.

Xuan Zhuo (2):
  xsk: replace datagram_poll by sock_poll_wait
  xsk: change the tx writeable condition

 net/xdp/xsk.c   | 20 
 net/xdp/xsk_queue.h |  6 ++
 2 files changed, 22 insertions(+), 4 deletions(-)

--
1.8.3.1



[PATCH bpf v2 2/2] xsk: change the tx writeable condition

2020-11-24 Thread Xuan Zhuo
Modify the tx writeable condition from the queue is not full to the
number of present tx queues is less than the half of the total number
of queues. Because the tx queue not full is a very short time, this will
cause a large number of EPOLLOUT events, and cause a large number of
process wake up.

Signed-off-by: Xuan Zhuo 
---
 net/xdp/xsk.c   | 16 +---
 net/xdp/xsk_queue.h |  6 ++
 2 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 0df8651..22e35e9 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -211,6 +211,14 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff 
*xdp, u32 len,
return 0;
 }
 
+static bool xsk_tx_writeable(struct xdp_sock *xs)
+{
+   if (xskq_cons_present_entries(xs->tx) > xs->tx->nentries / 2)
+   return false;
+
+   return true;
+}
+
 static bool xsk_is_bound(struct xdp_sock *xs)
 {
if (READ_ONCE(xs->state) == XSK_BOUND) {
@@ -296,7 +304,8 @@ void xsk_tx_release(struct xsk_buff_pool *pool)
rcu_read_lock();
list_for_each_entry_rcu(xs, >xsk_tx_list, tx_list) {
__xskq_cons_release(xs->tx);
-   xs->sk.sk_write_space(>sk);
+   if (xsk_tx_writeable(xs))
+   xs->sk.sk_write_space(>sk);
}
rcu_read_unlock();
 }
@@ -499,7 +508,8 @@ static int xsk_generic_xmit(struct sock *sk)
 
 out:
if (sent_frame)
-   sk->sk_write_space(sk);
+   if (xsk_tx_writeable(xs))
+   sk->sk_write_space(sk);
 
mutex_unlock(>mutex);
return err;
@@ -556,7 +566,7 @@ static __poll_t xsk_poll(struct file *file, struct socket 
*sock,
 
if (xs->rx && !xskq_prod_is_empty(xs->rx))
mask |= EPOLLIN | EPOLLRDNORM;
-   if (xs->tx && !xskq_cons_is_full(xs->tx))
+   if (xs->tx && xsk_tx_writeable(xs))
mask |= EPOLLOUT | EPOLLWRNORM;
 
return mask;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index b936c46..b655004 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -307,6 +307,12 @@ static inline bool xskq_cons_is_full(struct xsk_queue *q)
q->nentries;
 }
 
+static inline __u64 xskq_cons_present_entries(struct xsk_queue *q)
+{
+   /* No barriers needed since data is not accessed */
+   return READ_ONCE(q->ring->producer) - READ_ONCE(q->ring->consumer);
+}
+
 /* Functions for producers */
 
 static inline u32 xskq_prod_nb_free(struct xsk_queue *q, u32 max)
-- 
1.8.3.1



[PATCH bpf v2 1/2] xsk: replace datagram_poll by sock_poll_wait

2020-11-24 Thread Xuan Zhuo
datagram_poll will judge the current socket status (EPOLLIN, EPOLLOUT)
based on the traditional socket information (eg: sk_wmem_alloc), but
this does not apply to xsk. So this patch uses sock_poll_wait instead of
datagram_poll, and the mask is calculated by xsk_poll.

Signed-off-by: Xuan Zhuo 
---
 net/xdp/xsk.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index b014197..0df8651 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -534,11 +534,13 @@ static int xsk_sendmsg(struct socket *sock, struct msghdr 
*m, size_t total_len)
 static __poll_t xsk_poll(struct file *file, struct socket *sock,
 struct poll_table_struct *wait)
 {
-   __poll_t mask = datagram_poll(file, sock, wait);
+   __poll_t mask = 0;
struct sock *sk = sock->sk;
struct xdp_sock *xs = xdp_sk(sk);
struct xsk_buff_pool *pool;
 
+   sock_poll_wait(file, sock, wait);
+
if (unlikely(!xsk_is_bound(xs)))
return mask;
 
-- 
1.8.3.1



[PATCH 2/3] xsk: change the tx writeable condition

2020-11-18 Thread Xuan Zhuo
Modify the tx writeable condition from the queue is not full to the
number of remaining tx queues is less than the half of the total number
of queues. Because the tx queue not full is a very short time, this will
cause a large number of EPOLLOUT events, and cause a large number of
process wake up.

Signed-off-by: Xuan Zhuo 
---
 net/xdp/xsk.c   | 20 +---
 net/xdp/xsk_queue.h |  6 ++
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 7f0353e..bc3d4ece 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -211,6 +211,17 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff 
*xdp, u32 len,
return 0;
 }
 
+static bool xsk_writeable(struct xdp_sock *xs)
+{
+   if (!xs->tx)
+   return false;
+
+   if (xskq_cons_left(xs->tx) > xs->tx->nentries / 2)
+   return false;
+
+   return true;
+}
+
 static bool xsk_is_bound(struct xdp_sock *xs)
 {
if (READ_ONCE(xs->state) == XSK_BOUND) {
@@ -296,7 +307,8 @@ void xsk_tx_release(struct xsk_buff_pool *pool)
rcu_read_lock();
list_for_each_entry_rcu(xs, >xsk_tx_list, tx_list) {
__xskq_cons_release(xs->tx);
-   xs->sk.sk_write_space(>sk);
+   if (xsk_writeable(xs))
+   xs->sk.sk_write_space(>sk);
}
rcu_read_unlock();
 }
@@ -442,7 +454,8 @@ static int xsk_generic_xmit(struct sock *sk)
 
 out:
if (sent_frame)
-   sk->sk_write_space(sk);
+   if (xsk_writeable(xs))
+   sk->sk_write_space(sk);
 
mutex_unlock(>mutex);
return err;
@@ -499,7 +512,8 @@ static __poll_t xsk_poll(struct file *file, struct socket 
*sock,
 
if (xs->rx && !xskq_prod_is_empty(xs->rx))
mask |= EPOLLIN | EPOLLRDNORM;
-   if (xs->tx && !xskq_cons_is_full(xs->tx))
+
+   if (xsk_writeable(xs))
mask |= EPOLLOUT | EPOLLWRNORM;
 
return mask;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index cdb9cf3..82a5228 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -264,6 +264,12 @@ static inline bool xskq_cons_is_full(struct xsk_queue *q)
q->nentries;
 }
 
+static inline __u64 xskq_cons_left(struct xsk_queue *q)
+{
+   /* No barriers needed since data is not accessed */
+   return READ_ONCE(q->ring->producer) - READ_ONCE(q->ring->consumer);
+}
+
 /* Functions for producers */
 
 static inline bool xskq_prod_is_full(struct xsk_queue *q)
-- 
1.8.3.1



[PATCH 3/3] xsk: set tx/rx the min entries

2020-11-18 Thread Xuan Zhuo
We expect tx entries to be greater than twice the number of packets that
the network card can send at a time, so that when the remaining number
of the tx queue is less than half of the queue, it can be guaranteed
that there are recycled items in the cq that can be used.

At the same time, rx will not cause packet loss because it cannot
receive the packets uploaded by the network card at one time.

Of course, the 1024 here is only an estimated value, and the number of
packets sent by each network card at a time may be different.

Signed-off-by: Xuan Zhuo 
---
 include/uapi/linux/if_xdp.h | 2 ++
 net/xdp/xsk.c   | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index a78a809..d55ba79 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -64,6 +64,8 @@ struct xdp_mmap_offsets {
 #define XDP_STATISTICS 7
 #define XDP_OPTIONS8
 
+#define XDP_RXTX_RING_MIN_ENTRIES   1024
+
 struct xdp_umem_reg {
__u64 addr; /* Start of packet data area */
__u64 len; /* Length of packet data area */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index bc3d4ece..e62c795 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -831,6 +831,8 @@ static int xsk_setsockopt(struct socket *sock, int level, 
int optname,
return -EINVAL;
if (copy_from_sockptr(, optval, sizeof(entries)))
return -EFAULT;
+   if (entries < XDP_RXTX_RING_MIN_ENTRIES)
+   return -EINVAL;
 
mutex_lock(>mutex);
if (xs->state != XSK_READY) {
-- 
1.8.3.1



[PATCH 1/3] xsk: replace datagram_poll by sock_poll_wait

2020-11-18 Thread Xuan Zhuo
datagram_poll will judge the current socket status (EPOLLIN, EPOLLOUT)
based on the traditional socket information (eg: sk_wmem_alloc), but
this does not apply to xsk. So this patch uses sock_poll_wait instead of
datagram_poll, and the mask is calculated by xsk_poll.

Signed-off-by: Xuan Zhuo 
---
 net/xdp/xsk.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index cfbec39..7f0353e 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -477,11 +477,13 @@ static int xsk_sendmsg(struct socket *sock, struct msghdr 
*m, size_t total_len)
 static __poll_t xsk_poll(struct file *file, struct socket *sock,
 struct poll_table_struct *wait)
 {
-   __poll_t mask = datagram_poll(file, sock, wait);
+   __poll_t mask = 0;
struct sock *sk = sock->sk;
struct xdp_sock *xs = xdp_sk(sk);
struct xsk_buff_pool *pool;
 
+   sock_poll_wait(file, sock, wait);
+
if (unlikely(!xsk_is_bound(xs)))
return mask;
 
-- 
1.8.3.1



[PATCH 0/3] xsk: fix for xsk_poll writeable

2020-11-18 Thread Xuan Zhuo
I tried to combine cq available and tx writeable, but I found it very difficult.
Sometimes we pay attention to the status of "available" for both, but sometimes,
we may only pay attention to one, such as tx writeable, because we can use the
item of fq to write to tx. And this kind of demand may be constantly changing,
and it may be necessary to set it every time before entering xsk_poll, so
setsockopt is not very convenient. I feel even more that using a new event may
be a better solution, such as EPOLLPRI, I think it can be used here, after all,
xsk should not have OOB data ^_^.

However, two other problems were discovered during the test:

* The mask returned by datagram_poll always contains EPOLLOUT
* It is not particularly reasonable to return EPOLLOUT based on tx not full

After fixing these two problems, I found that when the process is awakened by
EPOLLOUT, the process can always get the item from cq.

Because the number of packets that the network card can send at a time is
actually limited, suppose this value is "nic_num". Once the number of
consumed items in the tx queue is greater than nic_num, this means that there
must also be new recycled items in the cq queue from nic.

In this way, as long as the tx configured by the user is larger, we won't have
the situation that tx is already in the writeable state but cannot get the item
from cq.

Xuan Zhuo (3):
  xsk: replace datagram_poll by sock_poll_wait
  xsk: change the tx writeable condition
  xsk: set tx/rx the min entries

 include/uapi/linux/if_xdp.h |  2 ++
 net/xdp/xsk.c   | 26 ++
 net/xdp/xsk_queue.h |  6 ++
 3 files changed, 30 insertions(+), 4 deletions(-)

--
1.8.3.1



[PATCH] xsk: add cq event

2020-11-16 Thread Xuan Zhuo
When we write all cq items to tx, we have to wait for a new event based
on poll to indicate that it is writable. But the current writability is
triggered based on whether tx is full or not, and In fact, when tx is
dissatisfied, the user of cq's item may not necessarily get it, because it
may still be occupied by the network card. In this case, we need to know
when cq is available, so this patch adds a socket option, When the user
configures this option using setsockopt, when cq is available, a
readable event is generated for all xsk bound to this umem.

I can't find a better description of this event,
I think it can also be 'readable', although it is indeed different from
the 'readable' of the new data. But the overhead of xsk checking whether
cq or rx is readable is small.

Signed-off-by: Xuan Zhuo 
---
 include/net/xdp_sock.h  |  1 +
 include/uapi/linux/if_xdp.h |  1 +
 net/xdp/xsk.c   | 28 
 3 files changed, 30 insertions(+)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 1a9559c..faf5b1a 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -49,6 +49,7 @@ struct xdp_sock {
struct xsk_buff_pool *pool;
u16 queue_id;
bool zc;
+   bool cq_event;
enum {
XSK_READY = 0,
XSK_BOUND,
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index a78a809..2dba3cb 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -63,6 +63,7 @@ struct xdp_mmap_offsets {
 #define XDP_UMEM_COMPLETION_RING   6
 #define XDP_STATISTICS 7
 #define XDP_OPTIONS8
+#define XDP_CQ_EVENT   9
 
 struct xdp_umem_reg {
__u64 addr; /* Start of packet data area */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index cfbec39..0c53403 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -285,7 +285,16 @@ void __xsk_map_flush(void)
 
 void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries)
 {
+   struct xdp_sock *xs;
+
xskq_prod_submit_n(pool->cq, nb_entries);
+
+   rcu_read_lock();
+   list_for_each_entry_rcu(xs, >xsk_tx_list, tx_list) {
+   if (xs->cq_event)
+   sock_def_readable(>sk);
+   }
+   rcu_read_unlock();
 }
 EXPORT_SYMBOL(xsk_tx_completed);
 
@@ -495,6 +504,9 @@ static __poll_t xsk_poll(struct file *file, struct socket 
*sock,
__xsk_sendmsg(sk);
}
 
+   if (xs->cq_event && pool->cq && !xskq_prod_is_empty(pool->cq))
+   mask |= EPOLLIN | EPOLLRDNORM;
+
if (xs->rx && !xskq_prod_is_empty(xs->rx))
mask |= EPOLLIN | EPOLLRDNORM;
if (xs->tx && !xskq_cons_is_full(xs->tx))
@@ -882,6 +894,22 @@ static int xsk_setsockopt(struct socket *sock, int level, 
int optname,
mutex_unlock(>mutex);
return err;
}
+   case XDP_CQ_EVENT:
+   {
+   int cq_event;
+
+   if (optlen < sizeof(cq_event))
+   return -EINVAL;
+   if (copy_from_sockptr(_event, optval, sizeof(cq_event)))
+   return -EFAULT;
+
+   if (cq_event)
+   xs->cq_event = true;
+   else
+   xs->cq_event = false;
+
+   return 0;
+   }
default:
break;
}
-- 
1.8.3.1