from:"Michael S. Tsirkin"

Re: [PATCH net-next 03/13] virtio_ring: packed: harden dma unmap for indirect

2024-09-12 Thread Michael S. Tsirkin

On Thu, Sep 12, 2024 at 02:55:38PM +0800, Xuan Zhuo wrote:
> On Wed, 11 Sep 2024 07:28:36 -0400, "Michael S. Tsirkin"  
> wrote:
> > As gcc luckily noted:
> >
> > On Tue, Aug 20, 2024 at 03:33:20PM +0800, Xuan Zhuo wrote:
> > > @@ -1617,23 +1617,24 @@ static void detach_buf_packed(struct 
> > > vring_virtqueue *vq,
> > >   }
> > >
> > >   if (vq->indirect) {
> > > + struct vring_desc_extra *extra;
> > >   u32 len;
> > >
> > >   /* Free the indirect table, if any, now that it's unmapped. */
> > > - desc = state->indir_desc;
> > > - if (!desc)
> >
> > desc is no longer initialized here
> 
> 
> Will fix.
> 
> 
> >
> > > + extra = state->indir;
> > > + if (!extra)
> > >   return;
> > >
> > >   if (vring_need_unmap_buffer(vq)) {
> > >   len = vq->packed.desc_extra[id].len;
> > >   for (i = 0; i < len / sizeof(struct vring_packed_desc);
> > >   i++)
> > > - vring_unmap_desc_packed(vq, &desc[i]);
> > > + vring_unmap_extra_packed(vq, &extra[i]);
> > >   }
> > >   kfree(desc);
> >
> >
> > but freed here
> >
> > > - state->indir_desc = NULL;
> > > + state->indir = NULL;
> > >   } else if (ctx) {
> > > - *ctx = state->indir_desc;
> > > + *ctx = state->indir;
> > >   }
> > >  }
> >
> >
> > It seems unlikely this was always 0 on all paths with even
> > a small amount of stress, so now I question how this was tested.
> > Besides, do not ignore compiler warnings, and do not tweak code
> > to just make compiler shut up - they are your friend.
> 
> I agree.
> 
> Normally I do this by make W=12, but we have too many message,
> so I missed this.
> 
>   make W=12 drivers/net/virtio_net.o drivers/virtio/virtio_ring.o
> 
> If not W=12, then I did not get any warning message.
> How do you get the message quickly?
> 
> Thanks.


If you stress test this for a long enough time, and with
debug enabled, you will see a crash.


> >
> > >
> > > --
> > > 2.32.0.3.g01195cf9f
> >

Re: [PATCH 0/3] Revert "virtio_net: rx enable premapped mode by default"

2024-09-11 Thread Michael S. Tsirkin

Thanks a lot!
Could you retest Xuan Zhuo original patch just to make sure it does
not fix the issue?

On Wed, Sep 11, 2024 at 03:18:55PM +0100, Darren Kenny wrote:
> For the record, I got a chance to test these changes and confirmed that
> they resolved the issue for me when applied on 6.11-rc7.
> 
> Tested-by: Darren Kenny 
> 
> Thanks,
> 
> Darren.
> 
> PS - I'll try get to looking at the other potential fix when I have time.
> 
> On Tuesday, 2024-09-10 at 08:12:06 -04, Michael S. Tsirkin wrote:
> > On Fri, Sep 06, 2024 at 08:31:34PM +0800, Xuan Zhuo wrote:
> >> Regression: 
> >> http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> >> 
> >> I still think that the patch can fix the problem, I hope Darren can 
> >> re-test it
> >> or give me more info.
> >> 
> >> 
> >> http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com
> >> 
> >> If that can not work or Darren can not reply in time, Michael you can try 
> >> this
> >> patch set.
> >
> > Just making sure netdev maintainers see this, this patch is for net.
> >
> >> Thanks.
> >> 
> >> Xuan Zhuo (3):
> >>   Revert "virtio_net: rx remove premapped failover code"
> >>   Revert "virtio_net: big mode skip the unmap check"
> >>   virtio_net: disable premapped mode by default
> >> 
> >>  drivers/net/virtio_net.c | 95 +++-
> >>  1 file changed, 46 insertions(+), 49 deletions(-)
> >> 
> >> --
> >> 2.32.0.3.g01195cf9f

Re: [PATCH net-next 03/13] virtio_ring: packed: harden dma unmap for indirect

2024-09-11 Thread Michael S. Tsirkin

As gcc luckily noted:

On Tue, Aug 20, 2024 at 03:33:20PM +0800, Xuan Zhuo wrote:
> @@ -1617,23 +1617,24 @@ static void detach_buf_packed(struct vring_virtqueue 
> *vq,
>   }
>  
>   if (vq->indirect) {
> + struct vring_desc_extra *extra;
>   u32 len;
>  
>   /* Free the indirect table, if any, now that it's unmapped. */
> - desc = state->indir_desc;
> - if (!desc)

desc is no longer initialized here

> + extra = state->indir;
> + if (!extra)
>   return;
>  
>   if (vring_need_unmap_buffer(vq)) {
>   len = vq->packed.desc_extra[id].len;
>   for (i = 0; i < len / sizeof(struct vring_packed_desc);
>   i++)
> - vring_unmap_desc_packed(vq, &desc[i]);
> + vring_unmap_extra_packed(vq, &extra[i]);
>   }
>   kfree(desc);


but freed here

> - state->indir_desc = NULL;
> + state->indir = NULL;
>   } else if (ctx) {
> - *ctx = state->indir_desc;
> + *ctx = state->indir;
>   }
>  }


It seems unlikely this was always 0 on all paths with even
a small amount of stress, so now I question how this was tested.
Besides, do not ignore compiler warnings, and do not tweak code
to just make compiler shut up - they are your friend.

>  
> -- 
> 2.32.0.3.g01195cf9f

Re: [PATCH net-next 02/13] virtio_ring: split: harden dma unmap for indirect

2024-09-11 Thread Michael S. Tsirkin

On Wed, Sep 11, 2024 at 11:46:30AM +0800, Jason Wang wrote:
> On Tue, Aug 20, 2024 at 3:33 PM Xuan Zhuo  wrote:
> >
> > 1. this commit hardens dma unmap for indirect
> 
> I think we need to explain why we need such hardening.

yes pls be more specific. Recording same state in two
places is just a source of bugs, not hardening.

Re: [PATCH 0/3] Revert "virtio_net: rx enable premapped mode by default"

2024-09-10 Thread Michael S. Tsirkin

On Fri, Sep 06, 2024 at 08:31:34PM +0800, Xuan Zhuo wrote:
> Regression: 
> http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> 
> I still think that the patch can fix the problem, I hope Darren can re-test it
> or give me more info.
> 
> 
> http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com
> 
> If that can not work or Darren can not reply in time, Michael you can try this
> patch set.

Just making sure netdev maintainers see this, this patch is for net.

> Thanks.
> 
> Xuan Zhuo (3):
>   Revert "virtio_net: rx remove premapped failover code"
>   Revert "virtio_net: big mode skip the unmap check"
>   virtio_net: disable premapped mode by default
> 
>  drivers/net/virtio_net.c | 95 +++-
>  1 file changed, 46 insertions(+), 49 deletions(-)
> 
> --
> 2.32.0.3.g01195cf9f

Re: [PATCH net] virtio-net: fix overflow inside virtnet_rq_alloc

2024-09-08 Thread Michael S. Tsirkin

On Tue, Aug 20, 2024 at 03:19:13PM +0800, Xuan Zhuo wrote:
> leads to regression on VM with the sysctl value of:
> 
> - net.core.high_order_alloc_disable=1
> 
> which could see reliable crashes or scp failure (scp a file 100M in size
> to VM):
> 
> The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning
> of a new frag. When the frag size is larger than PAGE_SIZE,
> everything is fine. However, if the frag is only one page and the
> total size of the buffer and virtnet_rq_dma is larger than one page, an
> overflow may occur. In this case, if an overflow is possible, I adjust
> the buffer size. If net.core.high_order_alloc_disable=1, the maximum
> buffer size is 4096 - 16. If net.core.high_order_alloc_disable=0, only
> the first buffer of the frag is affected.
> 
> Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever 
> use_dma_api")
> Reported-by: "Si-Wei Liu" 
> Closes: 
> http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> Signed-off-by: Xuan Zhuo 


BTW why isn't it needed if we revert f9dac92ba908?

> ---
>  drivers/net/virtio_net.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index c6af18948092..e5286a6da863 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -918,9 +918,6 @@ static void *virtnet_rq_alloc(struct receive_queue *rq, 
> u32 size, gfp_t gfp)
>   void *buf, *head;
>   dma_addr_t addr;
>  
> - if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp)))
> - return NULL;
> -
>   head = page_address(alloc_frag->page);
>  
>   dma = head;
> @@ -2421,6 +2418,9 @@ static int add_recvbuf_small(struct virtnet_info *vi, 
> struct receive_queue *rq,
>   len = SKB_DATA_ALIGN(len) +
> SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>  
> + if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp)))
> + return -ENOMEM;
> +
>   buf = virtnet_rq_alloc(rq, len, gfp);
>   if (unlikely(!buf))
>   return -ENOMEM;
> @@ -2521,6 +2521,12 @@ static int add_recvbuf_mergeable(struct virtnet_info 
> *vi,
>*/
>   len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
>  
> + if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))
> + return -ENOMEM;
> +
> + if (!alloc_frag->offset && len + room + sizeof(struct virtnet_rq_dma) > 
> alloc_frag->size)
> + len -= sizeof(struct virtnet_rq_dma);
> +
>   buf = virtnet_rq_alloc(rq, len + room, gfp);
>   if (unlikely(!buf))
>   return -ENOMEM;
> -- 
> 2.32.0.3.g01195cf9f

Re: [PATCH 0/3] Revert "virtio_net: rx enable premapped mode by default"

2024-09-08 Thread Michael S. Tsirkin

On Fri, Sep 06, 2024 at 08:31:34PM +0800, Xuan Zhuo wrote:
> Regression: 
> http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> 
> I still think that the patch can fix the problem, I hope Darren can re-test it
> or give me more info.
> 
> 
> http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com
> 
> If that can not work or Darren can not reply in time, Michael you can try this
> patch set.


Acked-by: Michael S. Tsirkin 
Tested-by: Takero Funaki 


> Thanks.
> 
> Xuan Zhuo (3):
>   Revert "virtio_net: rx remove premapped failover code"
>   Revert "virtio_net: big mode skip the unmap check"
>   virtio_net: disable premapped mode by default
> 
>  drivers/net/virtio_net.c | 95 +++-
>  1 file changed, 46 insertions(+), 49 deletions(-)
> 
> --
> 2.32.0.3.g01195cf9f
>

Re: [PATCH net] virtio-net: fix overflow inside virtnet_rq_alloc

2024-09-08 Thread Michael S. Tsirkin

On Sat, Sep 07, 2024 at 12:16:24PM +0900, Takero Funaki wrote:
> 2024年9月6日(金) 18:55 Michael S. Tsirkin :
> >
> > On Fri, Sep 06, 2024 at 05:46:02PM +0800, Xuan Zhuo wrote:
> > > On Fri, 6 Sep 2024 05:44:27 -0400, "Michael S. Tsirkin"  
> > > wrote:
> > > > On Fri, Sep 06, 2024 at 05:25:36PM +0800, Xuan Zhuo wrote:
> > > > > On Fri, 6 Sep 2024 05:08:56 -0400, "Michael S. Tsirkin" 
> > > > >  wrote:
> > > > > > On Fri, Sep 06, 2024 at 04:53:38PM +0800, Xuan Zhuo wrote:
> > > > > > > On Fri, 6 Sep 2024 04:43:29 -0400, "Michael S. Tsirkin" 
> > > > > > >  wrote:
> > > > > > > > On Tue, Aug 20, 2024 at 03:19:13PM +0800, Xuan Zhuo wrote:
> > > > > > > > > leads to regression on VM with the sysctl value of:
> > > > > > > > >
> > > > > > > > > - net.core.high_order_alloc_disable=1
> > > > > > > > >
> > > > > > > > > which could see reliable crashes or scp failure (scp a file 
> > > > > > > > > 100M in size
> > > > > > > > > to VM):
> > > > > > > > >
> > > > > > > > > The issue is that the virtnet_rq_dma takes up 16 bytes at the 
> > > > > > > > > beginning
> > > > > > > > > of a new frag. When the frag size is larger than PAGE_SIZE,
> > > > > > > > > everything is fine. However, if the frag is only one page and 
> > > > > > > > > the
> > > > > > > > > total size of the buffer and virtnet_rq_dma is larger than 
> > > > > > > > > one page, an
> > > > > > > > > overflow may occur. In this case, if an overflow is possible, 
> > > > > > > > > I adjust
> > > > > > > > > the buffer size. If net.core.high_order_alloc_disable=1, the 
> > > > > > > > > maximum
> > > > > > > > > buffer size is 4096 - 16. If 
> > > > > > > > > net.core.high_order_alloc_disable=0, only
> > > > > > > > > the first buffer of the frag is affected.
> > > > > > > > >
> > > > > > > > > Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode 
> > > > > > > > > whatever use_dma_api")
> > > > > > > > > Reported-by: "Si-Wei Liu" 
> > > > > > > > > Closes: 
> > > > > > > > > http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> > > > > > > > > Signed-off-by: Xuan Zhuo 
> > > > > > > >
> > > > > > > >
> > > > > > > > Guys where are we going with this? We have a crasher right now,
> > > > > > > > if this is not fixed ASAP I'd have to revert a ton of
> > > > > > > > work Xuan Zhuo just did.
> > > > > > >
> > > > > > > I think this patch can fix it and I tested it.
> > > > > > > But Darren said this patch did not work.
> > > > > > > I need more info about the crash that Darren encountered.
> > > > > > >
> > > > > > > Thanks.
> > > > > >
> > > > > > So what are we doing? Revert the whole pile for now?
> > > > > > Seems to be a bit of a pity, but maybe that's the best we can do
> > > > > > for this release.
> > > > >
> > > > > @Jason Could you review this?
> > > > >
> > > > > I think this problem is clear, though I do not know why it did not 
> > > > > work
> > > > > for Darren.
> > > > >
> > > > > Thanks.
> > > > >
> > > >
> > > > No regressions is a hard rule. If we can't figure out the regression
> > > > now, we should revert and you can try again for the next release.
> > >
> > > I see. I think I fixed it.
> > >
> > > Hope Darren can reply before you post the revert patches.
> > >
> > > Thanks.
> > >
> >
> > It's very rushed anyway. I posted the reverts, but as RFC for now.
> > You should post a debugging patch for Darren to help you figure
> > out what is going on.
> >
> >
> 
> Hello,
> 
> My issue [1], which bisected to the commit f9dac92ba908, was resolved
> after applying the patch on v6.11-rc6.
> [1] https://bugzilla.kernel.org/show_bug.cgi?id=219154
> 
> In my case, random crashes occur when receiving large data under heavy
> memory/IO load. Although the crash details differ, the memory
> corruption during data transfers is consistent.
> 
> If Darren is unable to confirm the fix, would it be possible to
> consider merging this patch to close [1] instead?
> 
> Thanks.


Could you also test
https://lore.kernel.org/all/cover.1725616135.git@redhat.com/

please?

Re: [PATCH net] virtio-net: fix overflow inside virtnet_rq_alloc

2024-09-06 Thread Michael S. Tsirkin

On Fri, Sep 06, 2024 at 05:46:02PM +0800, Xuan Zhuo wrote:
> On Fri, 6 Sep 2024 05:44:27 -0400, "Michael S. Tsirkin"  
> wrote:
> > On Fri, Sep 06, 2024 at 05:25:36PM +0800, Xuan Zhuo wrote:
> > > On Fri, 6 Sep 2024 05:08:56 -0400, "Michael S. Tsirkin"  
> > > wrote:
> > > > On Fri, Sep 06, 2024 at 04:53:38PM +0800, Xuan Zhuo wrote:
> > > > > On Fri, 6 Sep 2024 04:43:29 -0400, "Michael S. Tsirkin" 
> > > > >  wrote:
> > > > > > On Tue, Aug 20, 2024 at 03:19:13PM +0800, Xuan Zhuo wrote:
> > > > > > > leads to regression on VM with the sysctl value of:
> > > > > > >
> > > > > > > - net.core.high_order_alloc_disable=1
> > > > > > >
> > > > > > > which could see reliable crashes or scp failure (scp a file 100M 
> > > > > > > in size
> > > > > > > to VM):
> > > > > > >
> > > > > > > The issue is that the virtnet_rq_dma takes up 16 bytes at the 
> > > > > > > beginning
> > > > > > > of a new frag. When the frag size is larger than PAGE_SIZE,
> > > > > > > everything is fine. However, if the frag is only one page and the
> > > > > > > total size of the buffer and virtnet_rq_dma is larger than one 
> > > > > > > page, an
> > > > > > > overflow may occur. In this case, if an overflow is possible, I 
> > > > > > > adjust
> > > > > > > the buffer size. If net.core.high_order_alloc_disable=1, the 
> > > > > > > maximum
> > > > > > > buffer size is 4096 - 16. If net.core.high_order_alloc_disable=0, 
> > > > > > > only
> > > > > > > the first buffer of the frag is affected.
> > > > > > >
> > > > > > > Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever 
> > > > > > > use_dma_api")
> > > > > > > Reported-by: "Si-Wei Liu" 
> > > > > > > Closes: 
> > > > > > > http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> > > > > > > Signed-off-by: Xuan Zhuo 
> > > > > >
> > > > > >
> > > > > > Guys where are we going with this? We have a crasher right now,
> > > > > > if this is not fixed ASAP I'd have to revert a ton of
> > > > > > work Xuan Zhuo just did.
> > > > >
> > > > > I think this patch can fix it and I tested it.
> > > > > But Darren said this patch did not work.
> > > > > I need more info about the crash that Darren encountered.
> > > > >
> > > > > Thanks.
> > > >
> > > > So what are we doing? Revert the whole pile for now?
> > > > Seems to be a bit of a pity, but maybe that's the best we can do
> > > > for this release.
> > >
> > > @Jason Could you review this?
> > >
> > > I think this problem is clear, though I do not know why it did not work
> > > for Darren.
> > >
> > > Thanks.
> > >
> >
> > No regressions is a hard rule. If we can't figure out the regression
> > now, we should revert and you can try again for the next release.
> 
> I see. I think I fixed it.
> 
> Hope Darren can reply before you post the revert patches.
> 
> Thanks.
> 

It's very rushed anyway. I posted the reverts, but as RFC for now.
You should post a debugging patch for Darren to help you figure
out what is going on.


> >
> >
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > > > ---
> > > > > > >  drivers/net/virtio_net.c | 12 +---
> > > > > > >  1 file changed, 9 insertions(+), 3 deletions(-)
> > > > > > >
> > > > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > > > index c6af18948092..e5286a6da863 100644
> > > > > > > --- a/drivers/net/virtio_net.c
> > > > > > > +++ b/drivers/net/virtio_net.c
> > > > > > > @@ -918,9 +918,6 @@ static void *virtnet_rq_alloc(struct 
> > > > > > > receive_queue *rq, u32 size, gfp_t gfp)
> > > > > > >   void *buf, *head;
> > > > > > >   dm

Re: [PATCH net] virtio-net: fix overflow inside virtnet_rq_alloc

2024-09-06 Thread Michael S. Tsirkin

On Fri, Sep 06, 2024 at 05:25:36PM +0800, Xuan Zhuo wrote:
> On Fri, 6 Sep 2024 05:08:56 -0400, "Michael S. Tsirkin"  
> wrote:
> > On Fri, Sep 06, 2024 at 04:53:38PM +0800, Xuan Zhuo wrote:
> > > On Fri, 6 Sep 2024 04:43:29 -0400, "Michael S. Tsirkin"  
> > > wrote:
> > > > On Tue, Aug 20, 2024 at 03:19:13PM +0800, Xuan Zhuo wrote:
> > > > > leads to regression on VM with the sysctl value of:
> > > > >
> > > > > - net.core.high_order_alloc_disable=1
> > > > >
> > > > > which could see reliable crashes or scp failure (scp a file 100M in 
> > > > > size
> > > > > to VM):
> > > > >
> > > > > The issue is that the virtnet_rq_dma takes up 16 bytes at the 
> > > > > beginning
> > > > > of a new frag. When the frag size is larger than PAGE_SIZE,
> > > > > everything is fine. However, if the frag is only one page and the
> > > > > total size of the buffer and virtnet_rq_dma is larger than one page, 
> > > > > an
> > > > > overflow may occur. In this case, if an overflow is possible, I adjust
> > > > > the buffer size. If net.core.high_order_alloc_disable=1, the maximum
> > > > > buffer size is 4096 - 16. If net.core.high_order_alloc_disable=0, only
> > > > > the first buffer of the frag is affected.
> > > > >
> > > > > Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever 
> > > > > use_dma_api")
> > > > > Reported-by: "Si-Wei Liu" 
> > > > > Closes: 
> > > > > http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> > > > > Signed-off-by: Xuan Zhuo 
> > > >
> > > >
> > > > Guys where are we going with this? We have a crasher right now,
> > > > if this is not fixed ASAP I'd have to revert a ton of
> > > > work Xuan Zhuo just did.
> > >
> > > I think this patch can fix it and I tested it.
> > > But Darren said this patch did not work.
> > > I need more info about the crash that Darren encountered.
> > >
> > > Thanks.
> >
> > So what are we doing? Revert the whole pile for now?
> > Seems to be a bit of a pity, but maybe that's the best we can do
> > for this release.
> 
> @Jason Could you review this?
> 
> I think this problem is clear, though I do not know why it did not work
> for Darren.
> 
> Thanks.
> 

No regressions is a hard rule. If we can't figure out the regression
now, we should revert and you can try again for the next release.


> >
> >
> > > >
> > > >
> > > > > ---
> > > > >  drivers/net/virtio_net.c | 12 +---
> > > > >  1 file changed, 9 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > index c6af18948092..e5286a6da863 100644
> > > > > --- a/drivers/net/virtio_net.c
> > > > > +++ b/drivers/net/virtio_net.c
> > > > > @@ -918,9 +918,6 @@ static void *virtnet_rq_alloc(struct 
> > > > > receive_queue *rq, u32 size, gfp_t gfp)
> > > > >   void *buf, *head;
> > > > >   dma_addr_t addr;
> > > > >
> > > > > - if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp)))
> > > > > - return NULL;
> > > > > -
> > > > >   head = page_address(alloc_frag->page);
> > > > >
> > > > >   dma = head;
> > > > > @@ -2421,6 +2418,9 @@ static int add_recvbuf_small(struct 
> > > > > virtnet_info *vi, struct receive_queue *rq,
> > > > >   len = SKB_DATA_ALIGN(len) +
> > > > > SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > > >
> > > > > + if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp)))
> > > > > + return -ENOMEM;
> > > > > +
> > > > >   buf = virtnet_rq_alloc(rq, len, gfp);
> > > > >   if (unlikely(!buf))
> > > > >   return -ENOMEM;
> > > > > @@ -2521,6 +2521,12 @@ static int add_recvbuf_mergeable(struct 
> > > > > virtnet_info *vi,
> > > > >*/
> > > > >   len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
> > > > >
> > > > > + if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, 
> > > > > gfp)))
> > > > > + return -ENOMEM;
> > > > > +
> > > > > + if (!alloc_frag->offset && len + room + sizeof(struct 
> > > > > virtnet_rq_dma) > alloc_frag->size)
> > > > > + len -= sizeof(struct virtnet_rq_dma);
> > > > > +
> > > > >   buf = virtnet_rq_alloc(rq, len + room, gfp);
> > > > >   if (unlikely(!buf))
> > > > >   return -ENOMEM;
> > > > > --
> > > > > 2.32.0.3.g01195cf9f
> > > >
> >

Re: [PATCH net] virtio-net: fix overflow inside virtnet_rq_alloc

2024-09-06 Thread Michael S. Tsirkin

On Fri, Sep 06, 2024 at 04:53:38PM +0800, Xuan Zhuo wrote:
> On Fri, 6 Sep 2024 04:43:29 -0400, "Michael S. Tsirkin"  
> wrote:
> > On Tue, Aug 20, 2024 at 03:19:13PM +0800, Xuan Zhuo wrote:
> > > leads to regression on VM with the sysctl value of:
> > >
> > > - net.core.high_order_alloc_disable=1
> > >
> > > which could see reliable crashes or scp failure (scp a file 100M in size
> > > to VM):
> > >
> > > The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning
> > > of a new frag. When the frag size is larger than PAGE_SIZE,
> > > everything is fine. However, if the frag is only one page and the
> > > total size of the buffer and virtnet_rq_dma is larger than one page, an
> > > overflow may occur. In this case, if an overflow is possible, I adjust
> > > the buffer size. If net.core.high_order_alloc_disable=1, the maximum
> > > buffer size is 4096 - 16. If net.core.high_order_alloc_disable=0, only
> > > the first buffer of the frag is affected.
> > >
> > > Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever 
> > > use_dma_api")
> > > Reported-by: "Si-Wei Liu" 
> > > Closes: 
> > > http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> > > Signed-off-by: Xuan Zhuo 
> >
> >
> > Guys where are we going with this? We have a crasher right now,
> > if this is not fixed ASAP I'd have to revert a ton of
> > work Xuan Zhuo just did.
> 
> I think this patch can fix it and I tested it.
> But Darren said this patch did not work.
> I need more info about the crash that Darren encountered.
> 
> Thanks.

So what are we doing? Revert the whole pile for now?
Seems to be a bit of a pity, but maybe that's the best we can do
for this release.


> >
> >
> > > ---
> > >  drivers/net/virtio_net.c | 12 +---
> > >  1 file changed, 9 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index c6af18948092..e5286a6da863 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -918,9 +918,6 @@ static void *virtnet_rq_alloc(struct receive_queue 
> > > *rq, u32 size, gfp_t gfp)
> > >   void *buf, *head;
> > >   dma_addr_t addr;
> > >
> > > - if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp)))
> > > - return NULL;
> > > -
> > >   head = page_address(alloc_frag->page);
> > >
> > >   dma = head;
> > > @@ -2421,6 +2418,9 @@ static int add_recvbuf_small(struct virtnet_info 
> > > *vi, struct receive_queue *rq,
> > >   len = SKB_DATA_ALIGN(len) +
> > > SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > >
> > > + if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp)))
> > > + return -ENOMEM;
> > > +
> > >   buf = virtnet_rq_alloc(rq, len, gfp);
> > >   if (unlikely(!buf))
> > >   return -ENOMEM;
> > > @@ -2521,6 +2521,12 @@ static int add_recvbuf_mergeable(struct 
> > > virtnet_info *vi,
> > >*/
> > >   len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
> > >
> > > + if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))
> > > + return -ENOMEM;
> > > +
> > > + if (!alloc_frag->offset && len + room + sizeof(struct virtnet_rq_dma) > 
> > > alloc_frag->size)
> > > + len -= sizeof(struct virtnet_rq_dma);
> > > +
> > >   buf = virtnet_rq_alloc(rq, len + room, gfp);
> > >   if (unlikely(!buf))
> > >   return -ENOMEM;
> > > --
> > > 2.32.0.3.g01195cf9f
> >

Re: [PATCH net] virtio-net: fix overflow inside virtnet_rq_alloc

2024-09-06 Thread Michael S. Tsirkin

On Tue, Aug 20, 2024 at 03:19:13PM +0800, Xuan Zhuo wrote:
> leads to regression on VM with the sysctl value of:
> 
> - net.core.high_order_alloc_disable=1
> 
> which could see reliable crashes or scp failure (scp a file 100M in size
> to VM):
> 
> The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning
> of a new frag. When the frag size is larger than PAGE_SIZE,
> everything is fine. However, if the frag is only one page and the
> total size of the buffer and virtnet_rq_dma is larger than one page, an
> overflow may occur. In this case, if an overflow is possible, I adjust
> the buffer size. If net.core.high_order_alloc_disable=1, the maximum
> buffer size is 4096 - 16. If net.core.high_order_alloc_disable=0, only
> the first buffer of the frag is affected.
> 
> Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever 
> use_dma_api")
> Reported-by: "Si-Wei Liu" 
> Closes: 
> http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> Signed-off-by: Xuan Zhuo 


Guys where are we going with this? We have a crasher right now,
if this is not fixed ASAP I'd have to revert a ton of
work Xuan Zhuo just did.


> ---
>  drivers/net/virtio_net.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index c6af18948092..e5286a6da863 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -918,9 +918,6 @@ static void *virtnet_rq_alloc(struct receive_queue *rq, 
> u32 size, gfp_t gfp)
>   void *buf, *head;
>   dma_addr_t addr;
>  
> - if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp)))
> - return NULL;
> -
>   head = page_address(alloc_frag->page);
>  
>   dma = head;
> @@ -2421,6 +2418,9 @@ static int add_recvbuf_small(struct virtnet_info *vi, 
> struct receive_queue *rq,
>   len = SKB_DATA_ALIGN(len) +
> SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>  
> + if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp)))
> + return -ENOMEM;
> +
>   buf = virtnet_rq_alloc(rq, len, gfp);
>   if (unlikely(!buf))
>   return -ENOMEM;
> @@ -2521,6 +2521,12 @@ static int add_recvbuf_mergeable(struct virtnet_info 
> *vi,
>*/
>   len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
>  
> + if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))
> + return -ENOMEM;
> +
> + if (!alloc_frag->offset && len + room + sizeof(struct virtnet_rq_dma) > 
> alloc_frag->size)
> + len -= sizeof(struct virtnet_rq_dma);
> +
>   buf = virtnet_rq_alloc(rq, len + room, gfp);
>   if (unlikely(!buf))
>   return -ENOMEM;
> -- 
> 2.32.0.3.g01195cf9f

Re: [PATCH net] virtio-net: fix overflow inside virtnet_rq_alloc

2024-08-29 Thread Michael S. Tsirkin

On Thu, Aug 29, 2024 at 03:38:07PM +0800, Xuan Zhuo wrote:
> On Thu, 29 Aug 2024 03:35:58 -0400, "Michael S. Tsirkin"  
> wrote:
> > On Thu, Aug 29, 2024 at 03:26:00PM +0800, Xuan Zhuo wrote:
> > > On Thu, 29 Aug 2024 12:51:31 +0800, Jason Wang  
> > > wrote:
> > > > On Wed, Aug 28, 2024 at 7:21 PM Xuan Zhuo  
> > > > wrote:
> > > > >
> > > > > On Tue, 27 Aug 2024 11:38:45 +0800, Jason Wang  
> > > > > wrote:
> > > > > > On Tue, Aug 20, 2024 at 3:19 PM Xuan Zhuo 
> > > > > >  wrote:
> > > > > > >
> > > > > > > leads to regression on VM with the sysctl value of:
> > > > > > >
> > > > > > > - net.core.high_order_alloc_disable=1
> > > > > > >
> > > > > > > which could see reliable crashes or scp failure (scp a file 100M 
> > > > > > > in size
> > > > > > > to VM):
> > > > > > >
> > > > > > > The issue is that the virtnet_rq_dma takes up 16 bytes at the 
> > > > > > > beginning
> > > > > > > of a new frag. When the frag size is larger than PAGE_SIZE,
> > > > > > > everything is fine. However, if the frag is only one page and the
> > > > > > > total size of the buffer and virtnet_rq_dma is larger than one 
> > > > > > > page, an
> > > > > > > overflow may occur. In this case, if an overflow is possible, I 
> > > > > > > adjust
> > > > > > > the buffer size. If net.core.high_order_alloc_disable=1, the 
> > > > > > > maximum
> > > > > > > buffer size is 4096 - 16. If net.core.high_order_alloc_disable=0, 
> > > > > > > only
> > > > > > > the first buffer of the frag is affected.
> >
> > I don't exactly get it, when you say "only the first buffer of the frag
> > is affected" what do you mean? Affected how?
> 
> 
> I should say that if the frag is 32k, that is safe.
> Only when that frag is 4k, that is not safe.
> 
> Thanks.


It looks like nothing changes when net.core.high_order_alloc_disable=0
(which is the default) but maybe I am missing something.
It is worth testing performance with mtu set to 4k, just to make sure
we did not miss anything.

> 
> >
> > > > > >
> > > > > > I wonder instead of trying to make use of headroom, would it be
> > > > > > simpler if we allocate dedicated arrays of virtnet_rq_dma？
> > > > >
> > > > > Sorry for the late reply. My mailbox was full, so I missed the reply 
> > > > > to this
> > > > > thread. Thanks to Si-Wei for reminding me.
> > > > >
> > > > > If the virtnet_rq_dma is at the headroom, we can get the 
> > > > > virtnet_rq_dma by buf.
> > > > >
> > > > > struct page *page = virt_to_head_page(buf);
> > > > >
> > > > > head = page_address(page);
> > > > >
> > > > > If we use a dedicated array, then we need pass the virtnet_rq_dma 
> > > > > pointer to
> > > > > virtio core, the array has the same size with the rx ring.
> > > > >
> > > > > The virtnet_rq_dma will be:
> > > > >
> > > > > struct virtnet_rq_dma {
> > > > > dma_addr_t addr;
> > > > > u32 ref;
> > > > > u16 len;
> > > > > u16 need_sync;
> > > > > +   void *buf;
> > > > > };
> > > > >
> > > > > That will be simpler.
> > > >
> > > > I'm not sure I understand here, did you mean using a dedicated array is 
> > > > simpler?
> > >
> > > I found the old version(that used a dedicated array):
> > >
> > > http://lore.kernel.org/all/20230710034237.12391-11-xuanz...@linux.alibaba.com
> > >
> > > If you think that is ok, I can port a new version based that.
> > >
> > > Thanks.
> > >
> >
> > That one got a bunch of comments that likely still apply.
> > And this looks like a much bigger change than what this
> > patch proposes.
> >
> > > >
> > > > >
> > > > > >
> > > > > > Btw, I see it has a need_sync, I wonder if it can help for 
> > > > > >

Re: [PATCH net] virtio-net: fix overflow inside virtnet_rq_alloc

2024-08-29 Thread Michael S. Tsirkin

On Thu, Aug 29, 2024 at 03:26:00PM +0800, Xuan Zhuo wrote:
> On Thu, 29 Aug 2024 12:51:31 +0800, Jason Wang  wrote:
> > On Wed, Aug 28, 2024 at 7:21 PM Xuan Zhuo  
> > wrote:
> > >
> > > On Tue, 27 Aug 2024 11:38:45 +0800, Jason Wang  
> > > wrote:
> > > > On Tue, Aug 20, 2024 at 3:19 PM Xuan Zhuo  
> > > > wrote:
> > > > >
> > > > > leads to regression on VM with the sysctl value of:
> > > > >
> > > > > - net.core.high_order_alloc_disable=1
> > > > >
> > > > > which could see reliable crashes or scp failure (scp a file 100M in 
> > > > > size
> > > > > to VM):
> > > > >
> > > > > The issue is that the virtnet_rq_dma takes up 16 bytes at the 
> > > > > beginning
> > > > > of a new frag. When the frag size is larger than PAGE_SIZE,
> > > > > everything is fine. However, if the frag is only one page and the
> > > > > total size of the buffer and virtnet_rq_dma is larger than one page, 
> > > > > an
> > > > > overflow may occur. In this case, if an overflow is possible, I adjust
> > > > > the buffer size. If net.core.high_order_alloc_disable=1, the maximum
> > > > > buffer size is 4096 - 16. If net.core.high_order_alloc_disable=0, only
> > > > > the first buffer of the frag is affected.

I don't exactly get it, when you say "only the first buffer of the frag
is affected" what do you mean? Affected how?

> > > >
> > > > I wonder instead of trying to make use of headroom, would it be
> > > > simpler if we allocate dedicated arrays of virtnet_rq_dma？
> > >
> > > Sorry for the late reply. My mailbox was full, so I missed the reply to 
> > > this
> > > thread. Thanks to Si-Wei for reminding me.
> > >
> > > If the virtnet_rq_dma is at the headroom, we can get the virtnet_rq_dma 
> > > by buf.
> > >
> > > struct page *page = virt_to_head_page(buf);
> > >
> > > head = page_address(page);
> > >
> > > If we use a dedicated array, then we need pass the virtnet_rq_dma pointer 
> > > to
> > > virtio core, the array has the same size with the rx ring.
> > >
> > > The virtnet_rq_dma will be:
> > >
> > > struct virtnet_rq_dma {
> > > dma_addr_t addr;
> > > u32 ref;
> > > u16 len;
> > > u16 need_sync;
> > > +   void *buf;
> > > };
> > >
> > > That will be simpler.
> >
> > I'm not sure I understand here, did you mean using a dedicated array is 
> > simpler?
> 
> I found the old version(that used a dedicated array):
> 
> http://lore.kernel.org/all/20230710034237.12391-11-xuanz...@linux.alibaba.com
> 
> If you think that is ok, I can port a new version based that.
> 
> Thanks.
> 

That one got a bunch of comments that likely still apply.
And this looks like a much bigger change than what this
patch proposes.

> >
> > >
> > > >
> > > > Btw, I see it has a need_sync, I wonder if it can help for performance
> > > > or not? If not, any reason to keep that?
> > >
> > > I think yes, we can skip the cpu sync when we do not need it.
> >
> > I meant it looks to me the needs_sync is not necessary in the
> > structure as we can call need_sync() any time if we had dma addr.
> >
> > Thanks
> >
> > >
> > > Thanks.
> > >
> > >
> > > >
> > > > >
> > > > > Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever 
> > > > > use_dma_api")
> > > > > Reported-by: "Si-Wei Liu" 
> > > > > Closes: 
> > > > > http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> > > > > Signed-off-by: Xuan Zhuo 
> > > > > ---
> > > > >  drivers/net/virtio_net.c | 12 +---
> > > > >  1 file changed, 9 insertions(+), 3 deletions(-)
> > > > >
> > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > index c6af18948092..e5286a6da863 100644
> > > > > --- a/drivers/net/virtio_net.c
> > > > > +++ b/drivers/net/virtio_net.c
> > > > > @@ -918,9 +918,6 @@ static void *virtnet_rq_alloc(struct 
> > > > > receive_queue *rq, u32 size, gfp_t gfp)
> > > > > void *buf, *head;
> > > > > dma_addr_t addr;
> > > > >
> > > > > -   if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp)))
> > > > > -   return NULL;
> > > > > -
> > > > > head = page_address(alloc_frag->page);
> > > > >
> > > > > dma = head;
> > > > > @@ -2421,6 +2418,9 @@ static int add_recvbuf_small(struct 
> > > > > virtnet_info *vi, struct receive_queue *rq,
> > > > > len = SKB_DATA_ALIGN(len) +
> > > > >   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > > >
> > > > > +   if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, 
> > > > > gfp)))
> > > > > +   return -ENOMEM;
> > > > > +
> > > > > buf = virtnet_rq_alloc(rq, len, gfp);
> > > > > if (unlikely(!buf))
> > > > > return -ENOMEM;
> > > > > @@ -2521,6 +2521,12 @@ static int add_recvbuf_mergeable(struct 
> > > > > virtnet_info *vi,
> > > > >  */
> > > > > len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
> > > > >
> > > > > +   if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, 
> > > > > gf

Re: [PATCH net] virtio-net: fix overflow inside virtnet_rq_alloc

2024-08-20 Thread Michael S. Tsirkin

On Tue, Aug 20, 2024 at 03:19:13PM +0800, Xuan Zhuo wrote:
> leads to regression on VM with the sysctl value of:
> 
> - net.core.high_order_alloc_disable=1
> 
> which could see reliable crashes or scp failure (scp a file 100M in size
> to VM):
> 
> The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning
> of a new frag. When the frag size is larger than PAGE_SIZE,
> everything is fine. However, if the frag is only one page and the
> total size of the buffer and virtnet_rq_dma is larger than one page, an
> overflow may occur. In this case, if an overflow is possible, I adjust
> the buffer size. If net.core.high_order_alloc_disable=1, the maximum
> buffer size is 4096 - 16. If net.core.high_order_alloc_disable=0, only
> the first buffer of the frag is affected.
> 
> Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever 
> use_dma_api")
> Reported-by: "Si-Wei Liu" 
> Closes: 
> http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/net/virtio_net.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index c6af18948092..e5286a6da863 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -918,9 +918,6 @@ static void *virtnet_rq_alloc(struct receive_queue *rq, 
> u32 size, gfp_t gfp)
>   void *buf, *head;
>   dma_addr_t addr;
>  
> - if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp)))
> - return NULL;
> -
>   head = page_address(alloc_frag->page);
>  
>   dma = head;

>From API POV, I don't like it that virtnet_rq_alloc relies on
the caller to invoke skb_page_frag_refill.
It's better to pass in the length to refill.

> @@ -2421,6 +2418,9 @@ static int add_recvbuf_small(struct virtnet_info *vi, 
> struct receive_queue *rq,
>   len = SKB_DATA_ALIGN(len) +
> SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>  
> + if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp)))
> + return -ENOMEM;
> +
>   buf = virtnet_rq_alloc(rq, len, gfp);
>   if (unlikely(!buf))
>   return -ENOMEM;
> @@ -2521,6 +2521,12 @@ static int add_recvbuf_mergeable(struct virtnet_info 
> *vi,
>*/
>   len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
>  
> + if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))
> + return -ENOMEM;
> +
> + if (!alloc_frag->offset && len + room + sizeof(struct virtnet_rq_dma) > 
> alloc_frag->size)
> + len -= sizeof(struct virtnet_rq_dma);
> +
>   buf = virtnet_rq_alloc(rq, len + room, gfp);
>   if (unlikely(!buf))
>   return -ENOMEM;
> -- 
> 2.32.0.3.g01195cf9f

Re: [PATCH net] virtio-net: fix overflow inside virtnet_rq_alloc

2024-08-20 Thread Michael S. Tsirkin

On Tue, Aug 20, 2024 at 12:44:46PM -0700, Si-Wei Liu wrote:
> 
> 
> On 8/20/2024 12:19 AM, Xuan Zhuo wrote:
> > leads to regression on VM with the sysctl value of:
> > 
> > - net.core.high_order_alloc_disable=1
> > 
> > which could see reliable crashes or scp failure (scp a file 100M in size
> > to VM):
> > 
> > The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning
> > of a new frag. When the frag size is larger than PAGE_SIZE,
> > everything is fine. However, if the frag is only one page and the
> > total size of the buffer and virtnet_rq_dma is larger than one page, an
> > overflow may occur. In this case, if an overflow is possible, I adjust
> > the buffer size. If net.core.high_order_alloc_disable=1, the maximum
> > buffer size is 4096 - 16. If net.core.high_order_alloc_disable=0, only
> > the first buffer of the frag is affected.
> > 
> > Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever 
> > use_dma_api")
> > Reported-by: "Si-Wei Liu" 
> > Closes: 
> > http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> > Signed-off-by: Xuan Zhuo 
> > ---
> >   drivers/net/virtio_net.c | 12 +---
> >   1 file changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index c6af18948092..e5286a6da863 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -918,9 +918,6 @@ static void *virtnet_rq_alloc(struct receive_queue *rq, 
> > u32 size, gfp_t gfp)
> > void *buf, *head;
> > dma_addr_t addr;
> > -   if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp)))
> > -   return NULL;
> > -
> > head = page_address(alloc_frag->page);
> > dma = head;
> > @@ -2421,6 +2418,9 @@ static int add_recvbuf_small(struct virtnet_info *vi, 
> > struct receive_queue *rq,
> > len = SKB_DATA_ALIGN(len) +
> >   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > +   if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp)))
> > +   return -ENOMEM;
> > +
> Do you want to document the assumption that small packet case won't end up
> crossing the page frag boundary unlike the mergeable case? Add a comment
> block to explain or a WARN_ON() check against potential overflow would work
> with me.
> 
> > buf = virtnet_rq_alloc(rq, len, gfp);
> > if (unlikely(!buf))
> > return -ENOMEM;
> > @@ -2521,6 +2521,12 @@ static int add_recvbuf_mergeable(struct virtnet_info 
> > *vi,
> >  */
> > len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
> > +   if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))
> > +   return -ENOMEM;
> > +
> > +   if (!alloc_frag->offset && len + room + sizeof(struct virtnet_rq_dma) > 
> > alloc_frag->size)
> > +   len -= sizeof(struct virtnet_rq_dma);
> > +
> This could address my previous concern for possibly regressing every buffer
> size for the mergeable case, thanks. Though I still don't get why carving up
> a small chunk from page_frag for storing the virtnet_rq_dma metadata, this
> would cause perf regression on certain MTU size

4Kbyte MTU exactly?

> that happens to end up with
> one more base page (and an extra descriptor as well) to be allocated
> compared to the previous code without the extra virtnet_rq_dma content. How
> hard would it be to allocate a dedicated struct to store the related
> information without affecting the (size of) datapath pages?
> 
> FWIW, out of the code review perspective, I've looked up the past
> conversations but didn't see comprehensive benchmark was done before
> removing the old code and making premap the sole default mode. Granted this
> would reduce the footprint of additional code and the associated maintaining
> cost immediately, but I would assume at least there should have been
> thorough performance runs upfront to guarantee no regression is seen with
> every possible use case, or the negative effect is comparatively negligible
> even though there's slight regression in some limited case. If that kind of
> perf measurement hadn't been done before getting accepted/merged, I think at
> least it should allow both modes to coexist for a while such that every user
> could gauge the performance effect.
> 
> Thanks,
> -Siwei
> 
> > buf = virtnet_rq_alloc(rq, len + room, gfp);
> > if (unlikely(!buf))
> > return -ENOMEM;

Re: [PATCH net] virtio-net: fix overflow inside virtnet_rq_alloc

2024-08-20 Thread Michael S. Tsirkin

On Tue, Aug 20, 2024 at 03:19:13PM +0800, Xuan Zhuo wrote:
> leads to regression on VM with the sysctl value of:
> 
> - net.core.high_order_alloc_disable=1




> which could see reliable crashes or scp failure (scp a file 100M in size
> to VM):
> 
> The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning
> of a new frag. When the frag size is larger than PAGE_SIZE,
> everything is fine. However, if the frag is only one page and the
> total size of the buffer and virtnet_rq_dma is larger than one page, an
> overflow may occur. In this case, if an overflow is possible, I adjust
> the buffer size. If net.core.high_order_alloc_disable=1, the maximum
> buffer size is 4096 - 16. If net.core.high_order_alloc_disable=0, only
> the first buffer of the frag is affected.
> 
> Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever 
> use_dma_api")
> Reported-by: "Si-Wei Liu" 
> Closes: 
> http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
> Signed-off-by: Xuan Zhuo 


Darren, could you pls test and confirm?

> ---
>  drivers/net/virtio_net.c | 12 +---
>  1 file changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index c6af18948092..e5286a6da863 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -918,9 +918,6 @@ static void *virtnet_rq_alloc(struct receive_queue *rq, 
> u32 size, gfp_t gfp)
>   void *buf, *head;
>   dma_addr_t addr;
>  
> - if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp)))
> - return NULL;
> -
>   head = page_address(alloc_frag->page);
>  
>   dma = head;
> @@ -2421,6 +2418,9 @@ static int add_recvbuf_small(struct virtnet_info *vi, 
> struct receive_queue *rq,
>   len = SKB_DATA_ALIGN(len) +
> SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
>  
> + if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp)))
> + return -ENOMEM;
> +
>   buf = virtnet_rq_alloc(rq, len, gfp);
>   if (unlikely(!buf))
>   return -ENOMEM;
> @@ -2521,6 +2521,12 @@ static int add_recvbuf_mergeable(struct virtnet_info 
> *vi,
>*/
>   len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
>  
> + if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))
> + return -ENOMEM;
> +
> + if (!alloc_frag->offset && len + room + sizeof(struct virtnet_rq_dma) > 
> alloc_frag->size)
> + len -= sizeof(struct virtnet_rq_dma);
> +
>   buf = virtnet_rq_alloc(rq, len + room, gfp);
>   if (unlikely(!buf))
>   return -ENOMEM;
> -- 
> 2.32.0.3.g01195cf9f

Re: [PATCH net] virtio_net: move netdev_tx_reset_queue() call before RX napi enable

2024-08-14 Thread Michael S. Tsirkin

On Wed, Aug 14, 2024 at 02:25:00PM +0200, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> During suspend/resume the following BUG was hit:
> [ cut here ]
> kernel BUG at lib/dynamic_queue_limits.c:99!
> Internal error: Oops - BUG: 0 [#1] SMP ARM
> Modules linked in: bluetooth ecdh_generic ecc libaes
> CPU: 1 PID: 1282 Comm: rtcwake Not tainted
> 6.10.0-rc3-00732-gc8bd1f7f3e61 #15240
> Hardware name: Generic DT based system
> PC is at dql_completed+0x270/0x2cc
> LR is at __free_old_xmit+0x120/0x198
> pc : []    lr : []    psr: 8013
> ...
> Flags: Nzcv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
> Control: 10c5387d  Table: 43a4406a  DAC: 0051
> ...
> Process rtcwake (pid: 1282, stack limit = 0xfbc21278)
> Stack: (0xe0805e80 to 0xe0806000)
> ...
> Call trace:
>   dql_completed from __free_old_xmit+0x120/0x198
>   __free_old_xmit from free_old_xmit+0x44/0xe4
>   free_old_xmit from virtnet_poll_tx+0x88/0x1b4
>   virtnet_poll_tx from __napi_poll+0x2c/0x1d4
>   __napi_poll from net_rx_action+0x140/0x2b4
>   net_rx_action from handle_softirqs+0x11c/0x350
>   handle_softirqs from call_with_stack+0x18/0x20
>   call_with_stack from do_softirq+0x48/0x50
>   do_softirq from __local_bh_enable_ip+0xa0/0xa4
>   __local_bh_enable_ip from virtnet_open+0xd4/0x21c
>   virtnet_open from virtnet_restore+0x94/0x120
>   virtnet_restore from virtio_device_restore+0x110/0x1f4
>   virtio_device_restore from dpm_run_callback+0x3c/0x100
>   dpm_run_callback from device_resume+0x12c/0x2a8
>   device_resume from dpm_resume+0x12c/0x1e0
>   dpm_resume from dpm_resume_end+0xc/0x18
>   dpm_resume_end from suspend_devices_and_enter+0x1f0/0x72c
>   suspend_devices_and_enter from pm_suspend+0x270/0x2a0
>   pm_suspend from state_store+0x68/0xc8
>   state_store from kernfs_fop_write_iter+0x10c/0x1cc
>   kernfs_fop_write_iter from vfs_write+0x2b0/0x3dc
>   vfs_write from ksys_write+0x5c/0xd4
>   ksys_write from ret_fast_syscall+0x0/0x54
> Exception stack(0xe8bf1fa8 to 0xe8bf1ff0)
> ...
> ---[ end trace  ]---
> 
> After virtnet_napi_enable() is called, the following path is hit:
>   __napi_poll()
> -> virtnet_poll()
>   -> virtnet_poll_cleantx()
> -> netif_tx_wake_queue()
> 
> That wakes the TX queue and allows skbs to be submitted and accounted by
> BQL counters.
> 
> Then netdev_tx_reset_queue() is called that resets BQL counters and
> eventually leads to the BUG in dql_completed().
> 
> Move virtnet_napi_tx_enable() what does BQL counters reset before RX
> napi enable to avoid the issue.
> 
> Reported-by: Marek Szyprowski 
> Closes: 
> https://lore.kernel.org/netdev/e632e378-d019-4de7-8f13-07c572ab3...@samsung.com/
> Fixes: c8bd1f7f3e61 ("virtio_net: add support for Byte Queue Limits")
> Tested-by: Marek Szyprowski 
> Signed-off-by: Jiri Pirko 

Acked-by: Michael S. Tsirkin 

> ---
>  drivers/net/virtio_net.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 3f10c72743e9..c6af18948092 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2867,8 +2867,8 @@ static int virtnet_enable_queue_pair(struct 
> virtnet_info *vi, int qp_index)
>   if (err < 0)
>   goto err_xdp_reg_mem_model;
>  
> - virtnet_napi_enable(vi->rq[qp_index].vq, &vi->rq[qp_index].napi);
>   netdev_tx_reset_queue(netdev_get_tx_queue(vi->dev, qp_index));
> + virtnet_napi_enable(vi->rq[qp_index].vq, &vi->rq[qp_index].napi);
>   virtnet_napi_tx_enable(vi, vi->sq[qp_index].vq, &vi->sq[qp_index].napi);
>  
>   return 0;
> -- 
> 2.45.2

Re: [PATCH net-next v3] virtio_net: add support for Byte Queue Limits

2024-08-14 Thread Michael S. Tsirkin

On Wed, Aug 14, 2024 at 10:17:15AM +0200, Jiri Pirko wrote:
> >diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >index 3f10c72743e9..c6af18948092 100644
> >--- a/drivers/net/virtio_net.c
> >+++ b/drivers/net/virtio_net.c
> >@@ -2867,8 +2867,8 @@ static int virtnet_enable_queue_pair(struct 
> >virtnet_info *vi, int qp_index)
> > if (err < 0)
> > goto err_xdp_reg_mem_model;
> > 
> >-virtnet_napi_enable(vi->rq[qp_index].vq, &vi->rq[qp_index].napi);
> > netdev_tx_reset_queue(netdev_get_tx_queue(vi->dev, qp_index));
> >+virtnet_napi_enable(vi->rq[qp_index].vq, &vi->rq[qp_index].napi);
> > virtnet_napi_tx_enable(vi, vi->sq[qp_index].vq, &vi->sq[qp_index].napi);
> 
> Hmm, I have to look at this a bit more. I think this might be accidental
> fix. The thing is, napi can be triggered even if it is disabled:
> 
>->__local_bh_enable_ip()
>  -> net_rx_action()
>-> __napi_poll()
> 
> Here __napi_poll() calls napi_is_scheduled() and calls virtnet_poll_tx()
> in case napi is scheduled. napi_is_scheduled() checks NAPI_STATE_SCHED
> bit in napi state.
> 
> However, this bit is set previously by netif_napi_add_weight().

It's actually set in napi_disable too, isn't it?

> 
> >
> > > ...
> >
> >Best regards
> >-- 
> >Marek Szyprowski, PhD
> >Samsung R&D Institute Poland
> >
> 
> 
> > 
> > return 0;
> >
> >
> >Will submit the patch in a jiff. Thanks!
> >
> >
> >
> >>
> >>Best regards
> >>-- 
> >>Marek Szyprowski, PhD
> >>Samsung R&D Institute Poland
> >>

Re: [PATCH net-next v5 1/4] virtio_ring: enable premapped mode whatever use_dma_api

2024-08-14 Thread Michael S. Tsirkin

On Tue, Aug 13, 2024 at 08:39:53PM -0700, Si-Wei Liu wrote:
> Hi Michael,
> 
> I'll look for someone else from Oracle to help you on this, as the relevant
> team already did verify internally that reverting all 4 patches from this
> series could help address the regression. Just reverting one single commit
> won't help.
> 
>   9719f039d328 virtio_net: remove the misleading comment
>   defd28aa5acb virtio_net: rx remove premapped failover code
>   a377ae542d8d virtio_net: big mode skip the unmap check
>   f9dac92ba908 virtio_ring: enable premapped mode whatever use_dma_api
> 
> In case I fail to get someone to help, could you work with Darren (cc'ed)
> directly? He could reach out to the corresponding team in Oracle to help
> with testing.
> 
> Thanks,
> -Siwei
> 

OK, I posted an untested revert for your testing:

Message-ID: <20240511031404.30903-1-xuanz...@linux.alibaba.com>



> On 8/13/2024 12:46 PM, Michael S. Tsirkin wrote:
> > Want to post a patchset to revert?
> >

Re: [PATCH net-next v5 1/4] virtio_ring: enable premapped mode whatever use_dma_api

2024-08-13 Thread Michael S. Tsirkin

On Tue, Aug 13, 2024 at 12:28:41PM -0700, Si-Wei Liu wrote:
> 
> Turning out this below commit to unconditionally enable premapped
> virtio-net:
> 
> commit f9dac92ba9081062a6477ee015bd3b8c5914efc4
> Author: Xuan Zhuo 
> Date:   Sat May 11 11:14:01 2024 +0800
> 
> leads to regression on VM with no ACCESS_PLATFORM, and with the sysctl value
> of:
> 
> - net.core.high_order_alloc_disable=1
> 
> which could see reliable crashes or scp failure (scp a file 100M in size to
> VM):
> 
> [  332.079333] __vm_enough_memory: pid: 18440, comm: sshd, bytes:
> 5285790347661783040 not enough memory for the allocation
> [  332.079651] [ cut here ]
> [  332.079655] kernel BUG at mm/mmap.c:3514!
> [  332.080095] invalid opcode:  [#1] PREEMPT SMP NOPTI
> [  332.080826] CPU: 18 PID: 18440 Comm: sshd Kdump: loaded Not tainted
> 6.10.0-2.x86_64 #2
> [  332.081514] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
> 1.16.0-4.module+el8.9.0+90173+a3f3e83a 04/01/2014
> [  332.082451] RIP: 0010:exit_mmap+0x3a1/0x3b0
> [  332.082871] Code: be 01 00 00 00 48 89 df e8 0c 94 fe ff eb d7 be 01 00
> 00 00 48 89 df e8 5d 98 fe ff eb be 31 f6 48 89 df e8 31 99 fe ff eb a8 <0f>
> 0b e8 68 bc ae 00 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90
> [  332.084230] RSP: 0018:9988b1c8f948 EFLAGS: 00010293
> [  332.084635] RAX: 0406 RBX: 8d47583e7380 RCX:
> 
> [  332.085171] RDX:  RSI:  RDI:
> 
> [  332.085699] RBP: 008f R08:  R09:
> 
> [  332.086233] R10:  R11:  R12:
> 8d47583e7430
> [  332.086761] R13: 8d47583e73c0 R14: 0406 R15:
> 000495ae650dda58
> [  332.087300] FS:  7ff443899980() GS:8df1c570()
> knlGS:
> [  332.087888] CS:  0010 DS:  ES:  CR0: 80050033
> [  332.088334] CR2: 55a42d30b730 CR3: 0102e956a004 CR4:
> 00770ef0
> [  332.088867] PKRU: 5554
> [  332.089114] Call Trace:
> [  332.089349] 
> [  332.089556]  ? die+0x36/0x90
> [  332.089818]  ? do_trap+0xed/0x110
> [  332.090110]  ? exit_mmap+0x3a1/0x3b0
> [  332.090411]  ? do_error_trap+0x6a/0xa0
> [  332.090722]  ? exit_mmap+0x3a1/0x3b0
> [  332.091029]  ? exc_invalid_op+0x50/0x80
> [  332.091348]  ? exit_mmap+0x3a1/0x3b0
> [  332.091648]  ? asm_exc_invalid_op+0x1a/0x20
> [  332.091998]  ? exit_mmap+0x3a1/0x3b0
> [  332.092299]  ? exit_mmap+0x1d6/0x3b0
> [  332.092604] __mmput+0x3e/0x130
> [  332.092882] dup_mm.constprop.0+0x10c/0x110
> [  332.093226] copy_process+0xbd0/0x1570
> [  332.093539] kernel_clone+0xbf/0x430
> [  332.093838]  ? syscall_exit_work+0x103/0x130
> [  332.094197] __do_sys_clone+0x66/0xa0
> [  332.094506]  do_syscall_64+0x8c/0x1d0
> [  332.094814]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  332.095198]  ? audit_reset_context+0x232/0x310
> [  332.095558]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  332.095936]  ? syscall_exit_work+0x103/0x130
> [  332.096288]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  332.096668]  ? syscall_exit_to_user_mode+0x7d/0x220
> [  332.097059]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  332.097436]  ? do_syscall_64+0xba/0x1d0
> [  332.097752]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  332.098137]  ? syscall_exit_to_user_mode+0x7d/0x220
> [  332.098525]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  332.098903]  ? do_syscall_64+0xba/0x1d0
> [  332.099227]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  332.099606]  ? __audit_filter_op+0xbe/0x140
> [  332.099943]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  332.100328]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  332.100706]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  332.101089]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  332.101468]  ? wp_page_reuse+0x8e/0xb0
> [  332.101779]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  332.102163]  ? do_wp_page+0xe6/0x470
> [  332.102465]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  332.102843]  ? __handle_mm_fault+0x5ff/0x720
> [  332.103197]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  332.103574]  ? __count_memcg_events+0x4d/0xd0
> [  332.103938]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  332.104323]  ? count_memcg_events.constprop.0+0x26/0x50
> [  332.104729]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  332.105114]  ? handle_mm_fault+0xae/0x320
> [  332.105442]  ? srso_alias_return_thunk+0x5/0xfbef5
> [  332.105820]  ? do_user_addr_fault+0x31f/0x6c0
> [  332.106181]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> [  332.106576] RIP: 0033:0x7ff43f8f9a73
> [  332.106876] Code: db 0f 85 28 01 00 00 64 4c 8b 0c 25 10 00 00 00 45 31
> c0 4d 8d 91 d0 02 00 00 31 d2 31 f6 bf 11
> 00 20 01 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 b9 00 00 00 41 89 c5
> 85 c0 0f 85 c6 00 00
> [  332.108163] RSP: 002b:7ffc690909b0 EFLAGS: 0246 ORIG_RAX:
> 0038
> [  332.108719] RAX: ffda RBX:  RCX:
> 7ff43f8f9a73
> [  332.109253] RDX:  RSI: 0

Re: [Patch net] vsock: fix recursive ->recvmsg calls

2024-08-12 Thread Michael S. Tsirkin

On Sun, Aug 11, 2024 at 07:21:53PM -0700, Cong Wang wrote:
> From: Cong Wang 
> 
> After a vsock socket has been added to a BPF sockmap, its prot->recvmsg
> has been replaced with vsock_bpf_recvmsg(). Thus the following
> recursiion could happen:
> 
> vsock_bpf_recvmsg()
>  -> __vsock_recvmsg()
>   -> vsock_connectible_recvmsg()
>-> prot->recvmsg()
> -> vsock_bpf_recvmsg() again
> 
> We need to fix it by calling the original ->recvmsg() without any BPF
> sockmap logic in __vsock_recvmsg().
> 
> Fixes: 634f1a7110b4 ("vsock: support sockmap")
> Reported-by: syzbot+bdb4bd87b5e22058e...@syzkaller.appspotmail.com
> Tested-by: syzbot+bdb4bd87b5e22058e...@syzkaller.appspotmail.com
> Cc: Bobby Eshleman 
> Cc: Michael S. Tsirkin 
> Cc: Stefano Garzarella 
> Signed-off-by: Cong Wang 

Acked-by: Michael S. Tsirkin 

> ---
>  include/net/af_vsock.h|  4 
>  net/vmw_vsock/af_vsock.c  | 50 +++
>  net/vmw_vsock/vsock_bpf.c |  4 ++--
>  3 files changed, 35 insertions(+), 23 deletions(-)
> 
> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> index 535701efc1e5..24d970f7a4fa 100644
> --- a/include/net/af_vsock.h
> +++ b/include/net/af_vsock.h
> @@ -230,8 +230,12 @@ struct vsock_tap {
>  int vsock_add_tap(struct vsock_tap *vt);
>  int vsock_remove_tap(struct vsock_tap *vt);
>  void vsock_deliver_tap(struct sk_buff *build_skb(void *opaque), void 
> *opaque);
> +int __vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, 
> size_t len,
> + int flags);
>  int vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, 
> size_t len,
> int flags);
> +int __vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
> +   size_t len, int flags);
>  int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
>   size_t len, int flags);
>  
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index 4b040285aa78..0ff9b2dd86ba 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -1270,25 +1270,28 @@ static int vsock_dgram_connect(struct socket *sock,
>   return err;
>  }
>  
> +int __vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
> +   size_t len, int flags)
> +{
> + struct sock *sk = sock->sk;
> + struct vsock_sock *vsk = vsock_sk(sk);
> +
> + return vsk->transport->dgram_dequeue(vsk, msg, len, flags);
> +}
> +
>  int vsock_dgram_recvmsg(struct socket *sock, struct msghdr *msg,
>   size_t len, int flags)
>  {
>  #ifdef CONFIG_BPF_SYSCALL
> + struct sock *sk = sock->sk;
>   const struct proto *prot;
> -#endif
> - struct vsock_sock *vsk;
> - struct sock *sk;
>  
> - sk = sock->sk;
> - vsk = vsock_sk(sk);
> -
> -#ifdef CONFIG_BPF_SYSCALL
>   prot = READ_ONCE(sk->sk_prot);
>   if (prot != &vsock_proto)
>   return prot->recvmsg(sk, msg, len, flags, NULL);
>  #endif
>  
> - return vsk->transport->dgram_dequeue(vsk, msg, len, flags);
> + return __vsock_dgram_recvmsg(sock, msg, len, flags);
>  }
>  EXPORT_SYMBOL_GPL(vsock_dgram_recvmsg);
>  
> @@ -2174,15 +2177,12 @@ static int __vsock_seqpacket_recvmsg(struct sock *sk, 
> struct msghdr *msg,
>  }
>  
>  int
> -vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, size_t 
> len,
> -   int flags)
> +__vsock_connectible_recvmsg(struct socket *sock, struct msghdr *msg, size_t 
> len,
> + int flags)
>  {
>   struct sock *sk;
>   struct vsock_sock *vsk;
>   const struct vsock_transport *transport;
> -#ifdef CONFIG_BPF_SYSCALL
> - const struct proto *prot;
> -#endif
>   int err;
>  
>   sk = sock->sk;
> @@ -2233,14 +2233,6 @@ vsock_connectible_recvmsg(struct socket *sock, struct 
> msghdr *msg, size_t len,
>   goto out;
>   }
>  
> -#ifdef CONFIG_BPF_SYSCALL
> - prot = READ_ONCE(sk->sk_prot);
> - if (prot != &vsock_proto) {
> - release_sock(sk);
> - return prot->recvmsg(sk, msg, len, flags, NULL);
> - }
> -#endif
> -
>   if (sk->sk_type == SOCK_STREAM)
>   err = __vsock_stream_recvmsg(sk, msg, len, flags);
>   else
> @@ -2250,6 +2242,22 @@ vsock_connectible_recvmsg(struct socket *sock, struct 
> msghdr *msg, size_t len,
>   release_sock(sk);
>   return err;
>  }
> +
> +int
> +vsock_connectible_recvmsg(struct sock

Re: [PATCH net-next V6 0/4] virtio-net: synchronize op/admin state

2024-08-07 Thread Michael S. Tsirkin

On Tue, Aug 06, 2024 at 10:22:20AM +0800, Jason Wang wrote:
> Hi All:
> 
> This series tries to synchronize the operstate with the admin state
> which allows the lower virtio-net to propagate the link status to the
> upper devices like macvlan.
> 
> This is done by toggling carrier during ndo_open/stop while doing
> other necessary serialization about the carrier settings during probe.
> 
> While at it, also fix a race between probe and ndo_set_features as we
> didn't initalize the guest offload setting under rtnl lock.


Acked-by: Michael S. Tsirkin 

> Changes since V5:
> 
> - Fix sevreal typos
> - Include a new patch to synchronize probe with ndo_set_features
> 
> Changes since V4:
> 
> - do not update settings during ndo_open()
> - do not try to canel config noticiation during probe() as core make
>   sure the config notificaiton won't be triggered before probe is
>   done.
> - Tweak sevreal comments.
> 
> Changes since V3:
> 
> - when driver tries to enable config interrupt, check pending
>   interrupt and execute the nofitication change callback if necessary
> - do not unconditonally trigger the config space read
> - do not set LINK_UP flag in ndo_open/close but depends on the
>   notification change
> - disable config change notification until ndo_open()
> - read the link status under the rtnl_lock() to prevent a race with
>   ndo_open()
> 
> Changes since V2:
> 
> - introduce config_driver_disabled and helpers
> - schedule config change work unconditionally
> 
> Thanks
> 
> Jason Wang (4):
>   virtio: rename virtio_config_enabled to virtio_config_core_enabled
>   virtio: allow driver to disable the configure change notification
>   virtio-net: synchronize operstate with admin state on up/down
>   virtio-net: synchronize probe with ndo_set_features
> 
>  drivers/net/virtio_net.c | 78 +---
>  drivers/virtio/virtio.c  | 59 +++---
>  include/linux/virtio.h   | 11 --
>  3 files changed, 105 insertions(+), 43 deletions(-)
> 
> -- 
> 2.31.1

Re: [PATCH net-next V6 4/4] virtio-net: synchronize probe with ndo_set_features

2024-08-06 Thread Michael S. Tsirkin

On Tue, Aug 06, 2024 at 10:22:24AM +0800, Jason Wang wrote:
> We calculate guest offloads during probe without the protection of
> rtnl_lock. This lead to race between probe and ndo_set_features. Fix
> this by moving the calculation under the rtnl_lock.
> 
> Signed-off-by: Jason Wang 

Fixes tag pls?

> ---
>  drivers/net/virtio_net.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index fc5196ca8d51..1d86aa07c871 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -6596,6 +6596,11 @@ static int virtnet_probe(struct virtio_device *vdev)
>   netif_carrier_on(dev);
>   }
>  
> + for (i = 0; i < ARRAY_SIZE(guest_offloads); i++)
> + if (virtio_has_feature(vi->vdev, guest_offloads[i]))
> + set_bit(guest_offloads[i], &vi->guest_offloads);
> + vi->guest_offloads_capable = vi->guest_offloads;
> +
>   rtnl_unlock();
>  
>   err = virtnet_cpu_notif_add(vi);
> @@ -6604,11 +6609,6 @@ static int virtnet_probe(struct virtio_device *vdev)
>   goto free_unregister_netdev;
>   }
>  
> - for (i = 0; i < ARRAY_SIZE(guest_offloads); i++)
> - if (virtio_has_feature(vi->vdev, guest_offloads[i]))
> - set_bit(guest_offloads[i], &vi->guest_offloads);
> - vi->guest_offloads_capable = vi->guest_offloads;
> -
>   pr_debug("virtnet: registered device %s with %d RX and TX vq's\n",
>dev->name, max_queue_pairs);
>  
> -- 
> 2.31.1

Re: [PATCH net v3 2/2] virtio-net: unbreak vq resizing when coalescing is not negotiated

2024-08-01 Thread Michael S. Tsirkin

On Thu, Aug 01, 2024 at 08:27:39PM +0800, Heng Qi wrote:
> Don't break the resize action if the vq coalescing feature
> named VIRTIO_NET_F_VQ_NOTF_COAL is not negotiated.
> 
> Fixes: f61fe5f081cf ("virtio-net: fix the vq coalescing setting for vq 
> resize")
> Signed-off-by: Heng Qi 
> Reviewed-by: Xuan Zhuo 
> Acked-by: Eugenio Pé rez 
> Acked-by: Jason Wang 
> ---
> v2->v3:
>   - Break out the feature check and the fix into separate patches.
> 
> v1->v2:
>   - Rephrase the subject.
>   - Put the feature check inside the virtnet_send_{r,t}x_ctrl_coal_vq_cmd().
> 
>  drivers/net/virtio_net.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index b1176be8fcfd..2b566d893ea3 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3749,7 +3749,7 @@ static int virtnet_set_ringparam(struct net_device *dev,
>   err = virtnet_send_tx_ctrl_coal_vq_cmd(vi, i,
>  
> vi->intr_coal_tx.max_usecs,
>  
> vi->intr_coal_tx.max_packets);
> - if (err)
> + if (err && err != -EOPNOTSUPP)
>   return err;
>   }
>  
> @@ -3764,7 +3764,7 @@ static int virtnet_set_ringparam(struct net_device *dev,
>  
> vi->intr_coal_rx.max_usecs,
>  
> vi->intr_coal_rx.max_packets);
>   mutex_unlock(&vi->rq[i].dim_lock);
> - if (err)
> + if (err && err != -EOPNOTSUPP)
>   return err;


This needs a comment.


>   }
>   }
> -- 
> 2.32.0.3.g01195cf9f

Re: [PATCH net v2] virtio-net: unbreak vq resizing when coalescing is not negotiated

2024-07-31 Thread Michael S. Tsirkin

On Thu, Aug 01, 2024 at 02:07:43PM +0800, Heng Qi wrote:
> On Wed, 31 Jul 2024 08:46:42 -0400, "Michael S. Tsirkin"  
> wrote:
> > On Wed, Jul 31, 2024 at 08:25:23PM +0800, Heng Qi wrote:
> > > On Wed, 31 Jul 2024 08:14:43 -0400, "Michael S. Tsirkin" 
> > >  wrote:
> > > > On Wed, Jul 31, 2024 at 08:07:17PM +0800, Heng Qi wrote:
> > > > > >From the virtio spec:
> > > > > 
> > > > >   The driver MUST have negotiated the VIRTIO_NET_F_VQ_NOTF_COAL
> > > > >   feature when issuing commands VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET
> > > > >   and VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET.
> > > > > 
> > > > > The driver must not send vq notification coalescing commands if
> > > > > VIRTIO_NET_F_VQ_NOTF_COAL is not negotiated. This limitation of course
> > > > > applies to vq resize.
> > > > > 
> > > > > Fixes: f61fe5f081cf ("virtio-net: fix the vq coalescing setting for 
> > > > > vq resize")
> > > > > Signed-off-by: Heng Qi 
> > > > > Reviewed-by: Xuan Zhuo 
> > > > > Acked-by: Eugenio Pé rez 
> > > > > Acked-by: Jason Wang 
> > > > > ---
> > > > > v1->v2:
> > > > >  - Rephrase the subject.
> > > > >  - Put the feature check inside the 
> > > > > virtnet_send_{r,t}x_ctrl_coal_vq_cmd().
> > > > > 
> > > > >  drivers/net/virtio_net.c | 10 --
> > > > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > index 0383a3e136d6..2b566d893ea3 100644
> > > > > --- a/drivers/net/virtio_net.c
> > > > > +++ b/drivers/net/virtio_net.c
> > > > > @@ -3658,6 +3658,9 @@ static int 
> > > > > virtnet_send_rx_ctrl_coal_vq_cmd(struct virtnet_info *vi,
> > > > >  {
> > > > >   int err;
> > > > >  
> > > > > + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_VQ_NOTF_COAL))
> > > > > + return -EOPNOTSUPP;
> > > > > +
> > > > >   err = virtnet_send_ctrl_coal_vq_cmd(vi, rxq2vq(queue),
> > > > >   max_usecs, max_packets);
> > > > >   if (err)
> > > > > @@ -3675,6 +3678,9 @@ static int 
> > > > > virtnet_send_tx_ctrl_coal_vq_cmd(struct virtnet_info *vi,
> > > > >  {
> > > > >   int err;
> > > > >  
> > > > > + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_VQ_NOTF_COAL))
> > > > > + return -EOPNOTSUPP;
> > > > > +
> > > > >   err = virtnet_send_ctrl_coal_vq_cmd(vi, txq2vq(queue),
> > > > >   max_usecs, max_packets);
> > > > >   if (err)
> > > > > @@ -3743,7 +3749,7 @@ static int virtnet_set_ringparam(struct 
> > > > > net_device *dev,
> > > > >   err = virtnet_send_tx_ctrl_coal_vq_cmd(vi, i,
> > > > >  
> > > > > vi->intr_coal_tx.max_usecs,
> > > > >  
> > > > > vi->intr_coal_tx.max_packets);
> > > > > - if (err)
> > > > > + if (err && err != -EOPNOTSUPP)
> > > > >   return err;
> > > > >   }
> > > > >
> > > > 
> > > > 
> > > > So far so good.
> > > >   
> > > > > @@ -3758,7 +3764,7 @@ static int virtnet_set_ringparam(struct 
> > > > > net_device *dev,
> > > > >  
> > > > > vi->intr_coal_rx.max_usecs,
> > > > >  
> > > > > vi->intr_coal_rx.max_packets);
> > > > >   mutex_unlock(&vi->rq[i].dim_lock);
> > > > > - if (err)
> > > > > + if (err && err != -EOPNOTSUPP)
> > > > >   return err;
> > > > >   }
> > > > >   }
> > > > 
> > > > I don't get this one. If resize is not supported,
> > > 
> > > Here means that the *dim feature* is not supported, not the *resize* 
> > > feature.
> > > 
> > > > we pretend it was successful? Why?
> > > 
> > > During a resize, if the dim feature is not supported, the driver does not
> > > need to try to recover any coalescing values, since the device does not 
> > > have
> > > these parameters.
> > > Therefore, the resize should continue without interruption.
> > > 
> > > Thanks.
> > 
> > 
> > you mean it's a separate bugfix?
> 
> Right.
> 
> Don't break resize when coalescing is not negotiated.
> 
> Thanks.

Let's make this a separate patch then, please.

> > 
> > > > 
> > > > > -- 
> > > > > 2.32.0.3.g01195cf9f
> > > > 
> >

Re: [PATCH net v2] virtio-net: unbreak vq resizing when coalescing is not negotiated

2024-07-31 Thread Michael S. Tsirkin

On Wed, Jul 31, 2024 at 08:25:23PM +0800, Heng Qi wrote:
> On Wed, 31 Jul 2024 08:14:43 -0400, "Michael S. Tsirkin"  
> wrote:
> > On Wed, Jul 31, 2024 at 08:07:17PM +0800, Heng Qi wrote:
> > > >From the virtio spec:
> > > 
> > >   The driver MUST have negotiated the VIRTIO_NET_F_VQ_NOTF_COAL
> > >   feature when issuing commands VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET
> > >   and VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET.
> > > 
> > > The driver must not send vq notification coalescing commands if
> > > VIRTIO_NET_F_VQ_NOTF_COAL is not negotiated. This limitation of course
> > > applies to vq resize.
> > > 
> > > Fixes: f61fe5f081cf ("virtio-net: fix the vq coalescing setting for vq 
> > > resize")
> > > Signed-off-by: Heng Qi 
> > > Reviewed-by: Xuan Zhuo 
> > > Acked-by: Eugenio Pé rez 
> > > Acked-by: Jason Wang 
> > > ---
> > > v1->v2:
> > >  - Rephrase the subject.
> > >  - Put the feature check inside the 
> > > virtnet_send_{r,t}x_ctrl_coal_vq_cmd().
> > > 
> > >  drivers/net/virtio_net.c | 10 --
> > >  1 file changed, 8 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index 0383a3e136d6..2b566d893ea3 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -3658,6 +3658,9 @@ static int virtnet_send_rx_ctrl_coal_vq_cmd(struct 
> > > virtnet_info *vi,
> > >  {
> > >   int err;
> > >  
> > > + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_VQ_NOTF_COAL))
> > > + return -EOPNOTSUPP;
> > > +
> > >   err = virtnet_send_ctrl_coal_vq_cmd(vi, rxq2vq(queue),
> > >   max_usecs, max_packets);
> > >   if (err)
> > > @@ -3675,6 +3678,9 @@ static int virtnet_send_tx_ctrl_coal_vq_cmd(struct 
> > > virtnet_info *vi,
> > >  {
> > >   int err;
> > >  
> > > + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_VQ_NOTF_COAL))
> > > + return -EOPNOTSUPP;
> > > +
> > >   err = virtnet_send_ctrl_coal_vq_cmd(vi, txq2vq(queue),
> > >   max_usecs, max_packets);
> > >   if (err)
> > > @@ -3743,7 +3749,7 @@ static int virtnet_set_ringparam(struct net_device 
> > > *dev,
> > >   err = virtnet_send_tx_ctrl_coal_vq_cmd(vi, i,
> > >  
> > > vi->intr_coal_tx.max_usecs,
> > >  
> > > vi->intr_coal_tx.max_packets);
> > > - if (err)
> > > + if (err && err != -EOPNOTSUPP)
> > >   return err;
> > >   }
> > >
> > 
> > 
> > So far so good.
> >   
> > > @@ -3758,7 +3764,7 @@ static int virtnet_set_ringparam(struct net_device 
> > > *dev,
> > >  
> > > vi->intr_coal_rx.max_usecs,
> > >  
> > > vi->intr_coal_rx.max_packets);
> > >   mutex_unlock(&vi->rq[i].dim_lock);
> > > - if (err)
> > > + if (err && err != -EOPNOTSUPP)
> > >   return err;
> > >   }
> > >   }
> > 
> > I don't get this one. If resize is not supported,
> 
> Here means that the *dim feature* is not supported, not the *resize* feature.
> 
> > we pretend it was successful? Why?
> 
> During a resize, if the dim feature is not supported, the driver does not
> need to try to recover any coalescing values, since the device does not have
> these parameters.
> Therefore, the resize should continue without interruption.
> 
> Thanks.


you mean it's a separate bugfix?

> > 
> > > -- 
> > > 2.32.0.3.g01195cf9f
> >

Re: [PATCH net v2] virtio-net: unbreak vq resizing when coalescing is not negotiated

2024-07-31 Thread Michael S. Tsirkin

On Wed, Jul 31, 2024 at 08:07:17PM +0800, Heng Qi wrote:
> >From the virtio spec:
> 
>   The driver MUST have negotiated the VIRTIO_NET_F_VQ_NOTF_COAL
>   feature when issuing commands VIRTIO_NET_CTRL_NOTF_COAL_VQ_SET
>   and VIRTIO_NET_CTRL_NOTF_COAL_VQ_GET.
> 
> The driver must not send vq notification coalescing commands if
> VIRTIO_NET_F_VQ_NOTF_COAL is not negotiated. This limitation of course
> applies to vq resize.
> 
> Fixes: f61fe5f081cf ("virtio-net: fix the vq coalescing setting for vq 
> resize")
> Signed-off-by: Heng Qi 
> Reviewed-by: Xuan Zhuo 
> Acked-by: Eugenio Pé rez 
> Acked-by: Jason Wang 
> ---
> v1->v2:
>  - Rephrase the subject.
>  - Put the feature check inside the virtnet_send_{r,t}x_ctrl_coal_vq_cmd().
> 
>  drivers/net/virtio_net.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 0383a3e136d6..2b566d893ea3 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3658,6 +3658,9 @@ static int virtnet_send_rx_ctrl_coal_vq_cmd(struct 
> virtnet_info *vi,
>  {
>   int err;
>  
> + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_VQ_NOTF_COAL))
> + return -EOPNOTSUPP;
> +
>   err = virtnet_send_ctrl_coal_vq_cmd(vi, rxq2vq(queue),
>   max_usecs, max_packets);
>   if (err)
> @@ -3675,6 +3678,9 @@ static int virtnet_send_tx_ctrl_coal_vq_cmd(struct 
> virtnet_info *vi,
>  {
>   int err;
>  
> + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_VQ_NOTF_COAL))
> + return -EOPNOTSUPP;
> +
>   err = virtnet_send_ctrl_coal_vq_cmd(vi, txq2vq(queue),
>   max_usecs, max_packets);
>   if (err)
> @@ -3743,7 +3749,7 @@ static int virtnet_set_ringparam(struct net_device *dev,
>   err = virtnet_send_tx_ctrl_coal_vq_cmd(vi, i,
>  
> vi->intr_coal_tx.max_usecs,
>  
> vi->intr_coal_tx.max_packets);
> - if (err)
> + if (err && err != -EOPNOTSUPP)
>   return err;
>   }
>


So far so good.
  
> @@ -3758,7 +3764,7 @@ static int virtnet_set_ringparam(struct net_device *dev,
>  
> vi->intr_coal_rx.max_usecs,
>  
> vi->intr_coal_rx.max_packets);
>   mutex_unlock(&vi->rq[i].dim_lock);
> - if (err)
> + if (err && err != -EOPNOTSUPP)
>   return err;
>   }
>   }

I don't get this one. If resize is not supported, we pretend it
was successful? Why?

> -- 
> 2.32.0.3.g01195cf9f

Re: [PATCH net-next] net: virtio: fix virtnet_sq_free_stats initialization

2024-07-12 Thread Michael S. Tsirkin

On Fri, Jul 12, 2024 at 09:03:30AM +0100, Jean-Philippe Brucker wrote:
> Commit c8bd1f7f3e61 ("virtio_net: add support for Byte Queue Limits")
> added two new fields to struct virtnet_sq_free_stats, but commit
> 23c81a20b998 ("net: virtio: unify code to init stats") accidentally
> removed their initialization. In the worst case this can trigger the BUG
> at lib/dynamic_queue_limits.c:99 because dql_completed() receives a
> random value as count. Initialize the whole structure.
> 
> Fixes: 23c81a20b998 ("net: virtio: unify code to init stats")
> Reported-by: Aishwarya TCV 
> Signed-off-by: Jean-Philippe Brucker 


Acked-by: Michael S. Tsirkin 

> ---
> Both these patches are still in next so it might be possible to fix it
> up directly.

I'd be fine with squashing but I don't think it's done in net-next.

> ---
>  drivers/net/virtio_net.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 10d8674eec5d2..f014802522e0f 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -530,7 +530,7 @@ static void __free_old_xmit(struct send_queue *sq, struct 
> netdev_queue *txq,
>   unsigned int len;
>   void *ptr;
>  
> - stats->bytes = stats->packets = 0;
> + memset(stats, 0, sizeof(*stats));
>  
>   while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
>   if (!is_xdp_frame(ptr)) {
> 
> base-commit: 3fe121b622825ff8cc995a1e6b026181c48188db
> -- 
> 2.45.2

Re: [PATCH net-next v8 00/10] virtio-net: support AF_XDP zero copy

2024-07-09 Thread Michael S. Tsirkin

On Mon, Jul 08, 2024 at 07:25:27PM +0800, Xuan Zhuo wrote:
> v8:
> 1. virtnet_add_recvbuf_xsk() always return err, when encounters error
> 
> v7:
> 1. some small fixes
> 
> v6:
> 1. start from supporting the rx zerocopy
> 
> v5:
> 1. fix the comments of last version
> 
> http://lore.kernel.org/all/2024064147.31320-1-xuanz...@linux.alibaba.com
> v4:
> 1. remove the commits that introduce the independent directory
> 2. remove the supporting for the rx merge mode (for limit 15
>commits of net-next). Let's start with the small mode.
> 3. merge some commits and remove some not important commits


Series:

Acked-by: Michael S. Tsirkin 

> ## AF_XDP
> 
> XDP socket(AF_XDP) is an excellent bypass kernel network framework. The zero
> copy feature of xsk (XDP socket) needs to be supported by the driver. The
> performance of zero copy is very good. mlx5 and intel ixgbe already support
> this feature, This patch set allows virtio-net to support xsk's zerocopy xmit
> feature.
> 
> At present, we have completed some preparation:
> 
> 1. vq-reset (virtio spec and kernel code)
> 2. virtio-core premapped dma
> 3. virtio-net xdp refactor
> 
> So it is time for Virtio-Net to complete the support for the XDP Socket
> Zerocopy.
> 
> Virtio-net can not increase the queue num at will, so xsk shares the queue 
> with
> kernel.
> 
> On the other hand, Virtio-Net does not support generate interrupt from driver
> manually, so when we wakeup tx xmit, we used some tips. If the CPU run by TX
> NAPI last time is other CPUs, use IPI to wake up NAPI on the remote CPU. If it
> is also the local CPU, then we wake up napi directly.
> 
> This patch set includes some refactor to the virtio-net to let that to support
> AF_XDP.
> 
> ## Run & Test
> 
> Because there are too many commits, the work of virtio net supporting af-xdp 
> is
> split to rx part and tx part. This patch set is for rx part.
> 
> So the flag NETDEV_XDP_ACT_XSK_ZEROCOPY is not added, if someone want to test
> for af-xdp rx, the flag needs to be adding locally.
> 
> ## performance
> 
> ENV: Qemu with vhost-user(polling mode).
> Host CPU: Intel(R) Xeon(R) Platinum 8163 CPU @ 2.50GHz
> 
> ### virtio PMD in guest with testpmd
> 
> testpmd> show port stats all
> 
>   NIC statistics for port 0 
>  RX-packets: 19531092064 RX-missed: 0 RX-bytes: 1093741155584
>  RX-errors: 0
>  RX-nombuf: 0
>  TX-packets: 595992 TX-errors: 0 TX-bytes: 371030645664
> 
> 
>  Throughput (since last show)
>  Rx-pps:   8861574 Rx-bps:  3969985208
>  Tx-pps:   8861493 Tx-bps:  3969962736
>  
> 
> ### AF_XDP PMD in guest with testpmd
> 
> testpmd> show port stats all
> 
>    NIC statistics for port 0  
>   RX-packets: 68152727   RX-missed: 0  RX-bytes:  3816552712
>   RX-errors: 0
>   RX-nombuf:  0
>   TX-packets: 68114967   TX-errors: 33216  TX-bytes:  3814438152
> 
>   Throughput (since last show)
>   Rx-pps:  6333196  Rx-bps:   2837272088
>   Tx-pps:  6333227  Tx-bps:   2837285936
>   
> 
> But AF_XDP consumes more CPU for tx and rx napi(100% and 86%).
> 
> Please review.
> 
> Thanks.
> 
> v3
> 1. virtio introduces helpers for virtio-net sq using premapped dma
> 2. xsk has more complete support for merge mode
> 3. fix some problems
> 
> v2
> 1. wakeup uses the way of GVE. No send ipi to wakeup napi on remote cpu.
> 2. remove rcu. Because we synchronize all operat, so the rcu is not 
> needed.
> 3. split the commit "move to virtio_net.h" in last patch set. Just move 
> the
>struct/api to header when we use them.
> 4. add comments for some code
> 
> v1:
> 1. remove two virtio commits. Push this patchset to net-next
> 2. squash "virtio_net: virtnet_poll_tx support rescheduled" to xsk: 
> support tx
> 3. fix some warnings
> 
> 
> 
> 
> 
> 
> 
> 
> Xuan Zhuo (10):
>   virtio_net: replace VIRTIO_XDP_HEADROOM by XDP_PACKET_HEADROOM
>   virtio_net: separate virtnet_rx_resize()
>   virtio_net: separate virtnet_tx_resize()
>   virtio_net: separate receive_buf
>   virtio_net: separate receive_mergeable
>   virtio_net: xsk: bind/unbind xsk for rx
>   virtio_net: xsk: support wakeup
>   virtio_net: xsk: rx: support fill with xsk buffer
>   virtio_net: xsk: rx: support recv small mode
>   virtio_net: xsk: rx: support recv merge mode
> 
>  drivers/net/virtio_net.c | 770 ++-
>  1 file changed, 676 insertions(+), 94 deletions(-)
> 
> --
> 2.32.0.3.g01195cf9f

Re: [PATCH] virtio_net: Use u64_stats_fetch_begin() for stats fetch

2024-07-09 Thread Michael S. Tsirkin

On Wed, Jun 19, 2024 at 10:55:29AM +0800, Li RongQing wrote:
> This place is fetching the stats, so u64_stats_fetch_begin
> and u64_stats_fetch_retry should be used
> 
> Fixes: 6208799553a8 ("virtio-net: support rx netdim")
> Signed-off-by: Li RongQing 

So I dropped this from my tree, if you think it's
still necessary, pls resubmit to net-next.

Thanks!

> ---
>  drivers/net/virtio_net.c | 14 --
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 61a57d1..b669e73 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2332,16 +2332,18 @@ static void virtnet_poll_cleantx(struct receive_queue 
> *rq)
>  static void virtnet_rx_dim_update(struct virtnet_info *vi, struct 
> receive_queue *rq)
>  {
>   struct dim_sample cur_sample = {};
> + unsigned int start;
>  
>   if (!rq->packets_in_napi)
>   return;
>  
> - u64_stats_update_begin(&rq->stats.syncp);
> - dim_update_sample(rq->calls,
> -   u64_stats_read(&rq->stats.packets),
> -   u64_stats_read(&rq->stats.bytes),
> -   &cur_sample);
> - u64_stats_update_end(&rq->stats.syncp);
> + do {
> + start = u64_stats_fetch_begin(&rq->stats.syncp);
> + dim_update_sample(rq->calls,
> + u64_stats_read(&rq->stats.packets),
> + u64_stats_read(&rq->stats.bytes),
> + &cur_sample);
> + } while (u64_stats_fetch_retry(&rq->stats.syncp, start));
>  
>   net_dim(&rq->dim, cur_sample);
>   rq->packets_in_napi = 0;
> -- 
> 2.9.4

Re: [PATCH net-next v4 2/5] virtio_net: enable irq for the control vq

2024-06-26 Thread Michael S. Tsirkin

On Wed, Jun 26, 2024 at 10:43:11AM +0200, Jiri Pirko wrote:
> Wed, Jun 26, 2024 at 10:08:14AM CEST, m...@redhat.com wrote:
> >On Wed, Jun 26, 2024 at 09:52:58AM +0200, Jiri Pirko wrote:
> >> Thu, Jun 20, 2024 at 12:31:34PM CEST, hen...@linux.alibaba.com wrote:
> >> >On Thu, 20 Jun 2024 06:11:40 -0400, "Michael S. Tsirkin" 
> >> > wrote:
> >> >> On Thu, Jun 20, 2024 at 06:10:51AM -0400, Michael S. Tsirkin wrote:
> >> >> > On Thu, Jun 20, 2024 at 05:53:15PM +0800, Heng Qi wrote:
> >> >> > > On Thu, 20 Jun 2024 16:26:05 +0800, Jason Wang 
> >> >> > >  wrote:
> >> >> > > > On Thu, Jun 20, 2024 at 4:21 PM Jason Wang  
> >> >> > > > wrote:
> >> >> > > > >
> >> >> > > > > On Thu, Jun 20, 2024 at 3:35 PM Heng Qi 
> >> >> > > > >  wrote:
> >> >> > > > > >
> >> >> > > > > > On Wed, 19 Jun 2024 17:19:12 -0400, "Michael S. Tsirkin" 
> >> >> > > > > >  wrote:
> >> >> > > > > > > On Thu, Jun 20, 2024 at 12:19:05AM +0800, Heng Qi wrote:
> >> >> > > > > > > > @@ -5312,7 +5315,7 @@ static int virtnet_find_vqs(struct 
> >> >> > > > > > > > virtnet_info *vi)
> >> >> > > > > > > >
> >> >> > > > > > > > /* Parameters for control virtqueue, if any */
> >> >> > > > > > > > if (vi->has_cvq) {
> >> >> > > > > > > > -   callbacks[total_vqs - 1] = NULL;
> >> >> > > > > > > > +   callbacks[total_vqs - 1] = virtnet_cvq_done;
> >> >> > > > > > > > names[total_vqs - 1] = "control";
> >> >> > > > > > > > }
> >> >> > > > > > > >
> >> >> > > > > > >
> >> >> > > > > > > If the # of MSIX vectors is exactly for data path VQs,
> >> >> > > > > > > this will cause irq sharing between VQs which will degrade
> >> >> > > > > > > performance significantly.
> >> >> > > > > > >
> >> >> > > > >
> >> >> > > > > Why do we need to care about buggy management? I think libvirt 
> >> >> > > > > has
> >> >> > > > > been teached to use 2N+2 since the introduction of the 
> >> >> > > > > multiqueue[1].
> >> >> > > > 
> >> >> > > > And Qemu can calculate it correctly automatically since:
> >> >> > > > 
> >> >> > > > commit 51a81a2118df0c70988f00d61647da9e298483a4
> >> >> > > > Author: Jason Wang 
> >> >> > > > Date:   Mon Mar 8 12:49:19 2021 +0800
> >> >> > > > 
> >> >> > > > virtio-net: calculating proper msix vectors on init
> >> >> > > > 
> >> >> > > > Currently, the default msix vectors for virtio-net-pci is 3 
> >> >> > > > which is
> >> >> > > > obvious not suitable for multiqueue guest, so we depends on 
> >> >> > > > the user
> >> >> > > > or management tools to pass a correct vectors parameter. In 
> >> >> > > > fact, we
> >> >> > > > can simplifying this by calculating the number of vectors on 
> >> >> > > > realize.
> >> >> > > > 
> >> >> > > > Consider we have N queues, the number of vectors needed is 
> >> >> > > > 2*N + 2
> >> >> > > > (#queue pairs + plus one config interrupt and control vq). We 
> >> >> > > > didn't
> >> >> > > > check whether or not host support control vq because it was 
> >> >> > > > added
> >> >> > > > unconditionally by qemu to avoid breaking legacy guests such 
> >> >> > > > as Minix.
> >> >> > > > 
> >> >> > > > Reviewed-by: Philippe Mathieu-Daudé  >> >> > > >

Re: [PATCH net-next v4 2/5] virtio_net: enable irq for the control vq

2024-06-26 Thread Michael S. Tsirkin

On Wed, Jun 26, 2024 at 09:52:58AM +0200, Jiri Pirko wrote:
> Thu, Jun 20, 2024 at 12:31:34PM CEST, hen...@linux.alibaba.com wrote:
> >On Thu, 20 Jun 2024 06:11:40 -0400, "Michael S. Tsirkin"  
> >wrote:
> >> On Thu, Jun 20, 2024 at 06:10:51AM -0400, Michael S. Tsirkin wrote:
> >> > On Thu, Jun 20, 2024 at 05:53:15PM +0800, Heng Qi wrote:
> >> > > On Thu, 20 Jun 2024 16:26:05 +0800, Jason Wang  
> >> > > wrote:
> >> > > > On Thu, Jun 20, 2024 at 4:21 PM Jason Wang  
> >> > > > wrote:
> >> > > > >
> >> > > > > On Thu, Jun 20, 2024 at 3:35 PM Heng Qi  
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > On Wed, 19 Jun 2024 17:19:12 -0400, "Michael S. Tsirkin" 
> >> > > > > >  wrote:
> >> > > > > > > On Thu, Jun 20, 2024 at 12:19:05AM +0800, Heng Qi wrote:
> >> > > > > > > > @@ -5312,7 +5315,7 @@ static int virtnet_find_vqs(struct 
> >> > > > > > > > virtnet_info *vi)
> >> > > > > > > >
> >> > > > > > > > /* Parameters for control virtqueue, if any */
> >> > > > > > > > if (vi->has_cvq) {
> >> > > > > > > > -   callbacks[total_vqs - 1] = NULL;
> >> > > > > > > > +   callbacks[total_vqs - 1] = virtnet_cvq_done;
> >> > > > > > > > names[total_vqs - 1] = "control";
> >> > > > > > > > }
> >> > > > > > > >
> >> > > > > > >
> >> > > > > > > If the # of MSIX vectors is exactly for data path VQs,
> >> > > > > > > this will cause irq sharing between VQs which will degrade
> >> > > > > > > performance significantly.
> >> > > > > > >
> >> > > > >
> >> > > > > Why do we need to care about buggy management? I think libvirt has
> >> > > > > been teached to use 2N+2 since the introduction of the 
> >> > > > > multiqueue[1].
> >> > > > 
> >> > > > And Qemu can calculate it correctly automatically since:
> >> > > > 
> >> > > > commit 51a81a2118df0c70988f00d61647da9e298483a4
> >> > > > Author: Jason Wang 
> >> > > > Date:   Mon Mar 8 12:49:19 2021 +0800
> >> > > > 
> >> > > > virtio-net: calculating proper msix vectors on init
> >> > > > 
> >> > > > Currently, the default msix vectors for virtio-net-pci is 3 
> >> > > > which is
> >> > > > obvious not suitable for multiqueue guest, so we depends on the 
> >> > > > user
> >> > > > or management tools to pass a correct vectors parameter. In 
> >> > > > fact, we
> >> > > > can simplifying this by calculating the number of vectors on 
> >> > > > realize.
> >> > > > 
> >> > > > Consider we have N queues, the number of vectors needed is 2*N + 
> >> > > > 2
> >> > > > (#queue pairs + plus one config interrupt and control vq). We 
> >> > > > didn't
> >> > > > check whether or not host support control vq because it was added
> >> > > > unconditionally by qemu to avoid breaking legacy guests such as 
> >> > > > Minix.
> >> > > > 
> >> > > > Reviewed-by: Philippe Mathieu-Daudé  >> > > > Reviewed-by: Stefano Garzarella 
> >> > > > Reviewed-by: Stefan Hajnoczi 
> >> > > > Signed-off-by: Jason Wang 
> >> > > 
> >> > > Yes, devices designed according to the spec need to reserve an 
> >> > > interrupt
> >> > > vector for ctrlq. So, Michael, do we want to be compatible with buggy 
> >> > > devices?
> >> > > 
> >> > > Thanks.
> >> > 
> >> > These aren't buggy, the spec allows this. So don't fail, but
> >> > I'm fine with using polling if not enough vectors.
> >> 
> >> sharing with config interrupt is easier code-wise though, FWIW -
> >> we don't need to maintain two code

Re: [PATCH net-next v4 2/5] virtio_net: enable irq for the control vq

2024-06-25 Thread Michael S. Tsirkin

On Tue, Jun 25, 2024 at 09:27:24AM +0800, Jason Wang wrote:
> On Thu, Jun 20, 2024 at 6:12 PM Michael S. Tsirkin  wrote:
> >
> > On Thu, Jun 20, 2024 at 06:10:51AM -0400, Michael S. Tsirkin wrote:
> > > On Thu, Jun 20, 2024 at 05:53:15PM +0800, Heng Qi wrote:
> > > > On Thu, 20 Jun 2024 16:26:05 +0800, Jason Wang  
> > > > wrote:
> > > > > On Thu, Jun 20, 2024 at 4:21 PM Jason Wang  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Jun 20, 2024 at 3:35 PM Heng Qi  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Wed, 19 Jun 2024 17:19:12 -0400, "Michael S. Tsirkin" 
> > > > > > >  wrote:
> > > > > > > > On Thu, Jun 20, 2024 at 12:19:05AM +0800, Heng Qi wrote:
> > > > > > > > > @@ -5312,7 +5315,7 @@ static int virtnet_find_vqs(struct 
> > > > > > > > > virtnet_info *vi)
> > > > > > > > >
> > > > > > > > > /* Parameters for control virtqueue, if any */
> > > > > > > > > if (vi->has_cvq) {
> > > > > > > > > -   callbacks[total_vqs - 1] = NULL;
> > > > > > > > > +   callbacks[total_vqs - 1] = virtnet_cvq_done;
> > > > > > > > > names[total_vqs - 1] = "control";
> > > > > > > > > }
> > > > > > > > >
> > > > > > > >
> > > > > > > > If the # of MSIX vectors is exactly for data path VQs,
> > > > > > > > this will cause irq sharing between VQs which will degrade
> > > > > > > > performance significantly.
> > > > > > > >
> > > > > >
> > > > > > Why do we need to care about buggy management? I think libvirt has
> > > > > > been teached to use 2N+2 since the introduction of the 
> > > > > > multiqueue[1].
> > > > >
> > > > > And Qemu can calculate it correctly automatically since:
> > > > >
> > > > > commit 51a81a2118df0c70988f00d61647da9e298483a4
> > > > > Author: Jason Wang 
> > > > > Date:   Mon Mar 8 12:49:19 2021 +0800
> > > > >
> > > > > virtio-net: calculating proper msix vectors on init
> > > > >
> > > > > Currently, the default msix vectors for virtio-net-pci is 3 which 
> > > > > is
> > > > > obvious not suitable for multiqueue guest, so we depends on the 
> > > > > user
> > > > > or management tools to pass a correct vectors parameter. In fact, 
> > > > > we
> > > > > can simplifying this by calculating the number of vectors on 
> > > > > realize.
> > > > >
> > > > > Consider we have N queues, the number of vectors needed is 2*N + 2
> > > > > (#queue pairs + plus one config interrupt and control vq). We 
> > > > > didn't
> > > > > check whether or not host support control vq because it was added
> > > > > unconditionally by qemu to avoid breaking legacy guests such as 
> > > > > Minix.
> > > > >
> > > > > Reviewed-by: Philippe Mathieu-Daudé  > > > > Reviewed-by: Stefano Garzarella 
> > > > > Reviewed-by: Stefan Hajnoczi 
> > > > > Signed-off-by: Jason Wang 
> > > >
> > > > Yes, devices designed according to the spec need to reserve an interrupt
> > > > vector for ctrlq. So, Michael, do we want to be compatible with buggy 
> > > > devices?
> > > >
> > > > Thanks.
> > >
> > > These aren't buggy, the spec allows this.
> 
> So it doesn't differ from the case when we are lacking sufficient msix
> vectors in the case of multiqueue. In that case we just fallback to
> share one msix for all queues and another for config and we don't
> bother at that time.

sharing queues is exactly "bothering".

> Any reason to bother now?
> 
> Thanks

This patch can make datapath slower for such configs by switching to
sharing msix for a control path benefit. Not a tradeoff many users want
to make.


> > >  So don't fail, but
> > > I'm fine with using polling if not enough vectors.
> >
> > sharing with config interrupt is easier code-wise though, FWIW -
> > we don't need to maintain two code-paths.
> >
> > > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > > > > So no, you can not just do it unconditionally.
> > > > > > > >
> > > > > > > > The correct fix probably requires virtio core/API extensions.
> > > > > > >
> > > > > > > If the introduction of cvq irq causes interrupts to become 
> > > > > > > shared, then
> > > > > > > ctrlq need to fall back to polling mode and keep the status quo.
> > > > > >
> > > > > > Having to path sounds a burden.
> > > > > >
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > [1] https://www.linux-kvm.org/page/Multiqueue
> > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > MST
> > > > > > > >
> > > > > > >
> > > > >
> >

Re: [PATCH net-next v4 2/5] virtio_net: enable irq for the control vq

2024-06-24 Thread Michael S. Tsirkin

On Thu, Jun 20, 2024 at 12:19:05AM +0800, Heng Qi wrote:
> If the device does not respond to a request for a long time,
> then control vq polling elevates CPU utilization, a problem that
> exacerbates with more command requests.
> 
> Enabling control vq's irq is advantageous for the guest, and
> this still doesn't support concurrent requests.
> 
> Suggested-by: Jason Wang 
> Signed-off-by: Heng Qi 
> ---
>  drivers/net/virtio_net.c | 22 +-
>  1 file changed, 13 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index b45f58a902e3..ed10084997d3 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -372,6 +372,8 @@ struct virtio_net_ctrl_rss {
>  struct control_buf {
>   struct virtio_net_ctrl_hdr hdr;
>   virtio_net_ctrl_ack status;
> + /* Wait for the device to complete the cvq request. */
> + struct completion completion;
>  };
>  
>  struct virtnet_info {
> @@ -664,6 +666,13 @@ static bool virtqueue_napi_complete(struct napi_struct 
> *napi,
>   return false;
>  }
>  
> +static void virtnet_cvq_done(struct virtqueue *cvq)
> +{
> + struct virtnet_info *vi = cvq->vdev->priv;
> +
> + complete(&vi->ctrl->completion);
> +}
> +
>  static void skb_xmit_done(struct virtqueue *vq)
>  {
>   struct virtnet_info *vi = vq->vdev->priv;

> @@ -2724,14 +2733,8 @@ static bool virtnet_send_command_reply(struct 
> virtnet_info *vi,
>   if (unlikely(!virtqueue_kick(vi->cvq)))
>   goto unlock;
>  
> - /* Spin for a response, the kick causes an ioport write, trapping
> -  * into the hypervisor, so the request should be handled immediately.
> -  */
> - while (!virtqueue_get_buf(vi->cvq, &tmp) &&
> -!virtqueue_is_broken(vi->cvq)) {
> - cond_resched();
> - cpu_relax();
> - }
> + wait_for_completion(&ctrl->completion);
> + virtqueue_get_buf(vi->cvq, &tmp);
>  
>  unlock:
>   ok = ctrl->status == VIRTIO_NET_OK;

Hmm no this is not a good idea, code should be robust in case
of spurious interrupts.

Also suprise removal is now broken ...


> @@ -5312,7 +5315,7 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
>  
>   /* Parameters for control virtqueue, if any */
>   if (vi->has_cvq) {
> - callbacks[total_vqs - 1] = NULL;
> + callbacks[total_vqs - 1] = virtnet_cvq_done;
>   names[total_vqs - 1] = "control";
>   }
>  
> @@ -5832,6 +5835,7 @@ static int virtnet_probe(struct virtio_device *vdev)
>   if (vi->has_rss || vi->has_rss_hash_report)
>   virtnet_init_default_rss(vi);
>  
> + init_completion(&vi->ctrl->completion);
>   enable_rx_mode_work(vi);
>  
>   /* serialize netdev register + virtio_device_ready() with ndo_open() */
> -- 
> 2.32.0.3.g01195cf9f

Re: [PATCH net 0/2] virtio_net: fix race on control_buf

2024-06-23 Thread Michael S. Tsirkin

On Tue, May 28, 2024 at 03:52:24PM +0800, Heng Qi wrote:
> Patch 1 did a simple rename, leaving 'ret' for patch 2.
> Patch 2 fixed a race between reading the device response and the
> new command submission.


Acked-by: Michael S. Tsirkin 


> Heng Qi (2):
>   virtio_net: rename ret to err
>   virtio_net: fix missing lock protection on control_buf access
> 
>  drivers/net/virtio_net.c | 12 +++-
>  1 file changed, 7 insertions(+), 5 deletions(-)
> 
> -- 
> 2.32.0.3.g01195cf9f

Re: [PATCH net-next v4 2/5] virtio_net: enable irq for the control vq

2024-06-21 Thread Michael S. Tsirkin

On Fri, Jun 21, 2024 at 03:41:46PM +0800, Xuan Zhuo wrote:
> On Thu, 20 Jun 2024 06:11:40 -0400, "Michael S. Tsirkin"  
> wrote:
> > On Thu, Jun 20, 2024 at 06:10:51AM -0400, Michael S. Tsirkin wrote:
> > > On Thu, Jun 20, 2024 at 05:53:15PM +0800, Heng Qi wrote:
> > > > On Thu, 20 Jun 2024 16:26:05 +0800, Jason Wang  
> > > > wrote:
> > > > > On Thu, Jun 20, 2024 at 4:21 PM Jason Wang  
> > > > > wrote:
> > > > > >
> > > > > > On Thu, Jun 20, 2024 at 3:35 PM Heng Qi  
> > > > > > wrote:
> > > > > > >
> > > > > > > On Wed, 19 Jun 2024 17:19:12 -0400, "Michael S. Tsirkin" 
> > > > > > >  wrote:
> > > > > > > > On Thu, Jun 20, 2024 at 12:19:05AM +0800, Heng Qi wrote:
> > > > > > > > > @@ -5312,7 +5315,7 @@ static int virtnet_find_vqs(struct 
> > > > > > > > > virtnet_info *vi)
> > > > > > > > >
> > > > > > > > > /* Parameters for control virtqueue, if any */
> > > > > > > > > if (vi->has_cvq) {
> > > > > > > > > -   callbacks[total_vqs - 1] = NULL;
> > > > > > > > > +   callbacks[total_vqs - 1] = virtnet_cvq_done;
> > > > > > > > > names[total_vqs - 1] = "control";
> > > > > > > > > }
> > > > > > > > >
> > > > > > > >
> > > > > > > > If the # of MSIX vectors is exactly for data path VQs,
> > > > > > > > this will cause irq sharing between VQs which will degrade
> > > > > > > > performance significantly.
> > > > > > > >
> > > > > >
> > > > > > Why do we need to care about buggy management? I think libvirt has
> > > > > > been teached to use 2N+2 since the introduction of the 
> > > > > > multiqueue[1].
> > > > >
> > > > > And Qemu can calculate it correctly automatically since:
> > > > >
> > > > > commit 51a81a2118df0c70988f00d61647da9e298483a4
> > > > > Author: Jason Wang 
> > > > > Date:   Mon Mar 8 12:49:19 2021 +0800
> > > > >
> > > > > virtio-net: calculating proper msix vectors on init
> > > > >
> > > > > Currently, the default msix vectors for virtio-net-pci is 3 which 
> > > > > is
> > > > > obvious not suitable for multiqueue guest, so we depends on the 
> > > > > user
> > > > > or management tools to pass a correct vectors parameter. In fact, 
> > > > > we
> > > > > can simplifying this by calculating the number of vectors on 
> > > > > realize.
> > > > >
> > > > > Consider we have N queues, the number of vectors needed is 2*N + 2
> > > > > (#queue pairs + plus one config interrupt and control vq). We 
> > > > > didn't
> > > > > check whether or not host support control vq because it was added
> > > > > unconditionally by qemu to avoid breaking legacy guests such as 
> > > > > Minix.
> > > > >
> > > > > Reviewed-by: Philippe Mathieu-Daudé  > > > > Reviewed-by: Stefano Garzarella 
> > > > > Reviewed-by: Stefan Hajnoczi 
> > > > > Signed-off-by: Jason Wang 
> > > >
> > > > Yes, devices designed according to the spec need to reserve an interrupt
> > > > vector for ctrlq. So, Michael, do we want to be compatible with buggy 
> > > > devices?
> > > >
> > > > Thanks.
> > >
> > > These aren't buggy, the spec allows this. So don't fail, but
> > > I'm fine with using polling if not enough vectors.
> >
> > sharing with config interrupt is easier code-wise though, FWIW -
> > we don't need to maintain two code-paths.
> 
> 
> If we do that, we should introduce a new helper, not to add new function to
> find_vqs().
> 
> if ->share_irq_with_config:
>   ->share_irq_with_config(..., vq_idx)
> 
> Thanks.

Generally, I'd like to avoid something very narrow and single-use
in the API. Just telling virtio core that a vq is slow-path
sounds generic enough, and we should be able to extend that
later to support sharing between VQs.

> 
> >
> > > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > > > > So no, you can not just do it unconditionally.
> > > > > > > >
> > > > > > > > The correct fix probably requires virtio core/API extensions.
> > > > > > >
> > > > > > > If the introduction of cvq irq causes interrupts to become 
> > > > > > > shared, then
> > > > > > > ctrlq need to fall back to polling mode and keep the status quo.
> > > > > >
> > > > > > Having to path sounds a burden.
> > > > > >
> > > > > > >
> > > > > > > Thanks.
> > > > > > >
> > > > > >
> > > > > >
> > > > > > Thanks
> > > > > >
> > > > > > [1] https://www.linux-kvm.org/page/Multiqueue
> > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > MST
> > > > > > > >
> > > > > > >
> > > > >
> >

Re: [PATCH][net-next,v2] virtio_net: Remove u64_stats_update_begin()/end() for stats fetch

2024-06-21 Thread Michael S. Tsirkin

On Fri, Jun 21, 2024 at 05:45:52PM +0800, Li RongQing wrote:
> This place is fetching the stats, u64_stats_update_begin()/end()
> should not be used, and the fetcher of stats is in the same context
> as the updater of the stats, so don't need any protection
> 
> Suggested-by: Jakub Kicinski 
> Signed-off-by: Li RongQing 


I like the added comment, makes things clearer.

Acked-by: Michael S. Tsirkin 


> ---
>  drivers/net/virtio_net.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 61a57d1..6604339 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2336,12 +2336,12 @@ static void virtnet_rx_dim_update(struct virtnet_info 
> *vi, struct receive_queue
>   if (!rq->packets_in_napi)
>   return;
>  
> - u64_stats_update_begin(&rq->stats.syncp);
> + /* Don't need protection when fetching stats, since fetcher and
> +  * updater of the stats are in same context */
>   dim_update_sample(rq->calls,
> u64_stats_read(&rq->stats.packets),
> u64_stats_read(&rq->stats.bytes),
> &cur_sample);
> - u64_stats_update_end(&rq->stats.syncp);
>  
>   net_dim(&rq->dim, cur_sample);
>   rq->packets_in_napi = 0;
> -- 
> 2.9.4

Re: [PATCH] virtio_net: Use u64_stats_fetch_begin() for stats fetch

2024-06-20 Thread Michael S. Tsirkin

On Thu, Jun 20, 2024 at 07:09:08AM -0700, Jakub Kicinski wrote:
> On Wed, 19 Jun 2024 10:55:29 +0800 Li RongQing wrote:
> > This place is fetching the stats, so u64_stats_fetch_begin
> > and u64_stats_fetch_retry should be used
> > 
> > Fixes: 6208799553a8 ("virtio-net: support rx netdim")
> > Signed-off-by: Li RongQing 
> > ---
> >  drivers/net/virtio_net.c | 14 --
> >  1 file changed, 8 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 61a57d1..b669e73 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -2332,16 +2332,18 @@ static void virtnet_poll_cleantx(struct 
> > receive_queue *rq)
> >  static void virtnet_rx_dim_update(struct virtnet_info *vi, struct 
> > receive_queue *rq)
> >  {
> > struct dim_sample cur_sample = {};
> > +   unsigned int start;
> >  
> > if (!rq->packets_in_napi)
> > return;
> >  
> > -   u64_stats_update_begin(&rq->stats.syncp);
> > -   dim_update_sample(rq->calls,
> > - u64_stats_read(&rq->stats.packets),
> > - u64_stats_read(&rq->stats.bytes),
> > - &cur_sample);
> > -   u64_stats_update_end(&rq->stats.syncp);
> > +   do {
> > +   start = u64_stats_fetch_begin(&rq->stats.syncp);
> > +   dim_update_sample(rq->calls,
> > +   u64_stats_read(&rq->stats.packets),
> > +   u64_stats_read(&rq->stats.bytes),
> > +   &cur_sample);
> > +   } while (u64_stats_fetch_retry(&rq->stats.syncp, start));
> 
> Did you by any chance use an automated tool of any sort to find this
> issue or generate the fix?
> 
> I don't think this is actually necessary here, you're in the same
> context as the updater of the stats, you don't need any protection.
> You can remove u64_stats_update_begin() / end() (in net-next, there's
> no bug).
> 
> I won't comment on implications of calling dim_update_sample() in 
> a loop.

I didn't realize there are any - it seems to be idempotent, no?


> Please make sure you answer my "did you use a tool" question, I'm
> really curious.
> -- 
> pw-bot: cr

Re: [PATCH net-next v6 10/10] virtio_net: xsk: rx: free the unused xsk buffer

2024-06-20 Thread Michael S. Tsirkin

On Thu, Jun 20, 2024 at 12:46:24PM +0200, Paolo Abeni wrote:
> On Tue, 2024-06-18 at 15:56 +0800, Xuan Zhuo wrote:
> > Release the xsk buffer, when the queue is releasing or the queue is
> > resizing.
> > 
> > Signed-off-by: Xuan Zhuo 
> > ---
> >  drivers/net/virtio_net.c | 5 +
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index cfa106aa8039..33695b86bd99 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -967,6 +967,11 @@ static void virtnet_rq_unmap_free_buf(struct virtqueue 
> > *vq, void *buf)
> >  
> > rq = &vi->rq[i];
> >  
> > +   if (rq->xsk.pool) {
> > +   xsk_buff_free((struct xdp_buff *)buf);
> > +   return;
> > +   }
> > +
> > if (!vi->big_packets || vi->mergeable_rx_bufs)
> > virtnet_rq_unmap(rq, buf, 0);
> 
> 
> I'm under the impression this should be squashed in a previous patch,
> likely "virtio_net: xsk: bind/unbind xsk for rx"
> 
> Thanks,
> 
> Paolo


agreed, looks weird.

Re: [PATCH 2/2] virtio_net: fixing XDP for fully checksummed packets handling

2024-06-20 Thread Michael S. Tsirkin

On Thu, Jun 20, 2024 at 06:38:49PM +0800, Heng Qi wrote:
> On Thu, 20 Jun 2024 18:27:16 +0800, Heng Qi  wrote:
> > On Thu, 20 Jun 2024 06:19:01 -0400, "Michael S. Tsirkin"  
> > wrote:
> > > On Thu, Jun 20, 2024 at 05:28:48PM +0800, Heng Qi wrote:
> > > > On Thu, 20 Jun 2024 16:33:35 +0800, Jason Wang  
> > > > wrote:
> > > > > On Tue, Jun 18, 2024 at 11:17 AM Heng Qi  
> > > > > wrote:
> > > > > >
> > > > > > On Tue, 18 Jun 2024 11:10:26 +0800, Jason Wang 
> > > > > >  wrote:
> > > > > > > On Mon, Jun 17, 2024 at 9:15 PM Heng Qi 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > The XDP program can't correctly handle partially checksummed
> > > > > > > > packets, but works fine with fully checksummed packets.
> > > > > > >
> > > > > > > Not sure this is ture, if I was not wrong, XDP can try to 
> > > > > > > calculate checksum.
> > > > > >
> > > > > > XDP's interface serves a full checksum,
> > > > > 
> > > > > What do you mean by "serve" here? I mean, XDP can calculate the
> > > > > checksum and fill it in the packet by itself.
> > > > > 
> > > > 
> > > > Yes, XDP can parse and calculate checksums for all packets.
> > > > However, the bpf_csum_diff and bpf_l4_csum_replace APIs provided by XDP 
> > > > assume
> > > > that the packets being processed are fully checksumed packets. That is,
> > > > after the XDP program modified the packets, the incremental checksum 
> > > > can be
> > > > calculated (for example, samples/bpf/tcbpf1_kern.c, 
> > > > samples/bpf/test_lwt_bpf.c).
> > > > 
> > > > Therefore, partially checksummed packets cannot be processed normally 
> > > > in these
> > > > examples and need to be discarded.
> > > > 
> > > > > > and this is why we disabled the
> > > > > > offloading of VIRTIO_NET_F_GUEST_CSUM when loading XDP.
> > > > > 
> > > > > If we trust the device to disable VIRTIO_NET_F_GUEST_CSUM, any reason
> > > > > to check VIRTIO_NET_HDR_F_NEEDS_CSUM again in the receive path?
> > > > 
> > > > There doesn't seem to be a mandatory constraint in the spec that 
> > > > devices that
> > > > haven't negotiated VIRTIO_NET_F_GUEST_CSUM cannot set NEEDS_CSUM bit, 
> > > > so I check this.
> > > > 
> > > > Thanks.
> > > 
> > > The spec says:
> > > 
> > > \item If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, the
> > >   VIRTIO_NET_HDR_F_NEEDS_CSUM bit in \field{flags} can be
> > >   set: if so, the packet checksum at offset \field{csum_offset} 
> > >   from \field{csum_start} and any preceding checksums
> > >   have been validated.  The checksum on the packet is incomplete and
> > >   if bit VIRTIO_NET_HDR_F_RSC_INFO is not set in \field{flags},
> > >   then \field{csum_start} and \field{csum_offset} indicate how to 
> > > calculate it
> > >   (see Packet Transmission point 1).
> > > 
> > > 
> > > So yes, NEEDS_CSUM without VIRTIO_NET_F_GUEST_CSUM is at best undefined.
> > > Please do not try to use it unless VIRTIO_NET_F_GUEST_CSUM is set.
> > 
> > I've seen it before, but thought something like
> >  "The device MUST NOT set the NEEDS_CSUM bit if GUEST_CSUM has not been 
> > negotiated"
> > would be clearer.
> > 
> > Furthermore, it is still possible for a malicious device to set the bit.
> 
> Hint:
> We previously checked and used DATA_VALID and NEEDS_CSUM bits, but never 
> checked
> to see if GUEST_CSUM was negotiated.


That would be out of spec then. Might be too late to fix.

> > 
> > Thanks.
> > 
> > > 
> > > And if you want to be flexible, ignore it unless VIRTIO_NET_F_GUEST_CSUM
> > > has been negotiated.
> > > 
> > > 
> > > 
> > > 
> > > 
> > > > > 
> > > > > >
> > > > > > Thanks.
> > > > > 
> > > > > Thanks
> > > > > 
> > > > > >
> > > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > > > If the
> &

Re: [PATCH 2/2] virtio_net: fixing XDP for fully checksummed packets handling

2024-06-20 Thread Michael S. Tsirkin

On Thu, Jun 20, 2024 at 05:28:48PM +0800, Heng Qi wrote:
> On Thu, 20 Jun 2024 16:33:35 +0800, Jason Wang  wrote:
> > On Tue, Jun 18, 2024 at 11:17 AM Heng Qi  wrote:
> > >
> > > On Tue, 18 Jun 2024 11:10:26 +0800, Jason Wang  
> > > wrote:
> > > > On Mon, Jun 17, 2024 at 9:15 PM Heng Qi  
> > > > wrote:
> > > > >
> > > > > The XDP program can't correctly handle partially checksummed
> > > > > packets, but works fine with fully checksummed packets.
> > > >
> > > > Not sure this is ture, if I was not wrong, XDP can try to calculate 
> > > > checksum.
> > >
> > > XDP's interface serves a full checksum,
> > 
> > What do you mean by "serve" here? I mean, XDP can calculate the
> > checksum and fill it in the packet by itself.
> > 
> 
> Yes, XDP can parse and calculate checksums for all packets.
> However, the bpf_csum_diff and bpf_l4_csum_replace APIs provided by XDP assume
> that the packets being processed are fully checksumed packets. That is,
> after the XDP program modified the packets, the incremental checksum can be
> calculated (for example, samples/bpf/tcbpf1_kern.c, 
> samples/bpf/test_lwt_bpf.c).
> 
> Therefore, partially checksummed packets cannot be processed normally in these
> examples and need to be discarded.
> 
> > > and this is why we disabled the
> > > offloading of VIRTIO_NET_F_GUEST_CSUM when loading XDP.
> > 
> > If we trust the device to disable VIRTIO_NET_F_GUEST_CSUM, any reason
> > to check VIRTIO_NET_HDR_F_NEEDS_CSUM again in the receive path?
> 
> There doesn't seem to be a mandatory constraint in the spec that devices that
> haven't negotiated VIRTIO_NET_F_GUEST_CSUM cannot set NEEDS_CSUM bit, so I 
> check this.
> 
> Thanks.

The spec says:

\item If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, the
  VIRTIO_NET_HDR_F_NEEDS_CSUM bit in \field{flags} can be
  set: if so, the packet checksum at offset \field{csum_offset} 
  from \field{csum_start} and any preceding checksums
  have been validated.  The checksum on the packet is incomplete and
  if bit VIRTIO_NET_HDR_F_RSC_INFO is not set in \field{flags},
  then \field{csum_start} and \field{csum_offset} indicate how to calculate it
  (see Packet Transmission point 1).


So yes, NEEDS_CSUM without VIRTIO_NET_F_GUEST_CSUM is at best undefined.
Please do not try to use it unless VIRTIO_NET_F_GUEST_CSUM is set.

And if you want to be flexible, ignore it unless VIRTIO_NET_F_GUEST_CSUM
has been negotiated.





> > 
> > >
> > > Thanks.
> > 
> > Thanks
> > 
> > >
> > > >
> > > > Thanks
> > > >
> > > > > If the
> > > > > device has already validated fully checksummed packets, then
> > > > > the driver doesn't need to re-validate them, saving CPU resources.
> > > > >
> > > > > Additionally, the driver does not drop all partially checksummed
> > > > > packets when VIRTIO_NET_F_GUEST_CSUM is not negotiated. This is
> > > > > not a bug, as the driver has always done this.
> > > > >
> > > > > Fixes: 436c9453a1ac ("virtio-net: keep vnet header zeroed after 
> > > > > processing XDP")
> > > > > Signed-off-by: Heng Qi 
> > > > > ---
> > > > >  drivers/net/virtio_net.c | 20 +++-
> > > > >  1 file changed, 19 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > index aa70a7ed8072..ea10db9a09fa 100644
> > > > > --- a/drivers/net/virtio_net.c
> > > > > +++ b/drivers/net/virtio_net.c
> > > > > @@ -1360,6 +1360,10 @@ static struct sk_buff 
> > > > > *receive_small_xdp(struct net_device *dev,
> > > > > if (unlikely(hdr->hdr.gso_type))
> > > > > goto err_xdp;
> > > > >
> > > > > +   /* Partially checksummed packets must be dropped. */
> > > > > +   if (unlikely(hdr->hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM))
> > > > > +   goto err_xdp;
> > > > > +
> > > > > buflen = SKB_DATA_ALIGN(GOOD_PACKET_LEN + headroom) +
> > > > > SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> > > > >
> > > > > @@ -1677,6 +1681,10 @@ static void *mergeable_xdp_get_buf(struct 
> > > > > virtnet_info *vi,
> > > > > if (unlikely(hdr->hdr.gso_type))
> > > > > return NULL;
> > > > >
> > > > > +   /* Partially checksummed packets must be dropped. */
> > > > > +   if (unlikely(hdr->hdr.flags & VIRTIO_NET_HDR_F_NEEDS_CSUM))
> > > > > +   return NULL;
> > > > > +
> > > > > /* Now XDP core assumes frag size is PAGE_SIZE, but buffers
> > > > >  * with headroom may add hole in truesize, which
> > > > >  * make their length exceed PAGE_SIZE. So we disabled the
> > > > > @@ -1943,6 +1951,7 @@ static void receive_buf(struct virtnet_info 
> > > > > *vi, struct receive_queue *rq,
> > > > > struct net_device *dev = vi->dev;
> > > > > struct sk_buff *skb;
> > > > > struct virtio_net_common_hdr *hdr;
> > > > > +   u8 flags;
> > > > >
> > > > > if (unlikely(len < vi->hdr_len + ETH_HLEN)) {
> > > > > pr_debug("%s: short packet

Re: [PATCH net-next v4 2/5] virtio_net: enable irq for the control vq

2024-06-20 Thread Michael S. Tsirkin

On Thu, Jun 20, 2024 at 06:10:51AM -0400, Michael S. Tsirkin wrote:
> On Thu, Jun 20, 2024 at 05:53:15PM +0800, Heng Qi wrote:
> > On Thu, 20 Jun 2024 16:26:05 +0800, Jason Wang  wrote:
> > > On Thu, Jun 20, 2024 at 4:21 PM Jason Wang  wrote:
> > > >
> > > > On Thu, Jun 20, 2024 at 3:35 PM Heng Qi  
> > > > wrote:
> > > > >
> > > > > On Wed, 19 Jun 2024 17:19:12 -0400, "Michael S. Tsirkin" 
> > > > >  wrote:
> > > > > > On Thu, Jun 20, 2024 at 12:19:05AM +0800, Heng Qi wrote:
> > > > > > > @@ -5312,7 +5315,7 @@ static int virtnet_find_vqs(struct 
> > > > > > > virtnet_info *vi)
> > > > > > >
> > > > > > > /* Parameters for control virtqueue, if any */
> > > > > > > if (vi->has_cvq) {
> > > > > > > -   callbacks[total_vqs - 1] = NULL;
> > > > > > > +   callbacks[total_vqs - 1] = virtnet_cvq_done;
> > > > > > > names[total_vqs - 1] = "control";
> > > > > > > }
> > > > > > >
> > > > > >
> > > > > > If the # of MSIX vectors is exactly for data path VQs,
> > > > > > this will cause irq sharing between VQs which will degrade
> > > > > > performance significantly.
> > > > > >
> > > >
> > > > Why do we need to care about buggy management? I think libvirt has
> > > > been teached to use 2N+2 since the introduction of the multiqueue[1].
> > > 
> > > And Qemu can calculate it correctly automatically since:
> > > 
> > > commit 51a81a2118df0c70988f00d61647da9e298483a4
> > > Author: Jason Wang 
> > > Date:   Mon Mar 8 12:49:19 2021 +0800
> > > 
> > > virtio-net: calculating proper msix vectors on init
> > > 
> > > Currently, the default msix vectors for virtio-net-pci is 3 which is
> > > obvious not suitable for multiqueue guest, so we depends on the user
> > > or management tools to pass a correct vectors parameter. In fact, we
> > > can simplifying this by calculating the number of vectors on realize.
> > > 
> > > Consider we have N queues, the number of vectors needed is 2*N + 2
> > > (#queue pairs + plus one config interrupt and control vq). We didn't
> > > check whether or not host support control vq because it was added
> > > unconditionally by qemu to avoid breaking legacy guests such as Minix.
> > > 
> > > Reviewed-by: Philippe Mathieu-Daudé  > > Reviewed-by: Stefano Garzarella 
> > > Reviewed-by: Stefan Hajnoczi 
> > > Signed-off-by: Jason Wang 
> > 
> > Yes, devices designed according to the spec need to reserve an interrupt
> > vector for ctrlq. So, Michael, do we want to be compatible with buggy 
> > devices?
> > 
> > Thanks.
> 
> These aren't buggy, the spec allows this. So don't fail, but
> I'm fine with using polling if not enough vectors.

sharing with config interrupt is easier code-wise though, FWIW -
we don't need to maintain two code-paths.

> > > 
> > > Thanks
> > > 
> > > >
> > > > > > So no, you can not just do it unconditionally.
> > > > > >
> > > > > > The correct fix probably requires virtio core/API extensions.
> > > > >
> > > > > If the introduction of cvq irq causes interrupts to become shared, 
> > > > > then
> > > > > ctrlq need to fall back to polling mode and keep the status quo.
> > > >
> > > > Having to path sounds a burden.
> > > >
> > > > >
> > > > > Thanks.
> > > > >
> > > >
> > > >
> > > > Thanks
> > > >
> > > > [1] https://www.linux-kvm.org/page/Multiqueue
> > > >
> > > > > >
> > > > > > --
> > > > > > MST
> > > > > >
> > > > >
> > >

Re: [PATCH net-next v4 2/5] virtio_net: enable irq for the control vq

2024-06-20 Thread Michael S. Tsirkin

On Thu, Jun 20, 2024 at 05:53:15PM +0800, Heng Qi wrote:
> On Thu, 20 Jun 2024 16:26:05 +0800, Jason Wang  wrote:
> > On Thu, Jun 20, 2024 at 4:21 PM Jason Wang  wrote:
> > >
> > > On Thu, Jun 20, 2024 at 3:35 PM Heng Qi  wrote:
> > > >
> > > > On Wed, 19 Jun 2024 17:19:12 -0400, "Michael S. Tsirkin" 
> > > >  wrote:
> > > > > On Thu, Jun 20, 2024 at 12:19:05AM +0800, Heng Qi wrote:
> > > > > > @@ -5312,7 +5315,7 @@ static int virtnet_find_vqs(struct 
> > > > > > virtnet_info *vi)
> > > > > >
> > > > > > /* Parameters for control virtqueue, if any */
> > > > > > if (vi->has_cvq) {
> > > > > > -   callbacks[total_vqs - 1] = NULL;
> > > > > > +   callbacks[total_vqs - 1] = virtnet_cvq_done;
> > > > > > names[total_vqs - 1] = "control";
> > > > > > }
> > > > > >
> > > > >
> > > > > If the # of MSIX vectors is exactly for data path VQs,
> > > > > this will cause irq sharing between VQs which will degrade
> > > > > performance significantly.
> > > > >
> > >
> > > Why do we need to care about buggy management? I think libvirt has
> > > been teached to use 2N+2 since the introduction of the multiqueue[1].
> > 
> > And Qemu can calculate it correctly automatically since:
> > 
> > commit 51a81a2118df0c70988f00d61647da9e298483a4
> > Author: Jason Wang 
> > Date:   Mon Mar 8 12:49:19 2021 +0800
> > 
> > virtio-net: calculating proper msix vectors on init
> > 
> > Currently, the default msix vectors for virtio-net-pci is 3 which is
> > obvious not suitable for multiqueue guest, so we depends on the user
> > or management tools to pass a correct vectors parameter. In fact, we
> > can simplifying this by calculating the number of vectors on realize.
> > 
> > Consider we have N queues, the number of vectors needed is 2*N + 2
> > (#queue pairs + plus one config interrupt and control vq). We didn't
> > check whether or not host support control vq because it was added
> > unconditionally by qemu to avoid breaking legacy guests such as Minix.
> > 
> > Reviewed-by: Philippe Mathieu-Daudé  > Reviewed-by: Stefano Garzarella 
> > Reviewed-by: Stefan Hajnoczi 
> > Signed-off-by: Jason Wang 
> 
> Yes, devices designed according to the spec need to reserve an interrupt
> vector for ctrlq. So, Michael, do we want to be compatible with buggy devices?
> 
> Thanks.

These aren't buggy, the spec allows this. So don't fail, but
I'm fine with using polling if not enough vectors.

> > 
> > Thanks
> > 
> > >
> > > > > So no, you can not just do it unconditionally.
> > > > >
> > > > > The correct fix probably requires virtio core/API extensions.
> > > >
> > > > If the introduction of cvq irq causes interrupts to become shared, then
> > > > ctrlq need to fall back to polling mode and keep the status quo.
> > >
> > > Having to path sounds a burden.
> > >
> > > >
> > > > Thanks.
> > > >
> > >
> > >
> > > Thanks
> > >
> > > [1] https://www.linux-kvm.org/page/Multiqueue
> > >
> > > > >
> > > > > --
> > > > > MST
> > > > >
> > > >
> >

Re: [PATCH net-next v4 2/5] virtio_net: enable irq for the control vq

2024-06-20 Thread Michael S. Tsirkin

On Thu, Jun 20, 2024 at 05:38:22PM +0800, Heng Qi wrote:
> On Thu, 20 Jun 2024 04:32:15 -0400, "Michael S. Tsirkin"  
> wrote:
> > On Thu, Jun 20, 2024 at 03:29:15PM +0800, Heng Qi wrote:
> > > On Wed, 19 Jun 2024 17:19:12 -0400, "Michael S. Tsirkin" 
> > >  wrote:
> > > > On Thu, Jun 20, 2024 at 12:19:05AM +0800, Heng Qi wrote:
> > > > > @@ -5312,7 +5315,7 @@ static int virtnet_find_vqs(struct virtnet_info 
> > > > > *vi)
> > > > >  
> > > > >   /* Parameters for control virtqueue, if any */
> > > > >   if (vi->has_cvq) {
> > > > > - callbacks[total_vqs - 1] = NULL;
> > > > > + callbacks[total_vqs - 1] = virtnet_cvq_done;
> > > > >   names[total_vqs - 1] = "control";
> > > > >   }
> > > > >  
> > > > 
> > > > If the # of MSIX vectors is exactly for data path VQs,
> > > > this will cause irq sharing between VQs which will degrade
> > > > performance significantly.
> > > > 
> > > > So no, you can not just do it unconditionally.
> > > > 
> > > > The correct fix probably requires virtio core/API extensions.
> > > 
> > > If the introduction of cvq irq causes interrupts to become shared, then
> > > ctrlq need to fall back to polling mode and keep the status quo.
> > > 
> > > Thanks.
> > 
> > I don't see that in the code.
> > 
> > I guess we'll need more info in find vqs about what can and what can't 
> > share irqs?
> 
> I mean we should add fallback code, for example if allocating interrupt for 
> ctrlq
> fails, we should clear the callback of ctrlq.

I have no idea how you plan to do that. interrupts are allocated in
virtio core, callbacks enabled in drivers.

> > Sharing between ctrl vq and config irq can also be an option.
> > 
> 
> Not sure if this violates the spec. In the spec, used buffer notification and
> configuration change notification are clearly defined - ctrlq is a virtqueue
> and used buffer notification should be used.
> 
> Thanks.

It is up to driver to choose which msix vector will trigger.
Nothing says same vector can't be reused.
Whether devices made assumptions based on current driver
behaviour is another matter.


> > 
> > 
> > 
> > > > 
> > > > -- 
> > > > MST
> > > > 
> > 
> >

Re: [PATCH net-next v4 2/5] virtio_net: enable irq for the control vq

2024-06-20 Thread Michael S. Tsirkin

On Thu, Jun 20, 2024 at 03:29:15PM +0800, Heng Qi wrote:
> On Wed, 19 Jun 2024 17:19:12 -0400, "Michael S. Tsirkin"  
> wrote:
> > On Thu, Jun 20, 2024 at 12:19:05AM +0800, Heng Qi wrote:
> > > @@ -5312,7 +5315,7 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
> > >  
> > >   /* Parameters for control virtqueue, if any */
> > >   if (vi->has_cvq) {
> > > - callbacks[total_vqs - 1] = NULL;
> > > + callbacks[total_vqs - 1] = virtnet_cvq_done;
> > >   names[total_vqs - 1] = "control";
> > >   }
> > >  
> > 
> > If the # of MSIX vectors is exactly for data path VQs,
> > this will cause irq sharing between VQs which will degrade
> > performance significantly.
> > 
> > So no, you can not just do it unconditionally.
> > 
> > The correct fix probably requires virtio core/API extensions.
> 
> If the introduction of cvq irq causes interrupts to become shared, then
> ctrlq need to fall back to polling mode and keep the status quo.
> 
> Thanks.

I don't see that in the code.

I guess we'll need more info in find vqs about what can and what can't share 
irqs?
Sharing between ctrl vq and config irq can also be an option.




> > 
> > -- 
> > MST
> >

Re: [PATCH net-next v3] virtio_net: add support for Byte Queue Limits

2024-06-20 Thread Michael S. Tsirkin

On Wed, Jun 19, 2024 at 12:09:45PM +0200, Jiri Pirko wrote:
> Wed, Jun 19, 2024 at 10:23:25AM CEST, m...@redhat.com wrote:
> >On Wed, Jun 19, 2024 at 10:05:41AM +0200, Jiri Pirko wrote:
> >> >Oh. Right of course. Worth a comment maybe? Just to make sure
> >> >we remember not to call __free_old_xmit twice in a row
> >> >without reinitializing stats.
> >> >Or move the initialization into __free_old_xmit to make it
> >> >self-contained ..
> >> 
> >> Well, the initialization happens in the caller by {0}, Wouldn't
> >> memset in __free_old_xmit() add an extra overhead? IDK.
> >
> >
> >Well if I did the below the binary is a bit smaller.
> >
> >If you have to respin you can include it.
> >If not I can submit separately.
> 
> Please send it reparately. It's should be a separate patch.
> 
> Thanks!
> 
> >
> >
> >
> >
> >virtio-net: cleanup __free_old_xmit
> >
> >Two call sites of __free_old_xmit zero-initialize stats,
> >doing it inside __free_old_xmit seems to make compiler's
> >job a bit easier:
> >
> >$ size /tmp/orig/virtio_net.o 
> >   textdata bss dec hex filename
> >  658573892 100   69849   110d9 /tmp/orig/virtio_net.o
> >$ size /tmp/new/virtio_net.o 
> >   textdata bss dec hex filename
> >  657603892 100   69752   11078 /tmp/new/virtio_net.o
> >
> >Couldn't measure any performance impact, unsurprizingly.
> >
> >Signed-off-by: Michael S. Tsirkin 
> >
> >---
> >
> >diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >index 283b34d50296..c2ce8de340f7 100644
> >--- a/drivers/net/virtio_net.c
> >+++ b/drivers/net/virtio_net.c
> >@@ -383,6 +383,8 @@ static void __free_old_xmit(struct send_queue *sq, bool 
> >in_napi,
> > unsigned int len;
> > void *ptr;
> > 
> >+stats->bytes = stats->packets = 0;
> 
> Memset perhaps?

Generates the same code and I find it less readable -
virtio generally opts for explicit initialization.

> >+
> > while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> > ++stats->packets;
> > 
> >@@ -828,7 +830,7 @@ static void virtnet_rq_unmap_free_buf(struct virtqueue 
> >*vq, void *buf)
> > 
> > static void free_old_xmit(struct send_queue *sq, bool in_napi)
> > {
> >-struct virtnet_sq_free_stats stats = {0};
> >+struct virtnet_sq_free_stats stats;
> > 
> > __free_old_xmit(sq, in_napi, &stats);
> > 
> >@@ -979,7 +981,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
> > int n, struct xdp_frame **frames, u32 flags)
> > {
> > struct virtnet_info *vi = netdev_priv(dev);
> >-struct virtnet_sq_free_stats stats = {0};
> >+struct virtnet_sq_free_stats stats;
> > struct receive_queue *rq = vi->rq;
> > struct bpf_prog *xdp_prog;
> > struct send_queue *sq;
> >

Re: [PATCH net-next v4 2/5] virtio_net: enable irq for the control vq

2024-06-19 Thread Michael S. Tsirkin

On Thu, Jun 20, 2024 at 12:19:05AM +0800, Heng Qi wrote:
> @@ -5312,7 +5315,7 @@ static int virtnet_find_vqs(struct virtnet_info *vi)
>  
>   /* Parameters for control virtqueue, if any */
>   if (vi->has_cvq) {
> - callbacks[total_vqs - 1] = NULL;
> + callbacks[total_vqs - 1] = virtnet_cvq_done;
>   names[total_vqs - 1] = "control";
>   }
>  

If the # of MSIX vectors is exactly for data path VQs,
this will cause irq sharing between VQs which will degrade
performance significantly.

So no, you can not just do it unconditionally.

The correct fix probably requires virtio core/API extensions.

-- 
MST

Re: [PATCH net-next v4 0/5] virtio_net: enable the irq for ctrlq

2024-06-19 Thread Michael S. Tsirkin

On Thu, Jun 20, 2024 at 12:19:03AM +0800, Heng Qi wrote:
> Ctrlq in polling mode may cause the virtual machine to hang and
> occupy additional CPU resources. Enabling the irq for ctrlq
> alleviates this problem and allows commands to be requested
> concurrently.

Any patch that is supposed to be a performance improvement
has to come with actual before/after testing restults, not
vague "may cause".



> Changelog
> =
> v3->v4:
>   - Turn off the switch before flush the get_cvq work.
>   - Add interrupt suppression.
> 
> v2->v3:
>   - Use the completion for dim cmds.
> 
> v1->v2:
>   - Refactor the patch 1 and rephase the commit log.
> 
> Heng Qi (5):
>   virtio_net: passing control_buf explicitly
>   virtio_net: enable irq for the control vq
>   virtio_net: change the command token to completion
>   virtio_net: refactor command sending and response handling
>   virtio_net: improve dim command request efficiency
> 
>  drivers/net/virtio_net.c | 309 ---
>  1 file changed, 260 insertions(+), 49 deletions(-)
> 
> -- 
> 2.32.0.3.g01195cf9f

Re: [PATCH] virtio_net: Use u64_stats_fetch_begin() for stats fetch

2024-06-19 Thread Michael S. Tsirkin

On Wed, Jun 19, 2024 at 10:55:29AM +0800, Li RongQing wrote:
> This place is fetching the stats, so u64_stats_fetch_begin
> and u64_stats_fetch_retry should be used
> 
> Fixes: 6208799553a8 ("virtio-net: support rx netdim")
> Signed-off-by: Li RongQing 

Acked-by: Michael S. Tsirkin 

> ---
>  drivers/net/virtio_net.c | 14 --
>  1 file changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 61a57d1..b669e73 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2332,16 +2332,18 @@ static void virtnet_poll_cleantx(struct receive_queue 
> *rq)
>  static void virtnet_rx_dim_update(struct virtnet_info *vi, struct 
> receive_queue *rq)
>  {
>   struct dim_sample cur_sample = {};
> + unsigned int start;
>  
>   if (!rq->packets_in_napi)
>   return;
>  
> - u64_stats_update_begin(&rq->stats.syncp);
> - dim_update_sample(rq->calls,
> -   u64_stats_read(&rq->stats.packets),
> -   u64_stats_read(&rq->stats.bytes),
> -   &cur_sample);
> - u64_stats_update_end(&rq->stats.syncp);
> + do {
> + start = u64_stats_fetch_begin(&rq->stats.syncp);
> + dim_update_sample(rq->calls,
> + u64_stats_read(&rq->stats.packets),
> + u64_stats_read(&rq->stats.bytes),
> + &cur_sample);
> + } while (u64_stats_fetch_retry(&rq->stats.syncp, start));
>  
>   net_dim(&rq->dim, cur_sample);
>   rq->packets_in_napi = 0;
> -- 
> 2.9.4

Re: [PATCH net-next v3] virtio_net: add support for Byte Queue Limits

2024-06-19 Thread Michael S. Tsirkin

On Wed, Jun 19, 2024 at 10:05:41AM +0200, Jiri Pirko wrote:
> >Oh. Right of course. Worth a comment maybe? Just to make sure
> >we remember not to call __free_old_xmit twice in a row
> >without reinitializing stats.
> >Or move the initialization into __free_old_xmit to make it
> >self-contained ..
> 
> Well, the initialization happens in the caller by {0}, Wouldn't
> memset in __free_old_xmit() add an extra overhead? IDK.


Well if I did the below the binary is a bit smaller.

If you have to respin you can include it.
If not I can submit separately.




virtio-net: cleanup __free_old_xmit

Two call sites of __free_old_xmit zero-initialize stats,
doing it inside __free_old_xmit seems to make compiler's
job a bit easier:

$ size /tmp/orig/virtio_net.o 
   textdata bss dec hex filename
  658573892 100   69849   110d9 /tmp/orig/virtio_net.o
$ size /tmp/new/virtio_net.o 
   textdata bss dec hex filename
  657603892 100   69752   11078 /tmp/new/virtio_net.o

Couldn't measure any performance impact, unsurprizingly.

Signed-off-by: Michael S. Tsirkin 

---

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 283b34d50296..c2ce8de340f7 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -383,6 +383,8 @@ static void __free_old_xmit(struct send_queue *sq, bool 
in_napi,
unsigned int len;
void *ptr;
 
+   stats->bytes = stats->packets = 0;
+
while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
++stats->packets;
 
@@ -828,7 +830,7 @@ static void virtnet_rq_unmap_free_buf(struct virtqueue *vq, 
void *buf)
 
 static void free_old_xmit(struct send_queue *sq, bool in_napi)
 {
-   struct virtnet_sq_free_stats stats = {0};
+   struct virtnet_sq_free_stats stats;
 
__free_old_xmit(sq, in_napi, &stats);
 
@@ -979,7 +981,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
int n, struct xdp_frame **frames, u32 flags)
 {
struct virtnet_info *vi = netdev_priv(dev);
-   struct virtnet_sq_free_stats stats = {0};
+   struct virtnet_sq_free_stats stats;
struct receive_queue *rq = vi->rq;
struct bpf_prog *xdp_prog;
struct send_queue *sq;

Re: [PATCH net-next v3] virtio_net: add support for Byte Queue Limits

2024-06-19 Thread Michael S. Tsirkin

On Wed, Jun 19, 2024 at 10:05:41AM +0200, Jiri Pirko wrote:
> Wed, Jun 19, 2024 at 09:26:22AM CEST, m...@redhat.com wrote:
> >On Wed, Jun 19, 2024 at 07:45:16AM +0200, Jiri Pirko wrote:
> >> Tue, Jun 18, 2024 at 08:18:12PM CEST, m...@redhat.com wrote:
> >> >This looks like a sensible way to do this.
> >> >Yet something to improve:
> >> >
> >> >
> >> >On Tue, Jun 18, 2024 at 04:44:56PM +0200, Jiri Pirko wrote:
> >> >> From: Jiri Pirko 
> >> >> 
> >> 
> >> [...]
> >> 
> >> 
> >> >> +static void __free_old_xmit(struct send_queue *sq, struct netdev_queue 
> >> >> *txq,
> >> >> +   bool in_napi, struct virtnet_sq_free_stats 
> >> >> *stats)
> >> >>  {
> >> >> unsigned int len;
> >> >> void *ptr;
> >> >>  
> >> >> while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> >> >> -   ++stats->packets;
> >> >> -
> >> >> if (!is_xdp_frame(ptr)) {
> >> >> -   struct sk_buff *skb = ptr;
> >> >> +   struct sk_buff *skb = ptr_to_skb(ptr);
> >> >>  
> >> >> pr_debug("Sent skb %p\n", skb);
> >> >>  
> >> >> -   stats->bytes += skb->len;
> >> >> +   if (is_orphan_skb(ptr)) {
> >> >> +   stats->packets++;
> >> >> +   stats->bytes += skb->len;
> >> >> +   } else {
> >> >> +   stats->napi_packets++;
> >> >> +   stats->napi_bytes += skb->len;
> >> >> +   }
> >> >> napi_consume_skb(skb, in_napi);
> >> >> } else {
> >> >> struct xdp_frame *frame = ptr_to_xdp(ptr);
> >> >>  
> >> >> +   stats->packets++;
> >> >> stats->bytes += xdp_get_frame_len(frame);
> >> >> xdp_return_frame(frame);
> >> >> }
> >> >> }
> >> >> +   netdev_tx_completed_queue(txq, stats->napi_packets, 
> >> >> stats->napi_bytes);
> >> >
> >> >Are you sure it's right? You are completing larger and larger
> >> >number of bytes and packets each time.
> >> 
> >> Not sure I get you. __free_old_xmit() is always called with stats
> >> zeroed. So this is just sum-up of one queue completion run.
> >> I don't see how this could become "larger and larger number" as you
> >> describe.
> >
> >Oh. Right of course. Worth a comment maybe? Just to make sure
> >we remember not to call __free_old_xmit twice in a row
> >without reinitializing stats.
> >Or move the initialization into __free_old_xmit to make it
> >self-contained ..
> 
> Well, the initialization happens in the caller by {0}, Wouldn't
> memset in __free_old_xmit() add an extra overhead? IDK.
> Perhaps a small comment in __free_old_xmit() would do better.
> 
> One way or another, I think this is parallel to this patchset. Will
> handle it separatelly if you don't mind.


Okay.


Acked-by: Michael S. Tsirkin 


> >WDYT?
> >
> >> 
> >> >
> >> >For example as won't this eventually trigger this inside dql_completed:
> >> >
> >> >BUG_ON(count > num_queued - dql->num_completed);
> >> 
> >> Nope, I don't see how we can hit it. Do not complete anything else
> >> in addition to what was started in xmit(). Am I missing something?
> >> 
> >> 
> >> >
> >> >?
> >> >
> >> >
> >> >If I am right the perf testing has to be redone with this fixed ...
> >> >
> >> >
> >> >>  }
> >> >>  
> >> 
> >> [...]
> >

Re: [PATCH net-next v3] virtio_net: add support for Byte Queue Limits

2024-06-19 Thread Michael S. Tsirkin

On Wed, Jun 19, 2024 at 07:45:16AM +0200, Jiri Pirko wrote:
> Tue, Jun 18, 2024 at 08:18:12PM CEST, m...@redhat.com wrote:
> >This looks like a sensible way to do this.
> >Yet something to improve:
> >
> >
> >On Tue, Jun 18, 2024 at 04:44:56PM +0200, Jiri Pirko wrote:
> >> From: Jiri Pirko 
> >> 
> 
> [...]
> 
> 
> >> +static void __free_old_xmit(struct send_queue *sq, struct netdev_queue 
> >> *txq,
> >> +  bool in_napi, struct virtnet_sq_free_stats *stats)
> >>  {
> >>unsigned int len;
> >>void *ptr;
> >>  
> >>while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> >> -  ++stats->packets;
> >> -
> >>if (!is_xdp_frame(ptr)) {
> >> -  struct sk_buff *skb = ptr;
> >> +  struct sk_buff *skb = ptr_to_skb(ptr);
> >>  
> >>pr_debug("Sent skb %p\n", skb);
> >>  
> >> -  stats->bytes += skb->len;
> >> +  if (is_orphan_skb(ptr)) {
> >> +  stats->packets++;
> >> +  stats->bytes += skb->len;
> >> +  } else {
> >> +  stats->napi_packets++;
> >> +  stats->napi_bytes += skb->len;
> >> +  }
> >>napi_consume_skb(skb, in_napi);
> >>} else {
> >>struct xdp_frame *frame = ptr_to_xdp(ptr);
> >>  
> >> +  stats->packets++;
> >>stats->bytes += xdp_get_frame_len(frame);
> >>xdp_return_frame(frame);
> >>}
> >>}
> >> +  netdev_tx_completed_queue(txq, stats->napi_packets, stats->napi_bytes);
> >
> >Are you sure it's right? You are completing larger and larger
> >number of bytes and packets each time.
> 
> Not sure I get you. __free_old_xmit() is always called with stats
> zeroed. So this is just sum-up of one queue completion run.
> I don't see how this could become "larger and larger number" as you
> describe.

Oh. Right of course. Worth a comment maybe? Just to make sure
we remember not to call __free_old_xmit twice in a row
without reinitializing stats.
Or move the initialization into __free_old_xmit to make it
self-contained ..
WDYT?

> 
> >
> >For example as won't this eventually trigger this inside dql_completed:
> >
> >BUG_ON(count > num_queued - dql->num_completed);
> 
> Nope, I don't see how we can hit it. Do not complete anything else
> in addition to what was started in xmit(). Am I missing something?
> 
> 
> >
> >?
> >
> >
> >If I am right the perf testing has to be redone with this fixed ...
> >
> >
> >>  }
> >>  
> 
> [...]

Re: [patch net-next] virtio_net: add support for Byte Queue Limits

2024-06-18 Thread Michael S. Tsirkin

On Tue, Jun 18, 2024 at 08:52:38AM +0800, Jason Wang wrote:
> On Mon, Jun 17, 2024 at 5:30 PM Jiri Pirko  wrote:
> >
> > Mon, Jun 17, 2024 at 03:44:55AM CEST, jasow...@redhat.com wrote:
> > >On Mon, Jun 10, 2024 at 10:19 PM Michael S. Tsirkin  
> > >wrote:
> > >>
> > >> On Fri, Jun 07, 2024 at 01:30:34PM +0200, Jiri Pirko wrote:
> > >> > Fri, Jun 07, 2024 at 12:23:37PM CEST, m...@redhat.com wrote:
> > >> > >On Fri, Jun 07, 2024 at 11:57:37AM +0200, Jiri Pirko wrote:
> > >> > >> >True. Personally, I would like to just drop orphan mode. But I'm 
> > >> > >> >not
> > >> > >> >sure others are happy with this.
> > >> > >>
> > >> > >> How about to do it other way around. I will take a stab at sending 
> > >> > >> patch
> > >> > >> removing it. If anyone is against and has solid data to prove orphan
> > >> > >> mode is needed, let them provide those.
> > >> > >
> > >> > >Break it with no warning and see if anyone complains?
> > >> >
> > >> > This is now what I suggested at all.
> > >> >
> > >> > >No, this is not how we handle userspace compatibility, normally.
> > >> >
> > >> > Sure.
> > >> >
> > >> > Again:
> > >> >
> > >> > I would send orphan removal patch containing:
> > >> > 1) no module options removal. Warn if someone sets it up
> > >> > 2) module option to disable napi is ignored
> > >> > 3) orphan mode is removed from code
> > >> >
> > >> > There is no breakage. Only, hypotetically performance downgrade in some
> > >> > hypotetical usecase nobody knows of.
> > >>
> > >> Performance is why people use virtio. It's as much a breakage as any
> > >> other bug. The main difference is, with other types of breakage, they
> > >> are typically binary and we can not tolerate them at all.  A tiny,
> > >> negligeable performance regression might be tolarable if it brings
> > >> other benefits. I very much doubt avoiding interrupts is
> > >> negligeable though. And making code simpler isn't a big benefit,
> > >> users do not care.
> > >
> > >It's not just making code simpler. As discussed in the past, it also
> > >fixes real bugs.
> > >
> > >>
> > >> > My point was, if someone presents
> > >> > solid data to prove orphan is needed during the patch review, let's 
> > >> > toss
> > >> > out the patch.
> > >> >
> > >> > Makes sense?
> > >>
> > >> It's not hypothetical - if anything, it's hypothetical that performance
> > >> does not regress.  And we just got a report from users that see a
> > >> regression without.  So, not really.
> > >
> > >Probably, but do we need to define a bar here? Looking at git history,
> > >we didn't ask a full benchmark for a lot of commits that may touch
> >
> > Moreover, there is no "benchmark" to run anyway, is it?
> 
> Yes, so my point is to have some agreement on
> 
> 1) what kind of test needs to be run for a patch like this.
> 2) what numbers are ok or not
> 
> Thanks

That's a $1mln question and the difficulty is why we don't change
behaviour drastically for users without a fallback even if
we think we did a bunch of testing.



> >
> >
> > >performance.
> > >
> > >Thanks
> > >
> > >>
> > >> >
> > >> > >
> > >> > >--
> > >> > >MST
> > >> > >
> > >>
> > >
> >

Re: [PATCH net-next v3] virtio_net: add support for Byte Queue Limits

2024-06-18 Thread Michael S. Tsirkin

This looks like a sensible way to do this.
Yet something to improve:


On Tue, Jun 18, 2024 at 04:44:56PM +0200, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> Add support for Byte Queue Limits (BQL).
> 
> Tested on qemu emulated virtio_net device with 1, 2 and 4 queues.
> Tested with fq_codel and pfifo_fast. Super netperf with 50 threads is
> running in background. Netperf TCP_RR results:
> 
> NOBQL FQC 1q:  159.56  159.33  158.50  154.31agv: 157.925
> NOBQL FQC 2q:  184.64  184.96  174.73  174.15agv: 179.62
> NOBQL FQC 4q:  994.46  441.96  416.50  499.56agv: 588.12
> NOBQL PFF 1q:  148.68  148.92  145.95  149.48agv: 148.2575
> NOBQL PFF 2q:  171.86  171.20  170.42  169.42agv: 170.725
> NOBQL PFF 4q: 1505.23 1137.23 2488.70 3507.99agv: 2159.7875
>   BQL FQC 1q: 1332.80 1297.97 1351.41 1147.57agv: 1282.4375
>   BQL FQC 2q:  768.30  817.72  864.43  974.40agv: 856.2125
>   BQL FQC 4q:  945.66  942.68  878.51  822.82agv: 897.4175
>   BQL PFF 1q:  149.69  151.49  149.40  147.47agv: 149.5125
>   BQL PFF 2q: 2059.32  798.74 1844.12  381.80agv: 1270.995
>   BQL PFF 4q: 1871.98 4420.02 4916.59 13268.16   agv: 6119.1875
> 
> Signed-off-by: Jiri Pirko 
> ---
> v2->v3:
> - fixed the switch from/to orphan mode while skbs are yet to be
>   completed by using the second least significant bit in virtqueue
>   token pointer to indicate skb is orphan. Don't account orphan
>   skbs in completion.
> - reorganized parallel skb/xdp free stats accounting to napi/others.
> - fixed kick condition check in orphan mode
> v1->v2:
> - moved netdev_tx_completed_queue() call into __free_old_xmit(),
>   propagate use_napi flag to __free_old_xmit() and only call
>   netdev_tx_completed_queue() in case it is true
> - added forgotten call to netdev_tx_reset_queue()
> - fixed stats for xdp packets
> - fixed bql accounting when __free_old_xmit() is called from xdp path
> - handle the !use_napi case in start_xmit() kick section
> ---
>  drivers/net/virtio_net.c | 81 
>  1 file changed, 57 insertions(+), 24 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 61a57d134544..9f9b86874173 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -47,7 +47,8 @@ module_param(napi_tx, bool, 0644);
>  #define VIRTIO_XDP_TXBIT(0)
>  #define VIRTIO_XDP_REDIR BIT(1)
>  
> -#define VIRTIO_XDP_FLAG  BIT(0)
> +#define VIRTIO_XDP_FLAG  BIT(0)
> +#define VIRTIO_ORPHAN_FLAG   BIT(1)
>  
>  /* RX packet size EWMA. The average packet size is used to determine the 
> packet
>   * buffer size when refilling RX rings. As the entire RX ring may be refilled
> @@ -85,6 +86,8 @@ struct virtnet_stat_desc {
>  struct virtnet_sq_free_stats {
>   u64 packets;
>   u64 bytes;
> + u64 napi_packets;
> + u64 napi_bytes;
>  };
>  
>  struct virtnet_sq_stats {
> @@ -506,29 +509,50 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
>   return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
>  }
>  
> -static void __free_old_xmit(struct send_queue *sq, bool in_napi,
> - struct virtnet_sq_free_stats *stats)
> +static bool is_orphan_skb(void *ptr)
> +{
> + return (unsigned long)ptr & VIRTIO_ORPHAN_FLAG;
> +}
> +
> +static void *skb_to_ptr(struct sk_buff *skb, bool orphan)
> +{
> + return (void *)((unsigned long)skb | (orphan ? VIRTIO_ORPHAN_FLAG : 0));
> +}
> +
> +static struct sk_buff *ptr_to_skb(void *ptr)
> +{
> + return (struct sk_buff *)((unsigned long)ptr & ~VIRTIO_ORPHAN_FLAG);
> +}
> +
> +static void __free_old_xmit(struct send_queue *sq, struct netdev_queue *txq,
> + bool in_napi, struct virtnet_sq_free_stats *stats)
>  {
>   unsigned int len;
>   void *ptr;
>  
>   while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> - ++stats->packets;
> -
>   if (!is_xdp_frame(ptr)) {
> - struct sk_buff *skb = ptr;
> + struct sk_buff *skb = ptr_to_skb(ptr);
>  
>   pr_debug("Sent skb %p\n", skb);
>  
> - stats->bytes += skb->len;
> + if (is_orphan_skb(ptr)) {
> + stats->packets++;
> + stats->bytes += skb->len;
> + } else {
> + stats->napi_packets++;
> + stats->napi_bytes += skb->len;
> + }
>   napi_consume_skb(skb, in_napi);
>   } else {
>   struct xdp_frame *frame = ptr_to_xdp(ptr);
>  
> + stats->packets++;
>   stats->bytes += xdp_get_frame_len(frame);
>   xdp_return_frame(frame);
>   }
>   }
> + netdev_tx_completed_queue(txq, stats->napi_packets, stats->napi_bytes);

Are you sure it's right? You are compl

Re: [patch net-next] virtio_net: add support for Byte Queue Limits

2024-06-17 Thread Michael S. Tsirkin

On Mon, Jun 17, 2024 at 11:30:36AM +0200, Jiri Pirko wrote:
> Mon, Jun 17, 2024 at 03:44:55AM CEST, jasow...@redhat.com wrote:
> >On Mon, Jun 10, 2024 at 10:19 PM Michael S. Tsirkin  wrote:
> >>
> >> On Fri, Jun 07, 2024 at 01:30:34PM +0200, Jiri Pirko wrote:
> >> > Fri, Jun 07, 2024 at 12:23:37PM CEST, m...@redhat.com wrote:
> >> > >On Fri, Jun 07, 2024 at 11:57:37AM +0200, Jiri Pirko wrote:
> >> > >> >True. Personally, I would like to just drop orphan mode. But I'm not
> >> > >> >sure others are happy with this.
> >> > >>
> >> > >> How about to do it other way around. I will take a stab at sending 
> >> > >> patch
> >> > >> removing it. If anyone is against and has solid data to prove orphan
> >> > >> mode is needed, let them provide those.
> >> > >
> >> > >Break it with no warning and see if anyone complains?
> >> >
> >> > This is now what I suggested at all.
> >> >
> >> > >No, this is not how we handle userspace compatibility, normally.
> >> >
> >> > Sure.
> >> >
> >> > Again:
> >> >
> >> > I would send orphan removal patch containing:
> >> > 1) no module options removal. Warn if someone sets it up
> >> > 2) module option to disable napi is ignored
> >> > 3) orphan mode is removed from code
> >> >
> >> > There is no breakage. Only, hypotetically performance downgrade in some
> >> > hypotetical usecase nobody knows of.
> >>
> >> Performance is why people use virtio. It's as much a breakage as any
> >> other bug. The main difference is, with other types of breakage, they
> >> are typically binary and we can not tolerate them at all.  A tiny,
> >> negligeable performance regression might be tolarable if it brings
> >> other benefits. I very much doubt avoiding interrupts is
> >> negligeable though. And making code simpler isn't a big benefit,
> >> users do not care.
> >
> >It's not just making code simpler. As discussed in the past, it also
> >fixes real bugs.
> >
> >>
> >> > My point was, if someone presents
> >> > solid data to prove orphan is needed during the patch review, let's toss
> >> > out the patch.
> >> >
> >> > Makes sense?
> >>
> >> It's not hypothetical - if anything, it's hypothetical that performance
> >> does not regress.  And we just got a report from users that see a
> >> regression without.  So, not really.
> >
> >Probably, but do we need to define a bar here? Looking at git history,
> >we didn't ask a full benchmark for a lot of commits that may touch

It's patently obvious that not getting interrupts is better than
getting interrupts. The onus of proof would be on people who claim
otherwise.


> Moreover, there is no "benchmark" to run anyway, is it?
> 

Tought.  Talk to users that report regressions.



> >performance.
> >
> >Thanks
> >
> >>
> >> >
> >> > >
> >> > >--
> >> > >MST
> >> > >
> >>
> >

Re: [patch net-next] virtio_net: add support for Byte Queue Limits

2024-06-17 Thread Michael S. Tsirkin

On Mon, Jun 17, 2024 at 09:44:55AM +0800, Jason Wang wrote:
> Probably, but do we need to define a bar here? Looking at git history,
> we didn't ask a full benchmark for a lot of commits that may touch
> performance.

There's no may here and we even got a report from a real user.

-- 
MST

Re: [PATCH net-next v2] virtio_net: add support for Byte Queue Limits

2024-06-17 Thread Michael S. Tsirkin

On Wed, Jun 12, 2024 at 07:08:51PM +0200, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> Add support for Byte Queue Limits (BQL).
> 
> Tested on qemu emulated virtio_net device with 1, 2 and 4 queues.
> Tested with fq_codel and pfifo_fast. Super netperf with 50 threads is
> running in background. Netperf TCP_RR results:
> 
> NOBQL FQC 1q:  159.56  159.33  158.50  154.31agv: 157.925
> NOBQL FQC 2q:  184.64  184.96  174.73  174.15agv: 179.62
> NOBQL FQC 4q:  994.46  441.96  416.50  499.56agv: 588.12
> NOBQL PFF 1q:  148.68  148.92  145.95  149.48agv: 148.2575
> NOBQL PFF 2q:  171.86  171.20  170.42  169.42agv: 170.725
> NOBQL PFF 4q: 1505.23 1137.23 2488.70 3507.99agv: 2159.7875
>   BQL FQC 1q: 1332.80 1297.97 1351.41 1147.57agv: 1282.4375
>   BQL FQC 2q:  768.30  817.72  864.43  974.40agv: 856.2125
>   BQL FQC 4q:  945.66  942.68  878.51  822.82agv: 897.4175
>   BQL PFF 1q:  149.69  151.49  149.40  147.47agv: 149.5125
>   BQL PFF 2q: 2059.32  798.74 1844.12  381.80agv: 1270.995
>   BQL PFF 4q: 1871.98 4420.02 4916.59 13268.16   agv: 6119.1875
> 
> Signed-off-by: Jiri Pirko 
> ---
> v1->v2:
> - moved netdev_tx_completed_queue() call into __free_old_xmit(),
>   propagate use_napi flag to __free_old_xmit() and only call
>   netdev_tx_completed_queue() in case it is true
> - added forgotten call to netdev_tx_reset_queue()
> - fixed stats for xdp packets
> - fixed bql accounting when __free_old_xmit() is called from xdp path
> - handle the !use_napi case in start_xmit() kick section
> ---
>  drivers/net/virtio_net.c | 50 +---
>  1 file changed, 32 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 61a57d134544..5863c663ccab 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -84,7 +84,9 @@ struct virtnet_stat_desc {
>  
>  struct virtnet_sq_free_stats {
>   u64 packets;
> + u64 xdp_packets;
>   u64 bytes;
> + u64 xdp_bytes;
>  };
>  
>  struct virtnet_sq_stats {
> @@ -506,29 +508,33 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
>   return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
>  }
>  
> -static void __free_old_xmit(struct send_queue *sq, bool in_napi,
> +static void __free_old_xmit(struct send_queue *sq, struct netdev_queue *txq,
> + bool in_napi, bool use_napi,
>   struct virtnet_sq_free_stats *stats)
>  {
>   unsigned int len;
>   void *ptr;
>  
>   while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> - ++stats->packets;
> -
>   if (!is_xdp_frame(ptr)) {
>   struct sk_buff *skb = ptr;
>  
>   pr_debug("Sent skb %p\n", skb);
>  
> + stats->packets++;
>   stats->bytes += skb->len;
>   napi_consume_skb(skb, in_napi);
>   } else {
>   struct xdp_frame *frame = ptr_to_xdp(ptr);
>  
> - stats->bytes += xdp_get_frame_len(frame);
> + stats->xdp_packets++;
> + stats->xdp_bytes += xdp_get_frame_len(frame);
>   xdp_return_frame(frame);
>   }
>   }
> + if (use_napi)
> + netdev_tx_completed_queue(txq, stats->packets, stats->bytes);
> +
>  }
>  
>  /* Converting between virtqueue no. and kernel tx/rx queue no.
> @@ -955,21 +961,22 @@ static void virtnet_rq_unmap_free_buf(struct virtqueue 
> *vq, void *buf)
>   virtnet_rq_free_buf(vi, rq, buf);
>  }
>  
> -static void free_old_xmit(struct send_queue *sq, bool in_napi)
> +static void free_old_xmit(struct send_queue *sq, struct netdev_queue *txq,
> +   bool in_napi, bool use_napi)
>  {
>   struct virtnet_sq_free_stats stats = {0};
>  
> - __free_old_xmit(sq, in_napi, &stats);
> + __free_old_xmit(sq, txq, in_napi, use_napi, &stats);
>  
>   /* Avoid overhead when no packets have been processed
>* happens when called speculatively from start_xmit.
>*/
> - if (!stats.packets)
> + if (!stats.packets && !stats.xdp_packets)
>   return;
>  
>   u64_stats_update_begin(&sq->stats.syncp);
> - u64_stats_add(&sq->stats.bytes, stats.bytes);
> - u64_stats_add(&sq->stats.packets, stats.packets);
> + u64_stats_add(&sq->stats.bytes, stats.bytes + stats.xdp_bytes);
> + u64_stats_add(&sq->stats.packets, stats.packets + stats.xdp_packets);
>   u64_stats_update_end(&sq->stats.syncp);
>  }
>  
> @@ -1003,7 +1010,9 @@ static void check_sq_full_and_disable(struct 
> virtnet_info *vi,
>* early means 16 slots are typically wasted.
>*/
>   if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
> - netif_stop_subqueue(dev, qnum);
> + struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> +
> + netif_tx_stop_queue(txq);
>

Re: [PATCH net-next v4 11/15] virtio_net: xsk: tx: support xmit xsk buffer

2024-06-12 Thread Michael S. Tsirkin

On Wed, Jun 12, 2024 at 04:25:05PM -0700, Jakub Kicinski wrote:
> On Tue, 11 Jun 2024 19:41:43 +0800 Xuan Zhuo wrote:
> > @@ -534,10 +534,13 @@ enum virtnet_xmit_type {
> > VIRTNET_XMIT_TYPE_SKB,
> > VIRTNET_XMIT_TYPE_XDP,
> > VIRTNET_XMIT_TYPE_DMA,
> > +   VIRTNET_XMIT_TYPE_XSK,
> 
> Again, would be great to avoid the transient warning (if it can be done
> cleanly):
> 
> drivers/net/virtio_net.c:5806:9: warning: enumeration value 
> ‘VIRTNET_XMIT_TYPE_XSK’ not handled in switch [-Wswitch]
>  5806 | switch (virtnet_xmit_ptr_strip(&buf)) {
>   | ^~


yea just squashing is usually enough.

-- 
MST

Re: [PATCH net-next v4 09/15] virtio_net: xsk: bind/unbind xsk

2024-06-12 Thread Michael S. Tsirkin

On Tue, Jun 11, 2024 at 07:41:41PM +0800, Xuan Zhuo wrote:
> This patch implement the logic of bind/unbind xsk pool to sq and rq.
> 
> Signed-off-by: Xuan Zhuo 

I'd just squash with previous patch. This one is hard to review in
isolation.

> ---
>  drivers/net/virtio_net.c | 199 +++
>  1 file changed, 199 insertions(+)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 4968ab7eb5a4..c82a0691632c 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -26,6 +26,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static int napi_weight = NAPI_POLL_WEIGHT;
>  module_param(napi_weight, int, 0444);
> @@ -57,6 +58,8 @@ DECLARE_EWMA(pkt_len, 0, 64)
>  
>  #define VIRTNET_DRIVER_VERSION "1.0.0"
>  
> +static struct virtio_net_hdr_mrg_rxbuf xsk_hdr;
> +
>  static const unsigned long guest_offloads[] = {
>   VIRTIO_NET_F_GUEST_TSO4,
>   VIRTIO_NET_F_GUEST_TSO6,
> @@ -320,6 +323,12 @@ struct send_queue {
>   bool premapped;
>  
>   struct virtnet_sq_dma_info dmainfo;
> +
> + struct {
> + struct xsk_buff_pool *pool;
> +
> + dma_addr_t hdr_dma_address;
> + } xsk;
>  };
>  
>  /* Internal representation of a receive virtqueue */
> @@ -371,6 +380,13 @@ struct receive_queue {
>  
>   /* Record the last dma info to free after new pages is allocated. */
>   struct virtnet_rq_dma *last_dma;
> +
> + struct {
> + struct xsk_buff_pool *pool;
> +
> + /* xdp rxq used by xsk */
> + struct xdp_rxq_info xdp_rxq;
> + } xsk;
>  };
>  
>  /* This structure can contain rss message with maximum settings for 
> indirection table and keysize
> @@ -5168,6 +5184,187 @@ static int virtnet_restore_guest_offloads(struct 
> virtnet_info *vi)
>   return virtnet_set_guest_offloads(vi, offloads);
>  }
>  
> +static int virtnet_rq_bind_xsk_pool(struct virtnet_info *vi, struct 
> receive_queue *rq,
> + struct xsk_buff_pool *pool)
> +{
> + int err, qindex;
> +
> + qindex = rq - vi->rq;
> +
> + if (pool) {
> + err = xdp_rxq_info_reg(&rq->xsk.xdp_rxq, vi->dev, qindex, 
> rq->napi.napi_id);
> + if (err < 0)
> + return err;
> +
> + err = xdp_rxq_info_reg_mem_model(&rq->xsk.xdp_rxq,
> +  MEM_TYPE_XSK_BUFF_POOL, NULL);
> + if (err < 0) {
> + xdp_rxq_info_unreg(&rq->xsk.xdp_rxq);
> + return err;
> + }
> +
> + xsk_pool_set_rxq_info(pool, &rq->xsk.xdp_rxq);
> + }
> +
> + virtnet_rx_pause(vi, rq);
> +
> + err = virtqueue_reset(rq->vq, virtnet_rq_unmap_free_buf);
> + if (err) {
> + netdev_err(vi->dev, "reset rx fail: rx queue index: %d err: 
> %d\n", qindex, err);
> +
> + pool = NULL;
> + }
> +
> + if (!pool)
> + xdp_rxq_info_unreg(&rq->xsk.xdp_rxq);
> +
> + rq->xsk.pool = pool;
> +
> + virtnet_rx_resume(vi, rq);
> +
> + return err;
> +}
> +
> +static int virtnet_sq_bind_xsk_pool(struct virtnet_info *vi,
> + struct send_queue *sq,
> + struct xsk_buff_pool *pool)
> +{
> + int err, qindex;
> +
> + qindex = sq - vi->sq;
> +
> + virtnet_tx_pause(vi, sq);
> +
> + err = virtqueue_reset(sq->vq, virtnet_sq_free_unused_buf);
> + if (err)
> + netdev_err(vi->dev, "reset tx fail: tx queue index: %d err: 
> %d\n", qindex, err);
> + else
> + err = virtnet_sq_set_premapped(sq, !!pool);
> +
> + if (err)
> + pool = NULL;
> +
> + sq->xsk.pool = pool;
> +
> + virtnet_tx_resume(vi, sq);
> +
> + return err;
> +}
> +
> +static int virtnet_xsk_pool_enable(struct net_device *dev,
> +struct xsk_buff_pool *pool,
> +u16 qid)
> +{
> + struct virtnet_info *vi = netdev_priv(dev);
> + struct receive_queue *rq;
> + struct send_queue *sq;
> + struct device *dma_dev;
> + dma_addr_t hdr_dma;
> + int err;
> +
> + /* In big_packets mode, xdp cannot work, so there is no need to
> +  * initialize xsk of rq.
> +  *
> +  * Support for small mode firstly.
> +  */
> + if (vi->big_packets)
> + return -ENOENT;
> +
> + if (qid >= vi->curr_queue_pairs)
> + return -EINVAL;
> +
> + sq = &vi->sq[qid];
> + rq = &vi->rq[qid];
> +
> + /* xsk tx zerocopy depend on the tx napi.
> +  *
> +  * All xsk packets are actually consumed and sent out from the xsk tx
> +  * queue under the tx napi mechanism.
> +  */
> + if (!sq->napi.weight)
> + return -EPERM;
> +
> + /* For the xsk, the tx and rx should have the same device. But
> +  * vq->dma_dev allows every vq has the respective dma dev. So I check
> +  * th

Re: [PATCH net-next v4 08/15] virtio_net: sq support premapped mode

2024-06-12 Thread Michael S. Tsirkin

On Tue, Jun 11, 2024 at 07:41:40PM +0800, Xuan Zhuo wrote:
> If the xsk is enabling, the xsk tx will share the send queue.
> But the xsk requires that the send queue use the premapped mode.
> So the send queue must support premapped mode when it is bound to
> af-xdp.
> 
> * virtnet_sq_set_premapped(sq, true) is used to enable premapped mode.
> 
> In this mode, the driver will record the dma info when skb or xdp
> frame is sent.
> 
> Currently, the SQ premapped mode is operational only with af-xdp. In
> this mode, af-xdp, the kernel stack, and xdp tx/redirect will share
> the same SQ. Af-xdp independently manages its DMA. The kernel stack
> and xdp tx/redirect utilize this DMA metadata to manage the DMA
> info.
> 
> If the indirect descriptor feature be supported, the volume of DMA
> details we need to maintain becomes quite substantial. Here, we have
> a cap on the amount of DMA info we manage.
> 
> If the kernel stack and xdp tx/redirect attempt to use more
> descriptors, virtnet_add_outbuf() will return an -ENOMEM error. But
> the af-xdp can work continually.
> 
> * virtnet_sq_set_premapped(sq, false) is used to disable premapped mode.
> 
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/net/virtio_net.c | 219 ++-
>  1 file changed, 215 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index e84a4624549b..4968ab7eb5a4 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -25,6 +25,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  static int napi_weight = NAPI_POLL_WEIGHT;
>  module_param(napi_weight, int, 0444);
> @@ -276,6 +277,25 @@ struct virtnet_rq_dma {
>   u16 need_sync;
>  };
>  
> +struct virtnet_sq_dma {
> + union {
> + struct virtnet_sq_dma *next;

Maybe:

struct llist_node node;   

I'd like to avoid growing our own single linked list
implementation if at all possible.



> + void *data;
> + };
> + dma_addr_t addr;
> + u32 len;
> + u8 num;
> +};
> +
> +struct virtnet_sq_dma_info {
> + /* record for kfree */
> + void *p;
> +
> + u32 free_num;
> +
> + struct virtnet_sq_dma *free;
> +};
> +
>  /* Internal representation of a send virtqueue */
>  struct send_queue {
>   /* Virtqueue associated with this send _queue */
> @@ -295,6 +315,11 @@ struct send_queue {
>  
>   /* Record whether sq is in reset state. */
>   bool reset;
> +
> + /* SQ is premapped mode or not. */
> + bool premapped;
> +
> + struct virtnet_sq_dma_info dmainfo;
>  };
>  
>  /* Internal representation of a receive virtqueue */
> @@ -492,9 +517,11 @@ static void virtnet_sq_free_unused_buf(struct virtqueue 
> *vq, void *buf);
>  enum virtnet_xmit_type {
>   VIRTNET_XMIT_TYPE_SKB,
>   VIRTNET_XMIT_TYPE_XDP,
> + VIRTNET_XMIT_TYPE_DMA,
>  };
>  
> -#define VIRTNET_XMIT_TYPE_MASK (VIRTNET_XMIT_TYPE_SKB | 
> VIRTNET_XMIT_TYPE_XDP)
> +#define VIRTNET_XMIT_TYPE_MASK (VIRTNET_XMIT_TYPE_SKB | 
> VIRTNET_XMIT_TYPE_XDP \
> + | VIRTNET_XMIT_TYPE_DMA)
>  
>  static enum virtnet_xmit_type virtnet_xmit_ptr_strip(void **ptr)
>  {
> @@ -510,12 +537,178 @@ static void *virtnet_xmit_ptr_mix(void *ptr, enum 
> virtnet_xmit_type type)
>   return (void *)((unsigned long)ptr | type);
>  }
>  
> +static void virtnet_sq_unmap(struct send_queue *sq, void **data)
> +{
> + struct virtnet_sq_dma *head, *tail, *p;
> + int i;
> +
> + head = *data;
> +
> + p = head;
> +
> + for (i = 0; i < head->num; ++i) {
> + virtqueue_dma_unmap_page_attrs(sq->vq, p->addr, p->len,
> +DMA_TO_DEVICE, 0);
> + tail = p;
> + p = p->next;
> + }
> +
> + *data = tail->data;
> +
> + tail->next = sq->dmainfo.free;
> + sq->dmainfo.free = head;
> + sq->dmainfo.free_num += head->num;
> +}
> +
> +static void *virtnet_dma_chain_update(struct send_queue *sq,
> +   struct virtnet_sq_dma *head,
> +   struct virtnet_sq_dma *tail,
> +   u8 num, void *data)
> +{
> + sq->dmainfo.free = tail->next;
> + sq->dmainfo.free_num -= num;
> + head->num = num;
> +
> + tail->data = data;
> +
> + return virtnet_xmit_ptr_mix(head, VIRTNET_XMIT_TYPE_DMA);
> +}
> +
> +static struct virtnet_sq_dma *virtnet_sq_map_sg(struct send_queue *sq, int 
> num, void *data)
> +{
> + struct virtnet_sq_dma *head, *tail, *p;
> + struct scatterlist *sg;
> + dma_addr_t addr;
> + int i, err;
> +
> + if (num > sq->dmainfo.free_num)
> + return NULL;
> +
> + head = sq->dmainfo.free;
> + p = head;
> +
> + tail = NULL;
> +
> + for (i = 0; i < num; ++i) {
> + sg = &sq->sg[i];
> +
> + addr = virtqueue_dma_map_page_attrs(sq->vq, sg_page(s

Re: [PATCH net-next v2] virtio_net: add support for Byte Queue Limits

2024-06-12 Thread Michael S. Tsirkin

On Wed, Jun 12, 2024 at 07:08:51PM +0200, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> Add support for Byte Queue Limits (BQL).
> 
> Tested on qemu emulated virtio_net device with 1, 2 and 4 queues.
> Tested with fq_codel and pfifo_fast. Super netperf with 50 threads is
> running in background. Netperf TCP_RR results:
> 
> NOBQL FQC 1q:  159.56  159.33  158.50  154.31agv: 157.925
> NOBQL FQC 2q:  184.64  184.96  174.73  174.15agv: 179.62
> NOBQL FQC 4q:  994.46  441.96  416.50  499.56agv: 588.12
> NOBQL PFF 1q:  148.68  148.92  145.95  149.48agv: 148.2575
> NOBQL PFF 2q:  171.86  171.20  170.42  169.42agv: 170.725
> NOBQL PFF 4q: 1505.23 1137.23 2488.70 3507.99agv: 2159.7875
>   BQL FQC 1q: 1332.80 1297.97 1351.41 1147.57agv: 1282.4375
>   BQL FQC 2q:  768.30  817.72  864.43  974.40agv: 856.2125
>   BQL FQC 4q:  945.66  942.68  878.51  822.82agv: 897.4175
>   BQL PFF 1q:  149.69  151.49  149.40  147.47agv: 149.5125
>   BQL PFF 2q: 2059.32  798.74 1844.12  381.80agv: 1270.995
>   BQL PFF 4q: 1871.98 4420.02 4916.59 13268.16   agv: 6119.1875
> 
> Signed-off-by: Jiri Pirko 

I see you now support both napi and non-napi. Thanks a lot Jiri!
Just coming out of a national holiday here, as usual with a backlog - pls allow
until Monday to review. Thanks!

> ---
> v1->v2:
> - moved netdev_tx_completed_queue() call into __free_old_xmit(),
>   propagate use_napi flag to __free_old_xmit() and only call
>   netdev_tx_completed_queue() in case it is true
> - added forgotten call to netdev_tx_reset_queue()
> - fixed stats for xdp packets
> - fixed bql accounting when __free_old_xmit() is called from xdp path
> - handle the !use_napi case in start_xmit() kick section
> ---
>  drivers/net/virtio_net.c | 50 +---
>  1 file changed, 32 insertions(+), 18 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 61a57d134544..5863c663ccab 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -84,7 +84,9 @@ struct virtnet_stat_desc {
>  
>  struct virtnet_sq_free_stats {
>   u64 packets;
> + u64 xdp_packets;
>   u64 bytes;
> + u64 xdp_bytes;
>  };
>  
>  struct virtnet_sq_stats {
> @@ -506,29 +508,33 @@ static struct xdp_frame *ptr_to_xdp(void *ptr)
>   return (struct xdp_frame *)((unsigned long)ptr & ~VIRTIO_XDP_FLAG);
>  }
>  
> -static void __free_old_xmit(struct send_queue *sq, bool in_napi,
> +static void __free_old_xmit(struct send_queue *sq, struct netdev_queue *txq,
> + bool in_napi, bool use_napi,
>   struct virtnet_sq_free_stats *stats)
>  {
>   unsigned int len;
>   void *ptr;
>  
>   while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> - ++stats->packets;
> -
>   if (!is_xdp_frame(ptr)) {
>   struct sk_buff *skb = ptr;
>  
>   pr_debug("Sent skb %p\n", skb);
>  
> + stats->packets++;
>   stats->bytes += skb->len;
>   napi_consume_skb(skb, in_napi);
>   } else {
>   struct xdp_frame *frame = ptr_to_xdp(ptr);
>  
> - stats->bytes += xdp_get_frame_len(frame);
> + stats->xdp_packets++;
> + stats->xdp_bytes += xdp_get_frame_len(frame);
>   xdp_return_frame(frame);
>   }
>   }
> + if (use_napi)
> + netdev_tx_completed_queue(txq, stats->packets, stats->bytes);
> +
>  }
>  
>  /* Converting between virtqueue no. and kernel tx/rx queue no.
> @@ -955,21 +961,22 @@ static void virtnet_rq_unmap_free_buf(struct virtqueue 
> *vq, void *buf)
>   virtnet_rq_free_buf(vi, rq, buf);
>  }
>  
> -static void free_old_xmit(struct send_queue *sq, bool in_napi)
> +static void free_old_xmit(struct send_queue *sq, struct netdev_queue *txq,
> +   bool in_napi, bool use_napi)
>  {
>   struct virtnet_sq_free_stats stats = {0};
>  
> - __free_old_xmit(sq, in_napi, &stats);
> + __free_old_xmit(sq, txq, in_napi, use_napi, &stats);
>  
>   /* Avoid overhead when no packets have been processed
>* happens when called speculatively from start_xmit.
>*/
> - if (!stats.packets)
> + if (!stats.packets && !stats.xdp_packets)
>   return;
>  
>   u64_stats_update_begin(&sq->stats.syncp);
> - u64_stats_add(&sq->stats.bytes, stats.bytes);
> - u64_stats_add(&sq->stats.packets, stats.packets);
> + u64_stats_add(&sq->stats.bytes, stats.bytes + stats.xdp_bytes);
> + u64_stats_add(&sq->stats.packets, stats.packets + stats.xdp_packets);
>   u64_stats_update_end(&sq->stats.syncp);
>  }
>  
> @@ -1003,7 +1010,9 @@ static void check_sq_full_and_disable(struct 
> virtnet_info *vi,
>* early means 16 slots are typically wasted.
>*/
>   if (sq->vq->num_free < 2+MAX_SKB_FRAGS)

Re: [patch net-next] virtio_net: add support for Byte Queue Limits

2024-06-10 Thread Michael S. Tsirkin

On Fri, Jun 07, 2024 at 01:30:34PM +0200, Jiri Pirko wrote:
> Fri, Jun 07, 2024 at 12:23:37PM CEST, m...@redhat.com wrote:
> >On Fri, Jun 07, 2024 at 11:57:37AM +0200, Jiri Pirko wrote:
> >> >True. Personally, I would like to just drop orphan mode. But I'm not
> >> >sure others are happy with this.
> >> 
> >> How about to do it other way around. I will take a stab at sending patch
> >> removing it. If anyone is against and has solid data to prove orphan
> >> mode is needed, let them provide those.
> >
> >Break it with no warning and see if anyone complains?
> 
> This is now what I suggested at all.
> 
> >No, this is not how we handle userspace compatibility, normally.
> 
> Sure.
> 
> Again:
> 
> I would send orphan removal patch containing:
> 1) no module options removal. Warn if someone sets it up
> 2) module option to disable napi is ignored
> 3) orphan mode is removed from code
> 
> There is no breakage. Only, hypotetically performance downgrade in some
> hypotetical usecase nobody knows of.

Performance is why people use virtio. It's as much a breakage as any
other bug. The main difference is, with other types of breakage, they
are typically binary and we can not tolerate them at all.  A tiny,
negligeable performance regression might be tolarable if it brings
other benefits. I very much doubt avoiding interrupts is
negligeable though. And making code simpler isn't a big benefit,
users do not care.

> My point was, if someone presents
> solid data to prove orphan is needed during the patch review, let's toss
> out the patch.
> 
> Makes sense?

It's not hypothetical - if anything, it's hypothetical that performance
does not regress.  And we just got a report from users that see a
regression without.  So, not really.

> 
> >
> >-- 
> >MST
> >

Re: [patch net-next] virtio_net: add support for Byte Queue Limits

2024-06-07 Thread Michael S. Tsirkin

On Fri, Jun 07, 2024 at 11:57:37AM +0200, Jiri Pirko wrote:
> >True. Personally, I would like to just drop orphan mode. But I'm not
> >sure others are happy with this.
> 
> How about to do it other way around. I will take a stab at sending patch
> removing it. If anyone is against and has solid data to prove orphan
> mode is needed, let them provide those.

Break it with no warning and see if anyone complains?
No, this is not how we handle userspace compatibility, normally.

-- 
MST

Re: [patch net-next] virtio_net: add support for Byte Queue Limits

2024-06-06 Thread Michael S. Tsirkin

On Fri, Jun 07, 2024 at 08:39:20AM +0200, Jiri Pirko wrote:
> Fri, Jun 07, 2024 at 08:25:19AM CEST, jasow...@redhat.com wrote:
> >On Thu, Jun 6, 2024 at 9:45 PM Jiri Pirko  wrote:
> >>
> >> Thu, Jun 06, 2024 at 09:56:50AM CEST, jasow...@redhat.com wrote:
> >> >On Thu, Jun 6, 2024 at 2:05 PM Michael S. Tsirkin  wrote:
> >> >>
> >> >> On Thu, Jun 06, 2024 at 12:25:15PM +0800, Jason Wang wrote:
> >> >> > > If the codes of orphan mode don't have an impact when you enable
> >> >> > > napi_tx mode, please keep it if you can.
> >> >> >
> >> >> > For example, it complicates BQL implementation.
> >> >> >
> >> >> > Thanks
> >> >>
> >> >> I very much doubt sending interrupts to a VM can
> >> >> *on all benchmarks* compete with not sending interrupts.
> >> >
> >> >It should not differ too much from the physical NIC. We can have one
> >> >more round of benchmarks to see the difference.
> >> >
> >> >But if NAPI mode needs to win all of the benchmarks in order to get
> >> >rid of orphan, that would be very difficult. Considering various bugs
> >> >will be fixed by dropping skb_orphan(), it would be sufficient if most
> >> >of the benchmark doesn't show obvious differences.
> >> >
> >> >Looking at git history, there're commits that removes skb_orphan(), for 
> >> >example:
> >> >
> >> >commit 8112ec3b8722680251aecdcc23dfd81aa7af6340
> >> >Author: Eric Dumazet 
> >> >Date:   Fri Sep 28 07:53:26 2012 +
> >> >
> >> >mlx4: dont orphan skbs in mlx4_en_xmit()
> >> >
> >> >After commit e22979d96a55d (mlx4_en: Moving to Interrupts for TX
> >> >completions) we no longer need to orphan skbs in mlx4_en_xmit()
> >> >since skb wont stay a long time in TX ring before their release.
> >> >
> >> >Orphaning skbs in ndo_start_xmit() should be avoided as much as
> >> >possible, since it breaks TCP Small Queue or other flow control
> >> >mechanisms (per socket limits)
> >> >
> >> >Signed-off-by: Eric Dumazet 
> >> >Acked-by: Yevgeny Petrilin 
> >> >Cc: Or Gerlitz 
> >> >Signed-off-by: David S. Miller 
> >> >
> >> >>
> >> >> So yea, it's great if napi and hardware are advanced enough
> >> >> that the default can be changed, since this way virtio
> >> >> is closer to a regular nic and more or standard
> >> >> infrastructure can be used.
> >> >>
> >> >> But dropping it will go against *no breaking userspace* rule.
> >> >> Complicated? Tough.
> >> >
> >> >I don't know what kind of userspace is broken by this. Or why it is
> >> >not broken since the day we enable NAPI mode by default.
> >>
> >> There is a module option that explicitly allows user to set
> >> napi_tx=false
> >> or
> >> napi_weight=0
> >>
> >> So if you remove this option or ignore it, both breaks the user
> >> expectation.
> >
> >We can keep them, but I wonder what's the expectation of the user
> >here? The only thing so far I can imagine is the performance
> >difference.
> 
> True.
> 
> >
> >> I personally would vote for this breakage. To carry ancient
> >> things like this one forever does not make sense to me.
> >
> >Exactly.
> >
> >> While at it,
> >> let's remove all virtio net module params. Thoughts?
> >
> >I tend to
> >
> >1) drop the orphan mode, but we can have some benchmarks first
> 
> Any idea which? That would be really tricky to find the ones where
> orphan mode makes difference I assume.

Exactly. We are kind of stuck with it I think.
I would just do this:

void orphan_destructor(struct sk_buff *skb)
{
}

skb_orphan(skb);
skb->destructor = orphan_destructor;
/* skip BQL */
return;


and then later
/* skip BQL accounting if we orphaned on xmit path */
if (skb->destructor == orphan_destructor)
return;



Hmm?


> 
> >2) keep the module parameters
> 
> and ignore them, correct? Perhaps a warning would be good.
> 
> 
> >
> >Thanks
> >
> >>
> >>
> >>
> >> >
> >> >Thanks
> >> >
> >> >>
> >> >> --
> >> >> MST
> >> >>
> >> >
> >>
> >

Re: [patch net-next] virtio_net: add support for Byte Queue Limits

2024-06-06 Thread Michael S. Tsirkin

On Fri, Jun 07, 2024 at 02:22:31PM +0800, Jason Wang wrote:
> On Thu, Jun 6, 2024 at 9:41 PM Jiri Pirko  wrote:
> >
> > Thu, Jun 06, 2024 at 06:25:15AM CEST, jasow...@redhat.com wrote:
> > >On Thu, Jun 6, 2024 at 10:59 AM Jason Xing  
> > >wrote:
> > >>
> > >> Hello Jason,
> > >>
> > >> On Thu, Jun 6, 2024 at 8:21 AM Jason Wang  wrote:
> > >> >
> > >> > On Wed, Jun 5, 2024 at 7:51 PM Heng Qi  
> > >> > wrote:
> > >> > >
> > >> > > On Wed, 5 Jun 2024 13:30:51 +0200, Jiri Pirko  
> > >> > > wrote:
> > >> > > > Mon, May 20, 2024 at 02:48:15PM CEST, j...@resnulli.us wrote:
> > >> > > > >Fri, May 10, 2024 at 09:11:16AM CEST, hen...@linux.alibaba.com 
> > >> > > > >wrote:
> > >> > > > >>On Thu,  9 May 2024 13:46:15 +0200, Jiri Pirko 
> > >> > > > >> wrote:
> > >> > > > >>> From: Jiri Pirko 
> > >> > > > >>>
> > >> > > > >>> Add support for Byte Queue Limits (BQL).
> > >> > > > >>
> > >> > > > >>Historically both Jason and Michael have attempted to support BQL
> > >> > > > >>for virtio-net, for example:
> > >> > > > >>
> > >> > > > >>https://lore.kernel.org/netdev/21384cb5-99a6-7431-1039-b356521e1...@redhat.com/
> > >> > > > >>
> > >> > > > >>These discussions focus primarily on:
> > >> > > > >>
> > >> > > > >>1. BQL is based on napi tx. Therefore, the transfer of 
> > >> > > > >>statistical information
> > >> > > > >>needs to rely on the judgment of use_napi. When the napi mode is 
> > >> > > > >>switched to
> > >> > > > >>orphan, some statistical information will be lost, resulting in 
> > >> > > > >>temporary
> > >> > > > >>inaccuracy in BQL.
> > >> > > > >>
> > >> > > > >>2. If tx dim is supported, orphan mode may be removed and tx irq 
> > >> > > > >>will be more
> > >> > > > >>reasonable. This provides good support for BQL.
> > >> > > > >
> > >> > > > >But when the device does not support dim, the orphan mode is still
> > >> > > > >needed, isn't it?
> > >> > > >
> > >> > > > Heng, is my assuption correct here? Thanks!
> > >> > > >
> > >> > >
> > >> > > Maybe, according to our cloud data, napi_tx=on works better than 
> > >> > > orphan mode in
> > >> > > most scenarios. Although orphan mode performs better in specific 
> > >> > > benckmark,
> > >> >
> > >> > For example pktgen (I meant even if the orphan mode can break pktgen,
> > >> > it can finish when there's a new packet that needs to be sent after
> > >> > pktgen is completed).
> > >> >
> > >> > > perf of napi_tx can be enhanced through tx dim. Then, there is no 
> > >> > > reason not to
> > >> > > support dim for devices that want the best performance.
> > >> >
> > >> > Ideally, if we can drop orphan mode, everything would be simplified.
> > >>
> > >> Please please don't do this. Orphan mode still has its merits. In some
> > >> cases which can hardly be reproduced in production, we still choose to
> > >> turn off the napi_tx mode because the delay of freeing a skb could
> > >> cause lower performance in the tx path,
> > >
> > >Well, it's probably just a side effect and it depends on how to define
> > >performance here.
> > >
> > >> which is, I know, surely
> > >> designed on purpose.
> > >
> > >I don't think so and no modern NIC uses that. It breaks a lot of things.
> > >
> > >>
> > >> If the codes of orphan mode don't have an impact when you enable
> > >> napi_tx mode, please keep it if you can.
> > >
> > >For example, it complicates BQL implementation.
> >
> > Well, bql could be disabled when napi is not used. It is just a matter
> > of one "if" in the xmit path.
> 
> Maybe, care to post a patch?
> 
> The trick part is, a skb is queued when BQL is enabled but sent when
> BQL is disabled as discussed here:
> 
> https://lore.kernel.org/netdev/21384cb5-99a6-7431-1039-b356521e1...@redhat.com/
> 
> Thanks

Yes of course. Or we can stick a dummy value in skb->destructor after we
orphan, maybe that's easier.


> >
> >
> > >
> > >Thanks
> > >
> > >>
> > >> Thank you.
> > >>
> > >
> >

Re: [patch net-next] virtio_net: add support for Byte Queue Limits

2024-06-06 Thread Michael S. Tsirkin

On Thu, Jun 06, 2024 at 07:42:35PM +0800, Jason Xing wrote:
> On Thu, Jun 6, 2024 at 12:25 PM Jason Wang  wrote:
> >
> > On Thu, Jun 6, 2024 at 10:59 AM Jason Xing  
> > wrote:
> > >
> > > Hello Jason,
> > >
> > > On Thu, Jun 6, 2024 at 8:21 AM Jason Wang  wrote:
> > > >
> > > > On Wed, Jun 5, 2024 at 7:51 PM Heng Qi  wrote:
> > > > >
> > > > > On Wed, 5 Jun 2024 13:30:51 +0200, Jiri Pirko  
> > > > > wrote:
> > > > > > Mon, May 20, 2024 at 02:48:15PM CEST, j...@resnulli.us wrote:
> > > > > > >Fri, May 10, 2024 at 09:11:16AM CEST, hen...@linux.alibaba.com 
> > > > > > >wrote:
> > > > > > >>On Thu,  9 May 2024 13:46:15 +0200, Jiri Pirko  
> > > > > > >>wrote:
> > > > > > >>> From: Jiri Pirko 
> > > > > > >>>
> > > > > > >>> Add support for Byte Queue Limits (BQL).
> > > > > > >>
> > > > > > >>Historically both Jason and Michael have attempted to support BQL
> > > > > > >>for virtio-net, for example:
> > > > > > >>
> > > > > > >>https://lore.kernel.org/netdev/21384cb5-99a6-7431-1039-b356521e1...@redhat.com/
> > > > > > >>
> > > > > > >>These discussions focus primarily on:
> > > > > > >>
> > > > > > >>1. BQL is based on napi tx. Therefore, the transfer of 
> > > > > > >>statistical information
> > > > > > >>needs to rely on the judgment of use_napi. When the napi mode is 
> > > > > > >>switched to
> > > > > > >>orphan, some statistical information will be lost, resulting in 
> > > > > > >>temporary
> > > > > > >>inaccuracy in BQL.
> > > > > > >>
> > > > > > >>2. If tx dim is supported, orphan mode may be removed and tx irq 
> > > > > > >>will be more
> > > > > > >>reasonable. This provides good support for BQL.
> > > > > > >
> > > > > > >But when the device does not support dim, the orphan mode is still
> > > > > > >needed, isn't it?
> > > > > >
> > > > > > Heng, is my assuption correct here? Thanks!
> > > > > >
> > > > >
> > > > > Maybe, according to our cloud data, napi_tx=on works better than 
> > > > > orphan mode in
> > > > > most scenarios. Although orphan mode performs better in specific 
> > > > > benckmark,
> > > >
> > > > For example pktgen (I meant even if the orphan mode can break pktgen,
> > > > it can finish when there's a new packet that needs to be sent after
> > > > pktgen is completed).
> > > >
> > > > > perf of napi_tx can be enhanced through tx dim. Then, there is no 
> > > > > reason not to
> > > > > support dim for devices that want the best performance.
> > > >
> > > > Ideally, if we can drop orphan mode, everything would be simplified.
> > >
> > > Please please don't do this. Orphan mode still has its merits. In some
> > > cases which can hardly be reproduced in production, we still choose to
> > > turn off the napi_tx mode because the delay of freeing a skb could
> > > cause lower performance in the tx path,
> >
> > Well, it's probably just a side effect and it depends on how to define
> > performance here.
> 
> Yes.
> 
> >
> > > which is, I know, surely
> > > designed on purpose.
> >
> > I don't think so and no modern NIC uses that. It breaks a lot of things.
> 
> To avoid confusion, I meant napi_tx mode can delay/slow down the speed
> in the tx path and no modern nic uses skb_orphan().

Clearly it's been designed for software NICs and when the
cost of interrupts is very high.

> I think I will have some time to test BQL in virtio_net.
> 
> Thanks,
> Jason

Re: [patch net-next] virtio_net: add support for Byte Queue Limits

2024-06-05 Thread Michael S. Tsirkin

On Thu, Jun 06, 2024 at 12:25:15PM +0800, Jason Wang wrote:
> > If the codes of orphan mode don't have an impact when you enable
> > napi_tx mode, please keep it if you can.
> 
> For example, it complicates BQL implementation.
> 
> Thanks

I very much doubt sending interrupts to a VM can
*on all benchmarks* compete with not sending interrupts.

So yea, it's great if napi and hardware are advanced enough
that the default can be changed, since this way virtio
is closer to a regular nic and more or standard
infrastructure can be used.

But dropping it will go against *no breaking userspace* rule.
Complicated? Tough.

-- 
MST

Re: [PATCH net-next v2 12/12] virtio_net: refactor the xmit type

2024-06-02 Thread Michael S. Tsirkin

On Thu, May 30, 2024 at 07:24:06PM +0800, Xuan Zhuo wrote:
> +enum virtnet_xmit_type {
> + VIRTNET_XMIT_TYPE_SKB,
> + VIRTNET_XMIT_TYPE_XDP,
> +};
> +
> +#define VIRTNET_XMIT_TYPE_MASK (VIRTNET_XMIT_TYPE_SKB | 
> VIRTNET_XMIT_TYPE_XDP)

No idea how this has any chance to work.
Was this tested, even?

-- 
MST

Re: [PATCH net-next v2 00/12] virtnet_net: prepare for af-xdp

2024-06-02 Thread Michael S. Tsirkin

On Sat, Jun 01, 2024 at 09:01:29AM +0800, Xuan Zhuo wrote:
> On Thu, 30 May 2024 07:53:17 -0400, "Michael S. Tsirkin"  
> wrote:
> > On Thu, May 30, 2024 at 07:23:54PM +0800, Xuan Zhuo wrote:
> > > This patch set prepares for supporting af-xdp zerocopy.
> > > There is no feature change in this patch set.
> > > I just want to reduce the patch num of the final patch set,
> > > so I split the patch set.
> > >
> > > Thanks.
> > >
> > > v2:
> > > 1. Add five commits. That provides some helper for sq to support 
> > > premapped
> > >mode. And the last one refactors distinguishing xmit types.
> > >
> > > v1:
> > > 1. resend for the new net-next merge window
> > >
> >
> >
> > It's great that you are working on this but
> > I'd like to see the actual use of this first.
> 
> I want to finish this work quickly. I don't have a particular preference for
> whether to use a separate directory; as an engineer, I think it makes sense. I
> don't want to keep dwelling on this issue. I also hope that as a maintainer, 
> you
> can help me complete this work as soon as possible. You should know that I 
> have
> been working on this for about three years now.
> 
> I can completely follow your suggestion regarding splitting the directory.
> However, there will still be many patches, so I hope that these patches in 
> this
> patch set can be merged first.
> 
>virtio_net: separate virtnet_rx_resize()
>virtio_net: separate virtnet_tx_resize()
>virtio_net: separate receive_mergeable
>virtio_net: separate receive_buf
>virtio_net: refactor the xmit type
> 
> I will try to compress the subsequent patch sets, hoping to reduce them to 
> about 15.
> 
> Thanks.


You can also post an RFC even if it's bigger than 15. If I see the use
I can start merging some of the patches.

> 
> >
> > >
> > > Xuan Zhuo (12):
> > >   virtio_net: independent directory
> > >   virtio_net: move core structures to virtio_net.h
> > >   virtio_net: add prefix virtnet to all struct inside virtio_net.h
> > >   virtio_net: separate virtnet_rx_resize()
> > >   virtio_net: separate virtnet_tx_resize()
> > >   virtio_net: separate receive_mergeable
> > >   virtio_net: separate receive_buf
> > >   virtio_ring: introduce vring_need_unmap_buffer
> > >   virtio_ring: introduce dma map api for page
> > >   virtio_ring: introduce virtqueue_dma_map_sg_attrs
> > >   virtio_ring: virtqueue_set_dma_premapped() support to disable
> > >   virtio_net: refactor the xmit type
> > >
> > >  MAINTAINERS   |   2 +-
> > >  drivers/net/Kconfig   |   9 +-
> > >  drivers/net/Makefile  |   2 +-
> > >  drivers/net/virtio/Kconfig|  12 +
> > >  drivers/net/virtio/Makefile   |   8 +
> > >  drivers/net/virtio/virtnet.h  | 248 
> > >  .../{virtio_net.c => virtio/virtnet_main.c}   | 596 +++---
> > >  drivers/virtio/virtio_ring.c  | 118 +++-
> > >  include/linux/virtio.h|  12 +-
> > >  9 files changed, 606 insertions(+), 401 deletions(-)
> > >  create mode 100644 drivers/net/virtio/Kconfig
> > >  create mode 100644 drivers/net/virtio/Makefile
> > >  create mode 100644 drivers/net/virtio/virtnet.h
> > >  rename drivers/net/{virtio_net.c => virtio/virtnet_main.c} (93%)
> > >
> > > --
> > > 2.32.0.3.g01195cf9f
> >

Re: [PATCH net-next v2 00/12] virtnet_net: prepare for af-xdp

2024-05-30 Thread Michael S. Tsirkin

On Thu, May 30, 2024 at 07:23:54PM +0800, Xuan Zhuo wrote:
> This patch set prepares for supporting af-xdp zerocopy.
> There is no feature change in this patch set.
> I just want to reduce the patch num of the final patch set,
> so I split the patch set.
> 
> Thanks.
> 
> v2:
> 1. Add five commits. That provides some helper for sq to support premapped
>mode. And the last one refactors distinguishing xmit types.
> 
> v1:
> 1. resend for the new net-next merge window
> 


It's great that you are working on this but
I'd like to see the actual use of this first.

> 
> Xuan Zhuo (12):
>   virtio_net: independent directory
>   virtio_net: move core structures to virtio_net.h
>   virtio_net: add prefix virtnet to all struct inside virtio_net.h
>   virtio_net: separate virtnet_rx_resize()
>   virtio_net: separate virtnet_tx_resize()
>   virtio_net: separate receive_mergeable
>   virtio_net: separate receive_buf
>   virtio_ring: introduce vring_need_unmap_buffer
>   virtio_ring: introduce dma map api for page
>   virtio_ring: introduce virtqueue_dma_map_sg_attrs
>   virtio_ring: virtqueue_set_dma_premapped() support to disable
>   virtio_net: refactor the xmit type
> 
>  MAINTAINERS   |   2 +-
>  drivers/net/Kconfig   |   9 +-
>  drivers/net/Makefile  |   2 +-
>  drivers/net/virtio/Kconfig|  12 +
>  drivers/net/virtio/Makefile   |   8 +
>  drivers/net/virtio/virtnet.h  | 248 
>  .../{virtio_net.c => virtio/virtnet_main.c}   | 596 +++---
>  drivers/virtio/virtio_ring.c  | 118 +++-
>  include/linux/virtio.h|  12 +-
>  9 files changed, 606 insertions(+), 401 deletions(-)
>  create mode 100644 drivers/net/virtio/Kconfig
>  create mode 100644 drivers/net/virtio/Makefile
>  create mode 100644 drivers/net/virtio/virtnet.h
>  rename drivers/net/{virtio_net.c => virtio/virtnet_main.c} (93%)
> 
> --
> 2.32.0.3.g01195cf9f

Re: [PATCH net v3 1/2] virtio_net: fix possible dim status unrecoverable

2024-05-30 Thread Michael S. Tsirkin

On Tue, May 28, 2024 at 09:41:15PM +0800, Heng Qi wrote:
> When the dim worker is scheduled, if it no longer needs to issue
> commands, dim may not be able to return to the working state later.
> 
> For example, the following single queue scenario:
>   1. The dim worker of rxq0 is scheduled, and the dim status is
>  changed to DIM_APPLY_NEW_PROFILE;
>   2. dim is disabled or parameters have not been modified;
>   3. virtnet_rx_dim_work exits directly;
> 
> Then, even if net_dim is invoked again, it cannot work because the
> state is not restored to DIM_START_MEASURE.
> 
> Fixes: 6208799553a8 ("virtio-net: support rx netdim")
> Signed-off-by: Heng Qi 
> Reviewed-by: Jiri Pirko 


Acked-by: Michael S. Tsirkin 

> ---
>  drivers/net/virtio_net.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 4a802c0ea2cb..4f828a9e5889 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -4417,9 +4417,9 @@ static void virtnet_rx_dim_work(struct work_struct 
> *work)
>   if (err)
>   pr_debug("%s: Failed to send dim parameters on rxq%d\n",
>dev->name, qnum);
> - dim->state = DIM_START_MEASURE;
>   }
>  out:
> + dim->state = DIM_START_MEASURE;
>   mutex_unlock(&rq->dim_lock);
>  }
>  
> -- 
> 2.32.0.3.g01195cf9f

Re: [PATCH net v3 2/2] virtio_net: fix a spurious deadlock issue

2024-05-30 Thread Michael S. Tsirkin

On Tue, May 28, 2024 at 09:41:16PM +0800, Heng Qi wrote:
> When the following snippet is run, lockdep will report a deadlock[1].
> 
>   /* Acquire all queues dim_locks */
>   for (i = 0; i < vi->max_queue_pairs; i++)
>   mutex_lock(&vi->rq[i].dim_lock);
> 
> There's no deadlock here because the vq locks are always taken
> in the same order, but lockdep can not figure it out. So refactoring
> the code to alleviate the problem.
> 
> [1]
> 
> WARNING: possible recursive locking detected
> 6.9.0-rc7+ #319 Not tainted
> 
> ethtool/962 is trying to acquire lock:
> 
> but task is already holding lock:
> 
> other info that might help us debug this:
> Possible unsafe locking scenario:
> 
>   CPU0
>   
>  lock(&vi->rq[i].dim_lock);
>  lock(&vi->rq[i].dim_lock);
> 
> *** DEADLOCK ***
> 
>  May be due to missing lock nesting notation
> 
> 3 locks held by ethtool/962:
>  #0: 82dbaab0 (cb_lock){}-{3:3}, at: genl_rcv+0x19/0x40
>  #1: 82dad0a8 (rtnl_mutex){+.+.}-{3:3}, at:
>   ethnl_default_set_doit+0xbe/0x1e0
> 
> stack backtrace:
> CPU: 6 PID: 962 Comm: ethtool Not tainted 6.9.0-rc7+ #319
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
>  rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> Call Trace:
>  
>  dump_stack_lvl+0x79/0xb0
>  check_deadlock+0x130/0x220
>  __lock_acquire+0x861/0x990
>  lock_acquire.part.0+0x72/0x1d0
>  ? lock_acquire+0xf8/0x130
>  __mutex_lock+0x71/0xd50
>  virtnet_set_coalesce+0x151/0x190
>  __ethnl_set_coalesce.isra.0+0x3f8/0x4d0
>  ethnl_set_coalesce+0x34/0x90
>  ethnl_default_set_doit+0xdd/0x1e0
>  genl_family_rcv_msg_doit+0xdc/0x130
>  genl_family_rcv_msg+0x154/0x230
>  ? __pfx_ethnl_default_set_doit+0x10/0x10
>  genl_rcv_msg+0x4b/0xa0
>  ? __pfx_genl_rcv_msg+0x10/0x10
>  netlink_rcv_skb+0x5a/0x110
>  genl_rcv+0x28/0x40
>  netlink_unicast+0x1af/0x280
>  netlink_sendmsg+0x20e/0x460
>  __sys_sendto+0x1fe/0x210
>  ? find_held_lock+0x2b/0x80
>  ? do_user_addr_fault+0x3a2/0x8a0
>  ? __lock_release+0x5e/0x160
>  ? do_user_addr_fault+0x3a2/0x8a0
>  ? lock_release+0x72/0x140
>  ? do_user_addr_fault+0x3a7/0x8a0
>  __x64_sys_sendto+0x29/0x30
>  do_syscall_64+0x78/0x180
>  entry_SYSCALL_64_after_hwframe+0x76/0x7e
> 
> Fixes: 4d4ac2ececd3 ("virtio_net: Add a lock for per queue RX coalesce")
> Signed-off-by: Heng Qi 


Acked-by: Michael S. Tsirkin 


> ---
>  drivers/net/virtio_net.c | 36 
>  1 file changed, 16 insertions(+), 20 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 4f828a9e5889..ecb5203d0372 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -4257,7 +4257,6 @@ static int virtnet_send_rx_notf_coal_cmds(struct 
> virtnet_info *vi,
>   struct virtio_net_ctrl_coal_rx *coal_rx __free(kfree) = NULL;
>   bool rx_ctrl_dim_on = !!ec->use_adaptive_rx_coalesce;
>   struct scatterlist sgs_rx;
> - int ret = 0;
>   int i;
>  
>   if (rx_ctrl_dim_on && !virtio_has_feature(vi->vdev, 
> VIRTIO_NET_F_VQ_NOTF_COAL))
> @@ -4267,27 +4266,27 @@ static int virtnet_send_rx_notf_coal_cmds(struct 
> virtnet_info *vi,
>  ec->rx_max_coalesced_frames != 
> vi->intr_coal_rx.max_packets))
>   return -EINVAL;
>  
> - /* Acquire all queues dim_locks */
> - for (i = 0; i < vi->max_queue_pairs; i++)
> - mutex_lock(&vi->rq[i].dim_lock);
> -
>   if (rx_ctrl_dim_on && !vi->rx_dim_enabled) {
>   vi->rx_dim_enabled = true;
> - for (i = 0; i < vi->max_queue_pairs; i++)
> + for (i = 0; i < vi->max_queue_pairs; i++) {
> + mutex_lock(&vi->rq[i].dim_lock);
>   vi->rq[i].dim_enabled = true;
> - goto unlock;
> + mutex_unlock(&vi->rq[i].dim_lock);
> + }
> + return 0;
>   }
>  
>   coal_rx = kzalloc(sizeof(*coal_rx), GFP_KERNEL);
> - if (!coal_rx) {
> - ret = -ENOMEM;
> - goto unlock;
> - }
> + if (!coal_rx)
> + return -ENOMEM;
>  
>   if (!rx_ctrl_dim_on && vi->rx_dim_enabled) {
>   vi->rx_dim_enabled = false;
> - for (i = 0; i < vi->max_queue_pairs; i++)
> + for (i = 0; i < vi->max_queue_pairs; i++) {
> +

Re: [PATCH net-next v1 0/7] virtnet_net: prepare for af-xdp

2024-05-30 Thread Michael S. Tsirkin

On Thu, May 30, 2024 at 03:26:42PM +0800, Xuan Zhuo wrote:
> This patch set prepares for supporting af-xdp zerocopy.
> There is no feature change in this patch set.
> I just want to reduce the patch num of the final patch set,
> so I split the patch set.
> 
> #1-#3 add independent directory for virtio-net
> #4-#7 do some refactor, the sub-functions will be used by the subsequent 
> commits
> 
> Thanks.
> 
> v1:
> 1. resend for the new net-next merge window

What I said at the time is

I am fine adding xsk in a new file or just adding in same file working 
on a split later.

Given this was a year ago and all we keep seing is "prepare" patches,
I am inclined to say do it in the reverse order: add
af-xdp first then do the split when it's clear there is not
a lot of code sharing going on.


> 
> Xuan Zhuo (7):
>   virtio_net: independent directory
>   virtio_net: move core structures to virtio_net.h
>   virtio_net: add prefix virtnet to all struct inside virtio_net.h
>   virtio_net: separate virtnet_rx_resize()
>   virtio_net: separate virtnet_tx_resize()
>   virtio_net: separate receive_mergeable
>   virtio_net: separate receive_buf
> 
>  MAINTAINERS   |   2 +-
>  drivers/net/Kconfig   |   9 +-
>  drivers/net/Makefile  |   2 +-
>  drivers/net/virtio/Kconfig|  12 +
>  drivers/net/virtio/Makefile   |   8 +
>  drivers/net/virtio/virtnet.h  | 248 
>  .../{virtio_net.c => virtio/virtnet_main.c}   | 536 ++
>  7 files changed, 454 insertions(+), 363 deletions(-)
>  create mode 100644 drivers/net/virtio/Kconfig
>  create mode 100644 drivers/net/virtio/Makefile
>  create mode 100644 drivers/net/virtio/virtnet.h
>  rename drivers/net/{virtio_net.c => virtio/virtnet_main.c} (94%)
> 
> --
> 2.32.0.3.g01195cf9f

Re: [PATCH net 2/2] virtio_net: fix missing lock protection on control_buf access

2024-05-28 Thread Michael S. Tsirkin

On Wed, May 29, 2024 at 12:01:45AM +0800, Heng Qi wrote:
> On Tue, 28 May 2024 11:46:28 -0400, "Michael S. Tsirkin"  
> wrote:
> > On Tue, May 28, 2024 at 03:52:26PM +0800, Heng Qi wrote:
> > > Refactored the handling of control_buf to be within the cvq_lock
> > > critical section, mitigating race conditions between reading device
> > > responses and new command submissions.
> > > 
> > > Fixes: 6f45ab3e0409 ("virtio_net: Add a lock for the command VQ.")
> > > Signed-off-by: Heng Qi 
> > 
> > 
> > I don't get what does this change. status can change immediately
> > after you drop the mutex, can it not? what exactly is the
> > race conditions you are worried about?
> 
> See the following case:
> 
> 1. Command A is acknowledged and successfully executed by the device.
> 2. After releasing the mutex (mutex_unlock), process P1 gets preempted before
>it can read vi->ctrl->status, *which should be VIRTIO_NET_OK*.
> 3. A new command B (like the DIM command) is issued.
> 4. Post vi->ctrl->status being set to VIRTIO_NET_ERR by
>virtnet_send_command_reply(), process P2 gets preempted.
> 5. Process P1 resumes, reads *vi->ctrl->status as VIRTIO_NET_ERR*, and reports
>this error back for Command A. <-- Race causes incorrect results to be 
> read.
> 
> Thanks.


Why is it important that P1 gets VIRTIO_NET_OK?
After all it is no longer the state.

> > 
> > > ---
> > >  drivers/net/virtio_net.c | 4 +++-
> > >  1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index 6b0512a628e0..3d8407d9e3d2 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -2686,6 +2686,7 @@ static bool virtnet_send_command_reply(struct 
> > > virtnet_info *vi, u8 class, u8 cmd
> > >  {
> > >   struct scatterlist *sgs[5], hdr, stat;
> > >   u32 out_num = 0, tmp, in_num = 0;
> > > + bool ret;
> > >   int err;
> > >  
> > >   /* Caller should know better */
> > > @@ -2731,8 +2732,9 @@ static bool virtnet_send_command_reply(struct 
> > > virtnet_info *vi, u8 class, u8 cmd
> > >   }
> > >  
> > >  unlock:
> > > + ret = vi->ctrl->status == VIRTIO_NET_OK;
> > >   mutex_unlock(&vi->cvq_lock);
> > > - return vi->ctrl->status == VIRTIO_NET_OK;
> > > + return ret;
> > >  }
> > >  
> > >  static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 
> > > cmd,
> > > -- 
> > > 2.32.0.3.g01195cf9f
> >

Re: [PATCH net 2/2] virtio_net: fix missing lock protection on control_buf access

2024-05-28 Thread Michael S. Tsirkin

On Tue, May 28, 2024 at 03:52:26PM +0800, Heng Qi wrote:
> Refactored the handling of control_buf to be within the cvq_lock
> critical section, mitigating race conditions between reading device
> responses and new command submissions.
> 
> Fixes: 6f45ab3e0409 ("virtio_net: Add a lock for the command VQ.")
> Signed-off-by: Heng Qi 


I don't get what does this change. status can change immediately
after you drop the mutex, can it not? what exactly is the
race conditions you are worried about?

> ---
>  drivers/net/virtio_net.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 6b0512a628e0..3d8407d9e3d2 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2686,6 +2686,7 @@ static bool virtnet_send_command_reply(struct 
> virtnet_info *vi, u8 class, u8 cmd
>  {
>   struct scatterlist *sgs[5], hdr, stat;
>   u32 out_num = 0, tmp, in_num = 0;
> + bool ret;
>   int err;
>  
>   /* Caller should know better */
> @@ -2731,8 +2732,9 @@ static bool virtnet_send_command_reply(struct 
> virtnet_info *vi, u8 class, u8 cmd
>   }
>  
>  unlock:
> + ret = vi->ctrl->status == VIRTIO_NET_OK;
>   mutex_unlock(&vi->cvq_lock);
> - return vi->ctrl->status == VIRTIO_NET_OK;
> + return ret;
>  }
>  
>  static bool virtnet_send_command(struct virtnet_info *vi, u8 class, u8 cmd,
> -- 
> 2.32.0.3.g01195cf9f

Re: [PATCH net 2/2] Revert "virtio_net: Add a lock for per queue RX coalesce"

2024-05-22 Thread Michael S. Tsirkin

On Wed, May 22, 2024 at 04:52:19PM +0800, Heng Qi wrote:
> On Wed, 22 May 2024 10:15:46 +0200, Jiri Pirko  wrote:
> > Wed, May 22, 2024 at 05:45:48AM CEST, hen...@linux.alibaba.com wrote:
> > >This reverts commit 4d4ac2ececd3c42a08dd32a6e3a4aaf25f7efe44.
> > 
> > This commit does not exist in -net or -net-next.
> 
> It definitely exists in net-next :):
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git/commit/?id=4d4ac2ececd3c42a08dd32a6e3a4aaf25f7efe44
> 
> > 
> > >
> > >When the following snippet is run, lockdep will complain[1].
> > >
> > >  /* Acquire all queues dim_locks */
> > >  for (i = 0; i < vi->max_queue_pairs; i++)
> > > mutex_lock(&vi->rq[i].dim_lock);
> > >
> > >At the same time, too many queues will cause lockdep to be more irritable,
> > >which can be alleviated by using mutex_lock_nested(), however, there are
> > >still new warning when the number of queues exceeds MAX_LOCKDEP_SUBCLASSES.
> > >So I would like to gently revert this commit, although it brings
> > >unsynchronization that is not so concerned:

It's really hard to read this explanation.

I think you mean is:

When the following snippet is run, lockdep will report a deadlock[1].

  /* Acquire all queues dim_locks */
  for (i = 0; i < vi->max_queue_pairs; i++)
  mutex_lock(&vi->rq[i].dim_lock);

There's no deadlock here because the vq locks
are always taken in the same order, but lockdep can not figure it
out, and we can not make each lock a separate class because
there can be more than MAX_LOCKDEP_SUBCLASSES of vqs.

However, dropping the lock is harmless.



> > >  1. When dim is enabled, rx_dim_work may modify the coalescing parameters.
> > > Users may read dirty coalescing parameters if querying.


... anyway?

> > >  2. When dim is switched from enabled to disabled, a spurious dim worker
> > > maybe scheduled, but this can be handled correctly by rx_dim_work.

may be -> is?
How is this handled exactly?

> > >
> > >[1]
> > >
> > >WARNING: possible recursive locking detected
> > >6.9.0-rc7+ #319 Not tainted
> > >
> > >ethtool/962 is trying to acquire lock:
> > >
> > >but task is already holding lock:
> > >
> > >other info that might help us debug this:
> > >Possible unsafe locking scenario:
> > >
> > >  CPU0
> > >  
> > > lock(&vi->rq[i].dim_lock);
> > > lock(&vi->rq[i].dim_lock);
> > >
> > >*** DEADLOCK ***
> > >
> > > May be due to missing lock nesting notation
> > >
> > >3 locks held by ethtool/962:
> > > #0: 82dbaab0 (cb_lock){}-{3:3}, at: genl_rcv+0x19/0x40
> > > #1: 82dad0a8 (rtnl_mutex){+.+.}-{3:3}, at:
> > >   ethnl_default_set_doit+0xbe/0x1e0
> > >
> > >stack backtrace:
> > >CPU: 6 PID: 962 Comm: ethtool Not tainted 6.9.0-rc7+ #319
> > >Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > >  rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
> > >Call Trace:
> > > 
> > > dump_stack_lvl+0x79/0xb0
> > > check_deadlock+0x130/0x220
> > > __lock_acquire+0x861/0x990
> > > lock_acquire.part.0+0x72/0x1d0
> > > ? lock_acquire+0xf8/0x130
> > > __mutex_lock+0x71/0xd50
> > > virtnet_set_coalesce+0x151/0x190
> > > __ethnl_set_coalesce.isra.0+0x3f8/0x4d0
> > > ethnl_set_coalesce+0x34/0x90
> > > ethnl_default_set_doit+0xdd/0x1e0
> > > genl_family_rcv_msg_doit+0xdc/0x130
> > > genl_family_rcv_msg+0x154/0x230
> > > ? __pfx_ethnl_default_set_doit+0x10/0x10
> > > genl_rcv_msg+0x4b/0xa0
> > > ? __pfx_genl_rcv_msg+0x10/0x10
> > > netlink_rcv_skb+0x5a/0x110
> > > genl_rcv+0x28/0x40
> > > netlink_unicast+0x1af/0x280
> > > netlink_sendmsg+0x20e/0x460
> > > __sys_sendto+0x1fe/0x210
> > > ? find_held_lock+0x2b/0x80
> > > ? do_user_addr_fault+0x3a2/0x8a0
> > > ? __lock_release+0x5e/0x160
> > > ? do_user_addr_fault+0x3a2/0x8a0
> > > ? lock_release+0x72/0x140
> > > ? do_user_addr_fault+0x3a7/0x8a0
> > > __x64_sys_sendto+0x29/0x30
> > > do_syscall_64+0x78/0x180
> > > entry_SYSCALL_64_after_hwframe+0x76/0x7e
> > >
> > >Signed-off-by: Heng Qi 
> > 
> > Fixes tag missing.
> 
> IIUC,
> 
>   "This reverts commit 4d4ac2ececd3c42a08dd32a6e3a4aaf25f7efe44."
> 
> has provided a traceback way, which is not fixing other patches,
> but fixing itself. So we do not need fixes tag.
> 
> Thanks.

Providing the subject of the reverted commit is helpful.
Adding:

Fixes: 4d4ac2ececd3 ("virtio_net: Add a lock for per queue RX coalesce")

is a standard way to do that.




> > 
> > 
> > >---
> > > drivers/net/virtio_net.c | 53 +---
> > > 1 file changed, 12 insertions(+), 41 deletions(-)
> > >
> > >diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > >index 1cad06cef230..e4a1dff2a64a 100644
> > >--- a/drivers/net/virtio_net.c
> > >+++ b/drivers/net/virtio_net.c
> > >@@ -316,9 +316,6 @@ struct receive_queue {
> > >   /* Is dynamic interru

Re: [PATCH net-next] virtio-net: synchronize operstate with admin state on up/down

2024-05-21 Thread Michael S. Tsirkin

On Mon, May 20, 2024 at 09:03:02AM +0800, Jason Wang wrote:
> This patch synchronize operstate with admin state per RFC2863.
> 
> This is done by trying to toggle the carrier upon open/close and
> synchronize with the config change work. This allows propagate status
> correctly to stacked devices like:
> 
> ip link add link enp0s3 macvlan0 type macvlan
> ip link set link enp0s3 down
> ip link show
> 
> Before this patch:
> 
> 3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN mode 
> DEFAULT group default qlen 1000
> link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> ..
> 5: macvlan0@enp0s3:  mtu 1500 qdisc 
> noqueue state UP mode DEFAULT group default qlen 1000
> link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> 
> After this patch:
> 
> 3: enp0s3:  mtu 1500 qdisc pfifo_fast state DOWN mode 
> DEFAULT group default qlen 1000
> link/ether 00:00:05:00:00:09 brd ff:ff:ff:ff:ff:ff
> ...
> 5: macvlan0@enp0s3:  mtu 1500 qdisc 
> noqueue state LOWERLAYERDOWN mode DEFAULT group default qlen 1000
> link/ether b2:a9:c5:04:da:53 brd ff:ff:ff:ff:ff:ff
> 
> Cc: Venkat Venkatsubra 
> Cc: Gia-Khanh Nguyen 
> Signed-off-by: Jason Wang 

Acked-by: Michael S. Tsirkin 

> ---
>  drivers/net/virtio_net.c | 94 +++-
>  1 file changed, 63 insertions(+), 31 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 4e1a0fc0d555..24d880a5023d 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -433,6 +433,12 @@ struct virtnet_info {
>   /* The lock to synchronize the access to refill_enabled */
>   spinlock_t refill_lock;
>  
> + /* Is config change enabled? */
> + bool config_change_enabled;
> +
> + /* The lock to synchronize the access to config_change_enabled */
> + spinlock_t config_change_lock;
> +
>   /* Work struct for config space updates */
>   struct work_struct config_work;
>  
> @@ -623,6 +629,20 @@ static void disable_delayed_refill(struct virtnet_info 
> *vi)
>   spin_unlock_bh(&vi->refill_lock);
>  }
>  
> +static void enable_config_change(struct virtnet_info *vi)
> +{
> + spin_lock_irq(&vi->config_change_lock);
> + vi->config_change_enabled = true;
> + spin_unlock_irq(&vi->config_change_lock);
> +}
> +
> +static void disable_config_change(struct virtnet_info *vi)
> +{
> + spin_lock_irq(&vi->config_change_lock);
> + vi->config_change_enabled = false;
> + spin_unlock_irq(&vi->config_change_lock);
> +}
> +
>  static void enable_rx_mode_work(struct virtnet_info *vi)
>  {
>   rtnl_lock();
> @@ -2421,6 +2441,25 @@ static int virtnet_enable_queue_pair(struct 
> virtnet_info *vi, int qp_index)
>   return err;
>  }
>  
> +static void virtnet_update_settings(struct virtnet_info *vi)
> +{
> + u32 speed;
> + u8 duplex;
> +
> + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_SPEED_DUPLEX))
> + return;
> +
> + virtio_cread_le(vi->vdev, struct virtio_net_config, speed, &speed);
> +
> + if (ethtool_validate_speed(speed))
> + vi->speed = speed;
> +
> + virtio_cread_le(vi->vdev, struct virtio_net_config, duplex, &duplex);
> +
> + if (ethtool_validate_duplex(duplex))
> + vi->duplex = duplex;
> +}
> +
>  static int virtnet_open(struct net_device *dev)
>  {
>   struct virtnet_info *vi = netdev_priv(dev);
> @@ -2439,6 +2478,18 @@ static int virtnet_open(struct net_device *dev)
>   goto err_enable_qp;
>   }
>  
> + /* Assume link up if device can't report link status,
> +otherwise get link status from config. */
> + netif_carrier_off(dev);
> + if (virtio_has_feature(vi->vdev, VIRTIO_NET_F_STATUS)) {
> + enable_config_change(vi);
> + schedule_work(&vi->config_work);
> + } else {
> + vi->status = VIRTIO_NET_S_LINK_UP;
> + virtnet_update_settings(vi);
> + netif_carrier_on(dev);
> + }
> +
>   return 0;
>  
>  err_enable_qp:
> @@ -2875,12 +2926,19 @@ static int virtnet_close(struct net_device *dev)
>   disable_delayed_refill(vi);
>   /* Make sure refill_work doesn't re-enable napi! */
>   cancel_delayed_work_sync(&vi->refill);
> + /* Make sure config notification doesn't schedule config work */
> + disable_config_change(vi);
> + /* Make sure status updating is cancelled */
> + cancel_work_sync(&vi->config_work);
>  
>   for (i = 0; i < vi-&

Re: [patch net-next] virtio_net: add support for Byte Queue Limits

2024-05-16 Thread Michael S. Tsirkin

On Thu, May 16, 2024 at 05:25:20PM +0200, Jiri Pirko wrote:
> 
> >I'd expect a regression if any to be in a streaming benchmark.
> 
> Can you elaborate?

BQL does two things that can hurt throughput:
- limits amount of data in the queue - can limit bandwidth
  if we now get queue underruns
- adds CPU overhead - can limit bandwidth if CPU bound

So checking result of a streaming benchmark seems important.


-- 
MST

Re: [patch net-next] virtio_net: add support for Byte Queue Limits

2024-05-16 Thread Michael S. Tsirkin

On Thu, May 16, 2024 at 12:54:58PM +0200, Jiri Pirko wrote:
> Thu, May 16, 2024 at 06:48:38AM CEST, jasow...@redhat.com wrote:
> >On Wed, May 15, 2024 at 8:54 PM Jiri Pirko  wrote:
> >>
> >> Wed, May 15, 2024 at 12:12:51PM CEST, j...@resnulli.us wrote:
> >> >Wed, May 15, 2024 at 10:20:04AM CEST, m...@redhat.com wrote:
> >> >>On Wed, May 15, 2024 at 09:34:08AM +0200, Jiri Pirko wrote:
> >> >>> Fri, May 10, 2024 at 01:27:08PM CEST, m...@redhat.com wrote:
> >> >>> >On Fri, May 10, 2024 at 01:11:49PM +0200, Jiri Pirko wrote:
> >> >>> >> Fri, May 10, 2024 at 12:52:52PM CEST, m...@redhat.com wrote:
> >> >>> >> >On Fri, May 10, 2024 at 12:37:15PM +0200, Jiri Pirko wrote:
> >> >>> >> >> Thu, May 09, 2024 at 04:28:12PM CEST, m...@redhat.com wrote:
> >> >>> >> >> >On Thu, May 09, 2024 at 03:31:56PM +0200, Jiri Pirko wrote:
> >> >>> >> >> >> Thu, May 09, 2024 at 02:41:39PM CEST, m...@redhat.com wrote:
> >> >>> >> >> >> >On Thu, May 09, 2024 at 01:46:15PM +0200, Jiri Pirko wrote:
> >> >>> >> >> >> >> From: Jiri Pirko 
> >> >>> >> >> >> >>
> >> >>> >> >> >> >> Add support for Byte Queue Limits (BQL).
> >> >>> >> >> >> >>
> >> >>> >> >> >> >> Signed-off-by: Jiri Pirko 
> >> >>> >> >> >> >
> >> >>> >> >> >> >Can we get more detail on the benefits you observe etc?
> >> >>> >> >> >> >Thanks!
> >> >>> >> >> >>
> >> >>> >> >> >> More info about the BQL in general is here:
> >> >>> >> >> >> https://lwn.net/Articles/469652/
> >> >>> >> >> >
> >> >>> >> >> >I know about BQL in general. We discussed BQL for virtio in the 
> >> >>> >> >> >past
> >> >>> >> >> >mostly I got the feedback from net core maintainers that it 
> >> >>> >> >> >likely won't
> >> >>> >> >> >benefit virtio.
> >> >>> >> >>
> >> >>> >> >> Do you have some link to that, or is it this thread:
> >> >>> >> >> https://lore.kernel.org/netdev/21384cb5-99a6-7431-1039-b356521e1...@redhat.com/
> >> >>> >> >
> >> >>> >> >
> >> >>> >> >A quick search on lore turned up this, for example:
> >> >>> >> >https://lore.kernel.org/all/a11eee78-b2a1-3dbc-4821-b5f4bfaae...@gmail.com/
> >> >>> >>
> >> >>> >> Says:
> >> >>> >> "Note that NIC with many TX queues make BQL almost useless, only 
> >> >>> >> adding extra
> >> >>> >>  overhead."
> >> >>> >>
> >> >>> >> But virtio can have one tx queue, I guess that could be quite common
> >> >>> >> configuration in lot of deployments.
> >> >>> >
> >> >>> >Not sure we should worry about performance for these though.
> >> >>> >What I am saying is this should come with some benchmarking
> >> >>> >results.
> >> >>>
> >> >>> I did some measurements with VDPA, backed by ConnectX6dx NIC, single
> >> >>> queue pair:
> >> >>>
> >> >>> super_netperf 200 -H $ip -l 45 -t TCP_STREAM &
> >> >>> nice -n 20 netperf -H $ip -l 10 -t TCP_RR
> >> >>>
> >> >>> RR result with no bql:
> >> >>> 29.95
> >> >>> 32.74
> >> >>> 28.77
> >> >>>
> >> >>> RR result with bql:
> >> >>> 222.98
> >> >>> 159.81
> >> >>> 197.88
> >> >>>
> >> >>
> >> >>Okay. And on the other hand, any measureable degradation with
> >> >>multiqueue and when testing throughput?
> >> >
> >> >With multiqueue it depends if the flows hits the same queue or not. If
> >> >they do, the same results will likely be shown.
> >>
> >> RR 1q, w/o bql:
> >> 29.95
> >> 32.74
> >> 28.77
> >>
> >> RR 1q, with bql:
> >> 222.98
> >> 159.81
> >> 197.88
> >>
> >> RR 4q, w/o bql:
> >> 355.82
> >> 364.58
> >> 233.47
> >>
> >> RR 4q, with bql:
> >> 371.19
> >> 255.93
> >> 337.77
> >>
> >> So answer to your question is: "no measurable degradation with 4
> >> queues".
> >
> >Thanks but I think we also need benchmarks in cases other than vDPA.
> >For example, a simple virtualization setup.
> 
> For virtualization setup, I get this:
> 
> VIRT RR 1q, w/0 bql:
> 49.18
> 49.75
> 50.07
> 
> VIRT RR 1q, with bql:
> 51.33
> 47.88
> 40.40
> 
> No measurable/significant difference.

Seems the results became much noisier? Also
I'd expect a regression if any to be in a streaming benchmark.

-- 
MST

Re: [PATCH v3] virtio_net: Fix missed rtnl_unlock

2024-05-15 Thread Michael S. Tsirkin

On Wed, May 15, 2024 at 11:31:25AM -0500, Daniel Jurgens wrote:
> The rtnl_lock would stay locked if allocating promisc_allmulti failed.
> Also changed the allocation to GFP_KERNEL.
> 
> Fixes: ff7c7d9f5261 ("virtio_net: Remove command data from control_buf")
> Reported-by: Eric Dumazet 
> Link: 
> https://lore.kernel.org/netdev/cann89ilazvaucvhpm6rpjj0owra_ofnx7fhc8d60gv-65ad...@mail.gmail.com/
> Signed-off-by: Daniel Jurgens 

Acked-by: Michael S. Tsirkin 

net tree I presume.

> ---
> v3:
>   - Changed to promisc_allmulti alloc to GPF_KERNEL
> v2:
>   - Added fixes tag.
> ---
>  drivers/net/virtio_net.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 19a9b50646c7..4e1a0fc0d555 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2902,14 +2902,14 @@ static void virtnet_rx_mode_work(struct work_struct 
> *work)
>   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
>   return;
>  
> - rtnl_lock();
> -
> - promisc_allmulti = kzalloc(sizeof(*promisc_allmulti), GFP_ATOMIC);
> + promisc_allmulti = kzalloc(sizeof(*promisc_allmulti), GFP_KERNEL);
>   if (!promisc_allmulti) {
>   dev_warn(&dev->dev, "Failed to set RX mode, no memory.\n");
>   return;
>   }
>  
> + rtnl_lock();
> +
>   *promisc_allmulti = !!(dev->flags & IFF_PROMISC);
>   sg_init_one(sg, promisc_allmulti, sizeof(*promisc_allmulti));
>  
> -- 
> 2.45.0

Re: [PATCH v3] virtio_net: Fix missed rtnl_unlock

2024-05-15 Thread Michael S. Tsirkin

On Wed, May 15, 2024 at 11:31:25AM -0500, Daniel Jurgens wrote:
> The rtnl_lock would stay locked if allocating promisc_allmulti failed.
> Also changed the allocation to GFP_KERNEL.
> 
> Fixes: ff7c7d9f5261 ("virtio_net: Remove command data from control_buf")
> Reported-by: Eric Dumazet 
> Link: 
> https://lore.kernel.org/netdev/cann89ilazvaucvhpm6rpjj0owra_ofnx7fhc8d60gv-65ad...@mail.gmail.com/
> Signed-off-by: Daniel Jurgens 


Acked-by: Michael S. Tsirkin 

> ---
> v3:
>   - Changed to promisc_allmulti alloc to GPF_KERNEL
> v2:
>   - Added fixes tag.
> ---
>  drivers/net/virtio_net.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 19a9b50646c7..4e1a0fc0d555 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2902,14 +2902,14 @@ static void virtnet_rx_mode_work(struct work_struct 
> *work)
>   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
>   return;
>  
> - rtnl_lock();
> -
> - promisc_allmulti = kzalloc(sizeof(*promisc_allmulti), GFP_ATOMIC);
> + promisc_allmulti = kzalloc(sizeof(*promisc_allmulti), GFP_KERNEL);
>   if (!promisc_allmulti) {
>   dev_warn(&dev->dev, "Failed to set RX mode, no memory.\n");
>   return;
>   }
>  
> + rtnl_lock();
> +
>   *promisc_allmulti = !!(dev->flags & IFF_PROMISC);
>   sg_init_one(sg, promisc_allmulti, sizeof(*promisc_allmulti));
>  
> -- 
> 2.45.0

Re: [PATCH] virtio_net: Fix missed rtnl_unlock

2024-05-15 Thread Michael S. Tsirkin

On Wed, May 15, 2024 at 09:31:20AM -0500, Daniel Jurgens wrote:
> The rtnl_lock would stay locked if allocating promisc_allmulti failed.
> 
> Reported-by: Eric Dumazet 
> Link: 
> https://lore.kernel.org/netdev/cann89ilazvaucvhpm6rpjj0owra_ofnx7fhc8d60gv-65ad...@mail.gmail.com/
> Signed-off-by: Daniel Jurgens 


Fixes: tag?

> ---
>  drivers/net/virtio_net.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 19a9b50646c7..e2b7488f375e 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2902,14 +2902,14 @@ static void virtnet_rx_mode_work(struct work_struct 
> *work)
>   if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_CTRL_RX))
>   return;
>  
> - rtnl_lock();
> -
>   promisc_allmulti = kzalloc(sizeof(*promisc_allmulti), GFP_ATOMIC);
>   if (!promisc_allmulti) {
>   dev_warn(&dev->dev, "Failed to set RX mode, no memory.\n");
>   return;
>   }
>  
> + rtnl_lock();
> +
>   *promisc_allmulti = !!(dev->flags & IFF_PROMISC);
>   sg_init_one(sg, promisc_allmulti, sizeof(*promisc_allmulti));
>  
> -- 
> 2.45.0

Re: [patch net-next] virtio_net: add support for Byte Queue Limits

2024-05-15 Thread Michael S. Tsirkin

On Wed, May 15, 2024 at 09:34:08AM +0200, Jiri Pirko wrote:
> Fri, May 10, 2024 at 01:27:08PM CEST, m...@redhat.com wrote:
> >On Fri, May 10, 2024 at 01:11:49PM +0200, Jiri Pirko wrote:
> >> Fri, May 10, 2024 at 12:52:52PM CEST, m...@redhat.com wrote:
> >> >On Fri, May 10, 2024 at 12:37:15PM +0200, Jiri Pirko wrote:
> >> >> Thu, May 09, 2024 at 04:28:12PM CEST, m...@redhat.com wrote:
> >> >> >On Thu, May 09, 2024 at 03:31:56PM +0200, Jiri Pirko wrote:
> >> >> >> Thu, May 09, 2024 at 02:41:39PM CEST, m...@redhat.com wrote:
> >> >> >> >On Thu, May 09, 2024 at 01:46:15PM +0200, Jiri Pirko wrote:
> >> >> >> >> From: Jiri Pirko 
> >> >> >> >> 
> >> >> >> >> Add support for Byte Queue Limits (BQL).
> >> >> >> >> 
> >> >> >> >> Signed-off-by: Jiri Pirko 
> >> >> >> >
> >> >> >> >Can we get more detail on the benefits you observe etc?
> >> >> >> >Thanks!
> >> >> >> 
> >> >> >> More info about the BQL in general is here:
> >> >> >> https://lwn.net/Articles/469652/
> >> >> >
> >> >> >I know about BQL in general. We discussed BQL for virtio in the past
> >> >> >mostly I got the feedback from net core maintainers that it likely 
> >> >> >won't
> >> >> >benefit virtio.
> >> >> 
> >> >> Do you have some link to that, or is it this thread:
> >> >> https://lore.kernel.org/netdev/21384cb5-99a6-7431-1039-b356521e1...@redhat.com/
> >> >
> >> >
> >> >A quick search on lore turned up this, for example:
> >> >https://lore.kernel.org/all/a11eee78-b2a1-3dbc-4821-b5f4bfaae...@gmail.com/
> >> 
> >> Says:
> >> "Note that NIC with many TX queues make BQL almost useless, only adding 
> >> extra
> >>  overhead."
> >> 
> >> But virtio can have one tx queue, I guess that could be quite common
> >> configuration in lot of deployments.
> >
> >Not sure we should worry about performance for these though.
> >What I am saying is this should come with some benchmarking
> >results.
> 
> I did some measurements with VDPA, backed by ConnectX6dx NIC, single
> queue pair:
> 
> super_netperf 200 -H $ip -l 45 -t TCP_STREAM &
> nice -n 20 netperf -H $ip -l 10 -t TCP_RR
> 
> RR result with no bql:
> 29.95
> 32.74
> 28.77
> 
> RR result with bql:
> 222.98
> 159.81
> 197.88
> 

Okay. And on the other hand, any measureable degradation with
multiqueue and when testing throughput?


> 
> >
> >
> >> 
> >> >
> >> >
> >> >
> >> >
> >> >> I don't see why virtio should be any different from other
> >> >> drivers/devices that benefit from bql. HOL blocking is the same here are
> >> >> everywhere.
> >> >> 
> >> >> >
> >> >> >So I'm asking, what kind of benefit do you observe?
> >> >> 
> >> >> I don't have measurements at hand, will attach them to v2.
> >> >> 
> >> >> Thanks!
> >> >> 
> >> >> >
> >> >> >-- 
> >> >> >MST
> >> >> >
> >> >
> >

Re: [patch net-next] virtio_net: add support for Byte Queue Limits

2024-05-10 Thread Michael S. Tsirkin

On Fri, May 10, 2024 at 01:11:49PM +0200, Jiri Pirko wrote:
> Fri, May 10, 2024 at 12:52:52PM CEST, m...@redhat.com wrote:
> >On Fri, May 10, 2024 at 12:37:15PM +0200, Jiri Pirko wrote:
> >> Thu, May 09, 2024 at 04:28:12PM CEST, m...@redhat.com wrote:
> >> >On Thu, May 09, 2024 at 03:31:56PM +0200, Jiri Pirko wrote:
> >> >> Thu, May 09, 2024 at 02:41:39PM CEST, m...@redhat.com wrote:
> >> >> >On Thu, May 09, 2024 at 01:46:15PM +0200, Jiri Pirko wrote:
> >> >> >> From: Jiri Pirko 
> >> >> >> 
> >> >> >> Add support for Byte Queue Limits (BQL).
> >> >> >> 
> >> >> >> Signed-off-by: Jiri Pirko 
> >> >> >
> >> >> >Can we get more detail on the benefits you observe etc?
> >> >> >Thanks!
> >> >> 
> >> >> More info about the BQL in general is here:
> >> >> https://lwn.net/Articles/469652/
> >> >
> >> >I know about BQL in general. We discussed BQL for virtio in the past
> >> >mostly I got the feedback from net core maintainers that it likely won't
> >> >benefit virtio.
> >> 
> >> Do you have some link to that, or is it this thread:
> >> https://lore.kernel.org/netdev/21384cb5-99a6-7431-1039-b356521e1...@redhat.com/
> >
> >
> >A quick search on lore turned up this, for example:
> >https://lore.kernel.org/all/a11eee78-b2a1-3dbc-4821-b5f4bfaae...@gmail.com/
> 
> Says:
> "Note that NIC with many TX queues make BQL almost useless, only adding extra
>  overhead."
> 
> But virtio can have one tx queue, I guess that could be quite common
> configuration in lot of deployments.

Not sure we should worry about performance for these though.
What I am saying is this should come with some benchmarking
results.


> 
> >
> >
> >
> >
> >> I don't see why virtio should be any different from other
> >> drivers/devices that benefit from bql. HOL blocking is the same here are
> >> everywhere.
> >> 
> >> >
> >> >So I'm asking, what kind of benefit do you observe?
> >> 
> >> I don't have measurements at hand, will attach them to v2.
> >> 
> >> Thanks!
> >> 
> >> >
> >> >-- 
> >> >MST
> >> >
> >

Re: [patch net-next] virtio_net: add support for Byte Queue Limits

2024-05-10 Thread Michael S. Tsirkin

On Fri, May 10, 2024 at 12:37:15PM +0200, Jiri Pirko wrote:
> Thu, May 09, 2024 at 04:28:12PM CEST, m...@redhat.com wrote:
> >On Thu, May 09, 2024 at 03:31:56PM +0200, Jiri Pirko wrote:
> >> Thu, May 09, 2024 at 02:41:39PM CEST, m...@redhat.com wrote:
> >> >On Thu, May 09, 2024 at 01:46:15PM +0200, Jiri Pirko wrote:
> >> >> From: Jiri Pirko 
> >> >> 
> >> >> Add support for Byte Queue Limits (BQL).
> >> >> 
> >> >> Signed-off-by: Jiri Pirko 
> >> >
> >> >Can we get more detail on the benefits you observe etc?
> >> >Thanks!
> >> 
> >> More info about the BQL in general is here:
> >> https://lwn.net/Articles/469652/
> >
> >I know about BQL in general. We discussed BQL for virtio in the past
> >mostly I got the feedback from net core maintainers that it likely won't
> >benefit virtio.
> 
> Do you have some link to that, or is it this thread:
> https://lore.kernel.org/netdev/21384cb5-99a6-7431-1039-b356521e1...@redhat.com/


A quick search on lore turned up this, for example:
https://lore.kernel.org/all/a11eee78-b2a1-3dbc-4821-b5f4bfaae...@gmail.com/




> I don't see why virtio should be any different from other
> drivers/devices that benefit from bql. HOL blocking is the same here are
> everywhere.
> 
> >
> >So I'm asking, what kind of benefit do you observe?
> 
> I don't have measurements at hand, will attach them to v2.
> 
> Thanks!
> 
> >
> >-- 
> >MST
> >

Re: [PATCH] virtio_net: Fix memory leak in virtnet_rx_mod_work

2024-05-10 Thread Michael S. Tsirkin

On Thu, May 09, 2024 at 01:36:34PM -0500, Daniel Jurgens wrote:
> The pointer delcaration was missing the __free(kfree).
> 
> Fixes: ff7c7d9f5261 ("virtio_net: Remove command data from control_buf")
> Reported-by: Jens Axboe 
> Closes: 
> https://lore.kernel.org/netdev/0674ca1b-020f-4f93-94d0-104964566...@kernel.dk/
> Signed-off-by: Daniel Jurgens 

Acked-by: Michael S. Tsirkin 

> ---
>  drivers/net/virtio_net.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index df6121c38a1b..42da535913ed 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -2884,7 +2884,6 @@ static int virtnet_set_queues(struct virtnet_info *vi, 
> u16 queue_pairs)
>  
>  static int virtnet_close(struct net_device *dev)
>  {
> - u8 *promisc_allmulti  __free(kfree) = NULL;
>   struct virtnet_info *vi = netdev_priv(dev);
>   int i;
>  
> @@ -2905,11 +2904,11 @@ static void virtnet_rx_mode_work(struct work_struct 
> *work)
>  {
>   struct virtnet_info *vi =
>   container_of(work, struct virtnet_info, rx_mode_work);
> + u8 *promisc_allmulti  __free(kfree) = NULL;
>   struct net_device *dev = vi->dev;
>   struct scatterlist sg[2];
>   struct virtio_net_ctrl_mac *mac_data;
>   struct netdev_hw_addr *ha;
> - u8 *promisc_allmulti;
>   int uc_count;
>   int mc_count;
>   void *buf;
> -- 
> 2.45.0

Re: [patch net-next] virtio_net: add support for Byte Queue Limits

2024-05-09 Thread Michael S. Tsirkin

On Thu, May 09, 2024 at 03:31:56PM +0200, Jiri Pirko wrote:
> Thu, May 09, 2024 at 02:41:39PM CEST, m...@redhat.com wrote:
> >On Thu, May 09, 2024 at 01:46:15PM +0200, Jiri Pirko wrote:
> >> From: Jiri Pirko 
> >> 
> >> Add support for Byte Queue Limits (BQL).
> >> 
> >> Signed-off-by: Jiri Pirko 
> >
> >Can we get more detail on the benefits you observe etc?
> >Thanks!
> 
> More info about the BQL in general is here:
> https://lwn.net/Articles/469652/

I know about BQL in general. We discussed BQL for virtio in the past
mostly I got the feedback from net core maintainers that it likely won't
benefit virtio.

So I'm asking, what kind of benefit do you observe?

-- 
MST

Re: [patch net-next] virtio_net: add support for Byte Queue Limits

2024-05-09 Thread Michael S. Tsirkin

On Thu, May 09, 2024 at 01:46:15PM +0200, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> Add support for Byte Queue Limits (BQL).
> 
> Signed-off-by: Jiri Pirko 

Can we get more detail on the benefits you observe etc?
Thanks!

> ---
>  drivers/net/virtio_net.c | 33 -
>  1 file changed, 20 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 218a446c4c27..c53d6dc6d332 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -84,7 +84,9 @@ struct virtnet_stat_desc {
>  
>  struct virtnet_sq_free_stats {
>   u64 packets;
> + u64 xdp_packets;
>   u64 bytes;
> + u64 xdp_bytes;
>  };
>  
>  struct virtnet_sq_stats {
> @@ -512,19 +514,19 @@ static void __free_old_xmit(struct send_queue *sq, bool 
> in_napi,
>   void *ptr;
>  
>   while ((ptr = virtqueue_get_buf(sq->vq, &len)) != NULL) {
> - ++stats->packets;
> -
>   if (!is_xdp_frame(ptr)) {
>   struct sk_buff *skb = ptr;
>  
>   pr_debug("Sent skb %p\n", skb);
>  
> + stats->packets++;
>   stats->bytes += skb->len;
>   napi_consume_skb(skb, in_napi);
>   } else {
>   struct xdp_frame *frame = ptr_to_xdp(ptr);
>  
> - stats->bytes += xdp_get_frame_len(frame);
> + stats->xdp_packets++;
> + stats->xdp_bytes += xdp_get_frame_len(frame);
>   xdp_return_frame(frame);
>   }
>   }
> @@ -965,7 +967,8 @@ static void virtnet_rq_unmap_free_buf(struct virtqueue 
> *vq, void *buf)
>   virtnet_rq_free_buf(vi, rq, buf);
>  }
>  
> -static void free_old_xmit(struct send_queue *sq, bool in_napi)
> +static void free_old_xmit(struct send_queue *sq, struct netdev_queue *txq,
> +   bool in_napi)
>  {
>   struct virtnet_sq_free_stats stats = {0};
>  
> @@ -974,9 +977,11 @@ static void free_old_xmit(struct send_queue *sq, bool 
> in_napi)
>   /* Avoid overhead when no packets have been processed
>* happens when called speculatively from start_xmit.
>*/
> - if (!stats.packets)
> + if (!stats.packets && !stats.xdp_packets)
>   return;
>  
> + netdev_tx_completed_queue(txq, stats.packets, stats.bytes);
> +
>   u64_stats_update_begin(&sq->stats.syncp);
>   u64_stats_add(&sq->stats.bytes, stats.bytes);
>   u64_stats_add(&sq->stats.packets, stats.packets);
> @@ -1013,13 +1018,15 @@ static void check_sq_full_and_disable(struct 
> virtnet_info *vi,
>* early means 16 slots are typically wasted.
>*/
>   if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
> - netif_stop_subqueue(dev, qnum);
> + struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> +
> + netif_tx_stop_queue(txq);
>   if (use_napi) {
>   if (unlikely(!virtqueue_enable_cb_delayed(sq->vq)))
>   virtqueue_napi_schedule(&sq->napi, sq->vq);
>   } else if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
>   /* More just got used, free them then recheck. */
> - free_old_xmit(sq, false);
> + free_old_xmit(sq, txq, false);
>   if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {
>   netif_start_subqueue(dev, qnum);
>   virtqueue_disable_cb(sq->vq);
> @@ -2319,7 +2326,7 @@ static void virtnet_poll_cleantx(struct receive_queue 
> *rq)
>  
>   do {
>   virtqueue_disable_cb(sq->vq);
> - free_old_xmit(sq, true);
> + free_old_xmit(sq, txq, true);
>   } while (unlikely(!virtqueue_enable_cb_delayed(sq->vq)));
>  
>   if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
> @@ -2471,7 +2478,7 @@ static int virtnet_poll_tx(struct napi_struct *napi, 
> int budget)
>   txq = netdev_get_tx_queue(vi->dev, index);
>   __netif_tx_lock(txq, raw_smp_processor_id());
>   virtqueue_disable_cb(sq->vq);
> - free_old_xmit(sq, true);
> + free_old_xmit(sq, txq, true);
>  
>   if (sq->vq->num_free >= 2 + MAX_SKB_FRAGS)
>   netif_tx_wake_queue(txq);
> @@ -2553,7 +2560,7 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
> struct net_device *dev)
>   struct send_queue *sq = &vi->sq[qnum];
>   int err;
>   struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> - bool kick = !netdev_xmit_more();
> + bool xmit_more = netdev_xmit_more();
>   bool use_napi = sq->napi.weight;
>  
>   /* Free up any pending old buffers before queueing new ones. */
> @@ -2561,9 +2568,9 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
> struct net_device *dev)
>   if (use_napi)
>   virtqueue_disa

Re: [patch net-next v4 0/6] selftests: virtio_net: introduce initial testing infrastructure

2024-04-22 Thread Michael S. Tsirkin

On Thu, Apr 18, 2024 at 06:08:24PM +0200, Jiri Pirko wrote:
> From: Jiri Pirko 
> 
> This patchset aims at introducing very basic initial infrastructure
> for virtio_net testing, namely it focuses on virtio feature testing.
> 
> The first patch adds support for debugfs for virtio devices, allowing
> user to filter features to pretend to be driver that is not capable
> of the filtered feature.


virtio things:

Acked-by: Michael S. Tsirkin 

> Example:
> $ cat /sys/bus/virtio/devices/virtio0/features
> 111001010101110010001000
> $ echo "5" >/sys/kernel/debug/virtio/virtio0/filter_feature_add
> $ cat /sys/kernel/debug/virtio/virtio0/filter_features
> 5
> $ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/unbind
> $ echo "virtio0" > /sys/bus/virtio/drivers/virtio_net/bind
> $ cat /sys/bus/virtio/devices/virtio0/features
> 11110101110010001000
> 
> Leverage that in the last patch that lays ground for virtio_net
> selftests testing, including very basic F_MAC feature test.
> 
> To run this, do:
> $ make -C tools/testing/selftests/ TARGETS=drivers/net/virtio_net/ run_tests
> 
> It is assumed, as with lot of other selftests in the net group,
> that there are netdevices connected back-to-back. In this case,
> two virtio_net devices connected back to back. If you use "tap" qemu
> netdevice type, to configure this loop on a hypervisor, one may use
> this script:
> #!/bin/bash
> 
> DEV1="$1"
> DEV2="$2"
> 
> sudo tc qdisc add dev $DEV1 clsact
> sudo tc qdisc add dev $DEV2 clsact
> sudo tc filter add dev $DEV1 ingress protocol all pref 1 matchall action 
> mirred egress redirect dev $DEV2
> sudo tc filter add dev $DEV2 ingress protocol all pref 1 matchall action 
> mirred egress redirect dev $DEV1
> sudo ip link set $DEV1 up
> sudo ip link set $DEV2 up
> 
> Another possibility is to use virtme-ng like this:
> $ vng --network=loop
> or directly:
> $ vng --network=loop -- make -C tools/testing/selftests/ 
> TARGETS=drivers/net/virtio_net/ run_tests
> 
> "loop" network type will take care of creating two "hubport" qemu netdevs
> putting them into a single hub.
> 
> To do it manually with qemu, pass following command line options:
> -nic hubport,hubid=1,id=nd0,model=virtio-net-pci
> -nic hubport,hubid=1,id=nd1,model=virtio-net-pci
> 
> ---
> v3->v4:
> - addressed comments from Petr and Benjamin, more or less cosmetical
>   issues. See individual patches changelog for details.
> - extended cover letter by vng usage
> v2->v3:
> - added forgotten kdoc entry in patch #1.
> v1->v2:
> - addressed comments from Jakub and Benjamin, see individual
>   patches #3, #5 and #6 for details.
> 
> Jiri Pirko (6):
>   virtio: add debugfs infrastructure to allow to debug virtio features
>   selftests: forwarding: move initial root check to the beginning
>   selftests: forwarding: add ability to assemble NETIFS array by driver
> name
>   selftests: forwarding: add check_driver() helper
>   selftests: forwarding: add wait_for_dev() helper
>   selftests: virtio_net: add initial tests
> 
>  MAINTAINERS   |   1 +
>  drivers/virtio/Kconfig|   9 ++
>  drivers/virtio/Makefile   |   1 +
>  drivers/virtio/virtio.c   |   8 ++
>  drivers/virtio/virtio_debug.c | 109 +++
>  include/linux/virtio.h|  35 +
>  tools/testing/selftests/Makefile  |   1 +
>  .../selftests/drivers/net/virtio_net/Makefile |  15 ++
>  .../drivers/net/virtio_net/basic_features.sh  | 131 ++
>  .../selftests/drivers/net/virtio_net/config   |   2 +
>  .../net/virtio_net/virtio_net_common.sh   |  99 +
>  tools/testing/selftests/net/forwarding/lib.sh |  70 +-
>  12 files changed, 477 insertions(+), 4 deletions(-)
>  create mode 100644 drivers/virtio/virtio_debug.c
>  create mode 100644 tools/testing/selftests/drivers/net/virtio_net/Makefile
>  create mode 100755 
> tools/testing/selftests/drivers/net/virtio_net/basic_features.sh
>  create mode 100644 tools/testing/selftests/drivers/net/virtio_net/config
>  create mode 100644 
> tools/testing/selftests/drivers/net/virtio_net/virtio_net_common.sh
> 
> -- 
> 2.44.0

Re: [PATCH RESEND net-next v7 0/4] ethtool: provide the dim profile fine-tuning channel

2024-04-22 Thread Michael S. Tsirkin

On Mon, Apr 15, 2024 at 09:38:03PM +0800, Heng Qi wrote:
> The NetDIM library provides excellent acceleration for many modern
> network cards. However, the default profiles of DIM limits its maximum
> capabilities for different NICs, so providing a way which the NIC can
> be custom configured is necessary.
> 
> Currently, interaction with the driver is still based on the commonly
> used "ethtool -C".
> 
> Since the profile now exists in netdevice, adding a function similar
> to net_dim_get_rx_moderation_dev() with netdevice as argument is
> nice, but this would be better along with cleaning up the rest of
> the drivers, which we can get to very soon after this set.
> 
> Please review, thank you very much!


Acked-by: Michael S. Tsirkin 

> Changelog
> =
> v6->v7:
>   - A new wrapper struct pointer is used in struct net_device.
>   - Add IS_ENABLED(CONFIG_DIMLIB) to avoid compiler warnings.
>   - Profile fields changed from u16 to u32.
> 
> v5->v6:
>   - Place the profile in netdevice to bypass the driver.
> The interaction code of ethtool <-> kernel has not changed at all,
> only the interaction part of kernel <-> driver has changed.
> 
> v4->v5:
>   - Update some snippets from Kuba, Thanks.
> 
> v3->v4:
>   - Some tiny updates and patch 1 only add a new comment.
> 
> v2->v3:
>   - Break up the attributes to avoid the use of raw c structs.
>   - Use per-device profile instead of global profile in the driver.
> 
> v1->v2:
>   - Use ethtool tool instead of net-sysfs
> 
> Heng Qi (4):
>   linux/dim: move useful macros to .h file
>   ethtool: provide customized dim profile management
>   virtio-net: refactor dim initialization/destruction
>   virtio-net: support dim profile fine-tuning
> 
> Heng Qi (4):
>   linux/dim: move useful macros to .h file
>   ethtool: provide customized dim profile management
>   virtio-net: refactor dim initialization/destruction
>   virtio-net: support dim profile fine-tuning
> 
>  Documentation/netlink/specs/ethtool.yaml |  33 +++
>  Documentation/networking/ethtool-netlink.rst |   8 +
>  drivers/net/virtio_net.c |  46 +++--
>  include/linux/dim.h  |  13 ++
>  include/linux/ethtool.h  |  11 +-
>  include/linux/netdevice.h|  24 +++
>  include/uapi/linux/ethtool_netlink.h |  24 +++
>  lib/dim/net_dim.c|  10 +-
>  net/core/dev.c   |  83 
>  net/ethtool/coalesce.c   | 201 ++-
>  10 files changed, 430 insertions(+), 23 deletions(-)
> 
> -- 
> 2.32.0.3.g01195cf9f

Re: [PATCH net-next v5 0/9] virtio-net: support device stats

2024-04-22 Thread Michael S. Tsirkin

On Mon, Mar 18, 2024 at 07:05:53PM +0800, Xuan Zhuo wrote:
> As the spec:
> 
> https://github.com/oasis-tcs/virtio-spec/commit/42f389989823039724f95bbbd243291ab0064f82
> 
> The virtio net supports to get device stats.
> 
> Please review.

series:

Acked-by: Michael S. Tsirkin 

I think you can now repost for net-next.


> Thanks.
> 
> v5:
> 1. Fix some small problems in last version
> 2. Not report stats that will be reported by netlink
> 3. remove "_queue" from  ethtool -S
> 
> v4:
> 1. Support per-queue statistics API
> 2. Fix some small problems in last version
> 
> v3:
> 1. rebase net-next
> 
> v2:
> 1. fix the usage of the leXX_to_cpu()
> 2. add comment to the structure virtnet_stats_map
> 
> v1:
> 1. fix some definitions of the marco and the struct
> 
> 
> 
> 
> 
> 
> Xuan Zhuo (9):
>   virtio_net: introduce device stats feature and structures
>   virtio_net: virtnet_send_command supports command-specific-result
>   virtio_net: remove "_queue" from ethtool -S
>   virtio_net: support device stats
>   virtio_net: stats map include driver stats
>   virtio_net: add the total stats field
>   virtio_net: rename stat tx_timeout to timeout
>   netdev: add queue stats
>   virtio-net: support queue stat
> 
>  Documentation/netlink/specs/netdev.yaml | 104 
>  drivers/net/virtio_net.c| 755 +---
>  include/net/netdev_queues.h |  27 +
>  include/uapi/linux/netdev.h |  19 +
>  include/uapi/linux/virtio_net.h | 143 +
>  net/core/netdev-genl.c  |  23 +-
>  tools/include/uapi/linux/netdev.h   |  19 +
>  7 files changed, 1013 insertions(+), 77 deletions(-)
> 
> --
> 2.32.0.3.g01195cf9f

Re: [PATCH net-next 2/4] virtio_net: Remove command data from control_buf

2024-03-28 Thread Michael S. Tsirkin

On Thu, Mar 28, 2024 at 01:35:16PM +, Simon Horman wrote:
> On Mon, Mar 25, 2024 at 04:49:09PM -0500, Daniel Jurgens wrote:
> > Allocate memory for the data when it's used. Ideally the could be on the
> > stack, but we can't DMA stack memory. With this change only the header
> > and status memory are shared between commands, which will allow using a
> > tighter lock than RTNL.
> > 
> > Signed-off-by: Daniel Jurgens 
> > Reviewed-by: Jiri Pirko 
> 
> ...
> 
> > @@ -3893,10 +3925,16 @@ static int virtnet_restore_up(struct virtio_device 
> > *vdev)
> >  
> >  static int virtnet_set_guest_offloads(struct virtnet_info *vi, u64 
> > offloads)
> >  {
> > +   u64 *_offloads __free(kfree) = NULL;
> > struct scatterlist sg;
> > -   vi->ctrl->offloads = cpu_to_virtio64(vi->vdev, offloads);
> >  
> > -   sg_init_one(&sg, &vi->ctrl->offloads, sizeof(vi->ctrl->offloads));
> > +   _offloads = kzalloc(sizeof(*_offloads), GFP_KERNEL);
> > +   if (!_offloads)
> > +   return -ENOMEM;
> > +
> > +   *_offloads = cpu_to_virtio64(vi->vdev, offloads);
> 
> Hi Daniel,
> 
> There is a type mismatch between *_offloads and cpu_to_virtio64
> which is flagged by Sparse as follows:
> 
>  .../virtio_net.c:3978:20: warning: incorrect type in assignment (different 
> base types)
>  .../virtio_net.c:3978:20:expected unsigned long long [usertype]
>  .../virtio_net.c:3978:20:got restricted __virtio64
> 
> I think this can be addressed by changing the type of *_offloads to
> __virtio64 *.


Yes pls, endian-ness is easier to get right 1st time than fix
afterwards.

> > +
> > +   sg_init_one(&sg, _offloads, sizeof(*_offloads));
> >  
> > if (!virtnet_send_command(vi, VIRTIO_NET_CTRL_GUEST_OFFLOADS,
> >   VIRTIO_NET_CTRL_GUEST_OFFLOADS_SET, &sg)) {
> > -- 
> > 2.42.0
> > 
> >

Re: [PATCH vhost v4 02/10] virtio_ring: packed: remove double check of the unmap ops

2024-03-26 Thread Michael S. Tsirkin

On Thu, Mar 21, 2024 at 04:20:09PM +0800, Xuan Zhuo wrote:
> On Thu, 21 Mar 2024 13:57:06 +0800, Jason Wang  wrote:
> > On Tue, Mar 12, 2024 at 11:36 AM Xuan Zhuo  
> > wrote:
> > >
> > > In the functions vring_unmap_extra_packed and vring_unmap_desc_packed,
> > > multiple checks are made whether unmap is performed and whether it is
> > > INDIRECT.
> > >
> > > These two functions are usually called in a loop, and we should put the
> > > check outside the loop.
> > >
> > > And we unmap the descs with VRING_DESC_F_INDIRECT on the same path with
> > > other descs, that make the thing more complex. If we distinguish the
> > > descs with VRING_DESC_F_INDIRECT before unmap, thing will be clearer.
> > >
> > > 1. only one desc of the desc table is used, we do not need the loop
> > > 2. the called unmap api is difference from the other desc
> > > 3. the vq->premapped is not needed to check
> > > 4. the vq->indirect is not needed to check
> > > 5. the state->indir_desc must not be null
> > >
> > > Signed-off-by: Xuan Zhuo 
> > > ---
> > >  drivers/virtio/virtio_ring.c | 78 ++--
> > >  1 file changed, 40 insertions(+), 38 deletions(-)
> > >
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index c2779e34aac7..0dfbd17e5a87 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -1214,6 +1214,7 @@ static u16 packed_last_used(u16 last_used_idx)
> > > return last_used_idx & ~(-(1 << VRING_PACKED_EVENT_F_WRAP_CTR));
> > >  }
> > >
> > > +/* caller must check vring_need_unmap_buffer() */
> > >  static void vring_unmap_extra_packed(const struct vring_virtqueue *vq,
> > >  const struct vring_desc_extra *extra)
> > >  {
> > > @@ -1221,33 +1222,18 @@ static void vring_unmap_extra_packed(const struct 
> > > vring_virtqueue *vq,
> > >
> > > flags = extra->flags;
> > >
> > > -   if (flags & VRING_DESC_F_INDIRECT) {
> > > -   if (!vq->use_dma_api)
> > > -   return;
> > > -
> > > -   dma_unmap_single(vring_dma_dev(vq),
> > > -extra->addr, extra->len,
> > > -(flags & VRING_DESC_F_WRITE) ?
> > > -DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > > -   } else {
> > > -   if (!vring_need_unmap_buffer(vq))
> > > -   return;
> > > -
> > > -   dma_unmap_page(vring_dma_dev(vq),
> > > -  extra->addr, extra->len,
> > > -  (flags & VRING_DESC_F_WRITE) ?
> > > -  DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > > -   }
> > > +   dma_unmap_page(vring_dma_dev(vq),
> > > +  extra->addr, extra->len,
> > > +  (flags & VRING_DESC_F_WRITE) ?
> > > +  DMA_FROM_DEVICE : DMA_TO_DEVICE);
> > >  }
> > >
> > > +/* caller must check vring_need_unmap_buffer() */
> > >  static void vring_unmap_desc_packed(const struct vring_virtqueue *vq,
> > > const struct vring_packed_desc *desc)
> > >  {
> > > u16 flags;
> > >
> > > -   if (!vring_need_unmap_buffer(vq))
> > > -   return;
> > > -
> > > flags = le16_to_cpu(desc->flags);
> > >
> > > dma_unmap_page(vring_dma_dev(vq),
> > > @@ -1323,7 +1309,7 @@ static int virtqueue_add_indirect_packed(struct 
> > > vring_virtqueue *vq,
> > > total_sg * sizeof(struct vring_packed_desc),
> > > DMA_TO_DEVICE);
> > > if (vring_mapping_error(vq, addr)) {
> > > -   if (vq->premapped)
> > > +   if (!vring_need_unmap_buffer(vq))
> > > goto free_desc;
> > >
> > > goto unmap_release;
> > > @@ -1338,10 +1324,11 @@ static int virtqueue_add_indirect_packed(struct 
> > > vring_virtqueue *vq,
> > > vq->packed.desc_extra[id].addr = addr;
> > > vq->packed.desc_extra[id].len = total_sg *
> > > sizeof(struct vring_packed_desc);
> > > -   vq->packed.desc_extra[id].flags = VRING_DESC_F_INDIRECT |
> > > - 
> > > vq->packed.avail_used_flags;
> > > }
> > >
> > > +   vq->packed.desc_extra[id].flags = VRING_DESC_F_INDIRECT |
> > > +   vq->packed.avail_used_flags;
> > > +
> > > /*
> > >  * A driver MUST NOT make the first descriptor in the list
> > >  * available before all subsequent descriptors comprising
> > > @@ -1382,6 +1369,8 @@ static int virtqueue_add_indirect_packed(struct 
> > > vring_virtqueue *vq,
> > >  unmap_release:
> > > err_idx = i;
> > >
> > > +   WARN_ON(!vring_need_unmap_buffer(vq));
> > > +
> > > for (i = 0; i < err_idx; i++)
> > > vring_unmap_desc_packed(vq, &des

Re: [PATCH vhost v4 00/10] virtio: drivers maintain dma info for premapped vq

2024-03-18 Thread Michael S. Tsirkin

On Tue, Mar 12, 2024 at 11:35:47AM +0800, Xuan Zhuo wrote:
> As discussed:
> 
> http://lore.kernel.org/all/cacgkmevq0no8qgc46u4mgsmtud44fd_cflcpavmj3rhyqrz...@mail.gmail.com
> 
> If the virtio is premapped mode, the driver should manage the dma info by 
> self.
> So the virtio core should not store the dma info. We can release the memory 
> used
> to store the dma info.
> 
> For virtio-net xmit queue, if the virtio-net maintains the dma info,
> the virtio-net must allocate too much memory(19 * queue_size for per-queue), 
> so
> we do not plan to make the virtio-net to maintain the dma info by default. The
> virtio-net xmit queue only maintain the dma info when premapped mode is enable
> (such as AF_XDP is enable).

This landed when merge window was open already so I'm deferring this
to the next merge window, just to be safe. Jason can you review please?

> So this patch set try to do:
> 
> 1. make the virtio core to do not store the dma info
> - But if the desc_extra has not dma info, we face a new question,
>   it is hard to get the dma info of the desc with indirect flag.
>   For split mode, that is easy from desc, but for the packed mode,
>   it is hard to get the dma info from the desc. And hardening
>   the dma unmap is safe, we should store the dma info of indirect
>   descs when the virtio core does not store the bufer dma info.
> 
>   So I introduce the "structure the indirect desc table" to
>   allocate space to store dma info of the desc table.
> 
> +struct vring_split_desc_indir {
> +   dma_addr_t addr;/* Descriptor Array DMA addr. 
> */
> +   u32 len;/* Descriptor Array length. */
> +   u32 num;
> +   struct vring_desc desc[];
> +};
> 
>   The follow patches to this:
>  * virtio_ring: packed: structure the indirect desc table
>  * virtio_ring: split: structure the indirect desc table
> 
> - On the other side, in the umap handle, we mix the indirect descs with
>   other descs. That make things too complex. I found if we we distinguish
>   the descs with VRING_DESC_F_INDIRECT before unmap, thing will be 
> clearer.
> 
>   The follow patches do this.
>  * virtio_ring: packed: remove double check of the unmap ops
>  * virtio_ring: split: structure the indirect desc table
> 
> 2. make the virtio core to enable premapped mode by find_vqs() params
> - Because the find_vqs() will try to allocate memory for the dma info.
>   If we set the premapped mode after find_vqs() and release the
>   dma info, that is odd.
> 
> 
> Please review.
> 
> Thanks
> 
> v4:
> 1. virtio-net xmit queue does not enable premapped mode by default
> 
> v3:
> 1. fix the conflict with the vp_modern_create_avq().
> 
> v2:
> 1. change the dma item of virtio-net, every item have MAX_SKB_FRAGS + 2 
> addr + len pairs.
> 2. introduce virtnet_sq_free_stats for __free_old_xmit
> 
> v1:
> 1. rename transport_vq_config to vq_transport_config
> 2. virtio-net set dma meta number to (ring-size + 1)(MAX_SKB_FRGAS +2)
> 3. introduce virtqueue_dma_map_sg_attrs
> 4. separate vring_create_virtqueue to an independent commit
> 
> Xuan Zhuo (10):
>   virtio_ring: introduce vring_need_unmap_buffer
>   virtio_ring: packed: remove double check of the unmap ops
>   virtio_ring: packed: structure the indirect desc table
>   virtio_ring: split: remove double check of the unmap ops
>   virtio_ring: split: structure the indirect desc table
>   virtio_ring: no store dma info when unmap is not needed
>   virtio: find_vqs: add new parameter premapped
>   virtio_ring: export premapped to driver by struct virtqueue
>   virtio_net: set premapped mode by find_vqs()
>   virtio_ring: virtqueue_set_dma_premapped support disable
> 
>  drivers/net/virtio_net.c  |  57 +++--
>  drivers/virtio/virtio_ring.c  | 436 +-
>  include/linux/virtio.h|   3 +-
>  include/linux/virtio_config.h |  17 +-
>  4 files changed, 307 insertions(+), 206 deletions(-)
> 
> --
> 2.32.0.3.g01195cf9f

Re: [PATCH net-next 0/5] virtio-net: sq support premapped mode

2024-02-22 Thread Michael S. Tsirkin

On Tue, Jan 16, 2024 at 03:59:19PM +0800, Xuan Zhuo wrote:
> This is the second part of virtio-net support AF_XDP zero copy.

My understanding is, there's going to be another version of all
this work?

-- 
MST

Re: [PATCH net-next 1/5] virtio_ring: introduce virtqueue_get_buf_ctx_dma()

2024-02-22 Thread Michael S. Tsirkin

On Tue, Jan 16, 2024 at 03:59:20PM +0800, Xuan Zhuo wrote:
> introduce virtqueue_get_buf_ctx_dma() to collect the dma info when
> get buf from virtio core for premapped mode.
> 
> If the virtio queue is premapped mode, the virtio-net send buf may
> have many desc. Every desc dma address need to be unmap. So here we
> introduce a new helper to collect the dma address of the buffer from
> the virtio core.
> 
> Because the BAD_RING is called (that may set vq->broken), so
> the relative "const" of vq is removed.
> 
> Signed-off-by: Xuan Zhuo 
> ---
>  drivers/virtio/virtio_ring.c | 174 +--
>  include/linux/virtio.h   |  16 
>  2 files changed, 142 insertions(+), 48 deletions(-)
> 
> diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> index 49299b1f9ec7..82f72428605b 100644
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -362,6 +362,45 @@ static struct device *vring_dma_dev(const struct 
> vring_virtqueue *vq)
>   return vq->dma_dev;
>  }
>  
> +/*
> + * use_dma_api premapped -> do_unmap
> + *  1. false   falsefalse
> + *  2. truefalsetrue
> + *  3. truetrue false
> + *
> + * Only #3, we should return the DMA info to the driver.

no idea what this table is supposed to mean.
Instead of this, just add comments near each
piece of code explaining it.
E.g. something like (guest guessing at the reason, pls replace
with the real one):

/* if premapping is not supported, no need to unmap */
if (!vq->premapped)
return false;

and so on


> + * Return:
> + * true: the virtio core must unmap the desc
> + * false: the virtio core skip the desc unmap
> + */
> +static bool vring_need_unmap(struct vring_virtqueue *vq,
> +  struct virtio_dma_head *dma,
> +  dma_addr_t addr, unsigned int length)
> +{
> + if (vq->do_unmap)
> + return true;
> +
> + if (!vq->premapped)
> + return false;
> +
> + if (!dma)
> + return false;
> +
> + if (unlikely(dma->next >= dma->num)) {
> + BAD_RING(vq, "premapped vq: collect dma overflow: %pad %u\n",
> +  &addr, length);
> + return false;
> + }
> +
> + dma->items[dma->next].addr = addr;
> + dma->items[dma->next].length = length;
> +
> + ++dma->next;
> +
> + return false;
> +}
> +
>  /* Map one sg entry. */
>  static int vring_map_one_sg(const struct vring_virtqueue *vq, struct 
> scatterlist *sg,
>   enum dma_data_direction direction, dma_addr_t *addr)
> @@ -440,12 +479,14 @@ static void virtqueue_init(struct vring_virtqueue *vq, 
> u32 num)
>   * Split ring specific functions - *_split().
>   */
>  
> -static void vring_unmap_one_split_indirect(const struct vring_virtqueue *vq,
> -const struct vring_desc *desc)
> +static void vring_unmap_one_split_indirect(struct vring_virtqueue *vq,
> +const struct vring_desc *desc,
> +struct virtio_dma_head *dma)
>  {
>   u16 flags;
>  
> - if (!vq->do_unmap)
> + if (!vring_need_unmap(vq, dma, virtio64_to_cpu(vq->vq.vdev, desc->addr),
> +   virtio32_to_cpu(vq->vq.vdev, desc->len)))
>   return;
>  
>   flags = virtio16_to_cpu(vq->vq.vdev, desc->flags);
> @@ -457,8 +498,8 @@ static void vring_unmap_one_split_indirect(const struct 
> vring_virtqueue *vq,
>  DMA_FROM_DEVICE : DMA_TO_DEVICE);
>  }
>  
> -static unsigned int vring_unmap_one_split(const struct vring_virtqueue *vq,
> -   unsigned int i)
> +static unsigned int vring_unmap_one_split(struct vring_virtqueue *vq,
> +   unsigned int i, struct 
> virtio_dma_head *dma)
>  {
>   struct vring_desc_extra *extra = vq->split.desc_extra;
>   u16 flags;
> @@ -474,17 +515,16 @@ static unsigned int vring_unmap_one_split(const struct 
> vring_virtqueue *vq,
>extra[i].len,
>(flags & VRING_DESC_F_WRITE) ?
>DMA_FROM_DEVICE : DMA_TO_DEVICE);
> - } else {
> - if (!vq->do_unmap)
> - goto out;
> -
> - dma_unmap_page(vring_dma_dev(vq),
> -extra[i].addr,
> -extra[i].len,
> -(flags & VRING_DESC_F_WRITE) ?
> -DMA_FROM_DEVICE : DMA_TO_DEVICE);
> + goto out;
>   }
>  
> + if (!vring_need_unmap(vq, dma, extra[i].addr, extra[i].len))
> + goto out;
> +
> + dma_unmap_page(vring_dma_dev(vq), extra[i].addr, extra[i].len,
> +(flags & VRING_DESC_F_WRITE) ?
> +DMA_FROM_DEVICE : DMA_TO_DEVICE);
>

Re: [PATCH net-next] virtio_net: Add TX stop and wake counters

2024-02-20 Thread Michael S. Tsirkin

On Tue, Feb 20, 2024 at 06:02:46PM +, Dan Jurgens wrote:
> > From: Daniel Jurgens 
> > Sent: Wednesday, February 7, 2024 2:59 PM
> > To: Michael S. Tsirkin 
> > Cc: Jason Wang ; Jakub Kicinski
> > ; Jason Xing ;
> > netdev@vger.kernel.org; xuanz...@linux.alibaba.com;
> > virtualizat...@lists.linux.dev; da...@davemloft.net;
> > eduma...@google.com; ab...@redhat.com; Parav Pandit
> > 
> > Subject: RE: [PATCH net-next] virtio_net: Add TX stop and wake counters
> > 
> > 
> > > From: Michael S. Tsirkin 
> > > Sent: Wednesday, February 7, 2024 2:19 PM
> > > To: Daniel Jurgens 
> > > Subject: Re: [PATCH net-next] virtio_net: Add TX stop and wake
> > > counters
> > >
> > > On Wed, Feb 07, 2024 at 07:38:16PM +, Daniel Jurgens wrote:
> > > > > From: Michael S. Tsirkin 
> > > > > Sent: Sunday, February 4, 2024 6:40 AM
> > > > > To: Jason Wang 
> > > > > Cc: Jakub Kicinski ; Jason Xing
> > > > > ; Daniel Jurgens ;
> > > > > netdev@vger.kernel.org; xuanz...@linux.alibaba.com;
> > > > > virtualizat...@lists.linux.dev; da...@davemloft.net;
> > > > > eduma...@google.com; ab...@redhat.com; Parav Pandit
> > > > > 
> > > > > Subject: Re: [PATCH net-next] virtio_net: Add TX stop and wake
> > > > > counters
> > > > >
> > > > > On Sun, Feb 04, 2024 at 09:20:18AM +0800, Jason Wang wrote:
> > > > > > On Sat, Feb 3, 2024 at 12:01 AM Jakub Kicinski 
> > > wrote:
> > > > > > >
> > > > > > > On Fri, 2 Feb 2024 14:52:59 +0800 Jason Xing wrote:
> > > > > > > > > Can you say more? I'm curious what's your use case.
> > > > > > > >
> > > > > > > > I'm not working at Nvidia, so my point of view may differ
> > > > > > > > from
> > > theirs.
> > > > > > > > From what I can tell is that those two counters help me
> > > > > > > > narrow down the range if I have to diagnose/debug some issues.
> > > > > > >
> > > > > > > right, i'm asking to collect useful debugging tricks, nothing
> > > > > > > against the patch itself :)
> > > > > > >
> > > > > > > > 1) I sometimes notice that if some irq is held too long
> > > > > > > > (say, one simple case: output of printk printed to the
> > > > > > > > console), those two counters can reflect the issue.
> > > > > > > > 2) Similarly in virtio net, recently I traced such counters
> > > > > > > > the current kernel does not have and it turned out that one
> > > > > > > > of the output queues in the backend behaves badly.
> > > > > > > > ...
> > > > > > > >
> > > > > > > > Stop/wake queue counters may not show directly the root
> > > > > > > > cause of the issue, but help us 'guess' to some extent.
> > > > > > >
> > > > > > > I'm surprised you say you can detect stall-related issues with 
> > > > > > > this.
> > > > > > > I guess virtio doesn't have BQL support, which makes it special.
> > > > > >
> > > > > > Yes, virtio-net has a legacy orphan mode, this is something that
> > > > > > needs to be dropped in the future. This would make BQL much more
> > > > > > easier to be implemented.
> > > > >
> > > > >
> > > > > It's not that we can't implement BQL, it's that it does not seem
> > > > > to be benefitial - has been discussed many times.
> > > > >
> > > > > > > Normal HW drivers with BQL almost never stop the queue by
> > > themselves.
> > > > > > > I mean - if they do, and BQL is active, then the system is
> > > > > > > probably misconfigured (queue is too short). This is what we
> > > > > > > use at Meta to detect stalls in drivers with BQL:
> > > > > > >
> > > > > > > https://lore.kernel.org/all/20240131102150.728960-3-leitao@deb
> > > > > > > ia
> > > > > > > n.or
> > > > > > > g/
> > > > > > >
> > > > > > > Daniel, I think this may be a good enough excuse to add
> > > > > > > per-queue stats to the netdev genl family, if you're up for
> > > > > > > that. LMK if you want more info, otherwise I guess ethtool -S
> > > > > > > is fine
> > > for now.
> > > > > > >
> > > > > >
> > > > > > Thanks
> > > >
> > > > Michael,
> > > > Are you OK with this patch? Unless I missed it I didn't see a
> > > > response
> > > from you in our conversation the day I sent it.
> > > >
> > >
> > > I thought what is proposed is adding some support for these stats to core?
> > > Did I misunderstood?
> > >
> > 
> > That's a much bigger change and going that route I think still need to count
> > them at the driver level. I said I could potentially take that on as a 
> > background
> > project. But would prefer to go with ethtool -S for now.
> 
> Michael, are you a NACK on this? Jakub seemed OK with it, Jason also thinks 
> it's useful, and it's low risk. 


Not too bad ... Jakub can you confirm though?

> > 
> > > --
> > > MST
>

Re: [PATCH net-next] virtio_net: Add TX stop and wake counters

2024-02-07 Thread Michael S. Tsirkin

On Mon, Feb 05, 2024 at 09:45:38AM +0800, Jason Wang wrote:
> On Sun, Feb 4, 2024 at 8:39 PM Michael S. Tsirkin  wrote:
> >
> > On Sun, Feb 04, 2024 at 09:20:18AM +0800, Jason Wang wrote:
> > > On Sat, Feb 3, 2024 at 12:01 AM Jakub Kicinski  wrote:
> > > >
> > > > On Fri, 2 Feb 2024 14:52:59 +0800 Jason Xing wrote:
> > > > > > Can you say more? I'm curious what's your use case.
> > > > >
> > > > > I'm not working at Nvidia, so my point of view may differ from theirs.
> > > > > From what I can tell is that those two counters help me narrow down
> > > > > the range if I have to diagnose/debug some issues.
> > > >
> > > > right, i'm asking to collect useful debugging tricks, nothing against
> > > > the patch itself :)
> > > >
> > > > > 1) I sometimes notice that if some irq is held too long (say, one
> > > > > simple case: output of printk printed to the console), those two
> > > > > counters can reflect the issue.
> > > > > 2) Similarly in virtio net, recently I traced such counters the
> > > > > current kernel does not have and it turned out that one of the output
> > > > > queues in the backend behaves badly.
> > > > > ...
> > > > >
> > > > > Stop/wake queue counters may not show directly the root cause of the
> > > > > issue, but help us 'guess' to some extent.
> > > >
> > > > I'm surprised you say you can detect stall-related issues with this.
> > > > I guess virtio doesn't have BQL support, which makes it special.
> > >
> > > Yes, virtio-net has a legacy orphan mode, this is something that needs
> > > to be dropped in the future. This would make BQL much more easier to
> > > be implemented.
> >
> >
> > It's not that we can't implement BQL,
> 
> Well, I don't say we can't, I say it's not easy as we need to deal
> with the switching between two modes[1]. If we just have one mode like
> TX interrupt, we don't need to care about that.
> 
> > it's that it does not seem to
> > be benefitial - has been discussed many times.
> 
> Virtio doesn't differ from other NIC too much, for example gve supports bql.
> 
> 1) There's no numbers in [1]
> 2) We only benchmark vhost-net but not others, for example, vhost-user
> and hardware implementations
> 3) We don't have interrupt coalescing in 2018 but now we have with DIM

Only works well with hardware virtio cards though.

> Thanks
> 
> [1] https://lore.kernel.org/netdev/20181205225323.12555-1-...@redhat.com/
> 

So fundamentally, someone needs to show benefit and no serious
regressions to add BQL at this point. I doubt it's easily practical.
Hacks like wireless does to boost buffer sizes might be necessary.

> >
> > > > Normal HW drivers with BQL almost never stop the queue by themselves.
> > > > I mean - if they do, and BQL is active, then the system is probably
> > > > misconfigured (queue is too short). This is what we use at Meta to
> > > > detect stalls in drivers with BQL:
> > > >
> > > > https://lore.kernel.org/all/20240131102150.728960-3-lei...@debian.org/
> > > >
> > > > Daniel, I think this may be a good enough excuse to add per-queue stats
> > > > to the netdev genl family, if you're up for that. LMK if you want more
> > > > info, otherwise I guess ethtool -S is fine for now.
> > > >
> > >
> > > Thanks
> >

Re: [PATCH net-next] virtio_net: Add TX stop and wake counters

2024-02-07 Thread Michael S. Tsirkin

On Wed, Feb 07, 2024 at 07:38:16PM +, Daniel Jurgens wrote:
> > From: Michael S. Tsirkin 
> > Sent: Sunday, February 4, 2024 6:40 AM
> > To: Jason Wang 
> > Cc: Jakub Kicinski ; Jason Xing
> > ; Daniel Jurgens ;
> > netdev@vger.kernel.org; xuanz...@linux.alibaba.com;
> > virtualizat...@lists.linux.dev; da...@davemloft.net;
> > eduma...@google.com; ab...@redhat.com; Parav Pandit
> > 
> > Subject: Re: [PATCH net-next] virtio_net: Add TX stop and wake counters
> > 
> > On Sun, Feb 04, 2024 at 09:20:18AM +0800, Jason Wang wrote:
> > > On Sat, Feb 3, 2024 at 12:01 AM Jakub Kicinski  wrote:
> > > >
> > > > On Fri, 2 Feb 2024 14:52:59 +0800 Jason Xing wrote:
> > > > > > Can you say more? I'm curious what's your use case.
> > > > >
> > > > > I'm not working at Nvidia, so my point of view may differ from theirs.
> > > > > From what I can tell is that those two counters help me narrow
> > > > > down the range if I have to diagnose/debug some issues.
> > > >
> > > > right, i'm asking to collect useful debugging tricks, nothing
> > > > against the patch itself :)
> > > >
> > > > > 1) I sometimes notice that if some irq is held too long (say, one
> > > > > simple case: output of printk printed to the console), those two
> > > > > counters can reflect the issue.
> > > > > 2) Similarly in virtio net, recently I traced such counters the
> > > > > current kernel does not have and it turned out that one of the
> > > > > output queues in the backend behaves badly.
> > > > > ...
> > > > >
> > > > > Stop/wake queue counters may not show directly the root cause of
> > > > > the issue, but help us 'guess' to some extent.
> > > >
> > > > I'm surprised you say you can detect stall-related issues with this.
> > > > I guess virtio doesn't have BQL support, which makes it special.
> > >
> > > Yes, virtio-net has a legacy orphan mode, this is something that needs
> > > to be dropped in the future. This would make BQL much more easier to
> > > be implemented.
> > 
> > 
> > It's not that we can't implement BQL, it's that it does not seem to be
> > benefitial - has been discussed many times.
> > 
> > > > Normal HW drivers with BQL almost never stop the queue by themselves.
> > > > I mean - if they do, and BQL is active, then the system is probably
> > > > misconfigured (queue is too short). This is what we use at Meta to
> > > > detect stalls in drivers with BQL:
> > > >
> > > > https://lore.kernel.org/all/20240131102150.728960-3-lei...@debian.or
> > > > g/
> > > >
> > > > Daniel, I think this may be a good enough excuse to add per-queue
> > > > stats to the netdev genl family, if you're up for that. LMK if you
> > > > want more info, otherwise I guess ethtool -S is fine for now.
> > > >
> > >
> > > Thanks
> 
> Michael,
>   Are you OK with this patch? Unless I missed it I didn't see a response 
> from you in our conversation the day I sent it.
> 

I thought what is proposed is adding some support for these stats to core?
Did I misunderstood?

-- 
MST

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1001 matches

Mail list logo