Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread Eric W. Biederman
michael.chris...@oracle.com writes:

> On 5/29/23 2:35 PM, Mike Christie wrote:
>>> Hmm... If we you CLONE_THREAD the exiting vhost_worker() will auto-reap 
>>> itself,
>> Oh wait, are you saying that when we get auto-reaped then we would do the 
>> last
>> fput and call the file_operations->release function right? We actually set
>> task_struct->files = NULL for the vhost_task task_struct, so I think we call
>> release a little sooner than you think.
>> 
>> vhost_task_create() sets kernel_clone_args->no_files, so the vhost_task 
>> task_struc
>> that gets created works like kthreads where it doesn't do a CLONE_FILES and 
>> it
>> doesn't do a dup_fd.
>> 
>> So when we do de_thread() -> zap_other_threads(), that will kill all the 
>> threads
>> in the group right? So when they exit, it will call our release function 
>> since
>> we don't have refcount on ourself.
>> 
>
> Just to make sure I'm on the same page now.
>
> In the past thread when were discussing the patch below and you guys were 
> saying
> that it doesn't really ignore SIGKILL because we will hit the
> SIGNAL_GROUP_EXIT/group_exec_task checks and the parent is going to exit, it 
> was
> on purpose.
>
> Instead of a signal to tell me when do exit, I was using the parent exiting 
> and doing
> the last fput on the vhost device's file which calls our release function. 
> That then
> allowed the vhost code to use the vhost_task to handle the outstanding IOs 
> and then
> do vhost_task_should_stop to tell the vhost_task to exit when the oustanding 
> IO
> had completed.
>
> On the vhost side of things it's really nice, because all the shutdown paths 
> work
> the same and we don't need rcu/locking in the main IO path to handle the 
> vhost_task
> exiting while we are using it.

The code below does nothing for exec.

You really need to call get_signal to handle SIGSTOP/freeze/process exit.

A variant on my coredump proposal looks like it can handle exec as well.
It isn't pretty but it looks good enough for right now.

If you could test the patch I posted up thread I think that is something
imperfect that is good enough for now.

Eric
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread Eric W. Biederman
michael.chris...@oracle.com writes:

> On 5/29/23 6:19 AM, Oleg Nesterov wrote:
>> On 05/27, Eric W. Biederman wrote:
>>>
>>> Looking forward I don't see not asking the worker threads to stop
>>> for the coredump right now causing any problems in the future.
>>> So I think we can use this to resolve the coredump issue I spotted.
>> 
>> But we have almost the same problem with exec.
>> 
>> Execing thread will wait for vhost_worker() while vhost_worker will wait for
>> .release -> vhost_task_stop().
>
> For this type of case, what is the goal or correct behavior in the end?
>
> When get_signal returns true we can code things like you mention below and
> clean up the task_struct. However, we now have a non-functioning vhost device
> open and just sitting around taking up memory and it can't do any IO.
>
> For this type of case, do we expect just not to crash/hang, or was this new
> exec'd thread suppose to be able to use the vhost device?

Looking at the vhost code, from before commit 6e890c5d5021 ("vhost: use
vhost_tasks for worker threads") exec effectively brakes the connection
with the vhost device.  Aka the vhost device permanently keeps a
connection to the mm that opened it.

Frankly I think the vhost code will misbehave in fascinating ways if you
actually try and use it after exec.

> I would normally say it probably wants to use the vhost device still. However,
> I don't think this comes up so just not hanging might be ok.

Yes we just need to not hang, and fail as gracefully after exec.

>> And even O_CLOEXEC won't help, do_close_on_exec() is called after 
>> de_thread().
>> 
>> Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES...
>
> You mean the vhost_task's task/thread doing a function that does a 
> copy_process
> right? That type of thing is not needed. I can add a check in 
> vhost_task_create
> for this so new code doesn't try to do it. I don't think it will come up that 
> some
> code vhost is using will call kernel_thread/copy_process directly since those
> calls are so rare and the functions are not exported to modules.

Oleg was referring to the fact that during exec de_thread comes before
do_close_on_exec.

>> If we want CLONE_THREAD, I think vhost_worker() should exit after 
>> get_signal()
>> returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and
>> flush the pending works on ->work_list before exit, I dunno. But imo it 
>> should
>> not wait for the final fput().

In the long term I agree.  Exiting more or less immediately and leaving
the work to vhost_dev_flush looks like where the code needs to go.

For right now just to get us something before the final -rc I believe
the changes below with some testing are good enough.  We can handle
exec just like the coredump.  I have made those tests specific to the
vhost case for now so that there is no danger of trigger any other
funnies.

If someone does not set O_CLOEXEC or performs file descriptor passing we
might get a weird thread with an old mm.

I have also added in the needed changes to change a few additional
PF_IO_WORKER test to PF_USER_WORK.

Mike is there any chance you can test the change below?

From: "Eric W. Biederman" 
Date: Mon, 29 May 2023 19:13:59 -0500
Subject: [PATCH] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

When switching from kthreads to vhost_tasks two bugs were added:
1. The vhost worker tasks's now show up as processes so scripts doing
ps or ps a would not incorrectly detect the vhost task as another
process.  2. kthreads disabled freeze by setting PF_NOFREEZE, but
vhost tasks's didn't disable or add support for them.

To fix both bugs, this switches the vhost task to be thread in the
process that does the VHOST_SET_OWNER ioctl, and has vhost_worker call
get_signal to support SIGKILL/SIGSTOP and freeze signals. Note that
SIGKILL/STOP support is required because CLONE_THREAD requires
CLONE_SIGHAND which requires those 2 signals to be suppported.

This is a modified version of the patch written by Mike Christie
 which was a modified version of patch
originally written by Linus.

Much of what depended upon PF_IO_WORKER now depends on PF_USER_WORKER.
Including ignoring signals, setting up the register state, and having
get_signal return instead of calling do_group_exit.

Tidied up the vhost_task abstraction so that the definition of
vhost_task only needs to be visible inside of vhost_task.c.  Making
it easier to review the code and tell what needs to be done where.
As part of this the main loop has been moved from vhost_worker into
vhost_task_fn.  vhost_worker now returns true if work was done.

The main loop has been updated to call get_signal which handles
SIGSTOP, freezing, and collects the message that tells the thread to
exit as part of process exit.  This collection clears
__fatal_signal_pending.  This collection is not guaranteed to
clear signal_pending() so clear that explicitly so the schedule()
sleeps.

For now the vhost thread continues to exist 

Re: [PATCH net-next 2/5] virtio_net: Add page_pool support to improve performance

2023-05-29 Thread Liang Chen
On Mon, May 29, 2023 at 5:55 PM Michael S. Tsirkin  wrote:
>
> On Mon, May 29, 2023 at 03:27:56PM +0800, Liang Chen wrote:
> > On Sun, May 28, 2023 at 2:20 PM Michael S. Tsirkin  wrote:
> > >
> > > On Fri, May 26, 2023 at 01:46:18PM +0800, Liang Chen wrote:
> > > > The implementation at the moment uses one page per packet in both the
> > > > normal and XDP path. In addition, introducing a module parameter to 
> > > > enable
> > > > or disable the usage of page pool (disabled by default).
> > > >
> > > > In single-core vm testing environments, it gives a modest performance 
> > > > gain
> > > > in the normal path.
> > > >   Upstream codebase: 47.5 Gbits/sec
> > > >   Upstream codebase + page_pool support: 50.2 Gbits/sec
> > > >
> > > > In multi-core vm testing environments, The most significant performance
> > > > gain is observed in XDP cpumap:
> > > >   Upstream codebase: 1.38 Gbits/sec
> > > >   Upstream codebase + page_pool support: 9.74 Gbits/sec
> > > >
> > > > With this foundation, we can further integrate page pool fragmentation 
> > > > and
> > > > DMA map/unmap support.
> > > >
> > > > Signed-off-by: Liang Chen 
> > >
> > > Why off by default?
> > > I am guessing it sometimes has performance costs too?
> > >
> > >
> > > What happens if we use page pool for big mode too?
> > > The less modes we have the better...
> > >
> > >
> >
> > Sure, now I believe it makes sense to enable it by default. When the
> > packet size is very small, it reduces the likelihood of skb
> > coalescing. But such cases are rare.
>
> small packets are rare? These workloads are easy to create actually.
> Pls try and include benchmark with small packet size.
>

Sure, Thanks!
> > The usage of page pool for big mode is being evaluated now. Thanks!
> >
> > > > ---
> > > >  drivers/net/virtio_net.c | 188 ++-
> > > >  1 file changed, 146 insertions(+), 42 deletions(-)
> > > >
> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > index c5dca0d92e64..99c0ca0c1781 100644
> > > > --- a/drivers/net/virtio_net.c
> > > > +++ b/drivers/net/virtio_net.c
> > > > @@ -31,6 +31,9 @@ module_param(csum, bool, 0444);
> > > >  module_param(gso, bool, 0444);
> > > >  module_param(napi_tx, bool, 0644);
> > > >
> > > > +static bool page_pool_enabled;
> > > > +module_param(page_pool_enabled, bool, 0400);
> > > > +
> > > >  /* FIXME: MTU in config. */
> > > >  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> > > >  #define GOOD_COPY_LEN128
> > > > @@ -159,6 +162,9 @@ struct receive_queue {
> > > >   /* Chain pages by the private ptr. */
> > > >   struct page *pages;
> > > >
> > > > + /* Page pool */
> > > > + struct page_pool *page_pool;
> > > > +
> > > >   /* Average packet length for mergeable receive buffers. */
> > > >   struct ewma_pkt_len mrg_avg_pkt_len;
> > > >
> > > > @@ -459,6 +465,14 @@ static struct sk_buff *virtnet_build_skb(void 
> > > > *buf, unsigned int buflen,
> > > >   return skb;
> > > >  }
> > > >
> > > > +static void virtnet_put_page(struct receive_queue *rq, struct page 
> > > > *page)
> > > > +{
> > > > + if (rq->page_pool)
> > > > + page_pool_put_full_page(rq->page_pool, page, true);
> > > > + else
> > > > + put_page(page);
> > > > +}
> > > > +
> > > >  /* Called from bottom half context */
> > > >  static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> > > >  struct receive_queue *rq,
> > > > @@ -555,7 +569,7 @@ static struct sk_buff *page_to_skb(struct 
> > > > virtnet_info *vi,
> > > >   hdr = skb_vnet_hdr(skb);
> > > >   memcpy(hdr, hdr_p, hdr_len);
> > > >   if (page_to_free)
> > > > - put_page(page_to_free);
> > > > + virtnet_put_page(rq, page_to_free);
> > > >
> > > >   return skb;
> > > >  }
> > > > @@ -802,7 +816,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
> > > >   return ret;
> > > >  }
> > > >
> > > > -static void put_xdp_frags(struct xdp_buff *xdp)
> > > > +static void put_xdp_frags(struct xdp_buff *xdp, struct receive_queue 
> > > > *rq)
> > > >  {
> > > >   struct skb_shared_info *shinfo;
> > > >   struct page *xdp_page;
> > > > @@ -812,7 +826,7 @@ static void put_xdp_frags(struct xdp_buff *xdp)
> > > >   shinfo = xdp_get_shared_info_from_buff(xdp);
> > > >   for (i = 0; i < shinfo->nr_frags; i++) {
> > > >   xdp_page = skb_frag_page(>frags[i]);
> > > > - put_page(xdp_page);
> > > > + virtnet_put_page(rq, xdp_page);
> > > >   }
> > > >   }
> > > >  }
> > > > @@ -903,7 +917,11 @@ static struct page *xdp_linearize_page(struct 
> > > > receive_queue *rq,
> > > >   if (page_off + *len + tailroom > PAGE_SIZE)
> > > >   return NULL;
> > > >
> > > > - page = alloc_page(GFP_ATOMIC);
> > > > + if (rq->page_pool)
> > > > + page = 

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread michael . christie
On 5/29/23 2:35 PM, Mike Christie wrote:
>> Hmm... If we you CLONE_THREAD the exiting vhost_worker() will auto-reap 
>> itself,
> Oh wait, are you saying that when we get auto-reaped then we would do the last
> fput and call the file_operations->release function right? We actually set
> task_struct->files = NULL for the vhost_task task_struct, so I think we call
> release a little sooner than you think.
> 
> vhost_task_create() sets kernel_clone_args->no_files, so the vhost_task 
> task_struc
> that gets created works like kthreads where it doesn't do a CLONE_FILES and it
> doesn't do a dup_fd.
> 
> So when we do de_thread() -> zap_other_threads(), that will kill all the 
> threads
> in the group right? So when they exit, it will call our release function since
> we don't have refcount on ourself.
> 

Just to make sure I'm on the same page now.

In the past thread when were discussing the patch below and you guys were saying
that it doesn't really ignore SIGKILL because we will hit the
SIGNAL_GROUP_EXIT/group_exec_task checks and the parent is going to exit, it was
on purpose.

Instead of a signal to tell me when do exit, I was using the parent exiting and 
doing
the last fput on the vhost device's file which calls our release function. That 
then
allowed the vhost code to use the vhost_task to handle the outstanding IOs and 
then
do vhost_task_should_stop to tell the vhost_task to exit when the oustanding IO
had completed.

On the vhost side of things it's really nice, because all the shutdown paths 
work
the same and we don't need rcu/locking in the main IO path to handle the 
vhost_task
exiting while we are using it.


diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 537cbf9a2ade..e0f5ac90a228 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -29,7 +29,6 @@ struct kernel_clone_args {
u32 io_thread:1;
u32 user_worker:1;
u32 no_files:1;
-   u32 ignore_signals:1;
unsigned long stack;
unsigned long stack_size;
unsigned long tls;
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..fd2970b598b2 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -2336,8 +2336,15 @@ __latent_entropy struct task_struct *copy_process(
p->flags &= ~PF_KTHREAD;
if (args->kthread)
p->flags |= PF_KTHREAD;
-   if (args->user_worker)
+   if (args->user_worker) {
+   /*
+* User worker are similar to io_threads but they do not
+* support signals and cleanup is driven via another kernel
+* interface so even SIGKILL is blocked.
+*/
p->flags |= PF_USER_WORKER;
+   siginitsetinv(>blocked, 0);
+   }
if (args->io_thread) {
/*
 * Mark us an IO worker, and block any signal that isn't
@@ -2517,8 +2524,8 @@ __latent_entropy struct task_struct *copy_process(
if (retval)
goto bad_fork_cleanup_io;
 
-   if (args->ignore_signals)
-   ignore_signals(p);
+   if (args->user_worker)
+   p->flags |= PF_NOFREEZE;
 
stackleak_task_init(p);
 
@@ -2860,7 +2867,6 @@ struct task_struct *create_io_thread(int (*fn)(void *), 
void *arg, int node)
.fn = fn,
.fn_arg = arg,
.io_thread  = 1,
-   .user_worker= 1,
};
 
return copy_process(NULL, 0, node, );
diff --git a/kernel/signal.c b/kernel/signal.c
index 8f6330f0e9ca..bc7e26072437 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -995,6 +995,19 @@ static inline bool wants_signal(int sig, struct 
task_struct *p)
return task_curr(p) || !task_sigpending(p);
 }
 
+static void try_set_pending_sigkill(struct task_struct *t)
+{
+   /*
+* User workers don't support signals and their exit is driven through
+* their kernel layer, so by default block even SIGKILL.
+*/
+   if (sigismember(>blocked, SIGKILL))
+   return;
+
+   sigaddset(>pending.signal, SIGKILL);
+   signal_wake_up(t, 1);
+}
+
 static void complete_signal(int sig, struct task_struct *p, enum pid_type type)
 {
struct signal_struct *signal = p->signal;
@@ -1055,8 +1068,7 @@ static void complete_signal(int sig, struct task_struct 
*p, enum pid_type type)
t = p;
do {
task_clear_jobctl_pending(t, 
JOBCTL_PENDING_MASK);
-   sigaddset(>pending.signal, SIGKILL);
-   signal_wake_up(t, 1);
+   try_set_pending_sigkill(t);
} while_each_thread(p, t);
return;
}
@@ -1373,8 +1385,7 @@ int zap_other_threads(struct task_struct *p)
/* Don't bother with already dead threads */
if (t->exit_state)

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread Mike Christie
On 5/29/23 12:46 PM, Oleg Nesterov wrote:
> Mike, sorry, I don't understand your email.
> 
> Just in case, let me remind I know nothing about drivers/vhost/
> 

No problem. I get it. I don't know the signal/thread code so it's
one of those things where I'm also learning as I go.

> On 05/29, michael.chris...@oracle.com wrote:
>>
>> On 5/29/23 6:19 AM, Oleg Nesterov wrote:
>>> On 05/27, Eric W. Biederman wrote:

 Looking forward I don't see not asking the worker threads to stop
 for the coredump right now causing any problems in the future.
 So I think we can use this to resolve the coredump issue I spotted.
>>>
>>> But we have almost the same problem with exec.
>>>
>>> Execing thread will wait for vhost_worker() while vhost_worker will wait for
>>> .release -> vhost_task_stop().
>>
>> For this type of case, what is the goal or correct behavior in the end?
>>
>> When get_signal returns true we can code things like you mention below
> 
> and you have mentioned in the next email that you have already coded something
> like this, so perhaps we can delay the further discussions until you send the
> new code?
> 

Ok. Let me post that. You guys and the vhost devs can argue about if it's
too ugly to merge :)


>> and
>> clean up the task_struct.
> 
> Hmm... If we you CLONE_THREAD the exiting vhost_worker() will auto-reap 
> itself,

Oh wait, are you saying that when we get auto-reaped then we would do the last
fput and call the file_operations->release function right? We actually set
task_struct->files = NULL for the vhost_task task_struct, so I think we call
release a little sooner than you think.

vhost_task_create() sets kernel_clone_args->no_files, so the vhost_task 
task_struc
that gets created works like kthreads where it doesn't do a CLONE_FILES and it
doesn't do a dup_fd.

So when we do de_thread() -> zap_other_threads(), that will kill all the threads
in the group right? So when they exit, it will call our release function since
we don't have refcount on ourself.


> 
>> However, we now have a non-functioning vhost device
>> open and just sitting around taking up memory and it can't do any IO.
> 
> can't comment, see above.
> 
>> For this type of case, do we expect just not to crash/hang, or was this new
>> exec'd thread suppose to be able to use the vhost device?
> 
> I just tried to point out that (unless I missed something) there are more 
> corner
> cases, not just coredump.

Ok. I see.

> 
>>> Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES...
>>
>> You mean the vhost_task's task/thread doing a function that does a 
>> copy_process
>> right?
> 
> I meant that the vhost_task's sub-thread can do sys_clone(CLONE_FILES) from
> userspace.

I think we were talking about different things. I was saying that when the vhost
layer does vhost_task_create() -> copy_process(), the kernel_clone_args.fn is
vhost_task_fn() -> vhost_worker(). I thought you were concerned about 
vhost_worker()
or some function it calls, calling copy_process() with CLONE_FILES.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread Mike Christie
On 5/29/23 12:54 PM, Oleg Nesterov wrote:
> On 05/29, Oleg Nesterov wrote:
>>
>> Mike, sorry, I don't understand your email.
>>
>> Just in case, let me remind I know nothing about drivers/vhost/
> 
> And... this is slightly off-topic but you didn't answer my previous
> question and I am just curious. Do you agree that, even if we forget
> about CLONE_THREAD/signals, vhost_worker() needs fixes anyway because
> it can race with vhost_work_queue() ?

You saw the reply where I wrote I was waiting for the vhost devs to
reply because it's their code, right? Just wanted to make sure you know
I'm not ignoring your mails. The info is very valuable to me since I don't
know the signal code.

- I think you are right about using a continue after schedule.
- The work fn is stable. It's setup once and never changes.
- I have no idea why we do the __set_current_state(TASK_RUNNING). As
far as I can tell the work functions do not change the task state and
that might have been a left over from something else. However, the
vhost devs might know of some case.
- For the barrier use, I think you are right, but I couldn't trigger
an issue even if I hack up different timings. So I was hoping the
vhost devs know of something else in there.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH v1] virtio-pci: Improve code style for including header files

2023-05-29 Thread Feng Liu via Virtualization
Fix code style related to including header file. Include header files
before declaring macro definitions to avoid conflicts.

Signed-off-by: Feng Liu 
Reviewed-by: Jiri Pirko 
---
 drivers/virtio/virtio_pci_modern.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/virtio/virtio_pci_modern.c 
b/drivers/virtio/virtio_pci_modern.c
index d6bb68ba84e5..b21a489e0086 100644
--- a/drivers/virtio/virtio_pci_modern.c
+++ b/drivers/virtio/virtio_pci_modern.c
@@ -15,9 +15,10 @@
  */
 
 #include 
+#include "virtio_pci_common.h"
+
 #define VIRTIO_PCI_NO_LEGACY
 #define VIRTIO_RING_NO_LEGACY
-#include "virtio_pci_common.h"
 
 static u64 vp_get_features(struct virtio_device *vdev)
 {
-- 
2.37.1 (Apple Git-137.1)

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread Oleg Nesterov
On 05/29, Oleg Nesterov wrote:
>
> Mike, sorry, I don't understand your email.
>
> Just in case, let me remind I know nothing about drivers/vhost/

And... this is slightly off-topic but you didn't answer my previous
question and I am just curious. Do you agree that, even if we forget
about CLONE_THREAD/signals, vhost_worker() needs fixes anyway because
it can race with vhost_work_queue() ?

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread Oleg Nesterov
Mike, sorry, I don't understand your email.

Just in case, let me remind I know nothing about drivers/vhost/

On 05/29, michael.chris...@oracle.com wrote:
>
> On 5/29/23 6:19 AM, Oleg Nesterov wrote:
> > On 05/27, Eric W. Biederman wrote:
> >>
> >> Looking forward I don't see not asking the worker threads to stop
> >> for the coredump right now causing any problems in the future.
> >> So I think we can use this to resolve the coredump issue I spotted.
> >
> > But we have almost the same problem with exec.
> >
> > Execing thread will wait for vhost_worker() while vhost_worker will wait for
> > .release -> vhost_task_stop().
>
> For this type of case, what is the goal or correct behavior in the end?
>
> When get_signal returns true we can code things like you mention below

and you have mentioned in the next email that you have already coded something
like this, so perhaps we can delay the further discussions until you send the
new code?

> and
> clean up the task_struct.

Hmm... If we you CLONE_THREAD the exiting vhost_worker() will auto-reap itself,

> However, we now have a non-functioning vhost device
> open and just sitting around taking up memory and it can't do any IO.

can't comment, see above.

> For this type of case, do we expect just not to crash/hang, or was this new
> exec'd thread suppose to be able to use the vhost device?

I just tried to point out that (unless I missed something) there are more corner
cases, not just coredump.

> > Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES...
>
> You mean the vhost_task's task/thread doing a function that does a 
> copy_process
> right?

I meant that the vhost_task's sub-thread can do sys_clone(CLONE_FILES) from
userspace. Yes, this implies copy_process() but I still can't understand you.

> That type of thing is not needed.

Do you mean that userspace should never do this? But this doesn't matter, the
kernel should handle this case anyway.

Or what?

In short let me repeat that I don't understand you and - of course! - quite
possibly I missed something.

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread michael . christie
On 5/29/23 6:19 AM, Oleg Nesterov wrote:
> If we want CLONE_THREAD, I think vhost_worker() should exit after get_signal()
> returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and
> flush the pending works on ->work_list before exit, I dunno. But imo it should
> not wait for the final fput().

Just a FYI, I coded this. It's pre-RFC quality. It's invasive. If we want to go
this route then we have to do a temp hack for 6.4 or revert then do this route
for 6.5 or 6.6 (vhost devs will need time to review and we need time to full 
test).
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread michael . christie
On 5/29/23 6:19 AM, Oleg Nesterov wrote:
> On 05/27, Eric W. Biederman wrote:
>>
>> Looking forward I don't see not asking the worker threads to stop
>> for the coredump right now causing any problems in the future.
>> So I think we can use this to resolve the coredump issue I spotted.
> 
> But we have almost the same problem with exec.
> 
> Execing thread will wait for vhost_worker() while vhost_worker will wait for
> .release -> vhost_task_stop().

For this type of case, what is the goal or correct behavior in the end?

When get_signal returns true we can code things like you mention below and
clean up the task_struct. However, we now have a non-functioning vhost device
open and just sitting around taking up memory and it can't do any IO.

For this type of case, do we expect just not to crash/hang, or was this new
exec'd thread suppose to be able to use the vhost device?

I would normally say it probably wants to use the vhost device still. However,
I don't think this comes up so just not hanging might be ok. Before 6.4-rc1,
we ignored signals so it would have worked if we are concerned about a possible
regression if this was a common thing.



> 
> And even O_CLOEXEC won't help, do_close_on_exec() is called after de_thread().
> 
> Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES...

You mean the vhost_task's task/thread doing a function that does a copy_process
right? That type of thing is not needed. I can add a check in vhost_task_create
for this so new code doesn't try to do it. I don't think it will come up that 
some
code vhost is using will call kernel_thread/copy_process directly since those
calls are so rare and the functions are not exported to modules.


> 
> If we want CLONE_THREAD, I think vhost_worker() should exit after get_signal()
> returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and
> flush the pending works on ->work_list before exit, I dunno. But imo it should
> not wait for the final fput().
> 
> Oleg.
> 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] scsi: virtio_scsi: Remove a useless function call

2023-05-29 Thread Stefan Hajnoczi
On Mon, May 29, 2023 at 09:35:08AM +0200, Christophe JAILLET wrote:
> 'inq_result' is known to be NULL. There is no point calling kfree().
> 
> Signed-off-by: Christophe JAILLET 
> ---
>  drivers/scsi/virtio_scsi.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH] scsi: virtio_scsi: Remove a useless function call

2023-05-29 Thread Paolo Bonzini

On 5/29/23 09:35, Christophe JAILLET wrote:

'inq_result' is known to be NULL. There is no point calling kfree().

Signed-off-by: Christophe JAILLET 
---
  drivers/scsi/virtio_scsi.c | 4 +---
  1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 58498da9869a..bd5633667d01 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -338,10 +338,8 @@ static int virtscsi_rescan_hotunplug(struct virtio_scsi 
*vscsi)
int result, inquiry_len, inq_result_len = 256;
char *inq_result = kmalloc(inq_result_len, GFP_KERNEL);
  
-	if (!inq_result) {

-   kfree(inq_result);
+   if (!inq_result)
return -ENOMEM;
-   }
  
  	shost_for_each_device(sdev, shost) {

inquiry_len = sdev->inquiry_len ? sdev->inquiry_len : 36;


Reviewed-by: Paolo Bonzini 

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio-net: support modern-transtional devices

2023-05-29 Thread Michael S. Tsirkin
On Mon, May 29, 2023 at 09:13:09PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 5/29/2023 8:04 PM, Michael S. Tsirkin wrote:
> > On Mon, May 29, 2023 at 06:41:54PM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 5/29/2023 6:12 PM, Michael S. Tsirkin wrote:
> > > 
> > >  On Mon, May 29, 2023 at 04:07:42PM +0800, Zhu, Lingshan wrote:
> > > 
> > > 
> > >  On 5/29/2023 2:38 PM, Michael S. Tsirkin wrote:
> > > 
> > >  On Mon, May 29, 2023 at 02:19:36PM +0800, Zhu, Lingshan 
> > > wrote:
> > > 
> > >  On 5/28/2023 7:28 PM, Michael S. Tsirkin wrote:
> > > 
> > >  On Sat, May 27, 2023 at 02:15:42AM +0800, Zhu 
> > > Lingshan wrote:
> > > 
> > >  Current virtio-net only probes a device with 
> > > VIRITO_ID_NET == 1.
> > > 
> > >  For a modern-transtional virtio-net device which 
> > > has a transtional
> > >  device id 0x1000 and acts as a modern device, 
> > > current virtio-pci
> > >  modern driver will assign the sub-device-id to 
> > > its mdev->id.device,
> > >  which may not be 0x1, this sub-device-id is up 
> > > to the vendor.
> > > 
> > >  That means virtio-net driver doesn't probe a 
> > > modern-transitonal
> > >  virtio-net with a sub-device-id other than 0x1, 
> > > which is a bug.
> > > 
> > >  No, the bug is in the device. Legacy linux drivers 
> > > always looked at
> > >  sub device id (other OSes might differ). So it makes 
> > > no sense
> > >  for a transitional device to have sub-device-id 
> > > other than 0x1.
> > >  Don't have time to look at spec but I think you will 
> > > find it there.
> > > 
> > >  That is true for a software emulated transitional device,
> > >  because there is only "generation" of instance in the 
> > > hypervisor,
> > >  that allowing it to ensure its sub-device-id always be 
> > > 0x01,
> > >  and it fits VIRTIO_ID_NET.
> > > 
> > >  However, a vendor may produce multiple generations of 
> > > transitional
> > >  hardware. The sub-device-id is up to the vendor, and it 
> > > is the
> > >  only way to for a driver to identify a device, other IDs 
> > > are all
> > >  fixed as 0x1af4, 0x1000 and 0x8086 for Intel.
> > > 
> > >  That is one of the issues with legacy virtio, yes.
> > > 
> > > 
> > > 
> > > 
> > >  So the sub-device-id has to be unique and differ from 
> > > others, can not always
> > >  be 0x01.
> > > 
> > >  If you are trying to build a device and want to create a 
> > > safe way to
> > >  identify it without breaking legacy drivers, then
> > >  VIRTIO_PCI_CAP_VENDOR_CFG has been designed for things like 
> > > this.
> > >  For example you can have:
> > > 
> > >  struct virtio_pci_vndr_data {
> > >   u8 cap_vndr;/* Generic PCI field: 
> > > PCI_CAP_ID_VNDR */
> > >   u8 cap_next;/* Generic PCI field: next ptr. */
> > >   u8 cap_len; /* Generic PCI field: capability 
> > > length */
> > >   u8 cfg_type;/* Identifies the structure. */
> > >   u16 vendor_id;  /* Identifies the vendor-specific 
> > > format. */
> > >   u16 device_generation;  /* Device generation */
> > >  };
> > > 
> > >  This can be a solution for sure.
> > > 
> > >  I propose this fix, all changes are for 
> > > modern-transitional devices in
> > >  modern
> > >  code path, not for legacy nor legacy-transitional.
> > > 
> > >  Thanks
> > > 
> > >  But what good is this fix? If you just want the modern 
> > > driver to bind
> > >  and ignore legacy just create a modern device, you can play
> > >  with subsystem id and vendor to your heart's content then.
> > > 
> > >  Not sure who but there are some use-cases require
> > >  transnational devices than modern devices,
> > >  I don't like this neither.
> > > 
> > >  If you are using transitional then presumably you want
> > >  legacy drives to bind, they will not bind if subsystem device
> > >  id changes.
> > > 
> > >  well actually it is a transitional device and act as a
> > >  modern device by default, so modern driver will probe.
> > > 
> > >  I think this fix is common and easy, just let virtio-net
> > >  probe transitional device id 0x1000 just like it probes
> > >  modern device id 0x1. This is a once for all fix.
> > > 
> > >  This fix only affects 

Re: [RFC] virtio-net: support modern-transtional devices

2023-05-29 Thread Zhu, Lingshan




On 5/29/2023 8:04 PM, Michael S. Tsirkin wrote:

On Mon, May 29, 2023 at 06:41:54PM +0800, Zhu, Lingshan wrote:


On 5/29/2023 6:12 PM, Michael S. Tsirkin wrote:

 On Mon, May 29, 2023 at 04:07:42PM +0800, Zhu, Lingshan wrote:


 On 5/29/2023 2:38 PM, Michael S. Tsirkin wrote:

 On Mon, May 29, 2023 at 02:19:36PM +0800, Zhu, Lingshan wrote:

 On 5/28/2023 7:28 PM, Michael S. Tsirkin wrote:

 On Sat, May 27, 2023 at 02:15:42AM +0800, Zhu Lingshan 
wrote:

 Current virtio-net only probes a device with 
VIRITO_ID_NET == 1.

 For a modern-transtional virtio-net device which has a 
transtional
 device id 0x1000 and acts as a modern device, current 
virtio-pci
 modern driver will assign the sub-device-id to its 
mdev->id.device,
 which may not be 0x1, this sub-device-id is up to the 
vendor.

 That means virtio-net driver doesn't probe a 
modern-transitonal
 virtio-net with a sub-device-id other than 0x1, which 
is a bug.

 No, the bug is in the device. Legacy linux drivers always 
looked at
 sub device id (other OSes might differ). So it makes no 
sense
 for a transitional device to have sub-device-id other than 
0x1.
 Don't have time to look at spec but I think you will find 
it there.

 That is true for a software emulated transitional device,
 because there is only "generation" of instance in the 
hypervisor,
 that allowing it to ensure its sub-device-id always be 0x01,
 and it fits VIRTIO_ID_NET.

 However, a vendor may produce multiple generations of 
transitional
 hardware. The sub-device-id is up to the vendor, and it is the
 only way to for a driver to identify a device, other IDs are 
all
 fixed as 0x1af4, 0x1000 and 0x8086 for Intel.

 That is one of the issues with legacy virtio, yes.




 So the sub-device-id has to be unique and differ from others, 
can not always
 be 0x01.

 If you are trying to build a device and want to create a safe way 
to
 identify it without breaking legacy drivers, then
 VIRTIO_PCI_CAP_VENDOR_CFG has been designed for things like this.
 For example you can have:

 struct virtio_pci_vndr_data {
  u8 cap_vndr;/* Generic PCI field: PCI_CAP_ID_VNDR */
  u8 cap_next;/* Generic PCI field: next ptr. */
  u8 cap_len; /* Generic PCI field: capability length */
  u8 cfg_type;/* Identifies the structure. */
  u16 vendor_id;  /* Identifies the vendor-specific format. 
*/
  u16 device_generation;  /* Device generation */
 };

 This can be a solution for sure.

 I propose this fix, all changes are for modern-transitional 
devices in
 modern
 code path, not for legacy nor legacy-transitional.

 Thanks

 But what good is this fix? If you just want the modern driver to 
bind
 and ignore legacy just create a modern device, you can play
 with subsystem id and vendor to your heart's content then.

 Not sure who but there are some use-cases require
 transnational devices than modern devices,
 I don't like this neither.

 If you are using transitional then presumably you want
 legacy drives to bind, they will not bind if subsystem device
 id changes.

 well actually it is a transitional device and act as a
 modern device by default, so modern driver will probe.

 I think this fix is common and easy, just let virtio-net
 probe transitional device id 0x1000 just like it probes
 modern device id 0x1. This is a once for all fix.

 This fix only affects modern-transitional devices in modern code path,
 legacy is untouched.

 Thanks

 The point of having transitional as opposed to modern is to allow
 legacy drivers. If you don't need legacy just use a non transitional
 device.

 Your device is out of spec:
 Transitional devices MUST have the PCI Subsystem Device ID
 matching the Virtio Device ID, as indicated in section \ref{sec:Device 
Types}.

OK, thanks for point this out. Since the spec says so, I assume transitional is
almost legacy.

However the spec also says:
Transitional Device a device supporting both drivers conforming to this
specification, and allowing legacy drivers.

The transitional devices have their own device id, like 0x1000 indicates it is
a 

Re: [RFC] virtio-net: support modern-transtional devices

2023-05-29 Thread Michael S. Tsirkin
On Mon, May 29, 2023 at 06:41:54PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 5/29/2023 6:12 PM, Michael S. Tsirkin wrote:
> 
> On Mon, May 29, 2023 at 04:07:42PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 5/29/2023 2:38 PM, Michael S. Tsirkin wrote:
> 
> On Mon, May 29, 2023 at 02:19:36PM +0800, Zhu, Lingshan wrote:
> 
> On 5/28/2023 7:28 PM, Michael S. Tsirkin wrote:
> 
> On Sat, May 27, 2023 at 02:15:42AM +0800, Zhu Lingshan 
> wrote:
> 
> Current virtio-net only probes a device with 
> VIRITO_ID_NET == 1.
> 
> For a modern-transtional virtio-net device which has 
> a transtional
> device id 0x1000 and acts as a modern device, current 
> virtio-pci
> modern driver will assign the sub-device-id to its 
> mdev->id.device,
> which may not be 0x1, this sub-device-id is up to the 
> vendor.
> 
> That means virtio-net driver doesn't probe a 
> modern-transitonal
> virtio-net with a sub-device-id other than 0x1, which 
> is a bug.
> 
> No, the bug is in the device. Legacy linux drivers always 
> looked at
> sub device id (other OSes might differ). So it makes no 
> sense
> for a transitional device to have sub-device-id other 
> than 0x1.
> Don't have time to look at spec but I think you will find 
> it there.
> 
> That is true for a software emulated transitional device,
> because there is only "generation" of instance in the 
> hypervisor,
> that allowing it to ensure its sub-device-id always be 0x01,
> and it fits VIRTIO_ID_NET.
> 
> However, a vendor may produce multiple generations of 
> transitional
> hardware. The sub-device-id is up to the vendor, and it is the
> only way to for a driver to identify a device, other IDs are 
> all
> fixed as 0x1af4, 0x1000 and 0x8086 for Intel.
> 
> That is one of the issues with legacy virtio, yes.
> 
> 
> 
> 
> So the sub-device-id has to be unique and differ from others, 
> can not always
> be 0x01.
> 
> If you are trying to build a device and want to create a safe way 
> to
> identify it without breaking legacy drivers, then
> VIRTIO_PCI_CAP_VENDOR_CFG has been designed for things like this.
> For example you can have:
> 
> struct virtio_pci_vndr_data {
>  u8 cap_vndr;/* Generic PCI field: PCI_CAP_ID_VNDR */
>  u8 cap_next;/* Generic PCI field: next ptr. */
>  u8 cap_len; /* Generic PCI field: capability length 
> */
>  u8 cfg_type;/* Identifies the structure. */
>  u16 vendor_id;  /* Identifies the vendor-specific 
> format. */
>  u16 device_generation;  /* Device generation */
> };
> 
> This can be a solution for sure.
> 
> I propose this fix, all changes are for modern-transitional 
> devices in
> modern
> code path, not for legacy nor legacy-transitional.
> 
> Thanks
> 
> But what good is this fix? If you just want the modern driver to 
> bind
> and ignore legacy just create a modern device, you can play
> with subsystem id and vendor to your heart's content then.
> 
> Not sure who but there are some use-cases require
> transnational devices than modern devices,
> I don't like this neither.
> 
> If you are using transitional then presumably you want
> legacy drives to bind, they will not bind if subsystem device
> id changes.
> 
> well actually it is a transitional device and act as a
> modern device by default, so modern driver will probe.
> 
> I think this fix is common and easy, just let virtio-net
> probe transitional device id 0x1000 just like it probes
> modern device id 0x1. This is a once for all fix.
> 
> This fix only affects modern-transitional devices in modern code path,
> legacy is untouched.
> 
> Thanks
> 
> The point of having transitional as opposed to modern is to allow
> legacy drivers. If you don't need legacy just use a non transitional
> device.
> 
> Your device is out of spec:
> Transitional devices MUST have the PCI Subsystem Device ID
> matching the Virtio Device ID, as indicated in section 
> \ref{sec:Device Types}.
> 
> OK, thanks for point this out. Since the spec says so, I assume transitional 
> is
> almost legacy.
> 
> However the spec also says:
> Transitional Device a device supporting both drivers 

Re: [PATCH 3/3] fork, vhost: Use CLONE_THREAD to fix freezer/ps regression

2023-05-29 Thread Oleg Nesterov
On 05/27, Eric W. Biederman wrote:
>
> Looking forward I don't see not asking the worker threads to stop
> for the coredump right now causing any problems in the future.
> So I think we can use this to resolve the coredump issue I spotted.

But we have almost the same problem with exec.

Execing thread will wait for vhost_worker() while vhost_worker will wait for
.release -> vhost_task_stop().

And even O_CLOEXEC won't help, do_close_on_exec() is called after de_thread().

Or suppose that vhost_worker's sub-thread forks a child with CLONE_FILES...

If we want CLONE_THREAD, I think vhost_worker() should exit after get_signal()
returns SIGKILL. Perhaps it should "disable" vhost_work_queue() somehow and
flush the pending works on ->work_list before exit, I dunno. But imo it should
not wait for the final fput().

Oleg.

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio-net: support modern-transtional devices

2023-05-29 Thread Zhu, Lingshan



On 5/29/2023 6:12 PM, Michael S. Tsirkin wrote:

On Mon, May 29, 2023 at 04:07:42PM +0800, Zhu, Lingshan wrote:


On 5/29/2023 2:38 PM, Michael S. Tsirkin wrote:

On Mon, May 29, 2023 at 02:19:36PM +0800, Zhu, Lingshan wrote:

On 5/28/2023 7:28 PM, Michael S. Tsirkin wrote:

On Sat, May 27, 2023 at 02:15:42AM +0800, Zhu Lingshan wrote:

Current virtio-net only probes a device with VIRITO_ID_NET == 1.

For a modern-transtional virtio-net device which has a transtional
device id 0x1000 and acts as a modern device, current virtio-pci
modern driver will assign the sub-device-id to its mdev->id.device,
which may not be 0x1, this sub-device-id is up to the vendor.

That means virtio-net driver doesn't probe a modern-transitonal
virtio-net with a sub-device-id other than 0x1, which is a bug.

No, the bug is in the device. Legacy linux drivers always looked at
sub device id (other OSes might differ). So it makes no sense
for a transitional device to have sub-device-id other than 0x1.
Don't have time to look at spec but I think you will find it there.

That is true for a software emulated transitional device,
because there is only "generation" of instance in the hypervisor,
that allowing it to ensure its sub-device-id always be 0x01,
and it fits VIRTIO_ID_NET.

However, a vendor may produce multiple generations of transitional
hardware. The sub-device-id is up to the vendor, and it is the
only way to for a driver to identify a device, other IDs are all
fixed as 0x1af4, 0x1000 and 0x8086 for Intel.

That is one of the issues with legacy virtio, yes.




So the sub-device-id has to be unique and differ from others, can not always
be 0x01.

If you are trying to build a device and want to create a safe way to
identify it without breaking legacy drivers, then
VIRTIO_PCI_CAP_VENDOR_CFG has been designed for things like this.
For example you can have:

struct virtio_pci_vndr_data {
  u8 cap_vndr;/* Generic PCI field: PCI_CAP_ID_VNDR */
  u8 cap_next;/* Generic PCI field: next ptr. */
  u8 cap_len; /* Generic PCI field: capability length */
  u8 cfg_type;/* Identifies the structure. */
  u16 vendor_id;  /* Identifies the vendor-specific format. */
  u16 device_generation;  /* Device generation */
};

This can be a solution for sure.

I propose this fix, all changes are for modern-transitional devices in
modern
code path, not for legacy nor legacy-transitional.

Thanks

But what good is this fix? If you just want the modern driver to bind
and ignore legacy just create a modern device, you can play
with subsystem id and vendor to your heart's content then.

Not sure who but there are some use-cases require
transnational devices than modern devices,
I don't like this neither.

If you are using transitional then presumably you want
legacy drives to bind, they will not bind if subsystem device
id changes.

well actually it is a transitional device and act as a
modern device by default, so modern driver will probe.

I think this fix is common and easy, just let virtio-net
probe transitional device id 0x1000 just like it probes
modern device id 0x1. This is a once for all fix.

This fix only affects modern-transitional devices in modern code path,
legacy is untouched.

Thanks

The point of having transitional as opposed to modern is to allow
legacy drivers. If you don't need legacy just use a non transitional
device.

Your device is out of spec:
 Transitional devices MUST have the PCI Subsystem Device ID
 matching the Virtio Device ID, as indicated in section \ref{sec:Device 
Types}.
OK, thanks for point this out. Since the spec says so, I assume 
transitional is almost legacy.


However the spec also says:
Transitional Device a device supporting both drivers conforming to this 
specification, and allowing legacy drivers.


The transitional devices have their own device id, like 0x1000 indicates 
it is a network device.


Then why the sub-device-id has to be 0x1 in the spec? Is it because we 
have the driver first?


Thanks




So you will have to explain why the setup you are describing
makes any sense at all before we consider this a fix.






Other types of devices also have similar issues, like virito-blk.

I propose to fix this problem of modern-transitonal device
whith this solution, all in the modern code path:
1) assign the device id to mdev->id.device
2) add transitional device ids in the virtio-net(and others) probe table.

Comments are welcome!

Thanks!

Signed-off-by: Zhu Lingshan
---
drivers/net/virtio_net.c   | 1 +
drivers/virtio/virtio_pci_modern_dev.c | 2 +-
2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 56ca1d270304..6b45d8602a6b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -4250,6 +4250,7 @@ static __maybe_unused int virtnet_restore(struct 
virtio_device *vdev)
static struct virtio_device_id id_table[] = {
 

Re: [RFC] virtio-net: support modern-transtional devices

2023-05-29 Thread Michael S. Tsirkin
On Mon, May 29, 2023 at 04:07:42PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 5/29/2023 2:38 PM, Michael S. Tsirkin wrote:
> > On Mon, May 29, 2023 at 02:19:36PM +0800, Zhu, Lingshan wrote:
> > > 
> > > On 5/28/2023 7:28 PM, Michael S. Tsirkin wrote:
> > > > On Sat, May 27, 2023 at 02:15:42AM +0800, Zhu Lingshan wrote:
> > > > > Current virtio-net only probes a device with VIRITO_ID_NET == 1.
> > > > > 
> > > > > For a modern-transtional virtio-net device which has a transtional
> > > > > device id 0x1000 and acts as a modern device, current virtio-pci
> > > > > modern driver will assign the sub-device-id to its mdev->id.device,
> > > > > which may not be 0x1, this sub-device-id is up to the vendor.
> > > > > 
> > > > > That means virtio-net driver doesn't probe a modern-transitonal
> > > > > virtio-net with a sub-device-id other than 0x1, which is a bug.
> > > > No, the bug is in the device. Legacy linux drivers always looked at
> > > > sub device id (other OSes might differ). So it makes no sense
> > > > for a transitional device to have sub-device-id other than 0x1.
> > > > Don't have time to look at spec but I think you will find it there.
> > > That is true for a software emulated transitional device,
> > > because there is only "generation" of instance in the hypervisor,
> > > that allowing it to ensure its sub-device-id always be 0x01,
> > > and it fits VIRTIO_ID_NET.
> > > 
> > > However, a vendor may produce multiple generations of transitional
> > > hardware. The sub-device-id is up to the vendor, and it is the
> > > only way to for a driver to identify a device, other IDs are all
> > > fixed as 0x1af4, 0x1000 and 0x8086 for Intel.
> > That is one of the issues with legacy virtio, yes.
> > 
> > 
> > 
> > > So the sub-device-id has to be unique and differ from others, can not 
> > > always
> > > be 0x01.
> > 
> > If you are trying to build a device and want to create a safe way to
> > identify it without breaking legacy drivers, then
> > VIRTIO_PCI_CAP_VENDOR_CFG has been designed for things like this.
> > For example you can have:
> > 
> > struct virtio_pci_vndr_data {
> >  u8 cap_vndr;/* Generic PCI field: PCI_CAP_ID_VNDR */
> >  u8 cap_next;/* Generic PCI field: next ptr. */
> >  u8 cap_len; /* Generic PCI field: capability length */
> >  u8 cfg_type;/* Identifies the structure. */
> >  u16 vendor_id;  /* Identifies the vendor-specific format. */
> >  u16 device_generation;  /* Device generation */
> > };
> This can be a solution for sure.
> > 
> > > I propose this fix, all changes are for modern-transitional devices in
> > > modern
> > > code path, not for legacy nor legacy-transitional.
> > > 
> > > Thanks
> > But what good is this fix? If you just want the modern driver to bind
> > and ignore legacy just create a modern device, you can play
> > with subsystem id and vendor to your heart's content then.
> Not sure who but there are some use-cases require
> transnational devices than modern devices,
> I don't like this neither.
> > 
> > If you are using transitional then presumably you want
> > legacy drives to bind, they will not bind if subsystem device
> > id changes.
> well actually it is a transitional device and act as a
> modern device by default, so modern driver will probe.
> 
> I think this fix is common and easy, just let virtio-net
> probe transitional device id 0x1000 just like it probes
> modern device id 0x1. This is a once for all fix.
> 
> This fix only affects modern-transitional devices in modern code path,
> legacy is untouched.
> 
> Thanks

The point of having transitional as opposed to modern is to allow
legacy drivers. If you don't need legacy just use a non transitional
device.

Your device is out of spec:
Transitional devices MUST have the PCI Subsystem Device ID
matching the Virtio Device ID, as indicated in section \ref{sec:Device 
Types}.

So you will have to explain why the setup you are describing
makes any sense at all before we consider this a fix.



> > 
> > 
> > > > 
> > > > > Other types of devices also have similar issues, like virito-blk.
> > > > > 
> > > > > I propose to fix this problem of modern-transitonal device
> > > > > whith this solution, all in the modern code path:
> > > > > 1) assign the device id to mdev->id.device
> > > > > 2) add transitional device ids in the virtio-net(and others) probe 
> > > > > table.
> > > > > 
> > > > > Comments are welcome!
> > > > > 
> > > > > Thanks!
> > > > > 
> > > > > Signed-off-by: Zhu Lingshan 
> > > > > ---
> > > > >drivers/net/virtio_net.c   | 1 +
> > > > >drivers/virtio/virtio_pci_modern_dev.c | 2 +-
> > > > >2 files changed, 2 insertions(+), 1 deletion(-)
> > > > > 
> > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > > index 56ca1d270304..6b45d8602a6b 100644
> > > > > --- a/drivers/net/virtio_net.c
> > > > > +++ b/drivers/net/virtio_net.c
> > > > > @@ -4250,6 +4250,7 @@ static 

Re: [PATCH] virtio_ring: validate used buffer length

2023-05-29 Thread Michael S. Tsirkin
On Mon, May 29, 2023 at 09:18:10AM +0800, Jason Wang wrote:
> On Sun, May 28, 2023 at 3:57 PM Michael S. Tsirkin  wrote:
> >
> > On Fri, May 26, 2023 at 02:30:41PM +0800, Jason Wang wrote:
> > > This patch validate
> >
> > validates
> >
> > > the used buffer length provided by the device
> > > before trying to use it.
> >
> > before returning it to caller
> >
> > > This is done by remembering the in buffer
> > > length in a dedicated array during virtqueue_add(), then we can fail
> > > the virtqueue_get_buf() when we find the device is trying to give us a
> > > used buffer length which is greater than we stored before.
> >
> > than what we stored
> >
> > >
> > > This validation is disable
> >
> > disabled
> >
> > > by default via module parameter to unbreak
> > > some existing devices since some legacy devices are known to report
> > > buggy used length.
> > >
> > > Signed-off-by: Jason Wang 
> >
> > First I'm not merging this without more data about
> > what is known to be broken and what is known to work well
> > in the commit log. And how exactly do things work if used length
> > is wrong?
> 
> Assuming the device is malicious, it would be very hard to answer.
> Auditing and fuzzing won't cover every case. Instead of trying to seek
> the answer, we can simply make sure the used in buffer length is
> validated then we know we're fine or not.

To restate the question, you said above "some legacy devices are known
to report buggy used length". If they report buggy length then how
can things work?

> > Second what's wrong with dma_desc_extra that we already maintain?
> > Third motivation - it's part and parcel of the hardening effort yes?
> 
> They are different. dma_desc_extra is for a descriptor ring, but this
> is for a used ring. Technically we can go back to iterate on the
> descriptor ring for a legal used in buffer length. But it will have
> worse performance.

I don't really understand. We already iterate when we unmap -
all that is necessary is to subtract it from used length, if at
the end of the process it is >0 then we know used length is too
large.


> > I'd like to know the fate of VIRTIO_HARDEN_NOTIFICATION before
> > we do more hardening. If it's irrevocably broken let's rip it out?
> 
> So the plan is
> 
> 1) finish used ring validation (this had been proposed, merged and
> reverted before notification hardening)
> 2) do notification hardening on top.
> 
> So let's leave it as is and I will do a rework after we finalize the
> used ring validation.
> 
> Thanks
> 
> >
> >
> > > ---
> > > Changes since V4:
> > > - drop the flat for driver to suppress the check
> > > - validation is disabled by default
> > > - don't do validation for legacy device
> > > - rebase and support virtqueue resize
> > > ---
> > >  drivers/virtio/virtio_ring.c | 75 
> > >  1 file changed, 75 insertions(+)
> > >
> > > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > > index 143f380baa1c..5b151605aaf8 100644
> > > --- a/drivers/virtio/virtio_ring.c
> > > +++ b/drivers/virtio/virtio_ring.c
> > > @@ -15,6 +15,9 @@
> > >  #include 
> > >  #include 
> > >
> > > +static bool force_used_validation = false;
> > > +module_param(force_used_validation, bool, 0444);
> > > +
> > >  #ifdef DEBUG
> > >  /* For development, we want to crash whenever the ring is screwed. */
> > >  #define BAD_RING(_vq, fmt, args...)  \
> > > @@ -105,6 +108,9 @@ struct vring_virtqueue_split {
> > >   struct vring_desc_state_split *desc_state;
> > >   struct vring_desc_extra *desc_extra;
> > >
> > > + /* Maximum in buffer length, NULL means no used validation */
> > > + u32 *buflen;
> > > +
> > >   /* DMA address and size information */
> > >   dma_addr_t queue_dma_addr;
> > >   size_t queue_size_in_bytes;
> > > @@ -145,6 +151,9 @@ struct vring_virtqueue_packed {
> > >   struct vring_desc_state_packed *desc_state;
> > >   struct vring_desc_extra *desc_extra;
> > >
> > > + /* Maximum in buffer length, NULL means no used validation */
> > > + u32 *buflen;
> > > +
> > >   /* DMA address and size information */
> > >   dma_addr_t ring_dma_addr;
> > >   dma_addr_t driver_event_dma_addr;
> > > @@ -552,6 +561,7 @@ static inline int virtqueue_add_split(struct 
> > > virtqueue *_vq,
> > >   unsigned int i, n, avail, descs_used, prev, err_idx;
> > >   int head;
> > >   bool indirect;
> > > + u32 buflen = 0;
> > >
> > >   START_USE(vq);
> > >
> > > @@ -635,6 +645,7 @@ static inline int virtqueue_add_split(struct 
> > > virtqueue *_vq,
> > >VRING_DESC_F_NEXT |
> > >VRING_DESC_F_WRITE,
> > >indirect);
> > > + buflen += sg->length;
> > >   }
> > >   }
> > >   /* Last one doesn't continue. */
> > > @@ -675,6 +686,10 

Re: [PATCH net-next 2/5] virtio_net: Add page_pool support to improve performance

2023-05-29 Thread Michael S. Tsirkin
On Mon, May 29, 2023 at 03:27:56PM +0800, Liang Chen wrote:
> On Sun, May 28, 2023 at 2:20 PM Michael S. Tsirkin  wrote:
> >
> > On Fri, May 26, 2023 at 01:46:18PM +0800, Liang Chen wrote:
> > > The implementation at the moment uses one page per packet in both the
> > > normal and XDP path. In addition, introducing a module parameter to enable
> > > or disable the usage of page pool (disabled by default).
> > >
> > > In single-core vm testing environments, it gives a modest performance gain
> > > in the normal path.
> > >   Upstream codebase: 47.5 Gbits/sec
> > >   Upstream codebase + page_pool support: 50.2 Gbits/sec
> > >
> > > In multi-core vm testing environments, The most significant performance
> > > gain is observed in XDP cpumap:
> > >   Upstream codebase: 1.38 Gbits/sec
> > >   Upstream codebase + page_pool support: 9.74 Gbits/sec
> > >
> > > With this foundation, we can further integrate page pool fragmentation and
> > > DMA map/unmap support.
> > >
> > > Signed-off-by: Liang Chen 
> >
> > Why off by default?
> > I am guessing it sometimes has performance costs too?
> >
> >
> > What happens if we use page pool for big mode too?
> > The less modes we have the better...
> >
> >
> 
> Sure, now I believe it makes sense to enable it by default. When the
> packet size is very small, it reduces the likelihood of skb
> coalescing. But such cases are rare.

small packets are rare? These workloads are easy to create actually.
Pls try and include benchmark with small packet size.

> The usage of page pool for big mode is being evaluated now. Thanks!
> 
> > > ---
> > >  drivers/net/virtio_net.c | 188 ++-
> > >  1 file changed, 146 insertions(+), 42 deletions(-)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index c5dca0d92e64..99c0ca0c1781 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -31,6 +31,9 @@ module_param(csum, bool, 0444);
> > >  module_param(gso, bool, 0444);
> > >  module_param(napi_tx, bool, 0644);
> > >
> > > +static bool page_pool_enabled;
> > > +module_param(page_pool_enabled, bool, 0400);
> > > +
> > >  /* FIXME: MTU in config. */
> > >  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> > >  #define GOOD_COPY_LEN128
> > > @@ -159,6 +162,9 @@ struct receive_queue {
> > >   /* Chain pages by the private ptr. */
> > >   struct page *pages;
> > >
> > > + /* Page pool */
> > > + struct page_pool *page_pool;
> > > +
> > >   /* Average packet length for mergeable receive buffers. */
> > >   struct ewma_pkt_len mrg_avg_pkt_len;
> > >
> > > @@ -459,6 +465,14 @@ static struct sk_buff *virtnet_build_skb(void *buf, 
> > > unsigned int buflen,
> > >   return skb;
> > >  }
> > >
> > > +static void virtnet_put_page(struct receive_queue *rq, struct page *page)
> > > +{
> > > + if (rq->page_pool)
> > > + page_pool_put_full_page(rq->page_pool, page, true);
> > > + else
> > > + put_page(page);
> > > +}
> > > +
> > >  /* Called from bottom half context */
> > >  static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> > >  struct receive_queue *rq,
> > > @@ -555,7 +569,7 @@ static struct sk_buff *page_to_skb(struct 
> > > virtnet_info *vi,
> > >   hdr = skb_vnet_hdr(skb);
> > >   memcpy(hdr, hdr_p, hdr_len);
> > >   if (page_to_free)
> > > - put_page(page_to_free);
> > > + virtnet_put_page(rq, page_to_free);
> > >
> > >   return skb;
> > >  }
> > > @@ -802,7 +816,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
> > >   return ret;
> > >  }
> > >
> > > -static void put_xdp_frags(struct xdp_buff *xdp)
> > > +static void put_xdp_frags(struct xdp_buff *xdp, struct receive_queue *rq)
> > >  {
> > >   struct skb_shared_info *shinfo;
> > >   struct page *xdp_page;
> > > @@ -812,7 +826,7 @@ static void put_xdp_frags(struct xdp_buff *xdp)
> > >   shinfo = xdp_get_shared_info_from_buff(xdp);
> > >   for (i = 0; i < shinfo->nr_frags; i++) {
> > >   xdp_page = skb_frag_page(>frags[i]);
> > > - put_page(xdp_page);
> > > + virtnet_put_page(rq, xdp_page);
> > >   }
> > >   }
> > >  }
> > > @@ -903,7 +917,11 @@ static struct page *xdp_linearize_page(struct 
> > > receive_queue *rq,
> > >   if (page_off + *len + tailroom > PAGE_SIZE)
> > >   return NULL;
> > >
> > > - page = alloc_page(GFP_ATOMIC);
> > > + if (rq->page_pool)
> > > + page = page_pool_dev_alloc_pages(rq->page_pool);
> > > + else
> > > + page = alloc_page(GFP_ATOMIC);
> > > +
> > >   if (!page)
> > >   return NULL;
> > >
> > > @@ -926,21 +944,24 @@ static struct page *xdp_linearize_page(struct 
> > > receive_queue *rq,
> > >* is sending packet larger than the MTU.
> > >*/
> > >   

Re: [RFC] virtio-net: support modern-transtional devices

2023-05-29 Thread Zhu, Lingshan




On 5/29/2023 2:38 PM, Michael S. Tsirkin wrote:

On Mon, May 29, 2023 at 02:19:36PM +0800, Zhu, Lingshan wrote:


On 5/28/2023 7:28 PM, Michael S. Tsirkin wrote:

On Sat, May 27, 2023 at 02:15:42AM +0800, Zhu Lingshan wrote:

Current virtio-net only probes a device with VIRITO_ID_NET == 1.

For a modern-transtional virtio-net device which has a transtional
device id 0x1000 and acts as a modern device, current virtio-pci
modern driver will assign the sub-device-id to its mdev->id.device,
which may not be 0x1, this sub-device-id is up to the vendor.

That means virtio-net driver doesn't probe a modern-transitonal
virtio-net with a sub-device-id other than 0x1, which is a bug.

No, the bug is in the device. Legacy linux drivers always looked at
sub device id (other OSes might differ). So it makes no sense
for a transitional device to have sub-device-id other than 0x1.
Don't have time to look at spec but I think you will find it there.

That is true for a software emulated transitional device,
because there is only "generation" of instance in the hypervisor,
that allowing it to ensure its sub-device-id always be 0x01,
and it fits VIRTIO_ID_NET.

However, a vendor may produce multiple generations of transitional
hardware. The sub-device-id is up to the vendor, and it is the
only way to for a driver to identify a device, other IDs are all
fixed as 0x1af4, 0x1000 and 0x8086 for Intel.

That is one of the issues with legacy virtio, yes.




So the sub-device-id has to be unique and differ from others, can not always
be 0x01.


If you are trying to build a device and want to create a safe way to
identify it without breaking legacy drivers, then
VIRTIO_PCI_CAP_VENDOR_CFG has been designed for things like this.
For example you can have:

struct virtio_pci_vndr_data {
 u8 cap_vndr;/* Generic PCI field: PCI_CAP_ID_VNDR */
 u8 cap_next;/* Generic PCI field: next ptr. */
 u8 cap_len; /* Generic PCI field: capability length */
 u8 cfg_type;/* Identifies the structure. */
 u16 vendor_id;  /* Identifies the vendor-specific format. */
 u16 device_generation;  /* Device generation */
};

This can be a solution for sure.



I propose this fix, all changes are for modern-transitional devices in
modern
code path, not for legacy nor legacy-transitional.

Thanks

But what good is this fix? If you just want the modern driver to bind
and ignore legacy just create a modern device, you can play
with subsystem id and vendor to your heart's content then.

Not sure who but there are some use-cases require
transnational devices than modern devices,
I don't like this neither.


If you are using transitional then presumably you want
legacy drives to bind, they will not bind if subsystem device
id changes.

well actually it is a transitional device and act as a
modern device by default, so modern driver will probe.

I think this fix is common and easy, just let virtio-net
probe transitional device id 0x1000 just like it probes
modern device id 0x1. This is a once for all fix.

This fix only affects modern-transitional devices in modern code path,
legacy is untouched.

Thanks






Other types of devices also have similar issues, like virito-blk.

I propose to fix this problem of modern-transitonal device
whith this solution, all in the modern code path:
1) assign the device id to mdev->id.device
2) add transitional device ids in the virtio-net(and others) probe table.

Comments are welcome!

Thanks!

Signed-off-by: Zhu Lingshan 
---
   drivers/net/virtio_net.c   | 1 +
   drivers/virtio/virtio_pci_modern_dev.c | 2 +-
   2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 56ca1d270304..6b45d8602a6b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -4250,6 +4250,7 @@ static __maybe_unused int virtnet_restore(struct 
virtio_device *vdev)
   static struct virtio_device_id id_table[] = {
{ VIRTIO_ID_NET, VIRTIO_DEV_ANY_ID },
+   { VIRTIO_TRANS_ID_NET, VIRTIO_DEV_ANY_ID },
{ 0 },
   };
diff --git a/drivers/virtio/virtio_pci_modern_dev.c 
b/drivers/virtio/virtio_pci_modern_dev.c
index 869cb46bef96..80846e1195ce 100644
--- a/drivers/virtio/virtio_pci_modern_dev.c
+++ b/drivers/virtio/virtio_pci_modern_dev.c
@@ -229,7 +229,7 @@ int vp_modern_probe(struct virtio_pci_modern_device *mdev)
/* Transitional devices: use the PCI subsystem device id as
 * virtio device id, same as legacy driver always did.
 */
-   mdev->id.device = pci_dev->subsystem_device;
+   mdev->id.device = pci_dev->device;
} else {
/* Modern devices: simply use PCI device id, but start from 
0x1040. */
mdev->id.device = pci_dev->device - 0x1040;
--
2.39.1


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org

Re: [PATCH net-next 3/5] virtio_net: Add page pool fragmentation support

2023-05-29 Thread Liang Chen
On Mon, May 29, 2023 at 9:33 AM Yunsheng Lin  wrote:
>
> On 2023/5/26 13:46, Liang Chen wrote:
>
> ...
>
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 99c0ca0c1781..ac40b8c66c59 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -32,7 +32,9 @@ module_param(gso, bool, 0444);
> >  module_param(napi_tx, bool, 0644);
> >
> >  static bool page_pool_enabled;
> > +static bool page_pool_frag;
> >  module_param(page_pool_enabled, bool, 0400);
> > +module_param(page_pool_frag, bool, 0400);
>
> The below patchset unifies the frag and non-frag page for
> page_pool_alloc_frag() API, perhaps it would simplify the
> driver's support of page pool.
>
> https://patchwork.kernel.org/project/netdevbpf/cover/20230526092616.40355-1-linyunsh...@huawei.com/
>

Thanks for the information and the work to make driver support easy. I
will rebase accordingly after it lands.

> >
>
> ...
>
> > @@ -1769,13 +1788,29 @@ static int add_recvbuf_mergeable(struct 
> > virtnet_info *vi,
> >*/
> >   len = get_mergeable_buf_len(rq, >mrg_avg_pkt_len, room);
> >   if (rq->page_pool) {
> > - struct page *page;
> > + if (rq->page_pool->p.flags & PP_FLAG_PAGE_FRAG) {
> > + if (unlikely(!page_pool_dev_alloc_frag(rq->page_pool,
> > +
> > _frag_offset, len + room)))
> > + return -ENOMEM;
> > + buf = (char *)page_address(rq->page_pool->frag_page) +
> > + pp_frag_offset;
> > + buf += headroom; /* advance address leaving hole at 
> > front of pkt */
> > + hole = (PAGE_SIZE << rq->page_pool->p.order)
> > + - rq->page_pool->frag_offset;
> > + if (hole < len + room) {
> > + if (!headroom)
> > + len += hole;
> > + rq->page_pool->frag_offset += hole;
>
> Is there any reason why the driver need to be aware of page_pool->frag_offset?
> Isn't the page_pool_dev_alloc_frag() will drain the last page for you when
> page_pool_dev_alloc_frag() is called with size being 'len + room' later?
> One case I can think of needing this is to have an accurate truesize report
> for skb, but I am not sure it matters that much as 'struct page_frag_cache'
> and 'page_frag' implementation both have a similar problem.
>

Yeah, as you pointed out page_pool_dev_alloc_frag will drain the page
itself, so does skb_page_frag_refill. This is trying to keep the logic
consistent with non page pool case where the hole was skipped and
included in buffer len.

> > + }
> > + } else {
> > + struct page *page;
> >
> > - page = page_pool_dev_alloc_pages(rq->page_pool);
> > - if (unlikely(!page))
> > - return -ENOMEM;
> > - buf = (char *)page_address(page);
> > - buf += headroom; /* advance address leaving hole at front of 
> > pkt */
> > + page = page_pool_dev_alloc_pages(rq->page_pool);
> > + if (unlikely(!page))
> > + return -ENOMEM;
> > + buf = (char *)page_address(page);
> > + buf += headroom; /* advance address leaving hole at 
> > front of pkt */
> > + }
> >   } else {
> >   if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, 
> > gfp)))
> >   return -ENOMEM;
> > @@ -3800,13 +3835,16 @@ static void virtnet_alloc_page_pool(struct 
> > receive_queue *rq)
> >   struct virtio_device *vdev = rq->vq->vdev;
> >
> >   struct page_pool_params pp_params = {
> > - .order = 0,
> > + .order = page_pool_frag ? SKB_FRAG_PAGE_ORDER : 0,
> >   .pool_size = rq->vq->num_max,
>
> If it using order SKB_FRAG_PAGE_ORDER page, perhaps pool_size does
> not have to be rq->vq->num_max? Even for order 0 page, perhaps the
> pool_size does not need to be as big as rq->vq->num_max?
>

Thanks for pointing this out! pool_size will be lowered to a more
appropriate value on v2.


> >   .nid = dev_to_node(vdev->dev.parent),
> >   .dev = vdev->dev.parent,
> >   .offset = 0,
> >   };
> >
> > + if (page_pool_frag)
> > + pp_params.flags |= PP_FLAG_PAGE_FRAG;
> > +
> >   rq->page_pool = page_pool_create(_params);
> >   if (IS_ERR(rq->page_pool)) {
> >   dev_warn(>dev, "page pool creation failed: %ld\n",
> >
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next 3/5] virtio_net: Add page pool fragmentation support

2023-05-29 Thread Liang Chen
On Sun, May 28, 2023 at 2:25 PM Michael S. Tsirkin  wrote:
>
> On Fri, May 26, 2023 at 01:46:19PM +0800, Liang Chen wrote:
> > To further enhance performance, implement page pool fragmentation
> > support and introduce a module parameter to enable or disable it.
> >
> > In single-core vm testing environments, there is an additional performance
> > gain observed in the normal path compared to the one packet per page
> > approach.
> >   Upstream codebase: 47.5 Gbits/sec
> >   Upstream codebase with page pool: 50.2 Gbits/sec
> >   Upstream codebase with page pool fragmentation support: 52.3 Gbits/sec
> >
> > There is also some performance gain for XDP cpumap.
> >   Upstream codebase: 1.38 Gbits/sec
> >   Upstream codebase with page pool: 9.74 Gbits/sec
> >   Upstream codebase with page pool fragmentation: 10.3 Gbits/sec
> >
> > Signed-off-by: Liang Chen 
>
> I think it's called fragmenting not fragmentation.
>
>

Sure, thanks!

> > ---
> >  drivers/net/virtio_net.c | 72 ++--
> >  1 file changed, 55 insertions(+), 17 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 99c0ca0c1781..ac40b8c66c59 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -32,7 +32,9 @@ module_param(gso, bool, 0444);
> >  module_param(napi_tx, bool, 0644);
> >
> >  static bool page_pool_enabled;
> > +static bool page_pool_frag;
> >  module_param(page_pool_enabled, bool, 0400);
> > +module_param(page_pool_frag, bool, 0400);
> >
> >  /* FIXME: MTU in config. */
> >  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
>
> So here again same questions.
>
> -when is this a net perf gain when does it have no effect?
> -can be on by default
> - can we get rid of the extra modes?
>
>

Yeah, now I believe it makes sense to enable it by default to avoid
the extra modes. Thanks.


> > @@ -909,23 +911,32 @@ static struct page *xdp_linearize_page(struct 
> > receive_queue *rq,
> >  struct page *p,
> >  int offset,
> >  int page_off,
> > -unsigned int *len)
> > +unsigned int *len,
> > +unsigned int *pp_frag_offset)
> >  {
> >   int tailroom = SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> >   struct page *page;
> > + unsigned int pp_frag_offset_val;
> >
> >   if (page_off + *len + tailroom > PAGE_SIZE)
> >   return NULL;
> >
> >   if (rq->page_pool)
> > - page = page_pool_dev_alloc_pages(rq->page_pool);
> > + if (rq->page_pool->p.flags & PP_FLAG_PAGE_FRAG)
> > + page = page_pool_dev_alloc_frag(rq->page_pool, 
> > pp_frag_offset,
> > + PAGE_SIZE);
> > + else
> > + page = page_pool_dev_alloc_pages(rq->page_pool);
> >   else
> >   page = alloc_page(GFP_ATOMIC);
> >
> >   if (!page)
> >   return NULL;
> >
> > - memcpy(page_address(page) + page_off, page_address(p) + offset, *len);
> > + pp_frag_offset_val = pp_frag_offset ? *pp_frag_offset : 0;
> > +
> > + memcpy(page_address(page) + page_off + pp_frag_offset_val,
> > +page_address(p) + offset, *len);
> >   page_off += *len;
> >
> >   while (--*num_buf) {
> > @@ -948,7 +959,7 @@ static struct page *xdp_linearize_page(struct 
> > receive_queue *rq,
> >   goto err_buf;
> >   }
> >
> > - memcpy(page_address(page) + page_off,
> > + memcpy(page_address(page) + page_off + pp_frag_offset_val,
> >  page_address(p) + off, buflen);
> >   page_off += buflen;
> >   virtnet_put_page(rq, p);
> > @@ -1029,7 +1040,7 @@ static struct sk_buff *receive_small_xdp(struct 
> > net_device *dev,
> >   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> >   xdp_page = xdp_linearize_page(rq, _buf, page,
> > offset, header_offset,
> > -   );
> > +   , NULL);
> >   if (!xdp_page)
> >   goto err_xdp;
> >
> > @@ -1323,6 +1334,7 @@ static void *mergeable_xdp_get_buf(struct 
> > virtnet_info *vi,
> >   unsigned int headroom = mergeable_ctx_to_headroom(ctx);
> >   struct page *xdp_page;
> >   unsigned int xdp_room;
> > + unsigned int page_frag_offset = 0;
> >
> >   /* Transient failure which in theory could occur if
> >* in-flight packets from before XDP was enabled reach
> > @@ -1356,7 +1368,8 @@ static void *mergeable_xdp_get_buf(struct 
> > virtnet_info *vi,
> >   xdp_page = xdp_linearize_page(rq, num_buf,
> > 

Re: [PATCH net-next 2/5] virtio_net: Add page_pool support to improve performance

2023-05-29 Thread Liang Chen
On Sun, May 28, 2023 at 2:40 PM Michael S. Tsirkin  wrote:
>
> On Sat, May 27, 2023 at 08:35:01PM +0800, Liang Chen wrote:
> > On Fri, May 26, 2023 at 2:51 PM Jason Wang  wrote:
> > >
> > > On Fri, May 26, 2023 at 1:46 PM Liang Chen  
> > > wrote:
> > > >
> > > > The implementation at the moment uses one page per packet in both the
> > > > normal and XDP path.
> > >
> > > It's better to explain why we need a page pool and how it can help the
> > > performance.
> > >
> >
> > Sure, I will include that on v2.
> > > > In addition, introducing a module parameter to enable
> > > > or disable the usage of page pool (disabled by default).
> > >
> > > If page pool wins for most of the cases, any reason to disable it by 
> > > default?
> > >
> >
> > Thank you for raising the point. It does make sense to enable it by default.
>
> I'd like to see more benchmarks pls then, with a variety of packet
> sizes, udp and tcp.
>

Sure, more benchmarks will be provided. Thanks.


> > > >
> > > > In single-core vm testing environments, it gives a modest performance 
> > > > gain
> > > > in the normal path.
> > > >   Upstream codebase: 47.5 Gbits/sec
> > > >   Upstream codebase + page_pool support: 50.2 Gbits/sec
> > > >
> > > > In multi-core vm testing environments, The most significant performance
> > > > gain is observed in XDP cpumap:
> > > >   Upstream codebase: 1.38 Gbits/sec
> > > >   Upstream codebase + page_pool support: 9.74 Gbits/sec
> > >
> > > Please show more details on the test. E.g which kinds of tests have
> > > you measured?
> > >
> > > Btw, it would be better to measure PPS as well.
> > >
> >
> > Sure. It will be added on v2.
> > > >
> > > > With this foundation, we can further integrate page pool fragmentation 
> > > > and
> > > > DMA map/unmap support.
> > > >
> > > > Signed-off-by: Liang Chen 
> > > > ---
> > > >  drivers/net/virtio_net.c | 188 ++-
> > >
> > > I believe we should make virtio-net to select CONFIG_PAGE_POOL or do
> > > the ifdef tricks at least.
> > >
> >
> > Sure. it will be done on v2.
> > > >  1 file changed, 146 insertions(+), 42 deletions(-)
> > > >
> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > > index c5dca0d92e64..99c0ca0c1781 100644
> > > > --- a/drivers/net/virtio_net.c
> > > > +++ b/drivers/net/virtio_net.c
> > > > @@ -31,6 +31,9 @@ module_param(csum, bool, 0444);
> > > >  module_param(gso, bool, 0444);
> > > >  module_param(napi_tx, bool, 0644);
> > > >
> > > > +static bool page_pool_enabled;
> > > > +module_param(page_pool_enabled, bool, 0400);
> > > > +
> > > >  /* FIXME: MTU in config. */
> > > >  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> > > >  #define GOOD_COPY_LEN  128
> > > > @@ -159,6 +162,9 @@ struct receive_queue {
> > > > /* Chain pages by the private ptr. */
> > > > struct page *pages;
> > > >
> > > > +   /* Page pool */
> > > > +   struct page_pool *page_pool;
> > > > +
> > > > /* Average packet length for mergeable receive buffers. */
> > > > struct ewma_pkt_len mrg_avg_pkt_len;
> > > >
> > > > @@ -459,6 +465,14 @@ static struct sk_buff *virtnet_build_skb(void 
> > > > *buf, unsigned int buflen,
> > > > return skb;
> > > >  }
> > > >
> > > > +static void virtnet_put_page(struct receive_queue *rq, struct page 
> > > > *page)
> > > > +{
> > > > +   if (rq->page_pool)
> > > > +   page_pool_put_full_page(rq->page_pool, page, true);
> > > > +   else
> > > > +   put_page(page);
> > > > +}
> > > > +
> > > >  /* Called from bottom half context */
> > > >  static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> > > >struct receive_queue *rq,
> > > > @@ -555,7 +569,7 @@ static struct sk_buff *page_to_skb(struct 
> > > > virtnet_info *vi,
> > > > hdr = skb_vnet_hdr(skb);
> > > > memcpy(hdr, hdr_p, hdr_len);
> > > > if (page_to_free)
> > > > -   put_page(page_to_free);
> > > > +   virtnet_put_page(rq, page_to_free);
> > > >
> > > > return skb;
> > > >  }
> > > > @@ -802,7 +816,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
> > > > return ret;
> > > >  }
> > > >
> > > > -static void put_xdp_frags(struct xdp_buff *xdp)
> > > > +static void put_xdp_frags(struct xdp_buff *xdp, struct receive_queue 
> > > > *rq)
> > > >  {
> > >
> > > rq could be fetched from xdp_rxq_info?
> >
> > Yeah, it has the queue_index there.
> > >
> > > > struct skb_shared_info *shinfo;
> > > > struct page *xdp_page;
> > > > @@ -812,7 +826,7 @@ static void put_xdp_frags(struct xdp_buff *xdp)
> > > > shinfo = xdp_get_shared_info_from_buff(xdp);
> > > > for (i = 0; i < shinfo->nr_frags; i++) {
> > > > xdp_page = skb_frag_page(>frags[i]);
> > > > -   put_page(xdp_page);
> > > > +   virtnet_put_page(rq, xdp_page);
> > > >  

Re: [PATCH net-next 2/5] virtio_net: Add page_pool support to improve performance

2023-05-29 Thread Liang Chen
On Sun, May 28, 2023 at 2:27 PM Michael S. Tsirkin  wrote:
>
> On Sat, May 27, 2023 at 12:11:25AM +0800, kernel test robot wrote:
> > Hi Liang,
> >
> > kernel test robot noticed the following build errors:
> >
> > [auto build test ERROR on net-next/main]
> >
> > url:
> > https://github.com/intel-lab-lkp/linux/commits/Liang-Chen/virtio_net-Add-page_pool-support-to-improve-performance/20230526-135805
> > base:   net-next/main
> > patch link:
> > https://lore.kernel.org/r/20230526054621.18371-2-liangchen.linux%40gmail.com
> > patch subject: [PATCH net-next 2/5] virtio_net: Add page_pool support to 
> > improve performance
> > config: x86_64-defconfig 
> > (https://download.01.org/0day-ci/archive/20230526/202305262334.gifq3wpg-...@intel.com/config)
> > compiler: gcc-11 (Debian 11.3.0-12) 11.3.0
> > reproduce (this is a W=1 build):
> > # 
> > https://github.com/intel-lab-lkp/linux/commit/bfba563f43bba37181d8502cb2e566c32f96ec9e
> > git remote add linux-review https://github.com/intel-lab-lkp/linux
> > git fetch --no-tags linux-review 
> > Liang-Chen/virtio_net-Add-page_pool-support-to-improve-performance/20230526-135805
> > git checkout bfba563f43bba37181d8502cb2e566c32f96ec9e
> > # save the config file
> > mkdir build_dir && cp config build_dir/.config
> > make W=1 O=build_dir ARCH=x86_64 olddefconfig
> > make W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
> >
> > If you fix the issue, kindly add following tag where applicable
> > | Reported-by: kernel test robot 
> > | Closes: 
> > https://lore.kernel.org/oe-kbuild-all/202305262334.gifq3wpg-...@intel.com/
> >
> > All errors (new ones prefixed by >>):
> >
> >ld: vmlinux.o: in function `virtnet_find_vqs':
> > >> virtio_net.c:(.text+0x901fb5): undefined reference to `page_pool_create'
> >ld: vmlinux.o: in function `add_recvbuf_mergeable.isra.0':
> > >> virtio_net.c:(.text+0x905618): undefined reference to 
> > >> `page_pool_alloc_pages'
> >ld: vmlinux.o: in function `xdp_linearize_page':
> >virtio_net.c:(.text+0x906b6b): undefined reference to 
> > `page_pool_alloc_pages'
> >ld: vmlinux.o: in function `mergeable_xdp_get_buf.isra.0':
> >virtio_net.c:(.text+0x90728f): undefined reference to 
> > `page_pool_alloc_pages'
>
>
> you need to tweak Kconfig to select PAGE_POOL I think.
>

Sure, thanks!


> > --
> > 0-DAY CI Kernel Test Service
> > https://github.com/intel/lkp-tests/wiki
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next 2/5] virtio_net: Add page_pool support to improve performance

2023-05-29 Thread Liang Chen
On Sun, May 28, 2023 at 2:20 PM Michael S. Tsirkin  wrote:
>
> On Fri, May 26, 2023 at 01:46:18PM +0800, Liang Chen wrote:
> > The implementation at the moment uses one page per packet in both the
> > normal and XDP path. In addition, introducing a module parameter to enable
> > or disable the usage of page pool (disabled by default).
> >
> > In single-core vm testing environments, it gives a modest performance gain
> > in the normal path.
> >   Upstream codebase: 47.5 Gbits/sec
> >   Upstream codebase + page_pool support: 50.2 Gbits/sec
> >
> > In multi-core vm testing environments, The most significant performance
> > gain is observed in XDP cpumap:
> >   Upstream codebase: 1.38 Gbits/sec
> >   Upstream codebase + page_pool support: 9.74 Gbits/sec
> >
> > With this foundation, we can further integrate page pool fragmentation and
> > DMA map/unmap support.
> >
> > Signed-off-by: Liang Chen 
>
> Why off by default?
> I am guessing it sometimes has performance costs too?
>
>
> What happens if we use page pool for big mode too?
> The less modes we have the better...
>
>

Sure, now I believe it makes sense to enable it by default. When the
packet size is very small, it reduces the likelihood of skb
coalescing. But such cases are rare.
The usage of page pool for big mode is being evaluated now. Thanks!

> > ---
> >  drivers/net/virtio_net.c | 188 ++-
> >  1 file changed, 146 insertions(+), 42 deletions(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index c5dca0d92e64..99c0ca0c1781 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -31,6 +31,9 @@ module_param(csum, bool, 0444);
> >  module_param(gso, bool, 0444);
> >  module_param(napi_tx, bool, 0644);
> >
> > +static bool page_pool_enabled;
> > +module_param(page_pool_enabled, bool, 0400);
> > +
> >  /* FIXME: MTU in config. */
> >  #define GOOD_PACKET_LEN (ETH_HLEN + VLAN_HLEN + ETH_DATA_LEN)
> >  #define GOOD_COPY_LEN128
> > @@ -159,6 +162,9 @@ struct receive_queue {
> >   /* Chain pages by the private ptr. */
> >   struct page *pages;
> >
> > + /* Page pool */
> > + struct page_pool *page_pool;
> > +
> >   /* Average packet length for mergeable receive buffers. */
> >   struct ewma_pkt_len mrg_avg_pkt_len;
> >
> > @@ -459,6 +465,14 @@ static struct sk_buff *virtnet_build_skb(void *buf, 
> > unsigned int buflen,
> >   return skb;
> >  }
> >
> > +static void virtnet_put_page(struct receive_queue *rq, struct page *page)
> > +{
> > + if (rq->page_pool)
> > + page_pool_put_full_page(rq->page_pool, page, true);
> > + else
> > + put_page(page);
> > +}
> > +
> >  /* Called from bottom half context */
> >  static struct sk_buff *page_to_skb(struct virtnet_info *vi,
> >  struct receive_queue *rq,
> > @@ -555,7 +569,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info 
> > *vi,
> >   hdr = skb_vnet_hdr(skb);
> >   memcpy(hdr, hdr_p, hdr_len);
> >   if (page_to_free)
> > - put_page(page_to_free);
> > + virtnet_put_page(rq, page_to_free);
> >
> >   return skb;
> >  }
> > @@ -802,7 +816,7 @@ static int virtnet_xdp_xmit(struct net_device *dev,
> >   return ret;
> >  }
> >
> > -static void put_xdp_frags(struct xdp_buff *xdp)
> > +static void put_xdp_frags(struct xdp_buff *xdp, struct receive_queue *rq)
> >  {
> >   struct skb_shared_info *shinfo;
> >   struct page *xdp_page;
> > @@ -812,7 +826,7 @@ static void put_xdp_frags(struct xdp_buff *xdp)
> >   shinfo = xdp_get_shared_info_from_buff(xdp);
> >   for (i = 0; i < shinfo->nr_frags; i++) {
> >   xdp_page = skb_frag_page(>frags[i]);
> > - put_page(xdp_page);
> > + virtnet_put_page(rq, xdp_page);
> >   }
> >   }
> >  }
> > @@ -903,7 +917,11 @@ static struct page *xdp_linearize_page(struct 
> > receive_queue *rq,
> >   if (page_off + *len + tailroom > PAGE_SIZE)
> >   return NULL;
> >
> > - page = alloc_page(GFP_ATOMIC);
> > + if (rq->page_pool)
> > + page = page_pool_dev_alloc_pages(rq->page_pool);
> > + else
> > + page = alloc_page(GFP_ATOMIC);
> > +
> >   if (!page)
> >   return NULL;
> >
> > @@ -926,21 +944,24 @@ static struct page *xdp_linearize_page(struct 
> > receive_queue *rq,
> >* is sending packet larger than the MTU.
> >*/
> >   if ((page_off + buflen + tailroom) > PAGE_SIZE) {
> > - put_page(p);
> > + virtnet_put_page(rq, p);
> >   goto err_buf;
> >   }
> >
> >   memcpy(page_address(page) + page_off,
> >  page_address(p) + off, buflen);
> >   page_off += buflen;
> > - put_page(p);
> > + virtnet_put_page(rq, 

Re: [PATCH net-next 1/5] virtio_net: Fix an unsafe reference to the page chain

2023-05-29 Thread Liang Chen
On Sun, May 28, 2023 at 2:29 PM Michael S. Tsirkin  wrote:
>
> On Fri, May 26, 2023 at 02:38:54PM +0800, Jason Wang wrote:
> > On Fri, May 26, 2023 at 1:46 PM Liang Chen  
> > wrote:
> > >
> > > "private" of buffer page is currently used for big mode to chain pages.
> > > But in mergeable mode, that offset of page could mean something else,
> > > e.g. when page_pool page is used instead. So excluding mergeable mode to
> > > avoid such a problem.
> >
> > If this issue happens only in the case of page_pool, it would be
> > better to squash it there.
> >
> > Thanks
>
>
> This is a tiny patch so I don't care. Generally it's ok
> to first rework code then change functionality.
> in this case what Jason says os right especially because
> you then do not need to explain that current code is ok.
>

Sure. it will be squashed into the page pool enablement patch. Thanks!

> > >
> > > Signed-off-by: Liang Chen 
> > > ---
> > >  drivers/net/virtio_net.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index 5a7f7a76b920..c5dca0d92e64 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -497,7 +497,7 @@ static struct sk_buff *page_to_skb(struct 
> > > virtnet_info *vi,
> > > return NULL;
> > >
> > > page = (struct page *)page->private;
> > > -   if (page)
> > > +   if (!vi->mergeable_rx_bufs && page)
> > > give_pages(rq, page);
> > > goto ok;
> > > }
> > > --
> > > 2.31.1
> > >
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH net-next 1/5] virtio_net: Fix an unsafe reference to the page chain

2023-05-29 Thread Liang Chen
On Sun, May 28, 2023 at 2:16 PM Michael S. Tsirkin  wrote:
>
> On Fri, May 26, 2023 at 01:46:17PM +0800, Liang Chen wrote:
> > "private" of buffer page is currently used for big mode to chain pages.
> > But in mergeable mode, that offset of page could mean something else,
> > e.g. when page_pool page is used instead. So excluding mergeable mode to
> > avoid such a problem.
> >
> > Signed-off-by: Liang Chen 
>
> Ugh the subject makes it looks like current code has a problem
> but I don't think so because I don't think anything besides
> big packets uses page->private.
>
> The reason patch is needed is because follow up patches
> use page_pool.
> pls adjust commit log and subject to make all this clear.
>
>
> > ---
> >  drivers/net/virtio_net.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index 5a7f7a76b920..c5dca0d92e64 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -497,7 +497,7 @@ static struct sk_buff *page_to_skb(struct virtnet_info 
> > *vi,
> >   return NULL;
> >
> >   page = (struct page *)page->private;
> > - if (page)
> > + if (!vi->mergeable_rx_bufs && page)
>
> To be safe let's limit to big packets too:
>
> if (!vi->mergeable_rx_bufs && vi->big_packets && page)
>
>
>

Sure, thanks!

> >   give_pages(rq, page);
> >   goto ok;
> >   }
> > --
> > 2.31.1
>
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [RFC] virtio-net: support modern-transtional devices

2023-05-29 Thread Michael S. Tsirkin
On Mon, May 29, 2023 at 02:19:36PM +0800, Zhu, Lingshan wrote:
> 
> 
> On 5/28/2023 7:28 PM, Michael S. Tsirkin wrote:
> > On Sat, May 27, 2023 at 02:15:42AM +0800, Zhu Lingshan wrote:
> > > Current virtio-net only probes a device with VIRITO_ID_NET == 1.
> > > 
> > > For a modern-transtional virtio-net device which has a transtional
> > > device id 0x1000 and acts as a modern device, current virtio-pci
> > > modern driver will assign the sub-device-id to its mdev->id.device,
> > > which may not be 0x1, this sub-device-id is up to the vendor.
> > > 
> > > That means virtio-net driver doesn't probe a modern-transitonal
> > > virtio-net with a sub-device-id other than 0x1, which is a bug.
> > No, the bug is in the device. Legacy linux drivers always looked at
> > sub device id (other OSes might differ). So it makes no sense
> > for a transitional device to have sub-device-id other than 0x1.
> > Don't have time to look at spec but I think you will find it there.
> That is true for a software emulated transitional device,
> because there is only "generation" of instance in the hypervisor,
> that allowing it to ensure its sub-device-id always be 0x01,
> and it fits VIRTIO_ID_NET.
> 
> However, a vendor may produce multiple generations of transitional
> hardware. The sub-device-id is up to the vendor, and it is the
> only way to for a driver to identify a device, other IDs are all
> fixed as 0x1af4, 0x1000 and 0x8086 for Intel.

That is one of the issues with legacy virtio, yes.



> So the sub-device-id has to be unique and differ from others, can not always
> be 0x01.


If you are trying to build a device and want to create a safe way to
identify it without breaking legacy drivers, then
VIRTIO_PCI_CAP_VENDOR_CFG has been designed for things like this.
For example you can have:

struct virtio_pci_vndr_data {
u8 cap_vndr;/* Generic PCI field: PCI_CAP_ID_VNDR */
u8 cap_next;/* Generic PCI field: next ptr. */
u8 cap_len; /* Generic PCI field: capability length */ 
u8 cfg_type;/* Identifies the structure. */
u16 vendor_id;  /* Identifies the vendor-specific format. */
u16 device_generation;  /* Device generation */
};



> I propose this fix, all changes are for modern-transitional devices in
> modern
> code path, not for legacy nor legacy-transitional.
> 
> Thanks

But what good is this fix? If you just want the modern driver to bind
and ignore legacy just create a modern device, you can play
with subsystem id and vendor to your heart's content then.

If you are using transitional then presumably you want
legacy drives to bind, they will not bind if subsystem device
id changes.


> > 
> > 
> > > Other types of devices also have similar issues, like virito-blk.
> > > 
> > > I propose to fix this problem of modern-transitonal device
> > > whith this solution, all in the modern code path:
> > > 1) assign the device id to mdev->id.device
> > > 2) add transitional device ids in the virtio-net(and others) probe table.
> > > 
> > > Comments are welcome!
> > > 
> > > Thanks!
> > > 
> > > Signed-off-by: Zhu Lingshan 
> > > ---
> > >   drivers/net/virtio_net.c   | 1 +
> > >   drivers/virtio/virtio_pci_modern_dev.c | 2 +-
> > >   2 files changed, 2 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > > index 56ca1d270304..6b45d8602a6b 100644
> > > --- a/drivers/net/virtio_net.c
> > > +++ b/drivers/net/virtio_net.c
> > > @@ -4250,6 +4250,7 @@ static __maybe_unused int virtnet_restore(struct 
> > > virtio_device *vdev)
> > >   static struct virtio_device_id id_table[] = {
> > >   { VIRTIO_ID_NET, VIRTIO_DEV_ANY_ID },
> > > + { VIRTIO_TRANS_ID_NET, VIRTIO_DEV_ANY_ID },
> > >   { 0 },
> > >   };
> > > diff --git a/drivers/virtio/virtio_pci_modern_dev.c 
> > > b/drivers/virtio/virtio_pci_modern_dev.c
> > > index 869cb46bef96..80846e1195ce 100644
> > > --- a/drivers/virtio/virtio_pci_modern_dev.c
> > > +++ b/drivers/virtio/virtio_pci_modern_dev.c
> > > @@ -229,7 +229,7 @@ int vp_modern_probe(struct virtio_pci_modern_device 
> > > *mdev)
> > >   /* Transitional devices: use the PCI subsystem device 
> > > id as
> > >* virtio device id, same as legacy driver always did.
> > >*/
> > > - mdev->id.device = pci_dev->subsystem_device;
> > > + mdev->id.device = pci_dev->device;
> > >   } else {
> > >   /* Modern devices: simply use PCI device id, but start 
> > > from 0x1040. */
> > >   mdev->id.device = pci_dev->device - 0x1040;
> > > -- 
> > > 2.39.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] virtio-net: support modern-transtional devices

2023-05-29 Thread Zhu, Lingshan




On 5/28/2023 7:28 PM, Michael S. Tsirkin wrote:

On Sat, May 27, 2023 at 02:15:42AM +0800, Zhu Lingshan wrote:

Current virtio-net only probes a device with VIRITO_ID_NET == 1.

For a modern-transtional virtio-net device which has a transtional
device id 0x1000 and acts as a modern device, current virtio-pci
modern driver will assign the sub-device-id to its mdev->id.device,
which may not be 0x1, this sub-device-id is up to the vendor.

That means virtio-net driver doesn't probe a modern-transitonal
virtio-net with a sub-device-id other than 0x1, which is a bug.

No, the bug is in the device. Legacy linux drivers always looked at
sub device id (other OSes might differ). So it makes no sense
for a transitional device to have sub-device-id other than 0x1.
Don't have time to look at spec but I think you will find it there.

That is true for a software emulated transitional device,
because there is only "generation" of instance in the hypervisor,
that allowing it to ensure its sub-device-id always be 0x01,
and it fits VIRTIO_ID_NET.

However, a vendor may produce multiple generations of transitional
hardware. The sub-device-id is up to the vendor, and it is the
only way to for a driver to identify a device, other IDs are all
fixed as 0x1af4, 0x1000 and 0x8086 for Intel.

So the sub-device-id has to be unique and differ from others, can not 
always be 0x01.


I propose this fix, all changes are for modern-transitional devices in 
modern

code path, not for legacy nor legacy-transitional.

Thanks





Other types of devices also have similar issues, like virito-blk.

I propose to fix this problem of modern-transitonal device
whith this solution, all in the modern code path:
1) assign the device id to mdev->id.device
2) add transitional device ids in the virtio-net(and others) probe table.

Comments are welcome!

Thanks!

Signed-off-by: Zhu Lingshan 
---
  drivers/net/virtio_net.c   | 1 +
  drivers/virtio/virtio_pci_modern_dev.c | 2 +-
  2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 56ca1d270304..6b45d8602a6b 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -4250,6 +4250,7 @@ static __maybe_unused int virtnet_restore(struct 
virtio_device *vdev)
  
  static struct virtio_device_id id_table[] = {

{ VIRTIO_ID_NET, VIRTIO_DEV_ANY_ID },
+   { VIRTIO_TRANS_ID_NET, VIRTIO_DEV_ANY_ID },
{ 0 },
  };
  
diff --git a/drivers/virtio/virtio_pci_modern_dev.c b/drivers/virtio/virtio_pci_modern_dev.c

index 869cb46bef96..80846e1195ce 100644
--- a/drivers/virtio/virtio_pci_modern_dev.c
+++ b/drivers/virtio/virtio_pci_modern_dev.c
@@ -229,7 +229,7 @@ int vp_modern_probe(struct virtio_pci_modern_device *mdev)
/* Transitional devices: use the PCI subsystem device id as
 * virtio device id, same as legacy driver always did.
 */
-   mdev->id.device = pci_dev->subsystem_device;
+   mdev->id.device = pci_dev->device;
} else {
/* Modern devices: simply use PCI device id, but start from 
0x1040. */
mdev->id.device = pci_dev->device - 0x1040;
--
2.39.1


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization