Re: [PATCH v3 1/2] vhost: Add rbtree vdpa_mem_tree to saved the counted mem

2022-06-27 Thread Michael S. Tsirkin
On Tue, Jun 28, 2022 at 11:54:27AM +0800, Jason Wang wrote:
> On Mon, Jun 27, 2022 at 6:01 PM Michael S. Tsirkin  wrote:
> >
> > On Mon, Jun 27, 2022 at 04:12:57PM +0800, Cindy Lu wrote:
> > > On Sun, Jun 26, 2022 at 6:01 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Sun, Jun 26, 2022 at 05:04:08PM +0800, Cindy Lu wrote:
> > > > > We count pinned_vm as follow in vhost-vDPA
> > > > >
> > > > > lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > > if (npages + atomic64_read(>mm->pinned_vm) > lock_limit) {
> > > > >  ret = -ENOMEM;
> > > > >  goto unlock;
> > > > > }
> > > > > This means if we have two vDPA devices for the same VM the pages
> > > > > would be counted twice. So we add a tree to save the page that
> > > > > counted and we will not count it again.
> > > > >
> > > > > Add vdpa_mem_tree to saved the mem that already counted.
> > > > > use a hlist to saved the root for vdpa_mem_tree.
> > > > >
> > > > > Signed-off-by: Cindy Lu 
> > > > > ---
> > > > >  drivers/vhost/vhost.c | 63 
> > > > > +++
> > > > >  drivers/vhost/vhost.h |  1 +
> > > > >  2 files changed, 64 insertions(+)
> > > > >
> > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > > index 40097826cff0..4ca8b1ed944b 100644
> > > > > --- a/drivers/vhost/vhost.c
> > > > > +++ b/drivers/vhost/vhost.c
> > > > > @@ -32,6 +32,8 @@
> > > > >  #include 
> > > > >
> > > > >  #include "vhost.h"
> > > > > +#include 
> > > > > +#include 
> > > > >
> > > > >  static ushort max_mem_regions = 64;
> > > > >  module_param(max_mem_regions, ushort, 0444);
> > > > > @@ -49,6 +51,14 @@ enum {
> > > > >  #define vhost_used_event(vq) ((__virtio16 __user 
> > > > > *)>avail->ring[vq->num])
> > > > >  #define vhost_avail_event(vq) ((__virtio16 __user 
> > > > > *)>used->ring[vq->num])
> > > > >
> > > > > +struct vhost_vdpa_rbtree_node {
> > > > > + struct hlist_node node;
> > > > > + struct rb_root_cached vdpa_mem_tree;
> > > > > + struct mm_struct *mm_using;
> > > > > +};
> > > > > +static DECLARE_HASHTABLE(vhost_vdpa_rbtree_hlist, 8);
> > > > > +int vhost_vdpa_rbtree_hlist_status;
> > > > > +
> > > > >  #ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
> > > > >  static void vhost_disable_cross_endian(struct vhost_virtqueue *vq)
> > > > >  {
> > > >
> > > > Are you trying to save some per-mm information here?
> > > > Can't we just add it to mm_struct?
> > > >
> > > yes, this is a per-mm information, but I have checked with jason before,
> > > seems it maybe difficult to change the mm_struct in upstream
> > > so I add an to add a hlist  instead
> > > Thanks
> > > Cindy
> >
> > Difficult how?
> 
> It is only useful for vDPA probably. Though it could be used by VFIO
> as well, VFIO does pinning/accounting at the container level and it
> has been there for years.

Yes it's been there, I'm not sure this means it's perfect.
Also, rdma guys might be interested too I guess?

> vDPA have an implicit "container" the
> mm_struct, but the accounting is done per device right now.
> 
> In the future, when vDPA switches to iommufd, it can be then solved at
> iommufd level.

So is it even worth fixing now?

> And if we do this in mm, it will bring extra overheads.
> 
> Thanks

Pointer per mm, not too bad ...

> > You get more scrutiny if you try, for sure,
> > and you need to need to generalize it enough that it looks
> > useful outside the driver. But I guess that's good?
> >
> > > >
> > > >
> > > > > @@ -571,6 +581,51 @@ static void vhost_detach_mm(struct vhost_dev 
> > > > > *dev)
> > > > >   dev->mm = NULL;
> > > > >  }
> > > > >
> > > > > +struct rb_root_cached *vhost_vdpa_get_mem_tree(struct mm_struct *mm)
> > > > > +{
> > > > > + struct vhost_vdpa_rbtree_node *rbtree_root = NULL;
> > > > > + struct rb_root_cached *vdpa_tree;
> > > > > + u32 key;
> > > > > +
> > > > > + /* No hased table, init one */
> > > > > + if (vhost_vdpa_rbtree_hlist_status == 0) {
> > > > > + hash_init(vhost_vdpa_rbtree_hlist);
> > > > > + vhost_vdpa_rbtree_hlist_status = 1;
> > > > > + }
> > > > > +
> > > > > + key = jhash_1word((u64)mm, JHASH_INITVAL);
> > > > > + hash_for_each_possible(vhost_vdpa_rbtree_hlist, rbtree_root, 
> > > > > node,
> > > > > +key) {
> > > > > + if (rbtree_root->mm_using == mm)
> > > > > + return &(rbtree_root->vdpa_mem_tree);
> > > > > + }
> > > > > + rbtree_root = kmalloc(sizeof(*rbtree_root), GFP_KERNEL);
> > > > > + if (!rbtree_root)
> > > > > + return NULL;
> > > > > + rbtree_root->mm_using = mm;
> > > > > + rbtree_root->vdpa_mem_tree = RB_ROOT_CACHED;
> > > > > + hash_add(vhost_vdpa_rbtree_hlist, _root->node, key);
> > > > > + vdpa_tree = &(rbtree_root->vdpa_mem_tree);
> > > > > + return vdpa_tree;
> > > > > +}
> > > > > +
> > > > > +void vhost_vdpa_relase_mem_tree(struct mm_struct *mm)
> > > > > +{
> > > > > 

Re: [PATCH v10 25/41] virtio_pci: struct virtio_pci_common_cfg add queue_notify_data

2022-06-27 Thread Michael S. Tsirkin
On Tue, Jun 28, 2022 at 11:50:37AM +0800, Jason Wang wrote:
> On Mon, Jun 27, 2022 at 7:53 PM Michael S. Tsirkin  wrote:
> >
> > On Mon, Jun 27, 2022 at 04:14:20PM +0800, Jason Wang wrote:
> > > On Mon, Jun 27, 2022 at 3:58 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Mon, Jun 27, 2022 at 03:45:30PM +0800, Jason Wang wrote:
> > > > > On Mon, Jun 27, 2022 at 2:39 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Mon, Jun 27, 2022 at 10:30:42AM +0800, Jason Wang wrote:
> > > > > > > On Fri, Jun 24, 2022 at 2:59 PM Michael S. Tsirkin 
> > > > > > >  wrote:
> > > > > > > >
> > > > > > > > On Fri, Jun 24, 2022 at 10:56:05AM +0800, Xuan Zhuo wrote:
> > > > > > > > > Add queue_notify_data in struct virtio_pci_common_cfg, which 
> > > > > > > > > comes from
> > > > > > > > > here https://github.com/oasis-tcs/virtio-spec/issues/89
> > > > > > > > >
> > > > > > > > > For not breaks uABI, add a new struct 
> > > > > > > > > virtio_pci_common_cfg_notify.
> > > > > > > >
> > > > > > > > What exactly is meant by not breaking uABI?
> > > > > > > > Users are supposed to be prepared for struct size to change ... 
> > > > > > > > no?
> > > > > > >
> > > > > > > Not sure, any doc for this?
> > > > > > >
> > > > > > > Thanks
> > > > > >
> > > > > >
> > > > > > Well we have this:
> > > > > >
> > > > > > The drivers SHOULD only map part of configuration structure
> > > > > > large enough for device operation.  The drivers MUST handle
> > > > > > an unexpectedly large \field{length}, but MAY check that 
> > > > > > \field{length}
> > > > > > is large enough for device operation.
> > > > >
> > > > > Yes, but that's the device/driver interface. What's done here is the
> > > > > userspace/kernel.
> > > > >
> > > > > Userspace may break if it uses e.g sizeof(struct 
> > > > > virtio_pci_common_cfg)?
> > > > >
> > > > > Thanks
> > > >
> > > > Hmm I guess there's risk... but then how are we going to maintain this
> > > > going forward?  Add a new struct on any change?
> > >
> > > This is the way we have used it for the past 5 or more years. I don't
> > > see why this must be handled in the vq reset feature.
> > >
> > > >Can we at least
> > > > prevent this going forward somehow?
> > >
> > > Like have some padding?
> > >
> > > Thanks
> >
> > Maybe - this is what QEMU does ...
> 
> Do you want this to be addressed in this series (it's already very huge 
> anyhow)?
> 
> Thanks

Let's come up with a solution at least. QEMU does not seem to need the struct. 
Let's just put
it in virtio_pci_modern.h for now then?

> >
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > Since I want to add queue_reset after queue_notify_data, I 
> > > > > > > > > submitted
> > > > > > > > > this patch first.
> > > > > > > > >
> > > > > > > > > Signed-off-by: Xuan Zhuo 
> > > > > > > > > Acked-by: Jason Wang 
> > > > > > > > > ---
> > > > > > > > >  include/uapi/linux/virtio_pci.h | 7 +++
> > > > > > > > >  1 file changed, 7 insertions(+)
> > > > > > > > >
> > > > > > > > > diff --git a/include/uapi/linux/virtio_pci.h 
> > > > > > > > > b/include/uapi/linux/virtio_pci.h
> > > > > > > > > index 3a86f36d7e3d..22bec9bd0dfc 100644
> > > > > > > > > --- a/include/uapi/linux/virtio_pci.h
> > > > > > > > > +++ b/include/uapi/linux/virtio_pci.h
> > > > > > > > > @@ -166,6 +166,13 @@ struct virtio_pci_common_cfg {
> > > > > > > > >   __le32 queue_used_hi;   /* read-write */
> > > > > > > > >  };
> > > > > > > > >
> > > > > > > > > +struct virtio_pci_common_cfg_notify {
> > > > > > > > > + struct virtio_pci_common_cfg cfg;
> > > > > > > > > +
> > > > > > > > > + __le16 queue_notify_data;   /* read-write */
> > > > > > > > > + __le16 padding;
> > > > > > > > > +};
> > > > > > > > > +
> > > > > > > > >  /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */
> > > > > > > > >  struct virtio_pci_cfg_cap {
> > > > > > > > >   struct virtio_pci_cap cap;
> > > > > > > > > --
> > > > > > > > > 2.31.0
> > > > > > > >
> > > > > >
> > > >
> >

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V3] virtio: disable notification hardening by default

2022-06-27 Thread Michael S. Tsirkin
On Tue, Jun 28, 2022 at 11:49:12AM +0800, Jason Wang wrote:
> > Heh. Yea sure. But things work fine for people. What is the chance
> > your review found and fixed all driver bugs?
> 
> I don't/can't audit all bugs but the race between open/close against
> ready/reset. It looks to me a good chance to fix them all but if you
> think differently, let me know
> 
> > After two attempts
> > I don't feel like hoping audit will fix all bugs.
> 
> I've started the auditing and have 15+ patches in the queue. (only
> covers bluetooth, console, pmem, virtio-net and caif). Spotting the
> issue is not hard but the testing, It would take at least the time of
> one release to finalize I guess.

Absolutely. So I am looking for a way to implement hardening that does
not break existing drivers.


> >
> >
> > > >
> > > > The reason config was kind of easy is that config interrupt is rarely
> > > > vital for device function so arbitrarily deferring that does not lead to
> > > > deadlocks - what you are trying to do with VQ interrupts is
> > > > fundamentally different. Things are especially bad if we just drop
> > > > an interrupt but deferring can lead to problems too.
> > >
> > > I'm not sure I see the difference, disable_irq() stuffs also delay the
> > > interrupt processing until enable_irq().
> >
> >
> > Absolutely. I am not at all sure disable_irq fixes all problems.
> >
> > > >
> > > > Consider as an example
> > > > virtio-net: fix race between ndo_open() and virtio_device_ready()
> > > > if you just defer vq interrupts you get deadlocks.
> > > >
> > > >
> > >
> > > I don't see a deadlock here, maybe you can show more detail on this?
> >
> > What I mean is this: if we revert the above commit, things still
> > work (out of spec, but still). If we revert and defer interrupts until
> > device ready then ndo_open that triggers before device ready deadlocks.
> 
> Ok, I guess you meant on a hypervisor that is strictly written with spec.

I mean on hypervisor that starts processing queues after getting a kick
even without DRIVER_OK.

> >
> >
> > > >
> > > > So, thinking about all this, how about a simple per vq flag meaning
> > > > "this vq was kicked since reset"?
> > >
> > > And ignore the notification if vq is not kicked? It sounds like the
> > > callback needs to be synchronized with the kick.
> >
> > Note we only need to synchronize it when it changes, which is
> > only during initialization and reset.
> 
> Yes.
> 
> >
> >
> > > >
> > > > If driver does not kick then it's not ready to get callbacks, right?
> > > >
> > > > Sounds quite clean, but we need to think through memory ordering
> > > > concerns - I guess it's only when we change the value so
> > > > if (!vq->kicked) {
> > > > vq->kicked = true;
> > > > mb();
> > > > }
> > > >
> > > > will do the trick, right?
> > >
> > > There's no much difference with the existing approach:
> > >
> > > 1) your proposal implicitly makes callbacks ready in virtqueue_kick()
> > > 2) my proposal explicitly makes callbacks ready via virtio_device_ready()
> > >
> > > Both require careful auditing of all the existing drivers to make sure
> > > no kick before DRIVER_OK.
> >
> > Jason, kick before DRIVER_OK is out of spec, sure. But it is unrelated
> > to hardening
> 
> Yes but with your proposal, it seems to couple kick with DRIVER_OK somehow.

I don't see how - my proposal ignores DRIVER_OK issues.

> > and in absence of config interrupts is generally easily
> > fixed just by sticking virtio_device_ready early in initialization.
> 
> So if the kick is done before the subsystem registration, there's
> still a window in the middle (assuming we stick virtio_device_ready()
> early):
> 
> virtio_device_ready()
> virtqueue_kick()
> /* the window */
> subsystem_registration()

Absolutely, however, I do not think we really have many such drivers
since this has been known as a wrong thing to do since the beginning.
Want to try to find any? I couldn't ... except maybe bluetooth
but that's just maintainer nacking fixes saying he'll fix it
his way ...

> And during remove(), we get another window:
> 
> subsysrem_unregistration()
> /* the window */
> virtio_device_reset()

Same here.

> if we do reset before, we may end up with other issues like deadlock.
> 
> So I think it's probably very hard to fix issues only at the virtio
> core since we need subsystem specific knowledge to make sure
> everything is synchronized.

How many drivers do you see with the issue above?
As compared to yours which has 16 patches just in your queue.

> > With the current approach one has to *also* not do virtio_device_ready
> > too early - and it's really tricky.
> 
> Not sure how much we differ here, during the probe driver can just
> place the virtio_device_ready() after the kick.

Not so easy. For example, consider virtio net without your
locking change. kick is part of a command vq operation
which does not directly have anything to do with probe.
Same for many 

Re: [PATCH] Bluetooth: virtio_bt: fix device removal

2022-06-27 Thread Michael S. Tsirkin
On Mon, Jun 13, 2022 at 02:58:59AM -0400, Michael S. Tsirkin wrote:
> On Fri, Jan 14, 2022 at 03:12:47PM -0500, Michael S. Tsirkin wrote:
> > On Thu, Dec 16, 2021 at 08:58:31PM +0100, Marcel Holtmann wrote:
> > > Hi Michael,
> > > 
> > > > Device removal is clearly out of virtio spec: it attempts to remove
> > > > unused buffers from a VQ before invoking device reset. To fix, make
> > > > open/close NOPs and do all cleanup/setup in probe/remove.
> > >  
> > >  so the virtbt_{open,close} as NOP is not really what a driver is 
> > >  suppose
> > >  to be doing. These are transport enable/disable callbacks from the BT
> > >  Core towards the driver. It maps to a device being enabled/disabled 
> > >  by
> > >  something like bluetoothd for example. So if disabled, I expect that 
> > >  no
> > >  resources/queues are in use.
> > >  
> > >  Maybe I misunderstand the virtio spec in that regard, but I would 
> > >  like
> > >  to keep this fundamental concept of a Bluetooth driver. It does work
> > >  with all other transports like USB, SDIO, UART etc.
> > >  
> > > > The cost here is a single skb wasted on an unused bt device - which
> > > > seems modest.
> > >  
> > >  There should be no buffer used if the device is powered off. We also 
> > >  don’t
> > >  have any USB URBs in-flight if the transport is not active.
> > >  
> > > > NB: with this fix in place driver still suffers from a race 
> > > > condition if
> > > > an interrupt triggers while device is being reset. Work on a fix for
> > > > that issue is in progress.
> > >  
> > >  In the virtbt_close() callback we should deactivate all interrupts.
> > >  
> > >  Regards
> > >  
> > >  Marcel
> > > >>> 
> > > >>> So Marcel, do I read it right that you are working on a fix
> > > >>> and I can drop this patch for now?
> > > >> 
> > > >> ping
> > > > 
> > > > 
> > > > If I don't hear otherwise I'll queue my version - it might not
> > > > be ideal but it at least does not violate the spec.
> > > > We can work on not allocating/freeing buffers later
> > > > as appropriate.
> > > 
> > > I have a patch, but it is not fully tested yet.
> > > 
> > > Regards
> > > 
> > > Marcel
> > 
> > ping
> > 
> > it's been a month ...
> > 
> > I'm working on cleaning up module/device removal in virtio and bt
> > is kind of sticking out.
> 
> I am inclined to make this driver depend on BROKEN for now.
> Any objections?

OK patch incoming.

> 
> > -- 
> > MST

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/3] vdpa_sim_blk: limit the number of request handled per batch

2022-06-27 Thread Jason Wang
On Thu, Jun 23, 2022 at 4:58 PM Stefano Garzarella  wrote:
>
> On Thu, Jun 23, 2022 at 11:50:22AM +0800, Jason Wang wrote:
> >On Wed, Jun 22, 2022 at 12:09 AM Stefano Garzarella  
> >wrote:
> >>
> >> Limit the number of requests (4 per queue as for vdpa_sim_net) handled
> >> in a batch to prevent the worker from using the CPU for too long.
> >>
> >> Suggested-by: Eugenio Pérez 
> >> Signed-off-by: Stefano Garzarella 
> >> ---
> >>  drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 15 ++-
> >>  1 file changed, 14 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c 
> >> b/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c
> >> index a83a5c76f620..ac86478845b6 100644
> >> --- a/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c
> >> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c
> >> @@ -197,6 +197,7 @@ static bool vdpasim_blk_handle_req(struct vdpasim 
> >> *vdpasim,
> >>  static void vdpasim_blk_work(struct work_struct *work)
> >>  {
> >> struct vdpasim *vdpasim = container_of(work, struct vdpasim, work);
> >> +   bool reschedule = false;
> >> int i;
> >>
> >> spin_lock(>lock);
> >> @@ -206,11 +207,15 @@ static void vdpasim_blk_work(struct work_struct 
> >> *work)
> >>
> >> for (i = 0; i < VDPASIM_BLK_VQ_NUM; i++) {
> >> struct vdpasim_virtqueue *vq = >vqs[i];
> >> +   bool vq_work = true;
> >> +   int reqs = 0;
> >>
> >> if (!vq->ready)
> >> continue;
> >>
> >> -   while (vdpasim_blk_handle_req(vdpasim, vq)) {
> >> +   while (vq_work) {
> >> +   vq_work = vdpasim_blk_handle_req(vdpasim, vq);
> >> +
> >
> >Is it better to check and exit the loop early here?
>
> Maybe, but I'm not sure.
>
> In vdpa_sim_net we call vringh_complete_iotlb() and send notification
> also in the error path,

Looks not?

read = vringh_iov_pull_iotlb(>vring, >in_iov, ,
 sizeof(ctrl));
if (read != sizeof(ctrl))
break;

We break the loop.

Thanks

> so I thought was better to send notification
> also when vdpasim_blk_handle_req() return false, since we will update
> the used.idx.
>
> However, I don't think it's a common path, so if you think it's better
> to exit the loop early, I can do it.
>
> Thanks,
> Stefano
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH v3 1/2] vhost: Add rbtree vdpa_mem_tree to saved the counted mem

2022-06-27 Thread Jason Wang
On Mon, Jun 27, 2022 at 6:01 PM Michael S. Tsirkin  wrote:
>
> On Mon, Jun 27, 2022 at 04:12:57PM +0800, Cindy Lu wrote:
> > On Sun, Jun 26, 2022 at 6:01 PM Michael S. Tsirkin  wrote:
> > >
> > > On Sun, Jun 26, 2022 at 05:04:08PM +0800, Cindy Lu wrote:
> > > > We count pinned_vm as follow in vhost-vDPA
> > > >
> > > > lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > > if (npages + atomic64_read(>mm->pinned_vm) > lock_limit) {
> > > >  ret = -ENOMEM;
> > > >  goto unlock;
> > > > }
> > > > This means if we have two vDPA devices for the same VM the pages
> > > > would be counted twice. So we add a tree to save the page that
> > > > counted and we will not count it again.
> > > >
> > > > Add vdpa_mem_tree to saved the mem that already counted.
> > > > use a hlist to saved the root for vdpa_mem_tree.
> > > >
> > > > Signed-off-by: Cindy Lu 
> > > > ---
> > > >  drivers/vhost/vhost.c | 63 +++
> > > >  drivers/vhost/vhost.h |  1 +
> > > >  2 files changed, 64 insertions(+)
> > > >
> > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > > index 40097826cff0..4ca8b1ed944b 100644
> > > > --- a/drivers/vhost/vhost.c
> > > > +++ b/drivers/vhost/vhost.c
> > > > @@ -32,6 +32,8 @@
> > > >  #include 
> > > >
> > > >  #include "vhost.h"
> > > > +#include 
> > > > +#include 
> > > >
> > > >  static ushort max_mem_regions = 64;
> > > >  module_param(max_mem_regions, ushort, 0444);
> > > > @@ -49,6 +51,14 @@ enum {
> > > >  #define vhost_used_event(vq) ((__virtio16 __user 
> > > > *)>avail->ring[vq->num])
> > > >  #define vhost_avail_event(vq) ((__virtio16 __user 
> > > > *)>used->ring[vq->num])
> > > >
> > > > +struct vhost_vdpa_rbtree_node {
> > > > + struct hlist_node node;
> > > > + struct rb_root_cached vdpa_mem_tree;
> > > > + struct mm_struct *mm_using;
> > > > +};
> > > > +static DECLARE_HASHTABLE(vhost_vdpa_rbtree_hlist, 8);
> > > > +int vhost_vdpa_rbtree_hlist_status;
> > > > +
> > > >  #ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
> > > >  static void vhost_disable_cross_endian(struct vhost_virtqueue *vq)
> > > >  {
> > >
> > > Are you trying to save some per-mm information here?
> > > Can't we just add it to mm_struct?
> > >
> > yes, this is a per-mm information, but I have checked with jason before,
> > seems it maybe difficult to change the mm_struct in upstream
> > so I add an to add a hlist  instead
> > Thanks
> > Cindy
>
> Difficult how?

It is only useful for vDPA probably. Though it could be used by VFIO
as well, VFIO does pinning/accounting at the container level and it
has been there for years. vDPA have an implicit "container" the
mm_struct, but the accounting is done per device right now.

In the future, when vDPA switches to iommufd, it can be then solved at
iommufd level.

And if we do this in mm, it will bring extra overheads.

Thanks

> You get more scrutiny if you try, for sure,
> and you need to need to generalize it enough that it looks
> useful outside the driver. But I guess that's good?
>
> > >
> > >
> > > > @@ -571,6 +581,51 @@ static void vhost_detach_mm(struct vhost_dev *dev)
> > > >   dev->mm = NULL;
> > > >  }
> > > >
> > > > +struct rb_root_cached *vhost_vdpa_get_mem_tree(struct mm_struct *mm)
> > > > +{
> > > > + struct vhost_vdpa_rbtree_node *rbtree_root = NULL;
> > > > + struct rb_root_cached *vdpa_tree;
> > > > + u32 key;
> > > > +
> > > > + /* No hased table, init one */
> > > > + if (vhost_vdpa_rbtree_hlist_status == 0) {
> > > > + hash_init(vhost_vdpa_rbtree_hlist);
> > > > + vhost_vdpa_rbtree_hlist_status = 1;
> > > > + }
> > > > +
> > > > + key = jhash_1word((u64)mm, JHASH_INITVAL);
> > > > + hash_for_each_possible(vhost_vdpa_rbtree_hlist, rbtree_root, node,
> > > > +key) {
> > > > + if (rbtree_root->mm_using == mm)
> > > > + return &(rbtree_root->vdpa_mem_tree);
> > > > + }
> > > > + rbtree_root = kmalloc(sizeof(*rbtree_root), GFP_KERNEL);
> > > > + if (!rbtree_root)
> > > > + return NULL;
> > > > + rbtree_root->mm_using = mm;
> > > > + rbtree_root->vdpa_mem_tree = RB_ROOT_CACHED;
> > > > + hash_add(vhost_vdpa_rbtree_hlist, _root->node, key);
> > > > + vdpa_tree = &(rbtree_root->vdpa_mem_tree);
> > > > + return vdpa_tree;
> > > > +}
> > > > +
> > > > +void vhost_vdpa_relase_mem_tree(struct mm_struct *mm)
> > > > +{
> > > > + struct vhost_vdpa_rbtree_node *rbtree_root = NULL;
> > > > + u32 key;
> > > > +
> > > > + key = jhash_1word((u64)mm, JHASH_INITVAL);
> > > > +
> > > > + /* No hased table, init one */
> > > > + hash_for_each_possible(vhost_vdpa_rbtree_hlist, rbtree_root, node,
> > > > +key) {
> > > > + if (rbtree_root->mm_using == mm) {
> > > > + hash_del(_root->node);
> > > > + kfree(rbtree_root);
> > > > + }
> > 

Re: [PATCH v10 25/41] virtio_pci: struct virtio_pci_common_cfg add queue_notify_data

2022-06-27 Thread Jason Wang
On Mon, Jun 27, 2022 at 7:53 PM Michael S. Tsirkin  wrote:
>
> On Mon, Jun 27, 2022 at 04:14:20PM +0800, Jason Wang wrote:
> > On Mon, Jun 27, 2022 at 3:58 PM Michael S. Tsirkin  wrote:
> > >
> > > On Mon, Jun 27, 2022 at 03:45:30PM +0800, Jason Wang wrote:
> > > > On Mon, Jun 27, 2022 at 2:39 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Mon, Jun 27, 2022 at 10:30:42AM +0800, Jason Wang wrote:
> > > > > > On Fri, Jun 24, 2022 at 2:59 PM Michael S. Tsirkin 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Fri, Jun 24, 2022 at 10:56:05AM +0800, Xuan Zhuo wrote:
> > > > > > > > Add queue_notify_data in struct virtio_pci_common_cfg, which 
> > > > > > > > comes from
> > > > > > > > here https://github.com/oasis-tcs/virtio-spec/issues/89
> > > > > > > >
> > > > > > > > For not breaks uABI, add a new struct 
> > > > > > > > virtio_pci_common_cfg_notify.
> > > > > > >
> > > > > > > What exactly is meant by not breaking uABI?
> > > > > > > Users are supposed to be prepared for struct size to change ... 
> > > > > > > no?
> > > > > >
> > > > > > Not sure, any doc for this?
> > > > > >
> > > > > > Thanks
> > > > >
> > > > >
> > > > > Well we have this:
> > > > >
> > > > > The drivers SHOULD only map part of configuration structure
> > > > > large enough for device operation.  The drivers MUST handle
> > > > > an unexpectedly large \field{length}, but MAY check that 
> > > > > \field{length}
> > > > > is large enough for device operation.
> > > >
> > > > Yes, but that's the device/driver interface. What's done here is the
> > > > userspace/kernel.
> > > >
> > > > Userspace may break if it uses e.g sizeof(struct virtio_pci_common_cfg)?
> > > >
> > > > Thanks
> > >
> > > Hmm I guess there's risk... but then how are we going to maintain this
> > > going forward?  Add a new struct on any change?
> >
> > This is the way we have used it for the past 5 or more years. I don't
> > see why this must be handled in the vq reset feature.
> >
> > >Can we at least
> > > prevent this going forward somehow?
> >
> > Like have some padding?
> >
> > Thanks
>
> Maybe - this is what QEMU does ...

Do you want this to be addressed in this series (it's already very huge anyhow)?

Thanks

>
> > >
> > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Since I want to add queue_reset after queue_notify_data, I 
> > > > > > > > submitted
> > > > > > > > this patch first.
> > > > > > > >
> > > > > > > > Signed-off-by: Xuan Zhuo 
> > > > > > > > Acked-by: Jason Wang 
> > > > > > > > ---
> > > > > > > >  include/uapi/linux/virtio_pci.h | 7 +++
> > > > > > > >  1 file changed, 7 insertions(+)
> > > > > > > >
> > > > > > > > diff --git a/include/uapi/linux/virtio_pci.h 
> > > > > > > > b/include/uapi/linux/virtio_pci.h
> > > > > > > > index 3a86f36d7e3d..22bec9bd0dfc 100644
> > > > > > > > --- a/include/uapi/linux/virtio_pci.h
> > > > > > > > +++ b/include/uapi/linux/virtio_pci.h
> > > > > > > > @@ -166,6 +166,13 @@ struct virtio_pci_common_cfg {
> > > > > > > >   __le32 queue_used_hi;   /* read-write */
> > > > > > > >  };
> > > > > > > >
> > > > > > > > +struct virtio_pci_common_cfg_notify {
> > > > > > > > + struct virtio_pci_common_cfg cfg;
> > > > > > > > +
> > > > > > > > + __le16 queue_notify_data;   /* read-write */
> > > > > > > > + __le16 padding;
> > > > > > > > +};
> > > > > > > > +
> > > > > > > >  /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */
> > > > > > > >  struct virtio_pci_cfg_cap {
> > > > > > > >   struct virtio_pci_cap cap;
> > > > > > > > --
> > > > > > > > 2.31.0
> > > > > > >
> > > > >
> > >
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V3] virtio: disable notification hardening by default

2022-06-27 Thread Jason Wang
On Mon, Jun 27, 2022 at 5:52 PM Michael S. Tsirkin  wrote:
>
> On Mon, Jun 27, 2022 at 04:11:18PM +0800, Jason Wang wrote:
> > On Mon, Jun 27, 2022 at 3:33 PM Michael S. Tsirkin  wrote:
> > >
> > > On Mon, Jun 27, 2022 at 10:50:17AM +0800, Jason Wang wrote:
> > > > On Fri, Jun 24, 2022 at 2:31 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Wed, Jun 22, 2022 at 03:09:31PM +0800, Jason Wang wrote:
> > > > > > On Wed, Jun 22, 2022 at 3:03 PM Michael S. Tsirkin 
> > > > > >  wrote:
> > > > > > >
> > > > > > > On Wed, Jun 22, 2022 at 09:29:40AM +0800, Jason Wang wrote:
> > > > > > > > We try to harden virtio device notifications in 8b4ec69d7e09 
> > > > > > > > ("virtio:
> > > > > > > > harden vring IRQ"). It works with the assumption that the 
> > > > > > > > driver or
> > > > > > > > core can properly call virtio_device_ready() at the right
> > > > > > > > place. Unfortunately, this seems to be not true and uncover 
> > > > > > > > various
> > > > > > > > bugs of the existing drivers, mainly the issue of using
> > > > > > > > virtio_device_ready() incorrectly.
> > > > > > > >
> > > > > > > > So let's having a Kconfig option and disable it by default. It 
> > > > > > > > gives
> > > > > > > > us a breath to fix the drivers and then we can consider to 
> > > > > > > > enable it
> > > > > > > > by default.
> > > > > > > >
> > > > > > > > Signed-off-by: Jason Wang 
> > > > > > >
> > > > > > >
> > > > > > > OK I will queue, but I think the problem is fundamental.
> > > > > >
> > > > > > If I understand correctly, you want some core IRQ work?
> > > > >
> > > > > Yes.
> > > > >
> > > > > > As discussed
> > > > > > before, it doesn't solve all the problems, we still need to do per
> > > > > > driver audit.
> > > > > >
> > > > > > Thanks
> > > > >
> > > > > Maybe, but we don't need to tie things to device_ready then.
> > > > > We can do
> > > > >
> > > > > - disable irqs
> > > > > - device ready
> > > > > - setup everything
> > > > > - enable irqs
> > > > >
> > > > >
> > > > > and this works for most things, the only issue is
> > > > > this deadlocks if "setup everything" waits for interrupts.
> > > > >
> > > > >
> > > > > With the current approach there's really no good time:
> > > > > 1.- setup everything
> > > > > - device ready
> > > > >
> > > > > can cause kicks before device is ready
> > > > >
> > > > > 2.- device ready
> > > > > - setup everything
> > > > >
> > > > > can cause callbacks before setup.
> > > > >
> > > > > So I prefer the 1. and fix the hardening in the core.
> > > >
> > > > So my question is:
> > > >
> > > > 1) do similar hardening like config interrupt
> > > > or
> > > > 2) per transport notification work (e.g for PCI core IRQ work)
> > > >
> > > > 1) seems easier and universal, but we pay little overhead which could
> > > > be eliminated by the config option.
> > >
> > > I doubt 1 is easy and I am not even sure core IRQ changes will help.
> >
> > Core IRQ won't help in 1), the changes are limited in the virtio.
> >
> > > My concern with adding overhead is that I'm not sure these are not just
> > > wasted CPU cycles.  We spent a bunch of time on irq hardening and so far
> > > we are still at the "all drivers need to be fixed" stage.
> >
> > It's not the fault of hardening but the drivers. The hardening just
> > expose those bugs.
>
> Heh. Yea sure. But things work fine for people. What is the chance
> your review found and fixed all driver bugs?

I don't/can't audit all bugs but the race between open/close against
ready/reset. It looks to me a good chance to fix them all but if you
think differently, let me know

> After two attempts
> I don't feel like hoping audit will fix all bugs.

I've started the auditing and have 15+ patches in the queue. (only
covers bluetooth, console, pmem, virtio-net and caif). Spotting the
issue is not hard but the testing, It would take at least the time of
one release to finalize I guess.

>
>
> > >
> > > The reason config was kind of easy is that config interrupt is rarely
> > > vital for device function so arbitrarily deferring that does not lead to
> > > deadlocks - what you are trying to do with VQ interrupts is
> > > fundamentally different. Things are especially bad if we just drop
> > > an interrupt but deferring can lead to problems too.
> >
> > I'm not sure I see the difference, disable_irq() stuffs also delay the
> > interrupt processing until enable_irq().
>
>
> Absolutely. I am not at all sure disable_irq fixes all problems.
>
> > >
> > > Consider as an example
> > > virtio-net: fix race between ndo_open() and virtio_device_ready()
> > > if you just defer vq interrupts you get deadlocks.
> > >
> > >
> >
> > I don't see a deadlock here, maybe you can show more detail on this?
>
> What I mean is this: if we revert the above commit, things still
> work (out of spec, but still). If we revert and defer interrupts until
> device ready then ndo_open that triggers before device ready deadlocks.

Ok, I guess you meant on a hypervisor that is 

Re: [PATCH][next] treewide: uapi: Replace zero-length arrays with flexible-array members

2022-06-27 Thread Jason Gunthorpe
On Mon, Jun 27, 2022 at 08:27:37PM +0200, Daniel Borkmann wrote:
> On 6/27/22 8:04 PM, Gustavo A. R. Silva wrote:
> > There is a regular need in the kernel to provide a way to declare
> > having a dynamically sized set of trailing elements in a structure.
> > Kernel code should always use “flexible array members”[1] for these
> > cases. The older style of one-element or zero-length arrays should
> > no longer be used[2].
> > 
> > This code was transformed with the help of Coccinelle:
> > (linux-5.19-rc2$ spatch --jobs $(getconf _NPROCESSORS_ONLN) --sp-file 
> > script.cocci --include-headers --dir . > output.patch)
> > 
> > @@
> > identifier S, member, array;
> > type T1, T2;
> > @@
> > 
> > struct S {
> >...
> >T1 member;
> >T2 array[
> > - 0
> >];
> > };
> > 
> > -fstrict-flex-arrays=3 is coming and we need to land these changes
> > to prevent issues like these in the short future:
> > 
> > ../fs/minix/dir.c:337:3: warning: 'strcpy' will always overflow; 
> > destination buffer has size 0,
> > but the source string has length 2 (including NUL byte) [-Wfortify-source]
> > strcpy(de3->name, ".");
> > ^
> > 
> > Since these are all [0] to [] changes, the risk to UAPI is nearly zero. If
> > this breaks anything, we can use a union with a new member name.
> > 
> > [1] https://en.wikipedia.org/wiki/Flexible_array_member
> > [2] 
> > https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays
> > 
> > Link: https://github.com/KSPP/linux/issues/78
> > Build-tested-by: 
> > https://lore.kernel.org/lkml/62b675ec.wkx6aoz6cbe71vtf%25...@intel.com/
> > Signed-off-by: Gustavo A. R. Silva 
> > ---
> > Hi all!
> > 
> > JFYI: I'm adding this to my -next tree. :)
> 
> Fyi, this breaks BPF CI:
> 
> https://github.com/kernel-patches/bpf/runs/7078719372?check_suite_focus=true
> 
>   [...]
>   progs/map_ptr_kern.c:314:26: error: field 'trie_key' with variable sized 
> type 'struct bpf_lpm_trie_key' not at the end of a struct or class is a GNU 
> extension [-Werror,-Wgnu-variable-sized-type-not-at-end]
>   struct bpf_lpm_trie_key trie_key;
>   ^

This will break the rdma-core userspace as well, with a similar
error:

/usr/bin/clang-13 -DVERBS_DEBUG -Dibverbs_EXPORTS -Iinclude 
-I/usr/include/libnl3 -I/usr/include/drm -g -O2 -fdebug-prefix-map=/__w/1/s=. 
-fstack-protector-strong -Wformat -Werror=format-security -Wdate-time 
-D_FORTIFY_SOURCE=2 -Wmissing-prototypes -Wmissing-declarations -Wwrite-strings 
-Wformat=2 -Wcast-function-type -Wformat-nonliteral -Wdate-time 
-Wnested-externs -Wshadow -Wstrict-prototypes -Wold-style-definition -Werror 
-Wredundant-decls -g -fPIC   -std=gnu11 -MD -MT 
libibverbs/CMakeFiles/ibverbs.dir/cmd_flow.c.o -MF 
libibverbs/CMakeFiles/ibverbs.dir/cmd_flow.c.o.d -o 
libibverbs/CMakeFiles/ibverbs.dir/cmd_flow.c.o   -c ../libibverbs/cmd_flow.c
In file included from ../libibverbs/cmd_flow.c:33:
In file included from include/infiniband/cmd_write.h:36:
In file included from include/infiniband/cmd_ioctl.h:41:
In file included from include/infiniband/verbs.h:48:
In file included from include/infiniband/verbs_api.h:66:
In file included from include/infiniband/ib_user_ioctl_verbs.h:38:
include/rdma/ib_user_verbs.h:436:34: error: field 'base' with variable sized 
type 'struct ib_uverbs_create_cq_resp' not at the end of a struct or class is a 
GNU extension [-Werror,-Wgnu-variable-sized-type-not-at-end]
struct ib_uverbs_create_cq_resp base;
^
include/rdma/ib_user_verbs.h:644:34: error: field 'base' with variable sized 
type 'struct ib_uverbs_create_qp_resp' not at the end of a struct or class is a 
GNU extension [-Werror,-Wgnu-variable-sized-type-not-at-end]
struct ib_uverbs_create_qp_resp base;

Which is why I gave up trying to change these..

Though maybe we could just switch off -Wgnu-variable-sized-type-not-at-end  
during configuration ?

Jason
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

RE: [PATCH][next] treewide: uapi: Replace zero-length arrays with flexible-array members

2022-06-27 Thread Dan Williams
Gustavo A. R. Silva wrote:
> There is a regular need in the kernel to provide a way to declare
> having a dynamically sized set of trailing elements in a structure.
> Kernel code should always use “flexible array members”[1] for these
> cases. The older style of one-element or zero-length arrays should
> no longer be used[2].
> 
> This code was transformed with the help of Coccinelle:
> (linux-5.19-rc2$ spatch --jobs $(getconf _NPROCESSORS_ONLN) --sp-file 
> script.cocci --include-headers --dir . > output.patch)
> 
> @@
> identifier S, member, array;
> type T1, T2;
> @@
> 
> struct S {
>   ...
>   T1 member;
>   T2 array[
> - 0
>   ];
> };
> 
> -fstrict-flex-arrays=3 is coming and we need to land these changes
> to prevent issues like these in the short future:
> 
> ../fs/minix/dir.c:337:3: warning: 'strcpy' will always overflow; destination 
> buffer has size 0,
> but the source string has length 2 (including NUL byte) [-Wfortify-source]
>   strcpy(de3->name, ".");
>   ^
> 
> Since these are all [0] to [] changes, the risk to UAPI is nearly zero. If
> this breaks anything, we can use a union with a new member name.
> 
> [1] https://en.wikipedia.org/wiki/Flexible_array_member
> [2] 
> https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays
> 
> Link: https://github.com/KSPP/linux/issues/78
> Build-tested-by: 
> https://lore.kernel.org/lkml/62b675ec.wkx6aoz6cbe71vtf%25...@intel.com/
> Signed-off-by: Gustavo A. R. Silva 
> ---
> Hi all!
> 
> JFYI: I'm adding this to my -next tree. :)
> 
[..]
>  include/uapi/linux/ndctl.h| 10 +--

For ndctl.h

Acked-by: Dan Williams 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/1] Create debugfs file with virtio balloon usage information

2022-06-27 Thread Michael S. Tsirkin
On Mon, Jun 27, 2022 at 07:19:09PM +, Alexander Atanasov wrote:
> Allow the guest to know how much it is ballooned by the host.
> It is useful when debugging out of memory conditions.
> 
> When host gets back memory from the guest it is accounted
> as used memory in the guest but the guest have no way to know
> how much it is actually ballooned.
> 
> No pretty printing and fixed as per coding style.
> 
> Signed-off-by: Alexander Atanasov 
> ---
>  drivers/virtio/virtio_balloon.c | 71 +
>  1 file changed, 71 insertions(+)
> 
> - Leave pretty print to userspace
> - Fixed coding style
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index b9737da6c4dd..1bb6a6d0e37b 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -731,6 +732,71 @@ static void report_free_page_func(struct work_struct 
> *work)
>   }
>  }
>  
> +/*
> + * DEBUGFS Interface
> + */
> +#ifdef CONFIG_DEBUG_FS
> +
> +/**
> + * virtio_balloon_debug_show - shows statistics of balloon operations.
> + * @f: pointer to the  seq_file.
> + * @offset: ignored.
> + *
> + * Provides the statistics that can be accessed in virtio-balloon in the 
> debugfs.
> + *
> + * Return: zero on success or an error code.
> + */
> +static int virtio_balloon_debug_show(struct seq_file *f, void *offset)
> +{
> + struct virtio_balloon *b = f->private;
> + u32 num_pages;
> + struct sysinfo i;
> +
> + si_meminfo();
> +
> + seq_printf(f, "%-22s: %llx\n", "capabilities", b->vdev->features);

why do we need this in debugfs? Isn't this available in sysfs already?

> + seq_printf(f, "%-22s: %d\n", "page_size", 4096);

I suspect this function doesn't work properly when page size is not 4K.

> +
> + virtio_cread_le(b->vdev, struct virtio_balloon_config, actual,
> + _pages);
> + /* Memory returned to host or amount we can inflate if possible */
> + seq_printf(f, "%-22s: %u\n", "ballooned_pages", num_pages);

I don't really get the comment here.

> +
> + /* Total Memory for the guest from host */
> + seq_printf(f, "%-22s: %lu\n", "total_pages", i.totalram);
> +
> + /* Current memory for the guest */
> + seq_printf(f, "%-22s: %lu\n", "current_pages", i.totalram - num_pages);

Are you sure these are in units of 4Kbyte pages?

> + return 0;
> +}
> +
> +DEFINE_SHOW_ATTRIBUTE(virtio_balloon_debug);
> +
> +static void  virtio_balloon_debugfs_init(struct virtio_balloon *b)
> +{
> + debugfs_create_file("virtio-balloon", 0444, NULL, b,
> + _balloon_debug_fops);
> +}
> +
> +static void  virtio_balloon_debugfs_exit(struct virtio_balloon *b)
> +{
> + debugfs_remove(debugfs_lookup("virtio-balloon", NULL));
> +}
> +
> +#else
> +
> +static inline void virtio_balloon_debugfs_init(struct virtio_balloon *b)
> +{
> +}
> +
> +static inline void virtio_balloon_debugfs_exit(struct virtio_balloon *b)
> +{
> +}
> +
> +#endif   /* CONFIG_DEBUG_FS */
> +
>  #ifdef CONFIG_BALLOON_COMPACTION
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
> @@ -1019,6 +1085,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  
>   if (towards_target(vb))
>   virtballoon_changed(vdev);
> +
> + virtio_balloon_debugfs_init(vb);
> +
>   return 0;
>  
>  out_unregister_oom:
> @@ -1065,6 +1134,8 @@ static void virtballoon_remove(struct virtio_device 
> *vdev)
>  {
>   struct virtio_balloon *vb = vdev->priv;
>  
> + virtio_balloon_debugfs_exit(vb);
> +
>   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
>   page_reporting_unregister(>pr_dev_info);
>   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> -- 
> 2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH][next] treewide: uapi: Replace zero-length arrays with flexible-array members

2022-06-27 Thread Daniel Borkmann

On 6/27/22 8:04 PM, Gustavo A. R. Silva wrote:

There is a regular need in the kernel to provide a way to declare
having a dynamically sized set of trailing elements in a structure.
Kernel code should always use “flexible array members”[1] for these
cases. The older style of one-element or zero-length arrays should
no longer be used[2].

This code was transformed with the help of Coccinelle:
(linux-5.19-rc2$ spatch --jobs $(getconf _NPROCESSORS_ONLN) --sp-file script.cocci 
--include-headers --dir . > output.patch)

@@
identifier S, member, array;
type T1, T2;
@@

struct S {
   ...
   T1 member;
   T2 array[
- 0
   ];
};

-fstrict-flex-arrays=3 is coming and we need to land these changes
to prevent issues like these in the short future:

../fs/minix/dir.c:337:3: warning: 'strcpy' will always overflow; destination 
buffer has size 0,
but the source string has length 2 (including NUL byte) [-Wfortify-source]
strcpy(de3->name, ".");
^

Since these are all [0] to [] changes, the risk to UAPI is nearly zero. If
this breaks anything, we can use a union with a new member name.

[1] https://en.wikipedia.org/wiki/Flexible_array_member
[2] 
https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays

Link: https://github.com/KSPP/linux/issues/78
Build-tested-by: 
https://lore.kernel.org/lkml/62b675ec.wkx6aoz6cbe71vtf%25...@intel.com/
Signed-off-by: Gustavo A. R. Silva 
---
Hi all!

JFYI: I'm adding this to my -next tree. :)


Fyi, this breaks BPF CI:

https://github.com/kernel-patches/bpf/runs/7078719372?check_suite_focus=true

  [...]
  progs/map_ptr_kern.c:314:26: error: field 'trie_key' with variable sized type 
'struct bpf_lpm_trie_key' not at the end of a struct or class is a GNU 
extension [-Werror,-Wgnu-variable-sized-type-not-at-end]
  struct bpf_lpm_trie_key trie_key;
  ^
  1 error generated.
  make: *** [Makefile:519: 
/tmp/runner/work/bpf/bpf/tools/testing/selftests/bpf/map_ptr_kern.o] Error 1
  make: *** Waiting for unfinished jobs
  Error: Process completed with exit code 2.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [GIT PULL] virtio,vdpa: fixes

2022-06-27 Thread pr-tracker-bot
The pull request you sent on Mon, 27 Jun 2022 11:50:24 -0400:

> https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/941e3e7912696b9fbe3586083a7c2e102cee7a87

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/prtracker.html
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation

2022-06-27 Thread Waiman Long

On 6/27/22 10:27, Guo Hui wrote:

The instructions assigned to the vcpu_is_preempted function parameter
in the X86 architecture physical machine are redundant instructions,
causing the multi-core performance of Unixbench to drop by about 4% to 5%.
The C function is as follows:
static bool vcpu_is_preempted(long vcpu);

The parameter 'vcpu' in the function osq_lock
that calls the function vcpu_is_preempted is assigned as follows:

The C code is in the function node_cpu:
cpu = node->cpu - 1;

The instructions corresponding to the C code are:
mov 0x14(%rax),%edi
sub $0x1,%edi

The above instructions are unnecessary
in the X86 Native operating environment,
causing high cache-misses and degrading performance.

This patch uses static_key to not execute this instruction
in the Native runtime environment.

The patch effect is as follows two machines,
Unixbench runs with full core score:

1. Machine configuration:
Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
CPU core: 40
Memory: 256G
OS Kernel: 5.19-rc3

Before using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0  948326591.2  81261.9
Double-Precision Whetstone   55.0 211986.3  38543.0
Execl Throughput 43.0  43453.2  10105.4
File Copy 1024 bufsize 2000 maxblocks  3960.0 438936.2   1108.4
File Copy 256 bufsize 500 maxblocks1655.0 118197.4714.2
File Copy 4096 bufsize 8000 maxblocks  5800.01534674.7   2646.0
Pipe Throughput   12440.0   46482107.6  37365.0
Pipe-based Context Switching   4000.01915094.2   4787.7
Process Creation126.0  85442.2   6781.1
Shell Scripts (1 concurrent) 42.4  69400.7  16368.1
Shell Scripts (8 concurrent)  6.0   8877.2  14795.3
System Call Overhead  15000.04714906.1   3143.3

System Benchmarks Index Score7923.3

After using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0  947032915.5  81151.1
Double-Precision Whetstone   55.0 211971.2  38540.2
Execl Throughput 43.0  45054.8  10477.9
File Copy 1024 bufsize 2000 maxblocks  3960.0 515024.9   1300.6
File Copy 256 bufsize 500 maxblocks1655.0 146354.6884.3
File Copy 4096 bufsize 8000 maxblocks  5800.01679995.9   2896.5
Pipe Throughput   12440.0   46466394.2  37352.4
Pipe-based Context Switching   4000.01898221.4   4745.6
Process Creation126.0  85653.1   6797.9
Shell Scripts (1 concurrent) 42.4  69437.3  16376.7
Shell Scripts (8 concurrent)  6.0   8898.9  14831.4
System Call Overhead  15000.04658746.7   3105.8

System Benchmarks Index Score8248.8

2. Machine configuration:
Hygon C86 7185 32-core Processor
CPU core: 128
Memory: 256G
OS Kernel: 5.19-rc3

Before using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0 2256644068.3 193371.4
Double-Precision Whetstone   55.0 438969.9  79812.7
Execl Throughput 43.0  10108.6   2350.8
File Copy 1024 bufsize 2000 maxblocks  3960.0 275892.8696.7
File Copy 256 bufsize 500 maxblocks1655.0  72082.7435.5
File Copy 4096 bufsize 8000 maxblocks  5800.0 925043.4   1594.9
Pipe Throughput   12440.0  118905512.5  95583.2
Pipe-based Context Switching   4000.07820945.7  19552.4
Process Creation126.0  31233.3   2478.8
Shell Scripts (1 concurrent) 42.4  49042.8  11566.7
Shell Scripts (8 concurrent)  6.0   6656.0  11093.3
System Call Overhead  15000.06816047.5   4544.0

System Benchmarks Index Score7756.6

After using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0 2252272929.4 192996.8
Double-Precision Whetstone   55.0 451847.2  82154.0
Execl Throughput 43.0  10595.1   2464.0
File Copy 1024 bufsize 2000 maxblocks  3960.0 

[GIT PULL] virtio,vdpa: fixes

2022-06-27 Thread Michael S. Tsirkin
The following changes since commit a111daf0c53ae91e71fd2bfe7497862d14132e3e:

  Linux 5.19-rc3 (2022-06-19 15:06:47 -0500)

are available in the Git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus

for you to fetch changes up to c7cc29aaebf9eaa543b4c70801e0ecef1101b3c8:

  virtio_ring: make vring_create_virtqueue_split prettier (2022-06-27 08:05:35 
-0400)


virtio,vdpa: fixes

Fixes all over the place, most notably we are disabling
IRQ hardening (again!).

Signed-off-by: Michael S. Tsirkin 


Bo Liu (1):
  virtio: Remove unnecessary variable assignments

Deming Wang (1):
  virtio_ring: make vring_create_virtqueue_split prettier

Eli Cohen (2):
  vdpa/mlx5: Update Control VQ callback information
  vdpa/mlx5: Initialize CVQ vringh only once

Jason Wang (3):
  virtio: disable notification hardening by default
  virtio-net: fix race between ndo_open() and virtio_device_ready()
  caif_virtio: fix race between virtio_device_ready() and ndo_open()

Parav Pandit (1):
  vduse: Tie vduse mgmtdev and its device

Stefano Garzarella (1):
  vhost-vdpa: call vhost_vdpa_cleanup during the release

Stephan Gerhold (2):
  virtio_mmio: Add missing PM calls to freeze/restore
  virtio_mmio: Restore guest page size on resume

huangjie.albert (1):
  virtio_ring : keep used_wrap_counter in vq->last_used_idx

 drivers/net/caif/caif_virtio.c | 10 +++-
 drivers/net/virtio_net.c   |  8 ++-
 drivers/s390/virtio/virtio_ccw.c   |  9 +++-
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 33 -
 drivers/vdpa/vdpa_user/vduse_dev.c | 60 ++-
 drivers/vhost/vdpa.c   |  2 +-
 drivers/virtio/Kconfig | 13 +
 drivers/virtio/virtio.c|  2 +
 drivers/virtio/virtio_mmio.c   | 26 ++
 drivers/virtio/virtio_pci_modern_dev.c |  2 -
 drivers/virtio/virtio_ring.c   | 89 +++---
 include/linux/virtio_config.h  |  2 +
 12 files changed, 187 insertions(+), 69 deletions(-)

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v4] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation

2022-06-27 Thread Juergen Gross via Virtualization

On 27.06.22 16:27, Guo Hui wrote:

The instructions assigned to the vcpu_is_preempted function parameter
in the X86 architecture physical machine are redundant instructions,
causing the multi-core performance of Unixbench to drop by about 4% to 5%.
The C function is as follows:
static bool vcpu_is_preempted(long vcpu);

The parameter 'vcpu' in the function osq_lock
that calls the function vcpu_is_preempted is assigned as follows:

The C code is in the function node_cpu:
cpu = node->cpu - 1;

The instructions corresponding to the C code are:
mov 0x14(%rax),%edi
sub $0x1,%edi

The above instructions are unnecessary
in the X86 Native operating environment,
causing high cache-misses and degrading performance.

This patch uses static_key to not execute this instruction
in the Native runtime environment.

The patch effect is as follows two machines,
Unixbench runs with full core score:

1. Machine configuration:
Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
CPU core: 40
Memory: 256G
OS Kernel: 5.19-rc3

Before using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0  948326591.2  81261.9
Double-Precision Whetstone   55.0 211986.3  38543.0
Execl Throughput 43.0  43453.2  10105.4
File Copy 1024 bufsize 2000 maxblocks  3960.0 438936.2   1108.4
File Copy 256 bufsize 500 maxblocks1655.0 118197.4714.2
File Copy 4096 bufsize 8000 maxblocks  5800.01534674.7   2646.0
Pipe Throughput   12440.0   46482107.6  37365.0
Pipe-based Context Switching   4000.01915094.2   4787.7
Process Creation126.0  85442.2   6781.1
Shell Scripts (1 concurrent) 42.4  69400.7  16368.1
Shell Scripts (8 concurrent)  6.0   8877.2  14795.3
System Call Overhead  15000.04714906.1   3143.3

System Benchmarks Index Score7923.3

After using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0  947032915.5  81151.1
Double-Precision Whetstone   55.0 211971.2  38540.2
Execl Throughput 43.0  45054.8  10477.9
File Copy 1024 bufsize 2000 maxblocks  3960.0 515024.9   1300.6
File Copy 256 bufsize 500 maxblocks1655.0 146354.6884.3
File Copy 4096 bufsize 8000 maxblocks  5800.01679995.9   2896.5
Pipe Throughput   12440.0   46466394.2  37352.4
Pipe-based Context Switching   4000.01898221.4   4745.6
Process Creation126.0  85653.1   6797.9
Shell Scripts (1 concurrent) 42.4  69437.3  16376.7
Shell Scripts (8 concurrent)  6.0   8898.9  14831.4
System Call Overhead  15000.04658746.7   3105.8

System Benchmarks Index Score8248.8

2. Machine configuration:
Hygon C86 7185 32-core Processor
CPU core: 128
Memory: 256G
OS Kernel: 5.19-rc3

Before using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0 2256644068.3 193371.4
Double-Precision Whetstone   55.0 438969.9  79812.7
Execl Throughput 43.0  10108.6   2350.8
File Copy 1024 bufsize 2000 maxblocks  3960.0 275892.8696.7
File Copy 256 bufsize 500 maxblocks1655.0  72082.7435.5
File Copy 4096 bufsize 8000 maxblocks  5800.0 925043.4   1594.9
Pipe Throughput   12440.0  118905512.5  95583.2
Pipe-based Context Switching   4000.07820945.7  19552.4
Process Creation126.0  31233.3   2478.8
Shell Scripts (1 concurrent) 42.4  49042.8  11566.7
Shell Scripts (8 concurrent)  6.0   6656.0  11093.3
System Call Overhead  15000.06816047.5   4544.0

System Benchmarks Index Score7756.6

After using the patch:
System Benchmarks Index Values   BASELINE   RESULTINDEX
Dhrystone 2 using register variables 116700.0 2252272929.4 192996.8
Double-Precision Whetstone   55.0 451847.2  82154.0
Execl Throughput 43.0  10595.1   2464.0
File Copy 1024 bufsize 2000 maxblocks  3960.0

Re: [PATCH v2] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation

2022-06-27 Thread Waiman Long

On 6/27/22 01:54, Guo Hui wrote:
Thank you very much Longman, my patch is as you said, only disable 
node_cpu on X86, enable node_cpu on arm64, powerpc, s390 architectures;

the code is in file arch/x86/kernel/paravirt-spinlocks.c:
    DECLARE_STATIC_KEY_FALSE(preemted_key);
    static_branch_enable(_key);

the default value of preemted_key is false and the if conditional 
statement is reversed,

the code is in file kernel/locking/osq_lock.c:
    DEFINE_STATIC_KEY_FALSE(preemted_key);

    static inline int node_cpu(struct optimistic_spin_node *node)
    {
    int cpu = 0;

    if (!static_branch_unlikely(_key))
    cpu = node->cpu - 1;

    return cpu;
  }

In this way, only one nop instruction is added to architectures arm64, 
powerpc and s390, including virtual machines, without any other changes.


You are right. I am probably too tired last night to read the patch more 
carefully.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 1/1] Create debugfs file with virtio balloon usage information

2022-06-27 Thread Michael S. Tsirkin
On Mon, Jun 27, 2022 at 12:20:38PM +, alexander.atana...@virtuozzo.com 
wrote:
> From: Alexander Atanasov 
> 
> Allow the guest to know how much it is ballooned by the host.
> It is useful when debugging out of memory conditions.
> 
> When host gets back memory from the guest it is accounted
> as used memory in the guest but the guest have no way to know
> how much it is actually ballooned.
> 
> Signed-off-by: Alexander Atanasov 
> ---
>  drivers/virtio/virtio_balloon.c | 91 +
>  1 file changed, 91 insertions(+)
> 
> diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c
> index b9737da6c4dd..a32a52c28a1f 100644
> --- a/drivers/virtio/virtio_balloon.c
> +++ b/drivers/virtio/virtio_balloon.c
> @@ -10,6 +10,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -731,6 +732,91 @@ static void report_free_page_func(struct work_struct 
> *work)
>   }
>  }
>  
> +/*
> + * DEBUGFS Interface
> + */
> +#ifdef CONFIG_DEBUG_FS
> +
> +/**
> + * virtio_balloon_debug_show - shows statistics of balloon operations.
> + * @f: pointer to the  seq_file.
> + * @offset: ignored.
> + *
> + * Provides the statistics that can be accessed in virtio-balloon in the 
> debugfs.
> + *
> + * Return: zero on success or an error code.
> + */
> +static int virtio_balloon_debug_show(struct seq_file *f, void *offset)
> +{
> + struct virtio_balloon *b = f->private;
> + u32 num_pages;
> + struct sysinfo i;
> +
> + si_meminfo();
> +
> + seq_printf(f, "%-22s:", "capabilities");

why aren't features in sysfs sufficient for this? pretty printing
can be done in userspace ...

> + if (virtio_has_feature(b->vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST))
> + seq_puts(f, " tell_host");
> +
> + if (virtio_has_feature(b->vdev, VIRTIO_BALLOON_F_STATS_VQ))
> + seq_puts(f, " stats");
> +
> + if (virtio_has_feature(b->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> + seq_puts(f, " deflate_on_oom");
> +
> + if (virtio_has_feature(b->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT))
> + seq_puts(f, " free_page_hint");
> +
> + if (virtio_has_feature(b->vdev, VIRTIO_BALLOON_F_PAGE_POISON))
> + seq_puts(f, " page_poison");
> +
> + if (virtio_has_feature(b->vdev, VIRTIO_BALLOON_F_REPORTING))
> + seq_puts(f, " reporting");
> +
> + seq_puts(f, "\n");
> +
> + seq_printf(f, "%-22s: %d\n", "page_size", 4096);
> +
> + virtio_cread_le(b->vdev, struct virtio_balloon_config, actual,
> + _pages);
> + /* Memory returned to host or amount we can inflate if possible */
> + seq_printf(f, "%-22s: %u\n", "ballooned_pages", num_pages);
> +
> + /* Total Memory for the guest from host */
> + seq_printf(f, "%-22s: %lu\n", "total_pages", i.totalram);
> +
> + /* Current memory for the guest */
> + seq_printf(f, "%-22s: %lu\n", "current_pages", i.totalram-num_pages);

spaces around -

> +
> + return 0;
> +}
> +
> +DEFINE_SHOW_ATTRIBUTE(virtio_balloon_debug);
> +
> +static void  virtio_balloon_debugfs_init(struct virtio_balloon *b)
> +{
> + debugfs_create_file("virtio-balloon", 0444, NULL, b,
> + _balloon_debug_fops);
> +}
> +
> +static void  virtio_balloon_debugfs_exit(struct virtio_balloon *b)
> +{
> + debugfs_remove(debugfs_lookup("virtio-balloon", NULL));
> +}
> +
> +#else
> +
> +static inline void virtio_balloon_debugfs_init(struct virtio_balloon *b)
> +{
> +}
> +
> +static inline void virtio_balloon_debugfs_exit(struct virtio_balloon *b)
> +{
> +}
> +
> +#endif   /* CONFIG_DEBUG_FS */
> +
>  #ifdef CONFIG_BALLOON_COMPACTION
>  /*
>   * virtballoon_migratepage - perform the balloon page migration on behalf of
> @@ -1019,6 +1105,9 @@ static int virtballoon_probe(struct virtio_device *vdev)
>  
>   if (towards_target(vb))
>   virtballoon_changed(vdev);
> +
> + virtio_balloon_debugfs_init(vb);
> +
>   return 0;
>  
>  out_unregister_oom:
> @@ -1065,6 +1154,8 @@ static void virtballoon_remove(struct virtio_device 
> *vdev)
>  {
>   struct virtio_balloon *vb = vdev->priv;
>  
> + virtio_balloon_debugfs_exit(vb);
> +
>   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING))
>   page_reporting_unregister(>pr_dev_info);
>   if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM))
> -- 
> 2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net V2] virtio-net: fix race between ndo_open() and virtio_device_ready()

2022-06-27 Thread Michael S. Tsirkin
On Mon, Jun 27, 2022 at 04:30:40PM +0800, Jason Wang wrote:
> We currently call virtio_device_ready() after netdev
> registration. Since ndo_open() can be called immediately
> after register_netdev, this means there exists a race between
> ndo_open() and virtio_device_ready(): the driver may start to use the
> device before DRIVER_OK which violates the spec.
> 
> Fix this by switching to use register_netdevice() and protect the
> virtio_device_ready() with rtnl_lock() to make sure ndo_open() can
> only be called after virtio_device_ready().
> 
> Fixes: 4baf1e33d0842 ("virtio_net: enable VQs early")
> Acked-by: Michael S. Tsirkin 
> Signed-off-by: Jason Wang 

I reworded the caif commit log similarly and put both on my tree.

> ---
>  drivers/net/virtio_net.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index db05b5e930be..8a5810bcb839 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3655,14 +3655,20 @@ static int virtnet_probe(struct virtio_device *vdev)
>   if (vi->has_rss || vi->has_rss_hash_report)
>   virtnet_init_default_rss(vi);
>  
> - err = register_netdev(dev);
> + /* serialize netdev register + virtio_device_ready() with ndo_open() */
> + rtnl_lock();
> +
> + err = register_netdevice(dev);
>   if (err) {
>   pr_debug("virtio_net: registering device failed\n");
> + rtnl_unlock();
>   goto free_failover;
>   }
>  
>   virtio_device_ready(vdev);
>  
> + rtnl_unlock();
> +
>   err = virtnet_cpu_notif_add(vi);
>   if (err) {
>   pr_debug("virtio_net: registering cpu notifier failed\n");
> -- 
> 2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v10 25/41] virtio_pci: struct virtio_pci_common_cfg add queue_notify_data

2022-06-27 Thread Michael S. Tsirkin
On Mon, Jun 27, 2022 at 04:14:20PM +0800, Jason Wang wrote:
> On Mon, Jun 27, 2022 at 3:58 PM Michael S. Tsirkin  wrote:
> >
> > On Mon, Jun 27, 2022 at 03:45:30PM +0800, Jason Wang wrote:
> > > On Mon, Jun 27, 2022 at 2:39 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Mon, Jun 27, 2022 at 10:30:42AM +0800, Jason Wang wrote:
> > > > > On Fri, Jun 24, 2022 at 2:59 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Fri, Jun 24, 2022 at 10:56:05AM +0800, Xuan Zhuo wrote:
> > > > > > > Add queue_notify_data in struct virtio_pci_common_cfg, which 
> > > > > > > comes from
> > > > > > > here https://github.com/oasis-tcs/virtio-spec/issues/89
> > > > > > >
> > > > > > > For not breaks uABI, add a new struct 
> > > > > > > virtio_pci_common_cfg_notify.
> > > > > >
> > > > > > What exactly is meant by not breaking uABI?
> > > > > > Users are supposed to be prepared for struct size to change ... no?
> > > > >
> > > > > Not sure, any doc for this?
> > > > >
> > > > > Thanks
> > > >
> > > >
> > > > Well we have this:
> > > >
> > > > The drivers SHOULD only map part of configuration structure
> > > > large enough for device operation.  The drivers MUST handle
> > > > an unexpectedly large \field{length}, but MAY check that 
> > > > \field{length}
> > > > is large enough for device operation.
> > >
> > > Yes, but that's the device/driver interface. What's done here is the
> > > userspace/kernel.
> > >
> > > Userspace may break if it uses e.g sizeof(struct virtio_pci_common_cfg)?
> > >
> > > Thanks
> >
> > Hmm I guess there's risk... but then how are we going to maintain this
> > going forward?  Add a new struct on any change?
> 
> This is the way we have used it for the past 5 or more years. I don't
> see why this must be handled in the vq reset feature.
> 
> >Can we at least
> > prevent this going forward somehow?
> 
> Like have some padding?
> 
> Thanks

Maybe - this is what QEMU does ...

> >
> >
> > > >
> > > >
> > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > > Since I want to add queue_reset after queue_notify_data, I 
> > > > > > > submitted
> > > > > > > this patch first.
> > > > > > >
> > > > > > > Signed-off-by: Xuan Zhuo 
> > > > > > > Acked-by: Jason Wang 
> > > > > > > ---
> > > > > > >  include/uapi/linux/virtio_pci.h | 7 +++
> > > > > > >  1 file changed, 7 insertions(+)
> > > > > > >
> > > > > > > diff --git a/include/uapi/linux/virtio_pci.h 
> > > > > > > b/include/uapi/linux/virtio_pci.h
> > > > > > > index 3a86f36d7e3d..22bec9bd0dfc 100644
> > > > > > > --- a/include/uapi/linux/virtio_pci.h
> > > > > > > +++ b/include/uapi/linux/virtio_pci.h
> > > > > > > @@ -166,6 +166,13 @@ struct virtio_pci_common_cfg {
> > > > > > >   __le32 queue_used_hi;   /* read-write */
> > > > > > >  };
> > > > > > >
> > > > > > > +struct virtio_pci_common_cfg_notify {
> > > > > > > + struct virtio_pci_common_cfg cfg;
> > > > > > > +
> > > > > > > + __le16 queue_notify_data;   /* read-write */
> > > > > > > + __le16 padding;
> > > > > > > +};
> > > > > > > +
> > > > > > >  /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */
> > > > > > >  struct virtio_pci_cfg_cap {
> > > > > > >   struct virtio_pci_cap cap;
> > > > > > > --
> > > > > > > 2.31.0
> > > > > >
> > > >
> >

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 1/2] vhost: Add rbtree vdpa_mem_tree to saved the counted mem

2022-06-27 Thread Michael S. Tsirkin
On Mon, Jun 27, 2022 at 04:12:57PM +0800, Cindy Lu wrote:
> On Sun, Jun 26, 2022 at 6:01 PM Michael S. Tsirkin  wrote:
> >
> > On Sun, Jun 26, 2022 at 05:04:08PM +0800, Cindy Lu wrote:
> > > We count pinned_vm as follow in vhost-vDPA
> > >
> > > lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
> > > if (npages + atomic64_read(>mm->pinned_vm) > lock_limit) {
> > >  ret = -ENOMEM;
> > >  goto unlock;
> > > }
> > > This means if we have two vDPA devices for the same VM the pages
> > > would be counted twice. So we add a tree to save the page that
> > > counted and we will not count it again.
> > >
> > > Add vdpa_mem_tree to saved the mem that already counted.
> > > use a hlist to saved the root for vdpa_mem_tree.
> > >
> > > Signed-off-by: Cindy Lu 
> > > ---
> > >  drivers/vhost/vhost.c | 63 +++
> > >  drivers/vhost/vhost.h |  1 +
> > >  2 files changed, 64 insertions(+)
> > >
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index 40097826cff0..4ca8b1ed944b 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -32,6 +32,8 @@
> > >  #include 
> > >
> > >  #include "vhost.h"
> > > +#include 
> > > +#include 
> > >
> > >  static ushort max_mem_regions = 64;
> > >  module_param(max_mem_regions, ushort, 0444);
> > > @@ -49,6 +51,14 @@ enum {
> > >  #define vhost_used_event(vq) ((__virtio16 __user 
> > > *)>avail->ring[vq->num])
> > >  #define vhost_avail_event(vq) ((__virtio16 __user 
> > > *)>used->ring[vq->num])
> > >
> > > +struct vhost_vdpa_rbtree_node {
> > > + struct hlist_node node;
> > > + struct rb_root_cached vdpa_mem_tree;
> > > + struct mm_struct *mm_using;
> > > +};
> > > +static DECLARE_HASHTABLE(vhost_vdpa_rbtree_hlist, 8);
> > > +int vhost_vdpa_rbtree_hlist_status;
> > > +
> > >  #ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY
> > >  static void vhost_disable_cross_endian(struct vhost_virtqueue *vq)
> > >  {
> >
> > Are you trying to save some per-mm information here?
> > Can't we just add it to mm_struct?
> >
> yes, this is a per-mm information, but I have checked with jason before,
> seems it maybe difficult to change the mm_struct in upstream
> so I add an to add a hlist  instead
> Thanks
> Cindy

Difficult how? You get more scrutiny if you try, for sure,
and you need to need to generalize it enough that it looks
useful outside the driver. But I guess that's good?

> >
> >
> > > @@ -571,6 +581,51 @@ static void vhost_detach_mm(struct vhost_dev *dev)
> > >   dev->mm = NULL;
> > >  }
> > >
> > > +struct rb_root_cached *vhost_vdpa_get_mem_tree(struct mm_struct *mm)
> > > +{
> > > + struct vhost_vdpa_rbtree_node *rbtree_root = NULL;
> > > + struct rb_root_cached *vdpa_tree;
> > > + u32 key;
> > > +
> > > + /* No hased table, init one */
> > > + if (vhost_vdpa_rbtree_hlist_status == 0) {
> > > + hash_init(vhost_vdpa_rbtree_hlist);
> > > + vhost_vdpa_rbtree_hlist_status = 1;
> > > + }
> > > +
> > > + key = jhash_1word((u64)mm, JHASH_INITVAL);
> > > + hash_for_each_possible(vhost_vdpa_rbtree_hlist, rbtree_root, node,
> > > +key) {
> > > + if (rbtree_root->mm_using == mm)
> > > + return &(rbtree_root->vdpa_mem_tree);
> > > + }
> > > + rbtree_root = kmalloc(sizeof(*rbtree_root), GFP_KERNEL);
> > > + if (!rbtree_root)
> > > + return NULL;
> > > + rbtree_root->mm_using = mm;
> > > + rbtree_root->vdpa_mem_tree = RB_ROOT_CACHED;
> > > + hash_add(vhost_vdpa_rbtree_hlist, _root->node, key);
> > > + vdpa_tree = &(rbtree_root->vdpa_mem_tree);
> > > + return vdpa_tree;
> > > +}
> > > +
> > > +void vhost_vdpa_relase_mem_tree(struct mm_struct *mm)
> > > +{
> > > + struct vhost_vdpa_rbtree_node *rbtree_root = NULL;
> > > + u32 key;
> > > +
> > > + key = jhash_1word((u64)mm, JHASH_INITVAL);
> > > +
> > > + /* No hased table, init one */
> > > + hash_for_each_possible(vhost_vdpa_rbtree_hlist, rbtree_root, node,
> > > +key) {
> > > + if (rbtree_root->mm_using == mm) {
> > > + hash_del(_root->node);
> > > + kfree(rbtree_root);
> > > + }
> > > + }
> > > +}
> > > +
> > >  /* Caller should have device mutex */
> > >  long vhost_dev_set_owner(struct vhost_dev *dev)
> > >  {
> > > @@ -605,6 +660,11 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
> > >   err = vhost_dev_alloc_iovecs(dev);
> > >   if (err)
> > >   goto err_cgroup;
> > > + dev->vdpa_mem_tree = vhost_vdpa_get_mem_tree(dev->mm);
> > > + if (dev->vdpa_mem_tree == NULL) {
> > > + err = -ENOMEM;
> > > + goto err_cgroup;
> > > + }
> > >
> > >   return 0;
> > >  err_cgroup:
> > > @@ -613,6 +673,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev)
> > >   dev->worker = NULL;
> > >   }
> > >  err_worker:

Re: [PATCH V3] virtio: disable notification hardening by default

2022-06-27 Thread Michael S. Tsirkin
On Mon, Jun 27, 2022 at 04:11:18PM +0800, Jason Wang wrote:
> On Mon, Jun 27, 2022 at 3:33 PM Michael S. Tsirkin  wrote:
> >
> > On Mon, Jun 27, 2022 at 10:50:17AM +0800, Jason Wang wrote:
> > > On Fri, Jun 24, 2022 at 2:31 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Wed, Jun 22, 2022 at 03:09:31PM +0800, Jason Wang wrote:
> > > > > On Wed, Jun 22, 2022 at 3:03 PM Michael S. Tsirkin  
> > > > > wrote:
> > > > > >
> > > > > > On Wed, Jun 22, 2022 at 09:29:40AM +0800, Jason Wang wrote:
> > > > > > > We try to harden virtio device notifications in 8b4ec69d7e09 
> > > > > > > ("virtio:
> > > > > > > harden vring IRQ"). It works with the assumption that the driver 
> > > > > > > or
> > > > > > > core can properly call virtio_device_ready() at the right
> > > > > > > place. Unfortunately, this seems to be not true and uncover 
> > > > > > > various
> > > > > > > bugs of the existing drivers, mainly the issue of using
> > > > > > > virtio_device_ready() incorrectly.
> > > > > > >
> > > > > > > So let's having a Kconfig option and disable it by default. It 
> > > > > > > gives
> > > > > > > us a breath to fix the drivers and then we can consider to enable 
> > > > > > > it
> > > > > > > by default.
> > > > > > >
> > > > > > > Signed-off-by: Jason Wang 
> > > > > >
> > > > > >
> > > > > > OK I will queue, but I think the problem is fundamental.
> > > > >
> > > > > If I understand correctly, you want some core IRQ work?
> > > >
> > > > Yes.
> > > >
> > > > > As discussed
> > > > > before, it doesn't solve all the problems, we still need to do per
> > > > > driver audit.
> > > > >
> > > > > Thanks
> > > >
> > > > Maybe, but we don't need to tie things to device_ready then.
> > > > We can do
> > > >
> > > > - disable irqs
> > > > - device ready
> > > > - setup everything
> > > > - enable irqs
> > > >
> > > >
> > > > and this works for most things, the only issue is
> > > > this deadlocks if "setup everything" waits for interrupts.
> > > >
> > > >
> > > > With the current approach there's really no good time:
> > > > 1.- setup everything
> > > > - device ready
> > > >
> > > > can cause kicks before device is ready
> > > >
> > > > 2.- device ready
> > > > - setup everything
> > > >
> > > > can cause callbacks before setup.
> > > >
> > > > So I prefer the 1. and fix the hardening in the core.
> > >
> > > So my question is:
> > >
> > > 1) do similar hardening like config interrupt
> > > or
> > > 2) per transport notification work (e.g for PCI core IRQ work)
> > >
> > > 1) seems easier and universal, but we pay little overhead which could
> > > be eliminated by the config option.
> >
> > I doubt 1 is easy and I am not even sure core IRQ changes will help.
> 
> Core IRQ won't help in 1), the changes are limited in the virtio.
> 
> > My concern with adding overhead is that I'm not sure these are not just
> > wasted CPU cycles.  We spent a bunch of time on irq hardening and so far
> > we are still at the "all drivers need to be fixed" stage.
> 
> It's not the fault of hardening but the drivers. The hardening just
> expose those bugs.

Heh. Yea sure. But things work fine for people. What is the chance
your review found and fixed all driver bugs? After two attempts
I don't feel like hoping audit will fix all bugs.


> >
> > The reason config was kind of easy is that config interrupt is rarely
> > vital for device function so arbitrarily deferring that does not lead to
> > deadlocks - what you are trying to do with VQ interrupts is
> > fundamentally different. Things are especially bad if we just drop
> > an interrupt but deferring can lead to problems too.
> 
> I'm not sure I see the difference, disable_irq() stuffs also delay the
> interrupt processing until enable_irq().


Absolutely. I am not at all sure disable_irq fixes all problems.

> >
> > Consider as an example
> > virtio-net: fix race between ndo_open() and virtio_device_ready()
> > if you just defer vq interrupts you get deadlocks.
> >
> >
> 
> I don't see a deadlock here, maybe you can show more detail on this?

What I mean is this: if we revert the above commit, things still
work (out of spec, but still). If we revert and defer interrupts until
device ready then ndo_open that triggers before device ready deadlocks.


> >
> > So, thinking about all this, how about a simple per vq flag meaning
> > "this vq was kicked since reset"?
> 
> And ignore the notification if vq is not kicked? It sounds like the
> callback needs to be synchronized with the kick.

Note we only need to synchronize it when it changes, which is
only during initialization and reset.


> >
> > If driver does not kick then it's not ready to get callbacks, right?
> >
> > Sounds quite clean, but we need to think through memory ordering
> > concerns - I guess it's only when we change the value so
> > if (!vq->kicked) {
> > vq->kicked = true;
> > mb();
> > }
> >
> > will do the trick, right?
> 
> There's no much difference with the existing 

Re: [PATCH v2] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation

2022-06-27 Thread Peter Zijlstra
On Mon, Jun 27, 2022 at 05:18:02PM +0800, Guo Hui wrote:
> Ok thanks Peter, your suggestion is very good, I will modify my patch as you
> suggested.

Because it messes up the order in which people normally read text.
Why is top-posting such a bad thing?
Top-posting.
What is the most annoying thing in e-mail?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: vhost-vdpa tests in tools/virtio

2022-06-27 Thread Stefano Garzarella

On Thu, Jun 23, 2022 at 11:59:56AM +0800, Jason Wang wrote:

On Thu, Jun 23, 2022 at 3:21 AM Eugenio Perez Martin
 wrote:


On Wed, Jun 22, 2022 at 5:39 PM Stefano Garzarella  wrote:
>
> Hi,
> while developing/testing the vdpa-blk support in libblkio [1], I
> realized that we added several regressions with the "[PATCH v2 00/19]
> Control VQ support in vDPA" series (patches/clarifications already
> sent).
>
> To minimize these problems in the future, I was thinking of adding tests
> for vhost_vdpa in tools/virtio using vDPA simulators as devices.
>
> What do you think about this?

I think it's definitely a good idea. Maybe a good starting point is to
copy the virtio_test for the moment?

Maybe to add a new parameter to tell it it will test a vhost-vdpa
device? Initialization is pretty similar.


+1



Good idea, I will use it as a starting point.

Thanks,
Stefano

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net V2] virtio-net: fix race between ndo_open() and virtio_device_ready()

2022-06-27 Thread Jason Wang
We currently call virtio_device_ready() after netdev
registration. Since ndo_open() can be called immediately
after register_netdev, this means there exists a race between
ndo_open() and virtio_device_ready(): the driver may start to use the
device before DRIVER_OK which violates the spec.

Fix this by switching to use register_netdevice() and protect the
virtio_device_ready() with rtnl_lock() to make sure ndo_open() can
only be called after virtio_device_ready().

Fixes: 4baf1e33d0842 ("virtio_net: enable VQs early")
Acked-by: Michael S. Tsirkin 
Signed-off-by: Jason Wang 
---
 drivers/net/virtio_net.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index db05b5e930be..8a5810bcb839 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3655,14 +3655,20 @@ static int virtnet_probe(struct virtio_device *vdev)
if (vi->has_rss || vi->has_rss_hash_report)
virtnet_init_default_rss(vi);
 
-   err = register_netdev(dev);
+   /* serialize netdev register + virtio_device_ready() with ndo_open() */
+   rtnl_lock();
+
+   err = register_netdevice(dev);
if (err) {
pr_debug("virtio_net: registering device failed\n");
+   rtnl_unlock();
goto free_failover;
}
 
virtio_device_ready(vdev);
 
+   rtnl_unlock();
+
err = virtnet_cpu_notif_add(vi);
if (err) {
pr_debug("virtio_net: registering cpu notifier failed\n");
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net] virtio-net: fix race between ndo_open() and virtio_device_ready()

2022-06-27 Thread Jason Wang
On Mon, Jun 27, 2022 at 3:44 PM Michael S. Tsirkin  wrote:
>
> On Mon, Jun 27, 2022 at 02:36:56PM +0800, Jason Wang wrote:
> > We used to call virtio_device_ready() after netdev registration.
>
> s/used to call/currently call/
>
> > This
> > cause
>
> s/This cause/Since ndo_open can be called immediately
> after register_netdev, this means there exists/
>
> > a race between ndo_open() and virtio_device_ready(): if
> > ndo_open() is called before virtio_device_ready(), the driver may
> > start to use the device before DRIVER_OK which violates the spec.
> >
> > Fixing
>
> s/Fixing/Fix/
>
> > this by switching to use register_netdevice() and protect the
> > virtio_device_ready() with rtnl_lock() to make sure ndo_open() can
> > only be called after virtio_device_ready().
> >
> > Fixes: 4baf1e33d0842 ("virtio_net: enable VQs early")
>
> it's an unusual use of Fixes - the patch in question does not
> introduce the problem, it just does not fix it completely.

Yes, but I couldn't find a better commit.

> But OK I guess.
>
> > Signed-off-by: Jason Wang 
>
> With commit log changes:

Will post a new version with the above fixed.

Thanks

>
> Acked-by: Michael S. Tsirkin 
>
> > ---
> >  drivers/net/virtio_net.c | 8 +++-
> >  1 file changed, 7 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> > index db05b5e930be..8a5810bcb839 100644
> > --- a/drivers/net/virtio_net.c
> > +++ b/drivers/net/virtio_net.c
> > @@ -3655,14 +3655,20 @@ static int virtnet_probe(struct virtio_device *vdev)
> >   if (vi->has_rss || vi->has_rss_hash_report)
> >   virtnet_init_default_rss(vi);
> >
> > - err = register_netdev(dev);
> > + /* serialize netdev register + virtio_device_ready() with ndo_open() 
> > */
> > + rtnl_lock();
> > +
> > + err = register_netdevice(dev);
> >   if (err) {
> >   pr_debug("virtio_net: registering device failed\n");
> > + rtnl_unlock();
> >   goto free_failover;
> >   }
> >
> >   virtio_device_ready(vdev);
> >
> > + rtnl_unlock();
> > +
> >   err = virtnet_cpu_notif_add(vi);
> >   if (err) {
> >   pr_debug("virtio_net: registering cpu notifier failed\n");
> > --
> > 2.25.1
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v10 25/41] virtio_pci: struct virtio_pci_common_cfg add queue_notify_data

2022-06-27 Thread Jason Wang
On Mon, Jun 27, 2022 at 3:58 PM Michael S. Tsirkin  wrote:
>
> On Mon, Jun 27, 2022 at 03:45:30PM +0800, Jason Wang wrote:
> > On Mon, Jun 27, 2022 at 2:39 PM Michael S. Tsirkin  wrote:
> > >
> > > On Mon, Jun 27, 2022 at 10:30:42AM +0800, Jason Wang wrote:
> > > > On Fri, Jun 24, 2022 at 2:59 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Fri, Jun 24, 2022 at 10:56:05AM +0800, Xuan Zhuo wrote:
> > > > > > Add queue_notify_data in struct virtio_pci_common_cfg, which comes 
> > > > > > from
> > > > > > here https://github.com/oasis-tcs/virtio-spec/issues/89
> > > > > >
> > > > > > For not breaks uABI, add a new struct virtio_pci_common_cfg_notify.
> > > > >
> > > > > What exactly is meant by not breaking uABI?
> > > > > Users are supposed to be prepared for struct size to change ... no?
> > > >
> > > > Not sure, any doc for this?
> > > >
> > > > Thanks
> > >
> > >
> > > Well we have this:
> > >
> > > The drivers SHOULD only map part of configuration structure
> > > large enough for device operation.  The drivers MUST handle
> > > an unexpectedly large \field{length}, but MAY check that 
> > > \field{length}
> > > is large enough for device operation.
> >
> > Yes, but that's the device/driver interface. What's done here is the
> > userspace/kernel.
> >
> > Userspace may break if it uses e.g sizeof(struct virtio_pci_common_cfg)?
> >
> > Thanks
>
> Hmm I guess there's risk... but then how are we going to maintain this
> going forward?  Add a new struct on any change?

This is the way we have used it for the past 5 or more years. I don't
see why this must be handled in the vq reset feature.

>Can we at least
> prevent this going forward somehow?

Like have some padding?

Thanks

>
>
> > >
> > >
> > >
> > > >
> > > > >
> > > > >
> > > > > > Since I want to add queue_reset after queue_notify_data, I submitted
> > > > > > this patch first.
> > > > > >
> > > > > > Signed-off-by: Xuan Zhuo 
> > > > > > Acked-by: Jason Wang 
> > > > > > ---
> > > > > >  include/uapi/linux/virtio_pci.h | 7 +++
> > > > > >  1 file changed, 7 insertions(+)
> > > > > >
> > > > > > diff --git a/include/uapi/linux/virtio_pci.h 
> > > > > > b/include/uapi/linux/virtio_pci.h
> > > > > > index 3a86f36d7e3d..22bec9bd0dfc 100644
> > > > > > --- a/include/uapi/linux/virtio_pci.h
> > > > > > +++ b/include/uapi/linux/virtio_pci.h
> > > > > > @@ -166,6 +166,13 @@ struct virtio_pci_common_cfg {
> > > > > >   __le32 queue_used_hi;   /* read-write */
> > > > > >  };
> > > > > >
> > > > > > +struct virtio_pci_common_cfg_notify {
> > > > > > + struct virtio_pci_common_cfg cfg;
> > > > > > +
> > > > > > + __le16 queue_notify_data;   /* read-write */
> > > > > > + __le16 padding;
> > > > > > +};
> > > > > > +
> > > > > >  /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */
> > > > > >  struct virtio_pci_cfg_cap {
> > > > > >   struct virtio_pci_cap cap;
> > > > > > --
> > > > > > 2.31.0
> > > > >
> > >
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V3] virtio: disable notification hardening by default

2022-06-27 Thread Jason Wang
On Mon, Jun 27, 2022 at 3:33 PM Michael S. Tsirkin  wrote:
>
> On Mon, Jun 27, 2022 at 10:50:17AM +0800, Jason Wang wrote:
> > On Fri, Jun 24, 2022 at 2:31 PM Michael S. Tsirkin  wrote:
> > >
> > > On Wed, Jun 22, 2022 at 03:09:31PM +0800, Jason Wang wrote:
> > > > On Wed, Jun 22, 2022 at 3:03 PM Michael S. Tsirkin  
> > > > wrote:
> > > > >
> > > > > On Wed, Jun 22, 2022 at 09:29:40AM +0800, Jason Wang wrote:
> > > > > > We try to harden virtio device notifications in 8b4ec69d7e09 
> > > > > > ("virtio:
> > > > > > harden vring IRQ"). It works with the assumption that the driver or
> > > > > > core can properly call virtio_device_ready() at the right
> > > > > > place. Unfortunately, this seems to be not true and uncover various
> > > > > > bugs of the existing drivers, mainly the issue of using
> > > > > > virtio_device_ready() incorrectly.
> > > > > >
> > > > > > So let's having a Kconfig option and disable it by default. It gives
> > > > > > us a breath to fix the drivers and then we can consider to enable it
> > > > > > by default.
> > > > > >
> > > > > > Signed-off-by: Jason Wang 
> > > > >
> > > > >
> > > > > OK I will queue, but I think the problem is fundamental.
> > > >
> > > > If I understand correctly, you want some core IRQ work?
> > >
> > > Yes.
> > >
> > > > As discussed
> > > > before, it doesn't solve all the problems, we still need to do per
> > > > driver audit.
> > > >
> > > > Thanks
> > >
> > > Maybe, but we don't need to tie things to device_ready then.
> > > We can do
> > >
> > > - disable irqs
> > > - device ready
> > > - setup everything
> > > - enable irqs
> > >
> > >
> > > and this works for most things, the only issue is
> > > this deadlocks if "setup everything" waits for interrupts.
> > >
> > >
> > > With the current approach there's really no good time:
> > > 1.- setup everything
> > > - device ready
> > >
> > > can cause kicks before device is ready
> > >
> > > 2.- device ready
> > > - setup everything
> > >
> > > can cause callbacks before setup.
> > >
> > > So I prefer the 1. and fix the hardening in the core.
> >
> > So my question is:
> >
> > 1) do similar hardening like config interrupt
> > or
> > 2) per transport notification work (e.g for PCI core IRQ work)
> >
> > 1) seems easier and universal, but we pay little overhead which could
> > be eliminated by the config option.
>
> I doubt 1 is easy and I am not even sure core IRQ changes will help.

Core IRQ won't help in 1), the changes are limited in the virtio.

> My concern with adding overhead is that I'm not sure these are not just
> wasted CPU cycles.  We spent a bunch of time on irq hardening and so far
> we are still at the "all drivers need to be fixed" stage.

It's not the fault of hardening but the drivers. The hardening just
expose those bugs.

>
> The reason config was kind of easy is that config interrupt is rarely
> vital for device function so arbitrarily deferring that does not lead to
> deadlocks - what you are trying to do with VQ interrupts is
> fundamentally different. Things are especially bad if we just drop
> an interrupt but deferring can lead to problems too.

I'm not sure I see the difference, disable_irq() stuffs also delay the
interrupt processing until enable_irq().

>
> Consider as an example
> virtio-net: fix race between ndo_open() and virtio_device_ready()
> if you just defer vq interrupts you get deadlocks.
>
>

I don't see a deadlock here, maybe you can show more detail on this?

>
> So, thinking about all this, how about a simple per vq flag meaning
> "this vq was kicked since reset"?

And ignore the notification if vq is not kicked? It sounds like the
callback needs to be synchronized with the kick.

>
> If driver does not kick then it's not ready to get callbacks, right?
>
> Sounds quite clean, but we need to think through memory ordering
> concerns - I guess it's only when we change the value so
> if (!vq->kicked) {
> vq->kicked = true;
> mb();
> }
>
> will do the trick, right?

There's no much difference with the existing approach:

1) your proposal implicitly makes callbacks ready in virtqueue_kick()
2) my proposal explicitly makes callbacks ready via virtio_device_ready()

Both require careful auditing of all the existing drivers to make sure
no kick before DRIVER_OK.

>
> need to think about the reset path - it already synchronizes callbacks
> and already can lose interrupts so we just need to clear vq->kicked
> before that, right?

Probably.

>
>
> > 2) seems require more work in the IRQ core and it can not work for all
> > transports (e.g vDPA would be kind of difficult)
> >
> > Thanks
>
> Hmm I don't really get why would it be difficult.
> VDPA is mostly PCI isn't it? With PCI both level INT#x and edge MSI
> have interrupt masking support.

Yes, but consider the case of mlx5_vdpa, PCI stuff was hidden under
the auxiliary bus. And that is the way another vendor will go.

Thanks

>
>
> > >
> > >
> > > > >
> > > > >
> > > > > 

Re: [PATCH v10 25/41] virtio_pci: struct virtio_pci_common_cfg add queue_notify_data

2022-06-27 Thread Michael S. Tsirkin
On Mon, Jun 27, 2022 at 03:45:30PM +0800, Jason Wang wrote:
> On Mon, Jun 27, 2022 at 2:39 PM Michael S. Tsirkin  wrote:
> >
> > On Mon, Jun 27, 2022 at 10:30:42AM +0800, Jason Wang wrote:
> > > On Fri, Jun 24, 2022 at 2:59 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Fri, Jun 24, 2022 at 10:56:05AM +0800, Xuan Zhuo wrote:
> > > > > Add queue_notify_data in struct virtio_pci_common_cfg, which comes 
> > > > > from
> > > > > here https://github.com/oasis-tcs/virtio-spec/issues/89
> > > > >
> > > > > For not breaks uABI, add a new struct virtio_pci_common_cfg_notify.
> > > >
> > > > What exactly is meant by not breaking uABI?
> > > > Users are supposed to be prepared for struct size to change ... no?
> > >
> > > Not sure, any doc for this?
> > >
> > > Thanks
> >
> >
> > Well we have this:
> >
> > The drivers SHOULD only map part of configuration structure
> > large enough for device operation.  The drivers MUST handle
> > an unexpectedly large \field{length}, but MAY check that 
> > \field{length}
> > is large enough for device operation.
> 
> Yes, but that's the device/driver interface. What's done here is the
> userspace/kernel.
> 
> Userspace may break if it uses e.g sizeof(struct virtio_pci_common_cfg)?
> 
> Thanks

Hmm I guess there's risk... but then how are we going to maintain this
going forward?  Add a new struct on any change? Can we at least
prevent this going forward somehow?


> >
> >
> >
> > >
> > > >
> > > >
> > > > > Since I want to add queue_reset after queue_notify_data, I submitted
> > > > > this patch first.
> > > > >
> > > > > Signed-off-by: Xuan Zhuo 
> > > > > Acked-by: Jason Wang 
> > > > > ---
> > > > >  include/uapi/linux/virtio_pci.h | 7 +++
> > > > >  1 file changed, 7 insertions(+)
> > > > >
> > > > > diff --git a/include/uapi/linux/virtio_pci.h 
> > > > > b/include/uapi/linux/virtio_pci.h
> > > > > index 3a86f36d7e3d..22bec9bd0dfc 100644
> > > > > --- a/include/uapi/linux/virtio_pci.h
> > > > > +++ b/include/uapi/linux/virtio_pci.h
> > > > > @@ -166,6 +166,13 @@ struct virtio_pci_common_cfg {
> > > > >   __le32 queue_used_hi;   /* read-write */
> > > > >  };
> > > > >
> > > > > +struct virtio_pci_common_cfg_notify {
> > > > > + struct virtio_pci_common_cfg cfg;
> > > > > +
> > > > > + __le16 queue_notify_data;   /* read-write */
> > > > > + __le16 padding;
> > > > > +};
> > > > > +
> > > > >  /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */
> > > > >  struct virtio_pci_cfg_cap {
> > > > >   struct virtio_pci_cap cap;
> > > > > --
> > > > > 2.31.0
> > > >
> >

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v2] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation

2022-06-27 Thread Peter Zijlstra
On Mon, Jun 27, 2022 at 10:13:50AM +0800, Guo Hui wrote:
> The instructions assigned to the vcpu_is_preempted function parameter
> in the X86 architecture physical machine are redundant instructions,
> causing the multi-core performance of Unixbench to drop by about 4% to 5%.
> The C function is as follows:
> static bool vcpu_is_preempted(long vcpu);
> 
> The parameter 'vcpu' in the function osq_lock
> that calls the function vcpu_is_preempted is assigned as follows:
> 
> The C code is in the function node_cpu:
> cpu = node->cpu - 1;
> 
> The instructions corresponding to the C code are:
> mov 0x14(%rax),%edi
> sub $0x1,%edi
> 
> The above instructions are unnecessary
> in the X86 Native operating environment,
> causing high cache-misses and degrading performance.

The above basically says that argument setup is not patched out and
causes significant pain due to a cache-miss.

> Signed-off-by: Guo Hui 
> ---
>  arch/x86/kernel/paravirt-spinlocks.c | 4 
>  kernel/locking/osq_lock.c| 9 -
>  2 files changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/paravirt-spinlocks.c 
> b/arch/x86/kernel/paravirt-spinlocks.c
> index 9e1ea99ad..7a55f8407 100644
> --- a/arch/x86/kernel/paravirt-spinlocks.c
> +++ b/arch/x86/kernel/paravirt-spinlocks.c
> @@ -33,6 +33,8 @@ bool pv_is_native_vcpu_is_preempted(void)
>   __raw_callee_save___native_vcpu_is_preempted;
>  }
>  
> +DECLARE_STATIC_KEY_FALSE(preemted_key);
> +
>  void __init paravirt_set_cap(void)
>  {
>   if (!pv_is_native_spin_unlock())
> @@ -40,4 +42,6 @@ void __init paravirt_set_cap(void)
>  
>   if (!pv_is_native_vcpu_is_preempted())
>   setup_force_cpu_cap(X86_FEATURE_VCPUPREEMPT);
> + else
> + static_branch_enable(_key);
>  }

At least for x86 it makes sense to have the static_key default the other
way around. That is, enable it along with vcpu_is_preempted().

> diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
> index d5610ad52..a8798e701 100644
> --- a/kernel/locking/osq_lock.c
> +++ b/kernel/locking/osq_lock.c
> @@ -22,9 +22,16 @@ static inline int encode_cpu(int cpu_nr)
>   return cpu_nr + 1;
>  }
>  
> +DEFINE_STATIC_KEY_FALSE(preemted_key);
> +
>  static inline int node_cpu(struct optimistic_spin_node *node)
>  {
> - return node->cpu - 1;
> + int cpu = 0;
> +
> + if (!static_branch_unlikely(_key))
> + cpu = node->cpu - 1;
> +
> + return cpu;
>  }

Would not something like:

static inline bool
vcpu_is_preempted_node(struct optimistic_spin_node *node)
{
if (!static_branch_unlikely(_has_preemption))
return false;

return vcpu_is_preempted(node_cpu(node->prev));
}

And then use that like:

if (smp_cond_load_relaxed(>locked, VAL || need_resched() ||
  vcpu_is_preempted_node(node)))

Not generate better code still?
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v10 25/41] virtio_pci: struct virtio_pci_common_cfg add queue_notify_data

2022-06-27 Thread Jason Wang
On Mon, Jun 27, 2022 at 2:39 PM Michael S. Tsirkin  wrote:
>
> On Mon, Jun 27, 2022 at 10:30:42AM +0800, Jason Wang wrote:
> > On Fri, Jun 24, 2022 at 2:59 PM Michael S. Tsirkin  wrote:
> > >
> > > On Fri, Jun 24, 2022 at 10:56:05AM +0800, Xuan Zhuo wrote:
> > > > Add queue_notify_data in struct virtio_pci_common_cfg, which comes from
> > > > here https://github.com/oasis-tcs/virtio-spec/issues/89
> > > >
> > > > For not breaks uABI, add a new struct virtio_pci_common_cfg_notify.
> > >
> > > What exactly is meant by not breaking uABI?
> > > Users are supposed to be prepared for struct size to change ... no?
> >
> > Not sure, any doc for this?
> >
> > Thanks
>
>
> Well we have this:
>
> The drivers SHOULD only map part of configuration structure
> large enough for device operation.  The drivers MUST handle
> an unexpectedly large \field{length}, but MAY check that 
> \field{length}
> is large enough for device operation.

Yes, but that's the device/driver interface. What's done here is the
userspace/kernel.

Userspace may break if it uses e.g sizeof(struct virtio_pci_common_cfg)?

Thanks

>
>
>
> >
> > >
> > >
> > > > Since I want to add queue_reset after queue_notify_data, I submitted
> > > > this patch first.
> > > >
> > > > Signed-off-by: Xuan Zhuo 
> > > > Acked-by: Jason Wang 
> > > > ---
> > > >  include/uapi/linux/virtio_pci.h | 7 +++
> > > >  1 file changed, 7 insertions(+)
> > > >
> > > > diff --git a/include/uapi/linux/virtio_pci.h 
> > > > b/include/uapi/linux/virtio_pci.h
> > > > index 3a86f36d7e3d..22bec9bd0dfc 100644
> > > > --- a/include/uapi/linux/virtio_pci.h
> > > > +++ b/include/uapi/linux/virtio_pci.h
> > > > @@ -166,6 +166,13 @@ struct virtio_pci_common_cfg {
> > > >   __le32 queue_used_hi;   /* read-write */
> > > >  };
> > > >
> > > > +struct virtio_pci_common_cfg_notify {
> > > > + struct virtio_pci_common_cfg cfg;
> > > > +
> > > > + __le16 queue_notify_data;   /* read-write */
> > > > + __le16 padding;
> > > > +};
> > > > +
> > > >  /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */
> > > >  struct virtio_pci_cfg_cap {
> > > >   struct virtio_pci_cap cap;
> > > > --
> > > > 2.31.0
> > >
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net] virtio-net: fix race between ndo_open() and virtio_device_ready()

2022-06-27 Thread Michael S. Tsirkin
On Mon, Jun 27, 2022 at 02:36:56PM +0800, Jason Wang wrote:
> We used to call virtio_device_ready() after netdev registration.

s/used to call/currently call/

> This
> cause

s/This cause/Since ndo_open can be called immediately
after register_netdev, this means there exists/

> a race between ndo_open() and virtio_device_ready(): if
> ndo_open() is called before virtio_device_ready(), the driver may
> start to use the device before DRIVER_OK which violates the spec.
> 
> Fixing

s/Fixing/Fix/

> this by switching to use register_netdevice() and protect the
> virtio_device_ready() with rtnl_lock() to make sure ndo_open() can
> only be called after virtio_device_ready().
> 
> Fixes: 4baf1e33d0842 ("virtio_net: enable VQs early")

it's an unusual use of Fixes - the patch in question does not
introduce the problem, it just does not fix it completely.
But OK I guess.

> Signed-off-by: Jason Wang 

With commit log changes:

Acked-by: Michael S. Tsirkin 

> ---
>  drivers/net/virtio_net.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index db05b5e930be..8a5810bcb839 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3655,14 +3655,20 @@ static int virtnet_probe(struct virtio_device *vdev)
>   if (vi->has_rss || vi->has_rss_hash_report)
>   virtnet_init_default_rss(vi);
>  
> - err = register_netdev(dev);
> + /* serialize netdev register + virtio_device_ready() with ndo_open() */
> + rtnl_lock();
> +
> + err = register_netdevice(dev);
>   if (err) {
>   pr_debug("virtio_net: registering device failed\n");
> + rtnl_unlock();
>   goto free_failover;
>   }
>  
>   virtio_device_ready(vdev);
>  
> + rtnl_unlock();
> +
>   err = virtnet_cpu_notif_add(vi);
>   if (err) {
>   pr_debug("virtio_net: registering cpu notifier failed\n");
> -- 
> 2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V3] virtio: disable notification hardening by default

2022-06-27 Thread Michael S. Tsirkin
On Mon, Jun 27, 2022 at 10:50:17AM +0800, Jason Wang wrote:
> On Fri, Jun 24, 2022 at 2:31 PM Michael S. Tsirkin  wrote:
> >
> > On Wed, Jun 22, 2022 at 03:09:31PM +0800, Jason Wang wrote:
> > > On Wed, Jun 22, 2022 at 3:03 PM Michael S. Tsirkin  
> > > wrote:
> > > >
> > > > On Wed, Jun 22, 2022 at 09:29:40AM +0800, Jason Wang wrote:
> > > > > We try to harden virtio device notifications in 8b4ec69d7e09 ("virtio:
> > > > > harden vring IRQ"). It works with the assumption that the driver or
> > > > > core can properly call virtio_device_ready() at the right
> > > > > place. Unfortunately, this seems to be not true and uncover various
> > > > > bugs of the existing drivers, mainly the issue of using
> > > > > virtio_device_ready() incorrectly.
> > > > >
> > > > > So let's having a Kconfig option and disable it by default. It gives
> > > > > us a breath to fix the drivers and then we can consider to enable it
> > > > > by default.
> > > > >
> > > > > Signed-off-by: Jason Wang 
> > > >
> > > >
> > > > OK I will queue, but I think the problem is fundamental.
> > >
> > > If I understand correctly, you want some core IRQ work?
> >
> > Yes.
> >
> > > As discussed
> > > before, it doesn't solve all the problems, we still need to do per
> > > driver audit.
> > >
> > > Thanks
> >
> > Maybe, but we don't need to tie things to device_ready then.
> > We can do
> >
> > - disable irqs
> > - device ready
> > - setup everything
> > - enable irqs
> >
> >
> > and this works for most things, the only issue is
> > this deadlocks if "setup everything" waits for interrupts.
> >
> >
> > With the current approach there's really no good time:
> > 1.- setup everything
> > - device ready
> >
> > can cause kicks before device is ready
> >
> > 2.- device ready
> > - setup everything
> >
> > can cause callbacks before setup.
> >
> > So I prefer the 1. and fix the hardening in the core.
> 
> So my question is:
> 
> 1) do similar hardening like config interrupt
> or
> 2) per transport notification work (e.g for PCI core IRQ work)
> 
> 1) seems easier and universal, but we pay little overhead which could
> be eliminated by the config option.

I doubt 1 is easy and I am not even sure core IRQ changes will help.
My concern with adding overhead is that I'm not sure these are not just
wasted CPU cycles.  We spent a bunch of time on irq hardening and so far
we are still at the "all drivers need to be fixed" stage.

The reason config was kind of easy is that config interrupt is rarely
vital for device function so arbitrarily deferring that does not lead to
deadlocks - what you are trying to do with VQ interrupts is
fundamentally different. Things are especially bad if we just drop
an interrupt but deferring can lead to problems too.

Consider as an example
virtio-net: fix race between ndo_open() and virtio_device_ready()
if you just defer vq interrupts you get deadlocks.



So, thinking about all this, how about a simple per vq flag meaning
"this vq was kicked since reset"?

If driver does not kick then it's not ready to get callbacks, right?

Sounds quite clean, but we need to think through memory ordering
concerns - I guess it's only when we change the value so
if (!vq->kicked) {
vq->kicked = true;
mb();
}

will do the trick, right?

need to think about the reset path - it already synchronizes callbacks
and already can lose interrupts so we just need to clear vq->kicked
before that, right?


> 2) seems require more work in the IRQ core and it can not work for all
> transports (e.g vDPA would be kind of difficult)
> 
> Thanks

Hmm I don't really get why would it be difficult.
VDPA is mostly PCI isn't it? With PCI both level INT#x and edge MSI
have interrupt masking support.


> >
> >
> > > >
> > > >
> > > > > ---
> > > > > Changes since V2:
> > > > > - Tweak the Kconfig help
> > > > > - Add comment for the read_lock() pairing in virtio_ccw
> > > > > ---
> > > > >  drivers/s390/virtio/virtio_ccw.c |  9 -
> > > > >  drivers/virtio/Kconfig   | 13 +
> > > > >  drivers/virtio/virtio.c  |  2 ++
> > > > >  drivers/virtio/virtio_ring.c | 12 
> > > > >  include/linux/virtio_config.h|  2 ++
> > > > >  5 files changed, 37 insertions(+), 1 deletion(-)
> > > > >
> > > > > diff --git a/drivers/s390/virtio/virtio_ccw.c 
> > > > > b/drivers/s390/virtio/virtio_ccw.c
> > > > > index 97e51c34e6cf..1f6a358f65f0 100644
> > > > > --- a/drivers/s390/virtio/virtio_ccw.c
> > > > > +++ b/drivers/s390/virtio/virtio_ccw.c
> > > > > @@ -1136,8 +1136,13 @@ static void virtio_ccw_int_handler(struct 
> > > > > ccw_device *cdev,
> > > > >   vcdev->err = -EIO;
> > > > >   }
> > > > >   virtio_ccw_check_activity(vcdev, activity);
> > > > > - /* Interrupts are disabled here */
> > > > > +#ifdef CONFIG_VIRTIO_HARDEN_NOTIFICATION
> > > > > + /*
> > > > > +  * Paried with virtio_ccw_synchronize_cbs() and interrupts are
> > > > > +  

RE: [PATCH v3 5/5] vfio/iommu_type1: Simplify group attachment

2022-06-27 Thread Tian, Kevin
> From: Nicolin Chen 
> Sent: Friday, June 24, 2022 4:00 AM
> 
> Un-inline the domain specific logic from the attach/detach_group ops into
> two paired functions vfio_iommu_alloc_attach_domain() and
> vfio_iommu_detach_destroy_domain() that strictly deal with creating and
> destroying struct vfio_domains.
> 
> Add the logic to check for EMEDIUMTYPE return code of
> iommu_attach_group()
> and avoid the extra domain allocations and attach/detach sequences of the
> old code. This allows properly detecting an actual attach error, like
> -ENOMEM, vs treating all attach errors as an incompatible domain.
> 
> Co-developed-by: Jason Gunthorpe 
> Signed-off-by: Jason Gunthorpe 
> Signed-off-by: Nicolin Chen 

Reviewed-by: Kevin Tian 
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH net] virtio-net: fix race between ndo_open() and virtio_device_ready()

2022-06-27 Thread Jason Wang
We used to call virtio_device_ready() after netdev registration. This
cause a race between ndo_open() and virtio_device_ready(): if
ndo_open() is called before virtio_device_ready(), the driver may
start to use the device before DRIVER_OK which violates the spec.

Fixing this by switching to use register_netdevice() and protect the
virtio_device_ready() with rtnl_lock() to make sure ndo_open() can
only be called after virtio_device_ready().

Fixes: 4baf1e33d0842 ("virtio_net: enable VQs early")
Signed-off-by: Jason Wang 
---
 drivers/net/virtio_net.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index db05b5e930be..8a5810bcb839 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -3655,14 +3655,20 @@ static int virtnet_probe(struct virtio_device *vdev)
if (vi->has_rss || vi->has_rss_hash_report)
virtnet_init_default_rss(vi);
 
-   err = register_netdev(dev);
+   /* serialize netdev register + virtio_device_ready() with ndo_open() */
+   rtnl_lock();
+
+   err = register_netdevice(dev);
if (err) {
pr_debug("virtio_net: registering device failed\n");
+   rtnl_unlock();
goto free_failover;
}
 
virtio_device_ready(vdev);
 
+   rtnl_unlock();
+
err = virtnet_cpu_notif_add(vi);
if (err) {
pr_debug("virtio_net: registering cpu notifier failed\n");
-- 
2.25.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v10 25/41] virtio_pci: struct virtio_pci_common_cfg add queue_notify_data

2022-06-27 Thread Michael S. Tsirkin
On Mon, Jun 27, 2022 at 10:30:42AM +0800, Jason Wang wrote:
> On Fri, Jun 24, 2022 at 2:59 PM Michael S. Tsirkin  wrote:
> >
> > On Fri, Jun 24, 2022 at 10:56:05AM +0800, Xuan Zhuo wrote:
> > > Add queue_notify_data in struct virtio_pci_common_cfg, which comes from
> > > here https://github.com/oasis-tcs/virtio-spec/issues/89
> > >
> > > For not breaks uABI, add a new struct virtio_pci_common_cfg_notify.
> >
> > What exactly is meant by not breaking uABI?
> > Users are supposed to be prepared for struct size to change ... no?
> 
> Not sure, any doc for this?
> 
> Thanks


Well we have this:

The drivers SHOULD only map part of configuration structure
large enough for device operation.  The drivers MUST handle
an unexpectedly large \field{length}, but MAY check that \field{length}
is large enough for device operation.



> 
> >
> >
> > > Since I want to add queue_reset after queue_notify_data, I submitted
> > > this patch first.
> > >
> > > Signed-off-by: Xuan Zhuo 
> > > Acked-by: Jason Wang 
> > > ---
> > >  include/uapi/linux/virtio_pci.h | 7 +++
> > >  1 file changed, 7 insertions(+)
> > >
> > > diff --git a/include/uapi/linux/virtio_pci.h 
> > > b/include/uapi/linux/virtio_pci.h
> > > index 3a86f36d7e3d..22bec9bd0dfc 100644
> > > --- a/include/uapi/linux/virtio_pci.h
> > > +++ b/include/uapi/linux/virtio_pci.h
> > > @@ -166,6 +166,13 @@ struct virtio_pci_common_cfg {
> > >   __le32 queue_used_hi;   /* read-write */
> > >  };
> > >
> > > +struct virtio_pci_common_cfg_notify {
> > > + struct virtio_pci_common_cfg cfg;
> > > +
> > > + __le16 queue_notify_data;   /* read-write */
> > > + __le16 padding;
> > > +};
> > > +
> > >  /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */
> > >  struct virtio_pci_cfg_cap {
> > >   struct virtio_pci_cap cap;
> > > --
> > > 2.31.0
> >

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH] virtio-net: fix race between ndo_open() and virtio_device_ready()

2022-06-27 Thread Jason Wang
On Fri, Jun 17, 2022 at 3:29 PM Jason Wang  wrote:
>
> We used to call virtio_device_ready() after netdev registration. This
> cause a race between ndo_open() and virtio_device_ready(): if
> ndo_open() is called before virtio_device_ready(), the driver may
> start to use the device before DRIVER_OK which violates the spec.
>
> Fixing this by switching to use register_netdevice() and protect the
> virtio_device_ready() with rtnl_lock() to make sure ndo_open() can
> only be called after virtio_device_ready().
>
> Fixes: 4baf1e33d0842 ("virtio_net: enable VQs early")
> Signed-off-by: Jason Wang 

Ok, I think we're fine with this. So I will repost against -net.

If we spot issues with unregistering, we can use a separate patch for that.

Thanks

> ---
>  drivers/net/virtio_net.c | 8 +++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index db05b5e930be..8a5810bcb839 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -3655,14 +3655,20 @@ static int virtnet_probe(struct virtio_device *vdev)
> if (vi->has_rss || vi->has_rss_hash_report)
> virtnet_init_default_rss(vi);
>
> -   err = register_netdev(dev);
> +   /* serialize netdev register + virtio_device_ready() with ndo_open() 
> */
> +   rtnl_lock();
> +
> +   err = register_netdevice(dev);
> if (err) {
> pr_debug("virtio_net: registering device failed\n");
> +   rtnl_unlock();
> goto free_failover;
> }
>
> virtio_device_ready(vdev);
>
> +   rtnl_unlock();
> +
> err = virtnet_cpu_notif_add(vi);
> if (err) {
> pr_debug("virtio_net: registering cpu notifier failed\n");
> --
> 2.25.1
>

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization