Re: [PATCH v3 1/2] vhost: Add rbtree vdpa_mem_tree to saved the counted mem
On Tue, Jun 28, 2022 at 11:54:27AM +0800, Jason Wang wrote: > On Mon, Jun 27, 2022 at 6:01 PM Michael S. Tsirkin wrote: > > > > On Mon, Jun 27, 2022 at 04:12:57PM +0800, Cindy Lu wrote: > > > On Sun, Jun 26, 2022 at 6:01 PM Michael S. Tsirkin > > > wrote: > > > > > > > > On Sun, Jun 26, 2022 at 05:04:08PM +0800, Cindy Lu wrote: > > > > > We count pinned_vm as follow in vhost-vDPA > > > > > > > > > > lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > > if (npages + atomic64_read(>mm->pinned_vm) > lock_limit) { > > > > > ret = -ENOMEM; > > > > > goto unlock; > > > > > } > > > > > This means if we have two vDPA devices for the same VM the pages > > > > > would be counted twice. So we add a tree to save the page that > > > > > counted and we will not count it again. > > > > > > > > > > Add vdpa_mem_tree to saved the mem that already counted. > > > > > use a hlist to saved the root for vdpa_mem_tree. > > > > > > > > > > Signed-off-by: Cindy Lu > > > > > --- > > > > > drivers/vhost/vhost.c | 63 > > > > > +++ > > > > > drivers/vhost/vhost.h | 1 + > > > > > 2 files changed, 64 insertions(+) > > > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > > > > index 40097826cff0..4ca8b1ed944b 100644 > > > > > --- a/drivers/vhost/vhost.c > > > > > +++ b/drivers/vhost/vhost.c > > > > > @@ -32,6 +32,8 @@ > > > > > #include > > > > > > > > > > #include "vhost.h" > > > > > +#include > > > > > +#include > > > > > > > > > > static ushort max_mem_regions = 64; > > > > > module_param(max_mem_regions, ushort, 0444); > > > > > @@ -49,6 +51,14 @@ enum { > > > > > #define vhost_used_event(vq) ((__virtio16 __user > > > > > *)>avail->ring[vq->num]) > > > > > #define vhost_avail_event(vq) ((__virtio16 __user > > > > > *)>used->ring[vq->num]) > > > > > > > > > > +struct vhost_vdpa_rbtree_node { > > > > > + struct hlist_node node; > > > > > + struct rb_root_cached vdpa_mem_tree; > > > > > + struct mm_struct *mm_using; > > > > > +}; > > > > > +static DECLARE_HASHTABLE(vhost_vdpa_rbtree_hlist, 8); > > > > > +int vhost_vdpa_rbtree_hlist_status; > > > > > + > > > > > #ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY > > > > > static void vhost_disable_cross_endian(struct vhost_virtqueue *vq) > > > > > { > > > > > > > > Are you trying to save some per-mm information here? > > > > Can't we just add it to mm_struct? > > > > > > > yes, this is a per-mm information, but I have checked with jason before, > > > seems it maybe difficult to change the mm_struct in upstream > > > so I add an to add a hlist instead > > > Thanks > > > Cindy > > > > Difficult how? > > It is only useful for vDPA probably. Though it could be used by VFIO > as well, VFIO does pinning/accounting at the container level and it > has been there for years. Yes it's been there, I'm not sure this means it's perfect. Also, rdma guys might be interested too I guess? > vDPA have an implicit "container" the > mm_struct, but the accounting is done per device right now. > > In the future, when vDPA switches to iommufd, it can be then solved at > iommufd level. So is it even worth fixing now? > And if we do this in mm, it will bring extra overheads. > > Thanks Pointer per mm, not too bad ... > > You get more scrutiny if you try, for sure, > > and you need to need to generalize it enough that it looks > > useful outside the driver. But I guess that's good? > > > > > > > > > > > > > > > @@ -571,6 +581,51 @@ static void vhost_detach_mm(struct vhost_dev > > > > > *dev) > > > > > dev->mm = NULL; > > > > > } > > > > > > > > > > +struct rb_root_cached *vhost_vdpa_get_mem_tree(struct mm_struct *mm) > > > > > +{ > > > > > + struct vhost_vdpa_rbtree_node *rbtree_root = NULL; > > > > > + struct rb_root_cached *vdpa_tree; > > > > > + u32 key; > > > > > + > > > > > + /* No hased table, init one */ > > > > > + if (vhost_vdpa_rbtree_hlist_status == 0) { > > > > > + hash_init(vhost_vdpa_rbtree_hlist); > > > > > + vhost_vdpa_rbtree_hlist_status = 1; > > > > > + } > > > > > + > > > > > + key = jhash_1word((u64)mm, JHASH_INITVAL); > > > > > + hash_for_each_possible(vhost_vdpa_rbtree_hlist, rbtree_root, > > > > > node, > > > > > +key) { > > > > > + if (rbtree_root->mm_using == mm) > > > > > + return &(rbtree_root->vdpa_mem_tree); > > > > > + } > > > > > + rbtree_root = kmalloc(sizeof(*rbtree_root), GFP_KERNEL); > > > > > + if (!rbtree_root) > > > > > + return NULL; > > > > > + rbtree_root->mm_using = mm; > > > > > + rbtree_root->vdpa_mem_tree = RB_ROOT_CACHED; > > > > > + hash_add(vhost_vdpa_rbtree_hlist, _root->node, key); > > > > > + vdpa_tree = &(rbtree_root->vdpa_mem_tree); > > > > > + return vdpa_tree; > > > > > +} > > > > > + > > > > > +void vhost_vdpa_relase_mem_tree(struct mm_struct *mm) > > > > > +{ > > > > >
Re: [PATCH v10 25/41] virtio_pci: struct virtio_pci_common_cfg add queue_notify_data
On Tue, Jun 28, 2022 at 11:50:37AM +0800, Jason Wang wrote: > On Mon, Jun 27, 2022 at 7:53 PM Michael S. Tsirkin wrote: > > > > On Mon, Jun 27, 2022 at 04:14:20PM +0800, Jason Wang wrote: > > > On Mon, Jun 27, 2022 at 3:58 PM Michael S. Tsirkin > > > wrote: > > > > > > > > On Mon, Jun 27, 2022 at 03:45:30PM +0800, Jason Wang wrote: > > > > > On Mon, Jun 27, 2022 at 2:39 PM Michael S. Tsirkin > > > > > wrote: > > > > > > > > > > > > On Mon, Jun 27, 2022 at 10:30:42AM +0800, Jason Wang wrote: > > > > > > > On Fri, Jun 24, 2022 at 2:59 PM Michael S. Tsirkin > > > > > > > wrote: > > > > > > > > > > > > > > > > On Fri, Jun 24, 2022 at 10:56:05AM +0800, Xuan Zhuo wrote: > > > > > > > > > Add queue_notify_data in struct virtio_pci_common_cfg, which > > > > > > > > > comes from > > > > > > > > > here https://github.com/oasis-tcs/virtio-spec/issues/89 > > > > > > > > > > > > > > > > > > For not breaks uABI, add a new struct > > > > > > > > > virtio_pci_common_cfg_notify. > > > > > > > > > > > > > > > > What exactly is meant by not breaking uABI? > > > > > > > > Users are supposed to be prepared for struct size to change ... > > > > > > > > no? > > > > > > > > > > > > > > Not sure, any doc for this? > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > Well we have this: > > > > > > > > > > > > The drivers SHOULD only map part of configuration structure > > > > > > large enough for device operation. The drivers MUST handle > > > > > > an unexpectedly large \field{length}, but MAY check that > > > > > > \field{length} > > > > > > is large enough for device operation. > > > > > > > > > > Yes, but that's the device/driver interface. What's done here is the > > > > > userspace/kernel. > > > > > > > > > > Userspace may break if it uses e.g sizeof(struct > > > > > virtio_pci_common_cfg)? > > > > > > > > > > Thanks > > > > > > > > Hmm I guess there's risk... but then how are we going to maintain this > > > > going forward? Add a new struct on any change? > > > > > > This is the way we have used it for the past 5 or more years. I don't > > > see why this must be handled in the vq reset feature. > > > > > > >Can we at least > > > > prevent this going forward somehow? > > > > > > Like have some padding? > > > > > > Thanks > > > > Maybe - this is what QEMU does ... > > Do you want this to be addressed in this series (it's already very huge > anyhow)? > > Thanks Let's come up with a solution at least. QEMU does not seem to need the struct. Let's just put it in virtio_pci_modern.h for now then? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Since I want to add queue_reset after queue_notify_data, I > > > > > > > > > submitted > > > > > > > > > this patch first. > > > > > > > > > > > > > > > > > > Signed-off-by: Xuan Zhuo > > > > > > > > > Acked-by: Jason Wang > > > > > > > > > --- > > > > > > > > > include/uapi/linux/virtio_pci.h | 7 +++ > > > > > > > > > 1 file changed, 7 insertions(+) > > > > > > > > > > > > > > > > > > diff --git a/include/uapi/linux/virtio_pci.h > > > > > > > > > b/include/uapi/linux/virtio_pci.h > > > > > > > > > index 3a86f36d7e3d..22bec9bd0dfc 100644 > > > > > > > > > --- a/include/uapi/linux/virtio_pci.h > > > > > > > > > +++ b/include/uapi/linux/virtio_pci.h > > > > > > > > > @@ -166,6 +166,13 @@ struct virtio_pci_common_cfg { > > > > > > > > > __le32 queue_used_hi; /* read-write */ > > > > > > > > > }; > > > > > > > > > > > > > > > > > > +struct virtio_pci_common_cfg_notify { > > > > > > > > > + struct virtio_pci_common_cfg cfg; > > > > > > > > > + > > > > > > > > > + __le16 queue_notify_data; /* read-write */ > > > > > > > > > + __le16 padding; > > > > > > > > > +}; > > > > > > > > > + > > > > > > > > > /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */ > > > > > > > > > struct virtio_pci_cfg_cap { > > > > > > > > > struct virtio_pci_cap cap; > > > > > > > > > -- > > > > > > > > > 2.31.0 > > > > > > > > > > > > > > > > > > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V3] virtio: disable notification hardening by default
On Tue, Jun 28, 2022 at 11:49:12AM +0800, Jason Wang wrote: > > Heh. Yea sure. But things work fine for people. What is the chance > > your review found and fixed all driver bugs? > > I don't/can't audit all bugs but the race between open/close against > ready/reset. It looks to me a good chance to fix them all but if you > think differently, let me know > > > After two attempts > > I don't feel like hoping audit will fix all bugs. > > I've started the auditing and have 15+ patches in the queue. (only > covers bluetooth, console, pmem, virtio-net and caif). Spotting the > issue is not hard but the testing, It would take at least the time of > one release to finalize I guess. Absolutely. So I am looking for a way to implement hardening that does not break existing drivers. > > > > > > > > > > > > The reason config was kind of easy is that config interrupt is rarely > > > > vital for device function so arbitrarily deferring that does not lead to > > > > deadlocks - what you are trying to do with VQ interrupts is > > > > fundamentally different. Things are especially bad if we just drop > > > > an interrupt but deferring can lead to problems too. > > > > > > I'm not sure I see the difference, disable_irq() stuffs also delay the > > > interrupt processing until enable_irq(). > > > > > > Absolutely. I am not at all sure disable_irq fixes all problems. > > > > > > > > > > Consider as an example > > > > virtio-net: fix race between ndo_open() and virtio_device_ready() > > > > if you just defer vq interrupts you get deadlocks. > > > > > > > > > > > > > > I don't see a deadlock here, maybe you can show more detail on this? > > > > What I mean is this: if we revert the above commit, things still > > work (out of spec, but still). If we revert and defer interrupts until > > device ready then ndo_open that triggers before device ready deadlocks. > > Ok, I guess you meant on a hypervisor that is strictly written with spec. I mean on hypervisor that starts processing queues after getting a kick even without DRIVER_OK. > > > > > > > > > > > > So, thinking about all this, how about a simple per vq flag meaning > > > > "this vq was kicked since reset"? > > > > > > And ignore the notification if vq is not kicked? It sounds like the > > > callback needs to be synchronized with the kick. > > > > Note we only need to synchronize it when it changes, which is > > only during initialization and reset. > > Yes. > > > > > > > > > > > > > If driver does not kick then it's not ready to get callbacks, right? > > > > > > > > Sounds quite clean, but we need to think through memory ordering > > > > concerns - I guess it's only when we change the value so > > > > if (!vq->kicked) { > > > > vq->kicked = true; > > > > mb(); > > > > } > > > > > > > > will do the trick, right? > > > > > > There's no much difference with the existing approach: > > > > > > 1) your proposal implicitly makes callbacks ready in virtqueue_kick() > > > 2) my proposal explicitly makes callbacks ready via virtio_device_ready() > > > > > > Both require careful auditing of all the existing drivers to make sure > > > no kick before DRIVER_OK. > > > > Jason, kick before DRIVER_OK is out of spec, sure. But it is unrelated > > to hardening > > Yes but with your proposal, it seems to couple kick with DRIVER_OK somehow. I don't see how - my proposal ignores DRIVER_OK issues. > > and in absence of config interrupts is generally easily > > fixed just by sticking virtio_device_ready early in initialization. > > So if the kick is done before the subsystem registration, there's > still a window in the middle (assuming we stick virtio_device_ready() > early): > > virtio_device_ready() > virtqueue_kick() > /* the window */ > subsystem_registration() Absolutely, however, I do not think we really have many such drivers since this has been known as a wrong thing to do since the beginning. Want to try to find any? I couldn't ... except maybe bluetooth but that's just maintainer nacking fixes saying he'll fix it his way ... > And during remove(), we get another window: > > subsysrem_unregistration() > /* the window */ > virtio_device_reset() Same here. > if we do reset before, we may end up with other issues like deadlock. > > So I think it's probably very hard to fix issues only at the virtio > core since we need subsystem specific knowledge to make sure > everything is synchronized. How many drivers do you see with the issue above? As compared to yours which has 16 patches just in your queue. > > With the current approach one has to *also* not do virtio_device_ready > > too early - and it's really tricky. > > Not sure how much we differ here, during the probe driver can just > place the virtio_device_ready() after the kick. Not so easy. For example, consider virtio net without your locking change. kick is part of a command vq operation which does not directly have anything to do with probe. Same for many
Re: [PATCH] Bluetooth: virtio_bt: fix device removal
On Mon, Jun 13, 2022 at 02:58:59AM -0400, Michael S. Tsirkin wrote: > On Fri, Jan 14, 2022 at 03:12:47PM -0500, Michael S. Tsirkin wrote: > > On Thu, Dec 16, 2021 at 08:58:31PM +0100, Marcel Holtmann wrote: > > > Hi Michael, > > > > > > > Device removal is clearly out of virtio spec: it attempts to remove > > > > unused buffers from a VQ before invoking device reset. To fix, make > > > > open/close NOPs and do all cleanup/setup in probe/remove. > > > > > > so the virtbt_{open,close} as NOP is not really what a driver is > > > suppose > > > to be doing. These are transport enable/disable callbacks from the BT > > > Core towards the driver. It maps to a device being enabled/disabled > > > by > > > something like bluetoothd for example. So if disabled, I expect that > > > no > > > resources/queues are in use. > > > > > > Maybe I misunderstand the virtio spec in that regard, but I would > > > like > > > to keep this fundamental concept of a Bluetooth driver. It does work > > > with all other transports like USB, SDIO, UART etc. > > > > > > > The cost here is a single skb wasted on an unused bt device - which > > > > seems modest. > > > > > > There should be no buffer used if the device is powered off. We also > > > don’t > > > have any USB URBs in-flight if the transport is not active. > > > > > > > NB: with this fix in place driver still suffers from a race > > > > condition if > > > > an interrupt triggers while device is being reset. Work on a fix for > > > > that issue is in progress. > > > > > > In the virtbt_close() callback we should deactivate all interrupts. > > > > > > Regards > > > > > > Marcel > > > >>> > > > >>> So Marcel, do I read it right that you are working on a fix > > > >>> and I can drop this patch for now? > > > >> > > > >> ping > > > > > > > > > > > > If I don't hear otherwise I'll queue my version - it might not > > > > be ideal but it at least does not violate the spec. > > > > We can work on not allocating/freeing buffers later > > > > as appropriate. > > > > > > I have a patch, but it is not fully tested yet. > > > > > > Regards > > > > > > Marcel > > > > ping > > > > it's been a month ... > > > > I'm working on cleaning up module/device removal in virtio and bt > > is kind of sticking out. > > I am inclined to make this driver depend on BROKEN for now. > Any objections? OK patch incoming. > > > -- > > MST ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 2/3] vdpa_sim_blk: limit the number of request handled per batch
On Thu, Jun 23, 2022 at 4:58 PM Stefano Garzarella wrote: > > On Thu, Jun 23, 2022 at 11:50:22AM +0800, Jason Wang wrote: > >On Wed, Jun 22, 2022 at 12:09 AM Stefano Garzarella > >wrote: > >> > >> Limit the number of requests (4 per queue as for vdpa_sim_net) handled > >> in a batch to prevent the worker from using the CPU for too long. > >> > >> Suggested-by: Eugenio Pérez > >> Signed-off-by: Stefano Garzarella > >> --- > >> drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 15 ++- > >> 1 file changed, 14 insertions(+), 1 deletion(-) > >> > >> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c > >> b/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c > >> index a83a5c76f620..ac86478845b6 100644 > >> --- a/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c > >> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c > >> @@ -197,6 +197,7 @@ static bool vdpasim_blk_handle_req(struct vdpasim > >> *vdpasim, > >> static void vdpasim_blk_work(struct work_struct *work) > >> { > >> struct vdpasim *vdpasim = container_of(work, struct vdpasim, work); > >> + bool reschedule = false; > >> int i; > >> > >> spin_lock(>lock); > >> @@ -206,11 +207,15 @@ static void vdpasim_blk_work(struct work_struct > >> *work) > >> > >> for (i = 0; i < VDPASIM_BLK_VQ_NUM; i++) { > >> struct vdpasim_virtqueue *vq = >vqs[i]; > >> + bool vq_work = true; > >> + int reqs = 0; > >> > >> if (!vq->ready) > >> continue; > >> > >> - while (vdpasim_blk_handle_req(vdpasim, vq)) { > >> + while (vq_work) { > >> + vq_work = vdpasim_blk_handle_req(vdpasim, vq); > >> + > > > >Is it better to check and exit the loop early here? > > Maybe, but I'm not sure. > > In vdpa_sim_net we call vringh_complete_iotlb() and send notification > also in the error path, Looks not? read = vringh_iov_pull_iotlb(>vring, >in_iov, , sizeof(ctrl)); if (read != sizeof(ctrl)) break; We break the loop. Thanks > so I thought was better to send notification > also when vdpasim_blk_handle_req() return false, since we will update > the used.idx. > > However, I don't think it's a common path, so if you think it's better > to exit the loop early, I can do it. > > Thanks, > Stefano > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 1/2] vhost: Add rbtree vdpa_mem_tree to saved the counted mem
On Mon, Jun 27, 2022 at 6:01 PM Michael S. Tsirkin wrote: > > On Mon, Jun 27, 2022 at 04:12:57PM +0800, Cindy Lu wrote: > > On Sun, Jun 26, 2022 at 6:01 PM Michael S. Tsirkin wrote: > > > > > > On Sun, Jun 26, 2022 at 05:04:08PM +0800, Cindy Lu wrote: > > > > We count pinned_vm as follow in vhost-vDPA > > > > > > > > lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > > if (npages + atomic64_read(>mm->pinned_vm) > lock_limit) { > > > > ret = -ENOMEM; > > > > goto unlock; > > > > } > > > > This means if we have two vDPA devices for the same VM the pages > > > > would be counted twice. So we add a tree to save the page that > > > > counted and we will not count it again. > > > > > > > > Add vdpa_mem_tree to saved the mem that already counted. > > > > use a hlist to saved the root for vdpa_mem_tree. > > > > > > > > Signed-off-by: Cindy Lu > > > > --- > > > > drivers/vhost/vhost.c | 63 +++ > > > > drivers/vhost/vhost.h | 1 + > > > > 2 files changed, 64 insertions(+) > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > > > index 40097826cff0..4ca8b1ed944b 100644 > > > > --- a/drivers/vhost/vhost.c > > > > +++ b/drivers/vhost/vhost.c > > > > @@ -32,6 +32,8 @@ > > > > #include > > > > > > > > #include "vhost.h" > > > > +#include > > > > +#include > > > > > > > > static ushort max_mem_regions = 64; > > > > module_param(max_mem_regions, ushort, 0444); > > > > @@ -49,6 +51,14 @@ enum { > > > > #define vhost_used_event(vq) ((__virtio16 __user > > > > *)>avail->ring[vq->num]) > > > > #define vhost_avail_event(vq) ((__virtio16 __user > > > > *)>used->ring[vq->num]) > > > > > > > > +struct vhost_vdpa_rbtree_node { > > > > + struct hlist_node node; > > > > + struct rb_root_cached vdpa_mem_tree; > > > > + struct mm_struct *mm_using; > > > > +}; > > > > +static DECLARE_HASHTABLE(vhost_vdpa_rbtree_hlist, 8); > > > > +int vhost_vdpa_rbtree_hlist_status; > > > > + > > > > #ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY > > > > static void vhost_disable_cross_endian(struct vhost_virtqueue *vq) > > > > { > > > > > > Are you trying to save some per-mm information here? > > > Can't we just add it to mm_struct? > > > > > yes, this is a per-mm information, but I have checked with jason before, > > seems it maybe difficult to change the mm_struct in upstream > > so I add an to add a hlist instead > > Thanks > > Cindy > > Difficult how? It is only useful for vDPA probably. Though it could be used by VFIO as well, VFIO does pinning/accounting at the container level and it has been there for years. vDPA have an implicit "container" the mm_struct, but the accounting is done per device right now. In the future, when vDPA switches to iommufd, it can be then solved at iommufd level. And if we do this in mm, it will bring extra overheads. Thanks > You get more scrutiny if you try, for sure, > and you need to need to generalize it enough that it looks > useful outside the driver. But I guess that's good? > > > > > > > > > > > @@ -571,6 +581,51 @@ static void vhost_detach_mm(struct vhost_dev *dev) > > > > dev->mm = NULL; > > > > } > > > > > > > > +struct rb_root_cached *vhost_vdpa_get_mem_tree(struct mm_struct *mm) > > > > +{ > > > > + struct vhost_vdpa_rbtree_node *rbtree_root = NULL; > > > > + struct rb_root_cached *vdpa_tree; > > > > + u32 key; > > > > + > > > > + /* No hased table, init one */ > > > > + if (vhost_vdpa_rbtree_hlist_status == 0) { > > > > + hash_init(vhost_vdpa_rbtree_hlist); > > > > + vhost_vdpa_rbtree_hlist_status = 1; > > > > + } > > > > + > > > > + key = jhash_1word((u64)mm, JHASH_INITVAL); > > > > + hash_for_each_possible(vhost_vdpa_rbtree_hlist, rbtree_root, node, > > > > +key) { > > > > + if (rbtree_root->mm_using == mm) > > > > + return &(rbtree_root->vdpa_mem_tree); > > > > + } > > > > + rbtree_root = kmalloc(sizeof(*rbtree_root), GFP_KERNEL); > > > > + if (!rbtree_root) > > > > + return NULL; > > > > + rbtree_root->mm_using = mm; > > > > + rbtree_root->vdpa_mem_tree = RB_ROOT_CACHED; > > > > + hash_add(vhost_vdpa_rbtree_hlist, _root->node, key); > > > > + vdpa_tree = &(rbtree_root->vdpa_mem_tree); > > > > + return vdpa_tree; > > > > +} > > > > + > > > > +void vhost_vdpa_relase_mem_tree(struct mm_struct *mm) > > > > +{ > > > > + struct vhost_vdpa_rbtree_node *rbtree_root = NULL; > > > > + u32 key; > > > > + > > > > + key = jhash_1word((u64)mm, JHASH_INITVAL); > > > > + > > > > + /* No hased table, init one */ > > > > + hash_for_each_possible(vhost_vdpa_rbtree_hlist, rbtree_root, node, > > > > +key) { > > > > + if (rbtree_root->mm_using == mm) { > > > > + hash_del(_root->node); > > > > + kfree(rbtree_root); > > > > + } > >
Re: [PATCH v10 25/41] virtio_pci: struct virtio_pci_common_cfg add queue_notify_data
On Mon, Jun 27, 2022 at 7:53 PM Michael S. Tsirkin wrote: > > On Mon, Jun 27, 2022 at 04:14:20PM +0800, Jason Wang wrote: > > On Mon, Jun 27, 2022 at 3:58 PM Michael S. Tsirkin wrote: > > > > > > On Mon, Jun 27, 2022 at 03:45:30PM +0800, Jason Wang wrote: > > > > On Mon, Jun 27, 2022 at 2:39 PM Michael S. Tsirkin > > > > wrote: > > > > > > > > > > On Mon, Jun 27, 2022 at 10:30:42AM +0800, Jason Wang wrote: > > > > > > On Fri, Jun 24, 2022 at 2:59 PM Michael S. Tsirkin > > > > > > wrote: > > > > > > > > > > > > > > On Fri, Jun 24, 2022 at 10:56:05AM +0800, Xuan Zhuo wrote: > > > > > > > > Add queue_notify_data in struct virtio_pci_common_cfg, which > > > > > > > > comes from > > > > > > > > here https://github.com/oasis-tcs/virtio-spec/issues/89 > > > > > > > > > > > > > > > > For not breaks uABI, add a new struct > > > > > > > > virtio_pci_common_cfg_notify. > > > > > > > > > > > > > > What exactly is meant by not breaking uABI? > > > > > > > Users are supposed to be prepared for struct size to change ... > > > > > > > no? > > > > > > > > > > > > Not sure, any doc for this? > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > Well we have this: > > > > > > > > > > The drivers SHOULD only map part of configuration structure > > > > > large enough for device operation. The drivers MUST handle > > > > > an unexpectedly large \field{length}, but MAY check that > > > > > \field{length} > > > > > is large enough for device operation. > > > > > > > > Yes, but that's the device/driver interface. What's done here is the > > > > userspace/kernel. > > > > > > > > Userspace may break if it uses e.g sizeof(struct virtio_pci_common_cfg)? > > > > > > > > Thanks > > > > > > Hmm I guess there's risk... but then how are we going to maintain this > > > going forward? Add a new struct on any change? > > > > This is the way we have used it for the past 5 or more years. I don't > > see why this must be handled in the vq reset feature. > > > > >Can we at least > > > prevent this going forward somehow? > > > > Like have some padding? > > > > Thanks > > Maybe - this is what QEMU does ... Do you want this to be addressed in this series (it's already very huge anyhow)? Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Since I want to add queue_reset after queue_notify_data, I > > > > > > > > submitted > > > > > > > > this patch first. > > > > > > > > > > > > > > > > Signed-off-by: Xuan Zhuo > > > > > > > > Acked-by: Jason Wang > > > > > > > > --- > > > > > > > > include/uapi/linux/virtio_pci.h | 7 +++ > > > > > > > > 1 file changed, 7 insertions(+) > > > > > > > > > > > > > > > > diff --git a/include/uapi/linux/virtio_pci.h > > > > > > > > b/include/uapi/linux/virtio_pci.h > > > > > > > > index 3a86f36d7e3d..22bec9bd0dfc 100644 > > > > > > > > --- a/include/uapi/linux/virtio_pci.h > > > > > > > > +++ b/include/uapi/linux/virtio_pci.h > > > > > > > > @@ -166,6 +166,13 @@ struct virtio_pci_common_cfg { > > > > > > > > __le32 queue_used_hi; /* read-write */ > > > > > > > > }; > > > > > > > > > > > > > > > > +struct virtio_pci_common_cfg_notify { > > > > > > > > + struct virtio_pci_common_cfg cfg; > > > > > > > > + > > > > > > > > + __le16 queue_notify_data; /* read-write */ > > > > > > > > + __le16 padding; > > > > > > > > +}; > > > > > > > > + > > > > > > > > /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */ > > > > > > > > struct virtio_pci_cfg_cap { > > > > > > > > struct virtio_pci_cap cap; > > > > > > > > -- > > > > > > > > 2.31.0 > > > > > > > > > > > > > > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V3] virtio: disable notification hardening by default
On Mon, Jun 27, 2022 at 5:52 PM Michael S. Tsirkin wrote: > > On Mon, Jun 27, 2022 at 04:11:18PM +0800, Jason Wang wrote: > > On Mon, Jun 27, 2022 at 3:33 PM Michael S. Tsirkin wrote: > > > > > > On Mon, Jun 27, 2022 at 10:50:17AM +0800, Jason Wang wrote: > > > > On Fri, Jun 24, 2022 at 2:31 PM Michael S. Tsirkin > > > > wrote: > > > > > > > > > > On Wed, Jun 22, 2022 at 03:09:31PM +0800, Jason Wang wrote: > > > > > > On Wed, Jun 22, 2022 at 3:03 PM Michael S. Tsirkin > > > > > > wrote: > > > > > > > > > > > > > > On Wed, Jun 22, 2022 at 09:29:40AM +0800, Jason Wang wrote: > > > > > > > > We try to harden virtio device notifications in 8b4ec69d7e09 > > > > > > > > ("virtio: > > > > > > > > harden vring IRQ"). It works with the assumption that the > > > > > > > > driver or > > > > > > > > core can properly call virtio_device_ready() at the right > > > > > > > > place. Unfortunately, this seems to be not true and uncover > > > > > > > > various > > > > > > > > bugs of the existing drivers, mainly the issue of using > > > > > > > > virtio_device_ready() incorrectly. > > > > > > > > > > > > > > > > So let's having a Kconfig option and disable it by default. It > > > > > > > > gives > > > > > > > > us a breath to fix the drivers and then we can consider to > > > > > > > > enable it > > > > > > > > by default. > > > > > > > > > > > > > > > > Signed-off-by: Jason Wang > > > > > > > > > > > > > > > > > > > > > OK I will queue, but I think the problem is fundamental. > > > > > > > > > > > > If I understand correctly, you want some core IRQ work? > > > > > > > > > > Yes. > > > > > > > > > > > As discussed > > > > > > before, it doesn't solve all the problems, we still need to do per > > > > > > driver audit. > > > > > > > > > > > > Thanks > > > > > > > > > > Maybe, but we don't need to tie things to device_ready then. > > > > > We can do > > > > > > > > > > - disable irqs > > > > > - device ready > > > > > - setup everything > > > > > - enable irqs > > > > > > > > > > > > > > > and this works for most things, the only issue is > > > > > this deadlocks if "setup everything" waits for interrupts. > > > > > > > > > > > > > > > With the current approach there's really no good time: > > > > > 1.- setup everything > > > > > - device ready > > > > > > > > > > can cause kicks before device is ready > > > > > > > > > > 2.- device ready > > > > > - setup everything > > > > > > > > > > can cause callbacks before setup. > > > > > > > > > > So I prefer the 1. and fix the hardening in the core. > > > > > > > > So my question is: > > > > > > > > 1) do similar hardening like config interrupt > > > > or > > > > 2) per transport notification work (e.g for PCI core IRQ work) > > > > > > > > 1) seems easier and universal, but we pay little overhead which could > > > > be eliminated by the config option. > > > > > > I doubt 1 is easy and I am not even sure core IRQ changes will help. > > > > Core IRQ won't help in 1), the changes are limited in the virtio. > > > > > My concern with adding overhead is that I'm not sure these are not just > > > wasted CPU cycles. We spent a bunch of time on irq hardening and so far > > > we are still at the "all drivers need to be fixed" stage. > > > > It's not the fault of hardening but the drivers. The hardening just > > expose those bugs. > > Heh. Yea sure. But things work fine for people. What is the chance > your review found and fixed all driver bugs? I don't/can't audit all bugs but the race between open/close against ready/reset. It looks to me a good chance to fix them all but if you think differently, let me know > After two attempts > I don't feel like hoping audit will fix all bugs. I've started the auditing and have 15+ patches in the queue. (only covers bluetooth, console, pmem, virtio-net and caif). Spotting the issue is not hard but the testing, It would take at least the time of one release to finalize I guess. > > > > > > > > The reason config was kind of easy is that config interrupt is rarely > > > vital for device function so arbitrarily deferring that does not lead to > > > deadlocks - what you are trying to do with VQ interrupts is > > > fundamentally different. Things are especially bad if we just drop > > > an interrupt but deferring can lead to problems too. > > > > I'm not sure I see the difference, disable_irq() stuffs also delay the > > interrupt processing until enable_irq(). > > > Absolutely. I am not at all sure disable_irq fixes all problems. > > > > > > > Consider as an example > > > virtio-net: fix race between ndo_open() and virtio_device_ready() > > > if you just defer vq interrupts you get deadlocks. > > > > > > > > > > I don't see a deadlock here, maybe you can show more detail on this? > > What I mean is this: if we revert the above commit, things still > work (out of spec, but still). If we revert and defer interrupts until > device ready then ndo_open that triggers before device ready deadlocks. Ok, I guess you meant on a hypervisor that is
Re: [PATCH][next] treewide: uapi: Replace zero-length arrays with flexible-array members
On Mon, Jun 27, 2022 at 08:27:37PM +0200, Daniel Borkmann wrote: > On 6/27/22 8:04 PM, Gustavo A. R. Silva wrote: > > There is a regular need in the kernel to provide a way to declare > > having a dynamically sized set of trailing elements in a structure. > > Kernel code should always use “flexible array members”[1] for these > > cases. The older style of one-element or zero-length arrays should > > no longer be used[2]. > > > > This code was transformed with the help of Coccinelle: > > (linux-5.19-rc2$ spatch --jobs $(getconf _NPROCESSORS_ONLN) --sp-file > > script.cocci --include-headers --dir . > output.patch) > > > > @@ > > identifier S, member, array; > > type T1, T2; > > @@ > > > > struct S { > >... > >T1 member; > >T2 array[ > > - 0 > >]; > > }; > > > > -fstrict-flex-arrays=3 is coming and we need to land these changes > > to prevent issues like these in the short future: > > > > ../fs/minix/dir.c:337:3: warning: 'strcpy' will always overflow; > > destination buffer has size 0, > > but the source string has length 2 (including NUL byte) [-Wfortify-source] > > strcpy(de3->name, "."); > > ^ > > > > Since these are all [0] to [] changes, the risk to UAPI is nearly zero. If > > this breaks anything, we can use a union with a new member name. > > > > [1] https://en.wikipedia.org/wiki/Flexible_array_member > > [2] > > https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays > > > > Link: https://github.com/KSPP/linux/issues/78 > > Build-tested-by: > > https://lore.kernel.org/lkml/62b675ec.wkx6aoz6cbe71vtf%25...@intel.com/ > > Signed-off-by: Gustavo A. R. Silva > > --- > > Hi all! > > > > JFYI: I'm adding this to my -next tree. :) > > Fyi, this breaks BPF CI: > > https://github.com/kernel-patches/bpf/runs/7078719372?check_suite_focus=true > > [...] > progs/map_ptr_kern.c:314:26: error: field 'trie_key' with variable sized > type 'struct bpf_lpm_trie_key' not at the end of a struct or class is a GNU > extension [-Werror,-Wgnu-variable-sized-type-not-at-end] > struct bpf_lpm_trie_key trie_key; > ^ This will break the rdma-core userspace as well, with a similar error: /usr/bin/clang-13 -DVERBS_DEBUG -Dibverbs_EXPORTS -Iinclude -I/usr/include/libnl3 -I/usr/include/drm -g -O2 -fdebug-prefix-map=/__w/1/s=. -fstack-protector-strong -Wformat -Werror=format-security -Wdate-time -D_FORTIFY_SOURCE=2 -Wmissing-prototypes -Wmissing-declarations -Wwrite-strings -Wformat=2 -Wcast-function-type -Wformat-nonliteral -Wdate-time -Wnested-externs -Wshadow -Wstrict-prototypes -Wold-style-definition -Werror -Wredundant-decls -g -fPIC -std=gnu11 -MD -MT libibverbs/CMakeFiles/ibverbs.dir/cmd_flow.c.o -MF libibverbs/CMakeFiles/ibverbs.dir/cmd_flow.c.o.d -o libibverbs/CMakeFiles/ibverbs.dir/cmd_flow.c.o -c ../libibverbs/cmd_flow.c In file included from ../libibverbs/cmd_flow.c:33: In file included from include/infiniband/cmd_write.h:36: In file included from include/infiniband/cmd_ioctl.h:41: In file included from include/infiniband/verbs.h:48: In file included from include/infiniband/verbs_api.h:66: In file included from include/infiniband/ib_user_ioctl_verbs.h:38: include/rdma/ib_user_verbs.h:436:34: error: field 'base' with variable sized type 'struct ib_uverbs_create_cq_resp' not at the end of a struct or class is a GNU extension [-Werror,-Wgnu-variable-sized-type-not-at-end] struct ib_uverbs_create_cq_resp base; ^ include/rdma/ib_user_verbs.h:644:34: error: field 'base' with variable sized type 'struct ib_uverbs_create_qp_resp' not at the end of a struct or class is a GNU extension [-Werror,-Wgnu-variable-sized-type-not-at-end] struct ib_uverbs_create_qp_resp base; Which is why I gave up trying to change these.. Though maybe we could just switch off -Wgnu-variable-sized-type-not-at-end during configuration ? Jason ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
RE: [PATCH][next] treewide: uapi: Replace zero-length arrays with flexible-array members
Gustavo A. R. Silva wrote: > There is a regular need in the kernel to provide a way to declare > having a dynamically sized set of trailing elements in a structure. > Kernel code should always use “flexible array members”[1] for these > cases. The older style of one-element or zero-length arrays should > no longer be used[2]. > > This code was transformed with the help of Coccinelle: > (linux-5.19-rc2$ spatch --jobs $(getconf _NPROCESSORS_ONLN) --sp-file > script.cocci --include-headers --dir . > output.patch) > > @@ > identifier S, member, array; > type T1, T2; > @@ > > struct S { > ... > T1 member; > T2 array[ > - 0 > ]; > }; > > -fstrict-flex-arrays=3 is coming and we need to land these changes > to prevent issues like these in the short future: > > ../fs/minix/dir.c:337:3: warning: 'strcpy' will always overflow; destination > buffer has size 0, > but the source string has length 2 (including NUL byte) [-Wfortify-source] > strcpy(de3->name, "."); > ^ > > Since these are all [0] to [] changes, the risk to UAPI is nearly zero. If > this breaks anything, we can use a union with a new member name. > > [1] https://en.wikipedia.org/wiki/Flexible_array_member > [2] > https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays > > Link: https://github.com/KSPP/linux/issues/78 > Build-tested-by: > https://lore.kernel.org/lkml/62b675ec.wkx6aoz6cbe71vtf%25...@intel.com/ > Signed-off-by: Gustavo A. R. Silva > --- > Hi all! > > JFYI: I'm adding this to my -next tree. :) > [..] > include/uapi/linux/ndctl.h| 10 +-- For ndctl.h Acked-by: Dan Williams ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/1] Create debugfs file with virtio balloon usage information
On Mon, Jun 27, 2022 at 07:19:09PM +, Alexander Atanasov wrote: > Allow the guest to know how much it is ballooned by the host. > It is useful when debugging out of memory conditions. > > When host gets back memory from the guest it is accounted > as used memory in the guest but the guest have no way to know > how much it is actually ballooned. > > No pretty printing and fixed as per coding style. > > Signed-off-by: Alexander Atanasov > --- > drivers/virtio/virtio_balloon.c | 71 + > 1 file changed, 71 insertions(+) > > - Leave pretty print to userspace > - Fixed coding style > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index b9737da6c4dd..1bb6a6d0e37b 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -10,6 +10,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -731,6 +732,71 @@ static void report_free_page_func(struct work_struct > *work) > } > } > > +/* > + * DEBUGFS Interface > + */ > +#ifdef CONFIG_DEBUG_FS > + > +/** > + * virtio_balloon_debug_show - shows statistics of balloon operations. > + * @f: pointer to the seq_file. > + * @offset: ignored. > + * > + * Provides the statistics that can be accessed in virtio-balloon in the > debugfs. > + * > + * Return: zero on success or an error code. > + */ > +static int virtio_balloon_debug_show(struct seq_file *f, void *offset) > +{ > + struct virtio_balloon *b = f->private; > + u32 num_pages; > + struct sysinfo i; > + > + si_meminfo(); > + > + seq_printf(f, "%-22s: %llx\n", "capabilities", b->vdev->features); why do we need this in debugfs? Isn't this available in sysfs already? > + seq_printf(f, "%-22s: %d\n", "page_size", 4096); I suspect this function doesn't work properly when page size is not 4K. > + > + virtio_cread_le(b->vdev, struct virtio_balloon_config, actual, > + _pages); > + /* Memory returned to host or amount we can inflate if possible */ > + seq_printf(f, "%-22s: %u\n", "ballooned_pages", num_pages); I don't really get the comment here. > + > + /* Total Memory for the guest from host */ > + seq_printf(f, "%-22s: %lu\n", "total_pages", i.totalram); > + > + /* Current memory for the guest */ > + seq_printf(f, "%-22s: %lu\n", "current_pages", i.totalram - num_pages); Are you sure these are in units of 4Kbyte pages? > + return 0; > +} > + > +DEFINE_SHOW_ATTRIBUTE(virtio_balloon_debug); > + > +static void virtio_balloon_debugfs_init(struct virtio_balloon *b) > +{ > + debugfs_create_file("virtio-balloon", 0444, NULL, b, > + _balloon_debug_fops); > +} > + > +static void virtio_balloon_debugfs_exit(struct virtio_balloon *b) > +{ > + debugfs_remove(debugfs_lookup("virtio-balloon", NULL)); > +} > + > +#else > + > +static inline void virtio_balloon_debugfs_init(struct virtio_balloon *b) > +{ > +} > + > +static inline void virtio_balloon_debugfs_exit(struct virtio_balloon *b) > +{ > +} > + > +#endif /* CONFIG_DEBUG_FS */ > + > #ifdef CONFIG_BALLOON_COMPACTION > /* > * virtballoon_migratepage - perform the balloon page migration on behalf of > @@ -1019,6 +1085,9 @@ static int virtballoon_probe(struct virtio_device *vdev) > > if (towards_target(vb)) > virtballoon_changed(vdev); > + > + virtio_balloon_debugfs_init(vb); > + > return 0; > > out_unregister_oom: > @@ -1065,6 +1134,8 @@ static void virtballoon_remove(struct virtio_device > *vdev) > { > struct virtio_balloon *vb = vdev->priv; > > + virtio_balloon_debugfs_exit(vb); > + > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) > page_reporting_unregister(>pr_dev_info); > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) > -- > 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH][next] treewide: uapi: Replace zero-length arrays with flexible-array members
On 6/27/22 8:04 PM, Gustavo A. R. Silva wrote: There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. This code was transformed with the help of Coccinelle: (linux-5.19-rc2$ spatch --jobs $(getconf _NPROCESSORS_ONLN) --sp-file script.cocci --include-headers --dir . > output.patch) @@ identifier S, member, array; type T1, T2; @@ struct S { ... T1 member; T2 array[ - 0 ]; }; -fstrict-flex-arrays=3 is coming and we need to land these changes to prevent issues like these in the short future: ../fs/minix/dir.c:337:3: warning: 'strcpy' will always overflow; destination buffer has size 0, but the source string has length 2 (including NUL byte) [-Wfortify-source] strcpy(de3->name, "."); ^ Since these are all [0] to [] changes, the risk to UAPI is nearly zero. If this breaks anything, we can use a union with a new member name. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.16/process/deprecated.html#zero-length-and-one-element-arrays Link: https://github.com/KSPP/linux/issues/78 Build-tested-by: https://lore.kernel.org/lkml/62b675ec.wkx6aoz6cbe71vtf%25...@intel.com/ Signed-off-by: Gustavo A. R. Silva --- Hi all! JFYI: I'm adding this to my -next tree. :) Fyi, this breaks BPF CI: https://github.com/kernel-patches/bpf/runs/7078719372?check_suite_focus=true [...] progs/map_ptr_kern.c:314:26: error: field 'trie_key' with variable sized type 'struct bpf_lpm_trie_key' not at the end of a struct or class is a GNU extension [-Werror,-Wgnu-variable-sized-type-not-at-end] struct bpf_lpm_trie_key trie_key; ^ 1 error generated. make: *** [Makefile:519: /tmp/runner/work/bpf/bpf/tools/testing/selftests/bpf/map_ptr_kern.o] Error 1 make: *** Waiting for unfinished jobs Error: Process completed with exit code 2. ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [GIT PULL] virtio,vdpa: fixes
The pull request you sent on Mon, 27 Jun 2022 11:50:24 -0400: > https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/941e3e7912696b9fbe3586083a7c2e102cee7a87 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation
On 6/27/22 10:27, Guo Hui wrote: The instructions assigned to the vcpu_is_preempted function parameter in the X86 architecture physical machine are redundant instructions, causing the multi-core performance of Unixbench to drop by about 4% to 5%. The C function is as follows: static bool vcpu_is_preempted(long vcpu); The parameter 'vcpu' in the function osq_lock that calls the function vcpu_is_preempted is assigned as follows: The C code is in the function node_cpu: cpu = node->cpu - 1; The instructions corresponding to the C code are: mov 0x14(%rax),%edi sub $0x1,%edi The above instructions are unnecessary in the X86 Native operating environment, causing high cache-misses and degrading performance. This patch uses static_key to not execute this instruction in the Native runtime environment. The patch effect is as follows two machines, Unixbench runs with full core score: 1. Machine configuration: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz CPU core: 40 Memory: 256G OS Kernel: 5.19-rc3 Before using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 948326591.2 81261.9 Double-Precision Whetstone 55.0 211986.3 38543.0 Execl Throughput 43.0 43453.2 10105.4 File Copy 1024 bufsize 2000 maxblocks 3960.0 438936.2 1108.4 File Copy 256 bufsize 500 maxblocks1655.0 118197.4714.2 File Copy 4096 bufsize 8000 maxblocks 5800.01534674.7 2646.0 Pipe Throughput 12440.0 46482107.6 37365.0 Pipe-based Context Switching 4000.01915094.2 4787.7 Process Creation126.0 85442.2 6781.1 Shell Scripts (1 concurrent) 42.4 69400.7 16368.1 Shell Scripts (8 concurrent) 6.0 8877.2 14795.3 System Call Overhead 15000.04714906.1 3143.3 System Benchmarks Index Score7923.3 After using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 947032915.5 81151.1 Double-Precision Whetstone 55.0 211971.2 38540.2 Execl Throughput 43.0 45054.8 10477.9 File Copy 1024 bufsize 2000 maxblocks 3960.0 515024.9 1300.6 File Copy 256 bufsize 500 maxblocks1655.0 146354.6884.3 File Copy 4096 bufsize 8000 maxblocks 5800.01679995.9 2896.5 Pipe Throughput 12440.0 46466394.2 37352.4 Pipe-based Context Switching 4000.01898221.4 4745.6 Process Creation126.0 85653.1 6797.9 Shell Scripts (1 concurrent) 42.4 69437.3 16376.7 Shell Scripts (8 concurrent) 6.0 8898.9 14831.4 System Call Overhead 15000.04658746.7 3105.8 System Benchmarks Index Score8248.8 2. Machine configuration: Hygon C86 7185 32-core Processor CPU core: 128 Memory: 256G OS Kernel: 5.19-rc3 Before using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 2256644068.3 193371.4 Double-Precision Whetstone 55.0 438969.9 79812.7 Execl Throughput 43.0 10108.6 2350.8 File Copy 1024 bufsize 2000 maxblocks 3960.0 275892.8696.7 File Copy 256 bufsize 500 maxblocks1655.0 72082.7435.5 File Copy 4096 bufsize 8000 maxblocks 5800.0 925043.4 1594.9 Pipe Throughput 12440.0 118905512.5 95583.2 Pipe-based Context Switching 4000.07820945.7 19552.4 Process Creation126.0 31233.3 2478.8 Shell Scripts (1 concurrent) 42.4 49042.8 11566.7 Shell Scripts (8 concurrent) 6.0 6656.0 11093.3 System Call Overhead 15000.06816047.5 4544.0 System Benchmarks Index Score7756.6 After using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 2252272929.4 192996.8 Double-Precision Whetstone 55.0 451847.2 82154.0 Execl Throughput 43.0 10595.1 2464.0 File Copy 1024 bufsize 2000 maxblocks 3960.0
[GIT PULL] virtio,vdpa: fixes
The following changes since commit a111daf0c53ae91e71fd2bfe7497862d14132e3e: Linux 5.19-rc3 (2022-06-19 15:06:47 -0500) are available in the Git repository at: https://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus for you to fetch changes up to c7cc29aaebf9eaa543b4c70801e0ecef1101b3c8: virtio_ring: make vring_create_virtqueue_split prettier (2022-06-27 08:05:35 -0400) virtio,vdpa: fixes Fixes all over the place, most notably we are disabling IRQ hardening (again!). Signed-off-by: Michael S. Tsirkin Bo Liu (1): virtio: Remove unnecessary variable assignments Deming Wang (1): virtio_ring: make vring_create_virtqueue_split prettier Eli Cohen (2): vdpa/mlx5: Update Control VQ callback information vdpa/mlx5: Initialize CVQ vringh only once Jason Wang (3): virtio: disable notification hardening by default virtio-net: fix race between ndo_open() and virtio_device_ready() caif_virtio: fix race between virtio_device_ready() and ndo_open() Parav Pandit (1): vduse: Tie vduse mgmtdev and its device Stefano Garzarella (1): vhost-vdpa: call vhost_vdpa_cleanup during the release Stephan Gerhold (2): virtio_mmio: Add missing PM calls to freeze/restore virtio_mmio: Restore guest page size on resume huangjie.albert (1): virtio_ring : keep used_wrap_counter in vq->last_used_idx drivers/net/caif/caif_virtio.c | 10 +++- drivers/net/virtio_net.c | 8 ++- drivers/s390/virtio/virtio_ccw.c | 9 +++- drivers/vdpa/mlx5/net/mlx5_vnet.c | 33 - drivers/vdpa/vdpa_user/vduse_dev.c | 60 ++- drivers/vhost/vdpa.c | 2 +- drivers/virtio/Kconfig | 13 + drivers/virtio/virtio.c| 2 + drivers/virtio/virtio_mmio.c | 26 ++ drivers/virtio/virtio_pci_modern_dev.c | 2 - drivers/virtio/virtio_ring.c | 89 +++--- include/linux/virtio_config.h | 2 + 12 files changed, 187 insertions(+), 69 deletions(-) ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v4] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation
On 27.06.22 16:27, Guo Hui wrote: The instructions assigned to the vcpu_is_preempted function parameter in the X86 architecture physical machine are redundant instructions, causing the multi-core performance of Unixbench to drop by about 4% to 5%. The C function is as follows: static bool vcpu_is_preempted(long vcpu); The parameter 'vcpu' in the function osq_lock that calls the function vcpu_is_preempted is assigned as follows: The C code is in the function node_cpu: cpu = node->cpu - 1; The instructions corresponding to the C code are: mov 0x14(%rax),%edi sub $0x1,%edi The above instructions are unnecessary in the X86 Native operating environment, causing high cache-misses and degrading performance. This patch uses static_key to not execute this instruction in the Native runtime environment. The patch effect is as follows two machines, Unixbench runs with full core score: 1. Machine configuration: Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz CPU core: 40 Memory: 256G OS Kernel: 5.19-rc3 Before using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 948326591.2 81261.9 Double-Precision Whetstone 55.0 211986.3 38543.0 Execl Throughput 43.0 43453.2 10105.4 File Copy 1024 bufsize 2000 maxblocks 3960.0 438936.2 1108.4 File Copy 256 bufsize 500 maxblocks1655.0 118197.4714.2 File Copy 4096 bufsize 8000 maxblocks 5800.01534674.7 2646.0 Pipe Throughput 12440.0 46482107.6 37365.0 Pipe-based Context Switching 4000.01915094.2 4787.7 Process Creation126.0 85442.2 6781.1 Shell Scripts (1 concurrent) 42.4 69400.7 16368.1 Shell Scripts (8 concurrent) 6.0 8877.2 14795.3 System Call Overhead 15000.04714906.1 3143.3 System Benchmarks Index Score7923.3 After using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 947032915.5 81151.1 Double-Precision Whetstone 55.0 211971.2 38540.2 Execl Throughput 43.0 45054.8 10477.9 File Copy 1024 bufsize 2000 maxblocks 3960.0 515024.9 1300.6 File Copy 256 bufsize 500 maxblocks1655.0 146354.6884.3 File Copy 4096 bufsize 8000 maxblocks 5800.01679995.9 2896.5 Pipe Throughput 12440.0 46466394.2 37352.4 Pipe-based Context Switching 4000.01898221.4 4745.6 Process Creation126.0 85653.1 6797.9 Shell Scripts (1 concurrent) 42.4 69437.3 16376.7 Shell Scripts (8 concurrent) 6.0 8898.9 14831.4 System Call Overhead 15000.04658746.7 3105.8 System Benchmarks Index Score8248.8 2. Machine configuration: Hygon C86 7185 32-core Processor CPU core: 128 Memory: 256G OS Kernel: 5.19-rc3 Before using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 2256644068.3 193371.4 Double-Precision Whetstone 55.0 438969.9 79812.7 Execl Throughput 43.0 10108.6 2350.8 File Copy 1024 bufsize 2000 maxblocks 3960.0 275892.8696.7 File Copy 256 bufsize 500 maxblocks1655.0 72082.7435.5 File Copy 4096 bufsize 8000 maxblocks 5800.0 925043.4 1594.9 Pipe Throughput 12440.0 118905512.5 95583.2 Pipe-based Context Switching 4000.07820945.7 19552.4 Process Creation126.0 31233.3 2478.8 Shell Scripts (1 concurrent) 42.4 49042.8 11566.7 Shell Scripts (8 concurrent) 6.0 6656.0 11093.3 System Call Overhead 15000.06816047.5 4544.0 System Benchmarks Index Score7756.6 After using the patch: System Benchmarks Index Values BASELINE RESULTINDEX Dhrystone 2 using register variables 116700.0 2252272929.4 192996.8 Double-Precision Whetstone 55.0 451847.2 82154.0 Execl Throughput 43.0 10595.1 2464.0 File Copy 1024 bufsize 2000 maxblocks 3960.0
Re: [PATCH v2] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation
On 6/27/22 01:54, Guo Hui wrote: Thank you very much Longman, my patch is as you said, only disable node_cpu on X86, enable node_cpu on arm64, powerpc, s390 architectures; the code is in file arch/x86/kernel/paravirt-spinlocks.c: DECLARE_STATIC_KEY_FALSE(preemted_key); static_branch_enable(_key); the default value of preemted_key is false and the if conditional statement is reversed, the code is in file kernel/locking/osq_lock.c: DEFINE_STATIC_KEY_FALSE(preemted_key); static inline int node_cpu(struct optimistic_spin_node *node) { int cpu = 0; if (!static_branch_unlikely(_key)) cpu = node->cpu - 1; return cpu; } In this way, only one nop instruction is added to architectures arm64, powerpc and s390, including virtual machines, without any other changes. You are right. I am probably too tired last night to read the patch more carefully. Cheers, Longman ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH 1/1] Create debugfs file with virtio balloon usage information
On Mon, Jun 27, 2022 at 12:20:38PM +, alexander.atana...@virtuozzo.com wrote: > From: Alexander Atanasov > > Allow the guest to know how much it is ballooned by the host. > It is useful when debugging out of memory conditions. > > When host gets back memory from the guest it is accounted > as used memory in the guest but the guest have no way to know > how much it is actually ballooned. > > Signed-off-by: Alexander Atanasov > --- > drivers/virtio/virtio_balloon.c | 91 + > 1 file changed, 91 insertions(+) > > diff --git a/drivers/virtio/virtio_balloon.c b/drivers/virtio/virtio_balloon.c > index b9737da6c4dd..a32a52c28a1f 100644 > --- a/drivers/virtio/virtio_balloon.c > +++ b/drivers/virtio/virtio_balloon.c > @@ -10,6 +10,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -731,6 +732,91 @@ static void report_free_page_func(struct work_struct > *work) > } > } > > +/* > + * DEBUGFS Interface > + */ > +#ifdef CONFIG_DEBUG_FS > + > +/** > + * virtio_balloon_debug_show - shows statistics of balloon operations. > + * @f: pointer to the seq_file. > + * @offset: ignored. > + * > + * Provides the statistics that can be accessed in virtio-balloon in the > debugfs. > + * > + * Return: zero on success or an error code. > + */ > +static int virtio_balloon_debug_show(struct seq_file *f, void *offset) > +{ > + struct virtio_balloon *b = f->private; > + u32 num_pages; > + struct sysinfo i; > + > + si_meminfo(); > + > + seq_printf(f, "%-22s:", "capabilities"); why aren't features in sysfs sufficient for this? pretty printing can be done in userspace ... > + if (virtio_has_feature(b->vdev, VIRTIO_BALLOON_F_MUST_TELL_HOST)) > + seq_puts(f, " tell_host"); > + > + if (virtio_has_feature(b->vdev, VIRTIO_BALLOON_F_STATS_VQ)) > + seq_puts(f, " stats"); > + > + if (virtio_has_feature(b->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) > + seq_puts(f, " deflate_on_oom"); > + > + if (virtio_has_feature(b->vdev, VIRTIO_BALLOON_F_FREE_PAGE_HINT)) > + seq_puts(f, " free_page_hint"); > + > + if (virtio_has_feature(b->vdev, VIRTIO_BALLOON_F_PAGE_POISON)) > + seq_puts(f, " page_poison"); > + > + if (virtio_has_feature(b->vdev, VIRTIO_BALLOON_F_REPORTING)) > + seq_puts(f, " reporting"); > + > + seq_puts(f, "\n"); > + > + seq_printf(f, "%-22s: %d\n", "page_size", 4096); > + > + virtio_cread_le(b->vdev, struct virtio_balloon_config, actual, > + _pages); > + /* Memory returned to host or amount we can inflate if possible */ > + seq_printf(f, "%-22s: %u\n", "ballooned_pages", num_pages); > + > + /* Total Memory for the guest from host */ > + seq_printf(f, "%-22s: %lu\n", "total_pages", i.totalram); > + > + /* Current memory for the guest */ > + seq_printf(f, "%-22s: %lu\n", "current_pages", i.totalram-num_pages); spaces around - > + > + return 0; > +} > + > +DEFINE_SHOW_ATTRIBUTE(virtio_balloon_debug); > + > +static void virtio_balloon_debugfs_init(struct virtio_balloon *b) > +{ > + debugfs_create_file("virtio-balloon", 0444, NULL, b, > + _balloon_debug_fops); > +} > + > +static void virtio_balloon_debugfs_exit(struct virtio_balloon *b) > +{ > + debugfs_remove(debugfs_lookup("virtio-balloon", NULL)); > +} > + > +#else > + > +static inline void virtio_balloon_debugfs_init(struct virtio_balloon *b) > +{ > +} > + > +static inline void virtio_balloon_debugfs_exit(struct virtio_balloon *b) > +{ > +} > + > +#endif /* CONFIG_DEBUG_FS */ > + > #ifdef CONFIG_BALLOON_COMPACTION > /* > * virtballoon_migratepage - perform the balloon page migration on behalf of > @@ -1019,6 +1105,9 @@ static int virtballoon_probe(struct virtio_device *vdev) > > if (towards_target(vb)) > virtballoon_changed(vdev); > + > + virtio_balloon_debugfs_init(vb); > + > return 0; > > out_unregister_oom: > @@ -1065,6 +1154,8 @@ static void virtballoon_remove(struct virtio_device > *vdev) > { > struct virtio_balloon *vb = vdev->priv; > > + virtio_balloon_debugfs_exit(vb); > + > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_REPORTING)) > page_reporting_unregister(>pr_dev_info); > if (virtio_has_feature(vb->vdev, VIRTIO_BALLOON_F_DEFLATE_ON_OOM)) > -- > 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net V2] virtio-net: fix race between ndo_open() and virtio_device_ready()
On Mon, Jun 27, 2022 at 04:30:40PM +0800, Jason Wang wrote: > We currently call virtio_device_ready() after netdev > registration. Since ndo_open() can be called immediately > after register_netdev, this means there exists a race between > ndo_open() and virtio_device_ready(): the driver may start to use the > device before DRIVER_OK which violates the spec. > > Fix this by switching to use register_netdevice() and protect the > virtio_device_ready() with rtnl_lock() to make sure ndo_open() can > only be called after virtio_device_ready(). > > Fixes: 4baf1e33d0842 ("virtio_net: enable VQs early") > Acked-by: Michael S. Tsirkin > Signed-off-by: Jason Wang I reworded the caif commit log similarly and put both on my tree. > --- > drivers/net/virtio_net.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index db05b5e930be..8a5810bcb839 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -3655,14 +3655,20 @@ static int virtnet_probe(struct virtio_device *vdev) > if (vi->has_rss || vi->has_rss_hash_report) > virtnet_init_default_rss(vi); > > - err = register_netdev(dev); > + /* serialize netdev register + virtio_device_ready() with ndo_open() */ > + rtnl_lock(); > + > + err = register_netdevice(dev); > if (err) { > pr_debug("virtio_net: registering device failed\n"); > + rtnl_unlock(); > goto free_failover; > } > > virtio_device_ready(vdev); > > + rtnl_unlock(); > + > err = virtnet_cpu_notif_add(vi); > if (err) { > pr_debug("virtio_net: registering cpu notifier failed\n"); > -- > 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v10 25/41] virtio_pci: struct virtio_pci_common_cfg add queue_notify_data
On Mon, Jun 27, 2022 at 04:14:20PM +0800, Jason Wang wrote: > On Mon, Jun 27, 2022 at 3:58 PM Michael S. Tsirkin wrote: > > > > On Mon, Jun 27, 2022 at 03:45:30PM +0800, Jason Wang wrote: > > > On Mon, Jun 27, 2022 at 2:39 PM Michael S. Tsirkin > > > wrote: > > > > > > > > On Mon, Jun 27, 2022 at 10:30:42AM +0800, Jason Wang wrote: > > > > > On Fri, Jun 24, 2022 at 2:59 PM Michael S. Tsirkin > > > > > wrote: > > > > > > > > > > > > On Fri, Jun 24, 2022 at 10:56:05AM +0800, Xuan Zhuo wrote: > > > > > > > Add queue_notify_data in struct virtio_pci_common_cfg, which > > > > > > > comes from > > > > > > > here https://github.com/oasis-tcs/virtio-spec/issues/89 > > > > > > > > > > > > > > For not breaks uABI, add a new struct > > > > > > > virtio_pci_common_cfg_notify. > > > > > > > > > > > > What exactly is meant by not breaking uABI? > > > > > > Users are supposed to be prepared for struct size to change ... no? > > > > > > > > > > Not sure, any doc for this? > > > > > > > > > > Thanks > > > > > > > > > > > > Well we have this: > > > > > > > > The drivers SHOULD only map part of configuration structure > > > > large enough for device operation. The drivers MUST handle > > > > an unexpectedly large \field{length}, but MAY check that > > > > \field{length} > > > > is large enough for device operation. > > > > > > Yes, but that's the device/driver interface. What's done here is the > > > userspace/kernel. > > > > > > Userspace may break if it uses e.g sizeof(struct virtio_pci_common_cfg)? > > > > > > Thanks > > > > Hmm I guess there's risk... but then how are we going to maintain this > > going forward? Add a new struct on any change? > > This is the way we have used it for the past 5 or more years. I don't > see why this must be handled in the vq reset feature. > > >Can we at least > > prevent this going forward somehow? > > Like have some padding? > > Thanks Maybe - this is what QEMU does ... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Since I want to add queue_reset after queue_notify_data, I > > > > > > > submitted > > > > > > > this patch first. > > > > > > > > > > > > > > Signed-off-by: Xuan Zhuo > > > > > > > Acked-by: Jason Wang > > > > > > > --- > > > > > > > include/uapi/linux/virtio_pci.h | 7 +++ > > > > > > > 1 file changed, 7 insertions(+) > > > > > > > > > > > > > > diff --git a/include/uapi/linux/virtio_pci.h > > > > > > > b/include/uapi/linux/virtio_pci.h > > > > > > > index 3a86f36d7e3d..22bec9bd0dfc 100644 > > > > > > > --- a/include/uapi/linux/virtio_pci.h > > > > > > > +++ b/include/uapi/linux/virtio_pci.h > > > > > > > @@ -166,6 +166,13 @@ struct virtio_pci_common_cfg { > > > > > > > __le32 queue_used_hi; /* read-write */ > > > > > > > }; > > > > > > > > > > > > > > +struct virtio_pci_common_cfg_notify { > > > > > > > + struct virtio_pci_common_cfg cfg; > > > > > > > + > > > > > > > + __le16 queue_notify_data; /* read-write */ > > > > > > > + __le16 padding; > > > > > > > +}; > > > > > > > + > > > > > > > /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */ > > > > > > > struct virtio_pci_cfg_cap { > > > > > > > struct virtio_pci_cap cap; > > > > > > > -- > > > > > > > 2.31.0 > > > > > > > > > > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v3 1/2] vhost: Add rbtree vdpa_mem_tree to saved the counted mem
On Mon, Jun 27, 2022 at 04:12:57PM +0800, Cindy Lu wrote: > On Sun, Jun 26, 2022 at 6:01 PM Michael S. Tsirkin wrote: > > > > On Sun, Jun 26, 2022 at 05:04:08PM +0800, Cindy Lu wrote: > > > We count pinned_vm as follow in vhost-vDPA > > > > > > lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT; > > > if (npages + atomic64_read(>mm->pinned_vm) > lock_limit) { > > > ret = -ENOMEM; > > > goto unlock; > > > } > > > This means if we have two vDPA devices for the same VM the pages > > > would be counted twice. So we add a tree to save the page that > > > counted and we will not count it again. > > > > > > Add vdpa_mem_tree to saved the mem that already counted. > > > use a hlist to saved the root for vdpa_mem_tree. > > > > > > Signed-off-by: Cindy Lu > > > --- > > > drivers/vhost/vhost.c | 63 +++ > > > drivers/vhost/vhost.h | 1 + > > > 2 files changed, 64 insertions(+) > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > > index 40097826cff0..4ca8b1ed944b 100644 > > > --- a/drivers/vhost/vhost.c > > > +++ b/drivers/vhost/vhost.c > > > @@ -32,6 +32,8 @@ > > > #include > > > > > > #include "vhost.h" > > > +#include > > > +#include > > > > > > static ushort max_mem_regions = 64; > > > module_param(max_mem_regions, ushort, 0444); > > > @@ -49,6 +51,14 @@ enum { > > > #define vhost_used_event(vq) ((__virtio16 __user > > > *)>avail->ring[vq->num]) > > > #define vhost_avail_event(vq) ((__virtio16 __user > > > *)>used->ring[vq->num]) > > > > > > +struct vhost_vdpa_rbtree_node { > > > + struct hlist_node node; > > > + struct rb_root_cached vdpa_mem_tree; > > > + struct mm_struct *mm_using; > > > +}; > > > +static DECLARE_HASHTABLE(vhost_vdpa_rbtree_hlist, 8); > > > +int vhost_vdpa_rbtree_hlist_status; > > > + > > > #ifdef CONFIG_VHOST_CROSS_ENDIAN_LEGACY > > > static void vhost_disable_cross_endian(struct vhost_virtqueue *vq) > > > { > > > > Are you trying to save some per-mm information here? > > Can't we just add it to mm_struct? > > > yes, this is a per-mm information, but I have checked with jason before, > seems it maybe difficult to change the mm_struct in upstream > so I add an to add a hlist instead > Thanks > Cindy Difficult how? You get more scrutiny if you try, for sure, and you need to need to generalize it enough that it looks useful outside the driver. But I guess that's good? > > > > > > > @@ -571,6 +581,51 @@ static void vhost_detach_mm(struct vhost_dev *dev) > > > dev->mm = NULL; > > > } > > > > > > +struct rb_root_cached *vhost_vdpa_get_mem_tree(struct mm_struct *mm) > > > +{ > > > + struct vhost_vdpa_rbtree_node *rbtree_root = NULL; > > > + struct rb_root_cached *vdpa_tree; > > > + u32 key; > > > + > > > + /* No hased table, init one */ > > > + if (vhost_vdpa_rbtree_hlist_status == 0) { > > > + hash_init(vhost_vdpa_rbtree_hlist); > > > + vhost_vdpa_rbtree_hlist_status = 1; > > > + } > > > + > > > + key = jhash_1word((u64)mm, JHASH_INITVAL); > > > + hash_for_each_possible(vhost_vdpa_rbtree_hlist, rbtree_root, node, > > > +key) { > > > + if (rbtree_root->mm_using == mm) > > > + return &(rbtree_root->vdpa_mem_tree); > > > + } > > > + rbtree_root = kmalloc(sizeof(*rbtree_root), GFP_KERNEL); > > > + if (!rbtree_root) > > > + return NULL; > > > + rbtree_root->mm_using = mm; > > > + rbtree_root->vdpa_mem_tree = RB_ROOT_CACHED; > > > + hash_add(vhost_vdpa_rbtree_hlist, _root->node, key); > > > + vdpa_tree = &(rbtree_root->vdpa_mem_tree); > > > + return vdpa_tree; > > > +} > > > + > > > +void vhost_vdpa_relase_mem_tree(struct mm_struct *mm) > > > +{ > > > + struct vhost_vdpa_rbtree_node *rbtree_root = NULL; > > > + u32 key; > > > + > > > + key = jhash_1word((u64)mm, JHASH_INITVAL); > > > + > > > + /* No hased table, init one */ > > > + hash_for_each_possible(vhost_vdpa_rbtree_hlist, rbtree_root, node, > > > +key) { > > > + if (rbtree_root->mm_using == mm) { > > > + hash_del(_root->node); > > > + kfree(rbtree_root); > > > + } > > > + } > > > +} > > > + > > > /* Caller should have device mutex */ > > > long vhost_dev_set_owner(struct vhost_dev *dev) > > > { > > > @@ -605,6 +660,11 @@ long vhost_dev_set_owner(struct vhost_dev *dev) > > > err = vhost_dev_alloc_iovecs(dev); > > > if (err) > > > goto err_cgroup; > > > + dev->vdpa_mem_tree = vhost_vdpa_get_mem_tree(dev->mm); > > > + if (dev->vdpa_mem_tree == NULL) { > > > + err = -ENOMEM; > > > + goto err_cgroup; > > > + } > > > > > > return 0; > > > err_cgroup: > > > @@ -613,6 +673,7 @@ long vhost_dev_set_owner(struct vhost_dev *dev) > > > dev->worker = NULL; > > > } > > > err_worker:
Re: [PATCH V3] virtio: disable notification hardening by default
On Mon, Jun 27, 2022 at 04:11:18PM +0800, Jason Wang wrote: > On Mon, Jun 27, 2022 at 3:33 PM Michael S. Tsirkin wrote: > > > > On Mon, Jun 27, 2022 at 10:50:17AM +0800, Jason Wang wrote: > > > On Fri, Jun 24, 2022 at 2:31 PM Michael S. Tsirkin > > > wrote: > > > > > > > > On Wed, Jun 22, 2022 at 03:09:31PM +0800, Jason Wang wrote: > > > > > On Wed, Jun 22, 2022 at 3:03 PM Michael S. Tsirkin > > > > > wrote: > > > > > > > > > > > > On Wed, Jun 22, 2022 at 09:29:40AM +0800, Jason Wang wrote: > > > > > > > We try to harden virtio device notifications in 8b4ec69d7e09 > > > > > > > ("virtio: > > > > > > > harden vring IRQ"). It works with the assumption that the driver > > > > > > > or > > > > > > > core can properly call virtio_device_ready() at the right > > > > > > > place. Unfortunately, this seems to be not true and uncover > > > > > > > various > > > > > > > bugs of the existing drivers, mainly the issue of using > > > > > > > virtio_device_ready() incorrectly. > > > > > > > > > > > > > > So let's having a Kconfig option and disable it by default. It > > > > > > > gives > > > > > > > us a breath to fix the drivers and then we can consider to enable > > > > > > > it > > > > > > > by default. > > > > > > > > > > > > > > Signed-off-by: Jason Wang > > > > > > > > > > > > > > > > > > OK I will queue, but I think the problem is fundamental. > > > > > > > > > > If I understand correctly, you want some core IRQ work? > > > > > > > > Yes. > > > > > > > > > As discussed > > > > > before, it doesn't solve all the problems, we still need to do per > > > > > driver audit. > > > > > > > > > > Thanks > > > > > > > > Maybe, but we don't need to tie things to device_ready then. > > > > We can do > > > > > > > > - disable irqs > > > > - device ready > > > > - setup everything > > > > - enable irqs > > > > > > > > > > > > and this works for most things, the only issue is > > > > this deadlocks if "setup everything" waits for interrupts. > > > > > > > > > > > > With the current approach there's really no good time: > > > > 1.- setup everything > > > > - device ready > > > > > > > > can cause kicks before device is ready > > > > > > > > 2.- device ready > > > > - setup everything > > > > > > > > can cause callbacks before setup. > > > > > > > > So I prefer the 1. and fix the hardening in the core. > > > > > > So my question is: > > > > > > 1) do similar hardening like config interrupt > > > or > > > 2) per transport notification work (e.g for PCI core IRQ work) > > > > > > 1) seems easier and universal, but we pay little overhead which could > > > be eliminated by the config option. > > > > I doubt 1 is easy and I am not even sure core IRQ changes will help. > > Core IRQ won't help in 1), the changes are limited in the virtio. > > > My concern with adding overhead is that I'm not sure these are not just > > wasted CPU cycles. We spent a bunch of time on irq hardening and so far > > we are still at the "all drivers need to be fixed" stage. > > It's not the fault of hardening but the drivers. The hardening just > expose those bugs. Heh. Yea sure. But things work fine for people. What is the chance your review found and fixed all driver bugs? After two attempts I don't feel like hoping audit will fix all bugs. > > > > The reason config was kind of easy is that config interrupt is rarely > > vital for device function so arbitrarily deferring that does not lead to > > deadlocks - what you are trying to do with VQ interrupts is > > fundamentally different. Things are especially bad if we just drop > > an interrupt but deferring can lead to problems too. > > I'm not sure I see the difference, disable_irq() stuffs also delay the > interrupt processing until enable_irq(). Absolutely. I am not at all sure disable_irq fixes all problems. > > > > Consider as an example > > virtio-net: fix race between ndo_open() and virtio_device_ready() > > if you just defer vq interrupts you get deadlocks. > > > > > > I don't see a deadlock here, maybe you can show more detail on this? What I mean is this: if we revert the above commit, things still work (out of spec, but still). If we revert and defer interrupts until device ready then ndo_open that triggers before device ready deadlocks. > > > > So, thinking about all this, how about a simple per vq flag meaning > > "this vq was kicked since reset"? > > And ignore the notification if vq is not kicked? It sounds like the > callback needs to be synchronized with the kick. Note we only need to synchronize it when it changes, which is only during initialization and reset. > > > > If driver does not kick then it's not ready to get callbacks, right? > > > > Sounds quite clean, but we need to think through memory ordering > > concerns - I guess it's only when we change the value so > > if (!vq->kicked) { > > vq->kicked = true; > > mb(); > > } > > > > will do the trick, right? > > There's no much difference with the existing
Re: [PATCH v2] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation
On Mon, Jun 27, 2022 at 05:18:02PM +0800, Guo Hui wrote: > Ok thanks Peter, your suggestion is very good, I will modify my patch as you > suggested. Because it messes up the order in which people normally read text. Why is top-posting such a bad thing? Top-posting. What is the most annoying thing in e-mail? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: vhost-vdpa tests in tools/virtio
On Thu, Jun 23, 2022 at 11:59:56AM +0800, Jason Wang wrote: On Thu, Jun 23, 2022 at 3:21 AM Eugenio Perez Martin wrote: On Wed, Jun 22, 2022 at 5:39 PM Stefano Garzarella wrote: > > Hi, > while developing/testing the vdpa-blk support in libblkio [1], I > realized that we added several regressions with the "[PATCH v2 00/19] > Control VQ support in vDPA" series (patches/clarifications already > sent). > > To minimize these problems in the future, I was thinking of adding tests > for vhost_vdpa in tools/virtio using vDPA simulators as devices. > > What do you think about this? I think it's definitely a good idea. Maybe a good starting point is to copy the virtio_test for the moment? Maybe to add a new parameter to tell it it will test a vhost-vdpa device? Initialization is pretty similar. +1 Good idea, I will use it as a starting point. Thanks, Stefano ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net V2] virtio-net: fix race between ndo_open() and virtio_device_ready()
We currently call virtio_device_ready() after netdev registration. Since ndo_open() can be called immediately after register_netdev, this means there exists a race between ndo_open() and virtio_device_ready(): the driver may start to use the device before DRIVER_OK which violates the spec. Fix this by switching to use register_netdevice() and protect the virtio_device_ready() with rtnl_lock() to make sure ndo_open() can only be called after virtio_device_ready(). Fixes: 4baf1e33d0842 ("virtio_net: enable VQs early") Acked-by: Michael S. Tsirkin Signed-off-by: Jason Wang --- drivers/net/virtio_net.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index db05b5e930be..8a5810bcb839 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -3655,14 +3655,20 @@ static int virtnet_probe(struct virtio_device *vdev) if (vi->has_rss || vi->has_rss_hash_report) virtnet_init_default_rss(vi); - err = register_netdev(dev); + /* serialize netdev register + virtio_device_ready() with ndo_open() */ + rtnl_lock(); + + err = register_netdevice(dev); if (err) { pr_debug("virtio_net: registering device failed\n"); + rtnl_unlock(); goto free_failover; } virtio_device_ready(vdev); + rtnl_unlock(); + err = virtnet_cpu_notif_add(vi); if (err) { pr_debug("virtio_net: registering cpu notifier failed\n"); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net] virtio-net: fix race between ndo_open() and virtio_device_ready()
On Mon, Jun 27, 2022 at 3:44 PM Michael S. Tsirkin wrote: > > On Mon, Jun 27, 2022 at 02:36:56PM +0800, Jason Wang wrote: > > We used to call virtio_device_ready() after netdev registration. > > s/used to call/currently call/ > > > This > > cause > > s/This cause/Since ndo_open can be called immediately > after register_netdev, this means there exists/ > > > a race between ndo_open() and virtio_device_ready(): if > > ndo_open() is called before virtio_device_ready(), the driver may > > start to use the device before DRIVER_OK which violates the spec. > > > > Fixing > > s/Fixing/Fix/ > > > this by switching to use register_netdevice() and protect the > > virtio_device_ready() with rtnl_lock() to make sure ndo_open() can > > only be called after virtio_device_ready(). > > > > Fixes: 4baf1e33d0842 ("virtio_net: enable VQs early") > > it's an unusual use of Fixes - the patch in question does not > introduce the problem, it just does not fix it completely. Yes, but I couldn't find a better commit. > But OK I guess. > > > Signed-off-by: Jason Wang > > With commit log changes: Will post a new version with the above fixed. Thanks > > Acked-by: Michael S. Tsirkin > > > --- > > drivers/net/virtio_net.c | 8 +++- > > 1 file changed, 7 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > > index db05b5e930be..8a5810bcb839 100644 > > --- a/drivers/net/virtio_net.c > > +++ b/drivers/net/virtio_net.c > > @@ -3655,14 +3655,20 @@ static int virtnet_probe(struct virtio_device *vdev) > > if (vi->has_rss || vi->has_rss_hash_report) > > virtnet_init_default_rss(vi); > > > > - err = register_netdev(dev); > > + /* serialize netdev register + virtio_device_ready() with ndo_open() > > */ > > + rtnl_lock(); > > + > > + err = register_netdevice(dev); > > if (err) { > > pr_debug("virtio_net: registering device failed\n"); > > + rtnl_unlock(); > > goto free_failover; > > } > > > > virtio_device_ready(vdev); > > > > + rtnl_unlock(); > > + > > err = virtnet_cpu_notif_add(vi); > > if (err) { > > pr_debug("virtio_net: registering cpu notifier failed\n"); > > -- > > 2.25.1 > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v10 25/41] virtio_pci: struct virtio_pci_common_cfg add queue_notify_data
On Mon, Jun 27, 2022 at 3:58 PM Michael S. Tsirkin wrote: > > On Mon, Jun 27, 2022 at 03:45:30PM +0800, Jason Wang wrote: > > On Mon, Jun 27, 2022 at 2:39 PM Michael S. Tsirkin wrote: > > > > > > On Mon, Jun 27, 2022 at 10:30:42AM +0800, Jason Wang wrote: > > > > On Fri, Jun 24, 2022 at 2:59 PM Michael S. Tsirkin > > > > wrote: > > > > > > > > > > On Fri, Jun 24, 2022 at 10:56:05AM +0800, Xuan Zhuo wrote: > > > > > > Add queue_notify_data in struct virtio_pci_common_cfg, which comes > > > > > > from > > > > > > here https://github.com/oasis-tcs/virtio-spec/issues/89 > > > > > > > > > > > > For not breaks uABI, add a new struct virtio_pci_common_cfg_notify. > > > > > > > > > > What exactly is meant by not breaking uABI? > > > > > Users are supposed to be prepared for struct size to change ... no? > > > > > > > > Not sure, any doc for this? > > > > > > > > Thanks > > > > > > > > > Well we have this: > > > > > > The drivers SHOULD only map part of configuration structure > > > large enough for device operation. The drivers MUST handle > > > an unexpectedly large \field{length}, but MAY check that > > > \field{length} > > > is large enough for device operation. > > > > Yes, but that's the device/driver interface. What's done here is the > > userspace/kernel. > > > > Userspace may break if it uses e.g sizeof(struct virtio_pci_common_cfg)? > > > > Thanks > > Hmm I guess there's risk... but then how are we going to maintain this > going forward? Add a new struct on any change? This is the way we have used it for the past 5 or more years. I don't see why this must be handled in the vq reset feature. >Can we at least > prevent this going forward somehow? Like have some padding? Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Since I want to add queue_reset after queue_notify_data, I submitted > > > > > > this patch first. > > > > > > > > > > > > Signed-off-by: Xuan Zhuo > > > > > > Acked-by: Jason Wang > > > > > > --- > > > > > > include/uapi/linux/virtio_pci.h | 7 +++ > > > > > > 1 file changed, 7 insertions(+) > > > > > > > > > > > > diff --git a/include/uapi/linux/virtio_pci.h > > > > > > b/include/uapi/linux/virtio_pci.h > > > > > > index 3a86f36d7e3d..22bec9bd0dfc 100644 > > > > > > --- a/include/uapi/linux/virtio_pci.h > > > > > > +++ b/include/uapi/linux/virtio_pci.h > > > > > > @@ -166,6 +166,13 @@ struct virtio_pci_common_cfg { > > > > > > __le32 queue_used_hi; /* read-write */ > > > > > > }; > > > > > > > > > > > > +struct virtio_pci_common_cfg_notify { > > > > > > + struct virtio_pci_common_cfg cfg; > > > > > > + > > > > > > + __le16 queue_notify_data; /* read-write */ > > > > > > + __le16 padding; > > > > > > +}; > > > > > > + > > > > > > /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */ > > > > > > struct virtio_pci_cfg_cap { > > > > > > struct virtio_pci_cap cap; > > > > > > -- > > > > > > 2.31.0 > > > > > > > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V3] virtio: disable notification hardening by default
On Mon, Jun 27, 2022 at 3:33 PM Michael S. Tsirkin wrote: > > On Mon, Jun 27, 2022 at 10:50:17AM +0800, Jason Wang wrote: > > On Fri, Jun 24, 2022 at 2:31 PM Michael S. Tsirkin wrote: > > > > > > On Wed, Jun 22, 2022 at 03:09:31PM +0800, Jason Wang wrote: > > > > On Wed, Jun 22, 2022 at 3:03 PM Michael S. Tsirkin > > > > wrote: > > > > > > > > > > On Wed, Jun 22, 2022 at 09:29:40AM +0800, Jason Wang wrote: > > > > > > We try to harden virtio device notifications in 8b4ec69d7e09 > > > > > > ("virtio: > > > > > > harden vring IRQ"). It works with the assumption that the driver or > > > > > > core can properly call virtio_device_ready() at the right > > > > > > place. Unfortunately, this seems to be not true and uncover various > > > > > > bugs of the existing drivers, mainly the issue of using > > > > > > virtio_device_ready() incorrectly. > > > > > > > > > > > > So let's having a Kconfig option and disable it by default. It gives > > > > > > us a breath to fix the drivers and then we can consider to enable it > > > > > > by default. > > > > > > > > > > > > Signed-off-by: Jason Wang > > > > > > > > > > > > > > > OK I will queue, but I think the problem is fundamental. > > > > > > > > If I understand correctly, you want some core IRQ work? > > > > > > Yes. > > > > > > > As discussed > > > > before, it doesn't solve all the problems, we still need to do per > > > > driver audit. > > > > > > > > Thanks > > > > > > Maybe, but we don't need to tie things to device_ready then. > > > We can do > > > > > > - disable irqs > > > - device ready > > > - setup everything > > > - enable irqs > > > > > > > > > and this works for most things, the only issue is > > > this deadlocks if "setup everything" waits for interrupts. > > > > > > > > > With the current approach there's really no good time: > > > 1.- setup everything > > > - device ready > > > > > > can cause kicks before device is ready > > > > > > 2.- device ready > > > - setup everything > > > > > > can cause callbacks before setup. > > > > > > So I prefer the 1. and fix the hardening in the core. > > > > So my question is: > > > > 1) do similar hardening like config interrupt > > or > > 2) per transport notification work (e.g for PCI core IRQ work) > > > > 1) seems easier and universal, but we pay little overhead which could > > be eliminated by the config option. > > I doubt 1 is easy and I am not even sure core IRQ changes will help. Core IRQ won't help in 1), the changes are limited in the virtio. > My concern with adding overhead is that I'm not sure these are not just > wasted CPU cycles. We spent a bunch of time on irq hardening and so far > we are still at the "all drivers need to be fixed" stage. It's not the fault of hardening but the drivers. The hardening just expose those bugs. > > The reason config was kind of easy is that config interrupt is rarely > vital for device function so arbitrarily deferring that does not lead to > deadlocks - what you are trying to do with VQ interrupts is > fundamentally different. Things are especially bad if we just drop > an interrupt but deferring can lead to problems too. I'm not sure I see the difference, disable_irq() stuffs also delay the interrupt processing until enable_irq(). > > Consider as an example > virtio-net: fix race between ndo_open() and virtio_device_ready() > if you just defer vq interrupts you get deadlocks. > > I don't see a deadlock here, maybe you can show more detail on this? > > So, thinking about all this, how about a simple per vq flag meaning > "this vq was kicked since reset"? And ignore the notification if vq is not kicked? It sounds like the callback needs to be synchronized with the kick. > > If driver does not kick then it's not ready to get callbacks, right? > > Sounds quite clean, but we need to think through memory ordering > concerns - I guess it's only when we change the value so > if (!vq->kicked) { > vq->kicked = true; > mb(); > } > > will do the trick, right? There's no much difference with the existing approach: 1) your proposal implicitly makes callbacks ready in virtqueue_kick() 2) my proposal explicitly makes callbacks ready via virtio_device_ready() Both require careful auditing of all the existing drivers to make sure no kick before DRIVER_OK. > > need to think about the reset path - it already synchronizes callbacks > and already can lose interrupts so we just need to clear vq->kicked > before that, right? Probably. > > > > 2) seems require more work in the IRQ core and it can not work for all > > transports (e.g vDPA would be kind of difficult) > > > > Thanks > > Hmm I don't really get why would it be difficult. > VDPA is mostly PCI isn't it? With PCI both level INT#x and edge MSI > have interrupt masking support. Yes, but consider the case of mlx5_vdpa, PCI stuff was hidden under the auxiliary bus. And that is the way another vendor will go. Thanks > > > > > > > > > > > > > > > > > > > > > > >
Re: [PATCH v10 25/41] virtio_pci: struct virtio_pci_common_cfg add queue_notify_data
On Mon, Jun 27, 2022 at 03:45:30PM +0800, Jason Wang wrote: > On Mon, Jun 27, 2022 at 2:39 PM Michael S. Tsirkin wrote: > > > > On Mon, Jun 27, 2022 at 10:30:42AM +0800, Jason Wang wrote: > > > On Fri, Jun 24, 2022 at 2:59 PM Michael S. Tsirkin > > > wrote: > > > > > > > > On Fri, Jun 24, 2022 at 10:56:05AM +0800, Xuan Zhuo wrote: > > > > > Add queue_notify_data in struct virtio_pci_common_cfg, which comes > > > > > from > > > > > here https://github.com/oasis-tcs/virtio-spec/issues/89 > > > > > > > > > > For not breaks uABI, add a new struct virtio_pci_common_cfg_notify. > > > > > > > > What exactly is meant by not breaking uABI? > > > > Users are supposed to be prepared for struct size to change ... no? > > > > > > Not sure, any doc for this? > > > > > > Thanks > > > > > > Well we have this: > > > > The drivers SHOULD only map part of configuration structure > > large enough for device operation. The drivers MUST handle > > an unexpectedly large \field{length}, but MAY check that > > \field{length} > > is large enough for device operation. > > Yes, but that's the device/driver interface. What's done here is the > userspace/kernel. > > Userspace may break if it uses e.g sizeof(struct virtio_pci_common_cfg)? > > Thanks Hmm I guess there's risk... but then how are we going to maintain this going forward? Add a new struct on any change? Can we at least prevent this going forward somehow? > > > > > > > > > > > > > > > > > > > > > > Since I want to add queue_reset after queue_notify_data, I submitted > > > > > this patch first. > > > > > > > > > > Signed-off-by: Xuan Zhuo > > > > > Acked-by: Jason Wang > > > > > --- > > > > > include/uapi/linux/virtio_pci.h | 7 +++ > > > > > 1 file changed, 7 insertions(+) > > > > > > > > > > diff --git a/include/uapi/linux/virtio_pci.h > > > > > b/include/uapi/linux/virtio_pci.h > > > > > index 3a86f36d7e3d..22bec9bd0dfc 100644 > > > > > --- a/include/uapi/linux/virtio_pci.h > > > > > +++ b/include/uapi/linux/virtio_pci.h > > > > > @@ -166,6 +166,13 @@ struct virtio_pci_common_cfg { > > > > > __le32 queue_used_hi; /* read-write */ > > > > > }; > > > > > > > > > > +struct virtio_pci_common_cfg_notify { > > > > > + struct virtio_pci_common_cfg cfg; > > > > > + > > > > > + __le16 queue_notify_data; /* read-write */ > > > > > + __le16 padding; > > > > > +}; > > > > > + > > > > > /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */ > > > > > struct virtio_pci_cfg_cap { > > > > > struct virtio_pci_cap cap; > > > > > -- > > > > > 2.31.0 > > > > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v2] x86/paravirt: useless assignment instructions cause Unixbench full core performance degradation
On Mon, Jun 27, 2022 at 10:13:50AM +0800, Guo Hui wrote: > The instructions assigned to the vcpu_is_preempted function parameter > in the X86 architecture physical machine are redundant instructions, > causing the multi-core performance of Unixbench to drop by about 4% to 5%. > The C function is as follows: > static bool vcpu_is_preempted(long vcpu); > > The parameter 'vcpu' in the function osq_lock > that calls the function vcpu_is_preempted is assigned as follows: > > The C code is in the function node_cpu: > cpu = node->cpu - 1; > > The instructions corresponding to the C code are: > mov 0x14(%rax),%edi > sub $0x1,%edi > > The above instructions are unnecessary > in the X86 Native operating environment, > causing high cache-misses and degrading performance. The above basically says that argument setup is not patched out and causes significant pain due to a cache-miss. > Signed-off-by: Guo Hui > --- > arch/x86/kernel/paravirt-spinlocks.c | 4 > kernel/locking/osq_lock.c| 9 - > 2 files changed, 12 insertions(+), 1 deletion(-) > > diff --git a/arch/x86/kernel/paravirt-spinlocks.c > b/arch/x86/kernel/paravirt-spinlocks.c > index 9e1ea99ad..7a55f8407 100644 > --- a/arch/x86/kernel/paravirt-spinlocks.c > +++ b/arch/x86/kernel/paravirt-spinlocks.c > @@ -33,6 +33,8 @@ bool pv_is_native_vcpu_is_preempted(void) > __raw_callee_save___native_vcpu_is_preempted; > } > > +DECLARE_STATIC_KEY_FALSE(preemted_key); > + > void __init paravirt_set_cap(void) > { > if (!pv_is_native_spin_unlock()) > @@ -40,4 +42,6 @@ void __init paravirt_set_cap(void) > > if (!pv_is_native_vcpu_is_preempted()) > setup_force_cpu_cap(X86_FEATURE_VCPUPREEMPT); > + else > + static_branch_enable(_key); > } At least for x86 it makes sense to have the static_key default the other way around. That is, enable it along with vcpu_is_preempted(). > diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c > index d5610ad52..a8798e701 100644 > --- a/kernel/locking/osq_lock.c > +++ b/kernel/locking/osq_lock.c > @@ -22,9 +22,16 @@ static inline int encode_cpu(int cpu_nr) > return cpu_nr + 1; > } > > +DEFINE_STATIC_KEY_FALSE(preemted_key); > + > static inline int node_cpu(struct optimistic_spin_node *node) > { > - return node->cpu - 1; > + int cpu = 0; > + > + if (!static_branch_unlikely(_key)) > + cpu = node->cpu - 1; > + > + return cpu; > } Would not something like: static inline bool vcpu_is_preempted_node(struct optimistic_spin_node *node) { if (!static_branch_unlikely(_has_preemption)) return false; return vcpu_is_preempted(node_cpu(node->prev)); } And then use that like: if (smp_cond_load_relaxed(>locked, VAL || need_resched() || vcpu_is_preempted_node(node))) Not generate better code still? ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v10 25/41] virtio_pci: struct virtio_pci_common_cfg add queue_notify_data
On Mon, Jun 27, 2022 at 2:39 PM Michael S. Tsirkin wrote: > > On Mon, Jun 27, 2022 at 10:30:42AM +0800, Jason Wang wrote: > > On Fri, Jun 24, 2022 at 2:59 PM Michael S. Tsirkin wrote: > > > > > > On Fri, Jun 24, 2022 at 10:56:05AM +0800, Xuan Zhuo wrote: > > > > Add queue_notify_data in struct virtio_pci_common_cfg, which comes from > > > > here https://github.com/oasis-tcs/virtio-spec/issues/89 > > > > > > > > For not breaks uABI, add a new struct virtio_pci_common_cfg_notify. > > > > > > What exactly is meant by not breaking uABI? > > > Users are supposed to be prepared for struct size to change ... no? > > > > Not sure, any doc for this? > > > > Thanks > > > Well we have this: > > The drivers SHOULD only map part of configuration structure > large enough for device operation. The drivers MUST handle > an unexpectedly large \field{length}, but MAY check that > \field{length} > is large enough for device operation. Yes, but that's the device/driver interface. What's done here is the userspace/kernel. Userspace may break if it uses e.g sizeof(struct virtio_pci_common_cfg)? Thanks > > > > > > > > > > > > > > > Since I want to add queue_reset after queue_notify_data, I submitted > > > > this patch first. > > > > > > > > Signed-off-by: Xuan Zhuo > > > > Acked-by: Jason Wang > > > > --- > > > > include/uapi/linux/virtio_pci.h | 7 +++ > > > > 1 file changed, 7 insertions(+) > > > > > > > > diff --git a/include/uapi/linux/virtio_pci.h > > > > b/include/uapi/linux/virtio_pci.h > > > > index 3a86f36d7e3d..22bec9bd0dfc 100644 > > > > --- a/include/uapi/linux/virtio_pci.h > > > > +++ b/include/uapi/linux/virtio_pci.h > > > > @@ -166,6 +166,13 @@ struct virtio_pci_common_cfg { > > > > __le32 queue_used_hi; /* read-write */ > > > > }; > > > > > > > > +struct virtio_pci_common_cfg_notify { > > > > + struct virtio_pci_common_cfg cfg; > > > > + > > > > + __le16 queue_notify_data; /* read-write */ > > > > + __le16 padding; > > > > +}; > > > > + > > > > /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */ > > > > struct virtio_pci_cfg_cap { > > > > struct virtio_pci_cap cap; > > > > -- > > > > 2.31.0 > > > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH net] virtio-net: fix race between ndo_open() and virtio_device_ready()
On Mon, Jun 27, 2022 at 02:36:56PM +0800, Jason Wang wrote: > We used to call virtio_device_ready() after netdev registration. s/used to call/currently call/ > This > cause s/This cause/Since ndo_open can be called immediately after register_netdev, this means there exists/ > a race between ndo_open() and virtio_device_ready(): if > ndo_open() is called before virtio_device_ready(), the driver may > start to use the device before DRIVER_OK which violates the spec. > > Fixing s/Fixing/Fix/ > this by switching to use register_netdevice() and protect the > virtio_device_ready() with rtnl_lock() to make sure ndo_open() can > only be called after virtio_device_ready(). > > Fixes: 4baf1e33d0842 ("virtio_net: enable VQs early") it's an unusual use of Fixes - the patch in question does not introduce the problem, it just does not fix it completely. But OK I guess. > Signed-off-by: Jason Wang With commit log changes: Acked-by: Michael S. Tsirkin > --- > drivers/net/virtio_net.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index db05b5e930be..8a5810bcb839 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -3655,14 +3655,20 @@ static int virtnet_probe(struct virtio_device *vdev) > if (vi->has_rss || vi->has_rss_hash_report) > virtnet_init_default_rss(vi); > > - err = register_netdev(dev); > + /* serialize netdev register + virtio_device_ready() with ndo_open() */ > + rtnl_lock(); > + > + err = register_netdevice(dev); > if (err) { > pr_debug("virtio_net: registering device failed\n"); > + rtnl_unlock(); > goto free_failover; > } > > virtio_device_ready(vdev); > > + rtnl_unlock(); > + > err = virtnet_cpu_notif_add(vi); > if (err) { > pr_debug("virtio_net: registering cpu notifier failed\n"); > -- > 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH V3] virtio: disable notification hardening by default
On Mon, Jun 27, 2022 at 10:50:17AM +0800, Jason Wang wrote: > On Fri, Jun 24, 2022 at 2:31 PM Michael S. Tsirkin wrote: > > > > On Wed, Jun 22, 2022 at 03:09:31PM +0800, Jason Wang wrote: > > > On Wed, Jun 22, 2022 at 3:03 PM Michael S. Tsirkin > > > wrote: > > > > > > > > On Wed, Jun 22, 2022 at 09:29:40AM +0800, Jason Wang wrote: > > > > > We try to harden virtio device notifications in 8b4ec69d7e09 ("virtio: > > > > > harden vring IRQ"). It works with the assumption that the driver or > > > > > core can properly call virtio_device_ready() at the right > > > > > place. Unfortunately, this seems to be not true and uncover various > > > > > bugs of the existing drivers, mainly the issue of using > > > > > virtio_device_ready() incorrectly. > > > > > > > > > > So let's having a Kconfig option and disable it by default. It gives > > > > > us a breath to fix the drivers and then we can consider to enable it > > > > > by default. > > > > > > > > > > Signed-off-by: Jason Wang > > > > > > > > > > > > OK I will queue, but I think the problem is fundamental. > > > > > > If I understand correctly, you want some core IRQ work? > > > > Yes. > > > > > As discussed > > > before, it doesn't solve all the problems, we still need to do per > > > driver audit. > > > > > > Thanks > > > > Maybe, but we don't need to tie things to device_ready then. > > We can do > > > > - disable irqs > > - device ready > > - setup everything > > - enable irqs > > > > > > and this works for most things, the only issue is > > this deadlocks if "setup everything" waits for interrupts. > > > > > > With the current approach there's really no good time: > > 1.- setup everything > > - device ready > > > > can cause kicks before device is ready > > > > 2.- device ready > > - setup everything > > > > can cause callbacks before setup. > > > > So I prefer the 1. and fix the hardening in the core. > > So my question is: > > 1) do similar hardening like config interrupt > or > 2) per transport notification work (e.g for PCI core IRQ work) > > 1) seems easier and universal, but we pay little overhead which could > be eliminated by the config option. I doubt 1 is easy and I am not even sure core IRQ changes will help. My concern with adding overhead is that I'm not sure these are not just wasted CPU cycles. We spent a bunch of time on irq hardening and so far we are still at the "all drivers need to be fixed" stage. The reason config was kind of easy is that config interrupt is rarely vital for device function so arbitrarily deferring that does not lead to deadlocks - what you are trying to do with VQ interrupts is fundamentally different. Things are especially bad if we just drop an interrupt but deferring can lead to problems too. Consider as an example virtio-net: fix race between ndo_open() and virtio_device_ready() if you just defer vq interrupts you get deadlocks. So, thinking about all this, how about a simple per vq flag meaning "this vq was kicked since reset"? If driver does not kick then it's not ready to get callbacks, right? Sounds quite clean, but we need to think through memory ordering concerns - I guess it's only when we change the value so if (!vq->kicked) { vq->kicked = true; mb(); } will do the trick, right? need to think about the reset path - it already synchronizes callbacks and already can lose interrupts so we just need to clear vq->kicked before that, right? > 2) seems require more work in the IRQ core and it can not work for all > transports (e.g vDPA would be kind of difficult) > > Thanks Hmm I don't really get why would it be difficult. VDPA is mostly PCI isn't it? With PCI both level INT#x and edge MSI have interrupt masking support. > > > > > > > > > > > > > > > > > --- > > > > > Changes since V2: > > > > > - Tweak the Kconfig help > > > > > - Add comment for the read_lock() pairing in virtio_ccw > > > > > --- > > > > > drivers/s390/virtio/virtio_ccw.c | 9 - > > > > > drivers/virtio/Kconfig | 13 + > > > > > drivers/virtio/virtio.c | 2 ++ > > > > > drivers/virtio/virtio_ring.c | 12 > > > > > include/linux/virtio_config.h| 2 ++ > > > > > 5 files changed, 37 insertions(+), 1 deletion(-) > > > > > > > > > > diff --git a/drivers/s390/virtio/virtio_ccw.c > > > > > b/drivers/s390/virtio/virtio_ccw.c > > > > > index 97e51c34e6cf..1f6a358f65f0 100644 > > > > > --- a/drivers/s390/virtio/virtio_ccw.c > > > > > +++ b/drivers/s390/virtio/virtio_ccw.c > > > > > @@ -1136,8 +1136,13 @@ static void virtio_ccw_int_handler(struct > > > > > ccw_device *cdev, > > > > > vcdev->err = -EIO; > > > > > } > > > > > virtio_ccw_check_activity(vcdev, activity); > > > > > - /* Interrupts are disabled here */ > > > > > +#ifdef CONFIG_VIRTIO_HARDEN_NOTIFICATION > > > > > + /* > > > > > + * Paried with virtio_ccw_synchronize_cbs() and interrupts are > > > > > +
RE: [PATCH v3 5/5] vfio/iommu_type1: Simplify group attachment
> From: Nicolin Chen > Sent: Friday, June 24, 2022 4:00 AM > > Un-inline the domain specific logic from the attach/detach_group ops into > two paired functions vfio_iommu_alloc_attach_domain() and > vfio_iommu_detach_destroy_domain() that strictly deal with creating and > destroying struct vfio_domains. > > Add the logic to check for EMEDIUMTYPE return code of > iommu_attach_group() > and avoid the extra domain allocations and attach/detach sequences of the > old code. This allows properly detecting an actual attach error, like > -ENOMEM, vs treating all attach errors as an incompatible domain. > > Co-developed-by: Jason Gunthorpe > Signed-off-by: Jason Gunthorpe > Signed-off-by: Nicolin Chen Reviewed-by: Kevin Tian ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
[PATCH net] virtio-net: fix race between ndo_open() and virtio_device_ready()
We used to call virtio_device_ready() after netdev registration. This cause a race between ndo_open() and virtio_device_ready(): if ndo_open() is called before virtio_device_ready(), the driver may start to use the device before DRIVER_OK which violates the spec. Fixing this by switching to use register_netdevice() and protect the virtio_device_ready() with rtnl_lock() to make sure ndo_open() can only be called after virtio_device_ready(). Fixes: 4baf1e33d0842 ("virtio_net: enable VQs early") Signed-off-by: Jason Wang --- drivers/net/virtio_net.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c index db05b5e930be..8a5810bcb839 100644 --- a/drivers/net/virtio_net.c +++ b/drivers/net/virtio_net.c @@ -3655,14 +3655,20 @@ static int virtnet_probe(struct virtio_device *vdev) if (vi->has_rss || vi->has_rss_hash_report) virtnet_init_default_rss(vi); - err = register_netdev(dev); + /* serialize netdev register + virtio_device_ready() with ndo_open() */ + rtnl_lock(); + + err = register_netdevice(dev); if (err) { pr_debug("virtio_net: registering device failed\n"); + rtnl_unlock(); goto free_failover; } virtio_device_ready(vdev); + rtnl_unlock(); + err = virtnet_cpu_notif_add(vi); if (err) { pr_debug("virtio_net: registering cpu notifier failed\n"); -- 2.25.1 ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH v10 25/41] virtio_pci: struct virtio_pci_common_cfg add queue_notify_data
On Mon, Jun 27, 2022 at 10:30:42AM +0800, Jason Wang wrote: > On Fri, Jun 24, 2022 at 2:59 PM Michael S. Tsirkin wrote: > > > > On Fri, Jun 24, 2022 at 10:56:05AM +0800, Xuan Zhuo wrote: > > > Add queue_notify_data in struct virtio_pci_common_cfg, which comes from > > > here https://github.com/oasis-tcs/virtio-spec/issues/89 > > > > > > For not breaks uABI, add a new struct virtio_pci_common_cfg_notify. > > > > What exactly is meant by not breaking uABI? > > Users are supposed to be prepared for struct size to change ... no? > > Not sure, any doc for this? > > Thanks Well we have this: The drivers SHOULD only map part of configuration structure large enough for device operation. The drivers MUST handle an unexpectedly large \field{length}, but MAY check that \field{length} is large enough for device operation. > > > > > > > > Since I want to add queue_reset after queue_notify_data, I submitted > > > this patch first. > > > > > > Signed-off-by: Xuan Zhuo > > > Acked-by: Jason Wang > > > --- > > > include/uapi/linux/virtio_pci.h | 7 +++ > > > 1 file changed, 7 insertions(+) > > > > > > diff --git a/include/uapi/linux/virtio_pci.h > > > b/include/uapi/linux/virtio_pci.h > > > index 3a86f36d7e3d..22bec9bd0dfc 100644 > > > --- a/include/uapi/linux/virtio_pci.h > > > +++ b/include/uapi/linux/virtio_pci.h > > > @@ -166,6 +166,13 @@ struct virtio_pci_common_cfg { > > > __le32 queue_used_hi; /* read-write */ > > > }; > > > > > > +struct virtio_pci_common_cfg_notify { > > > + struct virtio_pci_common_cfg cfg; > > > + > > > + __le16 queue_notify_data; /* read-write */ > > > + __le16 padding; > > > +}; > > > + > > > /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */ > > > struct virtio_pci_cfg_cap { > > > struct virtio_pci_cap cap; > > > -- > > > 2.31.0 > > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization
Re: [PATCH] virtio-net: fix race between ndo_open() and virtio_device_ready()
On Fri, Jun 17, 2022 at 3:29 PM Jason Wang wrote: > > We used to call virtio_device_ready() after netdev registration. This > cause a race between ndo_open() and virtio_device_ready(): if > ndo_open() is called before virtio_device_ready(), the driver may > start to use the device before DRIVER_OK which violates the spec. > > Fixing this by switching to use register_netdevice() and protect the > virtio_device_ready() with rtnl_lock() to make sure ndo_open() can > only be called after virtio_device_ready(). > > Fixes: 4baf1e33d0842 ("virtio_net: enable VQs early") > Signed-off-by: Jason Wang Ok, I think we're fine with this. So I will repost against -net. If we spot issues with unregistering, we can use a separate patch for that. Thanks > --- > drivers/net/virtio_net.c | 8 +++- > 1 file changed, 7 insertions(+), 1 deletion(-) > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c > index db05b5e930be..8a5810bcb839 100644 > --- a/drivers/net/virtio_net.c > +++ b/drivers/net/virtio_net.c > @@ -3655,14 +3655,20 @@ static int virtnet_probe(struct virtio_device *vdev) > if (vi->has_rss || vi->has_rss_hash_report) > virtnet_init_default_rss(vi); > > - err = register_netdev(dev); > + /* serialize netdev register + virtio_device_ready() with ndo_open() > */ > + rtnl_lock(); > + > + err = register_netdevice(dev); > if (err) { > pr_debug("virtio_net: registering device failed\n"); > + rtnl_unlock(); > goto free_failover; > } > > virtio_device_ready(vdev); > > + rtnl_unlock(); > + > err = virtnet_cpu_notif_add(vi); > if (err) { > pr_debug("virtio_net: registering cpu notifier failed\n"); > -- > 2.25.1 > ___ Virtualization mailing list Virtualization@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/virtualization