Re: [PATCH net-next v10] virtio/vsock: replace virtio_vsock_pkt with sk_buff

2023-01-17 Thread Michael S. Tsirkin
On Fri, Jan 13, 2023 at 10:21:37PM +, Bobby Eshleman wrote:
> This commit changes virtio/vsock to use sk_buff instead of
> virtio_vsock_pkt. Beyond better conforming to other net code, using
> sk_buff allows vsock to use sk_buff-dependent features in the future
> (such as sockmap) and improves throughput.
> 
> This patch introduces the following performance changes:
> 
> Tool: Uperf
> Env: Phys Host + L1 Guest
> Payload: 64k
> Threads: 16
> Test Runs: 10
> Type: SOCK_STREAM
> Before: commit b7bfaa761d760 ("Linux 6.2-rc3")
> 
> Before
> --
> g2h: 16.77Gb/s
> h2g: 10.56Gb/s
> 
> After
> -
> g2h: 21.04Gb/s
> h2g: 10.76Gb/s
> 
> Signed-off-by: Bobby Eshleman 


Acked-by: Michael S. Tsirkin 


> ---
> Changes in v10:
> - vhost/vsock: use virtio_vsock_skb_dequeue()
> - vhost/vsock: remove extra iov_length() call
> - vhost/vsock: also consider hdr when evaluating that incoming size is
>   valid
> - new uperf data
> 
> Changes in v9:
> - check length in rx header
> - guard alloactor from small requests
> - squashed fix for v8 bug reported by syzbot:
> syzbot+30b72abaa17c07fe3...@syzkaller.appspotmail.com
> 
> Changes in v8:
> - vhost/vsock: remove unused enum
> - vhost/vsock: use spin_lock_bh() instead of spin_lock()
> - vsock/loopback: use __skb_dequeue instead of skb_dequeue
> 
> Changes in v7:
> - use skb_queue_empty() instead of skb_queue_empty_lockless()
> 
> Changes in v6:
> - use skb->cb instead of skb->_skb_refdst
> - use lock-free __skb_queue_purge for rx_queue when rx_lock held
> 
> Changes in v5:
> - last_skb instead of skb: last_hdr->len = cpu_to_le32(last_skb->len)
> 
> Changes in v4:
> - vdso/bits.h -> linux/bits.h
> - add virtio_vsock_alloc_skb() helper
> - virtio/vsock: rename buf_len -> total_len
> - update last_hdr->len
> - fix build_skb() for vsockmon (tested)
> - add queue helpers
> - use spin_{unlock/lock}_bh() instead of spin_lock()/spin_unlock()
> - note: I only ran a few g2h tests to check that this change
>   had no perf impact. The above data is still from patch
>   v3.
> 
> Changes in v3:
> - fix seqpacket bug
> - use zero in vhost_add_used(..., 0) device doesn't write to buffer
> - use xmas tree style declarations
> - vsock_hdr() -> virtio_vsock_hdr() and other include file style fixes
> - no skb merging
> - save space by not using vsock_metadata
> - use _skb_refdst instead of skb buffer space for flags
> - use skb_pull() to keep track of read bytes instead of using an an
>   extra variable 'off' in the skb buffer space
> - remove unnecessary sk_allocation assignment
> - do not zero hdr needlessly
> - introduce virtio_transport_skb_len() because skb->len changes now
> - use spin_lock() directly on queue lock instead of sk_buff_head helpers
>   which use spin_lock_irqsave() (e.g., skb_dequeue)
> - do not reduce buffer size to be page size divisible
> - Note: the biggest performance change came from loosening the spinlock
>   variation and not reducing the buffer size.
> 
> Changes in v2:
> - Use alloc_skb() directly instead of sock_alloc_send_pskb() to minimize
>   uAPI changes.
> - Do not marshal errors to -ENOMEM for non-virtio implementations.
> - No longer a part of the original series
> - Some code cleanup and refactoring
> - Include performance stats
> 
>  drivers/vhost/vsock.c   | 214 +---
>  include/linux/virtio_vsock.h| 129 ++--
>  net/vmw_vsock/virtio_transport.c| 149 +++--
>  net/vmw_vsock/virtio_transport_common.c | 422 +---
>  net/vmw_vsock/vsock_loopback.c  |  51 +--
>  5 files changed, 498 insertions(+), 467 deletions(-)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index a2b374372363..1f3b89c885cc 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -51,8 +51,7 @@ struct vhost_vsock {
>   struct hlist_node hash;
>  
>   struct vhost_work send_pkt_work;
> - spinlock_t send_pkt_list_lock;
> - struct list_head send_pkt_list; /* host->guest pending packets */
> + struct sk_buff_head send_pkt_queue; /* host->guest pending packets */
>  
>   atomic_t queued_replies;
>  
> @@ -108,40 +107,31 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>   vhost_disable_notify(>dev, vq);
>  
>   do {
> - struct virtio_vsock_pkt *pkt;
> + struct virtio_vsock_hdr *hdr;
> + size_t iov_len, payload_len;
>   struct iov_iter iov_iter;
> + u32 flags_to_restore = 0;
> + struct sk_buff *skb;
>   unsigned out, in;
>   size_t nbytes;
> - size_t iov_len, payload_len;
>   int head;
> - u32 flags_to_restore = 0;
>  
> - spin_lock_bh(>send_pkt_list_lock);
> - if (list_empty(>send_pkt_list)) {
> - spin_unlock_bh(>send_pkt_list_lock);
> + skb = virtio_vsock_skb_dequeue(>send_pkt_queue);
> +
> + if (!skb) {
>   

RE: [PATCH 6/8] iommu/intel: Add a gfp parameter to alloc_pgtable_page()

2023-01-17 Thread Tian, Kevin
> From: Jason Gunthorpe 
> Sent: Tuesday, January 17, 2023 9:30 PM
> 
> On Tue, Jan 17, 2023 at 03:35:08AM +, Tian, Kevin wrote:
> > > From: Jason Gunthorpe 
> > > Sent: Saturday, January 7, 2023 12:43 AM
> > >
> > > @@ -2676,7 +2676,7 @@ static int copy_context_table(struct
> intel_iommu
> > > *iommu,
> > >   if (!old_ce)
> > >   goto out;
> > >
> > > - new_ce = alloc_pgtable_page(iommu->node);
> > > + new_ce = alloc_pgtable_page(iommu->node,
> > > GFP_KERNEL);
> >
> > GFP_ATOMIC
> 
> Can't be:
> 
>   old_ce = memremap(old_ce_phys, PAGE_SIZE,
>   MEMREMAP_WB);
>   if (!old_ce)
>   goto out;
> 
>   new_ce = alloc_pgtable_page(iommu->node,
> GFP_KERNEL);
>   if (!new_ce)
> 
> memremap() is sleeping.
> 
> And the only caller is:
> 
>   ctxt_tbls = kcalloc(ctxt_table_entries, sizeof(void *), GFP_KERNEL);
>   if (!ctxt_tbls)
>   goto out_unmap;
> 
>   for (bus = 0; bus < 256; bus++) {
>   ret = copy_context_table(iommu, _rt[bus],
>ctxt_tbls, bus, ext);
> 

Yes, but the patch description says "Push the GFP_ATOMIC to all
callers." implying it's purely a refactoring w/o changing those
semantics.

I'm fine with doing this change in this patch, but it should worth
a clarification in the patch description.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH net V3] virtio-net: correctly enable callback during start_xmit

2023-01-17 Thread Laurent Vivier

On 1/17/23 04:47, Jason Wang wrote:

Commit a7766ef18b33("virtio_net: disable cb aggressively") enables
virtqueue callback via the following statement:

 do {
if (use_napi)
virtqueue_disable_cb(sq->vq);

free_old_xmit_skbs(sq, false);

} while (use_napi && kick &&
unlikely(!virtqueue_enable_cb_delayed(sq->vq)));

When NAPI is used and kick is false, the callback won't be enabled
here. And when the virtqueue is about to be full, the tx will be
disabled, but we still don't enable tx interrupt which will cause a TX
hang. This could be observed when using pktgen with burst enabled.

TO be consistent with the logic that tries to disable cb only for
NAPI, fixing this by trying to enable delayed callback only when NAPI
is enabled when the queue is about to be full.

Fixes: a7766ef18b33 ("virtio_net: disable cb aggressively")
Signed-off-by: Jason Wang 
---
The patch is needed for -stable.
Changes since V2:
- try to enabled delayed callback and schedule NAPI instead of try to
   poll as when TX NAPI is disabled
Changes since V1:
- enable tx interrupt after we disable TX
---
  drivers/net/virtio_net.c | 6 --
  1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7723b2a49d8e..18b3de854aeb 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1877,8 +1877,10 @@ static netdev_tx_t start_xmit(struct sk_buff *skb, 
struct net_device *dev)
 */
if (sq->vq->num_free < 2+MAX_SKB_FRAGS) {
netif_stop_subqueue(dev, qnum);
-   if (!use_napi &&
-   unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
+   if (use_napi) {
+   if (unlikely(!virtqueue_enable_cb_delayed(sq->vq)))
+   virtqueue_napi_schedule(>napi, sq->vq);
+   } else if (unlikely(!virtqueue_enable_cb_delayed(sq->vq))) {
/* More just got used, free them then recheck. */
free_old_xmit_skbs(sq, false);
if (sq->vq->num_free >= 2+MAX_SKB_FRAGS) {


This fix works fine with my test case (netdev stream + passt)

Tested-by: Laurent Vivier 

Thanks,
Laurent

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 00/51] cpuidle,rcu: Clean up the mess

2023-01-17 Thread Mark Rutland
On Tue, Jan 17, 2023 at 02:21:40PM +, Sudeep Holla wrote:
> On Tue, Jan 17, 2023 at 01:16:21PM +, Mark Rutland wrote:
> > On Tue, Jan 17, 2023 at 11:26:29AM +0100, Peter Zijlstra wrote:
> > > On Mon, Jan 16, 2023 at 04:59:04PM +, Mark Rutland wrote:
> > > 
> > > > I'm sorry to have to bear some bad news on that front. :(
> > > 
> > > Moo, something had to give..
> > > 
> > > 
> > > > IIUC what's happenign here is the PSCI cpuidle driver has entered idle 
> > > > and RCU
> > > > is no longer watching when arm64's cpu_suspend() manipulates DAIF. Our
> > > > local_daif_*() helpers poke lockdep and tracing, hence the call to
> > > > trace_hardirqs_off() and the RCU usage.
> > > 
> > > Right, strictly speaking not needed at this point, IRQs should have been
> > > traced off a long time ago.
> > 
> > True, but there are some other calls around here that *might* end up 
> > invoking
> > RCU stuff (e.g. the MTE code).
> > 
> > That all needs a noinstr cleanup too, which I'll sort out as a follow-up.
> > 
> > > > I think we need RCU to be watching all the way down to cpu_suspend(), 
> > > > and it's
> > > > cpu_suspend() that should actually enter/exit idle context. That and we 
> > > > need to
> > > > make cpu_suspend() and the low-level PSCI invocation noinstr.
> > > > 
> > > > I'm not sure whether 32-bit will have a similar issue or not.
> > > 
> > > I'm not seeing 32bit or Risc-V have similar issues here, but who knows,
> > > maybe I missed somsething.
> > 
> > I reckon if they do, the core changes here give us the infrastructure to fix
> > them if/when we get reports.
> > 
> > > In any case, the below ought to cure the ARM64 case and remove that last
> > > known RCU_NONIDLE() user as a bonus.
> > 
> > The below works for me testing on a Juno R1 board with PSCI, using 
> > defconfig +
> > CONFIG_PROVE_LOCKING=y + CONFIG_DEBUG_LOCKDEP=y + 
> > CONFIG_DEBUG_ATOMIC_SLEEP=y.
> > I'm not sure how to test the LPI / FFH part, but it looks good to me.
> > 
> > FWIW:
> > 
> > Reviewed-by: Mark Rutland 
> > Tested-by: Mark Rutland 
> > 
> > Sudeep, would you be able to give the LPI/FFH side a spin with the kconfig
> > options above?
> > 
> 
> Not sure if I have messed up something in my mail setup, but I did reply
> earlier.

Sorry, that was my bad; I had been drafting my reply for a while and forgot to
re-check prior to sending.

> I did test both DT/cpuidle-psci driver and  ACPI/LPI+FFH driver
> with the fix Peter sent. I was seeing same splat as you in both DT and
> ACPI boot which the patch fixed it. I used the same config as described by
> you above.

Perfect; thanks!

Mark.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [RFC] memory pressure detection in VMs using PSI mechanism for dynamically inflating/deflating VM memory

2023-01-17 Thread David Hildenbrand

On 15.01.23 04:57, Sudarshan Rajagopalan wrote:

Hello all,



Hi,

I'll focus on the virtio-mem side of things :)


We’re from the Linux memory team here at Qualcomm. We are currently
devising a VM memory resizing feature where we dynamically inflate or
deflate the Linux VM based on ongoing memory demands in the VM. We
wanted to propose few details about this userspace daemon in form of RFC
and wanted to know the upstream’s opinion. Here are few details –


I'd avoid using the terminology of inflating/deflating VM memory when 
talking about virtio-mem. Just call it "dynamically resizing VM memory". 
virtio-mem is one way of doing it using memory devices.


Inflation/deflation, in contrast, reminds one of a traditional balloon 
driver, along the lines of virtio-balloon.




1. This will be a native userspace daemon that will be running only in
the Linux VM which will use virtio-mem driver that uses memory hotplug
to add/remove memory. The VM (aka Secondary VM, SVM) will request for
memory from the host which is Primary VM, PVM via the backend hypervisor
which takes care of cross-VM communication.

2. This will be guest driver. This daemon will use PSI mechanism to
monitor memory pressure to keep track of memory demands in the system.
It will register to few memory pressure events and make an educated
guess on when demand for memory in system is increasing.


Is that running in the primary or the secondary VM?



3. Currently, min PSI window size is 500ms, so PSI monitor sampling
period will be 50ms. In order to get quick response time from PSI, we’ve
reduced the min window size to 50ms so that as small as 5ms increase in
memory pressure can be reported to userspace by PSI.

/* PSI trigger definitions */
-#define WINDOW_MIN_US 50   /* Min window size is 500ms */
+#define WINDOW_MIN_US 5    /* Min window size is 50ms */

4. Detecting increase in memory demand – when a certain usecase starts
in VM that does memory allocations, it will stall causing PSI mechanism
to generate a memory pressure event to userspace. To simply put, when
pressure increases certain set threshold, it can make educated guess
that a memory requiring usecase has ran and VM system needs memory to be
added.

5. Detecting decrease in memory pressure – the reverse part where we
give back memory to PVM when memory is no longer needed is bit tricky.
We look for pressure decay and see if PSI averages (avg10, avg60,
avg300) go down, and along with other memory stats (such as free memory
etc) we make an educated guess that usecase has ended and memory has
been free’ed by the usecase, and this memory can be given back to PVM
when its no longer needed.

6. I’m skimming much on the logic and intelligence but the daemon relies
on PSI mechanism to know when memory demand is going up and down, and
communicates with virtio-mem driver for hot-plugging/unplugging memory.


For now, the hypervisor is in charge of triggering a virtio-mem device 
resize request. Will the Linux VM expose a virtio-mem device to the SVM 
and request to resize the SVM memory via that virtio-mem device?



We also factor in the latency involved with roundtrips between SVM<->PVM
so we size the memory chuck that needs to be plugged-in accordingly.

7. The whole purpose of daemon using PSI mechanism is to make this si
guest driven rather than host driven, which currently is the case mostly
with virtio-mem users. The memory pressure and usage monitoring happens
inside the SVM and the SVM makes the decisions to request for memory
from PVM. This avoids any intervention such as admin in PVM to monitor
and control the knobs. We have also set max limit of how much SVMs can
grow interms of memory, so that a rouge VM would not abuse this scheme.


Something I envisioned at some point is to
1) Have a virtio-mem guest driver to request a size change. The
   hypervisor will react accordingly by adjusting the requested size.

   Such a driver<->device request could be communicated via any other
   communication mechanism to the hypervisor, but it already came up a
   couple of times to do it via the virtio-mem protocol directly.

2) Configure the hypervisor to have a lower/upper range. Within that
   range, resize requests by the driver can be granted. The current
   values of these properties can be exposed via the device to the
   driver as well.

Is that what you also proposing here? If so, great.



This daemon is currently in just Beta stage now and we have basic
functionality running. We are yet to add more flesh to this scheme to


Good to hear that the basics are running with virtio-mem (I assume :) ).


make sure any potential risks or security concerns are taken care as well.


It would be great to draw/explain the architecture in more detail.

--
Thanks,

David / dhildenb

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 0/5] Harden a few virtio drivers

2023-01-17 Thread Michael S. Tsirkin
On Tue, Jan 17, 2023 at 03:57:55PM +0200, Alexander Shishkin wrote:
> Hi,
> 
> Here are 5 patches that harden console, net and 9p drivers against
> various malicious host input.

So I'm all for hardening against buggy devices.
But, if it's malicious host you are worried about, shouldn't
we worry about all the spectre things?



> Alexander Shishkin (2):
>   virtio console: Harden control message handling
>   virtio_net: Guard against buffer length overflow in
> xdp_linearize_page()
> 
> Andi Kleen (3):
>   virtio console: Harden multiport against invalid host input
>   virtio console: Harden port adding
>   virtio 9p: Fix an overflow
> 
>  drivers/char/virtio_console.c | 19 ---
>  drivers/net/virtio_net.c  |  4 +++-
>  net/9p/trans_virtio.c |  2 +-
>  3 files changed, 16 insertions(+), 9 deletions(-)
> 
> -- 
> 2.39.0

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 0/2] vhost/net: Clear the pending messages when the backend is removed

2023-01-17 Thread Eric Auger
When the vhost iotlb is used along with a guest virtual iommu
and the guest gets rebooted, some MISS messages may have been
recorded just before the reboot and spuriously executed by
the virtual iommu after the reboot. This is due to the fact
the pending messages are not cleared.

As vhost does not have any explicit reset user API,
VHOST_NET_SET_BACKEND looks a reasonable point where to clear
the pending messages, in case the backend is removed (fd = -1).

This version is a follow-up on the discussions held in [1].

The first patch removes an unused 'enabled' parameter in
vhost_init_device_iotlb().

Best Regards

Eric

History:
[1] RFC: [RFC] vhost: Clear the pending messages on vhost_init_device_iotlb()
https://lore.kernel.org/all/20221107203431.368306-1-eric.au...@redhat.com/

Eric Auger (2):
  vhost: Remove the enabled parameter from vhost_init_device_iotlb
  vhost/net: Clear the pending messages when the backend is removed

 drivers/vhost/net.c   | 5 -
 drivers/vhost/vhost.c | 5 +++--
 drivers/vhost/vhost.h | 3 ++-
 3 files changed, 9 insertions(+), 4 deletions(-)

-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 1/2] vhost: Remove the enabled parameter from vhost_init_device_iotlb

2023-01-17 Thread Eric Auger
The 'enabled' parameter is not used by the function. Remove it.

Signed-off-by: Eric Auger 
Reported-by: Michael S. Tsirkin 
---
 drivers/vhost/net.c   | 2 +-
 drivers/vhost/vhost.c | 2 +-
 drivers/vhost/vhost.h | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 9af19b0cf3b7..135e23254a26 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1642,7 +1642,7 @@ static int vhost_net_set_features(struct vhost_net *n, 
u64 features)
goto out_unlock;
 
if ((features & (1ULL << VIRTIO_F_ACCESS_PLATFORM))) {
-   if (vhost_init_device_iotlb(>dev, true))
+   if (vhost_init_device_iotlb(>dev))
goto out_unlock;
}
 
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index cbe72bfd2f1f..34458e203716 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -1729,7 +1729,7 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int 
ioctl, void __user *arg
 }
 EXPORT_SYMBOL_GPL(vhost_vring_ioctl);
 
-int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled)
+int vhost_init_device_iotlb(struct vhost_dev *d)
 {
struct vhost_iotlb *niotlb, *oiotlb;
int i;
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index d9109107af08..4bfa10e52297 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -221,7 +221,7 @@ ssize_t vhost_chr_read_iter(struct vhost_dev *dev, struct 
iov_iter *to,
int noblock);
 ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
 struct iov_iter *from);
-int vhost_init_device_iotlb(struct vhost_dev *d, bool enabled);
+int vhost_init_device_iotlb(struct vhost_dev *d);
 
 void vhost_iotlb_map_free(struct vhost_iotlb *iotlb,
  struct vhost_iotlb_map *map);
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 2/2] vhost/net: Clear the pending messages when the backend is removed

2023-01-17 Thread Eric Auger
When the vhost iotlb is used along with a guest virtual iommu
and the guest gets rebooted, some MISS messages may have been
recorded just before the reboot and spuriously executed by
the virtual iommu after the reboot.

As vhost does not have any explicit reset user API,
VHOST_NET_SET_BACKEND looks a reasonable point where to clear
the pending messages, in case the backend is removed.

Export vhost_clear_msg() and call it in vhost_net_set_backend()
when fd == -1.

Signed-off-by: Eric Auger 
Suggested-by: Jason Wang 
Fixes: 6b1e6cc7855b0 ("vhost: new device IOTLB API")

---

Without this patch, with QEMU virtio-iommu, on reboot, we get
spurious messages such as

qemu-kvm: virtio_iommu_translate no mapping for 0xff732800 for sid=1536
---
 drivers/vhost/net.c   | 3 +++
 drivers/vhost/vhost.c | 3 ++-
 drivers/vhost/vhost.h | 1 +
 3 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 135e23254a26..383f8f2ae131 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1511,6 +1511,9 @@ static long vhost_net_set_backend(struct vhost_net *n, 
unsigned index, int fd)
nvq = >vqs[index];
mutex_lock(>mutex);
 
+   if (fd == -1)
+   vhost_clear_msg(>dev);
+
/* Verify that ring has been setup correctly. */
if (!vhost_vq_access_ok(vq)) {
r = -EFAULT;
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 34458e203716..f11bdbe4c2c5 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -661,7 +661,7 @@ void vhost_dev_stop(struct vhost_dev *dev)
 }
 EXPORT_SYMBOL_GPL(vhost_dev_stop);
 
-static void vhost_clear_msg(struct vhost_dev *dev)
+void vhost_clear_msg(struct vhost_dev *dev)
 {
struct vhost_msg_node *node, *n;
 
@@ -679,6 +679,7 @@ static void vhost_clear_msg(struct vhost_dev *dev)
 
spin_unlock(>iotlb_lock);
 }
+EXPORT_SYMBOL_GPL(vhost_clear_msg);
 
 void vhost_dev_cleanup(struct vhost_dev *dev)
 {
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index 4bfa10e52297..1647b750169c 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -181,6 +181,7 @@ long vhost_dev_ioctl(struct vhost_dev *, unsigned int 
ioctl, void __user *argp);
 long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user 
*argp);
 bool vhost_vq_access_ok(struct vhost_virtqueue *vq);
 bool vhost_log_access_ok(struct vhost_dev *);
+void vhost_clear_msg(struct vhost_dev *dev);
 
 int vhost_get_vq_desc(struct vhost_virtqueue *,
  struct iovec iov[], unsigned int iov_count,
-- 
2.31.1

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 00/51] cpuidle,rcu: Clean up the mess

2023-01-17 Thread Mark Rutland
On Tue, Jan 17, 2023 at 11:26:29AM +0100, Peter Zijlstra wrote:
> On Mon, Jan 16, 2023 at 04:59:04PM +, Mark Rutland wrote:
> 
> > I'm sorry to have to bear some bad news on that front. :(
> 
> Moo, something had to give..
> 
> 
> > IIUC what's happenign here is the PSCI cpuidle driver has entered idle and 
> > RCU
> > is no longer watching when arm64's cpu_suspend() manipulates DAIF. Our
> > local_daif_*() helpers poke lockdep and tracing, hence the call to
> > trace_hardirqs_off() and the RCU usage.
> 
> Right, strictly speaking not needed at this point, IRQs should have been
> traced off a long time ago.

True, but there are some other calls around here that *might* end up invoking
RCU stuff (e.g. the MTE code).

That all needs a noinstr cleanup too, which I'll sort out as a follow-up.

> > I think we need RCU to be watching all the way down to cpu_suspend(), and 
> > it's
> > cpu_suspend() that should actually enter/exit idle context. That and we 
> > need to
> > make cpu_suspend() and the low-level PSCI invocation noinstr.
> > 
> > I'm not sure whether 32-bit will have a similar issue or not.
> 
> I'm not seeing 32bit or Risc-V have similar issues here, but who knows,
> maybe I missed somsething.

I reckon if they do, the core changes here give us the infrastructure to fix
them if/when we get reports.

> In any case, the below ought to cure the ARM64 case and remove that last
> known RCU_NONIDLE() user as a bonus.

The below works for me testing on a Juno R1 board with PSCI, using defconfig +
CONFIG_PROVE_LOCKING=y + CONFIG_DEBUG_LOCKDEP=y + CONFIG_DEBUG_ATOMIC_SLEEP=y.
I'm not sure how to test the LPI / FFH part, but it looks good to me.

FWIW:

Reviewed-by: Mark Rutland 
Tested-by: Mark Rutland 

Sudeep, would you be able to give the LPI/FFH side a spin with the kconfig
options above?

Thanks,
Mark.

> 
> ---
> diff --git a/arch/arm64/kernel/cpuidle.c b/arch/arm64/kernel/cpuidle.c
> index 41974a1a229a..42e19fff40ee 100644
> --- a/arch/arm64/kernel/cpuidle.c
> +++ b/arch/arm64/kernel/cpuidle.c
> @@ -67,10 +67,10 @@ __cpuidle int acpi_processor_ffh_lpi_enter(struct 
> acpi_lpi_state *lpi)
>   u32 state = lpi->address;
>  
>   if (ARM64_LPI_IS_RETENTION_STATE(lpi->arch_flags))
> - return 
> CPU_PM_CPU_IDLE_ENTER_RETENTION_PARAM(psci_cpu_suspend_enter,
> + return 
> CPU_PM_CPU_IDLE_ENTER_RETENTION_PARAM_RCU(psci_cpu_suspend_enter,
>   lpi->index, state);
>   else
> - return CPU_PM_CPU_IDLE_ENTER_PARAM(psci_cpu_suspend_enter,
> + return CPU_PM_CPU_IDLE_ENTER_PARAM_RCU(psci_cpu_suspend_enter,
>lpi->index, state);
>  }
>  #endif
> diff --git a/arch/arm64/kernel/suspend.c b/arch/arm64/kernel/suspend.c
> index e7163f31f716..0fbdf5fe64d8 100644
> --- a/arch/arm64/kernel/suspend.c
> +++ b/arch/arm64/kernel/suspend.c
> @@ -4,6 +4,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -104,6 +105,10 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned 
> long))
>* From this point debug exceptions are disabled to prevent
>* updates to mdscr register (saved and restored along with
>* general purpose registers) from kernel debuggers.
> +  *
> +  * Strictly speaking the trace_hardirqs_off() here is superfluous,
> +  * hardirqs should be firmly off by now. This really ought to use
> +  * something like raw_local_daif_save().
>*/
>   flags = local_daif_save();
>  
> @@ -120,6 +125,8 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned 
> long))
>*/
>   arm_cpuidle_save_irq_context();
>  
> + ct_cpuidle_enter();
> +
>   if (__cpu_suspend_enter()) {
>   /* Call the suspend finisher */
>   ret = fn(arg);
> @@ -133,8 +140,11 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned 
> long))
>*/
>   if (!ret)
>   ret = -EOPNOTSUPP;
> +
> + ct_cpuidle_exit();
>   } else {
> - RCU_NONIDLE(__cpu_suspend_exit());
> + ct_cpuidle_exit();
> + __cpu_suspend_exit();
>   }
>  
>   arm_cpuidle_restore_irq_context();
> diff --git a/drivers/cpuidle/cpuidle-psci.c b/drivers/cpuidle/cpuidle-psci.c
> index 4fc4e0381944..312a34ef28dc 100644
> --- a/drivers/cpuidle/cpuidle-psci.c
> +++ b/drivers/cpuidle/cpuidle-psci.c
> @@ -69,16 +69,12 @@ static __cpuidle int 
> __psci_enter_domain_idle_state(struct cpuidle_device *dev,
>   else
>   pm_runtime_put_sync_suspend(pd_dev);
>  
> - ct_cpuidle_enter();
> -
>   state = psci_get_domain_state();
>   if (!state)
>   state = states[idx];
>  
>   ret = psci_cpu_suspend_enter(state) ? -1 : idx;
>  
> - ct_cpuidle_exit();
> -
>   if (s2idle)
>   dev_pm_genpd_resume(pd_dev);
>   else
> @@ -192,7 +188,7 @@ static __cpuidle int 

Re: [PATCH] x86/paravirt: merge activate_mm and dup_mmap callbacks

2023-01-17 Thread Peter Zijlstra
On Sun, Jan 15, 2023 at 08:27:50PM -0800, Srivatsa S. Bhat wrote:

> I see that's not an issue right now since there is no other actual
> user for these callbacks. But are we sure that merging the callbacks
> just because the current user (Xen PV) has the same implementation for
> both is a good idea?

IIRC the pv_ops are part of the PARAVIRT_ME_HARDER (also spelled as
_XXL) suite of ops and they are only to be used by Xen PV, no new users
of these must happen.

The moment we can drop Xen PV (hopes and dreams etc..) all these things
go in the bin right along with it.


___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH v3 00/51] cpuidle,rcu: Clean up the mess

2023-01-17 Thread Peter Zijlstra
On Mon, Jan 16, 2023 at 04:59:04PM +, Mark Rutland wrote:

> I'm sorry to have to bear some bad news on that front. :(

Moo, something had to give..


> IIUC what's happenign here is the PSCI cpuidle driver has entered idle and RCU
> is no longer watching when arm64's cpu_suspend() manipulates DAIF. Our
> local_daif_*() helpers poke lockdep and tracing, hence the call to
> trace_hardirqs_off() and the RCU usage.

Right, strictly speaking not needed at this point, IRQs should have been
traced off a long time ago.

> I think we need RCU to be watching all the way down to cpu_suspend(), and it's
> cpu_suspend() that should actually enter/exit idle context. That and we need 
> to
> make cpu_suspend() and the low-level PSCI invocation noinstr.
> 
> I'm not sure whether 32-bit will have a similar issue or not.

I'm not seeing 32bit or Risc-V have similar issues here, but who knows,
maybe I missed somsething.

In any case, the below ought to cure the ARM64 case and remove that last
known RCU_NONIDLE() user as a bonus.

---
diff --git a/arch/arm64/kernel/cpuidle.c b/arch/arm64/kernel/cpuidle.c
index 41974a1a229a..42e19fff40ee 100644
--- a/arch/arm64/kernel/cpuidle.c
+++ b/arch/arm64/kernel/cpuidle.c
@@ -67,10 +67,10 @@ __cpuidle int acpi_processor_ffh_lpi_enter(struct 
acpi_lpi_state *lpi)
u32 state = lpi->address;
 
if (ARM64_LPI_IS_RETENTION_STATE(lpi->arch_flags))
-   return 
CPU_PM_CPU_IDLE_ENTER_RETENTION_PARAM(psci_cpu_suspend_enter,
+   return 
CPU_PM_CPU_IDLE_ENTER_RETENTION_PARAM_RCU(psci_cpu_suspend_enter,
lpi->index, state);
else
-   return CPU_PM_CPU_IDLE_ENTER_PARAM(psci_cpu_suspend_enter,
+   return CPU_PM_CPU_IDLE_ENTER_PARAM_RCU(psci_cpu_suspend_enter,
 lpi->index, state);
 }
 #endif
diff --git a/arch/arm64/kernel/suspend.c b/arch/arm64/kernel/suspend.c
index e7163f31f716..0fbdf5fe64d8 100644
--- a/arch/arm64/kernel/suspend.c
+++ b/arch/arm64/kernel/suspend.c
@@ -4,6 +4,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -104,6 +105,10 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned 
long))
 * From this point debug exceptions are disabled to prevent
 * updates to mdscr register (saved and restored along with
 * general purpose registers) from kernel debuggers.
+*
+* Strictly speaking the trace_hardirqs_off() here is superfluous,
+* hardirqs should be firmly off by now. This really ought to use
+* something like raw_local_daif_save().
 */
flags = local_daif_save();
 
@@ -120,6 +125,8 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned long))
 */
arm_cpuidle_save_irq_context();
 
+   ct_cpuidle_enter();
+
if (__cpu_suspend_enter()) {
/* Call the suspend finisher */
ret = fn(arg);
@@ -133,8 +140,11 @@ int cpu_suspend(unsigned long arg, int (*fn)(unsigned 
long))
 */
if (!ret)
ret = -EOPNOTSUPP;
+
+   ct_cpuidle_exit();
} else {
-   RCU_NONIDLE(__cpu_suspend_exit());
+   ct_cpuidle_exit();
+   __cpu_suspend_exit();
}
 
arm_cpuidle_restore_irq_context();
diff --git a/drivers/cpuidle/cpuidle-psci.c b/drivers/cpuidle/cpuidle-psci.c
index 4fc4e0381944..312a34ef28dc 100644
--- a/drivers/cpuidle/cpuidle-psci.c
+++ b/drivers/cpuidle/cpuidle-psci.c
@@ -69,16 +69,12 @@ static __cpuidle int __psci_enter_domain_idle_state(struct 
cpuidle_device *dev,
else
pm_runtime_put_sync_suspend(pd_dev);
 
-   ct_cpuidle_enter();
-
state = psci_get_domain_state();
if (!state)
state = states[idx];
 
ret = psci_cpu_suspend_enter(state) ? -1 : idx;
 
-   ct_cpuidle_exit();
-
if (s2idle)
dev_pm_genpd_resume(pd_dev);
else
@@ -192,7 +188,7 @@ static __cpuidle int psci_enter_idle_state(struct 
cpuidle_device *dev,
 {
u32 *state = __this_cpu_read(psci_cpuidle_data.psci_states);
 
-   return CPU_PM_CPU_IDLE_ENTER_PARAM(psci_cpu_suspend_enter, idx, 
state[idx]);
+   return CPU_PM_CPU_IDLE_ENTER_PARAM_RCU(psci_cpu_suspend_enter, idx, 
state[idx]);
 }
 
 static const struct of_device_id psci_idle_state_match[] = {
diff --git a/drivers/firmware/psci/psci.c b/drivers/firmware/psci/psci.c
index e7bcfca4159f..f3a044fa4652 100644
--- a/drivers/firmware/psci/psci.c
+++ b/drivers/firmware/psci/psci.c
@@ -462,11 +462,22 @@ int psci_cpu_suspend_enter(u32 state)
if (!psci_power_state_loses_context(state)) {
struct arm_cpuidle_irq_context context;
 
+   ct_cpuidle_enter();
arm_cpuidle_save_irq_context();
ret = psci_ops.cpu_suspend(state, 0);
arm_cpuidle_restore_irq_context();
+ 

Re: [PATCH v3 35/51] trace,hardirq: No moar _rcuidle() tracing

2023-01-17 Thread Peter Zijlstra
On Tue, Jan 17, 2023 at 01:24:46PM +0900, Masami Hiramatsu wrote:
> Hi Peter,
> 
> On Thu, 12 Jan 2023 20:43:49 +0100
> Peter Zijlstra  wrote:
> 
> > Robot reported that trace_hardirqs_{on,off}() tickle the forbidden
> > _rcuidle() tracepoint through local_irq_{en,dis}able().
> > 
> > For 'sane' configs, these calls will only happen with RCU enabled and
> > as such can use the regular tracepoint. This also means it's possible
> > to trace them from NMI context again.
> > 
> > Signed-off-by: Peter Zijlstra (Intel) 
> 
> The code looks good to me. I just have a question about comment.
> 
> > ---
> >  kernel/trace/trace_preemptirq.c |   21 +
> >  1 file changed, 13 insertions(+), 8 deletions(-)
> > 
> > --- a/kernel/trace/trace_preemptirq.c
> > +++ b/kernel/trace/trace_preemptirq.c
> > @@ -20,6 +20,15 @@
> >  static DEFINE_PER_CPU(int, tracing_irq_cpu);
> >  
> >  /*
> > + * ...
> 
> Is this intended? Wouldn't you leave any comment here?

I indeed forgot to write the comment before posting, my bad :/ Ingo fixed
it up when he applied.
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization