Re: [PATCH V10 07/19] riscv: qspinlock: errata: Introduce ERRATA_THEAD_QSPINLOCK

2023-09-13 Thread Waiman Long

On 9/13/23 14:54, Palmer Dabbelt wrote:

On Sun, 06 Aug 2023 22:23:34 PDT (-0700), sor...@fastmail.com wrote:

On Wed, Aug 2, 2023, at 12:46 PM, guo...@kernel.org wrote:

From: Guo Ren 

According to qspinlock requirements, RISC-V gives out a weak LR/SC
forward progress guarantee which does not satisfy qspinlock. But
many vendors could produce stronger forward guarantee LR/SC to
ensure the xchg_tail could be finished in time on any kind of
hart. T-HEAD is the vendor which implements strong forward
guarantee LR/SC instruction pairs, so enable qspinlock for T-HEAD
with errata help.

T-HEAD early version of processors has the merge buffer delay
problem, so we need ERRATA_WRITEONCE to support qspinlock.

Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
---
 arch/riscv/Kconfig.errata  | 13 +
 arch/riscv/errata/thead/errata.c   | 24 
 arch/riscv/include/asm/errata_list.h   | 20 
 arch/riscv/include/asm/vendorid_list.h |  3 ++-
 arch/riscv/kernel/cpufeature.c |  3 ++-
 5 files changed, 61 insertions(+), 2 deletions(-)

diff --git a/arch/riscv/Kconfig.errata b/arch/riscv/Kconfig.errata
index 4745a5c57e7c..eb43677b13cc 100644
--- a/arch/riscv/Kconfig.errata
+++ b/arch/riscv/Kconfig.errata
@@ -96,4 +96,17 @@ config ERRATA_THEAD_WRITE_ONCE

   If you don't know what to do here, say "Y".

+config ERRATA_THEAD_QSPINLOCK
+    bool "Apply T-Head queued spinlock errata"
+    depends on ERRATA_THEAD
+    default y
+    help
+  The T-HEAD C9xx processors implement strong fwd guarantee 
LR/SC to

+  match the xchg_tail requirement of qspinlock.
+
+  This will apply the QSPINLOCK errata to handle the non-standard
+  behavior via using qspinlock instead of ticket_lock.
+
+  If you don't know what to do here, say "Y".


If this is to be applied, I would like to see a detailed explanation 
somewhere,

preferably with citations, of:

(a) The memory model requirements for qspinlock


The part of qspinlock that causes problem with many RISC architectures 
is its use of a 16-bit xchg() function call which many RISC 
architectures cannot do it natively and have to be emulated with 
hopefully some forward progress guarantee. Except that one call, the 
other atomic operations are all 32 bit in size.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [BUG] virtio-fs: Corruption when running binaries from virtiofsd-backed fs

2023-09-13 Thread Alex Bennée

"Erik Schilling"  writes:

> CCing a few more people as suggested by stefanha on #qemu.

(Add philmd who's tracking the MIPs failure)

> On Wed Sep 13, 2023 at 8:18 AM CEST, Erik Schilling wrote:
>> On Fri Sep 1, 2023 at 12:37 PM CEST, Erik Schilling wrote:
>> > On Wed Aug 30, 2023 at 10:20 AM CEST, Erik Schilling wrote:
>> > > Hi all!
>> > >
>> > > Some days ago I posted to #virtiofs:matrix.org, describing that I am
>> > > observing what looks like a corruption when executing programs from a
>> > > virtiofs-based filesystem.
>> > >
>> > > Over the last few days I spent more time drilling into the problem.
>> > > This is an attempt at summarizing my findings in order to see what other
>> > > people think about this.
>> > >
>> > > When running binaries mounted from virtiofs they may either: fail with a
>> > > segfault, fail with badaddr, get stuck or - sometimes - succeed.
>> > >
>> > > Environment:
>> > >   Host: Fedora 38 running 6.4.11-200.fc38.x86_64
>> > >   Guest: Yocto-based image: 6.4.9-yocto-standard, aarch64
>> > >   virtiofsd: latest main + some debug prints [1]
>> > >   QEMU: built from recent git [2]
>> > >
>> > > virtiofsd invocation:
>> > >   RUST_LOG="debug" ./virtiofsd --seccomp=none --sandbox=none \
>> > > --socket-path "fs.sock0" --shared-dir $PWD/share-dir/ --cache=never
>> > >
>> > > QEMU invocation:
>> > >   ~/projects/qemu/build/qemu-system-aarch64 -kernel Image -machine virt \
>> > > -cpu cortex-a57 \
>> > > -serial mon:stdio \
>> > > -device virtio-net-pci,netdev=net0 \
>> > > -netdev user,id=net0,hostfwd=tcp::2223-:22 \
>> > > -display none -m 2048 -smp 4 \
>> > > -object memory-backend-memfd,id=mem,size=2048M,share=on \
>> > > -numa node,memdev=mem \
>> > > -hda trs-overlay-guest.qcow2 \
>> > > -chardev socket,id=char0,path="fs.sock0" \
>> > > -device 
>> > > vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=/dev/root \
>> > > -append 'root=/dev/vda2 ro log_buf_len=8M'
>> > >
>> > > I figured that launching virtiofsd with --cache=always masks the
>> > > problem. Therefore, I set --cache=never, but I think I observed no
>> > > difference compared to the default setting (auto).
>> > >
>> > > Adding logging to virtiofsd and kernel _feeled_ like it made the problem
>> > > harder to reproduce - leaving me with the impression that some race is
>> > > happening on somewhere.
>> > >
>> > > Trying to rule out that virtiofsd is returning corrupted data, I added
>> > > some logging and hashsum calculation hacks to it [1]. The hashes check
>> > > out across multiple accesses and the order and kind of queued messages
>> > > is exactly the same in both the error case and crash case. fio was also
>> > > unable to find any errors with a naive job description [3].
>> > >
>> > > Next, I tried to capture info on the guest side. This became a bit
>> > > tricky since the crashes became pretty rare once I followed a fixed
>> > > pattern of starting log capture, running perf and trying to reproduce
>> > > the problem. Ultimately, I had the most consistent results with
>> > > immediately running a program twice:
>> > >
>> > >   /mnt/ld-linux-aarch64.so.1 /mnt/ls.coreutils /; \
>> > > /mnt/ld-linux-aarch64.so.1 /mnt/ls.coreutils /
>> > >
>> > >   (/mnt being the virtiofs mount)
>> > >
>> > > For collecting logs, I made a hack to the guest kernel in order to dump
>> > > the page content after receiving the virtiofs responses [4]. Reproducing
>> > > the problem with this, leaves me with logs that seem to suggest that
>> > > virtiofsd is returning identical content, but the guest kernel seems to
>> > > receive differing pages:
>> > >
>> > > good-kernel [5]:
>> > >   kernel: virtio_fs_wake_pending_and_unlock: opcode 3 unique 0x312 
>> > > nodeid 0x1 in.len 56 out.len 104
>> > >   kernel: virtiofs virtio1: virtio_fs_vq_done requests.0
>> > >   kernel: virtio_fs_wake_pending_and_unlock: opcode 1 unique 0x314 
>> > > nodeid 0x1 in.len 53 out.len 128
>> > >   kernel: virtiofs virtio1: virtio_fs_vq_done requests.0
>> > >   kernel: virtio_fs_wake_pending_and_unlock: opcode 3 unique 0x316 
>> > > nodeid 0x29 in.len 56 out.len 104
>> > >   kernel: virtiofs virtio1: virtio_fs_vq_done requests.0
>> > >   kernel: virtio_fs_wake_pending_and_unlock: opcode 14 unique 0x318 
>> > > nodeid 0x29 in.len 48 out.len 16
>> > >   kernel: virtiofs virtio1: virtio_fs_vq_done requests.0
>> > >   kernel: virtio_fs_wake_pending_and_unlock: opcode 15 unique 0x31a 
>> > > nodeid 0x29 in.len 80 out.len 832
>> > >   kernel: virtiofs virtio1: virtio_fs_vq_done requests.0
>> > >   kernel: virtio_fs: page: 6996d520
>> > >   kernel: virtio_fs: to: de590c14
>> > >   kernel: virtio_fs rsp:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> > > 
>> > >   kernel: virtio_fs rsp:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> > > 
>> > >   kernel: virtio_fs rsp:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
>> > > 
>> > >  

Re: [virtio-comment] virtio-sound linux driver conformance to spec

2023-09-13 Thread Paolo Bonzini
On Wed, Sep 13, 2023 at 5:05 PM Matias Ezequiel Vara Larsen
 wrote:
>
> Hello,
>
> This email is to report a behavior of the Linux virtio-sound driver that
> looks like it is not conforming to the VirtIO specification. The kernel
> driver is moving buffers from the used ring to the available ring
> without knowing if the content has been updated from the user. If the
> device picks up buffers from the available ring just after it is
> notified, it happens that the content is old. This problem can be fixed
> by waiting a period of time after the device dequeues a buffer from the
> available ring. The driver should not be allowed to change the content
> of a buffer in the available ring. When buffers are enqueued in the
> available ring, the device can consume them immediately.
>
> 1. Is the actual driver implementation following the spec?

If I understand the issue correctly, it's not. As you say, absent any
clarification a buffer that has been placed in the available ring
should be unmodifiable; the driver should operate as if the data in
the available ring is copied to an internal buffer instantly when the
buffer id is added to the ring.

I am assuming this is a playback buffer. To clarify, does the driver
expect buffers to be read only as needed, which is a fraction of a
second before the data is played back?

> 2. Shall the spec be extended to support such a use-case?

If it places N buffers, I think it's a reasonable expectation (for the
sound device only!) that the Nth isn't read until the (N-1)th has
started playing. With two buffers only, the behavior you specify would
not be permissible (because as soon as buffer 1 starts playing, the
device can read buffer 2; there is never an idle buffer). With three
buffers, you could write buffer 3 while buffer 1 plays; write buffer 1
while buffer 2 plays; and write buffer 2 while buffer 3 plays. Is this
enough?

That said, being reasonable isn't enough for virtio-sound to do it and
deviate from other virtio devices. Why does the Linux driver behave
like this? Is it somehow constrained by the kernel->userspace APIs?

Paolo

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

virtio-sound linux driver conformance to spec

2023-09-13 Thread Matias Ezequiel Vara Larsen
Hello,

This email is to report a behavior of the Linux virtio-sound driver that
looks like it is not conforming to the VirtIO specification. The kernel
driver is moving buffers from the used ring to the available ring
without knowing if the content has been updated from the user. If the
device picks up buffers from the available ring just after it is
notified, it happens that the content is old. This problem can be fixed
by waiting a period of time after the device dequeues a buffer from the
available ring. The driver should not be allowed to change the content
of a buffer in the available ring. When buffers are enqueued in the
available ring, the device can consume them immediately.

1. Is the actual driver implementation following the spec?  
2. Shall the spec be extended to support such a use-case?

Thanks, Matias

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [BUG] virtio-fs: Corruption when running binaries from virtiofsd-backed fs

2023-09-13 Thread Erik Schilling
CCing a few more people as suggested by stefanha on #qemu.

On Wed Sep 13, 2023 at 8:18 AM CEST, Erik Schilling wrote:
> On Fri Sep 1, 2023 at 12:37 PM CEST, Erik Schilling wrote:
> > On Wed Aug 30, 2023 at 10:20 AM CEST, Erik Schilling wrote:
> > > Hi all!
> > >
> > > Some days ago I posted to #virtiofs:matrix.org, describing that I am
> > > observing what looks like a corruption when executing programs from a
> > > virtiofs-based filesystem.
> > >
> > > Over the last few days I spent more time drilling into the problem.
> > > This is an attempt at summarizing my findings in order to see what other
> > > people think about this.
> > >
> > > When running binaries mounted from virtiofs they may either: fail with a
> > > segfault, fail with badaddr, get stuck or - sometimes - succeed.
> > >
> > > Environment:
> > >   Host: Fedora 38 running 6.4.11-200.fc38.x86_64
> > >   Guest: Yocto-based image: 6.4.9-yocto-standard, aarch64
> > >   virtiofsd: latest main + some debug prints [1]
> > >   QEMU: built from recent git [2]
> > >
> > > virtiofsd invocation:
> > >   RUST_LOG="debug" ./virtiofsd --seccomp=none --sandbox=none \
> > > --socket-path "fs.sock0" --shared-dir $PWD/share-dir/ --cache=never
> > >
> > > QEMU invocation:
> > >   ~/projects/qemu/build/qemu-system-aarch64 -kernel Image -machine virt \
> > > -cpu cortex-a57 \
> > > -serial mon:stdio \
> > > -device virtio-net-pci,netdev=net0 \
> > > -netdev user,id=net0,hostfwd=tcp::2223-:22 \
> > > -display none -m 2048 -smp 4 \
> > > -object memory-backend-memfd,id=mem,size=2048M,share=on \
> > > -numa node,memdev=mem \
> > > -hda trs-overlay-guest.qcow2 \
> > > -chardev socket,id=char0,path="fs.sock0" \
> > > -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=/dev/root 
> > > \
> > > -append 'root=/dev/vda2 ro log_buf_len=8M'
> > >
> > > I figured that launching virtiofsd with --cache=always masks the
> > > problem. Therefore, I set --cache=never, but I think I observed no
> > > difference compared to the default setting (auto).
> > >
> > > Adding logging to virtiofsd and kernel _feeled_ like it made the problem
> > > harder to reproduce - leaving me with the impression that some race is
> > > happening on somewhere.
> > >
> > > Trying to rule out that virtiofsd is returning corrupted data, I added
> > > some logging and hashsum calculation hacks to it [1]. The hashes check
> > > out across multiple accesses and the order and kind of queued messages
> > > is exactly the same in both the error case and crash case. fio was also
> > > unable to find any errors with a naive job description [3].
> > >
> > > Next, I tried to capture info on the guest side. This became a bit
> > > tricky since the crashes became pretty rare once I followed a fixed
> > > pattern of starting log capture, running perf and trying to reproduce
> > > the problem. Ultimately, I had the most consistent results with
> > > immediately running a program twice:
> > >
> > >   /mnt/ld-linux-aarch64.so.1 /mnt/ls.coreutils /; \
> > > /mnt/ld-linux-aarch64.so.1 /mnt/ls.coreutils /
> > >
> > >   (/mnt being the virtiofs mount)
> > >
> > > For collecting logs, I made a hack to the guest kernel in order to dump
> > > the page content after receiving the virtiofs responses [4]. Reproducing
> > > the problem with this, leaves me with logs that seem to suggest that
> > > virtiofsd is returning identical content, but the guest kernel seems to
> > > receive differing pages:
> > >
> > > good-kernel [5]:
> > >   kernel: virtio_fs_wake_pending_and_unlock: opcode 3 unique 0x312 nodeid 
> > > 0x1 in.len 56 out.len 104
> > >   kernel: virtiofs virtio1: virtio_fs_vq_done requests.0
> > >   kernel: virtio_fs_wake_pending_and_unlock: opcode 1 unique 0x314 nodeid 
> > > 0x1 in.len 53 out.len 128
> > >   kernel: virtiofs virtio1: virtio_fs_vq_done requests.0
> > >   kernel: virtio_fs_wake_pending_and_unlock: opcode 3 unique 0x316 nodeid 
> > > 0x29 in.len 56 out.len 104
> > >   kernel: virtiofs virtio1: virtio_fs_vq_done requests.0
> > >   kernel: virtio_fs_wake_pending_and_unlock: opcode 14 unique 0x318 
> > > nodeid 0x29 in.len 48 out.len 16
> > >   kernel: virtiofs virtio1: virtio_fs_vq_done requests.0
> > >   kernel: virtio_fs_wake_pending_and_unlock: opcode 15 unique 0x31a 
> > > nodeid 0x29 in.len 80 out.len 832
> > >   kernel: virtiofs virtio1: virtio_fs_vq_done requests.0
> > >   kernel: virtio_fs: page: 6996d520
> > >   kernel: virtio_fs: to: de590c14
> > >   kernel: virtio_fs rsp:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> > > 
> > >   kernel: virtio_fs rsp:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> > > 
> > >   kernel: virtio_fs rsp:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> > > 
> > >   kernel: virtio_fs rsp:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> > > 
> > >   [...]
> > >
> > > bad-kernel [6]:
> > >   kernel: 

Re: [PATCH v2] vhost: Allow null msg.size on VHOST_IOTLB_INVALIDATE

2023-09-13 Thread Eric Auger
Hi,

On 8/24/23 11:37, Eric Auger wrote:
> Commit e2ae38cf3d91 ("vhost: fix hung thread due to erroneous iotlb
> entries") Forbade vhost iotlb msg with null size to prevent entries
> with size = start = 0 and last = ULONG_MAX to end up in the iotlb.
>
> Then commit 95932ab2ea07 ("vhost: allow batching hint without size")
> only applied the check for VHOST_IOTLB_UPDATE and VHOST_IOTLB_INVALIDATE
> message types to fix a regression observed with batching hit.
>
> Still, the introduction of that check introduced a regression for
> some users attempting to invalidate the whole ULONG_MAX range by
> setting the size to 0. This is the case with qemu/smmuv3/vhost
> integration which does not work anymore. It Looks safe to partially
> revert the original commit and allow VHOST_IOTLB_INVALIDATE messages
> with null size. vhost_iotlb_del_range() will compute a correct end
> iova. Same for vhost_vdpa_iotlb_unmap().
>
> Signed-off-by: Eric Auger 
> Fixes: e2ae38cf3d91 ("vhost: fix hung thread due to erroneous iotlb entries")
> Cc: sta...@vger.kernel.org # v5.17+
> Acked-by: Jason Wang 

Gentle ping for this fix? Any other comments besides Jason's A-b?

Best Regards

Eric
>
> ---
> v1 -> v2:
> - Added Cc stable and Jason's Acked-by
> ---
>  drivers/vhost/vhost.c | 4 +---
>  1 file changed, 1 insertion(+), 3 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index c71d573f1c94..e0c181ad17e3 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1458,9 +1458,7 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
>   goto done;
>   }
>  
> - if ((msg.type == VHOST_IOTLB_UPDATE ||
> -  msg.type == VHOST_IOTLB_INVALIDATE) &&
> -  msg.size == 0) {
> + if (msg.type == VHOST_IOTLB_UPDATE && msg.size == 0) {
>   ret = -EINVAL;
>   goto done;
>   }

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH 2/3] x86/xen: move paravirt lazy code

2023-09-13 Thread Juergen Gross via Virtualization

On 13.09.23 15:26, Steven Rostedt wrote:

On Wed, 13 Sep 2023 13:38:27 +0200
Juergen Gross  wrote:


diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h
index 44a3f565264d..0577f0cdd231 100644
--- a/include/trace/events/xen.h
+++ b/include/trace/events/xen.h
@@ -6,26 +6,26 @@
  #define _TRACE_XEN_H
  
  #include 

-#include 
+#include 
  #include 
  
  struct multicall_entry;
  
  /* Multicalls */

  DECLARE_EVENT_CLASS(xen_mc__batch,
-   TP_PROTO(enum paravirt_lazy_mode mode),
+   TP_PROTO(enum xen_lazy_mode mode),
TP_ARGS(mode),
TP_STRUCT__entry(
-   __field(enum paravirt_lazy_mode, mode)
+   __field(enum xen_lazy_mode, mode)
),
TP_fast_assign(__entry->mode = mode),
TP_printk("start batch LAZY_%s",
- (__entry->mode == PARAVIRT_LAZY_MMU) ? "MMU" :
- (__entry->mode == PARAVIRT_LAZY_CPU) ? "CPU" : "NONE")
+ (__entry->mode == XEN_LAZY_MMU) ? "MMU" :
+ (__entry->mode == XEN_LAZY_CPU) ? "CPU" : "NONE")


There's helper functions that make the above easier to implement as well as
exports the symbols so that user space can parse this better:

TRACE_DEFINE_ENUM(XEN_LAZY_NONE);
TRACE_DEFINE_ENUM(XEN_LAZY_MMU);
TRACE_DEFINE_ENUM(XEN_LAZY_CPU);

[..]

TP_printk("start batch LAZY_%s",
  __print_symbolic(mode,
   { XEN_LAZY_NONE, "NONE" },
   { XEN_LAZY_MMU,  "MMU   },
   { XEN_LAZY_CPU,  "CPU"  }))

Then user space parsers that read the raw data can convert these events
into something humans can read.


Thanks. I'll add that to another patch I'm just writing for cleaning up
include/trace/events/xen.h (some defined trace events are no longer used).


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: OpenPGP public key


OpenPGP_signature.asc
Description: OpenPGP digital signature
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

Re: [PATCH 2/3] x86/xen: move paravirt lazy code

2023-09-13 Thread Steven Rostedt
On Wed, 13 Sep 2023 13:38:27 +0200
Juergen Gross  wrote:

> diff --git a/include/trace/events/xen.h b/include/trace/events/xen.h
> index 44a3f565264d..0577f0cdd231 100644
> --- a/include/trace/events/xen.h
> +++ b/include/trace/events/xen.h
> @@ -6,26 +6,26 @@
>  #define _TRACE_XEN_H
>  
>  #include 
> -#include 
> +#include 
>  #include 
>  
>  struct multicall_entry;
>  
>  /* Multicalls */
>  DECLARE_EVENT_CLASS(xen_mc__batch,
> - TP_PROTO(enum paravirt_lazy_mode mode),
> + TP_PROTO(enum xen_lazy_mode mode),
>   TP_ARGS(mode),
>   TP_STRUCT__entry(
> - __field(enum paravirt_lazy_mode, mode)
> + __field(enum xen_lazy_mode, mode)
>   ),
>   TP_fast_assign(__entry->mode = mode),
>   TP_printk("start batch LAZY_%s",
> -   (__entry->mode == PARAVIRT_LAZY_MMU) ? "MMU" :
> -   (__entry->mode == PARAVIRT_LAZY_CPU) ? "CPU" : "NONE")
> +   (__entry->mode == XEN_LAZY_MMU) ? "MMU" :
> +   (__entry->mode == XEN_LAZY_CPU) ? "CPU" : "NONE")

There's helper functions that make the above easier to implement as well as
exports the symbols so that user space can parse this better:

TRACE_DEFINE_ENUM(XEN_LAZY_NONE);
TRACE_DEFINE_ENUM(XEN_LAZY_MMU);
TRACE_DEFINE_ENUM(XEN_LAZY_CPU);

[..]

TP_printk("start batch LAZY_%s",
  __print_symbolic(mode,
   { XEN_LAZY_NONE, "NONE" },
   { XEN_LAZY_MMU,  "MMU   },
   { XEN_LAZY_CPU,  "CPU"  }))

Then user space parsers that read the raw data can convert these events
into something humans can read.

-- Steve

>   );
>  #define DEFINE_XEN_MC_BATCH(name)\
>   DEFINE_EVENT(xen_mc__batch, name,   \
> - TP_PROTO(enum paravirt_lazy_mode mode), \
> + TP_PROTO(enum xen_lazy_mode mode),  \
>TP_ARGS(mode))
>  
>  DEFINE_XEN_MC_BATCH(xen_mc_batch);
___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


Re: [PATCH V11 04/17] locking/qspinlock: Improve xchg_tail for number of cpus >= 16k

2023-09-13 Thread Waiman Long

On 9/13/23 08:52, Guo Ren wrote:

On Wed, Sep 13, 2023 at 4:55 PM Leonardo Bras  wrote:

On Tue, Sep 12, 2023 at 09:10:08AM +0800, Guo Ren wrote:

On Mon, Sep 11, 2023 at 9:03 PM Waiman Long  wrote:

On 9/10/23 23:09, Guo Ren wrote:

On Mon, Sep 11, 2023 at 10:35 AM Waiman Long  wrote:

On 9/10/23 04:28, guo...@kernel.org wrote:

From: Guo Ren 

The target of xchg_tail is to write the tail to the lock value, so
adding prefetchw could help the next cmpxchg step, which may
decrease the cmpxchg retry loops of xchg_tail. Some processors may
utilize this feature to give a forward guarantee, e.g., RISC-V
XuanTie processors would block the snoop channel & irq for several
cycles when prefetch.w instruction (from Zicbop extension) retired,
which guarantees the next cmpxchg succeeds.

Signed-off-by: Guo Ren 
Signed-off-by: Guo Ren 
---
kernel/locking/qspinlock.c | 5 -
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/qspinlock.c b/kernel/locking/qspinlock.c
index d3f99060b60f..96b54e2ade86 100644
--- a/kernel/locking/qspinlock.c
+++ b/kernel/locking/qspinlock.c
@@ -223,7 +223,10 @@ static __always_inline void 
clear_pending_set_locked(struct qspinlock *lock)
 */
static __always_inline u32 xchg_tail(struct qspinlock *lock, u32 tail)
{
- u32 old, new, val = atomic_read(>val);
+ u32 old, new, val;
+
+ prefetchw(>val);
+ val = atomic_read(>val);

for (;;) {
new = (val & _Q_LOCKED_PENDING_MASK) | tail;

That looks a bit weird. You pre-fetch and then immediately read it. How
much performance gain you get by this change alone?

Maybe you can define an arch specific primitive that default back to
atomic_read() if not defined.

Thx for the reply. This is a generic optimization point I would like
to talk about with you.

First, prefetchw() makes cacheline an exclusive state and serves for
the next cmpxchg loop semantic, which writes the idx_tail part of
arch_spin_lock. The atomic_read only makes cacheline in the shared
state, which couldn't give any guarantee for the next cmpxchg loop
semantic. Micro-architecture could utilize prefetchw() to provide a
strong forward progress guarantee for the xchg_tail, e.g., the T-HEAD
XuanTie processor would hold the exclusive cacheline state until the
next cmpxchg write success.

In the end, Let's go back to the principle: the xchg_tail is an atomic
swap operation that contains write eventually, so giving a prefetchw()
at the beginning is acceptable for all architectures..


I did realize afterward that prefetchw gets the cacheline in exclusive
state. I will suggest you mention that in your commit log as well as
adding a comment about its purpose in the code.

Okay, I would do that in v12, thx.

I would suggest adding a snippet from the ISA Extenstion doc:

"A prefetch.w instruction indicates to hardware that the cache block whose
effective address is the sum of the base address specified in rs1 and the
sign-extended offset encoded in imm[11:0], where imm[4:0] equals 0b0,
is likely to be accessed by a data write (i.e. store) in the near future."

Good point, thx.


qspinlock is generic code. I suppose this is for the RISCV architecture. 
You can mention that in the commit log as an example, but I prefer more 
generic comment especially in the code.


Cheers,
Longman

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

[PATCH 0/3] xen: cleanup and fix lazy mode handling

2023-09-13 Thread Juergen Gross via Virtualization
This small series is cleaning up Xen lazy mode handling by removing
unused stuff and moving purely Xen-specific code away from general
kernel code.

The last patch is fixing a regression which was introduced in the
6.6 merge window.

Juergen Gross (3):
  arm/xen: remove lazy mode related definitions
  x86/xen: move paravirt lazy code
  x86/xen: allow nesting of same lazy mode

 arch/x86/include/asm/paravirt_types.h | 15 --
 arch/x86/include/asm/xen/hypervisor.h | 37 +++
 arch/x86/kernel/paravirt.c| 67 ---
 arch/x86/xen/enlighten_pv.c   | 40 +---
 arch/x86/xen/mmu_pv.c | 55 ++
 arch/x86/xen/multicalls.h |  4 +-
 include/trace/events/xen.h| 12 ++---
 include/xen/arm/hypervisor.h  | 12 -
 8 files changed, 114 insertions(+), 128 deletions(-)

-- 
2.35.3

___
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization


[PATCH 2/3] x86/xen: move paravirt lazy code

2023-09-13 Thread Juergen Gross via Virtualization
Only Xen is using the paravirt lazy mode code, so it can be moved to
Xen specific sources.

This allows to make some of the functions static or to merge them into
their only call sites.

While at it do a rename from "paravirt" to "xen" for all moved
specifiers.

No functional change.

Signed-off-by: Juergen Gross 
---
 arch/x86/include/asm/paravirt_types.h | 15 --
 arch/x86/include/asm/xen/hypervisor.h | 26 +++
 arch/x86/kernel/paravirt.c| 67 ---
 arch/x86/xen/enlighten_pv.c   | 39 +---
 arch/x86/xen/mmu_pv.c | 55 ++
 arch/x86/xen/multicalls.h |  4 +-
 include/trace/events/xen.h| 12 ++---
 7 files changed, 102 insertions(+), 116 deletions(-)

diff --git a/arch/x86/include/asm/paravirt_types.h 
b/arch/x86/include/asm/paravirt_types.h
index 4acbcc29..772d03487520 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -9,13 +9,6 @@ struct paravirt_patch_site {
u8 type;/* type of this instruction */
u8 len; /* length of original instruction */
 };
-
-/* Lazy mode for batching updates / context switch */
-enum paravirt_lazy_mode {
-   PARAVIRT_LAZY_NONE,
-   PARAVIRT_LAZY_MMU,
-   PARAVIRT_LAZY_CPU,
-};
 #endif
 
 #ifdef CONFIG_PARAVIRT
@@ -549,14 +542,6 @@ int paravirt_disable_iospace(void);
__PVOP_VCALL(op, PVOP_CALL_ARG1(arg1), PVOP_CALL_ARG2(arg2),\
 PVOP_CALL_ARG3(arg3), PVOP_CALL_ARG4(arg4))
 
-enum paravirt_lazy_mode paravirt_get_lazy_mode(void);
-void paravirt_start_context_switch(struct task_struct *prev);
-void paravirt_end_context_switch(struct task_struct *next);
-
-void paravirt_enter_lazy_mmu(void);
-void paravirt_leave_lazy_mmu(void);
-void paravirt_flush_lazy_mmu(void);
-
 void _paravirt_nop(void);
 void paravirt_BUG(void);
 unsigned long paravirt_ret0(void);
diff --git a/arch/x86/include/asm/xen/hypervisor.h 
b/arch/x86/include/asm/xen/hypervisor.h
index 5fc35f889cd1..ed05ce3df5c7 100644
--- a/arch/x86/include/asm/xen/hypervisor.h
+++ b/arch/x86/include/asm/xen/hypervisor.h
@@ -36,6 +36,7 @@
 extern struct shared_info *HYPERVISOR_shared_info;
 extern struct start_info *xen_start_info;
 
+#include 
 #include 
 
 #define XEN_SIGNATURE "XenVMMXenVMM"
@@ -63,4 +64,29 @@ void __init xen_pvh_init(struct boot_params *boot_params);
 void __init mem_map_via_hcall(struct boot_params *boot_params_p);
 #endif
 
+/* Lazy mode for batching updates / context switch */
+enum xen_lazy_mode {
+   XEN_LAZY_NONE,
+   XEN_LAZY_MMU,
+   XEN_LAZY_CPU,
+};
+
+DECLARE_PER_CPU(enum xen_lazy_mode, xen_lazy_mode);
+
+static inline void enter_lazy(enum xen_lazy_mode mode)
+{
+   BUG_ON(this_cpu_read(xen_lazy_mode) != XEN_LAZY_NONE);
+
+   this_cpu_write(xen_lazy_mode, mode);
+}
+
+static inline void leave_lazy(enum xen_lazy_mode mode)
+{
+   BUG_ON(this_cpu_read(xen_lazy_mode) != mode);
+
+   this_cpu_write(xen_lazy_mode, XEN_LAZY_NONE);
+}
+
+enum xen_lazy_mode xen_get_lazy_mode(void);
+
 #endif /* _ASM_X86_XEN_HYPERVISOR_H */
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 975f98d5eee5..97f1436c1a20 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -143,66 +143,7 @@ int paravirt_disable_iospace(void)
return request_resource(_resource, _ioports);
 }
 
-static DEFINE_PER_CPU(enum paravirt_lazy_mode, paravirt_lazy_mode) = 
PARAVIRT_LAZY_NONE;
-
-static inline void enter_lazy(enum paravirt_lazy_mode mode)
-{
-   BUG_ON(this_cpu_read(paravirt_lazy_mode) != PARAVIRT_LAZY_NONE);
-
-   this_cpu_write(paravirt_lazy_mode, mode);
-}
-
-static void leave_lazy(enum paravirt_lazy_mode mode)
-{
-   BUG_ON(this_cpu_read(paravirt_lazy_mode) != mode);
-
-   this_cpu_write(paravirt_lazy_mode, PARAVIRT_LAZY_NONE);
-}
-
-void paravirt_enter_lazy_mmu(void)
-{
-   enter_lazy(PARAVIRT_LAZY_MMU);
-}
-
-void paravirt_leave_lazy_mmu(void)
-{
-   leave_lazy(PARAVIRT_LAZY_MMU);
-}
-
-void paravirt_flush_lazy_mmu(void)
-{
-   preempt_disable();
-
-   if (paravirt_get_lazy_mode() == PARAVIRT_LAZY_MMU) {
-   arch_leave_lazy_mmu_mode();
-   arch_enter_lazy_mmu_mode();
-   }
-
-   preempt_enable();
-}
-
 #ifdef CONFIG_PARAVIRT_XXL
-void paravirt_start_context_switch(struct task_struct *prev)
-{
-   BUG_ON(preemptible());
-
-   if (this_cpu_read(paravirt_lazy_mode) == PARAVIRT_LAZY_MMU) {
-   arch_leave_lazy_mmu_mode();
-   set_ti_thread_flag(task_thread_info(prev), 
TIF_LAZY_MMU_UPDATES);
-   }
-   enter_lazy(PARAVIRT_LAZY_CPU);
-}
-
-void paravirt_end_context_switch(struct task_struct *next)
-{
-   BUG_ON(preemptible());
-
-   leave_lazy(PARAVIRT_LAZY_CPU);
-
-   if (test_and_clear_ti_thread_flag(task_thread_info(next), 
TIF_LAZY_MMU_UPDATES))
-   arch_enter_lazy_mmu_mode();
-}

Re: [BUG] virtio-fs: Corruption when running binaries from virtiofsd-backed fs

2023-09-13 Thread Erik Schilling
On Fri Sep 1, 2023 at 12:37 PM CEST, Erik Schilling wrote:
> On Wed Aug 30, 2023 at 10:20 AM CEST, Erik Schilling wrote:
> > Hi all!
> >
> > Some days ago I posted to #virtiofs:matrix.org, describing that I am
> > observing what looks like a corruption when executing programs from a
> > virtiofs-based filesystem.
> >
> > Over the last few days I spent more time drilling into the problem.
> > This is an attempt at summarizing my findings in order to see what other
> > people think about this.
> >
> > When running binaries mounted from virtiofs they may either: fail with a
> > segfault, fail with badaddr, get stuck or - sometimes - succeed.
> >
> > Environment:
> >   Host: Fedora 38 running 6.4.11-200.fc38.x86_64
> >   Guest: Yocto-based image: 6.4.9-yocto-standard, aarch64
> >   virtiofsd: latest main + some debug prints [1]
> >   QEMU: built from recent git [2]
> >
> > virtiofsd invocation:
> >   RUST_LOG="debug" ./virtiofsd --seccomp=none --sandbox=none \
> > --socket-path "fs.sock0" --shared-dir $PWD/share-dir/ --cache=never
> >
> > QEMU invocation:
> >   ~/projects/qemu/build/qemu-system-aarch64 -kernel Image -machine virt \
> > -cpu cortex-a57 \
> > -serial mon:stdio \
> > -device virtio-net-pci,netdev=net0 \
> > -netdev user,id=net0,hostfwd=tcp::2223-:22 \
> > -display none -m 2048 -smp 4 \
> > -object memory-backend-memfd,id=mem,size=2048M,share=on \
> > -numa node,memdev=mem \
> > -hda trs-overlay-guest.qcow2 \
> > -chardev socket,id=char0,path="fs.sock0" \
> > -device vhost-user-fs-pci,queue-size=1024,chardev=char0,tag=/dev/root \
> > -append 'root=/dev/vda2 ro log_buf_len=8M'
> >
> > I figured that launching virtiofsd with --cache=always masks the
> > problem. Therefore, I set --cache=never, but I think I observed no
> > difference compared to the default setting (auto).
> >
> > Adding logging to virtiofsd and kernel _feeled_ like it made the problem
> > harder to reproduce - leaving me with the impression that some race is
> > happening on somewhere.
> >
> > Trying to rule out that virtiofsd is returning corrupted data, I added
> > some logging and hashsum calculation hacks to it [1]. The hashes check
> > out across multiple accesses and the order and kind of queued messages
> > is exactly the same in both the error case and crash case. fio was also
> > unable to find any errors with a naive job description [3].
> >
> > Next, I tried to capture info on the guest side. This became a bit
> > tricky since the crashes became pretty rare once I followed a fixed
> > pattern of starting log capture, running perf and trying to reproduce
> > the problem. Ultimately, I had the most consistent results with
> > immediately running a program twice:
> >
> >   /mnt/ld-linux-aarch64.so.1 /mnt/ls.coreutils /; \
> > /mnt/ld-linux-aarch64.so.1 /mnt/ls.coreutils /
> >
> >   (/mnt being the virtiofs mount)
> >
> > For collecting logs, I made a hack to the guest kernel in order to dump
> > the page content after receiving the virtiofs responses [4]. Reproducing
> > the problem with this, leaves me with logs that seem to suggest that
> > virtiofsd is returning identical content, but the guest kernel seems to
> > receive differing pages:
> >
> > good-kernel [5]:
> >   kernel: virtio_fs_wake_pending_and_unlock: opcode 3 unique 0x312 nodeid 
> > 0x1 in.len 56 out.len 104
> >   kernel: virtiofs virtio1: virtio_fs_vq_done requests.0
> >   kernel: virtio_fs_wake_pending_and_unlock: opcode 1 unique 0x314 nodeid 
> > 0x1 in.len 53 out.len 128
> >   kernel: virtiofs virtio1: virtio_fs_vq_done requests.0
> >   kernel: virtio_fs_wake_pending_and_unlock: opcode 3 unique 0x316 nodeid 
> > 0x29 in.len 56 out.len 104
> >   kernel: virtiofs virtio1: virtio_fs_vq_done requests.0
> >   kernel: virtio_fs_wake_pending_and_unlock: opcode 14 unique 0x318 nodeid 
> > 0x29 in.len 48 out.len 16
> >   kernel: virtiofs virtio1: virtio_fs_vq_done requests.0
> >   kernel: virtio_fs_wake_pending_and_unlock: opcode 15 unique 0x31a nodeid 
> > 0x29 in.len 80 out.len 832
> >   kernel: virtiofs virtio1: virtio_fs_vq_done requests.0
> >   kernel: virtio_fs: page: 6996d520
> >   kernel: virtio_fs: to: de590c14
> >   kernel: virtio_fs rsp:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> > 
> >   kernel: virtio_fs rsp:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> > 
> >   kernel: virtio_fs rsp:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> > 
> >   kernel: virtio_fs rsp:00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  
> > 
> >   [...]
> >
> > bad-kernel [6]:
> >   kernel: virtio_fs_wake_pending_and_unlock: opcode 3 unique 0x162 nodeid 
> > 0x1 in.len 56 out.len 104
> >   kernel: virtiofs virtio1: virtio_fs_vq_done requests.0
> >   kernel: virtio_fs_wake_pending_and_unlock: opcode 1 unique 0x164 nodeid 
> > 0x1 in.len 53 out.len 128
> >   kernel: virtiofs virtio1: virtio_fs_vq_done requests.0
> >   kernel: