Re: CPU_DOWN_FAILED hits ASSERTs in scheduling logic
On 29.05.24 14:46, Roger Pau Monné wrote: On Wed, May 29, 2024 at 01:47:09PM +0200, Jürgen Groß wrote: On 28.05.24 13:22, Roger Pau Monné wrote: Hello, When the stop_machine_run() call in cpu_down() fails and calls the CPU notifier CPU_DOWN_FAILED hook the following assert triggers in the scheduling code: Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at common/sched/cred1 [ Xen-4.19-unstable x86_64 debug=y Tainted: C] CPU:0 RIP:e008:[] common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 RFLAGS: 00010093 CONTEXT: hypervisor rax: rbx: 83202ecc2f80 rcx: 83202f3e64c0 rdx: 0001 rsi: 0002 rdi: 83202ecc2f88 rbp: 83203d58 rsp: 83203d30 r8: r9: 83202f3e6e01 r10: r11: 0f0f0f0f0f0f0f0f r12: 83202ecb80b0 r13: 0001 r14: 0282 r15: 83202ecbbf00 cr0: 8005003b cr4: 007526e0 cr3: 574c2000 cr2: fsb: gsb: gss: ds: es: fs: gs: ss: cs: e008 Xen code around (common/sched/credit2.c#csched2_free_pdata+0xc8/0x177): fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 75 2e Xen stack trace from rsp=83203d30: 83202d74d100 0001 82d0404c4430 0006 83203d78 82d040257454 0001 83203da8 82d04021f303 82d0404c4628 82d0404c4620 82d0404c4430 0006 83203df0 82d04022bc4c 83203e18 0001 0001 fff0 82d0405e6500 83203e08 82d040204fd5 0001 83203e30 82d0402054f0 82d0404c5860 0001 83202ec75000 83203e48 82d040348c25 83202d74d0d0 83203e68 82d0402071aa 83202ec751d0 82d0405ce210 83203e80 82d0402343c9 82d0405ce200 83203eb0 82d040234631 7fff 82d0405d5080 82d0405ce210 83203ee8 82d040321411 82d040321399 83202f3a9000 001d91a6fa2d 82d0405e6500 83203de0 82d040324391 Xen call trace: [] R common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 [] F free_cpu_rm_data+0x41/0x58 [] F common/sched/cpupool.c#cpu_callback+0xfb/0x466 [] F notifier_call_chain+0x6c/0x96 [] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36 [] F cpu_down+0xa7/0x143 [] F cpu_down_helper+0x11/0x27 [] F common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd [] F common/tasklet.c#do_tasklet_work+0x76/0xaf [] F do_tasklet+0x5b/0x8d [] F arch/x86/domain.c#idle_loop+0x78/0xe6 [] F continue_running+0x5b/0x5d Panic on CPU 0: Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at common/sched/credit2.c:4111 The issue seems to be that since the CPU hasn't been removed, it's still part of prv->initialized and the assert in csched2_free_pdata() called as part of free_cpu_rm_data() triggers. It's easy to reproduce by substituting the stop_machine_run() call in cpu_down() with an error. Could you please give the attached patch a try? I still get the following assert: Oh, silly me. Without core scheduling active nr_sr_unused will be 0 all the time. :-( Next try. Juergen From 8b707b0e4d68b1636f1c5ff0e2ef54bea5d9921a Mon Sep 17 00:00:00 2001 From: Juergen Gross Date: Wed, 29 May 2024 13:24:36 +0200 Subject: [PATCH] xen/sched: fix error path of cpu removal MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In case removal of a cpu fails at the CPU_DYING step, an ASSERT() in the credit2 scheduler might trigger: Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at common/sched/cred1 [ Xen-4.19-unstable x86_64 debug=y Tainted: C] CPU:0 RIP:e008:[] common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 RFLAGS: 00010093 CONTEXT: hypervisor rax: rbx: 83202ecc2f80 rcx: 83202f3e64c0 rdx: 0001 rsi: 0002 rdi: 83202ecc2f88 rbp: 83203d58 rsp: 83203d30 r8: r9: 83202f3e6e01 r10:
Re: [PATCH] xen/xenbus: handle potential dangling pointer issue in xen_pcibk_xenbus_probe
On 29.05.24 14:22, ysk...@gmail.com wrote: From: Yunseong Kim If 'xen_pcibk_init_devices()' fails. This ensures that 'pdev->xdev' does not point to 'xdev' when 'pdev' is freed. Signed-off-by: Yunseong Kim --- drivers/xen/xen-pciback/xenbus.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/xen/xen-pciback/xenbus.c b/drivers/xen/xen-pciback/xenbus.c index b11e401f1b1e..348d6803b8c0 100644 --- a/drivers/xen/xen-pciback/xenbus.c +++ b/drivers/xen/xen-pciback/xenbus.c @@ -54,6 +54,7 @@ static struct xen_pcibk_device *alloc_pdev(struct xenbus_device *xdev) INIT_WORK(>op_work, xen_pcibk_do_op); if (xen_pcibk_init_devices(pdev)) { + pdev->xdev = NULL; kfree(pdev); pdev = NULL; } NAK. This doesn't make any sense, as pdev is freed. Juergen
Re: CPU_DOWN_FAILED hits ASSERTs in scheduling logic
On 28.05.24 13:22, Roger Pau Monné wrote: Hello, When the stop_machine_run() call in cpu_down() fails and calls the CPU notifier CPU_DOWN_FAILED hook the following assert triggers in the scheduling code: Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at common/sched/cred1 [ Xen-4.19-unstable x86_64 debug=y Tainted: C] CPU:0 RIP:e008:[] common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 RFLAGS: 00010093 CONTEXT: hypervisor rax: rbx: 83202ecc2f80 rcx: 83202f3e64c0 rdx: 0001 rsi: 0002 rdi: 83202ecc2f88 rbp: 83203d58 rsp: 83203d30 r8: r9: 83202f3e6e01 r10: r11: 0f0f0f0f0f0f0f0f r12: 83202ecb80b0 r13: 0001 r14: 0282 r15: 83202ecbbf00 cr0: 8005003b cr4: 007526e0 cr3: 574c2000 cr2: fsb: gsb: gss: ds: es: fs: gs: ss: cs: e008 Xen code around (common/sched/credit2.c#csched2_free_pdata+0xc8/0x177): fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 75 2e Xen stack trace from rsp=83203d30: 83202d74d100 0001 82d0404c4430 0006 83203d78 82d040257454 0001 83203da8 82d04021f303 82d0404c4628 82d0404c4620 82d0404c4430 0006 83203df0 82d04022bc4c 83203e18 0001 0001 fff0 82d0405e6500 83203e08 82d040204fd5 0001 83203e30 82d0402054f0 82d0404c5860 0001 83202ec75000 83203e48 82d040348c25 83202d74d0d0 83203e68 82d0402071aa 83202ec751d0 82d0405ce210 83203e80 82d0402343c9 82d0405ce200 83203eb0 82d040234631 7fff 82d0405d5080 82d0405ce210 83203ee8 82d040321411 82d040321399 83202f3a9000 001d91a6fa2d 82d0405e6500 83203de0 82d040324391 Xen call trace: [] R common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 [] F free_cpu_rm_data+0x41/0x58 [] F common/sched/cpupool.c#cpu_callback+0xfb/0x466 [] F notifier_call_chain+0x6c/0x96 [] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36 [] F cpu_down+0xa7/0x143 [] F cpu_down_helper+0x11/0x27 [] F common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd [] F common/tasklet.c#do_tasklet_work+0x76/0xaf [] F do_tasklet+0x5b/0x8d [] F arch/x86/domain.c#idle_loop+0x78/0xe6 [] F continue_running+0x5b/0x5d Panic on CPU 0: Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at common/sched/credit2.c:4111 The issue seems to be that since the CPU hasn't been removed, it's still part of prv->initialized and the assert in csched2_free_pdata() called as part of free_cpu_rm_data() triggers. It's easy to reproduce by substituting the stop_machine_run() call in cpu_down() with an error. Could you please give the attached patch a try? Only compile tested, though... Juergen From 5925f15ace8be186c9f41c0cda2d4a675b0f7bb9 Mon Sep 17 00:00:00 2001 From: Juergen Gross Date: Wed, 29 May 2024 13:24:36 +0200 Subject: [PATCH] xen/sched: fix error path of cpu removal MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit In case removal of a cpu fails at the CPU_DYING step, an ASSERT() in the credit2 scheduler might trigger: Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at common/sched/cred1 [ Xen-4.19-unstable x86_64 debug=y Tainted: C] CPU:0 RIP:e008:[] common/sched/credit2.c#csched2_free_pdata+0xc8/0x177 RFLAGS: 00010093 CONTEXT: hypervisor rax: rbx: 83202ecc2f80 rcx: 83202f3e64c0 rdx: 0001 rsi: 0002 rdi: 83202ecc2f88 rbp: 83203d58 rsp: 83203d30 r8: r9: 83202f3e6e01 r10: r11: 0f0f0f0f0f0f0f0f r12: 83202ecb80b0 r13: 0001 r14: 0282 r15: 83202ecbbf00 cr0: 8005003b cr4: 007526e0 cr3: 574c2000 cr2: fsb: gsb:
Re: [PATCH 3/4] x86/pci/xen: Fix PCIBIOS_* return code handling
On 27.05.24 14:55, Ilpo Järvinen wrote: xen_pcifront_enable_irq() uses pci_read_config_byte() that returns PCIBIOS_* codes. The error handling, however, assumes the codes are normal errnos because it checks for < 0. xen_pcifront_enable_irq() also returns the PCIBIOS_* code back to the caller but the function is used as the (*pcibios_enable_irq) function which should return normal errnos. Convert the error check to plain non-zero check which works for PCIBIOS_* return codes and convert the PCIBIOS_* return code using pcibios_err_to_errno() into normal errno before returning it. Fixes: 3f2a230caf21 ("xen: handled remapped IRQs when enabling a pcifront PCI device.") Signed-off-by: Ilpo Järvinen Cc: sta...@vger.kernel.org Reviewed-by: Juergen Gross Juergen
Re: [PATCH v3 2/4] xen/arm: Alloc XenStore page for Dom0less DomUs from hypervisor
On 25.05.24 01:19, Stefano Stabellini wrote: On Fri, 24 May 2024, Jürgen Groß wrote: On 24.05.24 15:58, Julien Grall wrote: Hi Henry, + Juergen as the Xenstore maintainers. I'd like his opinion on the approach. The documentation of the new logic is in: https://lore.kernel.org/xen-devel/20240517032156.1490515-5-xin.wa...@amd.com/ FWIW I am happy in principle with the logic (this is what we discussed on the call last week). Some comments below. I'm not against this logic, but I'm wondering why it needs to be so complicated. Actually the reason I like it is that in my view, this is the simplest approach. You allocate a domain, you also allocate the xenstore page together with it. Initially the xenstore connection has an "uninitialized" state, as it should be. That's it. At some point, when xenstored is ready, the state changes to CONNECTED. Can't the domU itself allocate the Xenstore page from its RAM pages, write the PFN into the Xenstore grant tab entry, and then make it public via setting HVM_PARAM_STORE_PFN? This is not simpler in my view Okay, fine with me. I had the impression that violating the 1:1 mapping of the domain would add complexity, but if you are fine with that I don't mind your approach. Juergen
Re: [PATCH v3 2/4] xen/arm: Alloc XenStore page for Dom0less DomUs from hypervisor
On 24.05.24 15:58, Julien Grall wrote: Hi Henry, + Juergen as the Xenstore maintainers. I'd like his opinion on the approach. The documentation of the new logic is in: https://lore.kernel.org/xen-devel/20240517032156.1490515-5-xin.wa...@amd.com/ FWIW I am happy in principle with the logic (this is what we discussed on the call last week). Some comments below. I'm not against this logic, but I'm wondering why it needs to be so complicated. Can't the domU itself allocate the Xenstore page from its RAM pages, write the PFN into the Xenstore grant tab entry, and then make it public via setting HVM_PARAM_STORE_PFN? The init-dom0less application could then check HVM_PARAM_STORE_PFN being set and call XS_introduce_domain. Note that at least C-xenstored does not need the PFN of the Xenstore page, as it is just using GNTTAB_RESERVED_XENSTORE for mapping the page. Juergen
Re: [PATCH for-4.19 v3 2/3] xen: enable altp2m at create domain domctl
On 17.05.24 15:33, Roger Pau Monne wrote: Enabling it using an HVM param is fragile, and complicates the logic when deciding whether options that interact with altp2m can also be enabled. Leave the HVM param value for consumption by the guest, but prevent it from being set. Enabling is now done using and additional altp2m specific field in xen_domctl_createdomain. Note that albeit only currently implemented in x86, altp2m could be implemented in other architectures, hence why the field is added to xen_domctl_createdomain instead of xen_arch_domainconfig. Signed-off-by: Roger Pau Monné Reviewed-by: Juergen Gross # tools/libs/ Juergen
Re: [PATCH 3/5] x86/pvh: Set phys_base when calling xen_prepare_pvh()
On 10.04.24 21:48, Jason Andryuk wrote: phys_base needs to be set for __pa() to work in xen_pvh_init() when finding the hypercall page. Set it before calling into xen_prepare_pvh(), which calls xen_pvh_init(). Clear it afterward to avoid __startup_64() adding to it and creating an incorrect value. Signed-off-by: Jason Andryuk --- Instead of setting and clearing phys_base, a dedicated variable could be used just for the hypercall page. Having phys_base set properly may avoid further issues if the use of phys_base or __pa() grows. --- arch/x86/platform/pvh/head.S | 10 ++ 1 file changed, 10 insertions(+) diff --git a/arch/x86/platform/pvh/head.S b/arch/x86/platform/pvh/head.S index bb1e582e32b1..c08d08d8cc92 100644 --- a/arch/x86/platform/pvh/head.S +++ b/arch/x86/platform/pvh/head.S @@ -125,7 +125,17 @@ SYM_CODE_START_LOCAL(pvh_start_xen) xor %edx, %edx wrmsr + /* Calculate load offset from LOAD_PHYSICAL_ADDR and store in +* phys_base. __pa() needs phys_base set to calculate the +* hypercall page in xen_pvh_init(). */ Please use the correct style for multi-line comments: /* * comment lines * comment lines */ + movq %rbp, %rbx + subq $LOAD_PHYSICAL_ADDR, %rbx + movq %rbx, phys_base(%rip) call xen_prepare_pvh + /* Clear phys_base. __startup_64 will *add* to its value, +* so reset to 0. */ Comment style again. + xor %rbx, %rbx + movq %rbx, phys_base(%rip) /* startup_64 expects boot_params in %rsi. */ lea rva(pvh_bootparams)(%ebp), %rsi With above fixed: Reviewed-by: Juergen Gross Juergen
Re: [PATCH v2 4/4] tools: Drop libsystemd as a dependency
On 16.05.24 20:58, Andrew Cooper wrote: There are no more users, and we want to disuade people from introducing new users just for sd_notify() and friends. Drop the dependency. We still want the overall --with{,out}-systemd to gate the generation of the service/unit/mount/etc files. Rerun autogen.sh, and mark the dependency as removed in the build containers. Signed-off-by: Andrew Cooper Reviewed-by: Juergen Gross Juergen
Re: [PATCH v2 4/4] tools: Drop libsystemd as a dependency
On 16.05.24 20:58, Andrew Cooper wrote: There are no more users, and we want to disuade people from introducing new users just for sd_notify() and friends. Drop the dependency. We still want the overall --with{,out}-systemd to gate the generation of the service/unit/mount/etc files. Rerun autogen.sh, and mark the dependency as removed in the build containers. Signed-off-by: Andrew Cooper --- CC: Anthony PERARD CC: Juergen Gross CC: George Dunlap CC: Jan Beulich CC: Stefano Stabellini CC: Julien Grall CC: Christian Lindig CC: Edwin Török v2: * Only strip out the library check. --- automation/build/archlinux/current.dockerfile | 1 + .../build/suse/opensuse-leap.dockerfile | 1 + .../build/suse/opensuse-tumbleweed.dockerfile | 1 + automation/build/ubuntu/focal.dockerfile | 1 + config/Tools.mk.in| 2 - m4/systemd.m4 | 9 - tools/configure | 256 -- 7 files changed, 4 insertions(+), 267 deletions(-) diff --git a/automation/build/archlinux/current.dockerfile b/automation/build/archlinux/current.dockerfile index 3e37ab5c40c1..d29f1358c2bd 100644 --- a/automation/build/archlinux/current.dockerfile +++ b/automation/build/archlinux/current.dockerfile @@ -37,6 +37,7 @@ RUN pacman -S --refresh --sysupgrade --noconfirm --noprogressbar --needed \ sdl2 \ spice \ spice-protocol \ +# systemd for Xen < 4.19 Does this work as intended? A comment between the parameters and no "\" at the end of the line? Juergen
Re: [PATCH v2 3/4] tools/{c,o}xenstored: Don't link against libsystemd
On 16.05.24 20:58, Andrew Cooper wrote: Use the local freestanding wrapper instead. Signed-off-by: Andrew Cooper Reviewed-by: Juergen Gross # tools/xenstored Juergen --- CC: Anthony PERARD CC: Juergen Gross CC: George Dunlap CC: Jan Beulich CC: Stefano Stabellini CC: Julien Grall CC: Christian Lindig CC: Edwin Török v2: * Redo almost from scratch, using the freestanding wrapper instead. --- tools/ocaml/xenstored/Makefile| 2 -- tools/ocaml/xenstored/systemd_stubs.c | 2 +- tools/xenstored/Makefile | 5 - tools/xenstored/posix.c | 4 ++-- 4 files changed, 3 insertions(+), 10 deletions(-) diff --git a/tools/ocaml/xenstored/Makefile b/tools/ocaml/xenstored/Makefile index e8aaecf2e630..a8b8bb64698e 100644 --- a/tools/ocaml/xenstored/Makefile +++ b/tools/ocaml/xenstored/Makefile @@ -4,8 +4,6 @@ include $(OCAML_TOPLEVEL)/common.make # Include configure output (config.h) CFLAGS += -include $(XEN_ROOT)/tools/config.h -CFLAGS-$(CONFIG_SYSTEMD) += $(SYSTEMD_CFLAGS) -LDFLAGS-$(CONFIG_SYSTEMD) += $(SYSTEMD_LIBS) CFLAGS += $(CFLAGS-y) CFLAGS += $(APPEND_CFLAGS) diff --git a/tools/ocaml/xenstored/systemd_stubs.c b/tools/ocaml/xenstored/systemd_stubs.c index f4c875075abe..7dbbdd35bf30 100644 --- a/tools/ocaml/xenstored/systemd_stubs.c +++ b/tools/ocaml/xenstored/systemd_stubs.c @@ -25,7 +25,7 @@ #if defined(HAVE_SYSTEMD) -#include +#include CAMLprim value ocaml_sd_notify_ready(value ignore) { diff --git a/tools/xenstored/Makefile b/tools/xenstored/Makefile index e0897ed1ba30..09adfe1d5064 100644 --- a/tools/xenstored/Makefile +++ b/tools/xenstored/Makefile @@ -9,11 +9,6 @@ xenstored: LDLIBS += $(LDLIBS_libxenctrl) xenstored: LDLIBS += -lrt xenstored: LDLIBS += $(SOCKET_LIBS) -ifeq ($(CONFIG_SYSTEMD),y) -$(XENSTORED_OBJS-y): CFLAGS += $(SYSTEMD_CFLAGS) -xenstored: LDLIBS += $(SYSTEMD_LIBS) -endif - TARGETS := xenstored .PHONY: all diff --git a/tools/xenstored/posix.c b/tools/xenstored/posix.c index d88c82d972d7..6037d739d013 100644 --- a/tools/xenstored/posix.c +++ b/tools/xenstored/posix.c @@ -27,7 +27,7 @@ #include #include #if defined(HAVE_SYSTEMD) -#include +#include #endif #include @@ -393,7 +393,7 @@ void late_init(bool live_update) #if defined(HAVE_SYSTEMD) if (!live_update) { sd_notify(1, "READY=1"); - fprintf(stderr, SD_NOTICE "xenstored is ready\n"); + fprintf(stderr, "xenstored is ready\n"); } #endif }
Re: [PATCH v2 2/4] tools: Import standalone sd_notify() implementation from systemd
On 16.05.24 20:58, Andrew Cooper wrote: ... in order to avoid linking against the whole of libsystemd. Only minimal changes to the upstream copy, to function as a drop-in replacement for sd_notify() and as a header-only library. Signed-off-by: Andrew Cooper With s/cleanup(sd_closep)/cleanup(xen_sd_closep)/ Reviewed-by: Juergen Gross Juergen --- CC: Anthony PERARD CC: Juergen Gross CC: George Dunlap CC: Jan Beulich CC: Stefano Stabellini CC: Julien Grall CC: Christian Lindig CC: Edwin Török v2: * New --- tools/include/xen-sd-notify.h | 98 +++ 1 file changed, 98 insertions(+) create mode 100644 tools/include/xen-sd-notify.h diff --git a/tools/include/xen-sd-notify.h b/tools/include/xen-sd-notify.h new file mode 100644 index ..eda9d8b22d9e --- /dev/null +++ b/tools/include/xen-sd-notify.h @@ -0,0 +1,98 @@ +/* SPDX-License-Identifier: MIT-0 */ + +/* + * Implement the systemd notify protocol without external dependencies. + * Supports both readiness notification on startup and on reloading, + * according to the protocol defined at: + * https://www.freedesktop.org/software/systemd/man/latest/sd_notify.html + * This protocol is guaranteed to be stable as per: + * https://systemd.io/PORTABILITY_AND_STABILITY/ + * + * Differences from the upstream copy: + * - Rename/rework as a drop-in replacement for systemd/sd-daemon.h + * - Only take the subset Xen cares about + * - Respect -Wdeclaration-after-statement + */ + +#ifndef XEN_SD_NOTIFY +#define XEN_SD_NOTIFY + +#include +#include +#include +#include +#include +#include + +static inline void xen_sd_closep(int *fd) { + if (!fd || *fd < 0) +return; + + close(*fd); + *fd = -1; +} + +static inline int xen_sd_notify(const char *message) { + union sockaddr_union { +struct sockaddr sa; +struct sockaddr_un sun; + } socket_addr = { +.sun.sun_family = AF_UNIX, + }; + size_t path_length, message_length; + ssize_t written; + const char *socket_path; + int __attribute__((cleanup(sd_closep))) fd = -1; + + /* Verify the argument first */ + if (!message) +return -EINVAL; + + message_length = strlen(message); + if (message_length == 0) +return -EINVAL; + + /* If the variable is not set, the protocol is a noop */ + socket_path = getenv("NOTIFY_SOCKET"); + if (!socket_path) +return 0; /* Not set? Nothing to do */ + + /* Only AF_UNIX is supported, with path or abstract sockets */ + if (socket_path[0] != '/' && socket_path[0] != '@') +return -EAFNOSUPPORT; + + path_length = strlen(socket_path); + /* Ensure there is room for NUL byte */ + if (path_length >= sizeof(socket_addr.sun.sun_path)) +return -E2BIG; + + memcpy(socket_addr.sun.sun_path, socket_path, path_length); + + /* Support for abstract socket */ + if (socket_addr.sun.sun_path[0] == '@') +socket_addr.sun.sun_path[0] = 0; + + fd = socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0); + if (fd < 0) +return -errno; + + if (connect(fd, _addr.sa, offsetof(struct sockaddr_un, sun_path) + path_length) != 0) +return -errno; + + written = write(fd, message, message_length); + if (written != (ssize_t) message_length) +return written < 0 ? -errno : -EPROTO; + + return 1; /* Notified! */ +} + +static inline int sd_notify(int unset_environment, const char *message) { +int r = xen_sd_notify(message); + +if (unset_environment) +unsetenv("NOTIFY_SOCKET"); + +return r; +} + +#endif /* XEN_SD_NOTIFY */
Re: [XEN PATCH v8 1/5] xen/vpci: Clear all vpci status of device
On 17.05.24 11:50, Jan Beulich wrote: On 17.05.2024 11:28, Chen, Jiqian wrote: On 2024/5/17 16:20, Jan Beulich wrote: On 17.05.2024 10:08, Chen, Jiqian wrote: On 2024/5/16 21:08, Jan Beulich wrote: On 16.05.2024 11:52, Jiqian Chen wrote: struct physdev_pci_device { /* IN */ uint16_t seg; Is re-using this struct for this new sub-op sufficient? IOW are all possible resets equal, and hence it doesn't need specifying what kind of reset was done? For example, other than FLR most reset variants reset all functions in one go aiui. Imo that would better require only a single hypercall, just to avoid possible confusion. It also reads as if FLR would not reset as many registers as other reset variants would. If I understood correctly that you mean in this hypercall it needs to support resetting both one function and all functions of a slot(dev)? But it can be done for caller to use a cycle to call this reset hypercall for each slot function. It could, yes, but since (aiui) there needs to be an indication of the kind of reset anyway, we can as well avoid relying on the caller doing so (and at the same time simplify what the caller needs to do). Since the corresponding kernel patch has been merged into linux_next branch https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20240515=b272722511d5e8ae580f01830687b8a6b2717f01, if it's not very mandatory and necessary, just let the caller handle it temporarily. As also mentioned for the other patch having a corresponding kernel one: The kernel patch would imo better not be merged until the new sub-op is actually finalized. Oh, sorry to have overlooked that the interfcae change isn't yet committed on Xen side. I'll drop the patch from my linux-next branch. Juergen
Re: Xen crash in scheduler during cpu hotplug
On 15.05.24 15:22, Andrew Cooper wrote: On 15/05/2024 1:39 pm, Jan Beulich wrote: On 15.05.2024 13:58, Andrew Cooper wrote: Just so it doesn't get lost. In XenRT, we've found: (XEN) [ Xen-4.19.0-1-d x86_64 debug=y Tainted: H ] (XEN) CPU: 45 (XEN) RIP: e008:[] common/sched/credit.c#csched_load_balance+0x41/0x877 (XEN) RFLAGS: 00010092 CONTEXT: hypervisor (XEN) rax: 82d040981618 rbx: 82d040981618 rcx: (XEN) rdx: 003ff68cd000 rsi: 002d rdi: 83103723d450 (XEN) rbp: 83207caa7d48 rsp: 83207caa7b98 r8: (XEN) r9: 831037253cf0 r10: 83103767c3f0 r11: 0009 (XEN) r12: 831037237990 r13: 831037237990 r14: 831037253720 (XEN) r15: cr0: 8005003b cr4: 00f526e0 (XEN) cr3: 5bc2f000 cr2: 0010 (XEN) fsb: gsb: gss: (XEN) ds: es: fs: gs: ss: cs: e008 (XEN) Xen code around (common/sched/credit.c#csched_load_balance+0x41/0x877): (XEN) 48 8b 0c 10 48 8b 49 08 <48> 8b 79 10 48 89 bd b8 fe ff ff 49 8b 4e 28 48 (XEN) Xen call trace: (XEN) [] R common/sched/credit.c#csched_load_balance+0x41/0x877 While this is of course pretty little information, I've still tried to decipher it, first noticing it's credit1 that's being used here. Once forcing csched_load_balance() non-inline (no idea why it is a separate function in your build), I can see a sufficiently matching pattern at approximately the same offset into the function. That's const struct cpupool *c = get_sched_res(cpu)->cpupool; ... const cpumask_t *online = c->res_valid; ... BUG_ON(get_sched_res(cpu) != snext->unit->res); overlapping, with the crash being on the middle of the quoted lines. IOW the CPU pool is still NULL for this sched resource. Cc-ing Jürgen for possible clues ... We've seen it in 4.13, 4.17 and upstream, after Roger extended the existing CPU hotplug testing to try and reproduce the MTRR watchdog failure. We've found yet another "no irq for handler" from this too. It's always a deference at NULL+0x10, somewhere within csched_schedule(). I think I've found the reason. In schedule_cpu_add() the cpupool and granularity are set only after releasing the scheduling lock. I think those must be inside the locked region. Can you give this one a try (not tested at all)? diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c index 0cb33831d2..babac7aad6 100644 --- a/xen/common/sched/core.c +++ b/xen/common/sched/core.c @@ -3176,6 +3176,8 @@ int schedule_cpu_add(unsigned int cpu, struct cpupool *c) sr->scheduler = new_ops; sr->sched_priv = ppriv; +sr->granularity = cpupool_get_granularity(c); +sr->cpupool = c; /* * Reroute the lock to the per pCPU lock as /last/ thing. In fact, @@ -3188,8 +3190,6 @@ int schedule_cpu_add(unsigned int cpu, struct cpupool *c) /* _Not_ pcpu_schedule_unlock(): schedule_lock has changed! */ spin_unlock_irqrestore(old_lock, flags); -sr->granularity = cpupool_get_granularity(c); -sr->cpupool = c; /* The cpu is added to a pool, trigger it to go pick up some work */ cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ); Juergen
Re: [RFC KERNEL PATCH v6 3/3] xen/privcmd: Add new syscall to get gsi from irq
On 13.05.24 09:47, Chen, Jiqian wrote: Hi, On 2024/5/10 17:06, Chen, Jiqian wrote: Hi, On 2024/5/10 14:46, Jürgen Groß wrote: On 19.04.24 05:36, Jiqian Chen wrote: In PVH dom0, it uses the linux local interrupt mechanism, when it allocs irq for a gsi, it is dynamic, and follow the principle of applying first, distributing first. And the irq number is alloced from small to large, but the applying gsi number is not, may gsi 38 comes before gsi 28, it causes the irq number is not equal with the gsi number. And when passthrough a device, QEMU will use device's gsi number to do pirq mapping, but the gsi number is got from file /sys/bus/pci/devices//irq, irq!= gsi, so it will fail when mapping. And in current linux codes, there is no method to translate irq to gsi for userspace. For above purpose, record the relationship of gsi and irq when PVH dom0 do acpi_register_gsi_ioapic for devices and adds a new syscall into privcmd to let userspace can get that translation when they have a need. Co-developed-by: Huang Rui Signed-off-by: Jiqian Chen --- arch/x86/include/asm/apic.h | 8 +++ arch/x86/include/asm/xen/pci.h | 5 arch/x86/kernel/acpi/boot.c | 2 +- arch/x86/pci/xen.c | 21 + drivers/xen/events/events_base.c | 39 drivers/xen/privcmd.c | 19 include/uapi/xen/privcmd.h | 7 ++ include/xen/events.h | 5 8 files changed, 105 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index 9d159b771dc8..dd4139250895 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -169,6 +169,9 @@ extern bool apic_needs_pit(void); extern void apic_send_IPI_allbutself(unsigned int vector); +extern int acpi_register_gsi_ioapic(struct device *dev, u32 gsi, + int trigger, int polarity); + #else /* !CONFIG_X86_LOCAL_APIC */ static inline void lapic_shutdown(void) { } #define local_apic_timer_c2_ok 1 @@ -183,6 +186,11 @@ static inline void apic_intr_mode_init(void) { } static inline void lapic_assign_system_vectors(void) { } static inline void lapic_assign_legacy_vector(unsigned int i, bool r) { } static inline bool apic_needs_pit(void) { return true; } +static inline int acpi_register_gsi_ioapic(struct device *dev, u32 gsi, + int trigger, int polarity) +{ + return (int)gsi; +} #endif /* !CONFIG_X86_LOCAL_APIC */ #ifdef CONFIG_X86_X2APIC diff --git a/arch/x86/include/asm/xen/pci.h b/arch/x86/include/asm/xen/pci.h index 9015b888edd6..aa8ded61fc2d 100644 --- a/arch/x86/include/asm/xen/pci.h +++ b/arch/x86/include/asm/xen/pci.h @@ -5,6 +5,7 @@ #if defined(CONFIG_PCI_XEN) extern int __init pci_xen_init(void); extern int __init pci_xen_hvm_init(void); +extern int __init pci_xen_pvh_init(void); #define pci_xen 1 #else #define pci_xen 0 @@ -13,6 +14,10 @@ static inline int pci_xen_hvm_init(void) { return -1; } +static inline int pci_xen_pvh_init(void) +{ + return -1; +} #endif #ifdef CONFIG_XEN_PV_DOM0 int __init pci_xen_initial_domain(void); diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 85a3ce2a3666..72c73458c083 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -749,7 +749,7 @@ static int acpi_register_gsi_pic(struct device *dev, u32 gsi, } #ifdef CONFIG_X86_LOCAL_APIC -static int acpi_register_gsi_ioapic(struct device *dev, u32 gsi, +int acpi_register_gsi_ioapic(struct device *dev, u32 gsi, int trigger, int polarity) { int irq = gsi; diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c index 652cd53e77f6..f056ab5c0a06 100644 --- a/arch/x86/pci/xen.c +++ b/arch/x86/pci/xen.c @@ -114,6 +114,21 @@ static int acpi_register_gsi_xen_hvm(struct device *dev, u32 gsi, false /* no mapping of GSI to PIRQ */); } +static int acpi_register_gsi_xen_pvh(struct device *dev, u32 gsi, + int trigger, int polarity) +{ + int irq; + + irq = acpi_register_gsi_ioapic(dev, gsi, trigger, polarity); + if (irq < 0) + return irq; + + if (xen_pvh_add_gsi_irq_map(gsi, irq) == -EEXIST) + printk(KERN_INFO "Already map the GSI :%u and IRQ: %d\n", gsi, irq); + + return irq; +} + #ifdef CONFIG_XEN_PV_DOM0 static int xen_register_gsi(u32 gsi, int triggering, int polarity) { @@ -558,6 +573,12 @@ int __init pci_xen_hvm_init(void) return 0; } +int __init pci_xen_pvh_init(void) +{ + __acpi_register_gsi = acpi_register_gsi_xen_pvh; No support for unregistering the gsi again? __acpi_unregister_gsi is set in function acpi_set_irq_model_ioapic. Maybe I need to use a new function to call acpi_unregister_gsi_ioapic and remove the mapping of irq and gsi from xen_irq_list_head ? When I tried to support unregist
Re: [RFC KERNEL PATCH v6 3/3] xen/privcmd: Add new syscall to get gsi from irq
On 10.05.24 12:32, Chen, Jiqian wrote: On 2024/5/10 18:21, Jürgen Groß wrote: On 10.05.24 12:13, Chen, Jiqian wrote: On 2024/5/10 17:53, Jürgen Groß wrote: On 10.05.24 11:06, Chen, Jiqian wrote: Hi, On 2024/5/10 14:46, Jürgen Groß wrote: On 19.04.24 05:36, Jiqian Chen wrote: + + info->type = IRQT_PIRQ; I am considering whether I need to use a new type(like IRQT_GSI) here to distinguish with IRQT_PIRQ, because function restore_pirqs will process all IRQT_PIRQ. restore_pirqs() already considers gsi == 0 to be not GSI related. Isn't this enough? No, it is not enough. xen_pvh_add_gsi_irq_map adds the mapping of gsi and irq, but the value of gsi is not 0, once restore_pirqs is called, it will do PHYSDEVOP_map_pirq for that gsi, but in pvh dom0, we shouldn't do PHYSDEVOP_map_pirq. Okay, then add a new flag to info->u.pirq.flags for that purpose? I feel like adding "new flag to info->u.pirq.flags" is not as good as adding " new type to info->type". Because in restore_pirqs, it considers " info->type != IRQT_PIRQ", if adding " new flag to info->u.pirq.flags", we need to add a new condition in restore_pirqs. And actually this mapping(gsi and irq of pvh) doesn't have pirq, so it is not suitable to add to u.pirq.flags. Does this mean there is no other IRQT_PIRQ related activity relevant for those GSIs/IRQs? In that case I agree to add IRQT_GSI. Juergen
Re: [RFC KERNEL PATCH v6 3/3] xen/privcmd: Add new syscall to get gsi from irq
On 10.05.24 12:13, Chen, Jiqian wrote: On 2024/5/10 17:53, Jürgen Groß wrote: On 10.05.24 11:06, Chen, Jiqian wrote: Hi, On 2024/5/10 14:46, Jürgen Groß wrote: On 19.04.24 05:36, Jiqian Chen wrote: + + info->type = IRQT_PIRQ; I am considering whether I need to use a new type(like IRQT_GSI) here to distinguish with IRQT_PIRQ, because function restore_pirqs will process all IRQT_PIRQ. restore_pirqs() already considers gsi == 0 to be not GSI related. Isn't this enough? No, it is not enough. xen_pvh_add_gsi_irq_map adds the mapping of gsi and irq, but the value of gsi is not 0, once restore_pirqs is called, it will do PHYSDEVOP_map_pirq for that gsi, but in pvh dom0, we shouldn't do PHYSDEVOP_map_pirq. Okay, then add a new flag to info->u.pirq.flags for that purpose? Juergen
Re: [RFC KERNEL PATCH v6 3/3] xen/privcmd: Add new syscall to get gsi from irq
On 10.05.24 11:06, Chen, Jiqian wrote: Hi, On 2024/5/10 14:46, Jürgen Groß wrote: On 19.04.24 05:36, Jiqian Chen wrote: + + info->type = IRQT_PIRQ; I am considering whether I need to use a new type(like IRQT_GSI) here to distinguish with IRQT_PIRQ, because function restore_pirqs will process all IRQT_PIRQ. restore_pirqs() already considers gsi == 0 to be not GSI related. Isn't this enough? Juergen
Re: [PATCH 1/5] xen: sync elfnote.h from xen tree
On 10.04.24 21:48, Jason Andryuk wrote: Sync Xen's elfnote.h header from xen.git to pull in the XEN_ELFNOTE_PHYS32_RELOC define. xen commit dfc9fab00378 ("x86/PVH: Support relocatable dom0 kernels") This is a copy except for the removal of the emacs editor config at the end of the file. Signed-off-by: Jason Andryuk Reviewed-by: Juergen Gross Juergen
Re: [PATCH] libxl: Fix handling XenStore errors in device creation
On 27.04.24 04:17, Demi Marie Obenour wrote: If xenstored runs out of memory it is possible for it to fail operations that should succeed. libxl wasn't robust against this, and could fail to ensure that the TTY path of a non-initial console was created and read-only for guests. This doesn't qualify for an XSA because guests should not be able to run xenstored out of memory, but it still needs to be fixed. Add the missing error checks to ensure that all errors are properly handled and that at no point can a guest make the TTY path of its frontend directory writable. Signed-off-by: Demi Marie Obenour Apart from one nit below: Reviewed-by: Juergen Gross --- tools/libs/light/libxl_console.c | 10 ++--- tools/libs/light/libxl_device.c | 72 tools/libs/light/libxl_xshelp.c | 13 -- 3 files changed, 59 insertions(+), 36 deletions(-) diff --git a/tools/libs/light/libxl_console.c b/tools/libs/light/libxl_console.c index cd7412a3272a2faf4b9dab0ef4dd077e55472546..adf82aa844a4f4989111bfc8a94af18ad8e114f1 100644 --- a/tools/libs/light/libxl_console.c +++ b/tools/libs/light/libxl_console.c @@ -351,11 +351,10 @@ int libxl__device_console_add(libxl__gc *gc, uint32_t domid, flexarray_append(front, "protocol"); flexarray_append(front, LIBXL_XENCONSOLE_PROTOCOL); } -libxl__device_generic_add(gc, XBT_NULL, device, - libxl__xs_kvs_of_flexarray(gc, back), - libxl__xs_kvs_of_flexarray(gc, front), - libxl__xs_kvs_of_flexarray(gc, ro_front)); -rc = 0; +rc = libxl__device_generic_add(gc, XBT_NULL, device, + libxl__xs_kvs_of_flexarray(gc, back), + libxl__xs_kvs_of_flexarray(gc, front), + libxl__xs_kvs_of_flexarray(gc, ro_front)); out: return rc; } @@ -665,6 +664,7 @@ int libxl_device_channel_getinfo(libxl_ctx *ctx, uint32_t domid, */ if (!val) val = "/NO-SUCH-PATH"; channelinfo->u.pty.path = strdup(val); + if (channelinfo->u.pty.path == NULL) abort(); Even with the bad example 2 lines up, please put the "abort();" into a line of its own. Juergen
Re: [RFC KERNEL PATCH v6 3/3] xen/privcmd: Add new syscall to get gsi from irq
On 19.04.24 05:36, Jiqian Chen wrote: In PVH dom0, it uses the linux local interrupt mechanism, when it allocs irq for a gsi, it is dynamic, and follow the principle of applying first, distributing first. And the irq number is alloced from small to large, but the applying gsi number is not, may gsi 38 comes before gsi 28, it causes the irq number is not equal with the gsi number. And when passthrough a device, QEMU will use device's gsi number to do pirq mapping, but the gsi number is got from file /sys/bus/pci/devices//irq, irq!= gsi, so it will fail when mapping. And in current linux codes, there is no method to translate irq to gsi for userspace. For above purpose, record the relationship of gsi and irq when PVH dom0 do acpi_register_gsi_ioapic for devices and adds a new syscall into privcmd to let userspace can get that translation when they have a need. Co-developed-by: Huang Rui Signed-off-by: Jiqian Chen --- arch/x86/include/asm/apic.h | 8 +++ arch/x86/include/asm/xen/pci.h | 5 arch/x86/kernel/acpi/boot.c | 2 +- arch/x86/pci/xen.c | 21 + drivers/xen/events/events_base.c | 39 drivers/xen/privcmd.c| 19 include/uapi/xen/privcmd.h | 7 ++ include/xen/events.h | 5 8 files changed, 105 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h index 9d159b771dc8..dd4139250895 100644 --- a/arch/x86/include/asm/apic.h +++ b/arch/x86/include/asm/apic.h @@ -169,6 +169,9 @@ extern bool apic_needs_pit(void); extern void apic_send_IPI_allbutself(unsigned int vector); +extern int acpi_register_gsi_ioapic(struct device *dev, u32 gsi, + int trigger, int polarity); + #else /* !CONFIG_X86_LOCAL_APIC */ static inline void lapic_shutdown(void) { } #define local_apic_timer_c2_ok1 @@ -183,6 +186,11 @@ static inline void apic_intr_mode_init(void) { } static inline void lapic_assign_system_vectors(void) { } static inline void lapic_assign_legacy_vector(unsigned int i, bool r) { } static inline bool apic_needs_pit(void) { return true; } +static inline int acpi_register_gsi_ioapic(struct device *dev, u32 gsi, + int trigger, int polarity) +{ + return (int)gsi; +} #endif /* !CONFIG_X86_LOCAL_APIC */ #ifdef CONFIG_X86_X2APIC diff --git a/arch/x86/include/asm/xen/pci.h b/arch/x86/include/asm/xen/pci.h index 9015b888edd6..aa8ded61fc2d 100644 --- a/arch/x86/include/asm/xen/pci.h +++ b/arch/x86/include/asm/xen/pci.h @@ -5,6 +5,7 @@ #if defined(CONFIG_PCI_XEN) extern int __init pci_xen_init(void); extern int __init pci_xen_hvm_init(void); +extern int __init pci_xen_pvh_init(void); #define pci_xen 1 #else #define pci_xen 0 @@ -13,6 +14,10 @@ static inline int pci_xen_hvm_init(void) { return -1; } +static inline int pci_xen_pvh_init(void) +{ + return -1; +} #endif #ifdef CONFIG_XEN_PV_DOM0 int __init pci_xen_initial_domain(void); diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c index 85a3ce2a3666..72c73458c083 100644 --- a/arch/x86/kernel/acpi/boot.c +++ b/arch/x86/kernel/acpi/boot.c @@ -749,7 +749,7 @@ static int acpi_register_gsi_pic(struct device *dev, u32 gsi, } #ifdef CONFIG_X86_LOCAL_APIC -static int acpi_register_gsi_ioapic(struct device *dev, u32 gsi, +int acpi_register_gsi_ioapic(struct device *dev, u32 gsi, int trigger, int polarity) { int irq = gsi; diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c index 652cd53e77f6..f056ab5c0a06 100644 --- a/arch/x86/pci/xen.c +++ b/arch/x86/pci/xen.c @@ -114,6 +114,21 @@ static int acpi_register_gsi_xen_hvm(struct device *dev, u32 gsi, false /* no mapping of GSI to PIRQ */); } +static int acpi_register_gsi_xen_pvh(struct device *dev, u32 gsi, + int trigger, int polarity) +{ + int irq; + + irq = acpi_register_gsi_ioapic(dev, gsi, trigger, polarity); + if (irq < 0) + return irq; + + if (xen_pvh_add_gsi_irq_map(gsi, irq) == -EEXIST) + printk(KERN_INFO "Already map the GSI :%u and IRQ: %d\n", gsi, irq); + + return irq; +} + #ifdef CONFIG_XEN_PV_DOM0 static int xen_register_gsi(u32 gsi, int triggering, int polarity) { @@ -558,6 +573,12 @@ int __init pci_xen_hvm_init(void) return 0; } +int __init pci_xen_pvh_init(void) +{ + __acpi_register_gsi = acpi_register_gsi_xen_pvh; No support for unregistering the gsi again? + return 0; +} + #ifdef CONFIG_XEN_PV_DOM0 int __init pci_xen_initial_domain(void) { diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index 27553673e46b..80d4f7faac64 100644 --- a/drivers/xen/events/events_base.c +++
Re: [PATCH] xen/x86: add extra pages to unpopulated-alloc if available
On 29.04.24 17:50, Roger Pau Monne wrote: Commit 262fc47ac174 ('xen/balloon: don't use PV mode extra memory for zone device allocations') removed the addition of the extra memory ranges to the unpopulated range allocator, using those only for the balloon driver. This forces the unpopulated allocator to attach hotplug ranges even when spare memory (as part of the extra memory ranges) is available. Furthermore, on PVH domains it defeats the purpose of commit 38620fc4e893 ('x86/xen: attempt to inflate the memory balloon on PVH'), as extra memory ranges would only be used to map foreign memory if the kernel is built without XEN_UNPOPULATED_ALLOC support. Fix this by adding a helpers that adds the extra memory ranges to the list of unpopulated pages, and zeroes the ranges so they are not also consumed by the balloon driver. This should have been part of 38620fc4e893, hence the fixes tag. Note the current logic relies on unpopulated_init() (and hence arch_xen_unpopulated_init()) always being called ahead of balloon_init(), so that the extra memory regions are consumed by arch_xen_unpopulated_init(). Fixes: 38620fc4e893 ('x86/xen: attempt to inflate the memory balloon on PVH') Signed-off-by: Roger Pau Monné Reviewed-by: Juergen Gross Juergen
Re: [PATCH] tools/libxs: Open /dev/xen/xenbus fds as O_CLOEXEC
On 04.05.24 03:16, Andrew Cooper wrote: The header description for xs_open() goes as far as to suggest that the fd is O_CLOEXEC, but it isn't actually. `xl devd` has been observed leaking /dev/xen/xenbus into children. Link: https://github.com/QubesOS/qubes-issues/issues/8292 Reported-by: Demi Marie Obenour Signed-off-by: Andrew Cooper With the style breakage below fixed: Reviewed-by: Juergen Gross --- CC: Anthony PERARD CC: Juergen Gross CC: Demi Marie Obenour CC: Marek Marczykowski-Górecki Entirely speculative patch based on a Matrix report --- tools/libs/store/xs.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/tools/libs/store/xs.c b/tools/libs/store/xs.c index 140b9a28395e..1f74fb3c44a2 100644 --- a/tools/libs/store/xs.c +++ b/tools/libs/store/xs.c @@ -54,6 +54,10 @@ struct xs_stored_msg { #include #endif +#ifndef O_CLOEXEC +#define O_CLOEXEC 0 +#endif + struct xs_handle { /* Communications channel to xenstore daemon. */ int fd; @@ -227,7 +231,7 @@ static int get_socket(const char *connect_to) static int get_dev(const char *connect_to) { /* We cannot open read-only because requests are writes */ - return open(connect_to, O_RDWR); + return open(connect_to, O_RDWR|O_CLOEXEC); Nit: spaces around the "|", please. Juergen } static int all_restrict_cb(Xentoolcore__Active_Handle *ah, domid_t domid) { base-commit: feb9158a620040846d76981acbe8ea9e2255a07b
Re: [PATCH v6 8/8] xen: allow up to 16383 cpus
On 29.04.24 13:04, Julien Grall wrote: Hi Juergen, Sorry for the late reply. On 29/04/2024 11:33, Juergen Gross wrote: On 08.04.24 09:10, Jan Beulich wrote: On 27.03.2024 16:22, Juergen Gross wrote: With lock handling now allowing up to 16384 cpus (spinlocks can handle 65535 cpus, rwlocks can handle 16384 cpus), raise the allowed limit for the number of cpus to be configured to 16383. The new limit is imposed by IOMMU_CMD_BUFFER_MAX_ENTRIES and QINVAL_MAX_ENTRY_NR required to be larger than 2 * CONFIG_NR_CPUS. Signed-off-by: Juergen Gross Acked-by: Jan Beulich I'd prefer this to also gain an Arm ack, though. Any comment from Arm side? Can you clarify what the new limits mean in term of (security) support? Are we now claiming that Xen will work perfectly fine on platforms with up to 16383? If so, I can't comment for x86, but for Arm, I am doubtful that it would work without any (at least performance) issues. AFAIK, this is also an untested configuration. In fact I would be surprised if Xen on Arm was tested with more than a couple of hundreds cores (AFAICT the Ampere CPUs has 192 CPUs). I think we should add a security support limit for the number of physical cpus similar to the memory support limit we already have in place. For x86 I'd suggest 4096 cpus for security support (basically the limit we have with this patch), but I'm open for other suggestions, too. I have no idea about any sensible limits for Arm32/Arm64. Juergen
Re: [PATCH 1/2] tools/{c,o}xenstored: Don't link against libsystemd
On 25.04.24 19:32, Andrew Cooper wrote: libsystemd is a giant dependency for one single function, but in the wake of the xz backdoor, it turns out that even systemd leadership recommend against linking against libsystemd for sd_notify(). Since commit 7b61011e1450 ("tools: make xenstore domain easy configurable") in Xen 4.8, the launch-xenstore script invokes systemd-notify directly, so its not even necessary for the xenstored's to call sd_notify() themselves. You are aware that in the daemon case the call of systemd-notify does not signal readyness of xenstored? It is just called with the "--booted" parameter in order to detect whether systemd is active or not. So in order to just drop the sd_notify() call from xenstored you need to modify the launch-xenstore script, too. Juergen Therefore, just drop the calls to sd_notify() and stop linking against libsystemd. No functional change. Signed-off-by: Andrew Cooper --- CC: Anthony PERARD CC: Juergen Gross CC: Christian Lindig CC: Edwin Török CC: Stefano Stabellini --- tools/ocaml/xenstored/Makefile| 12 +-- tools/ocaml/xenstored/systemd.ml | 15 - tools/ocaml/xenstored/systemd.mli | 16 - tools/ocaml/xenstored/systemd_stubs.c | 47 --- tools/ocaml/xenstored/xenstored.ml| 1 - tools/xenstored/Makefile | 5 --- tools/xenstored/posix.c | 9 - 7 files changed, 1 insertion(+), 104 deletions(-) delete mode 100644 tools/ocaml/xenstored/systemd.ml delete mode 100644 tools/ocaml/xenstored/systemd.mli delete mode 100644 tools/ocaml/xenstored/systemd_stubs.c diff --git a/tools/ocaml/xenstored/Makefile b/tools/ocaml/xenstored/Makefile index e8aaecf2e630..1e4b51cc5432 100644 --- a/tools/ocaml/xenstored/Makefile +++ b/tools/ocaml/xenstored/Makefile @@ -4,8 +4,6 @@ include $(OCAML_TOPLEVEL)/common.make # Include configure output (config.h) CFLAGS += -include $(XEN_ROOT)/tools/config.h -CFLAGS-$(CONFIG_SYSTEMD) += $(SYSTEMD_CFLAGS) -LDFLAGS-$(CONFIG_SYSTEMD) += $(SYSTEMD_LIBS) CFLAGS += $(CFLAGS-y) CFLAGS += $(APPEND_CFLAGS) @@ -25,13 +23,6 @@ poll_OBJS = poll poll_C_OBJS = select_stubs OCAML_LIBRARY = syslog poll -LIBS += systemd.cma systemd.cmxa -systemd_OBJS = systemd -systemd_C_OBJS = systemd_stubs -OCAML_LIBRARY += systemd - -LIBS_systemd += $(LDFLAGS-y) - OBJS = paths \ define \ stdext \ @@ -56,12 +47,11 @@ OBJS = paths \ process \ xenstored -INTF = symbol.cmi trie.cmi syslog.cmi systemd.cmi poll.cmi +INTF = symbol.cmi trie.cmi syslog.cmi poll.cmi XENSTOREDLIBS = \ unix.cmxa \ -ccopt -L -ccopt . syslog.cmxa \ - -ccopt -L -ccopt . systemd.cmxa \ -ccopt -L -ccopt . poll.cmxa \ -ccopt -L -ccopt $(OCAML_TOPLEVEL)/libs/mmap $(OCAML_TOPLEVEL)/libs/mmap/xenmmap.cmxa \ -ccopt -L -ccopt $(OCAML_TOPLEVEL)/libs/eventchn $(OCAML_TOPLEVEL)/libs/eventchn/xeneventchn.cmxa \ diff --git a/tools/ocaml/xenstored/systemd.ml b/tools/ocaml/xenstored/systemd.ml deleted file mode 100644 index 39127f712d72.. --- a/tools/ocaml/xenstored/systemd.ml +++ /dev/null @@ -1,15 +0,0 @@ -(* - * Copyright (C) 2014 Luis R. Rodriguez - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU Lesser General Public License as published - * by the Free Software Foundation; version 2.1 only. with the special - * exception on linking described in file LICENSE. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU Lesser General Public License for more details. - *) - -external sd_notify_ready: unit -> unit = "ocaml_sd_notify_ready" diff --git a/tools/ocaml/xenstored/systemd.mli b/tools/ocaml/xenstored/systemd.mli deleted file mode 100644 index 18b9331031f9.. --- a/tools/ocaml/xenstored/systemd.mli +++ /dev/null @@ -1,16 +0,0 @@ -(* - * Copyright (C) 2014 Luis R. Rodriguez - * - * This program is free software; you can redistribute it and/or modify - * it under the terms of the GNU Lesser General Public License as published - * by the Free Software Foundation; version 2.1 only. with the special - * exception on linking described in file LICENSE. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - * GNU Lesser General Public License for more details. - *) - -(** Tells systemd we're ready *) -external sd_notify_ready: unit -> unit = "ocaml_sd_notify_ready" diff --git a/tools/ocaml/xenstored/systemd_stubs.c b/tools/ocaml/xenstored/systemd_stubs.c deleted file mode 100644 index f4c875075abe.. --- a/tools/ocaml/xenstored/systemd_stubs.c +++ /dev/null @@ -1,47 +0,0 @@ -/*
Re: Detecting whether dom0 is in a VM
On 22.04.24 09:12, Jan Beulich wrote: On 19.04.2024 17:29, George Dunlap wrote: On Fri, Jul 7, 2023 at 3:56 PM George Dunlap wrote: Xen's public interface offers access to the featuresets known / found / used by the hypervisor. See XEN_SYSCTL_get_cpu_featureset, accessible via xc_get_cpu_featureset(). Are any of these exposed in dom0 via sysctl, or hypfs? sysctl - yes (as the quoted name also says). hypfs no, afaict. SYSCTLs are unfortunately not stable interfaces, correct? So it wouldn't be practical for systemd to use them. Indeed, neither sysctl-s nor the libxc interfaces are stable. Thinking of it, xen-cpuid is a wrapper tool around those. They may want to look at its output (and, if they want to use it, advise distros to also package it), which I think we try to keep reasonably stable, albeit without providing any guarantees. We haven't had any clear guidance yet on what the systemd team want in the question; I just sort of assumed they wanted the L-1 virtualization *if possible*. It sounds like `vm-other` would be acceptable, particularly as a fall-back output if there's no way to get Xen's picture of the cpuid. It looks like xen-cpuid is available on Fedora, Debian, Ubuntu, and the old Virt SIG CentOS packages; so I'd expect most packages to follow suit. That's a place to start. Just to take the discussion all the way to its conclusion: - Supposing xen-cpuid isn't available, is there any other way to tell if Xen is running in a VM from dom0? - Would it make sense to expose that information somewhere, either in sysfs or in hypfs (or both?), so that eventually even systems which may not get the memo about packaging xen-cpuid will get support (or if the systemd guys would rather avoid executing another process if possible)? Resurrecting this thread. To recap: Currently, `systemd-detect-virt` has the following input / output table: 1. Xen on real hardware, domU: xen 2. Xen on real hardware, dom0: vm-other 3. Xen in a VM, domU: xen 4. Xen in aVM, dom0: vm-other It's the dom0 lines (2 and 4) which are problematic; we'd ideally like those to be `none` and `vm-other` (or the actual value, like `xen` or `kvm`). It looks like ATM, /proc/xen/capabilities will contain `control_d` if it's a dom0. Simply unilaterally returning `none` if /proc/xen/capabilities contains `control_d` would correct the vast majority of instances (since the number of instances of Xen on real hardware is presumably higher than the number running virtualized). Is /proc/xen/capabilities expected to stay around? I don't see anything equivalent in /dev/xen. I believe it's intended to stay around, but a definitive answer can only come from Linux folks. Jürgen, Stefano? I have no plans to remove it. Juergen Jan Other than adding a new interface to Xen, is there any way to tell if Xen is running in a VM? If we do need to expose an interface, what would be the best way to do that? -George
Re: [PATCH] locking/x86/xen: Use try_cmpxchg() in xen_alloc_p2m_entry()
On 05.04.24 10:32, Uros Bizjak wrote: Use try_cmpxchg() instead of cmpxchg(*ptr, old, new) == old. The x86 CMPXCHG instruction returns success in the ZF flag, so this change saves a compare after CMPXCHG. Also, try_cmpxchg() implicitly assigns old *ptr value to "old" when CMPXCHG fails. There is no need to explicitly assign old *ptr value to the temporary, which can simplify the surrounding source code. No functional change intended. Signed-off-by: Uros Bizjak Cc: Juergen Gross Cc: Boris Ostrovsky Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: "H. Peter Anvin" Reviewed-by: Juergen Gross Juergen
Re: [PATCH] x86/pat: fix W^X violation false-positives when running as Xen PV guest
On 10.04.24 15:48, Ingo Molnar wrote: * Juergen Gross wrote: When running as Xen PV guest in some cases W^X violation WARN()s have been observed. Those WARN()s are produced by verify_rwx(), which looks into the PTE to verify that writable kernel pages have the NX bit set in order to avoid code modifications of the kernel by rogue code. As the NX bits of all levels of translation entries are or-ed and the RW bits of all levels are and-ed, looking just into the PTE isn't enough for the decision that a writable page is executable, too. When running as a Xen PV guest, kernel initialization will set the NX bit in PMD entries of the initial page tables covering the .data segment. When finding the PTE to have set the RW bit but no NX bit, higher level entries must be looked at. Only when all levels have the RW bit set and no NX bit set, the W^X violation should be flagged. Additionally show_fault_oops() has a similar problem: it will issue the "kernel tried to execute NX-protected page" message only if it finds the NX bit set in the leaf translation entry, while any NX bit in non-leaf entries are being ignored for issuing the message. Modify lookup_address_in_pgd() to return the effective NX and RW bit values of the non-leaf translation entries and evaluate those as well in verify_rwx() and show_fault_oops(). Ok, this fix makes sense, as that's how the hardware works and we interpret the pagetables poorly. Thanks for confirmation that my approach is sane. Fixes: 652c5bf380ad ("x86/mm: Refuse W^X violations") Reported-by: Jason Andryuk Signed-off-by: Juergen Gross --- arch/x86/include/asm/pgtable_types.h | 2 +- arch/x86/kernel/sev.c| 3 +- arch/x86/mm/fault.c | 7 ++-- arch/x86/mm/pat/set_memory.c | 56 +--- arch/x86/virt/svm/sev.c | 3 +- 5 files changed, 52 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 0b748ee16b3d..91ab538d3872 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -565,7 +565,7 @@ static inline void update_page_count(int level, unsigned long pages) { } */ extern pte_t *lookup_address(unsigned long address, unsigned int *level); extern pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address, - unsigned int *level); + unsigned int *level, bool *nx, bool *rw); extern pmd_t *lookup_pmd_address(unsigned long address); extern phys_addr_t slow_virt_to_phys(void *__address); extern int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, Please introduce a new lookup_address_in_pgd_attr() function or so, which is used by code intentionally. This avoids changing the arch/x86/kernel/sev.c and arch/x86/virt/svm/sev.c uses, that retrieve these attributes but don't do anything with them: Okay. diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c index 38ad066179d8..adba581e999d 100644 --- a/arch/x86/kernel/sev.c +++ b/arch/x86/kernel/sev.c @@ -516,12 +516,13 @@ static enum es_result vc_slow_virt_to_phys(struct ghcb *ghcb, struct es_em_ctxt unsigned long va = (unsigned long)vaddr; unsigned int level; phys_addr_t pa; + bool nx, rw; pgd_t *pgd; pte_t *pte; pgd = __va(read_cr3_pa()); pgd = [pgd_index(va)]; - pte = lookup_address_in_pgd(pgd, va, ); + pte = lookup_address_in_pgd(pgd, va, , , ); if (!pte) { ctxt->fi.vector = X86_TRAP_PF; ctxt->fi.cr2= vaddr; diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 622d12ec7f08..eb8e897a5653 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -514,18 +514,19 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad if (error_code & X86_PF_INSTR) { unsigned int level; + bool nx, rw; pgd_t *pgd; pte_t *pte; pgd = __va(read_cr3_pa()); pgd += pgd_index(address); - pte = lookup_address_in_pgd(pgd, address, ); + pte = lookup_address_in_pgd(pgd, address, , , ); - if (pte && pte_present(*pte) && !pte_exec(*pte)) + if (pte && pte_present(*pte) && (!pte_exec(*pte) || nx)) pr_crit("kernel tried to execute NX-protected page - exploit attempt? (uid: %d)\n", from_kuid(_user_ns, current_uid())); - if (pte && pte_present(*pte) && pte_exec(*pte) && + if (pte && pte_present(*pte) && pte_exec(*pte) && !nx && (pgd_flags(*pgd) & _PAGE_USER) && (__read_cr4() & X86_CR4_SMEP)) pr_crit("unable to execute userspace code (SMEP?) (uid: %d)\n", This should be a separate patch - as it might change
Re: [PATCH 2/2] x86/xen: return a sane initial apic id when running as PV guest
On 05.04.24 14:50, Andrew Cooper wrote: On 05/04/2024 1:34 pm, Juergen Gross wrote: With recent sanity checks for topology information added, there are now warnings issued for APs when running as a Xen PV guest: [Firmware Bug]: CPU 1: APIC ID mismatch. CPUID: 0x APIC: 0x0001 This is due to the initial APIC ID obtained via CPUID for PV guests is always 0. /sigh From Xen: switch ( leaf ) { case 0x1: /* TODO: Rework topology logic. */ res->b &= 0x00ffu; if ( is_hvm_domain(d) ) res->b |= (v->vcpu_id * 2) << 24; I think there's a very good chance it was random prior to Xen 4.6. That used to come straight out of a CPUID value, so would get the APIC ID of whichever pCPU it was scheduled on. Avoid the warnings by synthesizing the CPUID data to contain the same initial APIC ID as xen_pv_smp_config() is using for registering the APIC IDs of all CPUs. Fixes: 52128a7a21f7 ("86/cpu/topology: Make the APIC mismatch warnings complete") Signed-off-by: Juergen Gross --- arch/x86/xen/enlighten_pv.c | 10 +- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c index ace2eb054053..965e4ca36024 100644 --- a/arch/x86/xen/enlighten_pv.c +++ b/arch/x86/xen/enlighten_pv.c @@ -219,13 +219,20 @@ static __read_mostly unsigned int cpuid_leaf5_edx_val; static void xen_cpuid(unsigned int *ax, unsigned int *bx, unsigned int *cx, unsigned int *dx) { - unsigned maskebx = ~0; + unsigned int maskebx = ~0; + unsigned int or_ebx = 0; /* * Mask out inconvenient features, to try and disable as many * unsupported kernel subsystems as possible. */ switch (*ax) { + case 0x1: + /* Replace initial APIC ID in bits 24-31 of EBX. */ + maskebx = 0x00ff; + or_ebx = smp_processor_id() << 24; I think the comment wants to cross-reference explicitly with xen_pv_smp_config(), because what we care about here is the two sources of information matching. I can add that as a comment. OTOH I'd really hope someone changing this code later would look into the commit message of the patch adding it. :-) Also while you're at it, the x2APIC ID in leaf 0xb. I'm not sure this is functionally relevant in PV guests. Note that my patch is only meant to silence warnings during boot. It is not needed for the system working correctly (at least I think so). Juergen
Re: [PATCH v6 7/8] xen/rwlock: raise the number of possible cpus
On 02.04.24 16:52, Jan Beulich wrote: On 27.03.2024 16:22, Juergen Gross wrote: @@ -36,14 +36,21 @@ void queue_write_lock_slowpath(rwlock_t *lock); static inline bool _is_write_locked_by_me(unsigned int cnts) { -BUILD_BUG_ON(_QW_CPUMASK < NR_CPUS); +BUILD_BUG_ON((_QW_CPUMASK + 1) < NR_CPUS); +BUILD_BUG_ON(NR_CPUS * _QR_BIAS > INT_MAX); return (cnts & _QW_WMASK) == _QW_LOCKED && (cnts & _QW_CPUMASK) == smp_processor_id(); } static inline bool _can_read_lock(unsigned int cnts) { -return !(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts); +/* + * If write locked by the caller, no other readers are possible. + * Not allowing the lock holder to read_lock() another 32768 times ought + * to be fine. + */ +return cnts <= INT_MAX && + (!(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts)); } What is the 32768 in the comment relating to? INT_MAX is quite a bit higher, yet the comparison against it is the only thing you add. Whereas the reader count is, with the sign bit unused, 17 bits, though (bits 14..30). I think You missed: #define_QR_SHIFT(_QW_SHIFT + 2) /* Reader count shift */ So the reader's shift is 16, resulting in 15 bits for the reader count. even in such a comment rather than using a literal number the corresponding expression would better be stated. Hmm, you mean replacing the 32768 with INT_MAX >> _QR_SHIFT? This would be fine with me. Juergen
Re: [PATCH v6 6/8] xen/spinlock: support higher number of cpus
On 02.04.24 16:42, Jan Beulich wrote: On 27.03.2024 16:22, Juergen Gross wrote: Allow 16 bits per cpu number, which is the limit imposed by spinlock_tickets_t. This will allow up to 65535 cpus, while increasing only the size of recursive spinlocks in debug builds from 8 to 12 bytes. The current Xen limit of 4095 cpus is imposed by SPINLOCK_CPU_BITS being 12. There are machines available with more cpus than the current Xen limit, so it makes sense to have the possibility to use more cpus. Signed-off-by: Juergen Gross Reviewed-by: Jan Beulich albeit I have to say that I'm not entirely convinced of ... --- a/xen/common/spinlock.c +++ b/xen/common/spinlock.c @@ -485,7 +485,9 @@ bool _rspin_trylock(rspinlock_t *lock) /* Don't allow overflow of recurse_cpu field. */ BUILD_BUG_ON(NR_CPUS > SPINLOCK_NO_CPU); +BUILD_BUG_ON(SPINLOCK_CPU_BITS > sizeof(lock->recurse_cpu) * 8); BUILD_BUG_ON(SPINLOCK_RECURSE_BITS < 3); +BUILD_BUG_ON(SPINLOCK_MAX_RECURSE > ((1u << SPINLOCK_RECURSE_BITS) - 1)); check_lock(>debug, true); ... the two additions here: The two checks we had verify independent properties, whereas the new ones basically check that struct rspinlock and its associated #define-s were got right. We don't check such elsewhere, I don't think. I think we do. What about: BUILD_BUG_ON(sizeof(hwp_req) != sizeof(hwp_req.raw)) checking that two union elements are of the same size (and both elements don't contain any other structs). Additionally it is not obvious at a first glance that SPINLOCK_CPU_BITS defined in line 11 is relevant for the definition of recurse_cpu in line 217. Regarding the second added BUILD_BUG_ON() there was a comment by Julien related to the definition of SPINLOCK_MAX_RECURSE in V4 of this patch. We settled to use the current form including the added BUILD_BUG_ON(). Juergen
Re: Linux Xen PV CPA W^X violation false-positives
Hi Jason, On 28.03.24 02:24, Jason Andryuk wrote: On Wed, Mar 27, 2024 at 7:46 AM Jürgen Groß wrote: On 24.01.24 17:54, Jason Andryuk wrote: + + return new; + } + } + end = start + npg * PAGE_SIZE - 1; WARN_ONCE(1, "CPA detected W^X violation: %016llx -> %016llx range: 0x%016lx - 0x%016lx PFN %lx\n", (unsigned long long)pgprot_val(old), Jason, do you want to send a V2 with your Signed-off, or would you like me to try upstreaming the patch? Hi Jürgen, Yes, please upstream your approach. I wasn't sure how to deal with it, so it was more of a bug report. The final solution was a bit more complicated, as there are some corner cases to be considered. OTOH it is now complete by looking at all used translation entries. Are you able to test the attached patch? I don't see the original issue and can only verify the patch doesn't cause any regression. Juergen From fd25a67d92e44b61d05d92658b23d026202a1656 Mon Sep 17 00:00:00 2001 From: Juergen Gross To: x...@kernel.org To: linux-ker...@vger.kernel.org To: xen-devel@lists.xenproject.org Cc: Thomas Gleixner Cc: Ingo Molnar Cc: Borislav Petkov Cc: Dave Hansen Cc: "H. Peter Anvin" Cc: Andy Lutomirski Cc: Peter Zijlstra Date: Thu, 28 Mar 2024 12:24:48 +0100 Subject: [PATCH] x86/pat: fix W^X violation false-positives when running as Xen PV guest When running as Xen PV guest in some cases W^X violation WARN()s have been observed. Those WARN()s are produced by verify_rwx(), which looks into the PTE to verify that writable kernel pages have the NX bit set in order to avoid code modifications of the kernel by rogue code. As the NX bits of all levels of translation entries are or-ed and the RW bits of all levels are and-ed, looking just into the PTE isn't enough for the decision that a writable page is executable, too. When running as a Xen PV guest, kernel initialization will set the NX bit in PMD entries of the initial page tables covering the .data segment. When finding the PTE to have set the RW bit but no NX bit, higher level entries must be looked at. Only when all levels have the RW bit set and no NX bit set, the W^X violation should be flagged. Additionally show_fault_oops() has a similar problem: it will issue the "kernel tried to execute NX-protected page" message only if it finds the NX bit set in the leaf translation entry, while any NX bit in non-leaf entries are being ignored for issuing the message. Modify lookup_address_in_pgd() to return the effective NX and RW bit values of the non-leaf translation entries and evaluate those as well in verify_rwx() and show_fault_oops(). Fixes: 652c5bf380ad ("x86/mm: Refuse W^X violations") Reported-by: Jason Andryuk Signed-off-by: Juergen Gross --- arch/x86/include/asm/pgtable_types.h | 2 +- arch/x86/kernel/sev.c| 3 +- arch/x86/mm/fault.c | 7 ++-- arch/x86/mm/pat/set_memory.c | 56 +--- arch/x86/virt/svm/sev.c | 3 +- 5 files changed, 52 insertions(+), 19 deletions(-) diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h index 0b748ee16b3d..91ab538d3872 100644 --- a/arch/x86/include/asm/pgtable_types.h +++ b/arch/x86/include/asm/pgtable_types.h @@ -565,7 +565,7 @@ static inline void update_page_count(int level, unsigned long pages) { } */ extern pte_t *lookup_address(unsigned long address, unsigned int *level); extern pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address, -unsigned int *level); +unsigned int *level, bool *nx, bool *rw); extern pmd_t *lookup_pmd_address(unsigned long address); extern phys_addr_t slow_virt_to_phys(void *__address); extern int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn, diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c index b59b09c2f284..e833183d1adb 100644 --- a/arch/x86/kernel/sev.c +++ b/arch/x86/kernel/sev.c @@ -515,12 +515,13 @@ static enum es_result vc_slow_virt_to_phys(struct ghcb *ghcb, struct es_em_ctxt unsigned long va = (unsigned long)vaddr; unsigned int level; phys_addr_t pa; + bool nx, rw; pgd_t *pgd; pte_t *pte; pgd = __va(read_cr3_pa()); pgd = [pgd_index(va)]; - pte = lookup_address_in_pgd(pgd, va, ); + pte = lookup_address_in_pgd(pgd, va, , , ); if (!pte) { ctxt->fi.vector = X86_TRAP_PF; ctxt->fi.cr2= vaddr; diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index 622d12ec7f08..eb8e897a5653 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -514,18 +514,19 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad if (error_code & X86_PF_INSTR) { unsigned int level; + bool nx, rw; pgd_t *pgd; pte_t *pte; pgd = __va(read_cr3_pa()); pgd += pgd_index(address); - pte = lookup_address_in_pgd(pgd, address, ); + pte = lookup_address_in_pgd(pgd, address, , , )
Re: Linux Xen PV CPA W^X violation false-positives
On 24.01.24 17:54, Jason Andryuk wrote: Xen PV domains show CPA W^X violations like: CPA detected W^X violation: 0064 -> 0067 range: 0x8881 - 0x88810fff PFN 10 WARNING: CPU: 0 PID: 30 at arch/x86/mm/pat/set_memory.c:613 __change_page_attr_set_clr+0x113a/0x11c0 Modules linked in: xt_physdev xt_MASQUERADE iptable_nat nf_nat nf_conntrack libcrc32c nf_defrag_ipv4 ip_tables x_tables xen_argo(O) CPU: 0 PID: 30 Comm: kworker/0:2 Tainted: G O 6.1.38 #1 Workqueue: events bpf_prog_free_deferred RIP: e030:__change_page_attr_set_clr+0x113a/0x11c0 Code: 4c 89 f1 4c 89 e2 4c 89 d6 4c 89 8d 70 ff ff ff 4d 8d 86 ff 0f 00 00 48 c7 c7 f0 3c da 81 c6 05 d0 0e 0e 01 01 e8 f6 71 00 00 <0f> 0b 4c 8b 8d 70 ff ff ff e9 2a fd ff ff 48 8b 85 60 ff ff ff 48 RSP: e02b:c9367c48 EFLAGS: 00010282 RAX: RBX: 000ef064 RCX: RDX: 0003 RSI: f7ff RDI: RBP: c9367d48 R08: R09: c9367aa0 R10: 0001 R11: 0001 R12: 0067 R13: 0001 R14: 8881 R15: c9367d60 FS: () GS:88800b80() knlGS: CS: e030 DS: ES: CR0: 80050033 CR2: 7fdbaeda01c0 CR3: 04312000 CR4: 00050660 Call Trace: ? show_regs.cold+0x1a/0x1f ? __change_page_attr_set_clr+0x113a/0x11c0 ? __warn+0x7b/0xc0 ? __change_page_attr_set_clr+0x113a/0x11c0 ? report_bug+0x111/0x1a0 ? handle_bug+0x4d/0xa0 ? exc_invalid_op+0x19/0x70 ? asm_exc_invalid_op+0x1b/0x20 ? __change_page_attr_set_clr+0x113a/0x11c0 ? __change_page_attr_set_clr+0x113a/0x11c0 ? debug_smp_processor_id+0x17/0x20 ? ___cache_free+0x2e/0x1e0 ? _raw_spin_unlock+0x1e/0x40 ? __purge_vmap_area_lazy+0x2ea/0x6b0 set_direct_map_default_noflush+0x7c/0xa0 __vunmap+0x1ac/0x280 __vfree+0x1d/0x60 vfree+0x27/0x40 __bpf_prog_free+0x44/0x50 bpf_prog_free_deferred+0x104/0x120 process_one_work+0x1ca/0x3d0 ? process_one_work+0x3d0/0x3d0 worker_thread+0x45/0x3c0 ? process_one_work+0x3d0/0x3d0 kthread+0xe2/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x1f/0x30 ---[ end trace ]--- Xen provides a set of page tables that the guest executes out of when it starts. The L1 entries are shared between level2_ident_pgt and level2_kernel_pgt, and xen_setup_kernel_pagetable() sets the NX bit in the level2_ident_pgt entries. verify_rwx() only checks the l1 entry and reports a false-positive violation. Here is a dump of some kernel virtual addresses and the corresponding L1 and L2 entries: This is the start of the directmap (ident) and they have NX (bit 63) set in the PMD. ndvm-pv (1): [0.466778] va=8880 pte=00100027 level: 1 ndvm-pv (1): [0.466788] va=8880 pmd=8242c067 level: 2 Directmap for kernel text: ndvm-pv (1): [0.466795] va=88800100 pte=00100165 level: 1 ndvm-pv (1): [0.466801] va=88800100 pmd=82434067 level: 2 ndvm-pv (1): [0.466807] va=88800101 pte=001001010065 level: 1 ndvm-pv (1): [0.466814] va=88800101 pmd=82434067 level: 2 The start of the kernel text highmap is unmapped: ndvm-pv (1): [0.466820] va=8000 pte= level: 3 ndvm-pv (1): [0.466826] va=8000 pmd= level: 3 Kernel PMD for .text has NX bit clear ndvm-pv (1): [0.466832] va=8100 pte=00100165 level: 1 ndvm-pv (1): [0.466838] va=8100 pmd=02434067 level: 2 Kernel PTE for rodata_end has NX bit set ndvm-pv (1): [0.466846] va=81e62000 pte=801001e62025 level: 1 ndvm-pv (1): [0.466874] va=81e62000 pmd=0243b067 level: 2 Directmap of rodata_end ndvm-pv (1): [0.466907] va=888001e62000 pte=801001e62025 level: 1 ndvm-pv (1): [0.466913] va=888001e62000 pmd=8243b067 level: 2 Directmap of a low RAM address ndvm-pv (1): [0.466920] va=8881 pte=00110027 level: 1 ndvm-pv (1): [0.466926] va=8881 pmd=8242c067 level: 2 Directmap of another RAM address close to but below kernel text ndvm-pv (1): [0.466932] va=88800096c000 pte=00100096c027 level: 1 ndvm-pv (1): [0.466938] va=88800096c000 pmd=82430067 level: 2 Here are some L2 entries showing the differing NX bits for l2_ident vs. l2_kernel while they point at the same L1 addresses ndvm-pv (1): [0.466944] l2_ident[ 0] pmd=8242c067 ndvm-pv (1): [0.466949] l2_ident[ 1] pmd=8242d067 ndvm-pv (1): [0.466955] l2_ident[ 8] pmd=82434067 ndvm-pv (1): [0.466959] l2_ident[ 9] pmd=82435067 ndvm-pv (1): [0.466964] l2_ident[ 14] pmd=8243a067 ndvm-pv (1): [0.466969] l2_ident[ 15] pmd=8243b067 ndvm-pv (1): [
Re: [PATCH] xen/x86: Remove duplicate include
On 22.03.24 07:39, Jiapeng Chong wrote: ./arch/x86/xen/enlighten.c: linux/memblock.h is included more than once. Reported-by: Abaci Robot Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8610 Signed-off-by: Jiapeng Chong Reviewed-by: Juergen Gross Juergen
Re: [PATCH] xen/rwlock: Don't perpeuatite broken API in new logic
On 19.03.24 12:30, Andrew Cooper wrote: The single user wants this the sane way around. Write it as a normal static inline just like rspin_lock(). Fixes: cc3e8df542ed ("xen/spinlock: add rspin_[un]lock_irq[save|restore]()") Signed-off-by: Andrew Cooper Reviewed-by: Juergen Gross Maybe with the subject fixed (s/rwlock/spinlock/). Juergen
Re: [PATCH v5 04/13] xen/spinlock: add rspin_[un]lock_irq[save|restore]()
On 18.03.24 17:08, Jan Beulich wrote: On 18.03.2024 17:05, Jürgen Groß wrote: On 18.03.24 16:59, Jan Beulich wrote: On 18.03.2024 16:55, Jürgen Groß wrote: On 18.03.24 15:43, Jan Beulich wrote: On 14.03.2024 08:20, Juergen Gross wrote: Instead of special casing rspin_lock_irqsave() and rspin_unlock_irqrestore() for the console lock, add those functions to spinlock handling and use them where needed. Signed-off-by: Juergen Gross Reviewed-by: Jan Beulich with two remarks: --- a/xen/common/spinlock.c +++ b/xen/common/spinlock.c @@ -475,15 +475,31 @@ void _rspin_lock(rspinlock_t *lock) lock->recurse_cnt++; } +unsigned long _rspin_lock_irqsave(rspinlock_t *lock) +{ +unsigned long flags; + +local_irq_save(flags); +_rspin_lock(lock); + +return flags; +} + void _rspin_unlock(rspinlock_t *lock) { if ( likely(--lock->recurse_cnt == 0) ) { lock->recurse_cpu = SPINLOCK_NO_CPU; -spin_unlock(lock); +_spin_unlock(lock); This looks like an unrelated change. I think I can guess the purpose, but it would be nice if such along-the-way changes could be mentioned in the description. I think it would be better to move that change to patch 3. Hmm, it would be a secondary change there, too. I was actually meaning to commit patches 2-5, but if things want moving around I guess I better wait with doing so? Hmm, maybe just drop this hunk and let patch 7 handle it? Ah yes, that seem more logical to me. I take it you don't mean "hunk" though, but really just this one line change. Oh yes, of course. Juergen
Re: [PATCH v5 12/13] xen/rwlock: raise the number of possible cpus
On 18.03.24 17:05, Jan Beulich wrote: On 18.03.2024 17:00, Jürgen Groß wrote: On 18.03.24 16:39, Jan Beulich wrote: On 14.03.2024 08:20, Juergen Gross wrote: @@ -36,14 +36,16 @@ void queue_write_lock_slowpath(rwlock_t *lock); static inline bool _is_write_locked_by_me(unsigned int cnts) { -BUILD_BUG_ON(_QW_CPUMASK < NR_CPUS); +BUILD_BUG_ON((_QW_CPUMASK + 1) < NR_CPUS); +BUILD_BUG_ON(NR_CPUS * _QR_BIAS > INT_MAX); return (cnts & _QW_WMASK) == _QW_LOCKED && (cnts & _QW_CPUMASK) == smp_processor_id(); } static inline bool _can_read_lock(unsigned int cnts) { -return !(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts); +return cnts <= INT_MAX && + (!(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts)); } I view this as problematic: Code knowing that a write lock is being held may invoke a function using read_trylock() and expect the lock to be available there. So you expect it to be fine that someone is using read_trylock() 32768 times recursively while holding a lock as a writer? Sure, I can change the condition, but OTOH ... Hmm, yes, the reader count (leaving aside nested read_trylock()) is zero when the lock is held for writing. So yes, I agree the condition is fine, but may I ask for a brief comment to this effect, for blind people like me? Yeah, fine with me. :-) Juergen
Re: [PATCH v5 04/13] xen/spinlock: add rspin_[un]lock_irq[save|restore]()
On 18.03.24 16:59, Jan Beulich wrote: On 18.03.2024 16:55, Jürgen Groß wrote: On 18.03.24 15:43, Jan Beulich wrote: On 14.03.2024 08:20, Juergen Gross wrote: Instead of special casing rspin_lock_irqsave() and rspin_unlock_irqrestore() for the console lock, add those functions to spinlock handling and use them where needed. Signed-off-by: Juergen Gross Reviewed-by: Jan Beulich with two remarks: --- a/xen/common/spinlock.c +++ b/xen/common/spinlock.c @@ -475,15 +475,31 @@ void _rspin_lock(rspinlock_t *lock) lock->recurse_cnt++; } +unsigned long _rspin_lock_irqsave(rspinlock_t *lock) +{ +unsigned long flags; + +local_irq_save(flags); +_rspin_lock(lock); + +return flags; +} + void _rspin_unlock(rspinlock_t *lock) { if ( likely(--lock->recurse_cnt == 0) ) { lock->recurse_cpu = SPINLOCK_NO_CPU; -spin_unlock(lock); +_spin_unlock(lock); This looks like an unrelated change. I think I can guess the purpose, but it would be nice if such along-the-way changes could be mentioned in the description. I think it would be better to move that change to patch 3. Hmm, it would be a secondary change there, too. I was actually meaning to commit patches 2-5, but if things want moving around I guess I better wait with doing so? Hmm, maybe just drop this hunk and let patch 7 handle it? Juergen
Re: [PATCH v5 12/13] xen/rwlock: raise the number of possible cpus
On 18.03.24 16:39, Jan Beulich wrote: On 14.03.2024 08:20, Juergen Gross wrote: The rwlock handling is limiting the number of cpus to 4095 today. The main reason is the use of the atomic_t data type for the main lock handling, which needs 2 bits for the locking state (writer waiting or write locked), 12 bits for the id of a possible writer, and a 12 bit counter for readers. The limit isn't 4096 due to an off by one sanity check. The atomic_t data type is 32 bits wide, so in theory 15 bits for the writer's cpu id and 15 bits for the reader count seem to be fine, but via read_trylock() more readers than cpus are possible. As a result, afaict you choose to use just 14 bits for the CPU, but still 15 bits (with the 16th to deal with overflow) for the reader count. That could do with making explicit here, as a question is whether we deem as sufficient that there is just one extra bit for the reader count. Okay, I'll add a sentence to the commit message. --- a/xen/include/xen/rwlock.h +++ b/xen/include/xen/rwlock.h @@ -23,12 +23,12 @@ typedef struct { #define rwlock_init(l) (*(l) = (rwlock_t)RW_LOCK_UNLOCKED) /* Writer states & reader shift and bias. */ -#define_QW_CPUMASK 0xfffU /* Writer CPU mask */ -#define_QW_SHIFT12 /* Writer flags shift */ -#define_QW_WAITING (1U << _QW_SHIFT) /* A writer is waiting */ -#define_QW_LOCKED (3U << _QW_SHIFT) /* A writer holds the lock */ -#define_QW_WMASK(3U << _QW_SHIFT) /* Writer mask */ -#define_QR_SHIFT14 /* Reader count shift */ +#define_QW_SHIFT14 /* Writer flags shift */ +#define_QW_CPUMASK ((1U << _QW_SHIFT) - 1) /* Writer CPU mask */ +#define_QW_WAITING (1U << _QW_SHIFT) /* A writer is waiting */ +#define_QW_LOCKED (3U << _QW_SHIFT) /* A writer holds the lock */ +#define_QW_WMASK(3U << _QW_SHIFT) /* Writer mask */ +#define_QR_SHIFT(_QW_SHIFT + 2) /* Reader count shift */ #define_QR_BIAS (1U << _QR_SHIFT) Btw, seeing all the uppercase U suffixes here, I think you had some lowercase ones earlier in the series. While Misra doesn't demand uppercase for U, it does for L and iirc we decided to use all uppercase suffixes as a result. Would be nice if what goes in could have this correct right away. I'll rescan all the patches and change them accordingly. @@ -36,14 +36,16 @@ void queue_write_lock_slowpath(rwlock_t *lock); static inline bool _is_write_locked_by_me(unsigned int cnts) { -BUILD_BUG_ON(_QW_CPUMASK < NR_CPUS); +BUILD_BUG_ON((_QW_CPUMASK + 1) < NR_CPUS); +BUILD_BUG_ON(NR_CPUS * _QR_BIAS > INT_MAX); return (cnts & _QW_WMASK) == _QW_LOCKED && (cnts & _QW_CPUMASK) == smp_processor_id(); } static inline bool _can_read_lock(unsigned int cnts) { -return !(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts); +return cnts <= INT_MAX && + (!(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts)); } I view this as problematic: Code knowing that a write lock is being held may invoke a function using read_trylock() and expect the lock to be available there. So you expect it to be fine that someone is using read_trylock() 32768 times recursively while holding a lock as a writer? Sure, I can change the condition, but OTOH ... Juergen
Re: [PATCH v5 11/13] xen/spinlock: support higher number of cpus
On 18.03.24 16:08, Jan Beulich wrote: On 14.03.2024 08:20, Juergen Gross wrote: --- a/xen/include/xen/spinlock.h +++ b/xen/include/xen/spinlock.h @@ -8,16 +8,16 @@ #include #include -#define SPINLOCK_CPU_BITS 12 +#define SPINLOCK_CPU_BITS 16 #ifdef CONFIG_DEBUG_LOCKS union lock_debug { -uint16_t val; -#define LOCK_DEBUG_INITVAL 0x +uint32_t val; +#define LOCK_DEBUG_INITVAL 0x With this #define I can see the desire for using a fixed width type for "val". However, ... struct { -uint16_t cpu:SPINLOCK_CPU_BITS; -#define LOCK_DEBUG_PAD_BITS (14 - SPINLOCK_CPU_BITS) -uint16_t :LOCK_DEBUG_PAD_BITS; +uint32_t cpu:SPINLOCK_CPU_BITS; +#define LOCK_DEBUG_PAD_BITS (30 - SPINLOCK_CPU_BITS) +uint32_t :LOCK_DEBUG_PAD_BITS; .. "unsigned int" ought to be fine for both of these. Fine with me. Juergen
Re: [PATCH v5 04/13] xen/spinlock: add rspin_[un]lock_irq[save|restore]()
On 18.03.24 15:43, Jan Beulich wrote: On 14.03.2024 08:20, Juergen Gross wrote: Instead of special casing rspin_lock_irqsave() and rspin_unlock_irqrestore() for the console lock, add those functions to spinlock handling and use them where needed. Signed-off-by: Juergen Gross Reviewed-by: Jan Beulich with two remarks: --- a/xen/common/spinlock.c +++ b/xen/common/spinlock.c @@ -475,15 +475,31 @@ void _rspin_lock(rspinlock_t *lock) lock->recurse_cnt++; } +unsigned long _rspin_lock_irqsave(rspinlock_t *lock) +{ +unsigned long flags; + +local_irq_save(flags); +_rspin_lock(lock); + +return flags; +} + void _rspin_unlock(rspinlock_t *lock) { if ( likely(--lock->recurse_cnt == 0) ) { lock->recurse_cpu = SPINLOCK_NO_CPU; -spin_unlock(lock); +_spin_unlock(lock); This looks like an unrelated change. I think I can guess the purpose, but it would be nice if such along-the-way changes could be mentioned in the description. I think it would be better to move that change to patch 3. --- a/xen/include/xen/spinlock.h +++ b/xen/include/xen/spinlock.h @@ -272,7 +272,15 @@ static always_inline void spin_lock_if(bool condition, spinlock_t *l) */ bool _rspin_trylock(rspinlock_t *lock); void _rspin_lock(rspinlock_t *lock); +#define rspin_lock_irqsave(l, f)\ +({ \ +BUILD_BUG_ON(sizeof(f) != sizeof(unsigned long)); \ +((f) = _rspin_lock_irqsave(l)); \ Perhaps in the context of another patch in the series I think I had pointed out that the outer pair of parentheses is unnecessary in constructs like this. Oh, this one slipped through, sorry for that. Will fix it in the next iteration. Juergen
Re: [PATCH v5 08/13] xen/spinlock: add missing rspin_is_locked() and rspin_barrier()
On 18.03.24 16:44, Jan Beulich wrote: On 18.03.2024 16:31, Jürgen Groß wrote: On 18.03.24 15:57, Jan Beulich wrote: On 14.03.2024 08:20, Juergen Gross wrote: --- a/xen/common/spinlock.c +++ b/xen/common/spinlock.c @@ -395,14 +395,7 @@ static bool always_inline spin_is_locked_common(const spinlock_tickets_t *t) int _spin_is_locked(const spinlock_t *lock) { -/* - * Recursive locks may be locked by another CPU, yet we return - * "false" here, making this function suitable only for use in - * ASSERT()s and alike. - */ -return lock->recurse_cpu == SPINLOCK_NO_CPU - ? spin_is_locked_common(>tickets) - : lock->recurse_cpu == smp_processor_id(); +return spin_is_locked_common(>tickets); } The "only suitable for ASSERT()s and alike" part of the comment wants to survive here, I think. Why? I could understand you asking for putting such a comment to spinlock.h mentioning that any *_is_locked() variant isn't safe, but with _spin_is_locked() no longer covering recursive locks the comment's reasoning is no longer true. Hmm. I guess there is a difference in expectations. To me, these functions in principle ought to report whether the lock is "owned", not just "locked by some CPU". They don't, hence why they may not be used for other than ASSERT()s. Okay, thanks for clarification. I'll change the comment accordingly. Juergen
Re: [PATCH v5 08/13] xen/spinlock: add missing rspin_is_locked() and rspin_barrier()
On 18.03.24 15:57, Jan Beulich wrote: On 14.03.2024 08:20, Juergen Gross wrote: --- a/xen/common/spinlock.c +++ b/xen/common/spinlock.c @@ -395,14 +395,7 @@ static bool always_inline spin_is_locked_common(const spinlock_tickets_t *t) int _spin_is_locked(const spinlock_t *lock) { -/* - * Recursive locks may be locked by another CPU, yet we return - * "false" here, making this function suitable only for use in - * ASSERT()s and alike. - */ -return lock->recurse_cpu == SPINLOCK_NO_CPU - ? spin_is_locked_common(>tickets) - : lock->recurse_cpu == smp_processor_id(); +return spin_is_locked_common(>tickets); } The "only suitable for ASSERT()s and alike" part of the comment wants to survive here, I think. Why? I could understand you asking for putting such a comment to spinlock.h mentioning that any *_is_locked() variant isn't safe, but with _spin_is_locked() no longer covering recursive locks the comment's reasoning is no longer true. @@ -465,6 +458,23 @@ void _spin_barrier(spinlock_t *lock) spin_barrier_common(>tickets, >debug, LOCK_PROFILE_PAR); } +bool _rspin_is_locked(const rspinlock_t *lock) +{ +/* + * Recursive locks may be locked by another CPU, yet we return + * "false" here, making this function suitable only for use in + * ASSERT()s and alike. + */ +return lock->recurse_cpu == SPINLOCK_NO_CPU + ? spin_is_locked_common(>tickets) + : lock->recurse_cpu == smp_processor_id(); +} Here otoh I wonder if both the comment and the spin_is_locked_common() part of the condition are actually correct. Oh, the latter needs retaining as long as we have nrspin_*() functions, I suppose. But the comment could surely do with improving a little - at the very least "yet we return "false"" isn't quite right; minimally there's a "may" missing. If anything I guess the comment shouldn't gain a "may", but rather say "Recursive locks may be locked by another CPU via rspin_lock() ..." Juergen
Re: [PATCH v2 1/1] x86: Rename __{start,end}_init_task to __{start,end}_init_stack
On 18.03.24 08:14, Xin Li (Intel) wrote: The stack of a task has been separated from the memory of a task_struct struture for a long time on x86, as a result __{start,end}_init_task no longer mark the start and end of the init_task structure, but its stack only. Rename __{start,end}_init_task to __{start,end}_init_stack. Note other architectures are not affected because __{start,end}_init_task are used on x86 only. Signed-off-by: Xin Li (Intel) Reviewed-by: Juergen Gross Juergen
Re: [PATCH v2] xen/arm: Set correct per-cpu cpu_core_mask
On 15.03.24 11:58, Julien Grall wrote: On 14/03/2024 14:22, Henry Wang wrote: Hi Julien, Hi, On 3/14/2024 9:27 PM, Julien Grall wrote: Hi Henry, On 28/02/2024 01:58, Henry Wang wrote: diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c index a84e706d77..d9ebd55d4a 100644 --- a/xen/arch/arm/smpboot.c +++ b/xen/arch/arm/smpboot.c @@ -66,7 +66,6 @@ static bool cpu_is_dead; /* ID of the PCPU we're running on */ DEFINE_PER_CPU(unsigned int, cpu_id); -/* XXX these seem awfully x86ish... */ :). I guess at the time we didn't realize that MT was supported on Arm. It is at least part of the spec, but as Michal pointed out it doesn't look like a lot of processors supports it. Yep. Do you think changing the content of this line to something like "Although multithread is part of the Arm spec, there are not many processors support multithread and current Xen on Arm assumes there is no multithread" makes sense to you? /* representing HT siblings of each logical CPU */ DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_mask); /* representing HT and core siblings of each logical CPU */ @@ -89,6 +88,10 @@ static int setup_cpu_sibling_map(int cpu) cpumask_set_cpu(cpu, per_cpu(cpu_sibling_mask, cpu)); cpumask_set_cpu(cpu, per_cpu(cpu_core_mask, cpu)); + /* Currently we assume there is no multithread. */ I am not very familiar with the scheduling in Xen. Do you know what's the consequence of not properly supporting MT? One thing I can think of is security (I know there were plenty of security issues with SMT). Unfortunately me neither, so adding George to this thread as I think he can bring us some insights on this topic from the scheduler perspective. +Juergen as he worked on co-scheduling. There are four aspects regarding multithreading: - security (basically boils down to one thread possibly being able to use side channel attacks for accessing other thread's data via resources in the core shared between the threads) - performance (trying to spread running vcpus across as many cores as possible will utilize more hardware resources resulting generally in better performance - especially credit2 is supporting that) - power consumption (a completely idle core will consume less power - again credit2 is supporting that with the correct setting) - busy loops (cpu_relax() support) For credit2 MT specifics see the large comment block in xen/common/sched/credit2 starting at line 636. Note that core scheduling is currently not supported on Arm, as there are architecture specific bits missing in the context switching logic. Juergen
Re: IMPORTANT - : Need help on USB port virtualization with Xen hypervisor
On 16.03.24 00:32, Stefano Stabellini wrote: Hi Dominique, You posted this configuration: device_model_args = [ " "-device","nec-usb-xhci,id=xhci", "-device","usb-host,bus=xhci.0,hostbus=1,hostport=13", "-device","usb-host,bus=xhci.0,hostbus=1,hostport=10", "-device","usb-host,bus=xhci.0,hostbus=1,hostport=2", "-device","usb-host,bus=xhci.0,hostbus=2,hostport=2", "-device","usb-host,bus=xhci.0,hostbus=2,hostport=1", "-device","usb-host,bus=xhci.0,hostbus=1,hostport=1"] It looks like you are using QEMU-based USB passthrough. Basically QEMU (independently from Xen) is accessing the USB device in Dom0 and exposing a corresponding device to the guest. I am not sure if there is anything that can be done in QEMU to support USB 3.0 with high speed, people in CC might know. There two other alternatives to this approach. You can use PV USB Passthrough instead. Juergen (CCed) is the original author. I am not sure if that supports USB 3.0 either. PV USB doesn't support USB 3.0. Instead of using device_model_args= in the guest configuration I'd rather use the usbctrl= and usbdev= items (see the related xl.cfg(5) man page). I'm not sure this will really make a performance difference, though. Juergen
Re: [PATCH v5 13/13] xen: allow up to 16383 cpus
On 14.03.24 08:26, Jan Beulich wrote: On 14.03.2024 08:20, Juergen Gross wrote: With lock handling now allowing up to 16384 cpus (spinlocks can handle 65535 cpus, rwlocks can handle 16384 cpus), raise the allowed limit for the number of cpus to be configured to 16383. The new limit is imposed by IOMMU_CMD_BUFFER_MAX_ENTRIES and QINVAL_MAX_ENTRY_NR required to be larger than 2 * CONFIG_NR_CPUS. That's quite sad - I was really hoping we'd finally end up with a power-of-2 upper bound. Yes, me too. Juergen
Re: [PATCH v5 01/13] xen/spinlock: remove misra rule 21.1 violations
On 14.03.24 08:32, Jan Beulich wrote: On 14.03.2024 08:20, Juergen Gross wrote: In xen spinlock code there are several violations of MISRA rule 21.1 (identifiers starting with "__" or "_[A-Z]"). Fix them by using trailing underscores instead. Signed-off-by: Juergen Gross I can live with the changes as they are, but before giving an ack, I'd still like to ask if the moved underscores are really useful / necessary in all cases. E.g. --- a/xen/include/xen/spinlock.h +++ b/xen/include/xen/spinlock.h @@ -22,7 +22,7 @@ union lock_debug { bool unseen:1; }; }; -#define _LOCK_DEBUG { .val = LOCK_DEBUG_INITVAL } +#define LOCK_DEBUG_ { .val = LOCK_DEBUG_INITVAL } ... for an internal helper macro it may indeed be better to have a trailing one here, but ... @@ -95,27 +95,27 @@ struct lock_profile_qhead { int32_t idx; /* index for printout */ }; -#define _LOCK_PROFILE(lockname) { .name = #lockname, .lock = &(lockname), } -#define _LOCK_PROFILE_PTR(name) \ -static struct lock_profile * const __lock_profile_##name \ +#define LOCK_PROFILE_(lockname) { .name = #lockname, .lock = &(lockname), } +#define LOCK_PROFILE_PTR_(name) \ +static struct lock_profile * const lock_profile__##name \ ... I'm not entirely convinced of the need for the double infix ones here ... This reduces the chance of name clashes with other lock profiling variables or functions (e.g. lock_profile_lock). In case you think this can be neglected, I'm fine with dropping the extra underscores. Juergen
Re: [PATCH v1 1/1] x86: Rename __{start,end}_init_task to __{start,end}_init_stack
On 13.03.24 07:05, Xin Li (Intel) wrote: The stack of a task has been separated from the memory of a task_struct struture for a long time on x86, as a result __{start,end}_init_task no longer mark the start and end of the init_task structure, but its stack only. Rename __{start,end}_init_task to __{start,end}_init_stack. Note other architectures are not affected because __{start,end}_init_task are used on x86 only. Signed-off-by: Xin Li (Intel) --- arch/x86/include/asm/processor.h | 4 ++-- arch/x86/kernel/head_64.S | 2 +- arch/x86/xen/xen-head.S | 2 +- include/asm-generic/vmlinux.lds.h | 8 4 files changed, 8 insertions(+), 8 deletions(-) diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h index 811548f131f4..8b3a3f3bb859 100644 --- a/arch/x86/include/asm/processor.h +++ b/arch/x86/include/asm/processor.h @@ -636,10 +636,10 @@ static __always_inline void prefetchw(const void *x) #define KSTK_ESP(task)(task_pt_regs(task)->sp) #else -extern unsigned long __end_init_task[]; +extern unsigned long __end_init_stack[]; #define INIT_THREAD { \ - .sp = (unsigned long)&__end_init_task - \ + .sp = (unsigned long)&__end_init_stack -\ TOP_OF_KERNEL_STACK_PADDING - \ sizeof(struct pt_regs), \ } diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S index d8198fbd70e5..c7babd7ebb0f 100644 --- a/arch/x86/kernel/head_64.S +++ b/arch/x86/kernel/head_64.S @@ -66,7 +66,7 @@ SYM_CODE_START_NOALIGN(startup_64) mov %rsi, %r15 /* Set up the stack for verify_cpu() */ - leaq(__end_init_task - TOP_OF_KERNEL_STACK_PADDING - PTREGS_SIZE)(%rip), %rsp + leaq(__end_init_stack - TOP_OF_KERNEL_STACK_PADDING - PTREGS_SIZE)(%rip), %rsp /* Setup GSBASE to allow stack canary access for C code */ movl$MSR_GS_BASE, %ecx diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S index 04101b984f24..43eadf03f46d 100644 --- a/arch/x86/xen/xen-head.S +++ b/arch/x86/xen/xen-head.S @@ -49,7 +49,7 @@ SYM_CODE_START(startup_xen) ANNOTATE_NOENDBR cld - leaq (__end_init_task - TOP_OF_KERNEL_STACK_PADDING - PTREGS_SIZE)(%rip), %rsp + leaq(__end_init_stack - TOP_OF_KERNEL_STACK_PADDING - PTREGS_SIZE)(%rip), %rsp /* Set up %gs. * diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 5dd3a61d673d..a168be99d522 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -399,13 +399,13 @@ #define INIT_TASK_DATA(align) \ . = ALIGN(align); \ - __start_init_task = .; \ + __start_init_stack = .; \ init_thread_union = .; \ init_stack = .; \ - KEEP(*(.data..init_task)) \ + KEEP(*(.data..init_stack)) \ Is this modification really correct? KEEP(*(.data..init_thread_info))\ - . = __start_init_task + THREAD_SIZE;\ - __end_init_task = .; + . = __start_init_stack + THREAD_SIZE; \ + __end_init_stack = .; #define JUMP_TABLE_DATA \ . = ALIGN(8); \ base-commit: 626856ae97054963e7b8c35335d4418271c8d0c4 Juergen
Re: Linux 6.7-rc1+: WARNING at drivers/xen/evtchn.c:167 evtchn_interrupt
On 30.11.23 03:34, Marek Marczykowski-Górecki wrote: Hi, While testing 6.7-rc3 on Qubes OS I found several warning like in the subject in dom0 log. I see them when running 6.7-rc1 too. I'm not sure what exactly triggers the issue, but my guess would be unbinding an event channel from userspace (closing vchan connection). Specific message: [ 83.973377] [ cut here ] [ 83.975523] Interrupt for port 77, but apparently not enabled; per-user a0e9f1d1 Finally I think I have a fix (thanks to Demi for finding the problematic patch through bisecting). Could you please try the attached patch? It is based on current upstream, but I think it should apply to 6.7 or stable 6.6, too. Juergen From 9d673c37b2d0c9aa274c53f619c4e9e43a419f41 Mon Sep 17 00:00:00 2001 From: Juergen Gross To: linux-ker...@vger.kernel.org Cc: Juergen Gross Cc: Stefano Stabellini Cc: Oleksandr Tyshchenko Cc: xen-devel@lists.xenproject.org Date: Tue, 12 Mar 2024 13:52:24 +0100 Subject: [PATCH] xen/evtchn: avoid WARN() when unbinding an event channel When unbinding a user event channel, the related handler might be called a last time in case the kernel was built with CONFIG_DEBUG_SHIRQ. This might cause a WARN() in the handler. Avoid that by adding an "unbinding" flag to struct user_event which will short circuit the handler. Fixes: 9e90e58c11b7 ("xen: evtchn: Allow shared registration of IRQ handers") Signed-off-by: Juergen Gross --- drivers/xen/evtchn.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/drivers/xen/evtchn.c b/drivers/xen/evtchn.c index 59717628ca42..f6a2216c2c87 100644 --- a/drivers/xen/evtchn.c +++ b/drivers/xen/evtchn.c @@ -85,6 +85,7 @@ struct user_evtchn { struct per_user_data *user; evtchn_port_t port; bool enabled; + bool unbinding; }; static void evtchn_free_ring(evtchn_port_t *ring) @@ -164,6 +165,10 @@ static irqreturn_t evtchn_interrupt(int irq, void *data) struct per_user_data *u = evtchn->user; unsigned int prod, cons; + /* Handler might be called when tearing down the IRQ. */ + if (evtchn->unbinding) + return IRQ_HANDLED; + WARN(!evtchn->enabled, "Interrupt for port %u, but apparently not enabled; per-user %p\n", evtchn->port, u); @@ -421,6 +426,7 @@ static void evtchn_unbind_from_user(struct per_user_data *u, BUG_ON(irq < 0); + evtchn->unbinding = true; unbind_from_irqhandler(irq, evtchn); del_evtchn(u, evtchn); -- 2.35.3
Re: [PATCH] xen/grant-dma-iommu: Convert to platform remove callback returning void
On 07.03.24 19:11, Uwe Kleine-König wrote: The .remove() callback for a platform driver returns an int which makes many driver authors wrongly assume it's possible to do error handling by returning an error code. However the value returned is ignored (apart from emitting a warning) and this typically results in resource leaks. To improve here there is a quest to make the remove callback return void. In the first step of this quest all drivers are converted to .remove_new(), which already returns void. Eventually after all drivers are converted, .remove_new() will be renamed to .remove(). Trivially convert this driver from always returning zero in the remove callback to the void returning variant. Signed-off-by: Uwe Kleine-König Reviewed-by: Juergen Gross Juergen
Re: [PATCH v3] tools/9pfsd: Fix build error caused by strerror_r
On 07.03.24 14:56, Henry Wang wrote: Below error can be seen when doing Yocto build of the toolstack: | io.c: In function 'p9_error': | io.c:684:5: error: ignoring return value of 'strerror_r' declared with attribute 'warn_unused_result' [-Werror=unused-result] | 684 | strerror_r(err, ring->buffer, ring->ring_size); | | ^~ | cc1: all warnings being treated as errors Using strerror_r() without special casing different build environments is impossible due to the different return types (int vs char *) depending on the environment. As p9_error() is not on a performance critical path, using strerror() with a mutex ought to be fine. So, fix the build by using strerror() to replace strerror_r(). The steps would then become: Acquire the mutex first, invoke strerror(), copy the string from strerror() to the designated buffer and then drop the mutex. Fixes: f4900d6d69b5 ("9pfsd: allow building with old glibc") Signed-off-by: Henry Wang Reviewed-by: Jan Beulich Reviewed-by: Juergen Gross Juergen
Re: [PATCH v3] tools/9pfsd: Fix build error caused by strerror_r
On 07.03.24 14:56, Henry Wang wrote: Below error can be seen when doing Yocto build of the toolstack: | io.c: In function 'p9_error': | io.c:684:5: error: ignoring return value of 'strerror_r' declared with attribute 'warn_unused_result' [-Werror=unused-result] | 684 | strerror_r(err, ring->buffer, ring->ring_size); | | ^~ | cc1: all warnings being treated as errors Using strerror_r() without special casing different build environments is impossible due to the different return types (int vs char *) depending on the environment. As p9_error() is not on a performance critical path, using strerror() with a mutex ought to be fine. So, fix the build by using strerror() to replace strerror_r(). The steps would then become: Acquire the mutex first, invoke strerror(), copy the string from strerror() to the designated buffer and then drop the mutex. Fixes: f4900d6d69b5 ("9pfsd: allow building with old glibc") Signed-off-by: Henry Wang Reviewed-by: Jan Beulich Reviewed-by: Juergen Gross Juergen
Re: [PATCH] MAINTAINERS: add an entry for tools/9pfsd
On 06.03.24 12:23, Jan Beulich wrote: On 06.03.2024 12:16, Juergen Gross wrote: --- a/MAINTAINERS +++ b/MAINTAINERS @@ -206,6 +206,12 @@ Maintainers List (try to look for most precise areas first) --- +9PFSD +M: Juergen Gross +M: Anthony PERARD +S: Supported +F: tools/9pfsd This wants a trailing slash. Oh, indeed. Can pretty certainly be taken care of while committing. Thanks, Juergen
Re: [PATCH] 9pfsd: allow building with old glibc
On 05.03.24 15:33, Jan Beulich wrote: Neither pread() / pwrite() nor O_DIRECTORY are available from glibc 2.11.3 headers without defining (e.g.) _GNU_SOURCE. Supplying the definition unconditionally shouldn't be a problem, seeing that various other tools/ components do so, too. Signed-off-by: Jan Beulich Fine with me. Reviewed-by: Juergen Gross Juergen
Re: [PATCH v4 10/12] xen/spinlock: split recursive spinlocks from normal ones
On 04.03.24 08:25, Jan Beulich wrote: On 01.03.2024 15:37, Juergen Gross wrote: On 29.02.24 16:32, Jan Beulich wrote: On 12.12.2023 10:47, Juergen Gross wrote: +#define nrspin_lock_irqsave(l, f) \ +({ \ +BUILD_BUG_ON(sizeof(f) != sizeof(unsigned long)); \ +((f) = __nrspin_lock_irqsave(l)); \ I don't think the outer pair of parentheses is needed here. Turns out it is needed. Otherwise something like: if ( a ) nrspin_lock_irqsave(l, f); else ... will fail with "else without a previous if". That's for "outer" in the whole #define I suppose, when I commented on just a specific line inside the construct. Sorry, I applied your remark to the wrong context. Yes, one level of parentheses can be removed from this line. Juergen
Re: [PATCH v4 12/12] xen/spinlock: support higher number of cpus
On 29.02.24 17:54, Jan Beulich wrote: On 29.02.2024 17:45, Juergen Gross wrote: On 29.02.24 17:31, Jan Beulich wrote: On 29.02.2024 17:29, Jürgen Groß wrote: On 29.02.24 16:46, Jan Beulich wrote: On 12.12.2023 10:47, Juergen Gross wrote: Allow 16 bits per cpu number, which is the limit imposed by spinlock_tickets_t. This will allow up to 65535 cpus, while increasing only the size of recursive spinlocks in debug builds from 8 to 12 bytes. I think we want to be more conservative here, for the case of there being bugs: The CPU holding a lock may wrongly try to acquire it a 2nd time. That's the 65536th ticket then, wrapping the value. Is this really a problem? There will be no other cpu left seeing the lock as "free" in this case, as all others will be waiting for the head to reach their private tail value. But isn't said CPU then going to make progress, rather than indefinitely spinning on the lock? No, I don't think so. Hmm. If CPU0 takes a pristine lock, it'll get ticket 0x. All other CPUs will get tickets 0x0001 ... 0x. Then CPU0 trying to take the lock No, they'll get 0x0001 ... 0xfffe (only 65535 cpus are supported). again will get ticket 0x again, which equals what .head still has (no And the first cpu will get 0x when trying to get the lock again. unlocks so far), thus signalling the lock to be free when it isn't. The limit isn't 65535 because of the ticket mechanism, but because of the rspin mechanism, where we need a "no cpu is owning the lock" value. Without the recursive locks the limit would be 65536 (or 4096 today). I think this rather belongs ... No, that was meant to tell you that without programming bug 65536 cpus would be perfectly fine for the ticket mechanism, and with bug 65535 cpus are fine. Therefore my suggestion would be to only (mention) go(ing) up to 32k. Signed-off-by: Juergen Gross --- xen/common/spinlock.c | 1 + xen/include/xen/spinlock.h | 18 +- 2 files changed, 10 insertions(+), 9 deletions(-) Shouldn't this also bump the upper bound of the NR_CPUS range then in xen/arch/Kconfig? Fine with me, I can add another patch to the series doing that. Why not do it right here? The upper bound there is like it is only because of the restriction that's lifted here. ... here (for having nothing to do with the supposed lack of hanging that I'm seeing)? I'd prefer splitting the two instances, but if you prefer it to be in a single patch, so be it. I'm not going to insist - if want to do it separately, please do. Perhaps others would actually prefer it that way ... Okay. I'll do it in another patch. Juergen
Re: [PATCH v4 12/12] xen/spinlock: support higher number of cpus
On 29.02.24 16:46, Jan Beulich wrote: On 12.12.2023 10:47, Juergen Gross wrote: Allow 16 bits per cpu number, which is the limit imposed by spinlock_tickets_t. This will allow up to 65535 cpus, while increasing only the size of recursive spinlocks in debug builds from 8 to 12 bytes. I think we want to be more conservative here, for the case of there being bugs: The CPU holding a lock may wrongly try to acquire it a 2nd time. That's the 65536th ticket then, wrapping the value. Is this really a problem? There will be no other cpu left seeing the lock as "free" in this case, as all others will be waiting for the head to reach their private tail value. Therefore my suggestion would be to only (mention) go(ing) up to 32k. Signed-off-by: Juergen Gross --- xen/common/spinlock.c | 1 + xen/include/xen/spinlock.h | 18 +- 2 files changed, 10 insertions(+), 9 deletions(-) Shouldn't this also bump the upper bound of the NR_CPUS range then in xen/arch/Kconfig? Fine with me, I can add another patch to the series doing that. Juergen
Re: [PATCH v4 10/12] xen/spinlock: split recursive spinlocks from normal ones
On 29.02.24 16:32, Jan Beulich wrote: On 12.12.2023 10:47, Juergen Gross wrote: --- a/xen/common/spinlock.c +++ b/xen/common/spinlock.c @@ -541,6 +541,55 @@ void rspin_unlock_irqrestore(rspinlock_t *lock, unsigned long flags) local_irq_restore(flags); } +int nrspin_trylock(rspinlock_t *lock) +{ +check_lock(>debug, true); + +if ( unlikely(lock->recurse_cpu != SPINLOCK_NO_CPU) ) +return 0; + +return spin_trylock_common(>tickets, >debug, LOCK_PROFILE_PAR); +} I wonder if we shouldn't take the opportunity and stop having trylock functions have "int" return type. Perhaps already spin_trylock_common() when introduced could use "bool" instead, then here following suit. Fine with me. +void nrspin_lock(rspinlock_t *lock) +{ +spin_lock_common(>tickets, >debug, LOCK_PROFILE_PAR, NULL, + NULL); +} + +void nrspin_unlock(rspinlock_t *lock) +{ +spin_unlock_common(>tickets, >debug, LOCK_PROFILE_PAR); +} + +void nrspin_lock_irq(rspinlock_t *lock) +{ +ASSERT(local_irq_is_enabled()); +local_irq_disable(); +nrspin_lock(lock); +} + +void nrspin_unlock_irq(rspinlock_t *lock) +{ +nrspin_unlock(lock); +local_irq_enable(); +} + +unsigned long __nrspin_lock_irqsave(rspinlock_t *lock) +{ +unsigned long flags; + +local_irq_save(flags); +nrspin_lock(lock); +return flags; Nit: Strictly speaking we want a blank line ahead of this "return". Okay, will add it. @@ -166,11 +172,15 @@ struct lock_profile_qhead { }; struct lock_profile { }; #define SPIN_LOCK_UNLOCKED { \ +.debug =_LOCK_DEBUG, \ +} +#define RSPIN_LOCK_UNLOCKED { \ +.debug =_LOCK_DEBUG, \ .recurse_cpu = SPINLOCK_NO_CPU, \ .debug =_LOCK_DEBUG, \ } Initializing .debug twice? Oh, right. Will drop one. @@ -180,7 +190,6 @@ struct lock_profile { }; #endif - typedef union { uint32_t head_tail; struct { Looks like this might be undoing what the earlier patch isn't going to do anymore? Yes, seen it already. @@ -242,6 +257,19 @@ void rspin_unlock_irqrestore(rspinlock_t *lock, unsigned long flags); int rspin_is_locked(const rspinlock_t *lock); void rspin_barrier(rspinlock_t *lock); +int nrspin_trylock(rspinlock_t *lock); +void nrspin_lock(rspinlock_t *lock); +void nrspin_unlock(rspinlock_t *lock); +void nrspin_lock_irq(rspinlock_t *lock); +void nrspin_unlock_irq(rspinlock_t *lock); +#define nrspin_lock_irqsave(l, f) \ +({ \ +BUILD_BUG_ON(sizeof(f) != sizeof(unsigned long)); \ +((f) = __nrspin_lock_irqsave(l)); \ I don't think the outer pair of parentheses is needed here. As to the leading double underscores, see comments elsewhere. Okay. Juergen
Re: [PATCH v4 09/12] xen/spinlock: add missing rspin_is_locked() and rspin_barrier()
On 29.02.24 15:14, Jan Beulich wrote: On 12.12.2023 10:47, Juergen Gross wrote: --- a/xen/common/spinlock.c +++ b/xen/common/spinlock.c @@ -458,6 +458,23 @@ void _spin_barrier(spinlock_t *lock) spin_barrier_common(>tickets, >debug, LOCK_PROFILE_PAR); } +int rspin_is_locked(const rspinlock_t *lock) +{ +/* + * Recursive locks may be locked by another CPU, yet we return + * "false" here, making this function suitable only for use in + * ASSERT()s and alike. + */ +return lock->recurse_cpu == SPINLOCK_NO_CPU + ? spin_is_locked_common(>tickets) + : lock->recurse_cpu == smp_processor_id(); +} + +void rspin_barrier(rspinlock_t *lock) +{ +spin_barrier_common(>tickets, >debug, LOCK_PROFILE_PAR); +} Ah, here we go. Looks all okay to me, but needs re-ordering such that the earlier patch won't transiently introduce a regression. Yes, just wanted to answer something similar to your remark on patch 8. Juergen
Re: [PATCH v9 2/6] stubdom: extend xenstore stubdom configs
On 29.02.24 14:05, Jan Beulich wrote: On 29.02.2024 13:48, Juergen Gross wrote: Extend the config files of the Xenstore stubdoms to include XENBUS and 9PFRONT items in order to support file based logging. Signed-off-by: Juergen Gross Reviewed-by: Jason Andryuk Was an ack from Samuel lost here? Or was it dropped on purpose? Oh, I think I lost it. Juergen
Re: [PATCH v4 06/12] xen/spinlock: make struct lock_profile rspinlock_t aware
On 28.02.24 17:02, Jan Beulich wrote: On 28.02.2024 16:43, Jürgen Groß wrote: On 28.02.24 16:19, Jan Beulich wrote: On 12.12.2023 10:47, Juergen Gross wrote: --- a/xen/common/spinlock.c +++ b/xen/common/spinlock.c @@ -538,19 +538,31 @@ static void spinlock_profile_iterate(lock_profile_subfunc *sub, void *par) static void cf_check spinlock_profile_print_elem(struct lock_profile *data, int32_t type, int32_t idx, void *par) { -struct spinlock *lock = data->lock; +unsigned int cpu; +uint32_t lockval; Any reason for this not being unsigned int as well? The more that ... +if ( data->is_rlock ) +{ +cpu = data->rlock->debug.cpu; +lockval = data->rlock->tickets.head_tail; +} +else +{ +cpu = data->lock->debug.cpu; +lockval = data->lock->tickets.head_tail; +} I've used the same type as tickets.head_tail. printk("%s ", lock_profile_ancs[type].name); if ( type != LOCKPROF_TYPE_GLOBAL ) printk("%d ", idx); -printk("%s: addr=%p, lockval=%08x, ", data->name, lock, - lock->tickets.head_tail); -if ( lock->debug.cpu == SPINLOCK_NO_CPU ) +printk("%s: addr=%p, lockval=%08x, ", data->name, data->lock, lockval); ... it's then printed with plain x as the format char. Which hasn't been changed by the patch. I can change it to PRIx32 if you want. As per ./CODING_STYLE unsigned int is preferred. Okay, I'll switch to unsigned int then. --- a/xen/include/xen/spinlock.h +++ b/xen/include/xen/spinlock.h @@ -76,13 +76,19 @@ union lock_debug { }; */ struct spinlock; +/* Temporary hack until a dedicated struct rspinlock is existing. */ +#define rspinlock spinlock struct lock_profile { struct lock_profile *next; /* forward link */ const char *name; /* lock name */ -struct spinlock *lock; /* the lock itself */ +union { +struct spinlock *lock; /* the lock itself */ +struct rspinlock *rlock; /* the recursive lock itself */ +}; _LOCK_PROFILE() wants to initialize this field, unconditionally using .lock. While I expect that problem to be taken care of in one of the later patches, use of the macro won't work anymore with this union in use with very old gcc that formally we still support. While a road to generally raising the baseline requirements is still pretty unclear to me, an option might be to require (and document) that to enable DEBUG_LOCK_PROFILE somewhat newer gcc needs using. Patch 8 is using either .lock or .rlock depending on the lock type. What is the problem with the old gcc version? Static initializers of anonymous union members? Yes. The easiest solution might then be to give the union a name. Juergen
Re: [PATCH v4 06/12] xen/spinlock: make struct lock_profile rspinlock_t aware
On 28.02.24 16:19, Jan Beulich wrote: On 12.12.2023 10:47, Juergen Gross wrote: --- a/xen/common/spinlock.c +++ b/xen/common/spinlock.c @@ -538,19 +538,31 @@ static void spinlock_profile_iterate(lock_profile_subfunc *sub, void *par) static void cf_check spinlock_profile_print_elem(struct lock_profile *data, int32_t type, int32_t idx, void *par) { -struct spinlock *lock = data->lock; +unsigned int cpu; +uint32_t lockval; Any reason for this not being unsigned int as well? The more that ... +if ( data->is_rlock ) +{ +cpu = data->rlock->debug.cpu; +lockval = data->rlock->tickets.head_tail; +} +else +{ +cpu = data->lock->debug.cpu; +lockval = data->lock->tickets.head_tail; +} I've used the same type as tickets.head_tail. printk("%s ", lock_profile_ancs[type].name); if ( type != LOCKPROF_TYPE_GLOBAL ) printk("%d ", idx); -printk("%s: addr=%p, lockval=%08x, ", data->name, lock, - lock->tickets.head_tail); -if ( lock->debug.cpu == SPINLOCK_NO_CPU ) +printk("%s: addr=%p, lockval=%08x, ", data->name, data->lock, lockval); ... it's then printed with plain x as the format char. Which hasn't been changed by the patch. I can change it to PRIx32 if you want. +if ( cpu == SPINLOCK_NO_CPU ) printk("not locked\n"); else -printk("cpu=%d\n", lock->debug.cpu); -printk(" lock:%" PRId64 "(%" PRI_stime "), block:%" PRId64 "(%" PRI_stime ")\n", - data->lock_cnt, data->time_hold, data->block_cnt, data->time_block); +printk("cpu=%u\n", cpu); +printk(" lock:%" PRIu64 "(%" PRI_stime "), block:%" PRIu64 "(%" PRI_stime ")\n", + data->lock_cnt, data->time_hold, (uint64_t)data->block_cnt, I think I know why the cast is suddenly / unexpectedly needed, but imo such wants stating in the description, when generally we aim at avoiding casts where possible. Okay, will add a sentence. --- a/xen/include/xen/spinlock.h +++ b/xen/include/xen/spinlock.h @@ -76,13 +76,19 @@ union lock_debug { }; */ struct spinlock; +/* Temporary hack until a dedicated struct rspinlock is existing. */ +#define rspinlock spinlock struct lock_profile { struct lock_profile *next; /* forward link */ const char *name; /* lock name */ -struct spinlock *lock; /* the lock itself */ +union { +struct spinlock *lock; /* the lock itself */ +struct rspinlock *rlock; /* the recursive lock itself */ +}; _LOCK_PROFILE() wants to initialize this field, unconditionally using .lock. While I expect that problem to be taken care of in one of the later patches, use of the macro won't work anymore with this union in use with very old gcc that formally we still support. While a road to generally raising the baseline requirements is still pretty unclear to me, an option might be to require (and document) that to enable DEBUG_LOCK_PROFILE somewhat newer gcc needs using. Patch 8 is using either .lock or .rlock depending on the lock type. What is the problem with the old gcc version? Static initializers of anonymous union members? uint64_tlock_cnt;/* # of complete locking ops */ -uint64_tblock_cnt; /* # of complete wait for lock */ +uint64_tblock_cnt:63; /* # of complete wait for lock */ +uint64_tis_rlock:1; /* use rlock pointer */ bool? Yes. Juergen
Re: [PATCH v4 05/12] xen/spinlock: add rspin_[un]lock_irq[save|restore]()
On 28.02.24 16:09, Jan Beulich wrote: On 12.12.2023 10:47, Juergen Gross wrote: Instead of special casing rspin_lock_irqsave() and rspin_unlock_irqrestore() for the console lock, add those functions to spinlock handling and use them where needed. Signed-off-by: Juergen Gross --- V2: - new patch In how far is this a necessary part of the series? Not really necessary. It just seemed wrong to have an open coded variant of rspin_lock_irqsave() and rspin_unlock_irqrestore(). --- a/xen/arch/x86/traps.c +++ b/xen/arch/x86/traps.c @@ -647,13 +647,15 @@ void show_stack_overflow(unsigned int cpu, const struct cpu_user_regs *regs) void show_execution_state(const struct cpu_user_regs *regs) { /* Prevent interleaving of output. */ -unsigned long flags = console_lock_recursive_irqsave(); +unsigned long flags; + +rspin_lock_irqsave(_lock, flags); show_registers(regs); show_code(regs); show_stack(regs); -console_unlock_recursive_irqrestore(flags); +rspin_unlock_irqrestore(_lock, flags); } void cf_check show_execution_state_nonconst(struct cpu_user_regs *regs) @@ -663,7 +665,7 @@ void cf_check show_execution_state_nonconst(struct cpu_user_regs *regs) void vcpu_show_execution_state(struct vcpu *v) { -unsigned long flags = 0; +unsigned long flags; if ( test_bit(_VPF_down, >pause_flags) ) { @@ -698,7 +700,7 @@ void vcpu_show_execution_state(struct vcpu *v) #endif /* Prevent interleaving of output. */ -flags = console_lock_recursive_irqsave(); +rspin_lock_irqsave(_lock, flags); vcpu_show_registers(v); @@ -708,7 +710,7 @@ void vcpu_show_execution_state(struct vcpu *v) * Stop interleaving prevention: The necessary P2M lookups involve * locking, which has to occur with IRQs enabled. */ -console_unlock_recursive_irqrestore(flags); +rspin_unlock_irqrestore(_lock, flags); show_hvm_stack(v, >arch.user_regs); } @@ -717,7 +719,7 @@ void vcpu_show_execution_state(struct vcpu *v) if ( guest_kernel_mode(v, >arch.user_regs) ) show_guest_stack(v, >arch.user_regs); -console_unlock_recursive_irqrestore(flags); +rspin_unlock_irqrestore(_lock, flags); } I view these as layering violations; ... --- a/xen/drivers/char/console.c +++ b/xen/drivers/char/console.c @@ -120,7 +120,7 @@ static int __read_mostly sercon_handle = -1; int8_t __read_mostly opt_console_xen; /* console=xen */ #endif -static DEFINE_RSPINLOCK(console_lock); +DEFINE_RSPINLOCK(console_lock); ... this should remain static. The question therefore just is whether to omit this patch or ... @@ -1158,22 +1158,6 @@ void console_end_log_everything(void) atomic_dec(_everything); } -unsigned long console_lock_recursive_irqsave(void) -{ -unsigned long flags; - -local_irq_save(flags); -rspin_lock(_lock); - -return flags; -} - -void console_unlock_recursive_irqrestore(unsigned long flags) -{ -rspin_unlock(_lock); -local_irq_restore(flags); -} ... whether to retain these two functions as thin wrappers around the new, more generic construct. I'd vote for the latter. Juergen
Re: [PATCH] Mini-OS: add symbol exports for xenstore stubdom
On 26.02.24 09:44, Jan Beulich wrote: On 26.02.2024 09:39, Juergen Gross wrote: Xenstore stubdom needs some more symbols exported. Signed-off-by: Juergen Gross Any Reported-by: possibly applicable here? If any, then Andrew. He reported a CI loop failure. Juergen
Re: [PATCH v5 07/22] tools/9pfsd: add 9pfs attach request support
On 14.02.24 18:48, Andrew Cooper wrote: On 08/02/2024 4:55 pm, Juergen Gross wrote: +static struct p9_fid *alloc_fid(device *device, unsigned int fid, +const char *path) +{ +struct p9_fid *fidp = NULL; + +pthread_mutex_lock(>fid_mutex); + +if ( find_fid(device, fid) ) +{ +errno = EBADFD; Also, FreeBSD says no. https://cirrus-ci.com/task/6634697753624576 io.c:580:17: error: use of undeclared identifier 'EBADFD' errno = EBADFD; ^ 1 error generated. Need to use EBADF. Yes. Juergen
Re: [PATCH v5 07/22] tools/9pfsd: add 9pfs attach request support
On 14.02.24 18:40, Andrew Cooper wrote: On 08/02/2024 4:55 pm, Juergen Gross wrote: +static struct p9_fid *alloc_fid_mem(device *device, unsigned int fid, +const char *path) +{ +struct p9_fid *fidp; +size_t pathlen; + +pathlen = strlen(path); +fidp = calloc(sizeof(*fidp) + pathlen + 1, 1); +if ( !fidp ) +return NULL; + +fidp->fid = fid; +strncpy(fidp->path, path, pathlen); + +return fidp; +} GitlabCI has something to say about this. https://gitlab.com/xen-project/people/andyhhp/xen/-/pipelines/1176787593 I think they're all variations of: io.c: In function 'alloc_fid_mem.isra.8': io.c:566:5: error: 'strncpy' output truncated before terminating nul copying as many bytes from a string as its length [-Werror=stringop-truncation] strncpy(fidp->path, path, pathlen); ^~ io.c:560:15: note: length computed here pathlen = strlen(path); ^~~~ In the end the result is fine, as the buffer is large enough and it is zeroed on allocation. I'll change it nevertheless as it is a bad code pattern. Juergen
Re: [PATCH v5 15/22] tools/libs/light: add backend type for 9pfs PV devices
On 14.02.24 18:27, Anthony PERARD wrote: On Wed, Feb 14, 2024 at 11:18:18AM +0100, Jürgen Groß wrote: On 13.02.24 19:03, Anthony PERARD wrote: On Thu, Feb 08, 2024 at 05:55:39PM +0100, Juergen Gross wrote: +struct libxl__aop9_state { +libxl__spawn_state spawn; +libxl__ao_device *aodev; +libxl_device_p9 p9; +uint32_t domid; +void (*callback)(libxl__egc *, libxl__aop9_state *, int); This "callback" is never used, right? Why do you think so? In xen9pfsd_spawn() it is used: aop9->callback = xen9pfsd_spawn_outcome; By never used, I mean that nothing is reading the value, their is no "aop9->callback(egc, aop9, rc)" call. It might have been useful if a caller of xen9pfsd_spawn() was actually setting this field, but that's not an option here. And callbacks of xen9pfsd_spawn() knows to call xen9pfsd_spawn_outcome() when done. Oh, right. This must be a relict of a previous attempt to make it work. I'll remove the callback. Juergen
Re: [PATCH v5 15/22] tools/libs/light: add backend type for 9pfs PV devices
On 13.02.24 19:03, Anthony PERARD wrote: On Thu, Feb 08, 2024 at 05:55:39PM +0100, Juergen Gross wrote: +struct libxl__aop9_state { +libxl__spawn_state spawn; +libxl__ao_device *aodev; +libxl_device_p9 p9; +uint32_t domid; +void (*callback)(libxl__egc *, libxl__aop9_state *, int); This "callback" is never used, right? Why do you think so? In xen9pfsd_spawn() it is used: aop9->callback = xen9pfsd_spawn_outcome; diff --git a/tools/libs/light/libxl_types_internal.idl b/tools/libs/light/libxl_types_internal.idl index e24288f1a5..39da71cef5 100644 --- a/tools/libs/light/libxl_types_internal.idl +++ b/tools/libs/light/libxl_types_internal.idl @@ -34,6 +34,7 @@ libxl__device_kind = Enumeration("device_kind", [ (16, "VINPUT"), (17, "VIRTIO_DISK"), (18, "VIRTIO"), +(19, "XEN_9PFS"), That's going to need to be rebased ;-). Yes. Juergen
Re: [PATCH] xen/events: close evtchn after mapping cleanup
On 24.01.24 17:31, Maximilian Heyne wrote: shutdown_pirq and startup_pirq are not taking the irq_mapping_update_lock because they can't due to lock inversion. Both are called with the irq_desc->lock being taking. The lock order, however, is first irq_mapping_update_lock and then irq_desc->lock. This opens multiple races: - shutdown_pirq can be interrupted by a function that allocates an event channel: CPU0CPU1 shutdown_pirq { xen_evtchn_close(e) __startup_pirq { EVTCHNOP_bind_pirq -> returns just freed evtchn e set_evtchn_to_irq(e, irq) } xen_irq_info_cleanup() { set_evtchn_to_irq(e, -1) } } Assume here event channel e refers here to the same event channel number. After this race the evtchn_to_irq mapping for e is invalid (-1). - __startup_pirq races with __unbind_from_irq in a similar way. Because __startup_pirq doesn't take irq_mapping_update_lock it can grab the evtchn that __unbind_from_irq is currently freeing and cleaning up. In this case even though the event channel is allocated, its mapping can be unset in evtchn_to_irq. The fix is to first cleanup the mappings and then close the event channel. In this way, when an event channel gets allocated it's potential previous evtchn_to_irq mappings are guaranteed to be unset already. This is also the reverse order of the allocation where first the event channel is allocated and then the mappings are setup. On a 5.10 kernel prior to commit 3fcdaf3d7634 ("xen/events: modify internal [un]bind interfaces"), we hit a BUG like the following during probing of NVMe devices. The issue is that during nvme_setup_io_queues, pci_free_irq is called for every device which results in a call to shutdown_pirq. With many nvme devices it's therefore likely to hit this race during boot because there will be multiple calls to shutdown_pirq and startup_pirq are running potentially in parallel. [ cut here ] blkfront: xvda: barrier or flush: disabled; persistent grants: enabled; indirect descriptors: enabled; bounce buffer: enabled kernel BUG at drivers/xen/events/events_base.c:499! invalid opcode: [#1] SMP PTI CPU: 44 PID: 375 Comm: kworker/u257:23 Not tainted 5.10.201-191.748.amzn2.x86_64 #1 Hardware name: Xen HVM domU, BIOS 4.11.amazon 08/24/2006 Workqueue: nvme-reset-wq nvme_reset_work RIP: 0010:bind_evtchn_to_cpu+0xdf/0xf0 Code: 5d 41 5e c3 cc cc cc cc 44 89 f7 e8 2b 55 ad ff 49 89 c5 48 85 c0 0f 84 64 ff ff ff 4c 8b 68 30 41 83 fe ff 0f 85 60 ff ff ff <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 0f 1f 44 00 00 RSP: :c9000d533b08 EFLAGS: 00010046 RAX: RBX: RCX: 0006 RDX: 0028 RSI: RDI: RBP: 888107419680 R08: R09: 82d72b00 R10: R11: R12: 01ed R13: R14: R15: 0002 FS: () GS:88bc8b50() knlGS: CS: 0010 DS: ES: CR0: 80050033 CR2: CR3: 02610001 CR4: 001706e0 DR0: DR1: DR2: DR3: DR6: fffe0ff0 DR7: 0400 Call Trace: ? show_trace_log_lvl+0x1c1/0x2d9 ? show_trace_log_lvl+0x1c1/0x2d9 ? set_affinity_irq+0xdc/0x1c0 ? __die_body.cold+0x8/0xd ? die+0x2b/0x50 ? do_trap+0x90/0x110 ? bind_evtchn_to_cpu+0xdf/0xf0 ? do_error_trap+0x65/0x80 ? bind_evtchn_to_cpu+0xdf/0xf0 ? exc_invalid_op+0x4e/0x70 ? bind_evtchn_to_cpu+0xdf/0xf0 ? asm_exc_invalid_op+0x12/0x20 ? bind_evtchn_to_cpu+0xdf/0xf0 ? bind_evtchn_to_cpu+0xc5/0xf0 set_affinity_irq+0xdc/0x1c0 irq_do_set_affinity+0x1d7/0x1f0 irq_setup_affinity+0xd6/0x1a0 irq_startup+0x8a/0xf0 __setup_irq+0x639/0x6d0 ? nvme_suspend+0x150/0x150 request_threaded_irq+0x10c/0x180 ? nvme_suspend+0x150/0x150 pci_request_irq+0xa8/0xf0 ? __blk_mq_free_request+0x74/0xa0 queue_request_irq+0x6f/0x80 nvme_create_queue+0x1af/0x200 nvme_create_io_queues+0xbd/0xf0 nvme_setup_io_queues+0x246/0x320 ? nvme_irq_check+0x30/0x30 nvme_reset_work+0x1c8/0x400 process_one_work+0x1b0/0x350 worker_thread+0x49/0x310 ? process_one_work+0x350/0x350 kthread+0x11b/0x140 ? __kthread_bind_mask+0x60/0x60 ret_from_fork+0x22/0x30 Modules linked in: ---[ end trace a11715de1eee1873 ]--- Fixes: d46a78b05c0e ("xen: implement pirq type event channels") Cc: sta...@vger.kernel.org Co-debugged-by: Andrew Panyakin Signed-off-by: Maximilian Heyne Reviewed-by: Juergen Gross Juergen
Errors when trying to create VM with "channel" device(s)
A customer is complaining to be unable to create multiple "channel" devices in a HVM domain. I've verified the same happens with upstream Xen. The "channel" configuration is: channel = [ "connection=socket, name=test.ch1, path=/run/testsocket1", "connection=socket, name=test.ch2, path=/run/testsocket2" ] When creating the domain, I see: libxl: error: libxl_console.c:285:libxl__device_console_add: Domain 4:Primary console has invalid configuration Both /run/testsocket1 and /run/testsocket2 are being created in dom0. "xl channel-list 4" gives me: Idx BE state evt-ch ring-ref connection 0 0 1 -1 -1 socket "xenstore-ls /local/domain/0/backend/console/4" prints only one channel: 1 = "" frontend = "/local/domain/4/device/console/1" frontend-id = "4" online = "1" state = "2" protocol = "vt100" name = "test.ch2" connection = "socket" path = "/run/testsocket2" hotplug-status = "connected" 0 = "" frontend = "/local/domain/4/console" frontend-id = "4" online = "1" state = "1" protocol = "vt100" While qemu seems to have been started with the information for both. Output of "xl -vvv create": ... libxl: debug: libxl_dm.c:2994:libxl__spawn_local_dm: Domain 4: -chardev libxl: debug: libxl_dm.c:2994:libxl__spawn_local_dm: Domain 4: socket,id=libxl-channel0,path=/run/testsocket1,server=on,wait=off libxl: debug: libxl_dm.c:2994:libxl__spawn_local_dm: Domain 4: -chardev libxl: debug: libxl_dm.c:2994:libxl__spawn_local_dm: Domain 4: socket,id=libxl-channel1,path=/run/testsocket2,server=on,wait=off ... "xenstore-ls /local/domain/4/device/console" only shows: 1 = "" backend = "/local/domain/0/backend/console/4/1" backend-id = "0" state = "1" protocol = "vt100" name = "test.ch2" limit = "1048576" type = "ioemu" output = "chardev:libxl-channel1" tty = "" So the console AND the first channel are missing. "xenstore-ls /local/domain/4/error" shows: device = "" console = "" 1 = "" error = "22 xenbus_dev_probe on device/console/1" Now doing the same with only one "channel" device shows the same error when creating the guest. The guest has no channel device after that and the Xenstore error node is not present. I'm not that familiar with this code area, so I thought I post this here in the hope someone has an idea where to look first. Juergen
Re: [PATCH v4 14/32] tools/xen-9pfsd: add 9pfs read request support
On 08.02.24 02:28, Jason Andryuk wrote: On Mon, Feb 5, 2024 at 5:51 AM Juergen Gross wrote: Add the read request of the 9pfs protocol. Signed-off-by: Juergen Gross Reviewed-by: Jason Andryuk Acked-by: Anthony PERARD --- V2: - make error check more readable (Jason Andryuk) V4: - add directory read support --- tools/xen-9pfsd/io.c | 90 1 file changed, 90 insertions(+) diff --git a/tools/xen-9pfsd/io.c b/tools/xen-9pfsd/io.c index b763e3d8d9..cfbf3bef59 100644 --- a/tools/xen-9pfsd/io.c +++ b/tools/xen-9pfsd/io.c + +len = count; +buf = ring->buffer + sizeof(*hdr) + sizeof(uint32_t); + +if ( fidp->isdir ) +{ +struct dirent *dirent; +struct stat st; +struct p9_stat p9s; + """ For directories, read returns an integral number of direc- tory entries exactly as in stat (see stat(5)), one for each member of the directory. The read request message must have offset equal to zero or the value of offset in the previous read on the directory, plus the number of bytes returned in the previous read. In other words, seeking other than to the beginning is illegal in a directory (see seek(2)). """ I think you need to check `off`. Looks like QEMU only checks for off == 0 and rewinds in that case. QEMU ignores other values. Oh, indeed. Thanks for noticing. I'll use the same approach as qemu. Juergen
Re: [PATCH v4 02/32] tools: add a new xen logging daemon
On 08.02.24 01:52, Andrew Cooper wrote: On 05/02/2024 10:49 am, Juergen Gross wrote: Add "xen-9pfsd", a new logging daemon meant to support infrastructure domains (e.g. xenstore-stubdom) to access files in dom0. I was still expecting for "logging" to disappear from this. In both cases it could just be deleted the sentences still work, although the subject ought to gain a xen-9pfsd part. Happy to fix on commit. tools/Makefile | 1 + tools/xen-9pfsd/.gitignore | 1 + tools/xen-9pfsd/Makefile| 38 ++ tools/xen-9pfsd/xen-9pfsd.c | 147 Asking just in case... Do we really want the "xen-" in the directory? tools/9pfsd is still perfectly descriptive, and more amenable to tab completion given what else is in tools/ I'll change it. Juergen
Re: [PATCH] config: update Mini-OS commit
On 07.02.24 14:41, Julien Grall wrote: Hi Juergen, On 07/02/2024 13:31, Juergen Gross wrote: Update the Mini-OS upstream revision. Signed-off-by: Juergen Gross --- Config.mk | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/Config.mk b/Config.mk index f7d6d84847..077d841bb7 100644 --- a/Config.mk +++ b/Config.mk @@ -224,7 +224,7 @@ QEMU_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/qemu-xen.git QEMU_UPSTREAM_REVISION ?= master MINIOS_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/mini-os.git -MINIOS_UPSTREAM_REVISION ?= 090eeeb1631f00a9a41ebf66d9b4aacb97eb51e7 +MINIOS_UPSTREAM_REVISION ?= b119f0178fd86876d0678007dfcf435ab8bb7568 I looked at the changes and I am puzzled by one. In p9_stat(), I see a call with: free_stat(); 'stat' is defined as 'struct p9_stat *stat' and 'free_stat' expects a 'struct p9_stat'. Wouldn't this resuilt to a build error? Hmm, I'm puzzled, too. I didn't see a build error when doing "make testbuild". But I agree that this looks very suspicious. Oh, in test builds we don't have HAVE_LIBC defined. I just verified that 9pfront.c gets compiled, while I didn't pay attention to the parts which were compiled during the build. Shame on me! Fixup patch coming soon... Thanks for noticing! Juergen
Re: [PATCH] Mini-OS: x86: zero out .bss segment at boot
On 07.02.24 13:02, Samuel Thibault wrote: Jürgen Groß, le mer. 07 févr. 2024 12:43:03 +0100, a ecrit: On 07.02.24 12:34, Samuel Thibault wrote: Jürgen Groß, le mer. 07 févr. 2024 12:16:44 +0100, a ecrit: On 07.02.24 12:00, Samuel Thibault wrote: Jürgen Groß, le mer. 07 févr. 2024 11:42:20 +0100, a ecrit: while implementing kexec in Mini-OS. Oh, nice :D For that I need it for sure. It needs to be done by kexec itself then. That's another option, yes. The question is whether we want to support to be kexec-ed from other systems, too. But aren't other systems' kexec supports supposed to do the memset? They really should. I guess there is a reason why the Linux kernel does clear its .bss section in early boot. Maybe it is due to how the boot process works (the ELF file is encapsulated in vmlinuz), Yes, the unpack prevents grub etc. from doing it. but following your reasoning they should have cleared their .bss while unpacking the ELF contents, not while booting the contents. AIUI the decompressor itself doesn't actually know about ELF. decompress_kernel() does call handle_relocations(), but it should indeed clear bss itself and not leave it out to assembly indeed. I'm not sure they do the .bss clearing in kexec either, AIUI they do, see kimage_load_normal_segment() which clears the page before possibly loading some file piece into it. Really, this is part of what "loading an ELF" means. Okay, I'll do the clearing only in the kexec code then. Juergen
Re: [PATCH] Mini-OS: x86: zero out .bss segment at boot
On 07.02.24 12:34, Samuel Thibault wrote: Jürgen Groß, le mer. 07 févr. 2024 12:16:44 +0100, a ecrit: On 07.02.24 12:00, Samuel Thibault wrote: Jürgen Groß, le mer. 07 févr. 2024 11:42:20 +0100, a ecrit: while implementing kexec in Mini-OS. Oh, nice :D For that I need it for sure. It needs to be done by kexec itself then. That's another option, yes. The question is whether we want to support to be kexec-ed from other systems, too. But aren't other systems' kexec supports supposed to do the memset? They really should. I guess there is a reason why the Linux kernel does clear its .bss section in early boot. Maybe it is due to how the boot process works (the ELF file is encapsulated in vmlinuz), but following your reasoning they should have cleared their .bss while unpacking the ELF contents, not while booting the contents. I'm not sure they do the .bss clearing in kexec either, as the kernel is clearing it anyway. Juergen
Re: [PATCH] Mini-OS: x86: zero out .bss segment at boot
On 07.02.24 12:00, Samuel Thibault wrote: Jürgen Groß, le mer. 07 févr. 2024 11:42:20 +0100, a ecrit: while implementing kexec in Mini-OS. Oh, nice :D For that I need it for sure. It needs to be done by kexec itself then. That's another option, yes. The question is whether we want to support to be kexec-ed from other systems, too. Then I'd rather do it on both sides. Juergen
Re: [PATCH] Mini-OS: x86: zero out .bss segment at boot
On 07.02.24 11:38, Samuel Thibault wrote: Juergen Gross, le mer. 07 févr. 2024 11:31:38 +0100, a ecrit: The .bss segment should be zeroed at very early boot. Is that not done by the elf loader of Xen? It might be done by Xen tools today, but I'm quite sure it is not part of the ABI. The hypervisor doesn't clear .bss when loading a domain (e.g. dom0). I stumbled over it while implementing kexec in Mini-OS. For that I need it for sure. Juergen
Re: [PATCH] Mini-OS: fix 9pfs frontend error path
On 06.02.24 16:26, Samuel Thibault wrote: Juergen Gross, le mar. 06 févr. 2024 07:17:21 +0100, a ecrit: The early error exit in p9_stat() returns without zeroing the p9_stat buffer, resulting in free() being called with an uninitialized pointer. Fix that by doing the zeroing first. This is not coherent with the usual conventions: when a function fails, it is supposed not to have done anything, and thus the caller shouldn't have to clean anything. I.e. i'd rather see the free_stat() call be put after the check for an error returned by p9_stat. I can do that, but this would require two calls of free_stat() (one in p9_stat() in an error case reported via req->result, and one in the caller of p9_stat() in case of no error). Juergen
Re: [PATCH] tools/libs/light: don't allow to stop Xenstore stubdom
On 06.02.24 14:08, Andrew Cooper wrote: On 06/02/2024 12:43 pm, Juergen Gross wrote: A Xenstore stubdom should never be stoppable. Reject attempts to do so. Signed-off-by: Juergen Gross I don't think this is a clever idea. `xl destroy` is also the "please clean up my system when it's in a very dead state" command, and that also includes a dead xenstored stubdom. I don't think xl destroy for a dead Xenstore stubdom will ever work. xl destroy tries to read (and delete) Xenstore entries, after all. I think you'd need a program using libxenctrl without all the xl/libxl actions for achieving this goal. And this would work with my current patch, too. If you're looking for some protection, then maybe a `--force` flag to override, but there must be some way of getting this to run. A system without Xenstore is probably quite useless anyway. At least today there is no way a new Xenstore would be able to connect to existing domains. OTOH I'm inclined to add more hooks, e.g. for "xl pause" and "xl migrate". And I do think that libxl is the right level for that, as I don't want users to be able to kill/pause/migrate Xenstore stubdom via libvirt either. Juergen
Re: [PATCH v4 23/32] tools/xenstored: move all log-pipe handling into posix.c
On 05.02.24 15:33, Julien Grall wrote: Hi Juergen, On 05/02/2024 10:49, Juergen Gross wrote: All of the log-pipe handling is needed only when running as daemon. Move it into posix.c. This requires to have a service function in the main event loop for handling the related requests and one for setting the fds[] array. Use a generic name for those functions, as socket handling can be added to them later, too. I would mention the rename in the commit message. With that: I have modified the commit message to: tools/xenstored: move all log-pipe handling into posix.c All of the log-pipe handling is needed only when running as daemon. Move it into posix.c. This requires to have a service function in the main event loop for handling the related requests and one for setting the fds[] array, which is renamed to poll_fds to have a more specific name. Use a generic name for those functions, as socket handling can be added to them later, too. Signed-off-by: Juergen Gross Reviewed-by: Julien Grall Thanks, Juergen
Re: [PATCH v4 00/32] tools: enable xenstore-stubdom to use 9pfs
On 05.02.24 11:55, Julien Grall wrote: Hi Juergen, On 05/02/2024 10:49, Juergen Gross wrote: This series is adding 9pfs support to Xenstore-stubdom, enabling it to do logging to a dom0 directory. This is a prerequisite for the final goal to add live update support to Xenstore-stubdom, as it enables the stubdom to store its state in a dom0 file. The 9pfs backend is a new daemon written from scratch. Using a dedicated 9pfs daemon has several advantages: - it is using much less resources than a full blown qemu process - it can serve multiple guests (the idea is to use it for other infrastructure domains, like qemu-stubdom or driver domains, too) - it is designed to support several security enhancements, like limiting the number of files for a guest, or limiting the allocated file system space - it doesn't support file links (neither hard nor soft links) or referencing parent directories via "..", minimizing the risk that a guest can "escape" from its home directory Note that for now the daemon only contains the minimal needed functionality to do logging from Xenstore-stubdom. I didn't want to add all the 9pfs commands and security add-ons in the beginning, in order to avoid needless efforts in case the idea of the daemon is being rejected. Changes in V4: - patch 2 of V3 was applied - added support of reading directories - addressed review comments Changes in V3: - new patches 1, 23-25 - addressed review comments Changes in V2: - support of multiple rings per device - xenlogd->xen-9pfsd rename - addressed review comments - fixed some bugs Juergen Gross (32): tools: add access macros for unaligned data tools: add a new xen logging daemon tools/xen-9pfsd: connect to frontend tools/xen-9pfsd: add transport layer tools/xen-9pfsd: add 9pfs response generation support tools/xen-9pfsd: add 9pfs version request support tools/xen-9pfsd: add 9pfs attach request support tools/xen-9pfsd: add 9pfs walk request support tools/xen-9pfsd: add 9pfs open request support tools/xen-9pfsd: add 9pfs clunk request support tools/xen-9pfsd: add 9pfs create request support tools/xen-9pfsd: add 9pfs stat request support tools/xen-9pfsd: add 9pfs write request support tools/xen-9pfsd: add 9pfs read request support tools/libs/light: add backend type for 9pfs PV devices tools/xl: support new 9pfs backend xen_9pfsd tools/helpers: allocate xenstore event channel for xenstore stubdom tools/xenstored: rename xenbus_evtchn() stubdom: extend xenstore stubdom configs tools: add 9pfs device to xenstore-stubdom tools/xenstored: add early_init() function tools/xenstored: move systemd handling to posix.c tools/xenstored: move all log-pipe handling into posix.c tools/xenstored: move all socket handling into posix.c tools/xenstored: get own domid in stubdom case tools/xenstored: rework ring page (un)map functions tools/xenstored: split domain_init() tools/xenstored: map stubdom interface tools/xenstored: mount 9pfs device in stubdom tools/xenstored: add helpers for filename handling tools/xenstored: support complete log capabilities in stubdom tools/xenstored: have a single do_control_memreport() I haven't checked what's the state of the 9PFS patches. Can part of the xenstored changes be committed without the 9PFS changes? The following patches can go in without the 9pfs daemon: tools/helpers: allocate xenstore event channel for xenstore stubdom tools/xenstored: rename xenbus_evtchn() stubdom: extend xenstore stubdom configs tools/xenstored: add early_init() function tools/xenstored: move systemd handling to posix.c tools/xenstored: move all log-pipe handling into posix.c tools/xenstored: move all socket handling into posix.c tools/xenstored: get own domid in stubdom case tools/xenstored: rework ring page (un)map functions tools/xenstored: split domain_init() tools/xenstored: map stubdom interface Juergen
Re: [PATCH v3 17/33] tools/xl: support new 9pfs backend xen-9pfsd
On 15.01.24 16:14, Anthony PERARD wrote: On Thu, Jan 04, 2024 at 10:00:39AM +0100, Juergen Gross wrote: @@ -2242,6 +2256,28 @@ void parse_config_data(const char *config_source, libxl_string_list_dispose(); +if (p9->type == LIBXL_P9_TYPE_UNKNOWN) { +p9->type = LIBXL_P9_TYPE_QEMU; The defaulting is normally done in libxl, so that it works for all users of libxl. Can this be done instead in libxl? Hopefully, it's enough to do it in libxl__device_p9_setdefault(). Same question for the followup checks and default values. I'll look into it. Juergen
Re: [PATCH v3 16/33] tools/libs/light: add backend type for 9pfs PV devices
On 15.01.24 16:38, Anthony PERARD wrote: On Thu, Jan 04, 2024 at 10:00:38AM +0100, Juergen Gross wrote: +static void libxl__device_p9_add(libxl__egc *egc, uint32_t domid, + libxl_device_p9 *p9, + libxl__ao_device *aodev) +{ +int rc; Can you make a copy of `p9` here, or maybe in xen9pfsd_spawn? There's no guaranty that `p9` will still be around by the time xen9pfsd_spawn_outcome() is executed. (with libxl_device_p9_copy()) Okay. +rc = xen9pfsd_spawn(egc, domid, p9, aodev); +if (rc == 1) +return; + +if (rc == 0) +libxl__device_add_async(egc, domid, __p9_devtype, p9, aodev); + +aodev->rc = rc; +if (rc) +aodev->callback(egc, aodev); +} + Juergen
Re: [PATCH v3 16/33] tools/libs/light: add backend type for 9pfs PV devices
On 12.01.24 17:55, Anthony PERARD wrote: On Thu, Jan 04, 2024 at 10:00:38AM +0100, Juergen Gross wrote: diff --git a/tools/libs/light/libxl_9pfs.c b/tools/libs/light/libxl_9pfs.c index 5ab0d3aa21..486bc4326e 100644 --- a/tools/libs/light/libxl_9pfs.c +++ b/tools/libs/light/libxl_9pfs.c @@ -33,20 +33,159 @@ static int libxl__set_xenstore_p9(libxl__gc *gc, uint32_t domid, flexarray_append_pair(front, "tag", p9->tag); +if (p9->type == LIBXL_P9_TYPE_XEN_9PFSD) { +flexarray_append_pair(back, "max-space", + GCSPRINTF("%u", p9->max_space)); +flexarray_append_pair(back, "max-files", + GCSPRINTF("%u", p9->max_files)); +flexarray_append_pair(back, "max-open-files", + GCSPRINTF("%u", p9->max_open_files)); +flexarray_append_pair(back, "auto-delete", + p9->auto_delete ? "1" : "0"); +} + +return 0; +} + +static int libxl__device_from_p9(libxl__gc *gc, uint32_t domid, + libxl_device_p9 *type, libxl__device *device) +{ +device->backend_devid = type->devid; +device->backend_domid = type->backend_domid; +device->backend_kind= type->type == LIBXL_P9_TYPE_QEMU + ? LIBXL__DEVICE_KIND_9PFS + : LIBXL__DEVICE_KIND_XEN_9PFS; +device->devid = type->devid; +device->domid = domid; +device->kind= LIBXL__DEVICE_KIND_9PFS; + return 0; } -#define libxl__add_p9s NULL +static int libxl_device_p9_dm_needed(void *e, unsigned domid) Prefix of the function should be "libxl__" as it's only internal to libxl. Okay. +{ +libxl_device_p9 *elem = e; + +return elem->type == LIBXL_P9_TYPE_QEMU && elem->backend_domid == domid; +} + +typedef struct libxl__aop9_state libxl__aop9_state; + +struct libxl__aop9_state { +libxl__spawn_state spawn; +libxl__ao_device *aodev; +libxl_device_p9 *p9; +uint32_t domid; +void (*callback)(libxl__egc *, libxl__aop9_state *, int); +}; + +static void xen9pfsd_spawn_outcome(libxl__egc *egc, libxl__aop9_state *aop9, + int rc) +{ +aop9->aodev->rc = rc; +if (rc) +aop9->aodev->callback(egc, aop9->aodev); +else +libxl__device_add_async(egc, aop9->domid, __p9_devtype, +aop9->p9, aop9->aodev); +} + +static void xen9pfsd_confirm(libxl__egc *egc, libxl__spawn_state *spawn, + const char *xsdata) +{ +STATE_AO_GC(spawn->ao); + +if (!xsdata) +return; + +if (strcmp(xsdata, "running")) +return; + +libxl__spawn_initiate_detach(gc, spawn); +} + +static void xen9pfsd_failed(libxl__egc *egc, libxl__spawn_state *spawn, int rc) +{ +libxl__aop9_state *aop9 = CONTAINER_OF(spawn, *aop9, spawn); + +xen9pfsd_spawn_outcome(egc, aop9, rc); +} + +static void xen9pfsd_detached(libxl__egc *egc, libxl__spawn_state *spawn) +{ +libxl__aop9_state *aop9 = CONTAINER_OF(spawn, *aop9, spawn); + +xen9pfsd_spawn_outcome(egc, aop9, 0); +} + +static int xen9pfsd_spawn(libxl__egc *egc, uint32_t domid, libxl_device_p9 *p9, + libxl__ao_device *aodev) +{ +STATE_AO_GC(aodev->ao); +struct libxl__aop9_state *aop9; +int rc; +char *args[] = { "xen-9pfsd", NULL }; +char *path = GCSPRINTF("/local/domain/%u/libxl/xen-9pfs", + p9->backend_domid); + +if (p9->type != LIBXL_P9_TYPE_XEN_9PFSD || +libxl__xs_read(gc, XBT_NULL, GCSPRINTF("%s/state", path))) I feel like this check and this function might not work as expected. What happen if we try to add more than one 9pfs "device"? libxl I think is going to try to start several xen-9pfs daemon before the first one have had time to write the "*/state" path. I don't think so. The path is specific for the _backend_ domid. What about two different libxl process trying to spawn that daemon? Is xen-9pfs going to behave well and have one giveup? But that would probably mean that libxl is going to have an error due to the process exiting early, maybe. I think I need to handle this case gracefully in the daemon by exiting with a 0 exit code. +return 0; + +GCNEW(aop9); +aop9->aodev = aodev; +aop9->p9 = p9; +aop9->domid = domid; +aop9->callback = xen9pfsd_spawn_outcome; + +aop9->spawn.ao = aodev->ao; +aop9->spawn.what = "xen-9pfs daemon"; +aop9->spawn.xspath = GCSPRINTF("%s/state", path); +aop9->spawn.timeout_ms = LIBXL_DEVICE_MODEL_START_TIMEOUT * 1000; +aop9->spawn.pidpath = GCSPRINTF("%s/pid", path); +aop9->spawn.midproc_cb = libxl__spawn_record_pid; +aop9->spawn.confirm_cb = xen9pfsd_confirm; +aop9->spawn.failure_cb = xen9pfsd_failed; +aop9->spawn.detached_cb = xen9pfsd_detached; +rc = libxl__spawn_spawn(egc, >spawn); +if (rc < 0) +
Re: [PATCH v3 09/33] tools/xenlogd: add 9pfs walk request support
On 09.01.24 20:19, Jason Andryuk wrote: On Thu, Jan 4, 2024 at 4:10 AM Juergen Gross wrote: Add the walk request of the 9pfs protocol. Signed-off-by: Juergen Gross Reviewed-by: Jason Andryuk With one minor comment. +path = calloc(path_len + 1, 1); +if ( !path ) +{ +p9_error(ring, hdr->tag, ENOMEM); +goto out; +} +strcpy(path, fidp->path); + +if ( n_names ) +{ +qids = calloc(n_names, sizeof(*qids)); +if ( !qids ) +{ +p9_error(ring, hdr->tag, ENOMEM); +goto out; +} +for ( i = 0; i < n_names; i++ ) +{ +if (strcmp(path, "/")) +strcat(path, "/"); strcmp() can only return 0 on the first iteration, so it seems inefficient to have it inside this loop. But the added complexity to avoid calling it doesn't seem worthwhile. With storing the relative path in fidp->path this call can be dropped. Juergen
Re: [PATCH v3 08/33] tools/xenlogd: add 9pfs attach request support
On 09.01.24 19:48, Jason Andryuk wrote: On Thu, Jan 4, 2024 at 4:12 AM Juergen Gross wrote: Add the attach request of the 9pfs protocol. This introduces the "fid" scheme of the 9pfs protocol. As this will be needed later, use a dedicated memory allocation function in alloc_fid() and prepare a fid reference count. For filling the qid data take the approach from the qemu 9pfs backend implementation. Signed-off-by: Juergen Gross --- V2: - make fill_qid() parameter stbuf const (Jason Andryuk) - free fids after disconnecting guest (Jason Andryuk) V3: - only store relative path in fid (Jason Andryuk) The code looks good. I did have a thought though. +static struct p9_fid *alloc_fid_mem(device *device, unsigned int fid, +const char *path) +{ +struct p9_fid *fidp; +size_t pathlen; + +/* Paths always start with "/" as they are starting at the mount point. */ +assert(path[0] == '/'); + ... + +static const char *relpath_from_path(const char *path) +{ +if (!strcmp(path, "/")) +return "."; + +return (path[0] == '/') ? path + 1 : path; +} You've carefully written the code to ensure the *at() functions are not called with paths starting with "/". What do you think about storing the converted paths when storing into the p9_fid? That way the code doesn't have to worry about always going through relpath_from_path() before use. Another option beside performing the relpath_from_path() conversion, would be to save fidp->path with "./" at the start to eliminate absolute paths that way. My thinking is it's more robust to not have any absolute paths that could be passed to a *at() function. I've had a thorough look at the code at the end of the series and I agree this is a good idea. I'll change the related patches accordingly. Juergen
Re: [PATCH] tools/xenstored: Remove unnecessary define XC_WANT_COMPAT_MAP_FOREIGN_API
On 25.03.21 12:42, Julien Grall wrote: From: Julien Grall The last use of the compat foreign API was dropped in commit 38eeb3864de4 "tools/xenstored: Drop mapping of the ring via foreign map". Therefore, we don't need to define XC_WANT_COMPAT_MAP_FOREIGN_API. Signed-off-by: Julien Grall Reviewed-by: Juergen Gross Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH for-4.15?] docs/misc: xenstored: Re-instate and tweak the documentation for XS_RESUME
On 25.03.21 19:06, Julien Grall wrote: From: Julien Grall Commit 13dd372834a4 removed the documentation for XS_RESUME, however this command is still implemented (at least in C Xenstored) and used by libxl when resuming a domain. So re-instate the documentation for the XS_RESUME. Take the opportunity to update it as there is a user of the command. Fixes: 13dd372834a4 ("docs/designs: re-work the xenstore migration document...") Signed-off-by: Julien Grall Reviewed-by: Juergen Gross Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH for-4.15?] docs/design: Update xenstore-migration.md
On 25.03.21 12:12, Julien Grall wrote: From: Julien Grall It is not very clear the shared page adddress is not contained in the connection record. Additionally, it is misleading to say the grant will always point to the share paged as a domain is free to revoke the permission. The restore code would need to make sure it doesn't fail/crash if this is happening. The sentence is now replaced with a paragraph explaining why the GFN is not preserved and that the grant is not guarantee to exist during restore. Take the opportunity to replace "code" with "node" when description the permission. Reported-by: Raphael Ning Signed-off-by: Julien Grall Reviewed-by: Juergen Gross Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH-for-4.15 V2] tools/libs/store: tidy up libxenstore interface
On 24.03.21 15:09, Julien Grall wrote: Hi Juergen, On 24/03/2021 11:30, Juergen Gross wrote: xenstore_lib.h is in need to be tidied up a little bit: - the definition of struct xs_tdb_record_hdr shouldn't be here - some symbols are not namespaced correctly Signed-off-by: Juergen Gross --- V2: minimal variant (Ian Jackson) --- tools/include/xenstore_lib.h | 17 - tools/libs/store/libxenstore.map | 6 +++--- tools/libs/store/xs.c | 12 ++-- tools/xenstore/utils.h | 11 +++ tools/xenstore/xenstore_client.c | 12 ++-- 5 files changed, 30 insertions(+), 28 deletions(-) diff --git a/tools/include/xenstore_lib.h b/tools/include/xenstore_lib.h index 4c9b6d1685..f74ad7024b 100644 --- a/tools/include/xenstore_lib.h +++ b/tools/include/xenstore_lib.h @@ -43,15 +43,6 @@ struct xs_permissions enum xs_perm_type perms; }; -/* Header of the node record in tdb. */ -struct xs_tdb_record_hdr { - uint64_t generation; - uint32_t num_perms; - uint32_t datalen; - uint32_t childlen; - struct xs_permissions perms[0]; -}; - /* Each 10 bits takes ~ 3 digits, plus one, plus one for nul terminator. */ #define MAX_STRLEN(x) ((sizeof(x) * CHAR_BIT + CHAR_BIT-1) / 10 * 3 + 2) @@ -78,18 +69,18 @@ bool xs_perm_to_string(const struct xs_permissions *perm, unsigned int xs_count_strings(const char *strings, unsigned int len); /* Sanitising (quoting) possibly-binary strings. */ -struct expanding_buffer { +struct xs_expanding_buffer { char *buf; int avail; }; /* Ensure that given expanding buffer has at least min_avail characters. */ -char *expanding_buffer_ensure(struct expanding_buffer *, int min_avail); +char *xs_expanding_buffer_ensure(struct xs_expanding_buffer *, int min_avail); /* sanitise_value() may return NULL if malloc fails. */ -char *sanitise_value(struct expanding_buffer *, const char *val, unsigned len); +char *xs_sanitise_value(struct xs_expanding_buffer *, const char *val, unsigned len); /* *out_len_r on entry is ignored; out must be at least strlen(in)+1 bytes. */ -void unsanitise_value(char *out, unsigned *out_len_r, const char *in); +void xs_unsanitise_value(char *out, unsigned *out_len_r, const char *in); #endif /* XENSTORE_LIB_H */ diff --git a/tools/libs/store/libxenstore.map b/tools/libs/store/libxenstore.map index 9854305a2c..fc1c213f13 100644 --- a/tools/libs/store/libxenstore.map +++ b/tools/libs/store/libxenstore.map @@ -42,8 +42,8 @@ VERS_3.0.3 { xs_strings_to_perms; xs_perm_to_string; xs_count_strings; - expanding_buffer_ensure; - sanitise_value; - unsanitise_value; + xs_expanding_buffer_ensure; + xs_sanitise_value; + xs_unsanitise_value; Isn't libxenstore considered stable? If so, shouldn't we bump the version to avoid any breakage for existing app? See https://lists.xen.org/archives/html/xen-devel/2021-03/msg01267.html Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH-for-4.15 V2] tools/libs/store: tidy up libxenstore interface
On 24.03.21 12:42, Andrew Cooper wrote: On 24/03/2021 11:30, Juergen Gross wrote: xenstore_lib.h is in need to be tidied up a little bit: - the definition of struct xs_tdb_record_hdr shouldn't be here - some symbols are not namespaced correctly Signed-off-by: Juergen Gross --- V2: minimal variant (Ian Jackson) --- tools/include/xenstore_lib.h | 17 - tools/libs/store/libxenstore.map | 6 +++--- tools/libs/store/xs.c| 12 ++-- tools/xenstore/utils.h | 11 +++ tools/xenstore/xenstore_client.c | 12 ++-- 5 files changed, 30 insertions(+), 28 deletions(-) diff --git a/tools/include/xenstore_lib.h b/tools/include/xenstore_lib.h index 4c9b6d1685..f74ad7024b 100644 --- a/tools/include/xenstore_lib.h +++ b/tools/include/xenstore_lib.h @@ -43,15 +43,6 @@ struct xs_permissions enum xs_perm_type perms; ^ This enum is still a ABI problem, as it has implementation defined size. The containing struct is used by xs_perm_to_string(). Substituting for int is probably the easiest option, because no amount of trickery with the enum values themselves can prevent the compiler deciding to use a long or larger for the object. Switching to unsigned int and replacing the enum values with #defines seems to be the way to go, as the enum values are basically bit mask values. Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH-for-4.15] tools/libs/store: cleanup libxenstore interface
On 24.03.21 12:30, Ian Jackson wrote: Jürgen Groß writes ("Re: [PATCH-for-4.15] tools/libs/store: cleanup libxenstore interface"): On 24.03.21 12:02, Ian Jackson wrote: Is it possible to do sort this out in a more minimal way ? Eg we could change the name to namespace it properly. (I haven't looked at the code in detail and am still rather under-caffeinated so maybe I am talking nonsense here.) No nonsense. This would be the really minimum option (apart from doing nothing). I can setup the patch for that and keep the rest for 4.16 (which will then probably need to bump the so version). Hmmm. Maybe it would be less disruptive to punt the whole lot for xen-next. That way we don't have a silent withdrawl in one release followed by a soname bump in the next. If you're keen to change this for 4.15, please feel free to show me what the patch looks like. But I would be inclined to postpone this. Minimal variant sent. I'm not keen to have that for 4.15, but the patch was just ready. :-) Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH-for-4.15] tools/libs/store: cleanup libxenstore interface
On 24.03.21 12:02, Ian Jackson wrote: Juergen Gross writes ("[PATCH-for-4.15] tools/libs/store: cleanup libxenstore interface"): There are some internals in the libxenstore interface which should be removed. Move those functions into xs_lib.c and the related definitions into xs_lib.h. Remove the functions from the mapfile. Add xs_lib.o to xenstore_client as some of the internal functions are needed there. This seems wider in scope than I was expecting. Reviewing it again makes me think that there are more concers than I anticipated and I am now doubtful whether I want to take it in 4.15. I'm fine with that. TBH I would have been surprised if you'd just take it. :-) I thought at this stage we were just going to fix the accidentally-exported symbols with improperly namespaced names. It is those for which I think that withdrawing them without an ABI soname bump, in contravention of usual library ABI stability rules, will not cause trouble in pracice. Just removing them from the mapfile doesn't work. Either we need to keep them (maybe with "xs_" prefixed), or we need to go the way I've done in this patch. My current thoughts are that several of these really ought not to be withdrawn as they might cause actual trouble: /* Path for various daemon things: env vars can override. */ -const char *xs_daemon_rootdir(void); -const char *xs_domain_dev(void); -const char *xs_daemon_tdb(void); Someone who was writing bindings might have exposed these without knowing what they were, resulting in linkage to these symbols. This patch is removing everything not being used in the (known) Xen ecosystem (Xen, qemu, qemu-trad, mini-os). bool xs_strings_to_perms(struct xs_permissions *perms, unsigned int num, const char *strings); -/* Convert permissions to a string (up to len MAX_STRLEN(unsigned int)+1). */ -bool xs_perm_to_string(const struct xs_permissions *perm, - char *buffer, size_t buf_len); Isn't this function potentially useful ? It seems funny to have only one of the conversion directions. As stated above: this patch is doing the absolute possible maximum. I'm absolutely fine to drop some of the removals. +void unsanitise_value(char *out, unsigned *out_len_r, const char *in) Is it possible to do sort this out in a more minimal way ? Eg we could change the name to namespace it properly. (I haven't looked at the code in detail and am still rather under-caffeinated so maybe I am talking nonsense here.) No nonsense. This would be the really minimum option (apart from doing nothing). I can setup the patch for that and keep the rest for 4.16 (which will then probably need to bump the so version). Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH 2/2] Revert "xen: fix p2m size in dom0 for disabled memory hotplug case"
On 17.03.21 12:04, Roger Pau Monne wrote: This partially reverts commit 882213990d32fd224340a4533f6318dd152be4b2. There's no need to special case XEN_UNPOPULATED_ALLOC anymore in order to correctly size the p2m. The generic memory hotplug option has already been tied together with the Xen hotplug limit, so enabling memory hotplug should already trigger a properly sized p2m on Xen PV. Can you add some words here that XEN_UNPOPULATED_ALLOC depends on MEMORY_HOTPLUG via ZONE_DEVICE? Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH v2 for-4.14] tools: Fix pkg-config file for libxenstore
On 22.03.21 17:38, Andrew Cooper wrote: There are no dependenices on evtchn, ctrl or gnttab. Fixes: 1b008e99 ("tools: provide pkg-config file for libxenstore") Signed-off-by: Andrew Cooper Reviewed-by: Juergen Gross Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH for-4.14] tools: Fix pkg-config file for libxenstore
On 22.03.21 17:20, Andrew Cooper wrote: There is no dependency on libxenctrl. Fixes: 1b008e99 ("tools: provide pkg-config file for libxenstore") Signed-off-by: Andrew Cooper --- CC: Ian Jackson CC: Wei Liu CC: Juergen Gross CC: Jan Beulich This has been fixed in Xen 4.15 by the uselibs.mk logic, but 4.14 and older cause everything linking against libxenstore to also try linking against libxenctrl. It also causes RPM to create unexpected dependencies between subpackages, which is a problem when trying to separate the stable and unstable libs. --- tools/xenstore/xenstore.pc.in | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/tools/xenstore/xenstore.pc.in b/tools/xenstore/xenstore.pc.in index 2f64a6b824..98c3f1ab39 100644 --- a/tools/xenstore/xenstore.pc.in +++ b/tools/xenstore/xenstore.pc.in @@ -8,4 +8,4 @@ Version: @@version@@ Cflags: -I${includedir} @@cflagslocal@@ Libs: @@libsflag@@${libdir} -lxenstore Libs.private: -ldl -Requires.private: xenevtchn,xencontrol,xengnttab,xentoolcore +Requires.private: xenevtchn,xengnttab,xentoolcore Any reason you are keeping xenevtchn and xengnttab? Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature
Re: [PATCH 0/6] tools/libs: add missing support of linear p2m_list, cleanup
On 22.03.21 13:46, Andrew Cooper wrote: On 22/03/2021 10:58, Juergen Gross wrote: There are some corners left which don't support the not so very new linear p2m list of pv guests, which has been introduced in Linux kernel 3.19 and which is mandatory for non-legacy versions of Xen since kernel 4.14. This series adds support for the linear p2m list where it is missing (colo support and "xl dump-core"). In theory it should be possible to merge the p2m list mapping code from migration handling and core dump handling, but this needs quite some cleanup before this is possible. The first three patches of this series are fixing real problems, so I've put them at the start of this series, especially in order to make backports easier. The other three patches are only the first steps of cleanup. The main work done here is to concentrate all p2m mapping in libxenguest instead of having one implementation in each of libxenguest and libxenctrl. Merging the two implementations should be rather easy, but this will require to touch many lines of code, as the migration handling variant seems to be more mature, but it is using the migration stream specific structures heavily. So I'd like to have some confirmation that my way to clean this up is the right one. My idea would be to add the data needed for p2m mapping to struct domain_info_context and replace the related fields in struct xc_sr_context with a struct domain_info_context. Modifying the interface of xc_core_arch_map_p2m() to take most current parameters via struct domain_info_context would then enable migration coding to use xc_core_arch_map_p2m() for mapping the p2m. xc_core_arch_map_p2m() should look basically like the current migration p2m mapping code afterwards. Any comments to that plan? Juergen Gross (6): tools/libs/guest: fix max_pfn setting in map_p2m() tools/libs/ctrl: fix xc_core_arch_map_p2m() to support linear p2m table tools/libs/ctrl: use common p2m mapping code in xc_domain_resume_any() tools/libs: move xc_resume.c to libxenguest tools/libs: move xc_core* from libxenctrl to libxenguest tools/libs/guest: make some definitions private to libxenguest https://gitlab.com/xen-project/patchew/xen/-/jobs/1116936958 xenctrl_stubs.c:342:11: error: implicit declaration of function 'xc_domain_resume' is invalid in C99 [-Werror,-Wimplicit-function-declaration] result = xc_domain_resume(_H(xch), c_domid, 1); ^ 1 error generated. I suspect you need to shuffle the headers in use for the Ocaml stubs too. Yes. Patch 4 needs to gain an "#include " in tools/ocaml/libs/xc/xenctrl_stubs.c Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature
Re: [net-next 1/2] xen-netback: add module parameter to disable ctrl-ring
On 22.03.21 07:48, Leon Romanovsky wrote: On Mon, Mar 22, 2021 at 06:58:34AM +0100, Jürgen Groß wrote: On 22.03.21 06:39, Leon Romanovsky wrote: On Sun, Mar 21, 2021 at 06:54:52PM +0100, Hsu, Chiahao wrote: <...> Typically there should be one VM running netback on each host, and having control over what interfaces or features it exposes is also important for stability. How about we create a 'feature flags' modparam, each bits is specified for different new features? At the end, it will be more granular module parameter that user still will need to guess. I believe users always need to know any parameter or any tool's flag before they use it. For example, before user try to set/clear this ctrl_ring_enabled, they should already have basic knowledge about this feature, or else they shouldn't use it (the default value is same as before), and that's also why we use the 'ctrl_ring_enabled' as parameter name. It solves only forward migration flow. Move from machine A with no option X to machine B with option X. It doesn't work for backward flow. Move from machine B to A back will probably break. In your flow, you want that users will set all module parameters for every upgrade and keep those parameters differently per-version. I think the flag should be a per guest config item. Adding this item to the backend Xenstore nodes for netback to consume it should be rather easy. Yes, this would need a change in Xen tools, too, but it is the most flexible way to handle it. And in case of migration the information would be just migrated to the new host with the guest's config data. Yes, it will overcome global nature of module parameters, but how does it solve backward compatibility concern? When creating a guest on A the (unknown) feature will not be set to any value in the guest's config data. A migration stream not having any value for that feature on B should set it to "false". When creating a guest on B it will either have the feature value set explicitly in the guest config (either true or false), or it will get the server's default (this value should be configurable in a global config file, default for that global value would be "true"). So with the guest created on B with feature specified as "false" (either for this guest only, or per global config), it will be migratable to machine A without problem. Migrating it back to B would work the same way as above. Trying to migrate a guest with feature set to "true" to B would not work, but this would be the host admin's fault due to not configuring the guest correctly. Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature
Re: [net-next 1/2] xen-netback: add module parameter to disable ctrl-ring
On 22.03.21 06:39, Leon Romanovsky wrote: On Sun, Mar 21, 2021 at 06:54:52PM +0100, Hsu, Chiahao wrote: <...> Typically there should be one VM running netback on each host, and having control over what interfaces or features it exposes is also important for stability. How about we create a 'feature flags' modparam, each bits is specified for different new features? At the end, it will be more granular module parameter that user still will need to guess. I believe users always need to know any parameter or any tool's flag before they use it. For example, before user try to set/clear this ctrl_ring_enabled, they should already have basic knowledge about this feature, or else they shouldn't use it (the default value is same as before), and that's also why we use the 'ctrl_ring_enabled' as parameter name. It solves only forward migration flow. Move from machine A with no option X to machine B with option X. It doesn't work for backward flow. Move from machine B to A back will probably break. In your flow, you want that users will set all module parameters for every upgrade and keep those parameters differently per-version. I think the flag should be a per guest config item. Adding this item to the backend Xenstore nodes for netback to consume it should be rather easy. Yes, this would need a change in Xen tools, too, but it is the most flexible way to handle it. And in case of migration the information would be just migrated to the new host with the guest's config data. Juergen OpenPGP_0xB0DE9DD628BF132F.asc Description: application/pgp-keys OpenPGP_signature Description: OpenPGP digital signature