Re: CPU_DOWN_FAILED hits ASSERTs in scheduling logic

2024-05-29 Thread Jürgen Groß

On 29.05.24 14:46, Roger Pau Monné wrote:

On Wed, May 29, 2024 at 01:47:09PM +0200, Jürgen Groß wrote:

On 28.05.24 13:22, Roger Pau Monné wrote:

Hello,

When the stop_machine_run() call in cpu_down() fails and calls the CPU
notifier CPU_DOWN_FAILED hook the following assert triggers in the
scheduling code:

Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at 
common/sched/cred1
[ Xen-4.19-unstable  x86_64  debug=y  Tainted:   C]
CPU:0
RIP:e008:[] 
common/sched/credit2.c#csched2_free_pdata+0xc8/0x177
RFLAGS: 00010093   CONTEXT: hypervisor
rax:    rbx: 83202ecc2f80   rcx: 83202f3e64c0
rdx: 0001   rsi: 0002   rdi: 83202ecc2f88
rbp: 83203d58   rsp: 83203d30   r8:  
r9:  83202f3e6e01   r10:    r11: 0f0f0f0f0f0f0f0f
r12: 83202ecb80b0   r13: 0001   r14: 0282
r15: 83202ecbbf00   cr0: 8005003b   cr4: 007526e0
cr3: 574c2000   cr2: 
fsb:    gsb:    gss: 
ds:    es:    fs:    gs:    ss:    cs: e008
Xen code around  
(common/sched/credit2.c#csched2_free_pdata+0xc8/0x177):
   fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 75 2e
Xen stack trace from rsp=83203d30:
 83202d74d100 0001 82d0404c4430 0006
  83203d78 82d040257454 
 0001 83203da8 82d04021f303 82d0404c4628
 82d0404c4620 82d0404c4430 0006 83203df0
 82d04022bc4c 83203e18 0001 0001
 fff0   82d0405e6500
 83203e08 82d040204fd5 0001 83203e30
 82d0402054f0 82d0404c5860 0001 83202ec75000
 83203e48 82d040348c25 83202d74d0d0 83203e68
 82d0402071aa 83202ec751d0 82d0405ce210 83203e80
 82d0402343c9 82d0405ce200 83203eb0 82d040234631
  7fff 82d0405d5080 82d0405ce210
 83203ee8 82d040321411 82d040321399 83202f3a9000
  001d91a6fa2d 82d0405e6500 83203de0
 82d040324391   
    
    
    
    
    
Xen call trace:
 [] R common/sched/credit2.c#csched2_free_pdata+0xc8/0x177
 [] F free_cpu_rm_data+0x41/0x58
 [] F common/sched/cpupool.c#cpu_callback+0xfb/0x466
 [] F notifier_call_chain+0x6c/0x96
 [] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36
 [] F cpu_down+0xa7/0x143
 [] F cpu_down_helper+0x11/0x27
 [] F 
common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd
 [] F common/tasklet.c#do_tasklet_work+0x76/0xaf
 [] F do_tasklet+0x5b/0x8d
 [] F arch/x86/domain.c#idle_loop+0x78/0xe6
 [] F continue_running+0x5b/0x5d



Panic on CPU 0:
Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at 
common/sched/credit2.c:4111


The issue seems to be that since the CPU hasn't been removed, it's
still part of prv->initialized and the assert in csched2_free_pdata()
called as part of free_cpu_rm_data() triggers.

It's easy to reproduce by substituting the stop_machine_run() call in
cpu_down() with an error.


Could you please give the attached patch a try?


I still get the following assert:


Oh, silly me. Without core scheduling active nr_sr_unused will be 0 all
the time. :-(

Next try.


Juergen
From 8b707b0e4d68b1636f1c5ff0e2ef54bea5d9921a Mon Sep 17 00:00:00 2001
From: Juergen Gross 
Date: Wed, 29 May 2024 13:24:36 +0200
Subject: [PATCH] xen/sched: fix error path of cpu removal
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

In case removal of a cpu fails at the CPU_DYING step, an ASSERT() in
the credit2 scheduler might trigger:

Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at common/sched/cred1
[ Xen-4.19-unstable  x86_64  debug=y  Tainted:   C]
CPU:0
RIP:e008:[] common/sched/credit2.c#csched2_free_pdata+0xc8/0x177
RFLAGS: 00010093   CONTEXT: hypervisor
rax:    rbx: 83202ecc2f80   rcx: 83202f3e64c0
rdx: 0001   rsi: 0002   rdi: 83202ecc2f88
rbp: 83203d58   rsp: 83203d30   r8:  
r9:  83202f3e6e01   r10: 

Re: [PATCH] xen/xenbus: handle potential dangling pointer issue in xen_pcibk_xenbus_probe

2024-05-29 Thread Jürgen Groß

On 29.05.24 14:22, ysk...@gmail.com wrote:

From: Yunseong Kim 

If 'xen_pcibk_init_devices()' fails. This ensures that 'pdev->xdev' does
not point to 'xdev' when 'pdev' is freed.

Signed-off-by: Yunseong Kim 
---
  drivers/xen/xen-pciback/xenbus.c | 1 +
  1 file changed, 1 insertion(+)

diff --git a/drivers/xen/xen-pciback/xenbus.c b/drivers/xen/xen-pciback/xenbus.c
index b11e401f1b1e..348d6803b8c0 100644
--- a/drivers/xen/xen-pciback/xenbus.c
+++ b/drivers/xen/xen-pciback/xenbus.c
@@ -54,6 +54,7 @@ static struct xen_pcibk_device *alloc_pdev(struct 
xenbus_device *xdev)
INIT_WORK(>op_work, xen_pcibk_do_op);
  
  	if (xen_pcibk_init_devices(pdev)) {

+   pdev->xdev = NULL;
kfree(pdev);
pdev = NULL;
}


NAK.

This doesn't make any sense, as pdev is freed.


Juergen



Re: CPU_DOWN_FAILED hits ASSERTs in scheduling logic

2024-05-29 Thread Jürgen Groß

On 28.05.24 13:22, Roger Pau Monné wrote:

Hello,

When the stop_machine_run() call in cpu_down() fails and calls the CPU
notifier CPU_DOWN_FAILED hook the following assert triggers in the
scheduling code:

Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at 
common/sched/cred1
[ Xen-4.19-unstable  x86_64  debug=y  Tainted:   C]
CPU:0
RIP:e008:[] 
common/sched/credit2.c#csched2_free_pdata+0xc8/0x177
RFLAGS: 00010093   CONTEXT: hypervisor
rax:    rbx: 83202ecc2f80   rcx: 83202f3e64c0
rdx: 0001   rsi: 0002   rdi: 83202ecc2f88
rbp: 83203d58   rsp: 83203d30   r8:  
r9:  83202f3e6e01   r10:    r11: 0f0f0f0f0f0f0f0f
r12: 83202ecb80b0   r13: 0001   r14: 0282
r15: 83202ecbbf00   cr0: 8005003b   cr4: 007526e0
cr3: 574c2000   cr2: 
fsb:    gsb:    gss: 
ds:    es:    fs:    gs:    ss:    cs: e008
Xen code around  
(common/sched/credit2.c#csched2_free_pdata+0xc8/0x177):
  fe ff eb 9a 0f 0b 0f 0b <0f> 0b 49 8d 4f 08 49 8b 47 08 48 3b 48 08 75 2e
Xen stack trace from rsp=83203d30:
83202d74d100 0001 82d0404c4430 0006
 83203d78 82d040257454 
0001 83203da8 82d04021f303 82d0404c4628
82d0404c4620 82d0404c4430 0006 83203df0
82d04022bc4c 83203e18 0001 0001
fff0   82d0405e6500
83203e08 82d040204fd5 0001 83203e30
82d0402054f0 82d0404c5860 0001 83202ec75000
83203e48 82d040348c25 83202d74d0d0 83203e68
82d0402071aa 83202ec751d0 82d0405ce210 83203e80
82d0402343c9 82d0405ce200 83203eb0 82d040234631
 7fff 82d0405d5080 82d0405ce210
83203ee8 82d040321411 82d040321399 83202f3a9000
 001d91a6fa2d 82d0405e6500 83203de0
82d040324391   
   
   
   
   
   
Xen call trace:
[] R common/sched/credit2.c#csched2_free_pdata+0xc8/0x177
[] F free_cpu_rm_data+0x41/0x58
[] F common/sched/cpupool.c#cpu_callback+0xfb/0x466
[] F notifier_call_chain+0x6c/0x96
[] F common/cpu.c#cpu_notifier_call_chain+0x1b/0x36
[] F cpu_down+0xa7/0x143
[] F cpu_down_helper+0x11/0x27
[] F 
common/domain.c#continue_hypercall_tasklet_handler+0x50/0xbd
[] F common/tasklet.c#do_tasklet_work+0x76/0xaf
[] F do_tasklet+0x5b/0x8d
[] F arch/x86/domain.c#idle_loop+0x78/0xe6
[] F continue_running+0x5b/0x5d



Panic on CPU 0:
Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at 
common/sched/credit2.c:4111


The issue seems to be that since the CPU hasn't been removed, it's
still part of prv->initialized and the assert in csched2_free_pdata()
called as part of free_cpu_rm_data() triggers.

It's easy to reproduce by substituting the stop_machine_run() call in
cpu_down() with an error.


Could you please give the attached patch a try?

Only compile tested, though...


Juergen
From 5925f15ace8be186c9f41c0cda2d4a675b0f7bb9 Mon Sep 17 00:00:00 2001
From: Juergen Gross 
Date: Wed, 29 May 2024 13:24:36 +0200
Subject: [PATCH] xen/sched: fix error path of cpu removal
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

In case removal of a cpu fails at the CPU_DYING step, an ASSERT() in
the credit2 scheduler might trigger:

Assertion '!cpumask_test_cpu(cpu, >initialized)' failed at common/sched/cred1
[ Xen-4.19-unstable  x86_64  debug=y  Tainted:   C]
CPU:0
RIP:e008:[] common/sched/credit2.c#csched2_free_pdata+0xc8/0x177
RFLAGS: 00010093   CONTEXT: hypervisor
rax:    rbx: 83202ecc2f80   rcx: 83202f3e64c0
rdx: 0001   rsi: 0002   rdi: 83202ecc2f88
rbp: 83203d58   rsp: 83203d30   r8:  
r9:  83202f3e6e01   r10:    r11: 0f0f0f0f0f0f0f0f
r12: 83202ecb80b0   r13: 0001   r14: 0282
r15: 83202ecbbf00   cr0: 8005003b   cr4: 007526e0
cr3: 574c2000   cr2: 
fsb:    gsb: 

Re: [PATCH 3/4] x86/pci/xen: Fix PCIBIOS_* return code handling

2024-05-28 Thread Jürgen Groß

On 27.05.24 14:55, Ilpo Järvinen wrote:

xen_pcifront_enable_irq() uses pci_read_config_byte() that returns
PCIBIOS_* codes. The error handling, however, assumes the codes are
normal errnos because it checks for < 0.

xen_pcifront_enable_irq() also returns the PCIBIOS_* code back to the
caller but the function is used as the (*pcibios_enable_irq) function
which should return normal errnos.

Convert the error check to plain non-zero check which works for
PCIBIOS_* return codes and convert the PCIBIOS_* return code using
pcibios_err_to_errno() into normal errno before returning it.

Fixes: 3f2a230caf21 ("xen: handled remapped IRQs when enabling a pcifront PCI 
device.")
Signed-off-by: Ilpo Järvinen 
Cc: sta...@vger.kernel.org


Reviewed-by: Juergen Gross 


Juergen




Re: [PATCH v3 2/4] xen/arm: Alloc XenStore page for Dom0less DomUs from hypervisor

2024-05-27 Thread Jürgen Groß

On 25.05.24 01:19, Stefano Stabellini wrote:

On Fri, 24 May 2024, Jürgen Groß wrote:

On 24.05.24 15:58, Julien Grall wrote:

Hi Henry,

+ Juergen as the Xenstore maintainers. I'd like his opinion on the approach.
The documentation of the new logic is in:

https://lore.kernel.org/xen-devel/20240517032156.1490515-5-xin.wa...@amd.com/

FWIW I am happy in principle with the logic (this is what we discussed on
the call last week). Some comments below.


I'm not against this logic, but I'm wondering why it needs to be so
complicated.


Actually the reason I like it is that in my view, this is the simplest
approach. You allocate a domain, you also allocate the xenstore page
together with it. Initially the xenstore connection has an
"uninitialized" state, as it should be. That's it. At some point, when
xenstored is ready, the state changes to CONNECTED.



Can't the domU itself allocate the Xenstore page from its RAM pages,
write the PFN into the Xenstore grant tab entry, and then make it
public via setting HVM_PARAM_STORE_PFN?


This is not simpler in my view


Okay, fine with me. I had the impression that violating the 1:1 mapping
of the domain would add complexity, but if you are fine with that I don't
mind your approach.


Juergen




Re: [PATCH v3 2/4] xen/arm: Alloc XenStore page for Dom0less DomUs from hypervisor

2024-05-24 Thread Jürgen Groß

On 24.05.24 15:58, Julien Grall wrote:

Hi Henry,

+ Juergen as the Xenstore maintainers. I'd like his opinion on the approach. The 
documentation of the new logic is in:


https://lore.kernel.org/xen-devel/20240517032156.1490515-5-xin.wa...@amd.com/

FWIW I am happy in principle with the logic (this is what we discussed on the 
call last week). Some comments below.


I'm not against this logic, but I'm wondering why it needs to be so
complicated.

Can't the domU itself allocate the Xenstore page from its RAM pages,
write the PFN into the Xenstore grant tab entry, and then make it
public via setting HVM_PARAM_STORE_PFN?

The init-dom0less application could then check HVM_PARAM_STORE_PFN
being set and call XS_introduce_domain.

Note that at least C-xenstored does not need the PFN of the Xenstore
page, as it is just using GNTTAB_RESERVED_XENSTORE for mapping the
page.


Juergen



Re: [PATCH for-4.19 v3 2/3] xen: enable altp2m at create domain domctl

2024-05-24 Thread Jürgen Groß

On 17.05.24 15:33, Roger Pau Monne wrote:

Enabling it using an HVM param is fragile, and complicates the logic when
deciding whether options that interact with altp2m can also be enabled.

Leave the HVM param value for consumption by the guest, but prevent it from
being set.  Enabling is now done using and additional altp2m specific field in
xen_domctl_createdomain.

Note that albeit only currently implemented in x86, altp2m could be implemented
in other architectures, hence why the field is added to xen_domctl_createdomain
instead of xen_arch_domainconfig.

Signed-off-by: Roger Pau Monné 


Reviewed-by: Juergen Gross  # tools/libs/


Juergen



Re: [PATCH 3/5] x86/pvh: Set phys_base when calling xen_prepare_pvh()

2024-05-23 Thread Jürgen Groß

On 10.04.24 21:48, Jason Andryuk wrote:

phys_base needs to be set for __pa() to work in xen_pvh_init() when
finding the hypercall page.  Set it before calling into
xen_prepare_pvh(), which calls xen_pvh_init().  Clear it afterward to
avoid __startup_64() adding to it and creating an incorrect value.

Signed-off-by: Jason Andryuk 
---
Instead of setting and clearing phys_base, a dedicated variable could be
used just for the hypercall page.  Having phys_base set properly may
avoid further issues if the use of phys_base or __pa() grows.
---
  arch/x86/platform/pvh/head.S | 10 ++
  1 file changed, 10 insertions(+)

diff --git a/arch/x86/platform/pvh/head.S b/arch/x86/platform/pvh/head.S
index bb1e582e32b1..c08d08d8cc92 100644
--- a/arch/x86/platform/pvh/head.S
+++ b/arch/x86/platform/pvh/head.S
@@ -125,7 +125,17 @@ SYM_CODE_START_LOCAL(pvh_start_xen)
xor %edx, %edx
wrmsr
  
+	/* Calculate load offset from LOAD_PHYSICAL_ADDR and store in

+* phys_base.  __pa() needs phys_base set to calculate the
+* hypercall page in xen_pvh_init(). */


Please use the correct style for multi-line comments:

/*
 * comment lines
 * comment lines
 */


+   movq %rbp, %rbx
+   subq $LOAD_PHYSICAL_ADDR, %rbx
+   movq %rbx, phys_base(%rip)
call xen_prepare_pvh
+   /* Clear phys_base.  __startup_64 will *add* to its value,
+* so reset to 0. */


Comment style again.


+   xor  %rbx, %rbx
+   movq %rbx, phys_base(%rip)
  
  	/* startup_64 expects boot_params in %rsi. */

lea rva(pvh_bootparams)(%ebp), %rsi


With above fixed:

Reviewed-by: Juergen Gross 


Juergen



Re: [PATCH v2 4/4] tools: Drop libsystemd as a dependency

2024-05-23 Thread Jürgen Groß

On 16.05.24 20:58, Andrew Cooper wrote:

There are no more users, and we want to disuade people from introducing new
users just for sd_notify() and friends.  Drop the dependency.

We still want the overall --with{,out}-systemd to gate the generation of the
service/unit/mount/etc files.

Rerun autogen.sh, and mark the dependency as removed in the build containers.

Signed-off-by: Andrew Cooper 


Reviewed-by: Juergen Gross 


Juergen




Re: [PATCH v2 4/4] tools: Drop libsystemd as a dependency

2024-05-23 Thread Jürgen Groß

On 16.05.24 20:58, Andrew Cooper wrote:

There are no more users, and we want to disuade people from introducing new
users just for sd_notify() and friends.  Drop the dependency.

We still want the overall --with{,out}-systemd to gate the generation of the
service/unit/mount/etc files.

Rerun autogen.sh, and mark the dependency as removed in the build containers.

Signed-off-by: Andrew Cooper 
---
CC: Anthony PERARD 
CC: Juergen Gross 
CC: George Dunlap 
CC: Jan Beulich 
CC: Stefano Stabellini 
CC: Julien Grall 
CC: Christian Lindig 
CC: Edwin Török 

v2:
  * Only strip out the library check.
---
  automation/build/archlinux/current.dockerfile |   1 +
  .../build/suse/opensuse-leap.dockerfile   |   1 +
  .../build/suse/opensuse-tumbleweed.dockerfile |   1 +
  automation/build/ubuntu/focal.dockerfile  |   1 +
  config/Tools.mk.in|   2 -
  m4/systemd.m4 |   9 -
  tools/configure   | 256 --
  7 files changed, 4 insertions(+), 267 deletions(-)

diff --git a/automation/build/archlinux/current.dockerfile 
b/automation/build/archlinux/current.dockerfile
index 3e37ab5c40c1..d29f1358c2bd 100644
--- a/automation/build/archlinux/current.dockerfile
+++ b/automation/build/archlinux/current.dockerfile
@@ -37,6 +37,7 @@ RUN pacman -S --refresh --sysupgrade --noconfirm 
--noprogressbar --needed \
  sdl2 \
  spice \
  spice-protocol \
+# systemd for Xen < 4.19


Does this work as intended? A comment between the parameters and no "\" at the
end of the line?


Juergen



Re: [PATCH v2 3/4] tools/{c,o}xenstored: Don't link against libsystemd

2024-05-23 Thread Jürgen Groß

On 16.05.24 20:58, Andrew Cooper wrote:

Use the local freestanding wrapper instead.

Signed-off-by: Andrew Cooper 


Reviewed-by: Juergen Gross  # tools/xenstored


Juergen


---
CC: Anthony PERARD 
CC: Juergen Gross 
CC: George Dunlap 
CC: Jan Beulich 
CC: Stefano Stabellini 
CC: Julien Grall 
CC: Christian Lindig 
CC: Edwin Török 

v2:
  * Redo almost from scratch, using the freestanding wrapper instead.
---
  tools/ocaml/xenstored/Makefile| 2 --
  tools/ocaml/xenstored/systemd_stubs.c | 2 +-
  tools/xenstored/Makefile  | 5 -
  tools/xenstored/posix.c   | 4 ++--
  4 files changed, 3 insertions(+), 10 deletions(-)

diff --git a/tools/ocaml/xenstored/Makefile b/tools/ocaml/xenstored/Makefile
index e8aaecf2e630..a8b8bb64698e 100644
--- a/tools/ocaml/xenstored/Makefile
+++ b/tools/ocaml/xenstored/Makefile
@@ -4,8 +4,6 @@ include $(OCAML_TOPLEVEL)/common.make
  
  # Include configure output (config.h)

  CFLAGS += -include $(XEN_ROOT)/tools/config.h
-CFLAGS-$(CONFIG_SYSTEMD)  += $(SYSTEMD_CFLAGS)
-LDFLAGS-$(CONFIG_SYSTEMD) += $(SYSTEMD_LIBS)
  
  CFLAGS  += $(CFLAGS-y)

  CFLAGS  += $(APPEND_CFLAGS)
diff --git a/tools/ocaml/xenstored/systemd_stubs.c 
b/tools/ocaml/xenstored/systemd_stubs.c
index f4c875075abe..7dbbdd35bf30 100644
--- a/tools/ocaml/xenstored/systemd_stubs.c
+++ b/tools/ocaml/xenstored/systemd_stubs.c
@@ -25,7 +25,7 @@
  
  #if defined(HAVE_SYSTEMD)
  
-#include 

+#include 
  
  CAMLprim value ocaml_sd_notify_ready(value ignore)

  {
diff --git a/tools/xenstored/Makefile b/tools/xenstored/Makefile
index e0897ed1ba30..09adfe1d5064 100644
--- a/tools/xenstored/Makefile
+++ b/tools/xenstored/Makefile
@@ -9,11 +9,6 @@ xenstored: LDLIBS += $(LDLIBS_libxenctrl)
  xenstored: LDLIBS += -lrt
  xenstored: LDLIBS += $(SOCKET_LIBS)
  
-ifeq ($(CONFIG_SYSTEMD),y)

-$(XENSTORED_OBJS-y): CFLAGS += $(SYSTEMD_CFLAGS)
-xenstored: LDLIBS += $(SYSTEMD_LIBS)
-endif
-
  TARGETS := xenstored
  
  .PHONY: all

diff --git a/tools/xenstored/posix.c b/tools/xenstored/posix.c
index d88c82d972d7..6037d739d013 100644
--- a/tools/xenstored/posix.c
+++ b/tools/xenstored/posix.c
@@ -27,7 +27,7 @@
  #include 
  #include 
  #if defined(HAVE_SYSTEMD)
-#include 
+#include 
  #endif
  #include 
  
@@ -393,7 +393,7 @@ void late_init(bool live_update)

  #if defined(HAVE_SYSTEMD)
if (!live_update) {
sd_notify(1, "READY=1");
-   fprintf(stderr, SD_NOTICE "xenstored is ready\n");
+   fprintf(stderr, "xenstored is ready\n");
}
  #endif
  }





Re: [PATCH v2 2/4] tools: Import standalone sd_notify() implementation from systemd

2024-05-23 Thread Jürgen Groß

On 16.05.24 20:58, Andrew Cooper wrote:

... in order to avoid linking against the whole of libsystemd.

Only minimal changes to the upstream copy, to function as a drop-in
replacement for sd_notify() and as a header-only library.

Signed-off-by: Andrew Cooper 


With s/cleanup(sd_closep)/cleanup(xen_sd_closep)/

Reviewed-by: Juergen Gross 


Juergen


---
CC: Anthony PERARD 
CC: Juergen Gross 
CC: George Dunlap 
CC: Jan Beulich 
CC: Stefano Stabellini 
CC: Julien Grall 
CC: Christian Lindig 
CC: Edwin Török 

v2:
  * New
---
  tools/include/xen-sd-notify.h | 98 +++
  1 file changed, 98 insertions(+)
  create mode 100644 tools/include/xen-sd-notify.h

diff --git a/tools/include/xen-sd-notify.h b/tools/include/xen-sd-notify.h
new file mode 100644
index ..eda9d8b22d9e
--- /dev/null
+++ b/tools/include/xen-sd-notify.h
@@ -0,0 +1,98 @@
+/* SPDX-License-Identifier: MIT-0 */
+
+/*
+ * Implement the systemd notify protocol without external dependencies.
+ * Supports both readiness notification on startup and on reloading,
+ * according to the protocol defined at:
+ * https://www.freedesktop.org/software/systemd/man/latest/sd_notify.html
+ * This protocol is guaranteed to be stable as per:
+ * https://systemd.io/PORTABILITY_AND_STABILITY/
+ *
+ * Differences from the upstream copy:
+ * - Rename/rework as a drop-in replacement for systemd/sd-daemon.h
+ * - Only take the subset Xen cares about
+ * - Respect -Wdeclaration-after-statement
+ */
+
+#ifndef XEN_SD_NOTIFY
+#define XEN_SD_NOTIFY
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+static inline void xen_sd_closep(int *fd) {
+  if (!fd || *fd < 0)
+return;
+
+  close(*fd);
+  *fd = -1;
+}
+
+static inline int xen_sd_notify(const char *message) {
+  union sockaddr_union {
+struct sockaddr sa;
+struct sockaddr_un sun;
+  } socket_addr = {
+.sun.sun_family = AF_UNIX,
+  };
+  size_t path_length, message_length;
+  ssize_t written;
+  const char *socket_path;
+  int __attribute__((cleanup(sd_closep))) fd = -1;
+
+  /* Verify the argument first */
+  if (!message)
+return -EINVAL;
+
+  message_length = strlen(message);
+  if (message_length == 0)
+return -EINVAL;
+
+  /* If the variable is not set, the protocol is a noop */
+  socket_path = getenv("NOTIFY_SOCKET");
+  if (!socket_path)
+return 0; /* Not set? Nothing to do */
+
+  /* Only AF_UNIX is supported, with path or abstract sockets */
+  if (socket_path[0] != '/' && socket_path[0] != '@')
+return -EAFNOSUPPORT;
+
+  path_length = strlen(socket_path);
+  /* Ensure there is room for NUL byte */
+  if (path_length >= sizeof(socket_addr.sun.sun_path))
+return -E2BIG;
+
+  memcpy(socket_addr.sun.sun_path, socket_path, path_length);
+
+  /* Support for abstract socket */
+  if (socket_addr.sun.sun_path[0] == '@')
+socket_addr.sun.sun_path[0] = 0;
+
+  fd = socket(AF_UNIX, SOCK_DGRAM|SOCK_CLOEXEC, 0);
+  if (fd < 0)
+return -errno;
+
+  if (connect(fd, _addr.sa, offsetof(struct sockaddr_un, sun_path) + 
path_length) != 0)
+return -errno;
+
+  written = write(fd, message, message_length);
+  if (written != (ssize_t) message_length)
+return written < 0 ? -errno : -EPROTO;
+
+  return 1; /* Notified! */
+}
+
+static inline int sd_notify(int unset_environment, const char *message) {
+int r = xen_sd_notify(message);
+
+if (unset_environment)
+unsetenv("NOTIFY_SOCKET");
+
+return r;
+}
+
+#endif /* XEN_SD_NOTIFY */





Re: [XEN PATCH v8 1/5] xen/vpci: Clear all vpci status of device

2024-05-17 Thread Jürgen Groß

On 17.05.24 11:50, Jan Beulich wrote:

On 17.05.2024 11:28, Chen, Jiqian wrote:

On 2024/5/17 16:20, Jan Beulich wrote:

On 17.05.2024 10:08, Chen, Jiqian wrote:

On 2024/5/16 21:08, Jan Beulich wrote:

On 16.05.2024 11:52, Jiqian Chen wrote:

  struct physdev_pci_device {
  /* IN */
  uint16_t seg;


Is re-using this struct for this new sub-op sufficient? IOW are all
possible resets equal, and hence it doesn't need specifying what kind of
reset was done? For example, other than FLR most reset variants reset all
functions in one go aiui. Imo that would better require only a single
hypercall, just to avoid possible confusion. It also reads as if FLR would
not reset as many registers as other reset variants would.

If I understood correctly that you mean in this hypercall it needs to support 
resetting both one function and all functions of a slot(dev)?
But it can be done for caller to use a cycle to call this reset hypercall for 
each slot function.


It could, yes, but since (aiui) there needs to be an indication of the
kind of reset anyway, we can as well avoid relying on the caller doing
so (and at the same time simplify what the caller needs to do).

Since the corresponding kernel patch has been merged into linux_next branch
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?h=next-20240515=b272722511d5e8ae580f01830687b8a6b2717f01,
if it's not very mandatory and necessary, just let the caller handle it 
temporarily.


As also mentioned for the other patch having a corresponding kernel one:
The kernel patch would imo better not be merged until the new sub-op is
actually finalized.


Oh, sorry to have overlooked that the interfcae change isn't yet committed on
Xen side.

I'll drop the patch from my linux-next branch.


Juergen




Re: Xen crash in scheduler during cpu hotplug

2024-05-15 Thread Jürgen Groß

On 15.05.24 15:22, Andrew Cooper wrote:

On 15/05/2024 1:39 pm, Jan Beulich wrote:

On 15.05.2024 13:58, Andrew Cooper wrote:

Just so it doesn't get lost.  In XenRT, we've found:


(XEN) [ Xen-4.19.0-1-d  x86_64  debug=y  Tainted: H  ]
(XEN) CPU:    45
(XEN) RIP:    e008:[]
common/sched/credit.c#csched_load_balance+0x41/0x877
(XEN) RFLAGS: 00010092   CONTEXT: hypervisor
(XEN) rax: 82d040981618   rbx: 82d040981618   rcx:

(XEN) rdx: 003ff68cd000   rsi: 002d   rdi:
83103723d450
(XEN) rbp: 83207caa7d48   rsp: 83207caa7b98   r8:

(XEN) r9:  831037253cf0   r10: 83103767c3f0   r11:
0009
(XEN) r12: 831037237990   r13: 831037237990   r14:
831037253720
(XEN) r15:    cr0: 8005003b   cr4:
00f526e0
(XEN) cr3: 5bc2f000   cr2: 0010
(XEN) fsb:    gsb:    gss:

(XEN) ds:    es:    fs:    gs:    ss:    cs: e008
(XEN) Xen code around 
(common/sched/credit.c#csched_load_balance+0x41/0x877):
(XEN)  48 8b 0c 10 48 8b 49 08 <48> 8b 79 10 48 89 bd b8 fe ff ff 49
8b 4e 28 48

(XEN) Xen call trace:
(XEN)    [] R
common/sched/credit.c#csched_load_balance+0x41/0x877

While this is of course pretty little information, I've still tried to
decipher it, first noticing it's credit1 that's being used here. Once
forcing csched_load_balance() non-inline (no idea why it is a separate
function in your build), I can see a sufficiently matching pattern at
approximately the same offset into the function. That's

 const struct cpupool *c = get_sched_res(cpu)->cpupool;
 ...
 const cpumask_t *online = c->res_valid;
 ...
 BUG_ON(get_sched_res(cpu) != snext->unit->res);

overlapping, with the crash being on the middle of the quoted lines.
IOW the CPU pool is still NULL for this sched resource. Cc-ing
Jürgen for possible clues ...


We've seen it in 4.13, 4.17 and upstream, after Roger extended the
existing CPU hotplug testing to try and reproduce the MTRR watchdog
failure.  We've found yet another "no irq for handler" from this too.

It's always a deference at NULL+0x10, somewhere within csched_schedule().


I think I've found the reason.

In schedule_cpu_add() the cpupool and granularity are set only after
releasing the scheduling lock. I think those must be inside the locked
region.

Can you give this one a try (not tested at all)?

diff --git a/xen/common/sched/core.c b/xen/common/sched/core.c
index 0cb33831d2..babac7aad6 100644
--- a/xen/common/sched/core.c
+++ b/xen/common/sched/core.c
@@ -3176,6 +3176,8 @@ int schedule_cpu_add(unsigned int cpu, struct cpupool *c)

 sr->scheduler = new_ops;
 sr->sched_priv = ppriv;
+sr->granularity = cpupool_get_granularity(c);
+sr->cpupool = c;

 /*
  * Reroute the lock to the per pCPU lock as /last/ thing. In fact,
@@ -3188,8 +3190,6 @@ int schedule_cpu_add(unsigned int cpu, struct cpupool *c)
 /* _Not_ pcpu_schedule_unlock(): schedule_lock has changed! */
 spin_unlock_irqrestore(old_lock, flags);

-sr->granularity = cpupool_get_granularity(c);
-sr->cpupool = c;
 /* The  cpu is added to a pool, trigger it to go pick up some work */
 cpu_raise_softirq(cpu, SCHEDULE_SOFTIRQ);


Juergen




Re: [RFC KERNEL PATCH v6 3/3] xen/privcmd: Add new syscall to get gsi from irq

2024-05-13 Thread Jürgen Groß

On 13.05.24 09:47, Chen, Jiqian wrote:

Hi,
On 2024/5/10 17:06, Chen, Jiqian wrote:

Hi,

On 2024/5/10 14:46, Jürgen Groß wrote:

On 19.04.24 05:36, Jiqian Chen wrote:

In PVH dom0, it uses the linux local interrupt mechanism,
when it allocs irq for a gsi, it is dynamic, and follow
the principle of applying first, distributing first. And
the irq number is alloced from small to large, but the
applying gsi number is not, may gsi 38 comes before gsi 28,
it causes the irq number is not equal with the gsi number.
And when passthrough a device, QEMU will use device's gsi
number to do pirq mapping, but the gsi number is got from
file /sys/bus/pci/devices//irq, irq!= gsi, so it will
fail when mapping.
And in current linux codes, there is no method to translate
irq to gsi for userspace.

For above purpose, record the relationship of gsi and irq
when PVH dom0 do acpi_register_gsi_ioapic for devices and
adds a new syscall into privcmd to let userspace can get
that translation when they have a need.

Co-developed-by: Huang Rui 
Signed-off-by: Jiqian Chen 
---
   arch/x86/include/asm/apic.h  |  8 +++
   arch/x86/include/asm/xen/pci.h   |  5 
   arch/x86/kernel/acpi/boot.c  |  2 +-
   arch/x86/pci/xen.c   | 21 +
   drivers/xen/events/events_base.c | 39 
   drivers/xen/privcmd.c    | 19 
   include/uapi/xen/privcmd.h   |  7 ++
   include/xen/events.h |  5 
   8 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 9d159b771dc8..dd4139250895 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -169,6 +169,9 @@ extern bool apic_needs_pit(void);
     extern void apic_send_IPI_allbutself(unsigned int vector);
   +extern int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
+    int trigger, int polarity);
+
   #else /* !CONFIG_X86_LOCAL_APIC */
   static inline void lapic_shutdown(void) { }
   #define local_apic_timer_c2_ok    1
@@ -183,6 +186,11 @@ static inline void apic_intr_mode_init(void) { }
   static inline void lapic_assign_system_vectors(void) { }
   static inline void lapic_assign_legacy_vector(unsigned int i, bool r) { }
   static inline bool apic_needs_pit(void) { return true; }
+static inline int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
+    int trigger, int polarity)
+{
+    return (int)gsi;
+}
   #endif /* !CONFIG_X86_LOCAL_APIC */
     #ifdef CONFIG_X86_X2APIC
diff --git a/arch/x86/include/asm/xen/pci.h b/arch/x86/include/asm/xen/pci.h
index 9015b888edd6..aa8ded61fc2d 100644
--- a/arch/x86/include/asm/xen/pci.h
+++ b/arch/x86/include/asm/xen/pci.h
@@ -5,6 +5,7 @@
   #if defined(CONFIG_PCI_XEN)
   extern int __init pci_xen_init(void);
   extern int __init pci_xen_hvm_init(void);
+extern int __init pci_xen_pvh_init(void);
   #define pci_xen 1
   #else
   #define pci_xen 0
@@ -13,6 +14,10 @@ static inline int pci_xen_hvm_init(void)
   {
   return -1;
   }
+static inline int pci_xen_pvh_init(void)
+{
+    return -1;
+}
   #endif
   #ifdef CONFIG_XEN_PV_DOM0
   int __init pci_xen_initial_domain(void);
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 85a3ce2a3666..72c73458c083 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -749,7 +749,7 @@ static int acpi_register_gsi_pic(struct device *dev, u32 
gsi,
   }
     #ifdef CONFIG_X86_LOCAL_APIC
-static int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
+int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
   int trigger, int polarity)
   {
   int irq = gsi;
diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index 652cd53e77f6..f056ab5c0a06 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -114,6 +114,21 @@ static int acpi_register_gsi_xen_hvm(struct device *dev, 
u32 gsi,
    false /* no mapping of GSI to PIRQ */);
   }
   +static int acpi_register_gsi_xen_pvh(struct device *dev, u32 gsi,
+    int trigger, int polarity)
+{
+    int irq;
+
+    irq = acpi_register_gsi_ioapic(dev, gsi, trigger, polarity);
+    if (irq < 0)
+    return irq;
+
+    if (xen_pvh_add_gsi_irq_map(gsi, irq) == -EEXIST)
+    printk(KERN_INFO "Already map the GSI :%u and IRQ: %d\n", gsi, irq);
+
+    return irq;
+}
+
   #ifdef CONFIG_XEN_PV_DOM0
   static int xen_register_gsi(u32 gsi, int triggering, int polarity)
   {
@@ -558,6 +573,12 @@ int __init pci_xen_hvm_init(void)
   return 0;
   }
   +int __init pci_xen_pvh_init(void)
+{
+    __acpi_register_gsi = acpi_register_gsi_xen_pvh;


No support for unregistering the gsi again?

__acpi_unregister_gsi is set in function acpi_set_irq_model_ioapic.
Maybe I need to use a new function to call acpi_unregister_gsi_ioapic and 
remove the mapping of irq and gsi from xen_irq_list_head ?

When I tried to support unregist

Re: [RFC KERNEL PATCH v6 3/3] xen/privcmd: Add new syscall to get gsi from irq

2024-05-10 Thread Jürgen Groß

On 10.05.24 12:32, Chen, Jiqian wrote:

On 2024/5/10 18:21, Jürgen Groß wrote:

On 10.05.24 12:13, Chen, Jiqian wrote:

On 2024/5/10 17:53, Jürgen Groß wrote:

On 10.05.24 11:06, Chen, Jiqian wrote:

Hi,

On 2024/5/10 14:46, Jürgen Groß wrote:

On 19.04.24 05:36, Jiqian Chen wrote:

+
+    info->type = IRQT_PIRQ;

I am considering whether I need to use a new type(like IRQT_GSI) here to 
distinguish with IRQT_PIRQ, because function restore_pirqs will process all 
IRQT_PIRQ.


restore_pirqs() already considers gsi == 0 to be not GSI related. Isn't this
enough?

No, it is not enough.
xen_pvh_add_gsi_irq_map adds the mapping of gsi and irq, but the value of gsi 
is not 0,
once restore_pirqs is called, it will do PHYSDEVOP_map_pirq for that gsi, but 
in pvh dom0, we shouldn't do PHYSDEVOP_map_pirq.


Okay, then add a new flag to info->u.pirq.flags for that purpose?

I feel like adding "new flag to info->u.pirq.flags" is not as good as adding " new type 
to info->type".
Because in restore_pirqs, it considers " info->type != IRQT_PIRQ", if adding " new flag 
to info->u.pirq.flags", we need to add a new condition in restore_pirqs.
And actually this mapping(gsi and irq of pvh) doesn't have pirq, so it is not 
suitable to add to u.pirq.flags.


Does this mean there is no other IRQT_PIRQ related activity relevant for those
GSIs/IRQs? In that case I agree to add IRQT_GSI.


Juergen



Re: [RFC KERNEL PATCH v6 3/3] xen/privcmd: Add new syscall to get gsi from irq

2024-05-10 Thread Jürgen Groß

On 10.05.24 12:13, Chen, Jiqian wrote:

On 2024/5/10 17:53, Jürgen Groß wrote:

On 10.05.24 11:06, Chen, Jiqian wrote:

Hi,

On 2024/5/10 14:46, Jürgen Groß wrote:

On 19.04.24 05:36, Jiqian Chen wrote:

+
+    info->type = IRQT_PIRQ;

I am considering whether I need to use a new type(like IRQT_GSI) here to 
distinguish with IRQT_PIRQ, because function restore_pirqs will process all 
IRQT_PIRQ.


restore_pirqs() already considers gsi == 0 to be not GSI related. Isn't this
enough?

No, it is not enough.
xen_pvh_add_gsi_irq_map adds the mapping of gsi and irq, but the value of gsi 
is not 0,
once restore_pirqs is called, it will do PHYSDEVOP_map_pirq for that gsi, but 
in pvh dom0, we shouldn't do PHYSDEVOP_map_pirq.


Okay, then add a new flag to info->u.pirq.flags for that purpose?


Juergen




Re: [RFC KERNEL PATCH v6 3/3] xen/privcmd: Add new syscall to get gsi from irq

2024-05-10 Thread Jürgen Groß

On 10.05.24 11:06, Chen, Jiqian wrote:

Hi,

On 2024/5/10 14:46, Jürgen Groß wrote:

On 19.04.24 05:36, Jiqian Chen wrote:

+
+    info->type = IRQT_PIRQ;

I am considering whether I need to use a new type(like IRQT_GSI) here to 
distinguish with IRQT_PIRQ, because function restore_pirqs will process all 
IRQT_PIRQ.


restore_pirqs() already considers gsi == 0 to be not GSI related. Isn't this
enough?


Juergen



Re: [PATCH 1/5] xen: sync elfnote.h from xen tree

2024-05-10 Thread Jürgen Groß

On 10.04.24 21:48, Jason Andryuk wrote:

Sync Xen's elfnote.h header from xen.git to pull in the
XEN_ELFNOTE_PHYS32_RELOC define.

xen commit dfc9fab00378 ("x86/PVH: Support relocatable dom0 kernels")

This is a copy except for the removal of the emacs editor config at the
end of the file.

Signed-off-by: Jason Andryuk 


Reviewed-by: Juergen Gross 


Juergen




Re: [PATCH] libxl: Fix handling XenStore errors in device creation

2024-05-10 Thread Jürgen Groß

On 27.04.24 04:17, Demi Marie Obenour wrote:

If xenstored runs out of memory it is possible for it to fail operations
that should succeed.  libxl wasn't robust against this, and could fail
to ensure that the TTY path of a non-initial console was created and
read-only for guests.  This doesn't qualify for an XSA because guests
should not be able to run xenstored out of memory, but it still needs to
be fixed.

Add the missing error checks to ensure that all errors are properly
handled and that at no point can a guest make the TTY path of its
frontend directory writable.

Signed-off-by: Demi Marie Obenour 


Apart from one nit below:

Reviewed-by: Juergen Gross 


---
  tools/libs/light/libxl_console.c | 10 ++---
  tools/libs/light/libxl_device.c  | 72 
  tools/libs/light/libxl_xshelp.c  | 13 --
  3 files changed, 59 insertions(+), 36 deletions(-)

diff --git a/tools/libs/light/libxl_console.c b/tools/libs/light/libxl_console.c
index 
cd7412a3272a2faf4b9dab0ef4dd077e55472546..adf82aa844a4f4989111bfc8a94af18ad8e114f1
 100644
--- a/tools/libs/light/libxl_console.c
+++ b/tools/libs/light/libxl_console.c
@@ -351,11 +351,10 @@ int libxl__device_console_add(libxl__gc *gc, uint32_t 
domid,
  flexarray_append(front, "protocol");
  flexarray_append(front, LIBXL_XENCONSOLE_PROTOCOL);
  }
-libxl__device_generic_add(gc, XBT_NULL, device,
-  libxl__xs_kvs_of_flexarray(gc, back),
-  libxl__xs_kvs_of_flexarray(gc, front),
-  libxl__xs_kvs_of_flexarray(gc, ro_front));
-rc = 0;
+rc = libxl__device_generic_add(gc, XBT_NULL, device,
+   libxl__xs_kvs_of_flexarray(gc, back),
+   libxl__xs_kvs_of_flexarray(gc, front),
+   libxl__xs_kvs_of_flexarray(gc, ro_front));
  out:
  return rc;
  }
@@ -665,6 +664,7 @@ int libxl_device_channel_getinfo(libxl_ctx *ctx, uint32_t 
domid,
*/
   if (!val) val = "/NO-SUCH-PATH";
   channelinfo->u.pty.path = strdup(val);
+ if (channelinfo->u.pty.path == NULL) abort();


Even with the bad example 2 lines up, please put the "abort();" into a
line of its own.


Juergen



Re: [RFC KERNEL PATCH v6 3/3] xen/privcmd: Add new syscall to get gsi from irq

2024-05-10 Thread Jürgen Groß

On 19.04.24 05:36, Jiqian Chen wrote:

In PVH dom0, it uses the linux local interrupt mechanism,
when it allocs irq for a gsi, it is dynamic, and follow
the principle of applying first, distributing first. And
the irq number is alloced from small to large, but the
applying gsi number is not, may gsi 38 comes before gsi 28,
it causes the irq number is not equal with the gsi number.
And when passthrough a device, QEMU will use device's gsi
number to do pirq mapping, but the gsi number is got from
file /sys/bus/pci/devices//irq, irq!= gsi, so it will
fail when mapping.
And in current linux codes, there is no method to translate
irq to gsi for userspace.

For above purpose, record the relationship of gsi and irq
when PVH dom0 do acpi_register_gsi_ioapic for devices and
adds a new syscall into privcmd to let userspace can get
that translation when they have a need.

Co-developed-by: Huang Rui 
Signed-off-by: Jiqian Chen 
---
  arch/x86/include/asm/apic.h  |  8 +++
  arch/x86/include/asm/xen/pci.h   |  5 
  arch/x86/kernel/acpi/boot.c  |  2 +-
  arch/x86/pci/xen.c   | 21 +
  drivers/xen/events/events_base.c | 39 
  drivers/xen/privcmd.c| 19 
  include/uapi/xen/privcmd.h   |  7 ++
  include/xen/events.h |  5 
  8 files changed, 105 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index 9d159b771dc8..dd4139250895 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -169,6 +169,9 @@ extern bool apic_needs_pit(void);
  
  extern void apic_send_IPI_allbutself(unsigned int vector);
  
+extern int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,

+   int trigger, int polarity);
+
  #else /* !CONFIG_X86_LOCAL_APIC */
  static inline void lapic_shutdown(void) { }
  #define local_apic_timer_c2_ok1
@@ -183,6 +186,11 @@ static inline void apic_intr_mode_init(void) { }
  static inline void lapic_assign_system_vectors(void) { }
  static inline void lapic_assign_legacy_vector(unsigned int i, bool r) { }
  static inline bool apic_needs_pit(void) { return true; }
+static inline int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
+   int trigger, int polarity)
+{
+   return (int)gsi;
+}
  #endif /* !CONFIG_X86_LOCAL_APIC */
  
  #ifdef CONFIG_X86_X2APIC

diff --git a/arch/x86/include/asm/xen/pci.h b/arch/x86/include/asm/xen/pci.h
index 9015b888edd6..aa8ded61fc2d 100644
--- a/arch/x86/include/asm/xen/pci.h
+++ b/arch/x86/include/asm/xen/pci.h
@@ -5,6 +5,7 @@
  #if defined(CONFIG_PCI_XEN)
  extern int __init pci_xen_init(void);
  extern int __init pci_xen_hvm_init(void);
+extern int __init pci_xen_pvh_init(void);
  #define pci_xen 1
  #else
  #define pci_xen 0
@@ -13,6 +14,10 @@ static inline int pci_xen_hvm_init(void)
  {
return -1;
  }
+static inline int pci_xen_pvh_init(void)
+{
+   return -1;
+}
  #endif
  #ifdef CONFIG_XEN_PV_DOM0
  int __init pci_xen_initial_domain(void);
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 85a3ce2a3666..72c73458c083 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -749,7 +749,7 @@ static int acpi_register_gsi_pic(struct device *dev, u32 
gsi,
  }
  
  #ifdef CONFIG_X86_LOCAL_APIC

-static int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
+int acpi_register_gsi_ioapic(struct device *dev, u32 gsi,
int trigger, int polarity)
  {
int irq = gsi;
diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c
index 652cd53e77f6..f056ab5c0a06 100644
--- a/arch/x86/pci/xen.c
+++ b/arch/x86/pci/xen.c
@@ -114,6 +114,21 @@ static int acpi_register_gsi_xen_hvm(struct device *dev, 
u32 gsi,
 false /* no mapping of GSI to PIRQ */);
  }
  
+static int acpi_register_gsi_xen_pvh(struct device *dev, u32 gsi,

+   int trigger, int polarity)
+{
+   int irq;
+
+   irq = acpi_register_gsi_ioapic(dev, gsi, trigger, polarity);
+   if (irq < 0)
+   return irq;
+
+   if (xen_pvh_add_gsi_irq_map(gsi, irq) == -EEXIST)
+   printk(KERN_INFO "Already map the GSI :%u and IRQ: %d\n", gsi, 
irq);
+
+   return irq;
+}
+
  #ifdef CONFIG_XEN_PV_DOM0
  static int xen_register_gsi(u32 gsi, int triggering, int polarity)
  {
@@ -558,6 +573,12 @@ int __init pci_xen_hvm_init(void)
return 0;
  }
  
+int __init pci_xen_pvh_init(void)

+{
+   __acpi_register_gsi = acpi_register_gsi_xen_pvh;


No support for unregistering the gsi again?


+   return 0;
+}
+
  #ifdef CONFIG_XEN_PV_DOM0
  int __init pci_xen_initial_domain(void)
  {
diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 27553673e46b..80d4f7faac64 100644
--- a/drivers/xen/events/events_base.c
+++ 

Re: [PATCH] xen/x86: add extra pages to unpopulated-alloc if available

2024-05-10 Thread Jürgen Groß

On 29.04.24 17:50, Roger Pau Monne wrote:

Commit 262fc47ac174 ('xen/balloon: don't use PV mode extra memory for zone
device allocations') removed the addition of the extra memory ranges to the
unpopulated range allocator, using those only for the balloon driver.

This forces the unpopulated allocator to attach hotplug ranges even when spare
memory (as part of the extra memory ranges) is available.  Furthermore, on PVH
domains it defeats the purpose of commit 38620fc4e893 ('x86/xen: attempt to
inflate the memory balloon on PVH'), as extra memory ranges would only be
used to map foreign memory if the kernel is built without XEN_UNPOPULATED_ALLOC
support.

Fix this by adding a helpers that adds the extra memory ranges to the list of
unpopulated pages, and zeroes the ranges so they are not also consumed by the
balloon driver.

This should have been part of 38620fc4e893, hence the fixes tag.

Note the current logic relies on unpopulated_init() (and hence
arch_xen_unpopulated_init()) always being called ahead of balloon_init(), so
that the extra memory regions are consumed by arch_xen_unpopulated_init().

Fixes: 38620fc4e893 ('x86/xen: attempt to inflate the memory balloon on PVH')
Signed-off-by: Roger Pau Monné 


Reviewed-by: Juergen Gross 


Juergen




Re: [PATCH] tools/libxs: Open /dev/xen/xenbus fds as O_CLOEXEC

2024-05-04 Thread Jürgen Groß

On 04.05.24 03:16, Andrew Cooper wrote:

The header description for xs_open() goes as far as to suggest that the fd is
O_CLOEXEC, but it isn't actually.

`xl devd` has been observed leaking /dev/xen/xenbus into children.

Link: https://github.com/QubesOS/qubes-issues/issues/8292
Reported-by: Demi Marie Obenour 
Signed-off-by: Andrew Cooper 


With the style breakage below fixed:

Reviewed-by: Juergen Gross 


---
CC: Anthony PERARD 
CC: Juergen Gross 
CC: Demi Marie Obenour 
CC: Marek Marczykowski-Górecki 

Entirely speculative patch based on a Matrix report
---
  tools/libs/store/xs.c | 6 +-
  1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/tools/libs/store/xs.c b/tools/libs/store/xs.c
index 140b9a28395e..1f74fb3c44a2 100644
--- a/tools/libs/store/xs.c
+++ b/tools/libs/store/xs.c
@@ -54,6 +54,10 @@ struct xs_stored_msg {
  #include 
  #endif
  
+#ifndef O_CLOEXEC

+#define O_CLOEXEC 0
+#endif
+
  struct xs_handle {
/* Communications channel to xenstore daemon. */
int fd;
@@ -227,7 +231,7 @@ static int get_socket(const char *connect_to)
  static int get_dev(const char *connect_to)
  {
/* We cannot open read-only because requests are writes */
-   return open(connect_to, O_RDWR);
+   return open(connect_to, O_RDWR|O_CLOEXEC);


Nit: spaces around the "|", please.


Juergen


  }
  
  static int all_restrict_cb(Xentoolcore__Active_Handle *ah, domid_t domid) {


base-commit: feb9158a620040846d76981acbe8ea9e2255a07b





Re: [PATCH v6 8/8] xen: allow up to 16383 cpus

2024-04-29 Thread Jürgen Groß

On 29.04.24 13:04, Julien Grall wrote:

Hi Juergen,

Sorry for the late reply.

On 29/04/2024 11:33, Juergen Gross wrote:

On 08.04.24 09:10, Jan Beulich wrote:

On 27.03.2024 16:22, Juergen Gross wrote:

With lock handling now allowing up to 16384 cpus (spinlocks can handle
65535 cpus, rwlocks can handle 16384 cpus), raise the allowed limit for
the number of cpus to be configured to 16383.

The new limit is imposed by IOMMU_CMD_BUFFER_MAX_ENTRIES and
QINVAL_MAX_ENTRY_NR required to be larger than 2 * CONFIG_NR_CPUS.

Signed-off-by: Juergen Gross 


Acked-by: Jan Beulich 

I'd prefer this to also gain an Arm ack, though.


Any comment from Arm side?


Can you clarify what the new limits mean in term of (security) support? Are we 
now claiming that Xen will work perfectly fine on platforms with up to 16383?


If so, I can't comment for x86, but for Arm, I am doubtful that it would work 
without any (at least performance) issues. AFAIK, this is also an untested 
configuration. In fact I would be surprised if Xen on Arm was tested with more 
than a couple of hundreds cores (AFAICT the Ampere CPUs has 192 CPUs).


I think we should add a security support limit for the number of physical
cpus similar to the memory support limit we already have in place.

For x86 I'd suggest 4096 cpus for security support (basically the limit we
have with this patch), but I'm open for other suggestions, too.

I have no idea about any sensible limits for Arm32/Arm64.


Juergen



Re: [PATCH 1/2] tools/{c,o}xenstored: Don't link against libsystemd

2024-04-29 Thread Jürgen Groß

On 25.04.24 19:32, Andrew Cooper wrote:

libsystemd is a giant dependency for one single function, but in the wake of
the xz backdoor, it turns out that even systemd leadership recommend against
linking against libsystemd for sd_notify().

Since commit 7b61011e1450 ("tools: make xenstore domain easy configurable") in
Xen 4.8, the launch-xenstore script invokes systemd-notify directly, so its
not even necessary for the xenstored's to call sd_notify() themselves.


You are aware that in the daemon case the call of systemd-notify does not
signal readyness of xenstored? It is just called with the "--booted" parameter
in order to detect whether systemd is active or not.

So in order to just drop the sd_notify() call from xenstored you need to
modify the launch-xenstore script, too.


Juergen



Therefore, just drop the calls to sd_notify() and stop linking against
libsystemd.

No functional change.

Signed-off-by: Andrew Cooper 
---
CC: Anthony PERARD 
CC: Juergen Gross 
CC: Christian Lindig 
CC: Edwin Török 
CC: Stefano Stabellini 
---
  tools/ocaml/xenstored/Makefile| 12 +--
  tools/ocaml/xenstored/systemd.ml  | 15 -
  tools/ocaml/xenstored/systemd.mli | 16 -
  tools/ocaml/xenstored/systemd_stubs.c | 47 ---
  tools/ocaml/xenstored/xenstored.ml|  1 -
  tools/xenstored/Makefile  |  5 ---
  tools/xenstored/posix.c   |  9 -
  7 files changed, 1 insertion(+), 104 deletions(-)
  delete mode 100644 tools/ocaml/xenstored/systemd.ml
  delete mode 100644 tools/ocaml/xenstored/systemd.mli
  delete mode 100644 tools/ocaml/xenstored/systemd_stubs.c

diff --git a/tools/ocaml/xenstored/Makefile b/tools/ocaml/xenstored/Makefile
index e8aaecf2e630..1e4b51cc5432 100644
--- a/tools/ocaml/xenstored/Makefile
+++ b/tools/ocaml/xenstored/Makefile
@@ -4,8 +4,6 @@ include $(OCAML_TOPLEVEL)/common.make
  
  # Include configure output (config.h)

  CFLAGS += -include $(XEN_ROOT)/tools/config.h
-CFLAGS-$(CONFIG_SYSTEMD)  += $(SYSTEMD_CFLAGS)
-LDFLAGS-$(CONFIG_SYSTEMD) += $(SYSTEMD_LIBS)
  
  CFLAGS  += $(CFLAGS-y)

  CFLAGS  += $(APPEND_CFLAGS)
@@ -25,13 +23,6 @@ poll_OBJS = poll
  poll_C_OBJS = select_stubs
  OCAML_LIBRARY = syslog poll
  
-LIBS += systemd.cma systemd.cmxa

-systemd_OBJS = systemd
-systemd_C_OBJS = systemd_stubs
-OCAML_LIBRARY += systemd
-
-LIBS_systemd += $(LDFLAGS-y)
-
  OBJS = paths \
define \
stdext \
@@ -56,12 +47,11 @@ OBJS = paths \
process \
xenstored
  
-INTF = symbol.cmi trie.cmi syslog.cmi systemd.cmi poll.cmi

+INTF = symbol.cmi trie.cmi syslog.cmi poll.cmi
  
  XENSTOREDLIBS = \

unix.cmxa \
-ccopt -L -ccopt . syslog.cmxa \
-   -ccopt -L -ccopt . systemd.cmxa \
-ccopt -L -ccopt . poll.cmxa \
-ccopt -L -ccopt $(OCAML_TOPLEVEL)/libs/mmap 
$(OCAML_TOPLEVEL)/libs/mmap/xenmmap.cmxa \
-ccopt -L -ccopt $(OCAML_TOPLEVEL)/libs/eventchn 
$(OCAML_TOPLEVEL)/libs/eventchn/xeneventchn.cmxa \
diff --git a/tools/ocaml/xenstored/systemd.ml b/tools/ocaml/xenstored/systemd.ml
deleted file mode 100644
index 39127f712d72..
--- a/tools/ocaml/xenstored/systemd.ml
+++ /dev/null
@@ -1,15 +0,0 @@
-(*
- * Copyright (C) 2014 Luis R. Rodriguez 
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU Lesser General Public License as published
- * by the Free Software Foundation; version 2.1 only. with the special
- * exception on linking described in file LICENSE.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU Lesser General Public License for more details.
- *)
-
-external sd_notify_ready: unit -> unit = "ocaml_sd_notify_ready"
diff --git a/tools/ocaml/xenstored/systemd.mli 
b/tools/ocaml/xenstored/systemd.mli
deleted file mode 100644
index 18b9331031f9..
--- a/tools/ocaml/xenstored/systemd.mli
+++ /dev/null
@@ -1,16 +0,0 @@
-(*
- * Copyright (C) 2014 Luis R. Rodriguez 
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU Lesser General Public License as published
- * by the Free Software Foundation; version 2.1 only. with the special
- * exception on linking described in file LICENSE.
- *
- * This program is distributed in the hope that it will be useful,
- * but WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
- * GNU Lesser General Public License for more details.
- *)
-
-(** Tells systemd we're ready *)
-external sd_notify_ready: unit -> unit = "ocaml_sd_notify_ready"
diff --git a/tools/ocaml/xenstored/systemd_stubs.c 
b/tools/ocaml/xenstored/systemd_stubs.c
deleted file mode 100644
index f4c875075abe..
--- a/tools/ocaml/xenstored/systemd_stubs.c
+++ /dev/null
@@ -1,47 +0,0 @@
-/*

Re: Detecting whether dom0 is in a VM

2024-04-22 Thread Jürgen Groß

On 22.04.24 09:12, Jan Beulich wrote:

On 19.04.2024 17:29, George Dunlap wrote:

On Fri, Jul 7, 2023 at 3:56 PM George Dunlap  wrote:

Xen's public interface offers access to the featuresets known / found /
used by the hypervisor. See XEN_SYSCTL_get_cpu_featureset, accessible
via xc_get_cpu_featureset().



Are any of these exposed in dom0 via sysctl, or hypfs?


sysctl - yes (as the quoted name also says). hypfs no, afaict.


  SYSCTLs are
unfortunately not stable interfaces, correct?  So it wouldn't be practical
for systemd to use them.


Indeed, neither sysctl-s nor the libxc interfaces are stable.


Thinking of it, xen-cpuid is a wrapper tool around those. They may want
to look at its output (and, if they want to use it, advise distros to
also package it), which I think we try to keep reasonably stable,
albeit without providing any guarantees.



We haven't had any clear guidance yet on what the systemd team want in the  question; I just sort of assumed they wanted the L-1 
virtualization *if possible*.  It sounds like `vm-other` would be acceptable, 
particularly as a fall-back output if there's no way to get Xen's picture of the 
cpuid.

It looks like xen-cpuid is available on Fedora, Debian, Ubuntu, and the old 
Virt SIG CentOS packages; so I'd expect most packages to follow suit.  That's a 
place to start.

Just to take the discussion all the way to its conclusion:

- Supposing xen-cpuid isn't available, is there any other way to tell if Xen is 
running in a VM from dom0?

- Would it make sense to expose that information somewhere, either in sysfs or 
in hypfs (or both?), so that eventually even systems which may not get the memo 
about packaging xen-cpuid will get support (or if the systemd guys would rather 
avoid executing another process if possible)?


Resurrecting this thread.

To recap:

Currently, `systemd-detect-virt` has the following  input / output table:

1. Xen on real hardware, domU: xen
2. Xen on real hardware, dom0: vm-other
3. Xen in a VM, domU: xen
4. Xen in aVM, dom0: vm-other

It's the dom0 lines (2 and 4) which are problematic; we'd ideally like
those to be `none` and `vm-other` (or the actual value, like `xen` or
`kvm`).

It looks like ATM, /proc/xen/capabilities will contain `control_d` if
it's a dom0.  Simply unilaterally returning `none` if
/proc/xen/capabilities contains `control_d` would correct the vast
majority of instances (since the number of instances of Xen on real
hardware is presumably higher than the number running virtualized).

Is /proc/xen/capabilities expected to stay around?  I don't see
anything equivalent in /dev/xen.


I believe it's intended to stay around, but a definitive answer can only
come from Linux folks. Jürgen, Stefano?


I have no plans to remove it.


Juergen



Jan


Other than adding a new interface to Xen, is there any way to tell if
Xen is running in a VM?  If we do need to expose an interface, what
would be the best way to do that?

  -George







Re: [PATCH] locking/x86/xen: Use try_cmpxchg() in xen_alloc_p2m_entry()

2024-04-11 Thread Jürgen Groß

On 05.04.24 10:32, Uros Bizjak wrote:

Use try_cmpxchg() instead of cmpxchg(*ptr, old, new) == old.

The x86 CMPXCHG instruction returns success in the ZF flag,
so this change saves a compare after CMPXCHG.

Also, try_cmpxchg() implicitly assigns old *ptr value to "old"
when CMPXCHG fails. There is no need to explicitly assign
old *ptr value to the temporary, which can simplify the
surrounding source code.

No functional change intended.

Signed-off-by: Uros Bizjak 
Cc: Juergen Gross 
Cc: Boris Ostrovsky 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: "H. Peter Anvin" 


Reviewed-by: Juergen Gross 


Juergen



Re: [PATCH] x86/pat: fix W^X violation false-positives when running as Xen PV guest

2024-04-10 Thread Jürgen Groß

On 10.04.24 15:48, Ingo Molnar wrote:


* Juergen Gross  wrote:


When running as Xen PV guest in some cases W^X violation WARN()s have
been observed. Those WARN()s are produced by verify_rwx(), which looks
into the PTE to verify that writable kernel pages have the NX bit set
in order to avoid code modifications of the kernel by rogue code.

As the NX bits of all levels of translation entries are or-ed and the
RW bits of all levels are and-ed, looking just into the PTE isn't enough
for the decision that a writable page is executable, too. When running
as a Xen PV guest, kernel initialization will set the NX bit in PMD
entries of the initial page tables covering the .data segment.

When finding the PTE to have set the RW bit but no NX bit, higher level
entries must be looked at. Only when all levels have the RW bit set and
no NX bit set, the W^X violation should be flagged.

Additionally show_fault_oops() has a similar problem: it will issue the
"kernel tried to execute NX-protected page" message only if it finds
the NX bit set in the leaf translation entry, while any NX bit in
non-leaf entries are being ignored for issuing the message.

Modify lookup_address_in_pgd() to return the effective NX and RW bit
values of the non-leaf translation entries and evaluate those as well
in verify_rwx() and show_fault_oops().


Ok, this fix makes sense, as that's how the hardware works and we interpret
the pagetables poorly.


Thanks for confirmation that my approach is sane.




Fixes: 652c5bf380ad ("x86/mm: Refuse W^X violations")
Reported-by: Jason Andryuk 
Signed-off-by: Juergen Gross 
---
  arch/x86/include/asm/pgtable_types.h |  2 +-
  arch/x86/kernel/sev.c|  3 +-
  arch/x86/mm/fault.c  |  7 ++--
  arch/x86/mm/pat/set_memory.c | 56 +---
  arch/x86/virt/svm/sev.c  |  3 +-
  5 files changed, 52 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h 
b/arch/x86/include/asm/pgtable_types.h
index 0b748ee16b3d..91ab538d3872 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -565,7 +565,7 @@ static inline void update_page_count(int level, unsigned 
long pages) { }
   */
  extern pte_t *lookup_address(unsigned long address, unsigned int *level);
  extern pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
-   unsigned int *level);
+   unsigned int *level, bool *nx, bool *rw);
  extern pmd_t *lookup_pmd_address(unsigned long address);
  extern phys_addr_t slow_virt_to_phys(void *__address);
  extern int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn,


Please introduce a new lookup_address_in_pgd_attr() function or so, which
is used by code intentionally.

This avoids changing the arch/x86/kernel/sev.c and arch/x86/virt/svm/sev.c
uses, that retrieve these attributes but don't do anything with them:


Okay.




diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index 38ad066179d8..adba581e999d 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -516,12 +516,13 @@ static enum es_result vc_slow_virt_to_phys(struct ghcb 
*ghcb, struct es_em_ctxt
unsigned long va = (unsigned long)vaddr;
unsigned int level;
phys_addr_t pa;
+   bool nx, rw;
pgd_t *pgd;
pte_t *pte;
  
  	pgd = __va(read_cr3_pa());

pgd = [pgd_index(va)];
-   pte = lookup_address_in_pgd(pgd, va, );
+   pte = lookup_address_in_pgd(pgd, va, , , );
if (!pte) {
ctxt->fi.vector = X86_TRAP_PF;
ctxt->fi.cr2= vaddr;
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 622d12ec7f08..eb8e897a5653 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -514,18 +514,19 @@ show_fault_oops(struct pt_regs *regs, unsigned long 
error_code, unsigned long ad
  
  	if (error_code & X86_PF_INSTR) {

unsigned int level;
+   bool nx, rw;
pgd_t *pgd;
pte_t *pte;
  
  		pgd = __va(read_cr3_pa());

pgd += pgd_index(address);
  
-		pte = lookup_address_in_pgd(pgd, address, );

+   pte = lookup_address_in_pgd(pgd, address, , , );
  
-		if (pte && pte_present(*pte) && !pte_exec(*pte))

+   if (pte && pte_present(*pte) && (!pte_exec(*pte) || nx))
pr_crit("kernel tried to execute NX-protected page - exploit 
attempt? (uid: %d)\n",
from_kuid(_user_ns, current_uid()));
-   if (pte && pte_present(*pte) && pte_exec(*pte) &&
+   if (pte && pte_present(*pte) && pte_exec(*pte) && !nx &&
(pgd_flags(*pgd) & _PAGE_USER) &&
(__read_cr4() & X86_CR4_SMEP))
pr_crit("unable to execute userspace code (SMEP?) (uid: 
%d)\n",


This should be a separate patch - as it might change 

Re: [PATCH 2/2] x86/xen: return a sane initial apic id when running as PV guest

2024-04-05 Thread Jürgen Groß

On 05.04.24 14:50, Andrew Cooper wrote:

On 05/04/2024 1:34 pm, Juergen Gross wrote:

With recent sanity checks for topology information added, there are now
warnings issued for APs when running as a Xen PV guest:

   [Firmware Bug]: CPU   1: APIC ID mismatch. CPUID: 0x APIC: 0x0001

This is due to the initial APIC ID obtained via CPUID for PV guests is
always 0.


/sigh

 From Xen:

     switch ( leaf )
     {
     case 0x1:
     /* TODO: Rework topology logic. */
     res->b &= 0x00ffu;
     if ( is_hvm_domain(d) )
     res->b |= (v->vcpu_id * 2) << 24;


I think there's a very good chance it was random prior to Xen 4.6.  That
used to come straight out of a CPUID value, so would get the APIC ID of
whichever pCPU it was scheduled on.


Avoid the warnings by synthesizing the CPUID data to contain the same
initial APIC ID as xen_pv_smp_config() is using for registering the
APIC IDs of all CPUs.

Fixes: 52128a7a21f7 ("86/cpu/topology: Make the APIC mismatch warnings 
complete")
Signed-off-by: Juergen Gross 
---
  arch/x86/xen/enlighten_pv.c | 10 +-
  1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index ace2eb054053..965e4ca36024 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -219,13 +219,20 @@ static __read_mostly unsigned int cpuid_leaf5_edx_val;
  static void xen_cpuid(unsigned int *ax, unsigned int *bx,
  unsigned int *cx, unsigned int *dx)
  {
-   unsigned maskebx = ~0;
+   unsigned int maskebx = ~0;
+   unsigned int or_ebx = 0;
  
  	/*

 * Mask out inconvenient features, to try and disable as many
 * unsupported kernel subsystems as possible.
 */
switch (*ax) {
+   case 0x1:
+   /* Replace initial APIC ID in bits 24-31 of EBX. */
+   maskebx = 0x00ff;
+   or_ebx = smp_processor_id() << 24;


I think the comment wants to cross-reference explicitly with
xen_pv_smp_config(), because what we care about here is the two sources
of information matching.


I can add that as a comment. OTOH I'd really hope someone changing this
code later would look into the commit message of the patch adding it. :-)



Also while you're at it, the x2APIC ID in leaf 0xb.


I'm not sure this is functionally relevant in PV guests.

Note that my patch is only meant to silence warnings during boot. It is not
needed for the system working correctly (at least I think so).


Juergen



Re: [PATCH v6 7/8] xen/rwlock: raise the number of possible cpus

2024-04-02 Thread Jürgen Groß

On 02.04.24 16:52, Jan Beulich wrote:

On 27.03.2024 16:22, Juergen Gross wrote:

@@ -36,14 +36,21 @@ void queue_write_lock_slowpath(rwlock_t *lock);
  
  static inline bool _is_write_locked_by_me(unsigned int cnts)

  {
-BUILD_BUG_ON(_QW_CPUMASK < NR_CPUS);
+BUILD_BUG_ON((_QW_CPUMASK + 1) < NR_CPUS);
+BUILD_BUG_ON(NR_CPUS * _QR_BIAS > INT_MAX);
  return (cnts & _QW_WMASK) == _QW_LOCKED &&
 (cnts & _QW_CPUMASK) == smp_processor_id();
  }
  
  static inline bool _can_read_lock(unsigned int cnts)

  {
-return !(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts);
+/*
+ * If write locked by the caller, no other readers are possible.
+ * Not allowing the lock holder to read_lock() another 32768 times ought
+ * to be fine.
+ */
+return cnts <= INT_MAX &&
+   (!(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts));
  }


What is the 32768 in the comment relating to? INT_MAX is quite a bit higher,
yet the comparison against it is the only thing you add. Whereas the reader
count is, with the sign bit unused, 17 bits, though (bits 14..30). I think


You missed:

#define_QR_SHIFT(_QW_SHIFT + 2) /* Reader count shift */

So the reader's shift is 16, resulting in 15 bits for the reader count.


even in such a comment rather than using a literal number the corresponding
expression would better be stated.


Hmm, you mean replacing the 32768 with INT_MAX >> _QR_SHIFT? This would be
fine with me.


Juergen



Re: [PATCH v6 6/8] xen/spinlock: support higher number of cpus

2024-04-02 Thread Jürgen Groß

On 02.04.24 16:42, Jan Beulich wrote:

On 27.03.2024 16:22, Juergen Gross wrote:

Allow 16 bits per cpu number, which is the limit imposed by
spinlock_tickets_t.

This will allow up to 65535 cpus, while increasing only the size of
recursive spinlocks in debug builds from 8 to 12 bytes.

The current Xen limit of 4095 cpus is imposed by SPINLOCK_CPU_BITS
being 12. There are machines available with more cpus than the current
Xen limit, so it makes sense to have the possibility to use more cpus.

Signed-off-by: Juergen Gross 


Reviewed-by: Jan Beulich 
albeit I have to say that I'm not entirely convinced of ...


--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -485,7 +485,9 @@ bool _rspin_trylock(rspinlock_t *lock)
  
  /* Don't allow overflow of recurse_cpu field. */

  BUILD_BUG_ON(NR_CPUS > SPINLOCK_NO_CPU);
+BUILD_BUG_ON(SPINLOCK_CPU_BITS > sizeof(lock->recurse_cpu) * 8);
  BUILD_BUG_ON(SPINLOCK_RECURSE_BITS < 3);
+BUILD_BUG_ON(SPINLOCK_MAX_RECURSE > ((1u << SPINLOCK_RECURSE_BITS) - 1));
  
  check_lock(>debug, true);


... the two additions here: The two checks we had verify independent
properties, whereas the new ones basically check that struct rspinlock
and its associated #define-s were got right. We don't check such
elsewhere, I don't think.


I think we do.

What about:
  BUILD_BUG_ON(sizeof(hwp_req) != sizeof(hwp_req.raw))
checking that two union elements are of the same size (and both elements don't
contain any other structs).

Additionally it is not obvious at a first glance that SPINLOCK_CPU_BITS defined
in line 11 is relevant for the definition of recurse_cpu in line 217.

Regarding the second added BUILD_BUG_ON() there was a comment by Julien related
to the definition of SPINLOCK_MAX_RECURSE in V4 of this patch. We settled to use
the current form including the added BUILD_BUG_ON().


Juergen



Re: Linux Xen PV CPA W^X violation false-positives

2024-03-28 Thread Jürgen Groß

Hi Jason,

On 28.03.24 02:24, Jason Andryuk wrote:

On Wed, Mar 27, 2024 at 7:46 AM Jürgen Groß  wrote:


On 24.01.24 17:54, Jason Andryuk wrote:

+
+ return new;
+ }
+ }
+
   end = start + npg * PAGE_SIZE - 1;
   WARN_ONCE(1, "CPA detected W^X violation: %016llx -> %016llx range: 0x%016lx 
- 0x%016lx PFN %lx\n",
 (unsigned long long)pgprot_val(old),


Jason, do you want to send a V2 with your Signed-off, or would you like me to
try upstreaming the patch?


Hi Jürgen,

Yes, please upstream your approach.  I wasn't sure how to deal with
it, so it was more of a bug report.


The final solution was a bit more complicated, as there are some
corner cases to be considered. OTOH it is now complete by looking
at all used translation entries.

Are you able to test the attached patch? I don't see the original
issue and can only verify the patch doesn't cause any regression.


Juergen
From fd25a67d92e44b61d05d92658b23d026202a1656 Mon Sep 17 00:00:00 2001
From: Juergen Gross 
To: x...@kernel.org
To: linux-ker...@vger.kernel.org
To: xen-devel@lists.xenproject.org
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: Dave Hansen 
Cc: "H. Peter Anvin" 
Cc: Andy Lutomirski 
Cc: Peter Zijlstra 
Date: Thu, 28 Mar 2024 12:24:48 +0100
Subject: [PATCH] x86/pat: fix W^X violation false-positives when running as
 Xen PV guest

When running as Xen PV guest in some cases W^X violation WARN()s have
been observed. Those WARN()s are produced by verify_rwx(), which looks
into the PTE to verify that writable kernel pages have the NX bit set
in order to avoid code modifications of the kernel by rogue code.

As the NX bits of all levels of translation entries are or-ed and the
RW bits of all levels are and-ed, looking just into the PTE isn't enough
for the decision that a writable page is executable, too. When running
as a Xen PV guest, kernel initialization will set the NX bit in PMD
entries of the initial page tables covering the .data segment.

When finding the PTE to have set the RW bit but no NX bit, higher level
entries must be looked at. Only when all levels have the RW bit set and
no NX bit set, the W^X violation should be flagged.

Additionally show_fault_oops() has a similar problem: it will issue the
"kernel tried to execute NX-protected page" message only if it finds
the NX bit set in the leaf translation entry, while any NX bit in
non-leaf entries are being ignored for issuing the message.

Modify lookup_address_in_pgd() to return the effective NX and RW bit
values of the non-leaf translation entries and evaluate those as well
in verify_rwx() and show_fault_oops().

Fixes: 652c5bf380ad ("x86/mm: Refuse W^X violations")
Reported-by: Jason Andryuk 
Signed-off-by: Juergen Gross 
---
 arch/x86/include/asm/pgtable_types.h |  2 +-
 arch/x86/kernel/sev.c|  3 +-
 arch/x86/mm/fault.c  |  7 ++--
 arch/x86/mm/pat/set_memory.c | 56 +---
 arch/x86/virt/svm/sev.c  |  3 +-
 5 files changed, 52 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 0b748ee16b3d..91ab538d3872 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -565,7 +565,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
  */
 extern pte_t *lookup_address(unsigned long address, unsigned int *level);
 extern pte_t *lookup_address_in_pgd(pgd_t *pgd, unsigned long address,
-unsigned int *level);
+unsigned int *level, bool *nx, bool *rw);
 extern pmd_t *lookup_pmd_address(unsigned long address);
 extern phys_addr_t slow_virt_to_phys(void *__address);
 extern int __init kernel_map_pages_in_pgd(pgd_t *pgd, u64 pfn,
diff --git a/arch/x86/kernel/sev.c b/arch/x86/kernel/sev.c
index b59b09c2f284..e833183d1adb 100644
--- a/arch/x86/kernel/sev.c
+++ b/arch/x86/kernel/sev.c
@@ -515,12 +515,13 @@ static enum es_result vc_slow_virt_to_phys(struct ghcb *ghcb, struct es_em_ctxt
 	unsigned long va = (unsigned long)vaddr;
 	unsigned int level;
 	phys_addr_t pa;
+	bool nx, rw;
 	pgd_t *pgd;
 	pte_t *pte;
 
 	pgd = __va(read_cr3_pa());
 	pgd = [pgd_index(va)];
-	pte = lookup_address_in_pgd(pgd, va, );
+	pte = lookup_address_in_pgd(pgd, va, , , );
 	if (!pte) {
 		ctxt->fi.vector = X86_TRAP_PF;
 		ctxt->fi.cr2= vaddr;
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 622d12ec7f08..eb8e897a5653 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -514,18 +514,19 @@ show_fault_oops(struct pt_regs *regs, unsigned long error_code, unsigned long ad
 
 	if (error_code & X86_PF_INSTR) {
 		unsigned int level;
+		bool nx, rw;
 		pgd_t *pgd;
 		pte_t *pte;
 
 		pgd = __va(read_cr3_pa());
 		pgd += pgd_index(address);
 
-		pte = lookup_address_in_pgd(pgd, address, );
+		pte = lookup_address_in_pgd(pgd, address, , , )

Re: Linux Xen PV CPA W^X violation false-positives

2024-03-27 Thread Jürgen Groß

On 24.01.24 17:54, Jason Andryuk wrote:

Xen PV domains show CPA W^X violations like:

CPA detected W^X violation: 0064 -> 0067 range: 
0x8881 - 0x88810fff PFN 10
WARNING: CPU: 0 PID: 30 at arch/x86/mm/pat/set_memory.c:613 
__change_page_attr_set_clr+0x113a/0x11c0
Modules linked in: xt_physdev xt_MASQUERADE iptable_nat nf_nat nf_conntrack 
libcrc32c nf_defrag_ipv4 ip_tables x_tables xen_argo(O)
CPU: 0 PID: 30 Comm: kworker/0:2 Tainted: G   O   6.1.38 #1
Workqueue: events bpf_prog_free_deferred
RIP: e030:__change_page_attr_set_clr+0x113a/0x11c0
Code: 4c 89 f1 4c 89 e2 4c 89 d6 4c 89 8d 70 ff ff ff 4d 8d 86 ff 0f 00 00 48 c7 c7 
f0 3c da 81 c6 05 d0 0e 0e 01 01 e8 f6 71 00 00 <0f> 0b 4c 8b 8d 70 ff ff ff e9 
2a fd ff ff 48 8b 85 60 ff ff ff 48
RSP: e02b:c9367c48 EFLAGS: 00010282
RAX:  RBX: 000ef064 RCX: 
RDX: 0003 RSI: f7ff RDI: 
RBP: c9367d48 R08:  R09: c9367aa0
R10: 0001 R11: 0001 R12: 0067
R13: 0001 R14: 8881 R15: c9367d60
FS:  () GS:88800b80() knlGS:
CS:  e030 DS:  ES:  CR0: 80050033
CR2: 7fdbaeda01c0 CR3: 04312000 CR4: 00050660
Call Trace:
  
  ? show_regs.cold+0x1a/0x1f
  ? __change_page_attr_set_clr+0x113a/0x11c0
  ? __warn+0x7b/0xc0
  ? __change_page_attr_set_clr+0x113a/0x11c0
  ? report_bug+0x111/0x1a0
  ? handle_bug+0x4d/0xa0
  ? exc_invalid_op+0x19/0x70
  ? asm_exc_invalid_op+0x1b/0x20
  ? __change_page_attr_set_clr+0x113a/0x11c0
  ? __change_page_attr_set_clr+0x113a/0x11c0
  ? debug_smp_processor_id+0x17/0x20
  ? ___cache_free+0x2e/0x1e0
  ? _raw_spin_unlock+0x1e/0x40
  ? __purge_vmap_area_lazy+0x2ea/0x6b0
  set_direct_map_default_noflush+0x7c/0xa0
  __vunmap+0x1ac/0x280
  __vfree+0x1d/0x60
  vfree+0x27/0x40
  __bpf_prog_free+0x44/0x50
  bpf_prog_free_deferred+0x104/0x120
  process_one_work+0x1ca/0x3d0
  ? process_one_work+0x3d0/0x3d0
  worker_thread+0x45/0x3c0
  ? process_one_work+0x3d0/0x3d0
  kthread+0xe2/0x110
  ? kthread_complete_and_exit+0x20/0x20
  ret_from_fork+0x1f/0x30
  
---[ end trace  ]---

Xen provides a set of page tables that the guest executes out of when it
starts.  The L1 entries are shared between level2_ident_pgt and
level2_kernel_pgt, and xen_setup_kernel_pagetable() sets the NX bit in
the level2_ident_pgt entries.  verify_rwx() only checks the l1 entry and
reports a false-positive violation.

Here is a dump of some kernel virtual addresses and the corresponding
L1 and L2 entries:
This is the start of the directmap (ident) and they have NX (bit 63) set
in the PMD.
ndvm-pv (1): [0.466778] va=8880 pte=00100027 level: 1
ndvm-pv (1): [0.466788] va=8880 pmd=8242c067 level: 2
Directmap for kernel text:
ndvm-pv (1): [0.466795] va=88800100 pte=00100165 level: 1
ndvm-pv (1): [0.466801] va=88800100 pmd=82434067 level: 2
ndvm-pv (1): [0.466807] va=88800101 pte=001001010065 level: 1
ndvm-pv (1): [0.466814] va=88800101 pmd=82434067 level: 2
The start of the kernel text highmap is unmapped:
ndvm-pv (1): [0.466820] va=8000 pte= level: 3
ndvm-pv (1): [0.466826] va=8000 pmd= level: 3
Kernel PMD for .text has NX bit clear
ndvm-pv (1): [0.466832] va=8100 pte=00100165 level: 1
ndvm-pv (1): [0.466838] va=8100 pmd=02434067 level: 2
Kernel PTE for rodata_end has NX bit set
ndvm-pv (1): [0.466846] va=81e62000 pte=801001e62025 level: 1
ndvm-pv (1): [0.466874] va=81e62000 pmd=0243b067 level: 2
Directmap of rodata_end
ndvm-pv (1): [0.466907] va=888001e62000 pte=801001e62025 level: 1
ndvm-pv (1): [0.466913] va=888001e62000 pmd=8243b067 level: 2
Directmap of a low RAM address
ndvm-pv (1): [0.466920] va=8881 pte=00110027 level: 1
ndvm-pv (1): [0.466926] va=8881 pmd=8242c067 level: 2
Directmap of another RAM address close to but below kernel text
ndvm-pv (1): [0.466932] va=88800096c000 pte=00100096c027 level: 1
ndvm-pv (1): [0.466938] va=88800096c000 pmd=82430067 level: 2

Here are some L2 entries showing the differing NX bits for l2_ident vs.
l2_kernel while they point at the same L1 addresses
ndvm-pv (1): [0.466944]  l2_ident[  0] pmd=8242c067
ndvm-pv (1): [0.466949]  l2_ident[  1] pmd=8242d067
ndvm-pv (1): [0.466955]  l2_ident[  8] pmd=82434067
ndvm-pv (1): [0.466959]  l2_ident[  9] pmd=82435067
ndvm-pv (1): [0.466964]  l2_ident[ 14] pmd=8243a067
ndvm-pv (1): [0.466969]  l2_ident[ 15] pmd=8243b067
ndvm-pv (1): [

Re: [PATCH] xen/x86: Remove duplicate include

2024-03-22 Thread Jürgen Groß

On 22.03.24 07:39, Jiapeng Chong wrote:

./arch/x86/xen/enlighten.c: linux/memblock.h is included more than once.

Reported-by: Abaci Robot 
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=8610
Signed-off-by: Jiapeng Chong 


Reviewed-by: Juergen Gross 


Juergen




Re: [PATCH] xen/rwlock: Don't perpeuatite broken API in new logic

2024-03-20 Thread Jürgen Groß

On 19.03.24 12:30, Andrew Cooper wrote:

The single user wants this the sane way around.  Write it as a normal static
inline just like rspin_lock().

Fixes: cc3e8df542ed ("xen/spinlock: add rspin_[un]lock_irq[save|restore]()")
Signed-off-by: Andrew Cooper 


Reviewed-by: Juergen Gross 

Maybe with the subject fixed (s/rwlock/spinlock/).


Juergen




Re: [PATCH v5 04/13] xen/spinlock: add rspin_[un]lock_irq[save|restore]()

2024-03-18 Thread Jürgen Groß

On 18.03.24 17:08, Jan Beulich wrote:

On 18.03.2024 17:05, Jürgen Groß wrote:

On 18.03.24 16:59, Jan Beulich wrote:

On 18.03.2024 16:55, Jürgen Groß wrote:

On 18.03.24 15:43, Jan Beulich wrote:

On 14.03.2024 08:20, Juergen Gross wrote:

Instead of special casing rspin_lock_irqsave() and
rspin_unlock_irqrestore() for the console lock, add those functions
to spinlock handling and use them where needed.

Signed-off-by: Juergen Gross 


Reviewed-by: Jan Beulich 
with two remarks:


--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -475,15 +475,31 @@ void _rspin_lock(rspinlock_t *lock)
lock->recurse_cnt++;
}

+unsigned long _rspin_lock_irqsave(rspinlock_t *lock)

+{
+unsigned long flags;
+
+local_irq_save(flags);
+_rspin_lock(lock);
+
+return flags;
+}
+
void _rspin_unlock(rspinlock_t *lock)
{
if ( likely(--lock->recurse_cnt == 0) )
{
lock->recurse_cpu = SPINLOCK_NO_CPU;
-spin_unlock(lock);
+_spin_unlock(lock);


This looks like an unrelated change. I think I can guess the purpose, but
it would be nice if such along-the-way changes could be mentioned in the
description.


I think it would be better to move that change to patch 3.


Hmm, it would be a secondary change there, too. I was actually meaning to
commit patches 2-5, but if things want moving around I guess I better
wait with doing so?


Hmm, maybe just drop this hunk and let patch 7 handle it?


Ah yes, that seem more logical to me. I take it you don't mean "hunk"
though, but really just this one line change.


Oh yes, of course.


Juergen




Re: [PATCH v5 12/13] xen/rwlock: raise the number of possible cpus

2024-03-18 Thread Jürgen Groß

On 18.03.24 17:05, Jan Beulich wrote:

On 18.03.2024 17:00, Jürgen Groß wrote:

On 18.03.24 16:39, Jan Beulich wrote:

On 14.03.2024 08:20, Juergen Gross wrote:

@@ -36,14 +36,16 @@ void queue_write_lock_slowpath(rwlock_t *lock);
   
   static inline bool _is_write_locked_by_me(unsigned int cnts)

   {
-BUILD_BUG_ON(_QW_CPUMASK < NR_CPUS);
+BUILD_BUG_ON((_QW_CPUMASK + 1) < NR_CPUS);
+BUILD_BUG_ON(NR_CPUS * _QR_BIAS > INT_MAX);
   return (cnts & _QW_WMASK) == _QW_LOCKED &&
  (cnts & _QW_CPUMASK) == smp_processor_id();
   }
   
   static inline bool _can_read_lock(unsigned int cnts)

   {
-return !(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts);
+return cnts <= INT_MAX &&
+   (!(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts));
   }


I view this as problematic: Code knowing that a write lock is being held
may invoke a function using read_trylock() and expect the lock to be
available there.


So you expect it to be fine that someone is using read_trylock() 32768 times
recursively while holding a lock as a writer? Sure, I can change the condition,
but OTOH ...


Hmm, yes, the reader count (leaving aside nested read_trylock()) is zero
when the lock is held for writing. So yes, I agree the condition is fine,
but may I ask for a brief comment to this effect, for blind people like
me?


Yeah, fine with me. :-)


Juergen




Re: [PATCH v5 04/13] xen/spinlock: add rspin_[un]lock_irq[save|restore]()

2024-03-18 Thread Jürgen Groß

On 18.03.24 16:59, Jan Beulich wrote:

On 18.03.2024 16:55, Jürgen Groß wrote:

On 18.03.24 15:43, Jan Beulich wrote:

On 14.03.2024 08:20, Juergen Gross wrote:

Instead of special casing rspin_lock_irqsave() and
rspin_unlock_irqrestore() for the console lock, add those functions
to spinlock handling and use them where needed.

Signed-off-by: Juergen Gross 


Reviewed-by: Jan Beulich 
with two remarks:


--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -475,15 +475,31 @@ void _rspin_lock(rspinlock_t *lock)
   lock->recurse_cnt++;
   }
   
+unsigned long _rspin_lock_irqsave(rspinlock_t *lock)

+{
+unsigned long flags;
+
+local_irq_save(flags);
+_rspin_lock(lock);
+
+return flags;
+}
+
   void _rspin_unlock(rspinlock_t *lock)
   {
   if ( likely(--lock->recurse_cnt == 0) )
   {
   lock->recurse_cpu = SPINLOCK_NO_CPU;
-spin_unlock(lock);
+_spin_unlock(lock);


This looks like an unrelated change. I think I can guess the purpose, but
it would be nice if such along-the-way changes could be mentioned in the
description.


I think it would be better to move that change to patch 3.


Hmm, it would be a secondary change there, too. I was actually meaning to
commit patches 2-5, but if things want moving around I guess I better
wait with doing so?


Hmm, maybe just drop this hunk and let patch 7 handle it?


Juergen



Re: [PATCH v5 12/13] xen/rwlock: raise the number of possible cpus

2024-03-18 Thread Jürgen Groß

On 18.03.24 16:39, Jan Beulich wrote:

On 14.03.2024 08:20, Juergen Gross wrote:

The rwlock handling is limiting the number of cpus to 4095 today. The
main reason is the use of the atomic_t data type for the main lock
handling, which needs 2 bits for the locking state (writer waiting or
write locked), 12 bits for the id of a possible writer, and a 12 bit
counter for readers. The limit isn't 4096 due to an off by one sanity
check.

The atomic_t data type is 32 bits wide, so in theory 15 bits for the
writer's cpu id and 15 bits for the reader count seem to be fine, but
via read_trylock() more readers than cpus are possible.


As a result, afaict you choose to use just 14 bits for the CPU, but
still 15 bits (with the 16th to deal with overflow) for the reader count.
That could do with making explicit here, as a question is whether we
deem as sufficient that there is just one extra bit for the reader
count.


Okay, I'll add a sentence to the commit message.




--- a/xen/include/xen/rwlock.h
+++ b/xen/include/xen/rwlock.h
@@ -23,12 +23,12 @@ typedef struct {
  #define rwlock_init(l) (*(l) = (rwlock_t)RW_LOCK_UNLOCKED)
  
  /* Writer states & reader shift and bias. */

-#define_QW_CPUMASK  0xfffU /* Writer CPU mask */
-#define_QW_SHIFT12 /* Writer flags shift */
-#define_QW_WAITING  (1U << _QW_SHIFT)  /* A writer is waiting */
-#define_QW_LOCKED   (3U << _QW_SHIFT)  /* A writer holds the lock */
-#define_QW_WMASK(3U << _QW_SHIFT)  /* Writer mask */
-#define_QR_SHIFT14 /* Reader count shift */
+#define_QW_SHIFT14  /* Writer flags shift */
+#define_QW_CPUMASK  ((1U << _QW_SHIFT) - 1) /* Writer CPU mask */
+#define_QW_WAITING  (1U << _QW_SHIFT)   /* A writer is waiting */
+#define_QW_LOCKED   (3U << _QW_SHIFT)   /* A writer holds the lock */
+#define_QW_WMASK(3U << _QW_SHIFT)   /* Writer mask */
+#define_QR_SHIFT(_QW_SHIFT + 2) /* Reader count shift */
  #define_QR_BIAS (1U << _QR_SHIFT)


Btw, seeing all the uppercase U suffixes here, I think you had some
lowercase ones earlier in the series. While Misra doesn't demand
uppercase for U, it does for L and iirc we decided to use all
uppercase suffixes as a result. Would be nice if what goes in could
have this correct right away.


I'll rescan all the patches and change them accordingly.




@@ -36,14 +36,16 @@ void queue_write_lock_slowpath(rwlock_t *lock);
  
  static inline bool _is_write_locked_by_me(unsigned int cnts)

  {
-BUILD_BUG_ON(_QW_CPUMASK < NR_CPUS);
+BUILD_BUG_ON((_QW_CPUMASK + 1) < NR_CPUS);
+BUILD_BUG_ON(NR_CPUS * _QR_BIAS > INT_MAX);
  return (cnts & _QW_WMASK) == _QW_LOCKED &&
 (cnts & _QW_CPUMASK) == smp_processor_id();
  }
  
  static inline bool _can_read_lock(unsigned int cnts)

  {
-return !(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts);
+return cnts <= INT_MAX &&
+   (!(cnts & _QW_WMASK) || _is_write_locked_by_me(cnts));
  }


I view this as problematic: Code knowing that a write lock is being held
may invoke a function using read_trylock() and expect the lock to be
available there.


So you expect it to be fine that someone is using read_trylock() 32768 times
recursively while holding a lock as a writer? Sure, I can change the condition,
but OTOH ...


Juergen




Re: [PATCH v5 11/13] xen/spinlock: support higher number of cpus

2024-03-18 Thread Jürgen Groß

On 18.03.24 16:08, Jan Beulich wrote:

On 14.03.2024 08:20, Juergen Gross wrote:

--- a/xen/include/xen/spinlock.h
+++ b/xen/include/xen/spinlock.h
@@ -8,16 +8,16 @@
  #include 
  #include 
  
-#define SPINLOCK_CPU_BITS  12

+#define SPINLOCK_CPU_BITS  16
  
  #ifdef CONFIG_DEBUG_LOCKS

  union lock_debug {
-uint16_t val;
-#define LOCK_DEBUG_INITVAL 0x
+uint32_t val;
+#define LOCK_DEBUG_INITVAL 0x


With this #define I can see the desire for using a fixed width type for "val".
However, ...


  struct {
-uint16_t cpu:SPINLOCK_CPU_BITS;
-#define LOCK_DEBUG_PAD_BITS (14 - SPINLOCK_CPU_BITS)
-uint16_t :LOCK_DEBUG_PAD_BITS;
+uint32_t cpu:SPINLOCK_CPU_BITS;
+#define LOCK_DEBUG_PAD_BITS (30 - SPINLOCK_CPU_BITS)
+uint32_t :LOCK_DEBUG_PAD_BITS;


.. "unsigned int" ought to be fine for both of these.


Fine with me.


Juergen




Re: [PATCH v5 04/13] xen/spinlock: add rspin_[un]lock_irq[save|restore]()

2024-03-18 Thread Jürgen Groß

On 18.03.24 15:43, Jan Beulich wrote:

On 14.03.2024 08:20, Juergen Gross wrote:

Instead of special casing rspin_lock_irqsave() and
rspin_unlock_irqrestore() for the console lock, add those functions
to spinlock handling and use them where needed.

Signed-off-by: Juergen Gross 


Reviewed-by: Jan Beulich 
with two remarks:


--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -475,15 +475,31 @@ void _rspin_lock(rspinlock_t *lock)
  lock->recurse_cnt++;
  }
  
+unsigned long _rspin_lock_irqsave(rspinlock_t *lock)

+{
+unsigned long flags;
+
+local_irq_save(flags);
+_rspin_lock(lock);
+
+return flags;
+}
+
  void _rspin_unlock(rspinlock_t *lock)
  {
  if ( likely(--lock->recurse_cnt == 0) )
  {
  lock->recurse_cpu = SPINLOCK_NO_CPU;
-spin_unlock(lock);
+_spin_unlock(lock);


This looks like an unrelated change. I think I can guess the purpose, but
it would be nice if such along-the-way changes could be mentioned in the
description.


I think it would be better to move that change to patch 3.




--- a/xen/include/xen/spinlock.h
+++ b/xen/include/xen/spinlock.h
@@ -272,7 +272,15 @@ static always_inline void spin_lock_if(bool condition, 
spinlock_t *l)
   */
  bool _rspin_trylock(rspinlock_t *lock);
  void _rspin_lock(rspinlock_t *lock);
+#define rspin_lock_irqsave(l, f)\
+({  \
+BUILD_BUG_ON(sizeof(f) != sizeof(unsigned long));   \
+((f) = _rspin_lock_irqsave(l)); \


Perhaps in the context of another patch in the series I think I had
pointed out that the outer pair of parentheses is unnecessary in
constructs like this.


Oh, this one slipped through, sorry for that.

Will fix it in the next iteration.


Juergen



Re: [PATCH v5 08/13] xen/spinlock: add missing rspin_is_locked() and rspin_barrier()

2024-03-18 Thread Jürgen Groß

On 18.03.24 16:44, Jan Beulich wrote:

On 18.03.2024 16:31, Jürgen Groß wrote:

On 18.03.24 15:57, Jan Beulich wrote:

On 14.03.2024 08:20, Juergen Gross wrote:

--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -395,14 +395,7 @@ static bool always_inline spin_is_locked_common(const 
spinlock_tickets_t *t)
   
   int _spin_is_locked(const spinlock_t *lock)

   {
-/*
- * Recursive locks may be locked by another CPU, yet we return
- * "false" here, making this function suitable only for use in
- * ASSERT()s and alike.
- */
-return lock->recurse_cpu == SPINLOCK_NO_CPU
-   ? spin_is_locked_common(>tickets)
-   : lock->recurse_cpu == smp_processor_id();
+return spin_is_locked_common(>tickets);
   }


The "only suitable for ASSERT()s and alike" part of the comment wants
to survive here, I think.


Why?

I could understand you asking for putting such a comment to spinlock.h
mentioning that any *_is_locked() variant isn't safe, but with
_spin_is_locked() no longer covering recursive locks the comment's reasoning
is no longer true.


Hmm. I guess there is a difference in expectations. To me, these
functions in principle ought to report whether the lock is "owned",
not just "locked by some CPU". They don't, hence why they may not be
used for other than ASSERT()s.


Okay, thanks for clarification. I'll change the comment accordingly.


Juergen



Re: [PATCH v5 08/13] xen/spinlock: add missing rspin_is_locked() and rspin_barrier()

2024-03-18 Thread Jürgen Groß

On 18.03.24 15:57, Jan Beulich wrote:

On 14.03.2024 08:20, Juergen Gross wrote:

--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -395,14 +395,7 @@ static bool always_inline spin_is_locked_common(const 
spinlock_tickets_t *t)
  
  int _spin_is_locked(const spinlock_t *lock)

  {
-/*
- * Recursive locks may be locked by another CPU, yet we return
- * "false" here, making this function suitable only for use in
- * ASSERT()s and alike.
- */
-return lock->recurse_cpu == SPINLOCK_NO_CPU
-   ? spin_is_locked_common(>tickets)
-   : lock->recurse_cpu == smp_processor_id();
+return spin_is_locked_common(>tickets);
  }


The "only suitable for ASSERT()s and alike" part of the comment wants
to survive here, I think.


Why?

I could understand you asking for putting such a comment to spinlock.h
mentioning that any *_is_locked() variant isn't safe, but with
_spin_is_locked() no longer covering recursive locks the comment's reasoning
is no longer true.




@@ -465,6 +458,23 @@ void _spin_barrier(spinlock_t *lock)
  spin_barrier_common(>tickets, >debug, LOCK_PROFILE_PAR);
  }
  
+bool _rspin_is_locked(const rspinlock_t *lock)

+{
+/*
+ * Recursive locks may be locked by another CPU, yet we return
+ * "false" here, making this function suitable only for use in
+ * ASSERT()s and alike.
+ */
+return lock->recurse_cpu == SPINLOCK_NO_CPU
+   ? spin_is_locked_common(>tickets)
+   : lock->recurse_cpu == smp_processor_id();
+}


Here otoh I wonder if both the comment and the spin_is_locked_common()
part of the condition are actually correct. Oh, the latter needs
retaining as long as we have nrspin_*() functions, I suppose. But the
comment could surely do with improving a little - at the very least
"yet we return "false"" isn't quite right; minimally there's a "may"
missing.


If anything I guess the comment shouldn't gain a "may", but rather say
"Recursive locks may be locked by another CPU via rspin_lock() ..."


Juergen



Re: [PATCH v2 1/1] x86: Rename __{start,end}_init_task to __{start,end}_init_stack

2024-03-18 Thread Jürgen Groß

On 18.03.24 08:14, Xin Li (Intel) wrote:

The stack of a task has been separated from the memory of a task_struct
struture for a long time on x86, as a result __{start,end}_init_task no
longer mark the start and end of the init_task structure, but its stack
only.

Rename __{start,end}_init_task to __{start,end}_init_stack.

Note other architectures are not affected because __{start,end}_init_task
are used on x86 only.

Signed-off-by: Xin Li (Intel) 


Reviewed-by: Juergen Gross 


Juergen




Re: [PATCH v2] xen/arm: Set correct per-cpu cpu_core_mask

2024-03-18 Thread Jürgen Groß

On 15.03.24 11:58, Julien Grall wrote:



On 14/03/2024 14:22, Henry Wang wrote:

Hi Julien,


Hi,



On 3/14/2024 9:27 PM, Julien Grall wrote:

Hi Henry,

On 28/02/2024 01:58, Henry Wang wrote:

diff --git a/xen/arch/arm/smpboot.c b/xen/arch/arm/smpboot.c
index a84e706d77..d9ebd55d4a 100644
--- a/xen/arch/arm/smpboot.c
+++ b/xen/arch/arm/smpboot.c
@@ -66,7 +66,6 @@ static bool cpu_is_dead;
    /* ID of the PCPU we're running on */
  DEFINE_PER_CPU(unsigned int, cpu_id);
-/* XXX these seem awfully x86ish... */


:). I guess at the time we didn't realize that MT was supported on Arm. It is 
at least part of the spec, but as Michal pointed out it doesn't look like a 
lot of processors supports it.


Yep. Do you think changing the content of this line to something like 
"Although multithread is part of the Arm spec, there are not many processors 
support multithread and current Xen on Arm assumes there is no multithread" 
makes sense to you?



  /* representing HT siblings of each logical CPU */
  DEFINE_PER_CPU_READ_MOSTLY(cpumask_var_t, cpu_sibling_mask);
  /* representing HT and core siblings of each logical CPU */
@@ -89,6 +88,10 @@ static int setup_cpu_sibling_map(int cpu)
  cpumask_set_cpu(cpu, per_cpu(cpu_sibling_mask, cpu));
  cpumask_set_cpu(cpu, per_cpu(cpu_core_mask, cpu));
  +    /* Currently we assume there is no multithread. */


I am not very familiar with the scheduling in Xen. Do you know what's the 
consequence of not properly supporting MT? One thing I can think of is 
security (I know there were plenty of security issues with SMT).


Unfortunately me neither, so adding George to this thread as I think he can 
bring us some insights on this topic from the scheduler perspective.


+Juergen as he worked on co-scheduling.


There are four aspects regarding multithreading:

- security (basically boils down to one thread possibly being able to use
  side channel attacks for accessing other thread's data via resources in
  the core shared between the threads)

- performance (trying to spread running vcpus across as many cores as
  possible will utilize more hardware resources resulting generally in better
  performance - especially credit2 is supporting that)

- power consumption (a completely idle core will consume less power - again
  credit2 is supporting that with the correct setting)

- busy loops (cpu_relax() support)

For credit2 MT specifics see the large comment block in xen/common/sched/credit2
starting at line 636.

Note that core scheduling is currently not supported on Arm, as there are
architecture specific bits missing in the context switching logic.


Juergen



Re: IMPORTANT - : Need help on USB port virtualization with Xen hypervisor

2024-03-16 Thread Jürgen Groß

On 16.03.24 00:32, Stefano Stabellini wrote:

Hi Dominique,

You posted this configuration:

device_model_args = [ "
   "-device","nec-usb-xhci,id=xhci",
   "-device","usb-host,bus=xhci.0,hostbus=1,hostport=13",
   "-device","usb-host,bus=xhci.0,hostbus=1,hostport=10",
   "-device","usb-host,bus=xhci.0,hostbus=1,hostport=2",
   "-device","usb-host,bus=xhci.0,hostbus=2,hostport=2",
   "-device","usb-host,bus=xhci.0,hostbus=2,hostport=1",
   "-device","usb-host,bus=xhci.0,hostbus=1,hostport=1"]

It looks like you are using QEMU-based USB passthrough. Basically QEMU
(independently from Xen) is accessing the USB device in Dom0 and
exposing a corresponding device to the guest. I am not sure if there is
anything that can be done in QEMU to support USB 3.0 with high speed,
people in CC might know.

There two other alternatives to this approach. You can use PV USB
Passthrough instead. Juergen (CCed) is the original author. I am not
sure if that supports USB 3.0 either.


PV USB doesn't support USB 3.0.

Instead of using device_model_args= in the guest configuration I'd rather
use the usbctrl= and usbdev= items (see the related xl.cfg(5) man page).
I'm not sure this will really make a performance difference, though.


Juergen




Re: [PATCH v5 13/13] xen: allow up to 16383 cpus

2024-03-14 Thread Jürgen Groß

On 14.03.24 08:26, Jan Beulich wrote:

On 14.03.2024 08:20, Juergen Gross wrote:

With lock handling now allowing up to 16384 cpus (spinlocks can handle
65535 cpus, rwlocks can handle 16384 cpus), raise the allowed limit for
the number of cpus to be configured to 16383.

The new limit is imposed by IOMMU_CMD_BUFFER_MAX_ENTRIES and
QINVAL_MAX_ENTRY_NR required to be larger than 2 * CONFIG_NR_CPUS.


That's quite sad - I was really hoping we'd finally end up with a
power-of-2 upper bound.


Yes, me too.


Juergen



Re: [PATCH v5 01/13] xen/spinlock: remove misra rule 21.1 violations

2024-03-14 Thread Jürgen Groß

On 14.03.24 08:32, Jan Beulich wrote:

On 14.03.2024 08:20, Juergen Gross wrote:

In xen spinlock code there are several violations of MISRA rule 21.1
(identifiers starting with "__" or "_[A-Z]").

Fix them by using trailing underscores instead.

Signed-off-by: Juergen Gross 


I can live with the changes as they are, but before giving an ack, I'd
still like to ask if the moved underscores are really useful / necessary
in all cases. E.g.


--- a/xen/include/xen/spinlock.h
+++ b/xen/include/xen/spinlock.h
@@ -22,7 +22,7 @@ union lock_debug {
  bool unseen:1;
  };
  };
-#define _LOCK_DEBUG { .val = LOCK_DEBUG_INITVAL }
+#define LOCK_DEBUG_ { .val = LOCK_DEBUG_INITVAL }


... for an internal helper macro it may indeed be better to have a
trailing one here, but ...


@@ -95,27 +95,27 @@ struct lock_profile_qhead {
  int32_t   idx; /* index for printout */
  };
  
-#define _LOCK_PROFILE(lockname) { .name = #lockname, .lock = &(lockname), }

-#define _LOCK_PROFILE_PTR(name)   \
-static struct lock_profile * const __lock_profile_##name  \
+#define LOCK_PROFILE_(lockname) { .name = #lockname, .lock = &(lockname), }
+#define LOCK_PROFILE_PTR_(name)   \
+static struct lock_profile * const lock_profile__##name   \


... I'm not entirely convinced of the need for the double infix ones
here ...


This reduces the chance of name clashes with other lock profiling variables or
functions (e.g. lock_profile_lock). In case you think this can be neglected, I'm
fine with dropping the extra underscores.


Juergen



Re: [PATCH v1 1/1] x86: Rename __{start,end}_init_task to __{start,end}_init_stack

2024-03-13 Thread Jürgen Groß

On 13.03.24 07:05, Xin Li (Intel) wrote:

The stack of a task has been separated from the memory of a task_struct
struture for a long time on x86, as a result __{start,end}_init_task no
longer mark the start and end of the init_task structure, but its stack
only.

Rename __{start,end}_init_task to __{start,end}_init_stack.

Note other architectures are not affected because __{start,end}_init_task
are used on x86 only.

Signed-off-by: Xin Li (Intel) 
---
  arch/x86/include/asm/processor.h  | 4 ++--
  arch/x86/kernel/head_64.S | 2 +-
  arch/x86/xen/xen-head.S   | 2 +-
  include/asm-generic/vmlinux.lds.h | 8 
  4 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 811548f131f4..8b3a3f3bb859 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -636,10 +636,10 @@ static __always_inline void prefetchw(const void *x)
  #define KSTK_ESP(task)(task_pt_regs(task)->sp)
  
  #else

-extern unsigned long __end_init_task[];
+extern unsigned long __end_init_stack[];
  
  #define INIT_THREAD {			\

-   .sp = (unsigned long)&__end_init_task - \
+   .sp = (unsigned long)&__end_init_stack -\
  TOP_OF_KERNEL_STACK_PADDING - \
  sizeof(struct pt_regs),   \
  }
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index d8198fbd70e5..c7babd7ebb0f 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -66,7 +66,7 @@ SYM_CODE_START_NOALIGN(startup_64)
mov %rsi, %r15
  
  	/* Set up the stack for verify_cpu() */

-   leaq(__end_init_task - TOP_OF_KERNEL_STACK_PADDING - 
PTREGS_SIZE)(%rip), %rsp
+   leaq(__end_init_stack - TOP_OF_KERNEL_STACK_PADDING - 
PTREGS_SIZE)(%rip), %rsp
  
  	/* Setup GSBASE to allow stack canary access for C code */

movl$MSR_GS_BASE, %ecx
diff --git a/arch/x86/xen/xen-head.S b/arch/x86/xen/xen-head.S
index 04101b984f24..43eadf03f46d 100644
--- a/arch/x86/xen/xen-head.S
+++ b/arch/x86/xen/xen-head.S
@@ -49,7 +49,7 @@ SYM_CODE_START(startup_xen)
ANNOTATE_NOENDBR
cld
  
-	leaq	(__end_init_task - TOP_OF_KERNEL_STACK_PADDING - PTREGS_SIZE)(%rip), %rsp

+   leaq(__end_init_stack - TOP_OF_KERNEL_STACK_PADDING - 
PTREGS_SIZE)(%rip), %rsp
  
  	/* Set up %gs.

 *
diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 5dd3a61d673d..a168be99d522 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -399,13 +399,13 @@
  
  #define INIT_TASK_DATA(align)		\

. = ALIGN(align);   \
-   __start_init_task = .;  \
+   __start_init_stack = .; \
init_thread_union = .;  \
init_stack = .; \
-   KEEP(*(.data..init_task))   \
+   KEEP(*(.data..init_stack))  \


Is this modification really correct?


KEEP(*(.data..init_thread_info))\
-   . = __start_init_task + THREAD_SIZE;\
-   __end_init_task = .;
+   . = __start_init_stack + THREAD_SIZE;   \
+   __end_init_stack = .;
  
  #define JUMP_TABLE_DATA			\

. = ALIGN(8);   \

base-commit: 626856ae97054963e7b8c35335d4418271c8d0c4



Juergen




Re: Linux 6.7-rc1+: WARNING at drivers/xen/evtchn.c:167 evtchn_interrupt

2024-03-12 Thread Jürgen Groß

On 30.11.23 03:34, Marek Marczykowski-Górecki wrote:

Hi,

While testing 6.7-rc3 on Qubes OS I found several warning like in the
subject in dom0 log. I see them when running 6.7-rc1 too. I'm not sure
what exactly triggers the issue, but my guess would be unbinding an
event channel from userspace (closing vchan connection).

Specific message:

[   83.973377] [ cut here ]
[   83.975523] Interrupt for port 77, but apparently not enabled; per-user 
a0e9f1d1


Finally I think I have a fix (thanks to Demi for finding the problematic patch
through bisecting).

Could you please try the attached patch? It is based on current upstream, but
I think it should apply to 6.7 or stable 6.6, too.


Juergen
From 9d673c37b2d0c9aa274c53f619c4e9e43a419f41 Mon Sep 17 00:00:00 2001
From: Juergen Gross 
To: linux-ker...@vger.kernel.org
Cc: Juergen Gross 
Cc: Stefano Stabellini 
Cc: Oleksandr Tyshchenko 
Cc: xen-devel@lists.xenproject.org
Date: Tue, 12 Mar 2024 13:52:24 +0100
Subject: [PATCH] xen/evtchn: avoid WARN() when unbinding an event channel

When unbinding a user event channel, the related handler might be
called a last time in case the kernel was built with
CONFIG_DEBUG_SHIRQ. This might cause a WARN() in the handler.

Avoid that by adding an "unbinding" flag to struct user_event which
will short circuit the handler.

Fixes: 9e90e58c11b7 ("xen: evtchn: Allow shared registration of IRQ handers")
Signed-off-by: Juergen Gross 
---
 drivers/xen/evtchn.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/drivers/xen/evtchn.c b/drivers/xen/evtchn.c
index 59717628ca42..f6a2216c2c87 100644
--- a/drivers/xen/evtchn.c
+++ b/drivers/xen/evtchn.c
@@ -85,6 +85,7 @@ struct user_evtchn {
 	struct per_user_data *user;
 	evtchn_port_t port;
 	bool enabled;
+	bool unbinding;
 };
 
 static void evtchn_free_ring(evtchn_port_t *ring)
@@ -164,6 +165,10 @@ static irqreturn_t evtchn_interrupt(int irq, void *data)
 	struct per_user_data *u = evtchn->user;
 	unsigned int prod, cons;
 
+	/* Handler might be called when tearing down the IRQ. */
+	if (evtchn->unbinding)
+		return IRQ_HANDLED;
+
 	WARN(!evtchn->enabled,
 	 "Interrupt for port %u, but apparently not enabled; per-user %p\n",
 	 evtchn->port, u);
@@ -421,6 +426,7 @@ static void evtchn_unbind_from_user(struct per_user_data *u,
 
 	BUG_ON(irq < 0);
 
+	evtchn->unbinding = true;
 	unbind_from_irqhandler(irq, evtchn);
 
 	del_evtchn(u, evtchn);
-- 
2.35.3



Re: [PATCH] xen/grant-dma-iommu: Convert to platform remove callback returning void

2024-03-08 Thread Jürgen Groß

On 07.03.24 19:11, Uwe Kleine-König wrote:

The .remove() callback for a platform driver returns an int which makes
many driver authors wrongly assume it's possible to do error handling by
returning an error code. However the value returned is ignored (apart
from emitting a warning) and this typically results in resource leaks.

To improve here there is a quest to make the remove callback return
void. In the first step of this quest all drivers are converted to
.remove_new(), which already returns void. Eventually after all drivers
are converted, .remove_new() will be renamed to .remove().

Trivially convert this driver from always returning zero in the remove
callback to the void returning variant.

Signed-off-by: Uwe Kleine-König 


Reviewed-by: Juergen Gross 


Juergen




Re: [PATCH v3] tools/9pfsd: Fix build error caused by strerror_r

2024-03-07 Thread Jürgen Groß

On 07.03.24 14:56, Henry Wang wrote:

Below error can be seen when doing Yocto build of the toolstack:

| io.c: In function 'p9_error':
| io.c:684:5: error: ignoring return value of 'strerror_r' declared
   with attribute 'warn_unused_result' [-Werror=unused-result]
|   684 | strerror_r(err, ring->buffer, ring->ring_size);
|   | ^~
| cc1: all warnings being treated as errors

Using strerror_r() without special casing different build environments
is impossible due to the different return types (int vs char *)
depending on the environment. As p9_error() is not on a performance
critical path, using strerror() with a mutex ought to be fine. So,
fix the build by using strerror() to replace strerror_r(). The steps
would then become: Acquire the mutex first, invoke strerror(), copy
the string from strerror() to the designated buffer and then drop the
mutex.

Fixes: f4900d6d69b5 ("9pfsd: allow building with old glibc")
Signed-off-by: Henry Wang 
Reviewed-by: Jan Beulich 


Reviewed-by: Juergen Gross 


Juergen




Re: [PATCH v3] tools/9pfsd: Fix build error caused by strerror_r

2024-03-07 Thread Jürgen Groß

On 07.03.24 14:56, Henry Wang wrote:

Below error can be seen when doing Yocto build of the toolstack:

| io.c: In function 'p9_error':
| io.c:684:5: error: ignoring return value of 'strerror_r' declared
   with attribute 'warn_unused_result' [-Werror=unused-result]
|   684 | strerror_r(err, ring->buffer, ring->ring_size);
|   | ^~
| cc1: all warnings being treated as errors

Using strerror_r() without special casing different build environments
is impossible due to the different return types (int vs char *)
depending on the environment. As p9_error() is not on a performance
critical path, using strerror() with a mutex ought to be fine. So,
fix the build by using strerror() to replace strerror_r(). The steps
would then become: Acquire the mutex first, invoke strerror(), copy
the string from strerror() to the designated buffer and then drop the
mutex.

Fixes: f4900d6d69b5 ("9pfsd: allow building with old glibc")
Signed-off-by: Henry Wang 
Reviewed-by: Jan Beulich 


Reviewed-by: Juergen Gross 


Juergen




Re: [PATCH] MAINTAINERS: add an entry for tools/9pfsd

2024-03-06 Thread Jürgen Groß

On 06.03.24 12:23, Jan Beulich wrote:

On 06.03.2024 12:16, Juergen Gross wrote:

--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -206,6 +206,12 @@ Maintainers List (try to look for most precise areas first)
  
  		---
  
+9PFSD

+M: Juergen Gross 
+M: Anthony PERARD 
+S: Supported
+F: tools/9pfsd


This wants a trailing slash.


Oh, indeed.


Can pretty certainly be taken care of while committing.


Thanks,


Juergen




Re: [PATCH] 9pfsd: allow building with old glibc

2024-03-05 Thread Jürgen Groß

On 05.03.24 15:33, Jan Beulich wrote:

Neither pread() / pwrite() nor O_DIRECTORY are available from glibc
2.11.3 headers without defining (e.g.) _GNU_SOURCE. Supplying the
definition unconditionally shouldn't be a problem, seeing that various
other tools/ components do so, too.

Signed-off-by: Jan Beulich 


Fine with me.

Reviewed-by: Juergen Gross 


Juergen




Re: [PATCH v4 10/12] xen/spinlock: split recursive spinlocks from normal ones

2024-03-03 Thread Jürgen Groß

On 04.03.24 08:25, Jan Beulich wrote:

On 01.03.2024 15:37, Juergen Gross wrote:

On 29.02.24 16:32, Jan Beulich wrote:

On 12.12.2023 10:47, Juergen Gross wrote:

+#define nrspin_lock_irqsave(l, f)   \
+({  \
+BUILD_BUG_ON(sizeof(f) != sizeof(unsigned long));   \
+((f) = __nrspin_lock_irqsave(l));   \


I don't think the outer pair of parentheses is needed here.


Turns out it is needed. Otherwise something like:


if ( a )
  nrspin_lock_irqsave(l, f);
else
  ...

will fail with "else without a previous if".


That's for "outer" in the whole #define I suppose, when I commented on
just a specific line inside the construct.


Sorry, I applied your remark to the wrong context.

Yes, one level of parentheses can be removed from this line.


Juergen



Re: [PATCH v4 12/12] xen/spinlock: support higher number of cpus

2024-02-29 Thread Jürgen Groß

On 29.02.24 17:54, Jan Beulich wrote:

On 29.02.2024 17:45, Juergen Gross wrote:

On 29.02.24 17:31, Jan Beulich wrote:

On 29.02.2024 17:29, Jürgen Groß wrote:

On 29.02.24 16:46, Jan Beulich wrote:

On 12.12.2023 10:47, Juergen Gross wrote:

Allow 16 bits per cpu number, which is the limit imposed by
spinlock_tickets_t.

This will allow up to 65535 cpus, while increasing only the size of
recursive spinlocks in debug builds from 8 to 12 bytes.


I think we want to be more conservative here, for the case of there
being bugs: The CPU holding a lock may wrongly try to acquire it a
2nd time. That's the 65536th ticket then, wrapping the value.


Is this really a problem? There will be no other cpu left seeing the lock
as "free" in this case, as all others will be waiting for the head to reach
their private tail value.


But isn't said CPU then going to make progress, rather than indefinitely
spinning on the lock?


No, I don't think so.


Hmm. If CPU0 takes a pristine lock, it'll get ticket 0x. All other
CPUs will get tickets 0x0001 ... 0x. Then CPU0 trying to take the lock


No, they'll get 0x0001 ... 0xfffe (only 65535 cpus are supported).


again will get ticket 0x again, which equals what .head still has (no


And the first cpu will get 0x when trying to get the lock again.


unlocks so far), thus signalling the lock to be free when it isn't.


The limit isn't 65535 because of the ticket mechanism, but because of the
rspin mechanism, where we need a "no cpu is owning the lock" value. Without
the recursive locks the limit would be 65536 (or 4096 today).


I think this rather belongs ...


No, that was meant to tell you that without programming bug 65536 cpus would
be perfectly fine for the ticket mechanism, and with bug 65535 cpus are fine.




Therefore my suggestion would be to only (mention) go(ing) up to 32k.


Signed-off-by: Juergen Gross 
---
xen/common/spinlock.c  |  1 +
xen/include/xen/spinlock.h | 18 +-
2 files changed, 10 insertions(+), 9 deletions(-)


Shouldn't this also bump the upper bound of the NR_CPUS range then
in xen/arch/Kconfig?


Fine with me, I can add another patch to the series doing that.


Why not do it right here? The upper bound there is like it is only
because of the restriction that's lifted here.


... here (for having nothing to do with the supposed lack of hanging
that I'm seeing)?


I'd prefer splitting the two instances, but if you prefer it to be in a
single patch, so be it.


I'm not going to insist - if want to do it separately, please do.
Perhaps others would actually prefer it that way ...


Okay. I'll do it in another patch.


Juergen



Re: [PATCH v4 12/12] xen/spinlock: support higher number of cpus

2024-02-29 Thread Jürgen Groß

On 29.02.24 16:46, Jan Beulich wrote:

On 12.12.2023 10:47, Juergen Gross wrote:

Allow 16 bits per cpu number, which is the limit imposed by
spinlock_tickets_t.

This will allow up to 65535 cpus, while increasing only the size of
recursive spinlocks in debug builds from 8 to 12 bytes.


I think we want to be more conservative here, for the case of there
being bugs: The CPU holding a lock may wrongly try to acquire it a
2nd time. That's the 65536th ticket then, wrapping the value.


Is this really a problem? There will be no other cpu left seeing the lock
as "free" in this case, as all others will be waiting for the head to reach
their private tail value.


Therefore my suggestion would be to only (mention) go(ing) up to 32k.


Signed-off-by: Juergen Gross 
---
  xen/common/spinlock.c  |  1 +
  xen/include/xen/spinlock.h | 18 +-
  2 files changed, 10 insertions(+), 9 deletions(-)


Shouldn't this also bump the upper bound of the NR_CPUS range then
in xen/arch/Kconfig?


Fine with me, I can add another patch to the series doing that.


Juergen



Re: [PATCH v4 10/12] xen/spinlock: split recursive spinlocks from normal ones

2024-02-29 Thread Jürgen Groß

On 29.02.24 16:32, Jan Beulich wrote:

On 12.12.2023 10:47, Juergen Gross wrote:

--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -541,6 +541,55 @@ void rspin_unlock_irqrestore(rspinlock_t *lock, unsigned 
long flags)
  local_irq_restore(flags);
  }
  
+int nrspin_trylock(rspinlock_t *lock)

+{
+check_lock(>debug, true);
+
+if ( unlikely(lock->recurse_cpu != SPINLOCK_NO_CPU) )
+return 0;
+
+return spin_trylock_common(>tickets, >debug, LOCK_PROFILE_PAR);
+}


I wonder if we shouldn't take the opportunity and stop having trylock
functions have "int" return type. Perhaps already spin_trylock_common()
when introduced could use "bool" instead, then here following suit.


Fine with me.




+void nrspin_lock(rspinlock_t *lock)
+{
+spin_lock_common(>tickets, >debug, LOCK_PROFILE_PAR, NULL,
+ NULL);
+}
+
+void nrspin_unlock(rspinlock_t *lock)
+{
+spin_unlock_common(>tickets, >debug, LOCK_PROFILE_PAR);
+}
+
+void nrspin_lock_irq(rspinlock_t *lock)
+{
+ASSERT(local_irq_is_enabled());
+local_irq_disable();
+nrspin_lock(lock);
+}
+
+void nrspin_unlock_irq(rspinlock_t *lock)
+{
+nrspin_unlock(lock);
+local_irq_enable();
+}
+
+unsigned long __nrspin_lock_irqsave(rspinlock_t *lock)
+{
+unsigned long flags;
+
+local_irq_save(flags);
+nrspin_lock(lock);
+return flags;


Nit: Strictly speaking we want a blank line ahead of this "return".


Okay, will add it.




@@ -166,11 +172,15 @@ struct lock_profile_qhead { };
  struct lock_profile { };
  
  #define SPIN_LOCK_UNLOCKED {  \

+.debug =_LOCK_DEBUG,  \
+}
+#define RSPIN_LOCK_UNLOCKED { \
+.debug =_LOCK_DEBUG,  \
  .recurse_cpu = SPINLOCK_NO_CPU,   
\
  .debug =_LOCK_DEBUG,  
\
  }


Initializing .debug twice?


Oh, right. Will drop one.




@@ -180,7 +190,6 @@ struct lock_profile { };
  
  #endif
  
-

  typedef union {
  uint32_t head_tail;
  struct {


Looks like this might be undoing what the earlier patch isn't going to
do anymore?


Yes, seen it already.




@@ -242,6 +257,19 @@ void rspin_unlock_irqrestore(rspinlock_t *lock, unsigned 
long flags);
  int rspin_is_locked(const rspinlock_t *lock);
  void rspin_barrier(rspinlock_t *lock);
  
+int nrspin_trylock(rspinlock_t *lock);

+void nrspin_lock(rspinlock_t *lock);
+void nrspin_unlock(rspinlock_t *lock);
+void nrspin_lock_irq(rspinlock_t *lock);
+void nrspin_unlock_irq(rspinlock_t *lock);
+#define nrspin_lock_irqsave(l, f)   \
+({  \
+BUILD_BUG_ON(sizeof(f) != sizeof(unsigned long));   \
+((f) = __nrspin_lock_irqsave(l));   \


I don't think the outer pair of parentheses is needed here. As to the
leading double underscores, see comments elsewhere.


Okay.


Juergen



Re: [PATCH v4 09/12] xen/spinlock: add missing rspin_is_locked() and rspin_barrier()

2024-02-29 Thread Jürgen Groß

On 29.02.24 15:14, Jan Beulich wrote:

On 12.12.2023 10:47, Juergen Gross wrote:

--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -458,6 +458,23 @@ void _spin_barrier(spinlock_t *lock)
  spin_barrier_common(>tickets, >debug, LOCK_PROFILE_PAR);
  }
  
+int rspin_is_locked(const rspinlock_t *lock)

+{
+/*
+ * Recursive locks may be locked by another CPU, yet we return
+ * "false" here, making this function suitable only for use in
+ * ASSERT()s and alike.
+ */
+return lock->recurse_cpu == SPINLOCK_NO_CPU
+   ? spin_is_locked_common(>tickets)
+   : lock->recurse_cpu == smp_processor_id();
+}
+
+void rspin_barrier(rspinlock_t *lock)
+{
+spin_barrier_common(>tickets, >debug, LOCK_PROFILE_PAR);
+}


Ah, here we go. Looks all okay to me, but needs re-ordering such that the
earlier patch won't transiently introduce a regression.


Yes, just wanted to answer something similar to your remark on patch 8.


Juergen



Re: [PATCH v9 2/6] stubdom: extend xenstore stubdom configs

2024-02-29 Thread Jürgen Groß

On 29.02.24 14:05, Jan Beulich wrote:

On 29.02.2024 13:48, Juergen Gross wrote:

Extend the config files of the Xenstore stubdoms to include XENBUS
and 9PFRONT items in order to support file based logging.

Signed-off-by: Juergen Gross 
Reviewed-by: Jason Andryuk 


Was an ack from Samuel lost here? Or was it dropped on purpose?


Oh, I think I lost it.


Juergen




Re: [PATCH v4 06/12] xen/spinlock: make struct lock_profile rspinlock_t aware

2024-02-28 Thread Jürgen Groß

On 28.02.24 17:02, Jan Beulich wrote:

On 28.02.2024 16:43, Jürgen Groß wrote:

On 28.02.24 16:19, Jan Beulich wrote:

On 12.12.2023 10:47, Juergen Gross wrote:

--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -538,19 +538,31 @@ static void spinlock_profile_iterate(lock_profile_subfunc 
*sub, void *par)
   static void cf_check spinlock_profile_print_elem(struct lock_profile *data,
   int32_t type, int32_t idx, void *par)
   {
-struct spinlock *lock = data->lock;
+unsigned int cpu;
+uint32_t lockval;


Any reason for this not being unsigned int as well? The more that ...


+if ( data->is_rlock )
+{
+cpu = data->rlock->debug.cpu;
+lockval = data->rlock->tickets.head_tail;
+}
+else
+{
+cpu = data->lock->debug.cpu;
+lockval = data->lock->tickets.head_tail;
+}


I've used the same type as tickets.head_tail.

   
   printk("%s ", lock_profile_ancs[type].name);

   if ( type != LOCKPROF_TYPE_GLOBAL )
   printk("%d ", idx);
-printk("%s: addr=%p, lockval=%08x, ", data->name, lock,
-   lock->tickets.head_tail);
-if ( lock->debug.cpu == SPINLOCK_NO_CPU )
+printk("%s: addr=%p, lockval=%08x, ", data->name, data->lock, lockval);


... it's then printed with plain x as the format char.


Which hasn't been changed by the patch. I can change it to PRIx32 if you want.


As per ./CODING_STYLE unsigned int is preferred.


Okay, I'll switch to unsigned int then.




--- a/xen/include/xen/spinlock.h
+++ b/xen/include/xen/spinlock.h
@@ -76,13 +76,19 @@ union lock_debug { };
   */
   
   struct spinlock;

+/* Temporary hack until a dedicated struct rspinlock is existing. */
+#define rspinlock spinlock
   
   struct lock_profile {

   struct lock_profile *next;   /* forward link */
   const char  *name;   /* lock name */
-struct spinlock *lock;   /* the lock itself */
+union {
+struct spinlock *lock;   /* the lock itself */
+struct rspinlock *rlock; /* the recursive lock itself */
+};


_LOCK_PROFILE() wants to initialize this field, unconditionally using
.lock. While I expect that problem to be taken care of in one of the
later patches, use of the macro won't work anymore with this union in
use with very old gcc that formally we still support. While a road to
generally raising the baseline requirements is still pretty unclear to
me, an option might be to require (and document) that to enable
DEBUG_LOCK_PROFILE somewhat newer gcc needs using.


Patch 8 is using either .lock or .rlock depending on the lock type.

What is the problem with the old gcc version? Static initializers of
anonymous union members?


Yes.


The easiest solution might then be to give the union a name.


Juergen




Re: [PATCH v4 06/12] xen/spinlock: make struct lock_profile rspinlock_t aware

2024-02-28 Thread Jürgen Groß

On 28.02.24 16:19, Jan Beulich wrote:

On 12.12.2023 10:47, Juergen Gross wrote:

--- a/xen/common/spinlock.c
+++ b/xen/common/spinlock.c
@@ -538,19 +538,31 @@ static void spinlock_profile_iterate(lock_profile_subfunc 
*sub, void *par)
  static void cf_check spinlock_profile_print_elem(struct lock_profile *data,
  int32_t type, int32_t idx, void *par)
  {
-struct spinlock *lock = data->lock;
+unsigned int cpu;
+uint32_t lockval;


Any reason for this not being unsigned int as well? The more that ...


+if ( data->is_rlock )
+{
+cpu = data->rlock->debug.cpu;
+lockval = data->rlock->tickets.head_tail;
+}
+else
+{
+cpu = data->lock->debug.cpu;
+lockval = data->lock->tickets.head_tail;
+}


I've used the same type as tickets.head_tail.

  
  printk("%s ", lock_profile_ancs[type].name);

  if ( type != LOCKPROF_TYPE_GLOBAL )
  printk("%d ", idx);
-printk("%s: addr=%p, lockval=%08x, ", data->name, lock,
-   lock->tickets.head_tail);
-if ( lock->debug.cpu == SPINLOCK_NO_CPU )
+printk("%s: addr=%p, lockval=%08x, ", data->name, data->lock, lockval);


... it's then printed with plain x as the format char.


Which hasn't been changed by the patch. I can change it to PRIx32 if you want.




+if ( cpu == SPINLOCK_NO_CPU )
  printk("not locked\n");
  else
-printk("cpu=%d\n", lock->debug.cpu);
-printk("  lock:%" PRId64 "(%" PRI_stime "), block:%" PRId64 "(%" PRI_stime 
")\n",
-   data->lock_cnt, data->time_hold, data->block_cnt, data->time_block);
+printk("cpu=%u\n", cpu);
+printk("  lock:%" PRIu64 "(%" PRI_stime "), block:%" PRIu64 "(%" PRI_stime 
")\n",
+   data->lock_cnt, data->time_hold, (uint64_t)data->block_cnt,


I think I know why the cast is suddenly / unexpectedly needed, but imo
such wants stating in the description, when generally we aim at avoiding
casts where possible.


Okay, will add a sentence.




--- a/xen/include/xen/spinlock.h
+++ b/xen/include/xen/spinlock.h
@@ -76,13 +76,19 @@ union lock_debug { };
  */
  
  struct spinlock;

+/* Temporary hack until a dedicated struct rspinlock is existing. */
+#define rspinlock spinlock
  
  struct lock_profile {

  struct lock_profile *next;   /* forward link */
  const char  *name;   /* lock name */
-struct spinlock *lock;   /* the lock itself */
+union {
+struct spinlock *lock;   /* the lock itself */
+struct rspinlock *rlock; /* the recursive lock itself */
+};


_LOCK_PROFILE() wants to initialize this field, unconditionally using
.lock. While I expect that problem to be taken care of in one of the
later patches, use of the macro won't work anymore with this union in
use with very old gcc that formally we still support. While a road to
generally raising the baseline requirements is still pretty unclear to
me, an option might be to require (and document) that to enable
DEBUG_LOCK_PROFILE somewhat newer gcc needs using.


Patch 8 is using either .lock or .rlock depending on the lock type.

What is the problem with the old gcc version? Static initializers of
anonymous union members?




  uint64_tlock_cnt;/* # of complete locking ops */
-uint64_tblock_cnt;   /* # of complete wait for lock */
+uint64_tblock_cnt:63; /* # of complete wait for lock */
+uint64_tis_rlock:1;  /* use rlock pointer */


bool?


Yes.


Juergen



Re: [PATCH v4 05/12] xen/spinlock: add rspin_[un]lock_irq[save|restore]()

2024-02-28 Thread Jürgen Groß

On 28.02.24 16:09, Jan Beulich wrote:

On 12.12.2023 10:47, Juergen Gross wrote:

Instead of special casing rspin_lock_irqsave() and
rspin_unlock_irqrestore() for the console lock, add those functions
to spinlock handling and use them where needed.

Signed-off-by: Juergen Gross 
---
V2:
- new patch


In how far is this a necessary part of the series?


Not really necessary. It just seemed wrong to have an open coded
variant of rspin_lock_irqsave() and rspin_unlock_irqrestore().




--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -647,13 +647,15 @@ void show_stack_overflow(unsigned int cpu, const struct 
cpu_user_regs *regs)
  void show_execution_state(const struct cpu_user_regs *regs)
  {
  /* Prevent interleaving of output. */
-unsigned long flags = console_lock_recursive_irqsave();
+unsigned long flags;
+
+rspin_lock_irqsave(_lock, flags);
  
  show_registers(regs);

  show_code(regs);
  show_stack(regs);
  
-console_unlock_recursive_irqrestore(flags);

+rspin_unlock_irqrestore(_lock, flags);
  }
  
  void cf_check show_execution_state_nonconst(struct cpu_user_regs *regs)

@@ -663,7 +665,7 @@ void cf_check show_execution_state_nonconst(struct 
cpu_user_regs *regs)
  
  void vcpu_show_execution_state(struct vcpu *v)

  {
-unsigned long flags = 0;
+unsigned long flags;
  
  if ( test_bit(_VPF_down, >pause_flags) )

  {
@@ -698,7 +700,7 @@ void vcpu_show_execution_state(struct vcpu *v)
  #endif
  
  /* Prevent interleaving of output. */

-flags = console_lock_recursive_irqsave();
+rspin_lock_irqsave(_lock, flags);
  
  vcpu_show_registers(v);
  
@@ -708,7 +710,7 @@ void vcpu_show_execution_state(struct vcpu *v)

   * Stop interleaving prevention: The necessary P2M lookups involve
   * locking, which has to occur with IRQs enabled.
   */
-console_unlock_recursive_irqrestore(flags);
+rspin_unlock_irqrestore(_lock, flags);
  
  show_hvm_stack(v, >arch.user_regs);

  }
@@ -717,7 +719,7 @@ void vcpu_show_execution_state(struct vcpu *v)
  if ( guest_kernel_mode(v, >arch.user_regs) )
  show_guest_stack(v, >arch.user_regs);
  
-console_unlock_recursive_irqrestore(flags);

+rspin_unlock_irqrestore(_lock, flags);
  }
  


I view these as layering violations; ...


--- a/xen/drivers/char/console.c
+++ b/xen/drivers/char/console.c
@@ -120,7 +120,7 @@ static int __read_mostly sercon_handle = -1;
  int8_t __read_mostly opt_console_xen; /* console=xen */
  #endif
  
-static DEFINE_RSPINLOCK(console_lock);

+DEFINE_RSPINLOCK(console_lock);


... this should remain static. The question therefore just is whether
to omit this patch or ...


@@ -1158,22 +1158,6 @@ void console_end_log_everything(void)
  atomic_dec(_everything);
  }
  
-unsigned long console_lock_recursive_irqsave(void)

-{
-unsigned long flags;
-
-local_irq_save(flags);
-rspin_lock(_lock);
-
-return flags;
-}
-
-void console_unlock_recursive_irqrestore(unsigned long flags)
-{
-rspin_unlock(_lock);
-local_irq_restore(flags);
-}


... whether to retain these two functions as thin wrappers around the
new, more generic construct.


I'd vote for the latter.


Juergen



Re: [PATCH] Mini-OS: add symbol exports for xenstore stubdom

2024-02-26 Thread Jürgen Groß

On 26.02.24 09:44, Jan Beulich wrote:

On 26.02.2024 09:39, Juergen Gross wrote:

Xenstore stubdom needs some more symbols exported.

Signed-off-by: Juergen Gross 


Any Reported-by: possibly applicable here?


If any, then Andrew.

He reported a CI loop failure.


Juergen




Re: [PATCH v5 07/22] tools/9pfsd: add 9pfs attach request support

2024-02-14 Thread Jürgen Groß

On 14.02.24 18:48, Andrew Cooper wrote:

On 08/02/2024 4:55 pm, Juergen Gross wrote:

+static struct p9_fid *alloc_fid(device *device, unsigned int fid,
+const char *path)
+{
+struct p9_fid *fidp = NULL;
+
+pthread_mutex_lock(>fid_mutex);
+
+if ( find_fid(device, fid) )
+{
+errno = EBADFD;


Also, FreeBSD says no.

https://cirrus-ci.com/task/6634697753624576

io.c:580:17: error: use of undeclared identifier 'EBADFD'
     errno = EBADFD;
     ^
1 error generated.

Need to use EBADF.


Yes.


Juergen




Re: [PATCH v5 07/22] tools/9pfsd: add 9pfs attach request support

2024-02-14 Thread Jürgen Groß

On 14.02.24 18:40, Andrew Cooper wrote:

On 08/02/2024 4:55 pm, Juergen Gross wrote:

+static struct p9_fid *alloc_fid_mem(device *device, unsigned int fid,
+const char *path)
+{
+struct p9_fid *fidp;
+size_t pathlen;
+
+pathlen = strlen(path);
+fidp = calloc(sizeof(*fidp) + pathlen + 1, 1);
+if ( !fidp )
+return NULL;
+
+fidp->fid = fid;
+strncpy(fidp->path, path, pathlen);
+
+return fidp;
+}


GitlabCI has something to say about this.
https://gitlab.com/xen-project/people/andyhhp/xen/-/pipelines/1176787593

I think they're all variations of:

io.c: In function 'alloc_fid_mem.isra.8':
io.c:566:5: error: 'strncpy' output truncated before terminating nul
copying as many bytes from a string as its length
[-Werror=stringop-truncation]
  strncpy(fidp->path, path, pathlen);
  ^~
io.c:560:15: note: length computed here
  pathlen = strlen(path);
    ^~~~


In the end the result is fine, as the buffer is large enough and it is
zeroed on allocation.

I'll change it nevertheless as it is a bad code pattern.


Juergen




Re: [PATCH v5 15/22] tools/libs/light: add backend type for 9pfs PV devices

2024-02-14 Thread Jürgen Groß

On 14.02.24 18:27, Anthony PERARD wrote:

On Wed, Feb 14, 2024 at 11:18:18AM +0100, Jürgen Groß wrote:

On 13.02.24 19:03, Anthony PERARD wrote:

On Thu, Feb 08, 2024 at 05:55:39PM +0100, Juergen Gross wrote:

+struct libxl__aop9_state {
+libxl__spawn_state spawn;
+libxl__ao_device *aodev;
+libxl_device_p9 p9;
+uint32_t domid;
+void (*callback)(libxl__egc *, libxl__aop9_state *, int);


This "callback" is never used, right?


Why do you think so?

In xen9pfsd_spawn() it is used:

   aop9->callback = xen9pfsd_spawn_outcome;


By never used, I mean that nothing is reading the value, their is no
"aop9->callback(egc, aop9, rc)" call.
It might have been useful if a caller of xen9pfsd_spawn() was actually
setting this field, but that's not an option here. And callbacks of
xen9pfsd_spawn() knows to call xen9pfsd_spawn_outcome() when done.


Oh, right. This must be a relict of a previous attempt to make it work.

I'll remove the callback.


Juergen




Re: [PATCH v5 15/22] tools/libs/light: add backend type for 9pfs PV devices

2024-02-14 Thread Jürgen Groß

On 13.02.24 19:03, Anthony PERARD wrote:

On Thu, Feb 08, 2024 at 05:55:39PM +0100, Juergen Gross wrote:

+struct libxl__aop9_state {
+libxl__spawn_state spawn;
+libxl__ao_device *aodev;
+libxl_device_p9 p9;
+uint32_t domid;
+void (*callback)(libxl__egc *, libxl__aop9_state *, int);


This "callback" is never used, right?


Why do you think so?

In xen9pfsd_spawn() it is used:

  aop9->callback = xen9pfsd_spawn_outcome;




diff --git a/tools/libs/light/libxl_types_internal.idl 
b/tools/libs/light/libxl_types_internal.idl
index e24288f1a5..39da71cef5 100644
--- a/tools/libs/light/libxl_types_internal.idl
+++ b/tools/libs/light/libxl_types_internal.idl
@@ -34,6 +34,7 @@ libxl__device_kind = Enumeration("device_kind", [
  (16, "VINPUT"),
  (17, "VIRTIO_DISK"),
  (18, "VIRTIO"),
+(19, "XEN_9PFS"),


That's going to need to be rebased ;-).


Yes.


Juergen



Re: [PATCH] xen/events: close evtchn after mapping cleanup

2024-02-13 Thread Jürgen Groß

On 24.01.24 17:31, Maximilian Heyne wrote:

shutdown_pirq and startup_pirq are not taking the
irq_mapping_update_lock because they can't due to lock inversion. Both
are called with the irq_desc->lock being taking. The lock order,
however, is first irq_mapping_update_lock and then irq_desc->lock.

This opens multiple races:
- shutdown_pirq can be interrupted by a function that allocates an event
   channel:

   CPU0CPU1
   shutdown_pirq {
 xen_evtchn_close(e)
   __startup_pirq {
 EVTCHNOP_bind_pirq
   -> returns just freed evtchn e
 set_evtchn_to_irq(e, irq)
   }
 xen_irq_info_cleanup() {
   set_evtchn_to_irq(e, -1)
 }
   }

   Assume here event channel e refers here to the same event channel
   number.
   After this race the evtchn_to_irq mapping for e is invalid (-1).

- __startup_pirq races with __unbind_from_irq in a similar way. Because
   __startup_pirq doesn't take irq_mapping_update_lock it can grab the
   evtchn that __unbind_from_irq is currently freeing and cleaning up. In
   this case even though the event channel is allocated, its mapping can
   be unset in evtchn_to_irq.

The fix is to first cleanup the mappings and then close the event
channel. In this way, when an event channel gets allocated it's
potential previous evtchn_to_irq mappings are guaranteed to be unset already.
This is also the reverse order of the allocation where first the event
channel is allocated and then the mappings are setup.

On a 5.10 kernel prior to commit 3fcdaf3d7634 ("xen/events: modify internal
[un]bind interfaces"), we hit a BUG like the following during probing of NVMe
devices. The issue is that during nvme_setup_io_queues, pci_free_irq
is called for every device which results in a call to shutdown_pirq.
With many nvme devices it's therefore likely to hit this race during
boot because there will be multiple calls to shutdown_pirq and
startup_pirq are running potentially in parallel.

   [ cut here ]
   blkfront: xvda: barrier or flush: disabled; persistent grants: enabled; 
indirect descriptors: enabled; bounce buffer: enabled
   kernel BUG at drivers/xen/events/events_base.c:499!
   invalid opcode:  [#1] SMP PTI
   CPU: 44 PID: 375 Comm: kworker/u257:23 Not tainted 
5.10.201-191.748.amzn2.x86_64 #1
   Hardware name: Xen HVM domU, BIOS 4.11.amazon 08/24/2006
   Workqueue: nvme-reset-wq nvme_reset_work
   RIP: 0010:bind_evtchn_to_cpu+0xdf/0xf0
   Code: 5d 41 5e c3 cc cc cc cc 44 89 f7 e8 2b 55 ad ff 49 89 c5 48 85 c0 0f 84 64 
ff ff ff 4c 8b 68 30 41 83 fe ff 0f 85 60 ff ff ff <0f> 0b 66 66 2e 0f 1f 84 00 
00 00 00 00 0f 1f 40 00 0f 1f 44 00 00
   RSP: :c9000d533b08 EFLAGS: 00010046
   RAX:  RBX:  RCX: 0006
   RDX: 0028 RSI:  RDI: 
   RBP: 888107419680 R08:  R09: 82d72b00
   R10:  R11:  R12: 01ed
   R13:  R14:  R15: 0002
   FS:  () GS:88bc8b50() knlGS:
   CS:  0010 DS:  ES:  CR0: 80050033
   CR2:  CR3: 02610001 CR4: 001706e0
   DR0:  DR1:  DR2: 
   DR3:  DR6: fffe0ff0 DR7: 0400
   Call Trace:
? show_trace_log_lvl+0x1c1/0x2d9
? show_trace_log_lvl+0x1c1/0x2d9
? set_affinity_irq+0xdc/0x1c0
? __die_body.cold+0x8/0xd
? die+0x2b/0x50
? do_trap+0x90/0x110
? bind_evtchn_to_cpu+0xdf/0xf0
? do_error_trap+0x65/0x80
? bind_evtchn_to_cpu+0xdf/0xf0
? exc_invalid_op+0x4e/0x70
? bind_evtchn_to_cpu+0xdf/0xf0
? asm_exc_invalid_op+0x12/0x20
? bind_evtchn_to_cpu+0xdf/0xf0
? bind_evtchn_to_cpu+0xc5/0xf0
set_affinity_irq+0xdc/0x1c0
irq_do_set_affinity+0x1d7/0x1f0
irq_setup_affinity+0xd6/0x1a0
irq_startup+0x8a/0xf0
__setup_irq+0x639/0x6d0
? nvme_suspend+0x150/0x150
request_threaded_irq+0x10c/0x180
? nvme_suspend+0x150/0x150
pci_request_irq+0xa8/0xf0
? __blk_mq_free_request+0x74/0xa0
queue_request_irq+0x6f/0x80
nvme_create_queue+0x1af/0x200
nvme_create_io_queues+0xbd/0xf0
nvme_setup_io_queues+0x246/0x320
? nvme_irq_check+0x30/0x30
nvme_reset_work+0x1c8/0x400
process_one_work+0x1b0/0x350
worker_thread+0x49/0x310
? process_one_work+0x350/0x350
kthread+0x11b/0x140
? __kthread_bind_mask+0x60/0x60
ret_from_fork+0x22/0x30
   Modules linked in:
   ---[ end trace a11715de1eee1873 ]---

Fixes: d46a78b05c0e ("xen: implement pirq type event channels")
Cc: sta...@vger.kernel.org
Co-debugged-by: Andrew Panyakin 
Signed-off-by: Maximilian Heyne 


Reviewed-by: Juergen Gross 


Juergen




Errors when trying to create VM with "channel" device(s)

2024-02-12 Thread Jürgen Groß

A customer is complaining to be unable to create multiple "channel" devices
in a HVM domain. I've verified the same happens with upstream Xen.

The "channel" configuration is:

channel = [ "connection=socket, name=test.ch1, path=/run/testsocket1", 
"connection=socket, name=test.ch2, path=/run/testsocket2" ]


When creating the domain, I see:

libxl: error: libxl_console.c:285:libxl__device_console_add: Domain 4:Primary 
console has invalid configuration


Both /run/testsocket1 and /run/testsocket2 are being created in dom0.

"xl channel-list 4" gives me:
Idx BE state evt-ch ring-ref connection
0   0  1 -1 -1   socket

"xenstore-ls /local/domain/0/backend/console/4" prints only one channel:
1 = ""
 frontend = "/local/domain/4/device/console/1"
 frontend-id = "4"
 online = "1"
 state = "2"
 protocol = "vt100"
 name = "test.ch2"
 connection = "socket"
 path = "/run/testsocket2"
 hotplug-status = "connected"
0 = ""
 frontend = "/local/domain/4/console"
 frontend-id = "4"
 online = "1"
 state = "1"
 protocol = "vt100"

While qemu seems to have been started with the information for both.
Output of "xl -vvv create":

...
libxl: debug: libxl_dm.c:2994:libxl__spawn_local_dm: Domain 4:  -chardev
libxl: debug: libxl_dm.c:2994:libxl__spawn_local_dm: Domain 4: 
socket,id=libxl-channel0,path=/run/testsocket1,server=on,wait=off

libxl: debug: libxl_dm.c:2994:libxl__spawn_local_dm: Domain 4:  -chardev
libxl: debug: libxl_dm.c:2994:libxl__spawn_local_dm: Domain 4: 
socket,id=libxl-channel1,path=/run/testsocket2,server=on,wait=off

...

"xenstore-ls /local/domain/4/device/console" only shows:
1 = ""
 backend = "/local/domain/0/backend/console/4/1"
 backend-id = "0"
 state = "1"
 protocol = "vt100"
 name = "test.ch2"
 limit = "1048576"
 type = "ioemu"
 output = "chardev:libxl-channel1"
 tty = ""

So the console AND the first channel are missing.

"xenstore-ls /local/domain/4/error" shows:
device = ""
 console = ""
  1 = ""
   error = "22 xenbus_dev_probe on device/console/1"

Now doing the same with only one "channel" device shows the same error when
creating the guest. The guest has no channel device after that and the Xenstore
error node is not present.

I'm not that familiar with this code area, so I thought I post this here in the
hope someone has an idea where to look first.


Juergen



Re: [PATCH v4 14/32] tools/xen-9pfsd: add 9pfs read request support

2024-02-08 Thread Jürgen Groß

On 08.02.24 02:28, Jason Andryuk wrote:

On Mon, Feb 5, 2024 at 5:51 AM Juergen Gross  wrote:


Add the read request of the 9pfs protocol.

Signed-off-by: Juergen Gross 
Reviewed-by: Jason Andryuk 
Acked-by: Anthony PERARD 
---
V2:
- make error check more readable (Jason Andryuk)
V4:
- add directory read support
---
  tools/xen-9pfsd/io.c | 90 
  1 file changed, 90 insertions(+)

diff --git a/tools/xen-9pfsd/io.c b/tools/xen-9pfsd/io.c
index b763e3d8d9..cfbf3bef59 100644
--- a/tools/xen-9pfsd/io.c
+++ b/tools/xen-9pfsd/io.c



+
+len = count;
+buf = ring->buffer + sizeof(*hdr) + sizeof(uint32_t);
+
+if ( fidp->isdir )
+{
+struct dirent *dirent;
+struct stat st;
+struct p9_stat p9s;
+


"""
For directories, read returns an integral number of direc- tory
entries exactly as in stat (see stat(5)), one for each member of the
directory. The read request message must have offset equal to zero or
the value of offset in the previous read on the directory, plus the
number of bytes returned in the previous read. In other words, seeking
other than to the beginning is illegal in a directory (see seek(2)).
"""

I think you need to check `off`.  Looks like QEMU only checks for off
== 0 and rewinds in that case.  QEMU ignores other values.


Oh, indeed. Thanks for noticing. I'll use the same approach as qemu.


Juergen



Re: [PATCH v4 02/32] tools: add a new xen logging daemon

2024-02-08 Thread Jürgen Groß

On 08.02.24 01:52, Andrew Cooper wrote:

On 05/02/2024 10:49 am, Juergen Gross wrote:

Add "xen-9pfsd", a new logging daemon meant to support infrastructure
domains (e.g. xenstore-stubdom) to access files in dom0.


I was still expecting for "logging" to disappear from this.

In both cases it could just be deleted the sentences still work,
although the subject ought to gain a xen-9pfsd part.

Happy to fix on commit.


  tools/Makefile  |   1 +
  tools/xen-9pfsd/.gitignore  |   1 +
  tools/xen-9pfsd/Makefile|  38 ++
  tools/xen-9pfsd/xen-9pfsd.c | 147 


Asking just in case...  Do we really want the "xen-" in the directory?
tools/9pfsd is still perfectly descriptive, and more amenable to tab
completion given what else is in tools/


I'll change it.


Juergen




Re: [PATCH] config: update Mini-OS commit

2024-02-07 Thread Jürgen Groß

On 07.02.24 14:41, Julien Grall wrote:

Hi Juergen,

On 07/02/2024 13:31, Juergen Gross wrote:

Update the Mini-OS upstream revision.

Signed-off-by: Juergen Gross 
---
  Config.mk | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Config.mk b/Config.mk
index f7d6d84847..077d841bb7 100644
--- a/Config.mk
+++ b/Config.mk
@@ -224,7 +224,7 @@ QEMU_UPSTREAM_URL ?= 
https://xenbits.xen.org/git-http/qemu-xen.git

  QEMU_UPSTREAM_REVISION ?= master
  MINIOS_UPSTREAM_URL ?= https://xenbits.xen.org/git-http/mini-os.git
-MINIOS_UPSTREAM_REVISION ?= 090eeeb1631f00a9a41ebf66d9b4aacb97eb51e7
+MINIOS_UPSTREAM_REVISION ?= b119f0178fd86876d0678007dfcf435ab8bb7568


I looked at the changes and I am puzzled by one. In p9_stat(), I see a call 
with:

   free_stat();

'stat' is defined as 'struct p9_stat *stat' and 'free_stat' expects a 'struct 
p9_stat'.


Wouldn't this resuilt to a build error?


Hmm, I'm puzzled, too. I didn't see a build error when doing "make testbuild".

But I agree that this looks very suspicious.

Oh, in test builds we don't have HAVE_LIBC defined. I just verified that
9pfront.c gets compiled, while I didn't pay attention to the parts which were
compiled during the build. Shame on me!

Fixup patch coming soon...

Thanks for noticing!


Juergen



Re: [PATCH] Mini-OS: x86: zero out .bss segment at boot

2024-02-07 Thread Jürgen Groß

On 07.02.24 13:02, Samuel Thibault wrote:

Jürgen Groß, le mer. 07 févr. 2024 12:43:03 +0100, a ecrit:

On 07.02.24 12:34, Samuel Thibault wrote:

Jürgen Groß, le mer. 07 févr. 2024 12:16:44 +0100, a ecrit:

On 07.02.24 12:00, Samuel Thibault wrote:

Jürgen Groß, le mer. 07 févr. 2024 11:42:20 +0100, a ecrit:

while implementing kexec in Mini-OS.


Oh, nice :D


For that I need it for sure.


It needs to be done by kexec itself then.


That's another option, yes.

The question is whether we want to support to be kexec-ed from other
systems, too.


But aren't other systems' kexec supports supposed to do the memset?

They really should.


I guess there is a reason why the Linux kernel does clear its .bss section
in early boot. Maybe it is due to how the boot process works (the ELF file
is encapsulated in vmlinuz),


Yes, the unpack prevents grub etc. from doing it.


but following your reasoning they should have cleared their .bss while
unpacking the ELF contents, not while booting the contents.


AIUI the decompressor itself doesn't actually know about ELF.
decompress_kernel() does call handle_relocations(), but it should indeed
clear bss itself and not leave it out to assembly indeed.


I'm not sure they do the .bss clearing in kexec either,


AIUI they do, see kimage_load_normal_segment() which clears the page
before possibly loading some file piece into it.

Really, this is part of what "loading an ELF" means.


Okay, I'll do the clearing only in the kexec code then.


Juergen



Re: [PATCH] Mini-OS: x86: zero out .bss segment at boot

2024-02-07 Thread Jürgen Groß

On 07.02.24 12:34, Samuel Thibault wrote:

Jürgen Groß, le mer. 07 févr. 2024 12:16:44 +0100, a ecrit:

On 07.02.24 12:00, Samuel Thibault wrote:

Jürgen Groß, le mer. 07 févr. 2024 11:42:20 +0100, a ecrit:

while implementing kexec in Mini-OS.


Oh, nice :D


For that I need it for sure.


It needs to be done by kexec itself then.


That's another option, yes.

The question is whether we want to support to be kexec-ed from other
systems, too.


But aren't other systems' kexec supports supposed to do the memset?

They really should.


I guess there is a reason why the Linux kernel does clear its .bss section
in early boot. Maybe it is due to how the boot process works (the ELF file
is encapsulated in vmlinuz), but following your reasoning they should have
cleared their .bss while unpacking the ELF contents, not while booting the
contents.

I'm not sure they do the .bss clearing in kexec either, as the kernel is
clearing it anyway.


Juergen



Re: [PATCH] Mini-OS: x86: zero out .bss segment at boot

2024-02-07 Thread Jürgen Groß

On 07.02.24 12:00, Samuel Thibault wrote:

Jürgen Groß, le mer. 07 févr. 2024 11:42:20 +0100, a ecrit:

while implementing kexec in Mini-OS.


Oh, nice :D


For that I need it for sure.


It needs to be done by kexec itself then.


That's another option, yes.

The question is whether we want to support to be kexec-ed from other
systems, too. Then I'd rather do it on both sides.


Juergen



Re: [PATCH] Mini-OS: x86: zero out .bss segment at boot

2024-02-07 Thread Jürgen Groß

On 07.02.24 11:38, Samuel Thibault wrote:

Juergen Gross, le mer. 07 févr. 2024 11:31:38 +0100, a ecrit:

The .bss segment should be zeroed at very early boot.


Is that not done by the elf loader of Xen?


It might be done by Xen tools today, but I'm quite sure it is not part
of the ABI. The hypervisor doesn't clear .bss when loading a domain (e.g.
dom0).

I stumbled over it while implementing kexec in Mini-OS. For that I need
it for sure.


Juergen



Re: [PATCH] Mini-OS: fix 9pfs frontend error path

2024-02-06 Thread Jürgen Groß

On 06.02.24 16:26, Samuel Thibault wrote:

Juergen Gross, le mar. 06 févr. 2024 07:17:21 +0100, a ecrit:

The early error exit in p9_stat() returns without zeroing the p9_stat
buffer, resulting in free() being called with an uninitialized pointer.

Fix that by doing the zeroing first.


This is not coherent with the usual conventions: when a function fails,
it is supposed not to have done anything, and thus the caller shouldn't
have to clean anything.

I.e. i'd rather see the free_stat() call be put after the check for
an error returned by p9_stat.


I can do that, but this would require two calls of free_stat() (one in
p9_stat() in an error case reported via req->result, and one in the
caller of p9_stat() in case of no error).


Juergen



Re: [PATCH] tools/libs/light: don't allow to stop Xenstore stubdom

2024-02-06 Thread Jürgen Groß

On 06.02.24 14:08, Andrew Cooper wrote:

On 06/02/2024 12:43 pm, Juergen Gross wrote:

A Xenstore stubdom should never be stoppable.

Reject attempts to do so.

Signed-off-by: Juergen Gross 


I don't think this is a clever idea.  `xl destroy` is also the "please
clean up my system when it's in a very dead state" command, and that
also includes a dead xenstored stubdom.


I don't think xl destroy for a dead Xenstore stubdom will ever work.
xl destroy tries to read (and delete) Xenstore entries, after all.

I think you'd need a program using libxenctrl without all the xl/libxl
actions for achieving this goal. And this would work with my current
patch, too.


If you're looking for some protection, then maybe a `--force` flag to
override, but there must be some way of getting this to run.


A system without Xenstore is probably quite useless anyway. At least today
there is no way a new Xenstore would be able to connect to existing domains.

OTOH I'm inclined to add more hooks, e.g. for "xl pause" and "xl migrate".

And I do think that libxl is the right level for that, as I don't want users
to be able to kill/pause/migrate Xenstore stubdom via libvirt either.


Juergen



Re: [PATCH v4 23/32] tools/xenstored: move all log-pipe handling into posix.c

2024-02-05 Thread Jürgen Groß

On 05.02.24 15:33, Julien Grall wrote:

Hi Juergen,

On 05/02/2024 10:49, Juergen Gross wrote:

All of the log-pipe handling is needed only when running as daemon.

Move it into posix.c. This requires to have a service function in the
main event loop for handling the related requests and one for setting
the fds[] array. Use a generic name for those functions, as socket
handling can be added to them later, too.


I would mention the rename in the commit message. With that:


I have modified the commit message to:

tools/xenstored: move all log-pipe handling into posix.c

All of the log-pipe handling is needed only when running as daemon.

Move it into posix.c. This requires to have a service function in the
main event loop for handling the related requests and one for setting
the fds[] array, which is renamed to poll_fds to have a more specific
name. Use a generic name for those functions, as socket handling can
be added to them later, too.





Signed-off-by: Juergen Gross 


Reviewed-by: Julien Grall 


Thanks,


Juergen




Re: [PATCH v4 00/32] tools: enable xenstore-stubdom to use 9pfs

2024-02-05 Thread Jürgen Groß

On 05.02.24 11:55, Julien Grall wrote:

Hi Juergen,

On 05/02/2024 10:49, Juergen Gross wrote:

This series is adding 9pfs support to Xenstore-stubdom, enabling it
to do logging to a dom0 directory.

This is a prerequisite for the final goal to add live update support
to Xenstore-stubdom, as it enables the stubdom to store its state in
a dom0 file.

The 9pfs backend is a new daemon written from scratch. Using a
dedicated 9pfs daemon has several advantages:

- it is using much less resources than a full blown qemu process
- it can serve multiple guests (the idea is to use it for other
   infrastructure domains, like qemu-stubdom or driver domains, too)
- it is designed to support several security enhancements, like
   limiting the number of files for a guest, or limiting the allocated
   file system space
- it doesn't support file links (neither hard nor soft links) or
   referencing parent directories via "..", minimizing the risk that
   a guest can "escape" from its home directory

Note that for now the daemon only contains the minimal needed
functionality to do logging from Xenstore-stubdom. I didn't want to
add all the 9pfs commands and security add-ons in the beginning, in
order to avoid needless efforts in case the idea of the daemon is
being rejected.

Changes in V4:
- patch 2 of V3 was applied
- added support of reading directories
- addressed review comments

Changes in V3:
- new patches 1, 23-25
- addressed review comments

Changes in V2:
- support of multiple rings per device
- xenlogd->xen-9pfsd rename
- addressed review comments
- fixed some bugs

Juergen Gross (32):
   tools: add access macros for unaligned data
   tools: add a new xen logging daemon
   tools/xen-9pfsd: connect to frontend
   tools/xen-9pfsd: add transport layer
   tools/xen-9pfsd: add 9pfs response generation support
   tools/xen-9pfsd: add 9pfs version request support
   tools/xen-9pfsd: add 9pfs attach request support
   tools/xen-9pfsd: add 9pfs walk request support
   tools/xen-9pfsd: add 9pfs open request support
   tools/xen-9pfsd: add 9pfs clunk request support
   tools/xen-9pfsd: add 9pfs create request support
   tools/xen-9pfsd: add 9pfs stat request support
   tools/xen-9pfsd: add 9pfs write request support
   tools/xen-9pfsd: add 9pfs read request support
   tools/libs/light: add backend type for 9pfs PV devices
   tools/xl: support new 9pfs backend xen_9pfsd
   tools/helpers: allocate xenstore event channel for xenstore stubdom
   tools/xenstored: rename xenbus_evtchn()
   stubdom: extend xenstore stubdom configs
   tools: add 9pfs device to xenstore-stubdom
   tools/xenstored: add early_init() function
   tools/xenstored: move systemd handling to posix.c
   tools/xenstored: move all log-pipe handling into posix.c
   tools/xenstored: move all socket handling into posix.c
   tools/xenstored: get own domid in stubdom case
   tools/xenstored: rework ring page (un)map functions
   tools/xenstored: split domain_init()
   tools/xenstored: map stubdom interface
   tools/xenstored: mount 9pfs device in stubdom
   tools/xenstored: add helpers for filename handling
   tools/xenstored: support complete log capabilities in stubdom
   tools/xenstored: have a single do_control_memreport()


I haven't checked what's the state of the 9PFS patches. Can part of the 
xenstored changes be committed without the 9PFS changes?


The following patches can go in without the 9pfs daemon:

tools/helpers: allocate xenstore event channel for xenstore stubdom
tools/xenstored: rename xenbus_evtchn()
stubdom: extend xenstore stubdom configs
tools/xenstored: add early_init() function
tools/xenstored: move systemd handling to posix.c
tools/xenstored: move all log-pipe handling into posix.c
tools/xenstored: move all socket handling into posix.c
tools/xenstored: get own domid in stubdom case
tools/xenstored: rework ring page (un)map functions
tools/xenstored: split domain_init()
tools/xenstored: map stubdom interface


Juergen



Re: [PATCH v3 17/33] tools/xl: support new 9pfs backend xen-9pfsd

2024-01-31 Thread Jürgen Groß

On 15.01.24 16:14, Anthony PERARD wrote:

On Thu, Jan 04, 2024 at 10:00:39AM +0100, Juergen Gross wrote:

@@ -2242,6 +2256,28 @@ void parse_config_data(const char *config_source,
  
  libxl_string_list_dispose();
  
+if (p9->type == LIBXL_P9_TYPE_UNKNOWN) {

+p9->type = LIBXL_P9_TYPE_QEMU;


The defaulting is normally done in libxl, so that it works for all users
of libxl. Can this be done instead in libxl? Hopefully, it's enough to
do it in libxl__device_p9_setdefault().

Same question for the followup checks and default values.


I'll look into it.


Juergen



Re: [PATCH v3 16/33] tools/libs/light: add backend type for 9pfs PV devices

2024-01-31 Thread Jürgen Groß

On 15.01.24 16:38, Anthony PERARD wrote:

On Thu, Jan 04, 2024 at 10:00:38AM +0100, Juergen Gross wrote:

+static void libxl__device_p9_add(libxl__egc *egc, uint32_t domid,
+ libxl_device_p9 *p9,
+ libxl__ao_device *aodev)
+{
+int rc;


Can you make a copy of `p9` here, or maybe in xen9pfsd_spawn? There's no
guaranty that `p9` will still be around by the time
xen9pfsd_spawn_outcome() is executed.

(with libxl_device_p9_copy())


Okay.




+rc = xen9pfsd_spawn(egc, domid, p9, aodev);
+if (rc == 1)
+return;
+
+if (rc == 0)
+libxl__device_add_async(egc, domid, __p9_devtype, p9, aodev);
+
+aodev->rc = rc;
+if (rc)
+aodev->callback(egc, aodev);
+}
+



Juergen




Re: [PATCH v3 16/33] tools/libs/light: add backend type for 9pfs PV devices

2024-01-31 Thread Jürgen Groß

On 12.01.24 17:55, Anthony PERARD wrote:

On Thu, Jan 04, 2024 at 10:00:38AM +0100, Juergen Gross wrote:

diff --git a/tools/libs/light/libxl_9pfs.c b/tools/libs/light/libxl_9pfs.c
index 5ab0d3aa21..486bc4326e 100644
--- a/tools/libs/light/libxl_9pfs.c
+++ b/tools/libs/light/libxl_9pfs.c
@@ -33,20 +33,159 @@ static int libxl__set_xenstore_p9(libxl__gc *gc, uint32_t 
domid,
  
  flexarray_append_pair(front, "tag", p9->tag);
  
+if (p9->type == LIBXL_P9_TYPE_XEN_9PFSD) {

+flexarray_append_pair(back, "max-space",
+  GCSPRINTF("%u", p9->max_space));
+flexarray_append_pair(back, "max-files",
+  GCSPRINTF("%u", p9->max_files));
+flexarray_append_pair(back, "max-open-files",
+  GCSPRINTF("%u", p9->max_open_files));
+flexarray_append_pair(back, "auto-delete",
+  p9->auto_delete ? "1" : "0");
+}
+
+return 0;
+}
+
+static int libxl__device_from_p9(libxl__gc *gc, uint32_t domid,
+ libxl_device_p9 *type, libxl__device *device)
+{
+device->backend_devid   = type->devid;
+device->backend_domid   = type->backend_domid;
+device->backend_kind= type->type == LIBXL_P9_TYPE_QEMU
+  ? LIBXL__DEVICE_KIND_9PFS
+  : LIBXL__DEVICE_KIND_XEN_9PFS;
+device->devid   = type->devid;
+device->domid   = domid;
+device->kind= LIBXL__DEVICE_KIND_9PFS;
+
  return 0;
  }
  
-#define libxl__add_p9s NULL

+static int libxl_device_p9_dm_needed(void *e, unsigned domid)


Prefix of the function should be "libxl__" as it's only internal to
libxl.


Okay.




+{
+libxl_device_p9 *elem = e;
+
+return elem->type == LIBXL_P9_TYPE_QEMU && elem->backend_domid == domid;
+}
+
+typedef struct libxl__aop9_state libxl__aop9_state;
+
+struct libxl__aop9_state {
+libxl__spawn_state spawn;
+libxl__ao_device *aodev;
+libxl_device_p9 *p9;
+uint32_t domid;
+void (*callback)(libxl__egc *, libxl__aop9_state *, int);
+};
+
+static void xen9pfsd_spawn_outcome(libxl__egc *egc, libxl__aop9_state *aop9,
+   int rc)
+{
+aop9->aodev->rc = rc;
+if (rc)
+aop9->aodev->callback(egc, aop9->aodev);
+else
+libxl__device_add_async(egc, aop9->domid, __p9_devtype,
+aop9->p9, aop9->aodev);
+}
+
+static void xen9pfsd_confirm(libxl__egc *egc, libxl__spawn_state *spawn,
+ const char *xsdata)
+{
+STATE_AO_GC(spawn->ao);
+
+if (!xsdata)
+return;
+
+if (strcmp(xsdata, "running"))
+return;
+
+libxl__spawn_initiate_detach(gc, spawn);
+}
+
+static void xen9pfsd_failed(libxl__egc *egc, libxl__spawn_state *spawn, int rc)
+{
+libxl__aop9_state *aop9 = CONTAINER_OF(spawn, *aop9, spawn);
+
+xen9pfsd_spawn_outcome(egc, aop9, rc);
+}
+
+static void xen9pfsd_detached(libxl__egc *egc, libxl__spawn_state *spawn)
+{
+libxl__aop9_state *aop9 = CONTAINER_OF(spawn, *aop9, spawn);
+
+xen9pfsd_spawn_outcome(egc, aop9, 0);
+}
+
+static int xen9pfsd_spawn(libxl__egc *egc, uint32_t domid, libxl_device_p9 *p9,
+ libxl__ao_device *aodev)
+{
+STATE_AO_GC(aodev->ao);
+struct libxl__aop9_state *aop9;
+int rc;
+char *args[] = { "xen-9pfsd", NULL };
+char *path = GCSPRINTF("/local/domain/%u/libxl/xen-9pfs",
+   p9->backend_domid);
+
+if (p9->type != LIBXL_P9_TYPE_XEN_9PFSD ||
+libxl__xs_read(gc, XBT_NULL, GCSPRINTF("%s/state", path)))


I feel like this check and this function might not work as expected.
What happen if we try to add more than one 9pfs "device"? libxl I think
is going to try to start several xen-9pfs daemon before the first one
have had time to write the "*/state" path.


I don't think so. The path is specific for the _backend_ domid.


What about two different libxl process trying to spawn that daemon? Is
xen-9pfs going to behave well and have one giveup? But that would
probably mean that libxl is going to have an error due to the process
exiting early, maybe.


I think I need to handle this case gracefully in the daemon by exiting with
a 0 exit code.




+return 0;
+
+GCNEW(aop9);
+aop9->aodev = aodev;
+aop9->p9 = p9;
+aop9->domid = domid;
+aop9->callback = xen9pfsd_spawn_outcome;
+
+aop9->spawn.ao = aodev->ao;
+aop9->spawn.what = "xen-9pfs daemon";
+aop9->spawn.xspath = GCSPRINTF("%s/state", path);
+aop9->spawn.timeout_ms = LIBXL_DEVICE_MODEL_START_TIMEOUT * 1000;
+aop9->spawn.pidpath = GCSPRINTF("%s/pid", path);
+aop9->spawn.midproc_cb = libxl__spawn_record_pid;
+aop9->spawn.confirm_cb = xen9pfsd_confirm;
+aop9->spawn.failure_cb = xen9pfsd_failed;
+aop9->spawn.detached_cb = xen9pfsd_detached;
+rc = libxl__spawn_spawn(egc, >spawn);
+if (rc < 0)
+

Re: [PATCH v3 09/33] tools/xenlogd: add 9pfs walk request support

2024-01-31 Thread Jürgen Groß

On 09.01.24 20:19, Jason Andryuk wrote:

On Thu, Jan 4, 2024 at 4:10 AM Juergen Gross  wrote:


Add the walk request of the 9pfs protocol.

Signed-off-by: Juergen Gross 


Reviewed-by: Jason Andryuk 

With one minor comment.


+path = calloc(path_len + 1, 1);
+if ( !path )
+{
+p9_error(ring, hdr->tag, ENOMEM);
+goto out;
+}
+strcpy(path, fidp->path);
+
+if ( n_names )
+{
+qids = calloc(n_names, sizeof(*qids));
+if ( !qids )
+{
+p9_error(ring, hdr->tag, ENOMEM);
+goto out;
+}
+for ( i = 0; i < n_names; i++ )
+{
+if (strcmp(path, "/"))
+strcat(path, "/");


strcmp() can only return 0 on the first iteration, so it seems
inefficient to have it inside this loop.  But the added complexity to
avoid calling it doesn't seem worthwhile.


With storing the relative path in fidp->path this call can be dropped.


Juergen



Re: [PATCH v3 08/33] tools/xenlogd: add 9pfs attach request support

2024-01-31 Thread Jürgen Groß

On 09.01.24 19:48, Jason Andryuk wrote:

On Thu, Jan 4, 2024 at 4:12 AM Juergen Gross  wrote:


Add the attach request of the 9pfs protocol. This introduces the "fid"
scheme of the 9pfs protocol.

As this will be needed later, use a dedicated memory allocation
function in alloc_fid() and prepare a fid reference count.

For filling the qid data take the approach from the qemu 9pfs backend
implementation.

Signed-off-by: Juergen Gross 
---
V2:
- make fill_qid() parameter stbuf const (Jason Andryuk)
- free fids after disconnecting guest (Jason Andryuk)
V3:
- only store relative path in fid (Jason Andryuk)


The code looks good.  I did have a thought though.


+static struct p9_fid *alloc_fid_mem(device *device, unsigned int fid,
+const char *path)
+{
+struct p9_fid *fidp;
+size_t pathlen;
+
+/* Paths always start with "/" as they are starting at the mount point. */
+assert(path[0] == '/');
+


...


+
+static const char *relpath_from_path(const char *path)
+{
+if (!strcmp(path, "/"))
+return ".";
+
+return (path[0] == '/') ? path + 1 : path;
+}


You've carefully written the code to ensure the *at() functions are
not called with paths starting with "/".  What do you think about
storing the converted paths when storing into the p9_fid?  That way
the code doesn't have to worry about always going through
relpath_from_path() before use.  Another option beside performing the
relpath_from_path() conversion, would be to save fidp->path with "./"
at the start to eliminate absolute paths that way.  My thinking is
it's more robust to not have any absolute paths that could be passed
to a *at() function.


I've had a thorough look at the code at the end of the series and I agree
this is a good idea.

I'll change the related patches accordingly.


Juergen



Re: [PATCH] tools/xenstored: Remove unnecessary define XC_WANT_COMPAT_MAP_FOREIGN_API

2021-03-26 Thread Jürgen Groß

On 25.03.21 12:42, Julien Grall wrote:

From: Julien Grall 

The last use of the compat foreign API was dropped in commit
38eeb3864de4 "tools/xenstored: Drop mapping of the ring via foreign
map".

Therefore, we don't need to define XC_WANT_COMPAT_MAP_FOREIGN_API.

Signed-off-by: Julien Grall 


Reviewed-by: Juergen Gross 


Juergen



OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH for-4.15?] docs/misc: xenstored: Re-instate and tweak the documentation for XS_RESUME

2021-03-26 Thread Jürgen Groß

On 25.03.21 19:06, Julien Grall wrote:

From: Julien Grall 

Commit 13dd372834a4 removed the documentation for XS_RESUME, however
this command is still implemented (at least in C Xenstored) and used by
libxl when resuming a domain.

So re-instate the documentation for the XS_RESUME. Take the opportunity
to update it as there is a user of the command.

Fixes: 13dd372834a4 ("docs/designs: re-work the xenstore migration document...")
Signed-off-by: Julien Grall 


Reviewed-by: Juergen Gross 


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH for-4.15?] docs/design: Update xenstore-migration.md

2021-03-26 Thread Jürgen Groß

On 25.03.21 12:12, Julien Grall wrote:

From: Julien Grall 

It is not very clear the shared page adddress is not contained in the
connection record. Additionally, it is misleading to say the grant
will always point to the share paged as a domain is free to revoke the
permission. The restore code would need to make sure it doesn't
fail/crash if this is happening.

The sentence is now replaced with a paragraph explaining why the GFN is
not preserved and that the grant is not guarantee to exist during
restore.

Take the opportunity to replace "code" with "node" when description the
permission.

Reported-by: Raphael Ning 
Signed-off-by: Julien Grall 


Reviewed-by: Juergen Gross 


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH-for-4.15 V2] tools/libs/store: tidy up libxenstore interface

2021-03-24 Thread Jürgen Groß

On 24.03.21 15:09, Julien Grall wrote:

Hi Juergen,

On 24/03/2021 11:30, Juergen Gross wrote:

xenstore_lib.h is in need to be tidied up a little bit:

- the definition of struct xs_tdb_record_hdr shouldn't be here
- some symbols are not namespaced correctly

Signed-off-by: Juergen Gross 
---
V2: minimal variant (Ian Jackson)
---
  tools/include/xenstore_lib.h | 17 -
  tools/libs/store/libxenstore.map |  6 +++---
  tools/libs/store/xs.c    | 12 ++--
  tools/xenstore/utils.h   | 11 +++
  tools/xenstore/xenstore_client.c | 12 ++--
  5 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/tools/include/xenstore_lib.h b/tools/include/xenstore_lib.h
index 4c9b6d1685..f74ad7024b 100644
--- a/tools/include/xenstore_lib.h
+++ b/tools/include/xenstore_lib.h
@@ -43,15 +43,6 @@ struct xs_permissions
  enum xs_perm_type perms;
  };
-/* Header of the node record in tdb. */
-struct xs_tdb_record_hdr {
-    uint64_t generation;
-    uint32_t num_perms;
-    uint32_t datalen;
-    uint32_t childlen;
-    struct xs_permissions perms[0];
-};
-
  /* Each 10 bits takes ~ 3 digits, plus one, plus one for nul 
terminator. */
  #define MAX_STRLEN(x) ((sizeof(x) * CHAR_BIT + CHAR_BIT-1) / 10 * 3 
+ 2)
@@ -78,18 +69,18 @@ bool xs_perm_to_string(const struct xs_permissions 
*perm,

  unsigned int xs_count_strings(const char *strings, unsigned int len);
  /* Sanitising (quoting) possibly-binary strings. */
-struct expanding_buffer {
+struct xs_expanding_buffer {
  char *buf;
  int avail;
  };
  /* Ensure that given expanding buffer has at least min_avail 
characters. */

-char *expanding_buffer_ensure(struct expanding_buffer *, int min_avail);
+char *xs_expanding_buffer_ensure(struct xs_expanding_buffer *, int 
min_avail);

  /* sanitise_value() may return NULL if malloc fails. */
-char *sanitise_value(struct expanding_buffer *, const char *val, 
unsigned len);
+char *xs_sanitise_value(struct xs_expanding_buffer *, const char 
*val, unsigned len);
  /* *out_len_r on entry is ignored; out must be at least strlen(in)+1 
bytes. */

-void unsanitise_value(char *out, unsigned *out_len_r, const char *in);
+void xs_unsanitise_value(char *out, unsigned *out_len_r, const char 
*in);

  #endif /* XENSTORE_LIB_H */
diff --git a/tools/libs/store/libxenstore.map 
b/tools/libs/store/libxenstore.map

index 9854305a2c..fc1c213f13 100644
--- a/tools/libs/store/libxenstore.map
+++ b/tools/libs/store/libxenstore.map
@@ -42,8 +42,8 @@ VERS_3.0.3 {
  xs_strings_to_perms;
  xs_perm_to_string;
  xs_count_strings;
-    expanding_buffer_ensure;
-    sanitise_value;
-    unsanitise_value;
+    xs_expanding_buffer_ensure;
+    xs_sanitise_value;
+    xs_unsanitise_value;


Isn't libxenstore considered stable? If so, shouldn't we bump the 
version to avoid any breakage for existing app?


See https://lists.xen.org/archives/html/xen-devel/2021-03/msg01267.html


Juergen



OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH-for-4.15 V2] tools/libs/store: tidy up libxenstore interface

2021-03-24 Thread Jürgen Groß

On 24.03.21 12:42, Andrew Cooper wrote:

On 24/03/2021 11:30, Juergen Gross wrote:

xenstore_lib.h is in need to be tidied up a little bit:

- the definition of struct xs_tdb_record_hdr shouldn't be here
- some symbols are not namespaced correctly

Signed-off-by: Juergen Gross 
---
V2: minimal variant (Ian Jackson)
---
  tools/include/xenstore_lib.h | 17 -
  tools/libs/store/libxenstore.map |  6 +++---
  tools/libs/store/xs.c| 12 ++--
  tools/xenstore/utils.h   | 11 +++
  tools/xenstore/xenstore_client.c | 12 ++--
  5 files changed, 30 insertions(+), 28 deletions(-)

diff --git a/tools/include/xenstore_lib.h b/tools/include/xenstore_lib.h
index 4c9b6d1685..f74ad7024b 100644
--- a/tools/include/xenstore_lib.h
+++ b/tools/include/xenstore_lib.h
@@ -43,15 +43,6 @@ struct xs_permissions
enum xs_perm_type perms;


^ This enum is still a ABI problem, as it has implementation defined
size.  The containing struct is used by xs_perm_to_string().

Substituting for int is probably the easiest option, because no amount
of trickery with the enum values themselves can prevent the compiler
deciding to use a long or larger for the object.


Switching to unsigned int and replacing the enum values with #defines
seems to be the way to go, as the enum values are basically bit mask
values.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH-for-4.15] tools/libs/store: cleanup libxenstore interface

2021-03-24 Thread Jürgen Groß

On 24.03.21 12:30, Ian Jackson wrote:

Jürgen Groß writes ("Re: [PATCH-for-4.15] tools/libs/store: cleanup libxenstore 
interface"):

On 24.03.21 12:02, Ian Jackson wrote:

Is it possible to do sort this out in a more minimal way ?  Eg we
could change the name to namespace it properly.  (I haven't looked at
the code in detail and am still rather under-caffeinated so maybe I am
talking nonsense here.)


No nonsense. This would be the really minimum option (apart from doing
nothing).

I can setup the patch for that and keep the rest for 4.16 (which will
then probably need to bump the so version).


Hmmm.  Maybe it would be less disruptive to punt the whole lot for
xen-next.  That way we don't have a silent withdrawl in one release
followed by a soname bump in the next.

If you're keen to change this for 4.15, please feel free to show me
what the patch looks like.  But I would be inclined to postpone this.


Minimal variant sent. I'm not keen to have that for 4.15, but the patch
was just ready. :-)

Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH-for-4.15] tools/libs/store: cleanup libxenstore interface

2021-03-24 Thread Jürgen Groß

On 24.03.21 12:02, Ian Jackson wrote:

Juergen Gross writes ("[PATCH-for-4.15] tools/libs/store: cleanup libxenstore 
interface"):

There are some internals in the libxenstore interface which should be
removed.

Move those functions into xs_lib.c and the related definitions into
xs_lib.h. Remove the functions from the mapfile. Add xs_lib.o to
xenstore_client as some of the internal functions are needed there.


This seems wider in scope than I was expecting.

Reviewing it again makes me think that there are more concers than I
anticipated and I am now doubtful whether I want to take it in 4.15.


I'm fine with that. TBH I would have been surprised if you'd just take
it. :-)


I thought at this stage we were just going to fix the
accidentally-exported symbols with improperly namespaced names.  It is
those for which I think that withdrawing them without an ABI soname
bump, in contravention of usual library ABI stability rules, will not
cause trouble in pracice.


Just removing them from the mapfile doesn't work.

Either we need to keep them (maybe with "xs_" prefixed), or we need
to go the way I've done in this patch.


My current thoughts are that several of these really ought not to be
withdrawn as they might cause actual trouble:


  /* Path for various daemon things: env vars can override. */
-const char *xs_daemon_rootdir(void);
-const char *xs_domain_dev(void);
-const char *xs_daemon_tdb(void);


Someone who was writing bindings might have exposed these without
knowing what they were, resulting in linkage to these symbols.


This patch is removing everything not being used in the (known) Xen
ecosystem (Xen, qemu, qemu-trad, mini-os).




  bool xs_strings_to_perms(struct xs_permissions *perms, unsigned int num,
 const char *strings);
  
-/* Convert permissions to a string (up to len MAX_STRLEN(unsigned int)+1). */

-bool xs_perm_to_string(const struct xs_permissions *perm,
-   char *buffer, size_t buf_len);


Isn't this function potentially useful ?  It seems funny to have only
one of the conversion directions.


As stated above: this patch is doing the absolute possible maximum.
I'm absolutely fine to drop some of the removals.


+void unsanitise_value(char *out, unsigned *out_len_r, const char *in)


Is it possible to do sort this out in a more minimal way ?  Eg we
could change the name to namespace it properly.  (I haven't looked at
the code in detail and am still rather under-caffeinated so maybe I am
talking nonsense here.)


No nonsense. This would be the really minimum option (apart from doing
nothing).

I can setup the patch for that and keep the rest for 4.16 (which will
then probably need to bump the so version).


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 2/2] Revert "xen: fix p2m size in dom0 for disabled memory hotplug case"

2021-03-23 Thread Jürgen Groß

On 17.03.21 12:04, Roger Pau Monne wrote:

This partially reverts commit 882213990d32fd224340a4533f6318dd152be4b2.

There's no need to special case XEN_UNPOPULATED_ALLOC anymore in order
to correctly size the p2m. The generic memory hotplug option has
already been tied together with the Xen hotplug limit, so enabling
memory hotplug should already trigger a properly sized p2m on Xen PV.


Can you add some words here that XEN_UNPOPULATED_ALLOC depends on
MEMORY_HOTPLUG via ZONE_DEVICE?


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH v2 for-4.14] tools: Fix pkg-config file for libxenstore

2021-03-22 Thread Jürgen Groß

On 22.03.21 17:38, Andrew Cooper wrote:

There are no dependenices on evtchn, ctrl or gnttab.

Fixes: 1b008e99 ("tools: provide pkg-config file for libxenstore")
Signed-off-by: Andrew Cooper 


Reviewed-by: Juergen Gross 


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH for-4.14] tools: Fix pkg-config file for libxenstore

2021-03-22 Thread Jürgen Groß

On 22.03.21 17:20, Andrew Cooper wrote:

There is no dependency on libxenctrl.

Fixes: 1b008e99 ("tools: provide pkg-config file for libxenstore")
Signed-off-by: Andrew Cooper 
---
CC: Ian Jackson 
CC: Wei Liu 
CC: Juergen Gross 
CC: Jan Beulich 

This has been fixed in Xen 4.15 by the uselibs.mk logic, but 4.14 and older
cause everything linking against libxenstore to also try linking against
libxenctrl.  It also causes RPM to create unexpected dependencies between
subpackages, which is a problem when trying to separate the stable and
unstable libs.
---
  tools/xenstore/xenstore.pc.in | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tools/xenstore/xenstore.pc.in b/tools/xenstore/xenstore.pc.in
index 2f64a6b824..98c3f1ab39 100644
--- a/tools/xenstore/xenstore.pc.in
+++ b/tools/xenstore/xenstore.pc.in
@@ -8,4 +8,4 @@ Version: @@version@@
  Cflags: -I${includedir} @@cflagslocal@@
  Libs: @@libsflag@@${libdir} -lxenstore
  Libs.private: -ldl
-Requires.private: xenevtchn,xencontrol,xengnttab,xentoolcore
+Requires.private: xenevtchn,xengnttab,xentoolcore


Any reason you are keeping xenevtchn and xengnttab?


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [PATCH 0/6] tools/libs: add missing support of linear p2m_list, cleanup

2021-03-22 Thread Jürgen Groß

On 22.03.21 13:46, Andrew Cooper wrote:

On 22/03/2021 10:58, Juergen Gross wrote:

There are some corners left which don't support the not so very new
linear p2m list of pv guests, which has been introduced in Linux kernel
3.19 and which is mandatory for non-legacy versions of Xen since kernel
4.14.

This series adds support for the linear p2m list where it is missing
(colo support and "xl dump-core").

In theory it should be possible to merge the p2m list mapping code
from migration handling and core dump handling, but this needs quite
some cleanup before this is possible.

The first three patches of this series are fixing real problems, so
I've put them at the start of this series, especially in order to make
backports easier.

The other three patches are only the first steps of cleanup. The main
work done here is to concentrate all p2m mapping in libxenguest instead
of having one implementation in each of libxenguest and libxenctrl.

Merging the two implementations should be rather easy, but this will
require to touch many lines of code, as the migration handling variant
seems to be more mature, but it is using the migration stream specific
structures heavily. So I'd like to have some confirmation that my way
to clean this up is the right one.

My idea would be to add the data needed for p2m mapping to struct
domain_info_context and replace the related fields in struct
xc_sr_context with a struct domain_info_context. Modifying the
interface of xc_core_arch_map_p2m() to take most current parameters
via struct domain_info_context would then enable migration coding to
use xc_core_arch_map_p2m() for mapping the p2m. xc_core_arch_map_p2m()
should look basically like the current migration p2m mapping code
afterwards.

Any comments to that plan?

Juergen Gross (6):
   tools/libs/guest: fix max_pfn setting in map_p2m()
   tools/libs/ctrl: fix xc_core_arch_map_p2m() to support linear p2m
 table
   tools/libs/ctrl: use common p2m mapping code in xc_domain_resume_any()
   tools/libs: move xc_resume.c to libxenguest
   tools/libs: move xc_core* from libxenctrl to libxenguest
   tools/libs/guest: make some definitions private to libxenguest


https://gitlab.com/xen-project/patchew/xen/-/jobs/1116936958

xenctrl_stubs.c:342:11: error: implicit declaration of function
'xc_domain_resume' is invalid in C99
[-Werror,-Wimplicit-function-declaration]
     result = xc_domain_resume(_H(xch), c_domid, 1);
  ^
1 error generated.

I suspect you need to shuffle the headers in use for the Ocaml stubs too.


Yes. Patch 4 needs to gain an "#include " in
tools/ocaml/libs/xc/xenctrl_stubs.c


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [net-next 1/2] xen-netback: add module parameter to disable ctrl-ring

2021-03-22 Thread Jürgen Groß

On 22.03.21 07:48, Leon Romanovsky wrote:

On Mon, Mar 22, 2021 at 06:58:34AM +0100, Jürgen Groß wrote:

On 22.03.21 06:39, Leon Romanovsky wrote:

On Sun, Mar 21, 2021 at 06:54:52PM +0100, Hsu, Chiahao wrote:




<...>


Typically there should be one VM running netback on each host,
and having control over what interfaces or features it exposes is also
important for stability.
How about we create a 'feature flags' modparam, each bits is specified for
different new features?

At the end, it will be more granular module parameter that user still
will need to guess.

I believe users always need to know any parameter or any tool's flag before
they use it.
For example, before user try to set/clear this ctrl_ring_enabled, they
should already have basic knowledge about this feature,
or else they shouldn't use it (the default value is same as before), and
that's also why we use the 'ctrl_ring_enabled' as parameter name.


It solves only forward migration flow. Move from machine A with no
option X to machine B with option X. It doesn't work for backward
flow. Move from machine B to A back will probably break.

In your flow, you want that users will set all module parameters for
every upgrade and keep those parameters differently per-version.


I think the flag should be a per guest config item. Adding this item to
the backend Xenstore nodes for netback to consume it should be rather
easy.

Yes, this would need a change in Xen tools, too, but it is the most
flexible way to handle it. And in case of migration the information
would be just migrated to the new host with the guest's config data.


Yes, it will overcome global nature of module parameters, but how does
it solve backward compatibility concern?


When creating a guest on A the (unknown) feature will not be set to
any value in the guest's config data. A migration stream not having any
value for that feature on B should set it to "false".

When creating a guest on B it will either have the feature value set
explicitly in the guest config (either true or false), or it will get
the server's default (this value should be configurable in a global
config file, default for that global value would be "true").

So with the guest created on B with feature specified as "false" (either
for this guest only, or per global config), it will be migratable to
machine A without problem. Migrating it back to B would work the same
way as above. Trying to migrate a guest with feature set to "true" to
B would not work, but this would be the host admin's fault due to not
configuring the guest correctly.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


Re: [net-next 1/2] xen-netback: add module parameter to disable ctrl-ring

2021-03-21 Thread Jürgen Groß

On 22.03.21 06:39, Leon Romanovsky wrote:

On Sun, Mar 21, 2021 at 06:54:52PM +0100, Hsu, Chiahao wrote:




<...>


Typically there should be one VM running netback on each host,
and having control over what interfaces or features it exposes is also
important for stability.
How about we create a 'feature flags' modparam, each bits is specified for
different new features?

At the end, it will be more granular module parameter that user still
will need to guess.

I believe users always need to know any parameter or any tool's flag before
they use it.
For example, before user try to set/clear this ctrl_ring_enabled, they
should already have basic knowledge about this feature,
or else they shouldn't use it (the default value is same as before), and
that's also why we use the 'ctrl_ring_enabled' as parameter name.


It solves only forward migration flow. Move from machine A with no
option X to machine B with option X. It doesn't work for backward
flow. Move from machine B to A back will probably break.

In your flow, you want that users will set all module parameters for
every upgrade and keep those parameters differently per-version.


I think the flag should be a per guest config item. Adding this item to
the backend Xenstore nodes for netback to consume it should be rather
easy.

Yes, this would need a change in Xen tools, too, but it is the most
flexible way to handle it. And in case of migration the information
would be just migrated to the new host with the guest's config data.


Juergen


OpenPGP_0xB0DE9DD628BF132F.asc
Description: application/pgp-keys


OpenPGP_signature
Description: OpenPGP digital signature


  1   2   3   4   5   6   7   8   9   10   >